<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objective" data-toc-modified-id="Objective-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Objective</a></span></li><li><span><a href="#Used-Python-Libraries" data-toc-modified-id="Used-Python-Libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Used Python Libraries</a></span></li><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load Data</a></span><ul class="toc-item"><li><span><a href="#Chats-Data" data-toc-modified-id="Chats-Data-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Chats Data</a></span></li><li><span><a href="#Talking-Points-Data" data-toc-modified-id="Talking-Points-Data-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Talking Points Data</a></span></li></ul></li><li><span><a href="#A-First-Glance-at-Chats" data-toc-modified-id="A-First-Glance-at-Chats-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>A First Glance at Chats</a></span></li><li><span><a href="#Pre-Trained-Sentence-Encoder" data-toc-modified-id="Pre-Trained-Sentence-Encoder-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Pre-Trained Sentence Encoder</a></span><ul class="toc-item"><li><span><a href="#Example-of-Using-Encoder" data-toc-modified-id="Example-of-Using-Encoder-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Example of Using Encoder</a></span></li><li><span><a href="#Using-Encoder-on-Chats" data-toc-modified-id="Using-Encoder-on-Chats-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Using Encoder on Chats</a></span></li></ul></li><li><span><a href="#Data-Processing" data-toc-modified-id="Data-Processing-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Data Processing</a></span><ul class="toc-item"><li><span><a href="#Processing-Chats" data-toc-modified-id="Processing-Chats-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Processing Chats</a></span><ul class="toc-item"><li><span><a href="#Example-of-Processed-Chat-Data" data-toc-modified-id="Example-of-Processed-Chat-Data-6.1.1"><span class="toc-item-num">6.1.1&nbsp;&nbsp;</span>Example of Processed Chat Data</a></span></li></ul></li></ul></li><li><span><a href="#Analyzing-Processed-Data" data-toc-modified-id="Analyzing-Processed-Data-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Analyzing Processed Data</a></span><ul class="toc-item"><li><span><a href="#Analyzing-Chats" data-toc-modified-id="Analyzing-Chats-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Analyzing Chats</a></span></li></ul></li><li><span><a href="#Interactive-Visualization-of-Results" data-toc-modified-id="Interactive-Visualization-of-Results-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Interactive Visualization of Results</a></span></li></ul></div>

## Objective

This project will prototype a screening tool to:

    * identify calls where customer agents speak about offers/recommendations they are able to see as talking points in the customer panel during a call;
    * determine whether the "discussed" botton (supposedly pressed by agents) describes accurately if a recommendation was pitched or not.
    
Two data tables will be used:

    * a table containing information about the call and the call transcripts 
    * a table containing information about discussion of talking points
    
<img src="xxx.png" width=800 /> ![](one_pager_img)

[Back to Contents](#Table-of-Contents)

## Used Python Libraries

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text #needed to avoid crashes
import os

In [2]:
import os
import glob

In [3]:
import pandas as pd
import numpy as np
import json

In [4]:
from datetime import timedelta

In [5]:
import re

In [6]:
from sentence_transformers import SentenceTransformer, util

In [7]:
import psutil
import ray
import sys

In [8]:
from random import sample 

In [9]:
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

[Back to Contents](#Table-of-Contents)

## Load Data

We use data provided by Amazon Science ([Topical Chats](https://www.amazon.science/blog/amazon-releases-data-set-of-annotated-conversations-to-aid-development-of-socialbots) project). 

In [None]:
#https://registry.opendata.aws/topical-chat-enriched/

import boto3
import os
from botocore import UNSIGNED
from botocore.client import Config

def download_all_files():
    #initiate s3 resource
    s3 = boto3.resource('s3', config=Config(signature_version=UNSIGNED))
    # select bucket
    my_bucket = s3.Bucket('enriched-topical-chat')
    # download file into current directory
    for s3_object in my_bucket.objects.all():
        filename = s3_object.key
        my_bucket.download_file(s3_object.key, filename)
        
download_all_files()

In [None]:
!mkdir alexa

In [None]:
!mv *.json alexa/

In [43]:
train_df = pd.read_json('data/train.json').T
train_df.head()

Unnamed: 0,config,content,conversation_rating
t_bde29ce2-4153-4056-9eb7-f4ad710505fe,C,[{'message': ['Are you a fan of Google or Micr...,"{'agent_1': 'Excellent', 'agent_2': 'Excellent'}"
t_1abc9c37-387d-4013-8691-88ef8c010e58,B,"[{'message': ['do you like dance?'], 'agent': ...","{'agent_1': 'Excellent', 'agent_2': 'Excellent'}"
t_1a600621-5ad4-409c-a812-bc0b2bb03aa6,C,[{'message': ['Hey what's up do use Google ver...,"{'agent_1': 'Excellent', 'agent_2': 'Excellent'}"
t_01269680-99c3-4ab4-9df3-23901e0623c9,C,"[{'message': ['Hi!', 'do you like to dance?'],...","{'agent_1': 'Excellent', 'agent_2': 'Excellent'}"
t_c4f84350-a9e8-4928-bde8-5193b62388e0,B,"[{'message': ['do you like dance?'], 'agent': ...","{'agent_1': 'Excellent', 'agent_2': 'Excellent'}"


In [81]:
cont_tmp = train_df.iloc[2]['content']
cont_tmp

[{'message': ["Hey what's up do use Google very often?I really love the company and was surprised to hear that it was founded back in 1998."],
  'agent': 'agent_1',
  'segmented_annotations': [{'da': '<Statement>',
    'gt_ks': {'ds': 'wiki',
     'section': 'FS1',
     'start_index': 479,
     'end_index': 553,
     'score': 0.27}}],
  'gt_turn_ks': {'ds': 'wiki',
   'section': 'FS1',
   'start_index': 479,
   'end_index': 553,
   'score': 0.27}},
 {'message': ['i think everyone must use it daily!',
   'its become ingrained in every day life'],
  'agent': 'agent_2',
  'segmented_annotations': [{'da': '<Statement>',
    'gt_ks': {'ds': 'wiki',
     'section': 'FS2',
     'start_index': 558,
     'end_index': 778,
     'score': 0.03}},
   {'da': '<Statement>',
    'gt_ks': {'ds': 'fun_facts',
     'section': 'FS3',
     'index': 1,
     'score': 0.12}}],
  'gt_turn_ks': {'ds': 'fun_facts',
   'section': 'FS3',
   'index': 1,
   'score': 0.11}},
 {'message': ['Agreed.',
   'The Google he

[Back to Contents](#Table-of-Contents)

### Chats Data

In [101]:
chats_datafiles = glob.glob("./alexa/*.json")
chats_df = pd.concat([pd.read_json(fp).T.drop(['config','conversation_rating'], axis=1) 
                      for fp in chats_datafiles], ignore_index=True)
print(chats_df.shape)

(10784, 1)


In [102]:
chats_df.head()

Unnamed: 0,content
0,[{'message': ['Do you know who Emily Dickson i...
1,[{'message': ['Did you know the richest superh...
2,"[{'message': ['What arts do you enjoy?', 'Musi..."
3,[{'message': ['Are you familiar with summit me...
4,"[{'message': ['Do you watch soccer?'], 'agent'..."


In [103]:
[ic['message'] for ic in chats_df.iloc[0]['content']]

[['Do you know who Emily Dickson is?'],
 ['Emily Dickinson?',
  'The poet?',
  'I do!',
  '"Tell all the truth, but tell it slant" she once said.',
  'Do you like her poetry?'],
 ['Yeah she was an icon she died in 1886 at the tender age of 55.'],
 ['Though she was reclusive, she lived an interesting 55 years.',
  'Do you know much about her life?'],
 ['I did not unfortunately!',
  'I hear over the years she shared at least 250 poems with Susan her close friend before marrying Austin.'],
 ['Yes.',
  'she wrote hundreds and hundreds of poems, but they were locked away in a drawer, and a critic said that there were many arresting phrases, nothing scanned or rhymed properly, and so he declined to help get them published.'],
 ['Did you kow theres a poem when read normally is depressing but when read backward is uplifting?'],
 ['Wow!',
  "That is certainly different than Emily's poetry.",
  "There's such a diversity in poetry.",
  "It's no surprise, since poetry dates back to prehistorical t

[Back to Contents](#Table-of-Contents)

### Talking Points Data

In [104]:
#https://github.com/alexa/Topical-Chat/blob/master/src/wiki/wiki.json
with open("alexa/wiki/wiki.json", "r") as f:
    wiki_data = json.load(f)
    shortened_wiki_lead_section = wiki_data['shortened_wiki_lead_section']
    summarized_wiki_lead_section = wiki_data['summarized_wiki_lead_section']

In [105]:
talking_points = list(shortened_wiki_lead_section.keys())

In [106]:
talking_points[:5]

['A horror film is a film that seeks to elicit fear. Initially inspired by literature from authors like Edgar Allan Poe, Bram Stoker, and Mary Shelley, horror has existed as a film genre for more than a century. The macabre and the supernatural are frequent themes. Horror may also overlap with the fantasy, supernatural fiction, and thriller genres.',
 'A soundtrack, also written sound track, can be recorded music accompanying and synchronized to the images of a motion picture, book, television program, or video game; a commercially released soundtrack album of music as featured in the soundtrack of a film, video, or television presentation; or the physical area of a film that contains the synchronized recorded sound.',
 'An album is a collection of audio recordings issued as a collection on compact disc (CD), vinyl, audio tape, or another medium. Albums of recorded music were developed in the early 20th century as individual 78-rpm records collected in a bound book resembling a photogr

[Back to Contents](#Table-of-Contents)

## A First Glance at Chats

In [156]:
cT_example = [' '.join(ic['message']) for ic in chats_df.iloc[2]['content']]
cT_example

['What arts do you enjoy? Music? Poetry?',
 'Hi there. I was an English major in college, so even though I took a lot of literature classes, I appreciate poetry, too. What about you?',
 'I took some poetry in high school and I still remember learning about palindromes',
 "Did you know that the comedian Demetri Martin wrote a 224-word palindrome poem? That's a lot of thinking and reading backwards!",
 "I can't even begin to comprehend how it would go",
 "There is also a poem that when read normally is depressing, but when read backwards is inspiring. I think that's impressive, as well.",
 'I think that in itself is pretty inspiring',
 'Yes, some works of poetry like the Epic of Gilgamesh are pretty, well, epic, for lack of a better word haha. They try to take you to another time and place.',
 'It is pretty amazing how literature allows us to build worlds within our minds',
 "Speaking of other worlds, isn't it crazy how much we've learned about Mars, to the point where there's a coloniza

[Back to Contents](#Table-of-Contents)

## Pre-Trained Sentence Encoder

In [None]:
#https://www.sbert.net/docs/installation.html

#these lines below are example for downloading pre-trained models
#embed_bert = SentenceTransformer('paraphrase-xlm-r-multilingual-v1')
#embed_bert = SentenceTransformer('stsb-roberta-large') #textual similarity

#the models are downloaded to here
#/Users/atambu310/.cache/torch/sentence_transformers/sbert.net_models_paraphrase-xlm-r-multilingual-v1_part
#copy to desired folder, ex: ./models

### Example of Using Encoder

In [107]:
# this is equivalent to Google Universal Sentence Encoder
embed_bert = SentenceTransformer('./models/sbert.net_models_._distiluse-base-multilingual-cased-v2')

# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome']

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

#Compute embedding for both lists
embeddings1 = embed_bert.encode(sentences1, convert_to_tensor=True)
embeddings2 = embed_bert.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

The cat sits outside 		 The dog plays in the garden 		 Score: 0.3065
A man is playing guitar 		 A woman watches TV 		 Score: 0.2188
The new movie is awesome 		 The new movie is so great 		 Score: 0.9821


In [108]:
#lower-casing or space removal are not needed with this model. Apparently, some preproc
#is performed internally to the model

util.pytorch_cos_sim(embed_bert.encode(["With the Wifi Gateway, you'll get access to our simple and digital xFi dashboard."], convert_to_tensor=True), 
                     embed_bert.encode(["with the wi-fi gateway, you will get access to our simple and digital xFi dashboard."], convert_to_tensor=True)).detach().cpu().numpy()

array([[0.97768307]], dtype=float32)

### Using Encoder on Chats

In [157]:
scores = []
for ic in cT_example[9:10]:
    #print(ic)
    for it in talking_points:
        #print(it)
        ic_it_dist = util.pytorch_cos_sim(embed_bert.encode([ic], convert_to_tensor=True), 
                                          embed_bert.encode([it], convert_to_tensor=True)).detach().cpu().numpy()
        scores.append(float(ic_it_dist))

In [158]:
best_score_idx = np.argmax(scores)

In [159]:
scores[best_score_idx]

0.36937442421913147

In [160]:
talking_points[best_score_idx]

'Mars is the fourth planet from the Sun and the second-smallest planet in the Solar System after Mercury. In English, Mars carries a name of the Roman god of war, and is often referred to as the "Red Planet" because the reddish iron oxide prevalent on its surface gives it a reddish appearance that is distinctive among the astronomical bodies visible to the naked eye. Mars is a terrestrial planet with a thin atmosphere, having surface features reminiscent both of the impact craters of the Moon and the valleys, deserts, and polar ice caps of Earth.'

[Back to Contents](#Table-of-Contents)

## Data Processing

In [220]:
@ray.remote
class processText:
        
    def __init__(self):
        
        import tensorflow as tf
        import tensorflow_hub as hub
        import tensorflow_text #needed to avoid crashes    
        
        self.texts = ray.get(texts_ray)
        
        print('loading model')
        self.embed = SentenceTransformer('./models/sbert.net_models_._distiluse-base-multilingual-cased-v2')
        print('loading model...done')

        self.talking_points_emb = self.embed.encode(ray.get(talking_points_ray), convert_to_tensor=True)

        
    def clean_sentences(self, cT):

        #create list of sentences from text                
        #cT = [s.lower() for s in cT] #lower case
        cT = [" ".join([w for w in s.split() if len(s.split())>5]) for s in cT]  #extra spaces, at least 5 words

        if len(cT)==0:
            return []

        if len(cT)<3: # at least 3 sentences
            return []
        else:
            return cT
    
    
    def get_points(self, text_idxs, phrase_thr=0.60):

        is_talking_points = {}
        
        print('start', text_idxs)
        
        for i in range(text_idxs[0], text_idxs[1]): 
            
            cT = self.texts[i]
            cT = self.clean_sentences(cT)
            
            if len(cT)==0:
                continue

            emb_cT_i = self.embed.encode(cT, convert_to_tensor=True)
            # similarity 
            sim_mat = util.pytorch_cos_sim(emb_cT_i,
                                           self.talking_points_emb).detach().cpu().numpy()
            
            # loop over individual sentences and check similarity to each of the phrases    
            #best_match = np.unravel_index(np.argmax(sim_mat, axis=None), sim_mat.shape)
            #best_match_value = sim_mat[best_match]
            #if best_match_value>phrase_thr:
            #    is_talking_points.update({(i,cT[best_match[0]]): [(best_match[1], best_match_value)]})
            
            # for each sentence find matching phrases above threshold. 
            # In a chat mutiple talking points can be mentioned
            for j in range(sim_mat.shape[0]):
                sim_idx = np.where(sim_mat[j,]>phrase_thr)[0]
                sim_values = [(k, sim_mat[j,k]) for k in sim_idx] # this is for debugging purposes, later just select best matching phrase
                if len(sim_idx)>0:
                    is_talking_points.update({(i,cT[j]): sim_values})
            

        print('end', text_idxs)
        
        return is_talking_points
    

In [213]:
ray.shutdown()

[Back to Contents](#Table-of-Contents)

### Processing Chats

In [201]:
all_chats = []
for i in chats_df['content'].values[:10] :
    all_chats_tmp = []
    for j in i:
        all_chats_tmp.append(' '.join(j['message']))
    all_chats.append(all_chats_tmp)

In [205]:
all_chats[1]

['Did you know the richest superhero is black panther?',
 'Is that the character in the 2018 film, or the actor who played the role?',
 'I think it is the character. I will check closer.',
 'Okay. Have you seen the film? Heard it was nominated for Oscars...',
 "I haven't yet but I plan to.",
 'Me, too. Understand that symbols and script were based on 4th century Nigerian story. Costume designer won an Oscar, I think...deserved to, at least...',
 'Interesting. Costume designing would be a fun occupation.',
 'Agree. One young star in the film, Chadwick Boseman, was sponsored to an Oxford, England theater program by a "private benefactor". Turned out to be the great actor, Denzel Washington. Fun fact?',
 'Wow! That is interesting.',
 'Glad script made black panther richest superhero, over characters like Tony Stark (Iron Man) and Bruce Wayne (Batman). Why not? ?',
 'True! If you want a superhero make him really rich.',
 "And of noble ancestry! This nearly all-black film is possibly the bi

In [221]:
%%time 

num_cpus = psutil.cpu_count(logical=False)
max_chats = 10# len(chats_df['Transcript'].tolist())
n_cells = 2 #to run in a single process

print('Processing', max_chats, 'chats')

# split task into chunks
run_list_chnks = np.linspace(0, max_chats, int(max_chats/n_cells), dtype=int)
run_list_chnks = [(run_list_chnks[i],run_list_chnks[i+1]) for i in range(0,len(run_list_chnks)-1)]

# init ray
ray.init(num_cpus=num_cpus)

phrase_thr = 0.50

texts_ray = ray.put(all_chats[:max_chats])
talking_points_ray = ray.put(talking_points)

pT = [processText.remote() for _ in range(num_cpus)]

# every 'num_cpus' jobs, start from worker 0 again
result_ids = [pT[i % num_cpus].get_points.remote(run_list_chnks[i],
                                                 phrase_thr=phrase_thr) for i in range(len(run_list_chnks))] 

# Fetch the results.
#results = ray.get(result_ids)    
results_c = ray.get(result_ids)

ray.shutdown()

Processing 10 chats


2021-04-20 16:57:46,893	INFO services.py:1173 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m
Traceback (most recent call last):
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 306, in <module>
    loop.run_until_complete(agent.run())
  File "/Users/atambu310/anaconda3/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 131, in run
    modules = self._load_modules()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 86, in _load_modules
    c = cls(self)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray

Traceback (most recent call last):
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 306, in <module>
    loop.run_until_complete(agent.run())
  File "/Users/atambu310/anaconda3/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 131, in run
    modules = self._load_modules()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 86, in _load_modules
    c = cls(self)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/metrics_agent.py", line 42, in __init__
    namespace="ray", port=metrics_export_port)))
  File "/Users/atambu310/a

[2m[36m(pid=40181)[0m loading model
[2m[36m(pid=40180)[0m loading model
[2m[36m(pid=40185)[0m loading model
[2m[36m(pid=40182)[0m loading model
[2m[36m(pid=40184)[0m loading model
[2m[36m(pid=40183)[0m loading model


Traceback (most recent call last):
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 306, in <module>
    loop.run_until_complete(agent.run())
  File "/Users/atambu310/anaconda3/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 131, in run
    modules = self._load_modules()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 86, in _load_modules
    c = cls(self)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/metrics_agent.py", line 42, in __init__
    namespace="ray", port=metrics_export_port)))
  File "/Users/atambu310/a

[2m[36m(pid=40181)[0m loading model...done
[2m[36m(pid=40184)[0m loading model...done
[2m[36m(pid=40180)[0m loading model...done
[2m[36m(pid=40183)[0m loading model...done
[2m[36m(pid=40185)[0m loading model...done
[2m[36m(pid=40182)[0m loading model...done


Traceback (most recent call last):
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 306, in <module>
    loop.run_until_complete(agent.run())
  File "/Users/atambu310/anaconda3/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 131, in run
    modules = self._load_modules()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 86, in _load_modules
    c = cls(self)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/metrics_agent.py", line 42, in __init__
    namespace="ray", port=metrics_export_port)))
  File "/Users/atambu310/a

Traceback (most recent call last):
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 306, in <module>
    loop.run_until_complete(agent.run())
  File "/Users/atambu310/anaconda3/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 131, in run
    modules = self._load_modules()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 86, in _load_modules
    c = cls(self)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/metrics_agent.py", line 42, in __init__
    namespace="ray", port=metrics_export_port)))
  File "/Users/atambu310/a

Traceback (most recent call last):
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 306, in <module>
    loop.run_until_complete(agent.run())
  File "/Users/atambu310/anaconda3/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 131, in run
    modules = self._load_modules()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 86, in _load_modules
    c = cls(self)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/metrics_agent.py", line 42, in __init__
    namespace="ray", port=metrics_export_port)))
  File "/Users/atambu310/a

Traceback (most recent call last):
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 306, in <module>
    loop.run_until_complete(agent.run())
  File "/Users/atambu310/anaconda3/lib/python3.7/asyncio/base_events.py", line 584, in run_until_complete
    return future.result()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 131, in run
    modules = self._load_modules()
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py", line 86, in _load_modules
    c = cls(self)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/modules/reporter/reporter_agent.py", line 72, in __init__
    self._metrics_agent = MetricsAgent(dashboard_agent.metrics_export_port)
  File "/Users/atambu310/anaconda3/lib/python3.7/site-packages/ray/metrics_agent.py", line 42, in __init__
    namespace="ray", port=metrics_export_port)))
  File "/Users/atambu310/a

KeyboardInterrupt: 

In [224]:
ray.shutdown()

#### Example of Processed Chat Data

In [225]:
results_c[0] # (chat index, utterance text above threshold): [(talking point 1, similarity to utterance),(talking point 2, similarity to utterance)] 

{(0,
  "Wow! That is certainly different than Emily's poetry. There's such a diversity in poetry. It's no surprise, since poetry dates back to prehistorical times. Did you know some of the first poetry was hunting poetry in Africa?"): [(66,
   0.5327011)],
 (1,
  'Glad script made black panther richest superhero, over characters like Tony Stark (Iron Man) and Bruce Wayne (Batman). Why not? ?'): [(58,
   0.5512788),
  (202, 0.51923746),
  (230, 0.5024118)],
 (1,
  "True!! You didn't miss anything. Marvel Comics character was not named after the 70's Black Panther Party, though studio almost called film, Black Leopard."): [(58,
   0.53018355)]}

In [227]:
talking_points[58]

"Black Panther is a 2018 American superhero film based on the Marvel Comics character of the same name. Produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures, it is the eighteenth film in the Marvel Cinematic Universe (MCU). The film is directed by Ryan Coogler, who co-wrote the screenplay with Joe Robert Cole, and stars Chadwick Boseman as T'Challa / Black Panther, alongside Michael B. Jordan, Lupita Nyong'o, Danai Gurira, Martin Freeman, Daniel Kaluuya, Letitia Wright, Winston Duke, Angela Bassett, Forest Whitaker, and Andy Serkis. In Black Panther, T'Challa is crowned king of Wakanda following his father's death, but his sovereignty is challenged by an adversary who plans to abandon the country's isolationist policies and begin a global revolution."

[Back to Contents](#Table-of-Contents)

## Analyzing Processed Data

For each talking point in the table below, we randomly sample few transcript extracts to check reliability of extract retrieval approach used in this notebook 

In [None]:
def analyze_results(results, n_texts, phrase_thr, verbose=True):

    talking_points_cnts = {k: 0 for k in talking_points}

    # loop over processed chunks and extract statistics
    for rr in range(len(results)):
        for chat_no_extract,tp_no_sim in results[rr].items():
            chat_no, chat_extract = chat_no_extract[0], chat_no_extract[1]
            tp_no_sim_sorted = sorted(tp_no_sim, key=lambda x: x[1], reverse=True) # sort by similarity values
            for kk in tp_no_sim_sorted[:1]: # take best
                tp_no = kk[0]
                tp_sim = kk[1]
                if tp_sim>phrase_thr:
                    if verbose:
                        print('------> Chat no. ', chat_no, 'Type: ', df.chatQueueLob[chat_no])
                        print('---> Extract:', chat_extract)
                        print('---> Phrase ', talking_points[tp_no])
                        print('---> Similarity index', tp_sim)
                    talking_points_cnts[talking_points[tp_no]] += 1
                 
    #extract some examples of match
    sample_extracts = {tp: [] for tp in talking_points}
    for tp in talking_points:
        for rr in range(len(results)):
            for chat_no_extract, tp_no_sim in results[rr].items():
                chat_no, chat_extract = chat_no_extract[0], chat_no_extract[1]
                tp_no_sim_sorted = sorted(tp_no_sim, key=lambda x: x[1], reverse=True)
                for kk in tp_no_sim_sorted[:1]: # take best
                    tp_no = kk[0]
                    tp_sim = kk[1]
                    if tp_sim>phrase_thr and tp_no==talking_points.index(tp):
                        sample_extracts[tp].append('**' + chat_extract)
                        #print('----->', tp)
                        #print('---> Extract:', chat_extract)
        if verbose:
            print(tp, len(sample_extracts[tp]), len(sample(sample_extracts[tp],min(20, len(sample_extracts[tp])))))
        sample_extracts[tp] = '\n'.join(sample(sample_extracts[tp],min(20, len(sample_extracts[tp])))) # sample only 5 examples
        #sample_extracts[tp] = '\n'.join(sample_extracts[tp]) # sample only 5 examples
    sample_extracts = [v for k,v in sample_extracts.items()]
    
    ## put in excel
    results_df = pd.DataFrame({'Talking Points': [k for k,v  in talking_points_cnts.items()], 
                  'Perc. Used': [100*v/n_texts for k,v  in talking_points_cnts.items()],
                  'Sample Extracts': sample_extracts,
                  'Human Corrected': [0 for k in talking_points_cnts]})# randomly sample 20 extracts for each talking point and visually inspect how many are right

    results_df.to_excel('extracts.xlsx')

    return results_df

[Back to Contents](#Table-of-Contents)

### Analyzing Chats

In [None]:
results_c_df = analyze_results(results_c, max_chats, phrase_thr=0.6, verbose=False)
results_c_df

In [None]:
i = 5
results_c_df.iloc[i,0], results_c_df.iloc[i,2].split('\n')

In [None]:
i = 7
results_c_df.iloc[i,0], results_c_df.iloc[i,2].split('\n')

[Back to Contents](#Table-of-Contents)

## Interactive Visualization of Results

In [None]:
embed = SentenceTransformer('./models/sbert.net_models_._distiluse-base-multilingual-cased-v2')

In [None]:
talking_points_emb = embed.encode(talking_points, convert_to_tensor=True)

In [None]:
# input text gets extra slash in front of \n, so we added \\n for splitting sentences
def clean_sentences(cT):

        #create list of senteences from text
        cT = re.sub(r'\.\.\.','<UNK>', cT)
        cT = re.sub(r'<UNK>','', cT)
        cT = cT.split('\\n')
                
        cT = [s.lower() for s in cT] #lower case
        cT = [" ".join([w for w in s.split() if len(s.split())>5]) for s in cT]  #extra spaces, at least 5 words

        if len(cT)==0:
            return []

        if len(cT)<3: # at least 3 sentences
            return []
        else:
            return cT
    
    
def get_points(cT, phrase_thr=0.60):

        cT = clean_sentences(cT)
            
        if len(cT)==0:
            return ''
        
        emb_cT = embed.encode(cT, convert_to_tensor=True)
        # similarity 
        sim_mat = util.pytorch_cos_sim(emb_cT,
                                       talking_points_emb).detach().cpu().numpy()
            
        # for each sentence find matching phrases above threshold. 
        for j in range(sim_mat.shape[0]):
            sim_idx = np.where(sim_mat[j,]>phrase_thr)[0]
            sim_values = [(k, sim_mat[j,k]) for k in sim_idx] 
            sim_sorted = sorted(sim_values, key=lambda x: x[1], reverse=True) # sort by similarity values
            if len(sim_sorted)>0:
                print('**utterance:', cT[j])
                print('Talking Point:', talking_points[sim_sorted[0][0]])
                print('Similarity: ', sim_sorted[0][1])            


In [None]:
print('**Available talking points**')
print('')
for i in talking_points:
    print(i)

In [None]:
#Chats
i=185 #11, 185, 2010, 2334
chats_df['Transcript'].iloc[i]
#Calls
#i=1460 #856, 865, 1460
#calls_df['Transcript'].iloc[i]

In [None]:
@interact_manual(in_text='', phrase_thr=widgets.FloatSlider(min=0.5, max=0.8, step=0.1, value=0.5))
def g(in_text, phrase_thr):
    return get_points(in_text, phrase_thr) 

In [None]:
## add pop-up message saying if talking point is pressed

[Back to Contents](#Table-of-Contents)