<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objective" data-toc-modified-id="Objective-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Objective</a></span></li><li><span><a href="#Used-Python-Libraries" data-toc-modified-id="Used-Python-Libraries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Used Python Libraries</a></span></li><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load Data</a></span><ul class="toc-item"><li><span><a href="#Example-Conversation--Data" data-toc-modified-id="Example-Conversation--Data-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Example Conversation  Data</a></span></li><li><span><a href="#Extracting-Conversation-Data" data-toc-modified-id="Extracting-Conversation-Data-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Extracting Conversation Data</a></span></li><li><span><a href="#Talking-Points-Data" data-toc-modified-id="Talking-Points-Data-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Talking Points Data</a></span></li><li><span><a href="#Mapping-of-Conversations-to-Talking-Points---NOT-USED-YET" data-toc-modified-id="Mapping-of-Conversations-to-Talking-Points---NOT-USED-YET-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Mapping of Conversations to Talking Points - NOT USED YET</a></span></li></ul></li><li><span><a href="#Pre-Trained-Sentence-Encoder" data-toc-modified-id="Pre-Trained-Sentence-Encoder-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Pre-Trained Sentence Encoder</a></span><ul class="toc-item"><li><span><a href="#Example-of-Using-Encoder" data-toc-modified-id="Example-of-Using-Encoder-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Example of Using Encoder</a></span></li><li><span><a href="#Using-Encoder-for-Chats" data-toc-modified-id="Using-Encoder-for-Chats-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Using Encoder for Chats</a></span></li></ul></li><li><span><a href="#Data-Processing" data-toc-modified-id="Data-Processing-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Data Processing</a></span><ul class="toc-item"><li><span><a href="#Processing-Chats" data-toc-modified-id="Processing-Chats-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Processing Chats</a></span><ul class="toc-item"><li><span><a href="#Example-of-Processed-Chat-Data" data-toc-modified-id="Example-of-Processed-Chat-Data-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Example of Processed Chat Data</a></span></li></ul></li></ul></li><li><span><a href="#Analyzing-Processed-Data" data-toc-modified-id="Analyzing-Processed-Data-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Analyzing Processed Data</a></span><ul class="toc-item"><li><span><a href="#Analyzing-Chats" data-toc-modified-id="Analyzing-Chats-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Analyzing Chats</a></span></li></ul></li><li><span><a href="#Interactive-Visualization-of-Results" data-toc-modified-id="Interactive-Visualization-of-Results-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Interactive Visualization of Results</a></span></li></ul></div>

## Objective

This project will prototype a tool to:
1. identify utterances where a speaker uses "talking points", i.e. talks about a topic that is popped up to them during a conversation;
2. 

We will also understand what typical "similarity" distance values between topics and utterances where these topics are used can be found in natural conversations (see Using Encoder for Chats section).
    
We use data provided by Amazon Science ([Topical Chats](https://www.amazon.science/blog/amazon-releases-data-set-of-annotated-conversations-to-aid-development-of-socialbots) project). 

[Back to Contents](#Table-of-Contents)

## Used Python Libraries

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text #needed to avoid crashes
import os

In [2]:
import os
import glob

In [3]:
import pandas as pd
import numpy as np
import json

In [4]:
from datetime import timedelta

In [5]:
import re

In [6]:
from sentence_transformers import SentenceTransformer, util

In [7]:
import psutil
import ray
import sys

In [8]:
from random import sample 

In [9]:
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

[Back to Contents](#Table-of-Contents)

## Load Data

In [None]:
#https://registry.opendata.aws/topical-chat-enriched/
# data can be also download from the github repo

import boto3
import os
from botocore import UNSIGNED
from botocore.client import Config

def download_all_files():
    #initiate s3 resource
    s3 = boto3.resource('s3', config=Config(signature_version=UNSIGNED))
    # select bucket
    my_bucket = s3.Bucket('enriched-topical-chat')
    # download file into current directory
    for s3_object in my_bucket.objects.all():
        filename = s3_object.key
        my_bucket.download_file(s3_object.key, filename)
        
download_all_files()

In [None]:
!mkdir alexa

In [None]:
!mv *.json alexa/

### Example Conversation  Data

In [10]:
train_df = pd.read_json('alexa/conversations/train.json').T
train_df.head()

Unnamed: 0,config,content,conversation_rating
t_bde29ce2-4153-4056-9eb7-f4ad710505fe,C,[{'message': ['Are you a fan of Google or Micr...,"{'agent_1': 'Excellent', 'agent_2': 'Excellent'}"
t_1abc9c37-387d-4013-8691-88ef8c010e58,B,"[{'message': ['do you like dance?'], 'agent': ...","{'agent_1': 'Excellent', 'agent_2': 'Excellent'}"
t_1a600621-5ad4-409c-a812-bc0b2bb03aa6,C,[{'message': ['Hey what's up do use Google ver...,"{'agent_1': 'Excellent', 'agent_2': 'Excellent'}"
t_01269680-99c3-4ab4-9df3-23901e0623c9,C,"[{'message': ['Hi!', 'do you like to dance?'],...","{'agent_1': 'Excellent', 'agent_2': 'Excellent'}"
t_c4f84350-a9e8-4928-bde8-5193b62388e0,B,"[{'message': ['do you like dance?'], 'agent': ...","{'agent_1': 'Excellent', 'agent_2': 'Excellent'}"


In [11]:
cont_tmp = train_df.iloc[2]['content']
cont_tmp

[{'message': ["Hey what's up do use Google very often?I really love the company and was surprised to hear that it was founded back in 1998."],
  'agent': 'agent_1',
  'segmented_annotations': [{'da': '<Statement>',
    'gt_ks': {'ds': 'wiki',
     'section': 'FS1',
     'start_index': 479,
     'end_index': 553,
     'score': 0.27}}],
  'gt_turn_ks': {'ds': 'wiki',
   'section': 'FS1',
   'start_index': 479,
   'end_index': 553,
   'score': 0.27}},
 {'message': ['i think everyone must use it daily!',
   'its become ingrained in every day life'],
  'agent': 'agent_2',
  'segmented_annotations': [{'da': '<Statement>',
    'gt_ks': {'ds': 'wiki',
     'section': 'FS2',
     'start_index': 558,
     'end_index': 778,
     'score': 0.03}},
   {'da': '<Statement>',
    'gt_ks': {'ds': 'fun_facts',
     'section': 'FS3',
     'index': 1,
     'score': 0.12}}],
  'gt_turn_ks': {'ds': 'fun_facts',
   'section': 'FS3',
   'index': 1,
   'score': 0.11}},
 {'message': ['Agreed.',
   'The Google he

[Back to Contents](#Table-of-Contents)

### Extracting Conversation Data

In [12]:
chats_datafiles = glob.glob("./alexa/conversations/*.json")
chats_df = pd.concat([pd.read_json(fp).T.drop(['config','conversation_rating'], axis=1) 
                      for fp in chats_datafiles], ignore_index=False)
print(chats_df.shape)

(10784, 1)


In [13]:
chats_df.head()

Unnamed: 0,content
t_f9116d33-7a0d-4969-a519-764a190fe7d9,[{'message': ['Do you know who Emily Dickson i...
t_1bdb0da2-7b3b-41b8-b908-91e4c09c6ea7,[{'message': ['Did you know the richest superh...
t_e13b13b6-b590-4d24-a871-cd80279d4310,"[{'message': ['What arts do you enjoy?', 'Musi..."
t_2f6c6509-2624-435c-b070-644033cf3aa8,[{'message': ['Are you familiar with summit me...
t_cbcfd55f-51ae-49b1-b072-926b398f3c34,"[{'message': ['Do you watch soccer?'], 'agent'..."


In [172]:
#[ic['message'] for ic in chats_df.iloc[0]['content']]

In [14]:
[' '.join(ic['message']) for ic in chats_df.iloc[0]['content']]

['Do you know who Emily Dickson is?',
 'Emily Dickinson? The poet? I do! "Tell all the truth, but tell it slant" she once said. Do you like her poetry?',
 'Yeah she was an icon she died in 1886 at the tender age of 55.',
 'Though she was reclusive, she lived an interesting 55 years. Do you know much about her life?',
 'I did not unfortunately! I hear over the years she shared at least 250 poems with Susan her close friend before marrying Austin.',
 'Yes. she wrote hundreds and hundreds of poems, but they were locked away in a drawer, and a critic said that there were many arresting phrases, nothing scanned or rhymed properly, and so he declined to help get them published.',
 'Did you kow theres a poem when read normally is depressing but when read backward is uplifting?',
 "Wow! That is certainly different than Emily's poetry. There's such a diversity in poetry. It's no surprise, since poetry dates back to prehistorical times. Did you know some of the first poetry was hunting poetry in

[Back to Contents](#Table-of-Contents)

### Talking Points Data

In [15]:
#https://github.com/alexa/Topical-Chat/blob/master/src/wiki/wiki.json
with open("alexa/wiki/wiki.json", "r") as f:
    wiki_data = json.load(f)
    shortened_wiki_lead_section = wiki_data['shortened_wiki_lead_section']
    summarized_wiki_lead_section = wiki_data['summarized_wiki_lead_section']

In [16]:
talking_points = list(shortened_wiki_lead_section.keys())
talking_points_idxs = list(shortened_wiki_lead_section.values())

In [17]:
talking_points[:5]

['A horror film is a film that seeks to elicit fear. Initially inspired by literature from authors like Edgar Allan Poe, Bram Stoker, and Mary Shelley, horror has existed as a film genre for more than a century. The macabre and the supernatural are frequent themes. Horror may also overlap with the fantasy, supernatural fiction, and thriller genres.',
 'A soundtrack, also written sound track, can be recorded music accompanying and synchronized to the images of a motion picture, book, television program, or video game; a commercially released soundtrack album of music as featured in the soundtrack of a film, video, or television presentation; or the physical area of a film that contains the synchronized recorded sound.',
 'An album is a collection of audio recordings issued as a collection on compact disc (CD), vinyl, audio tape, or another medium. Albums of recorded music were developed in the early 20th century as individual 78-rpm records collected in a bound book resembling a photogr

In [18]:
#[re.split(r"(?<!^)\s*[.\n]+\s*(?!$)", i) for i in talking_points[:5]]

[Back to Contents](#Table-of-Contents)

### Mapping of Conversations to Talking Points - NOT USED YET

For each conversation, topics labeled as FS1, FS2, FS3 are shown to the partners of the conversation.

https://github.com/alexa/Topical-Chat/tree/master/reading_sets/pre-build

In [None]:
!wget https://raw.githubusercontent.com/alexa/Topical-Chat/master/reading_sets/pre-build/train.json && mv train.json alexa/reading_sets/
!wget https://raw.githubusercontent.com/alexa/Topical-Chat/master/reading_sets/pre-build/test_freq.json && mv test_freq.json alexa/reading_sets/
!wget https://raw.githubusercontent.com/alexa/Topical-Chat/master/reading_sets/pre-build/test_rare.json && mv test_rare.json alexa/reading_sets/
!wget https://raw.githubusercontent.com/alexa/Topical-Chat/master/reading_sets/pre-build/valid_freq.json && mv valid_freq.json alexa/reading_sets/
!wget https://raw.githubusercontent.com/alexa/Topical-Chat/master/reading_sets/pre-build/valid_rare.json && mv valid_rare.json alexa/reading_sets/


In [24]:
reading_datafiles = glob.glob("./alexa/reading_sets//*.json")
reading_df = pd.concat([pd.read_json(fp).T.drop(['config','article_url'], axis=1) 
                      for fp in reading_datafiles], ignore_index=False)
print(reading_df.shape)

(10784, 2)


In [25]:
reading_df.head()

Unnamed: 0,agent_1,agent_2
t_f9116d33-7a0d-4969-a519-764a190fe7d9,"{'FS1': {'entity': 'Poetry', 'shortened_wiki_l...","{'FS1': {'entity': 'Poetry', 'shortened_wiki_l..."
t_1bdb0da2-7b3b-41b8-b908-91e4c09c6ea7,"{'FS1': {'entity': 'Black Panther (film)', 'sh...","{'FS1': {'entity': 'Black Panther (film)', 'su..."
t_e13b13b6-b590-4d24-a871-cd80279d4310,"{'FS1': {'entity': 'Poetry', 'shortened_wiki_l...","{'FS1': {'entity': 'Poetry', 'shortened_wiki_l..."
t_2f6c6509-2624-435c-b070-644033cf3aa8,"{'FS1': {'entity': 'FIFA World Cup', 'shortene...","{'FS1': {'entity': 'FIFA World Cup', 'summariz..."
t_cbcfd55f-51ae-49b1-b072-926b398f3c34,"{'FS1': {'entity': 'FIFA World Cup', 'shortene...","{'FS1': {'entity': 'FIFA World Cup', 'shortene..."


In [26]:
talking_points[talking_points_idxs.index(reading_df.iloc[0]['agent_1']['FS1']['shortened_wiki_lead_section'])]

'Poetry (the term derives from a variant of the Greek term, poiesis, "making") is a form of literature  that uses aesthetic and rhythmic qualities of language—such as phonaesthetics, sound symbolism, and metre—to evoke meanings in addition to, or in place of, the prosaic ostensible meaning.\nPoetry has a long history, dating back to prehistorical times with the creation of hunting poetry in Africa, and panegyric and elegiac court poetry were developed extensively throughout the history of the empires of the Nile, Niger and Volta river valleys . Some of the earliest written poetry in Africa can be found among the Pyramid Texts written during the 25th century BCE, while the Epic of Sundiata is one of the most well-known examples of griot court poetry. The earliest Western Asian epic poetry, the Epic of Gilgamesh, was written in Sumerian.  Early poems in the Eurasian continent evolved from folk songs such as the Chinese Shijing, or from a need to retell oral epics, as with the Sanskrit Ve

In [122]:
{'FS1': [reading_df.iloc[i]['agent_1']['FS1']['shortened_wiki_lead_section'] for i in range(len(reading_df))]}

{'FS1': [80844,
  81372,
  80844,
  80686,
  80686,
  80686,
  80992,
  80992,
  80992,
  81016,
  81016,
  81016,
  81377,
  81377,
  80481,
  80481,
  80481,
  80481,
  81377,
  81377,
  81377,
  81122,
  81377,
  81122,
  81122,
  81122,
  81377,
  81283,
  81283,
  81283,
  81283,
  80854,
  80854,
  80854,
  80854,
  81378,
  81378,
  81378,
  81378,
  81367,
  81367,
  81063,
  80686,
  81370,
  81135,
  81296,
  80686,
  81370,
  81135,
  81016,
  81063,
  81016,
  81370,
  81370,
  81063,
  81016,
  81016,
  81016,
  81016,
  81016,
  80686,
  81063,
  81063,
  81005,
  81135,
  81296,
  81135,
  80686,
  81016,
  81022,
  81022,
  81022,
  81293,
  81293,
  81293,
  80992,
  78652,
  78652,
  81372,
  81372,
  81372,
  81372,
  78652,
  81378,
  81378,
  81378,
  79877,
  79877,
  79877,
  79877,
  80895,
  80895,
  80701,
  81292,
  81238,
  81238,
  81238,
  81292,
  81292,
  81292,
  81238,
  81238,
  79877,
  79877,
  79877,
  79877,
  80895,
  80895,
  80895,
  80895,
  8

In [27]:
reading_df.iloc[0]['agent_1']

{'FS1': {'entity': 'Poetry',
  'shortened_wiki_lead_section': 80844,
  'fun_facts': ['t3_qlvl0', 't3_sc9i4', 't3_1wu6aw', 't3_m2vjw', 't3_qf7bz']},
 'FS2': {'entity': 'Mars',
  'shortened_wiki_lead_section': 80645,
  'fun_facts': ['t3_2obbaf',
   't3_16zmgy',
   't3_xo9f0',
   't3_1o5xog',
   't3_243i1e']},
 'FS3': {'entity': 'Piano',
  'shortened_wiki_lead_section': 81376,
  'fun_facts': ['t3_3fjuhg',
   't3_v67sn',
   't3_p7ygj',
   't3_1npju0',
   't3_1v09uz']}}

[Back to Contents](#Table-of-Contents)

## Pre-Trained Sentence Encoder

In [None]:
#https://www.sbert.net/docs/installation.html

#these lines below are example for downloading pre-trained models
#embed_bert = SentenceTransformer('paraphrase-xlm-r-multilingual-v1')
#embed_bert = SentenceTransformer('stsb-roberta-large') #textual similarity
#embed_bert = SentenceTransformer('distiluse-base-multilingual-cased-v2')

#the models are downloaded to here
#/Users/atambu/.cache/torch/sentence_transformers/sbert.net_models_paraphrase-xlm-r-multilingual-v1_part
#copy to desired folder, ex: ./models

### Example of Using Encoder

In [19]:
# this is equivalent to Google Universal Sentence Encoder
embed_bert = SentenceTransformer('../models/sbert.net_models_distiluse-base-multilingual-cased-v2')

# Two lists of sentences
sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome']

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

#Compute embedding for both lists
embeddings1 = embed_bert.encode(sentences1, convert_to_tensor=True)
embeddings2 = embed_bert.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarity
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

#Output the pairs with their score
for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[i], cosine_scores[i][i]))

The cat sits outside 		 The dog plays in the garden 		 Score: 0.3065
A man is playing guitar 		 A woman watches TV 		 Score: 0.2188
The new movie is awesome 		 The new movie is so great 		 Score: 0.9821


In [20]:
#lower-casing or space removal are not needed with this model. Apparently, some preproc
#is performed internally to the model

util.pytorch_cos_sim(embed_bert.encode(["A man is playing guitar."], convert_to_tensor=True), 
                     embed_bert.encode(["A man's playing guitar "], convert_to_tensor=True)).detach().cpu().numpy()

array([[0.97733396]], dtype=float32)

### Using Encoder for Chats

In [21]:
talking_points_emb = embed_bert.encode(talking_points, convert_to_tensor=True)

In [28]:
cT_example = [' '.join(ic['message']) for ic in chats_df.iloc[0]['content']]
agent_1_tp = talking_points[talking_points_idxs.index(reading_df.iloc[0]['agent_1']['FS1']['shortened_wiki_lead_section'])]
agent_2_tp = talking_points[talking_points_idxs.index(reading_df.iloc[0]['agent_2']['FS1']['shortened_wiki_lead_section'])]

In [32]:
agent_1_tp

'Poetry (the term derives from a variant of the Greek term, poiesis, "making") is a form of literature  that uses aesthetic and rhythmic qualities of language—such as phonaesthetics, sound symbolism, and metre—to evoke meanings in addition to, or in place of, the prosaic ostensible meaning.\nPoetry has a long history, dating back to prehistorical times with the creation of hunting poetry in Africa, and panegyric and elegiac court poetry were developed extensively throughout the history of the empires of the Nile, Niger and Volta river valleys . Some of the earliest written poetry in Africa can be found among the Pyramid Texts written during the 25th century BCE, while the Epic of Sundiata is one of the most well-known examples of griot court poetry. The earliest Western Asian epic poetry, the Epic of Gilgamesh, was written in Sumerian.  Early poems in the Eurasian continent evolved from folk songs such as the Chinese Shijing, or from a need to retell oral epics, as with the Sanskrit Ve

In [33]:
agent_2_tp

'Poetry (the term derives from a variant of the Greek term, poiesis, "making") is a form of literature  that uses aesthetic and rhythmic qualities of language—such as phonaesthetics, sound symbolism, and metre—to evoke meanings in addition to, or in place of, the prosaic ostensible meaning.\nPoetry has a long history, dating back to prehistorical times with the creation of hunting poetry in Africa, and panegyric and elegiac court poetry were developed extensively throughout the history of the empires of the Nile, Niger and Volta river valleys . Some of the earliest written poetry in Africa can be found among the Pyramid Texts written during the 25th century BCE, while the Epic of Sundiata is one of the most well-known examples of griot court poetry. The earliest Western Asian epic poetry, the Epic of Gilgamesh, was written in Sumerian.  Early poems in the Eurasian continent evolved from folk songs such as the Chinese Shijing, or from a need to retell oral epics, as with the Sanskrit Ve

In [29]:
cT_example

['Do you know who Emily Dickson is?',
 'Emily Dickinson? The poet? I do! "Tell all the truth, but tell it slant" she once said. Do you like her poetry?',
 'Yeah she was an icon she died in 1886 at the tender age of 55.',
 'Though she was reclusive, she lived an interesting 55 years. Do you know much about her life?',
 'I did not unfortunately! I hear over the years she shared at least 250 poems with Susan her close friend before marrying Austin.',
 'Yes. she wrote hundreds and hundreds of poems, but they were locked away in a drawer, and a critic said that there were many arresting phrases, nothing scanned or rhymed properly, and so he declined to help get them published.',
 'Did you kow theres a poem when read normally is depressing but when read backward is uplifting?',
 "Wow! That is certainly different than Emily's poetry. There's such a diversity in poetry. It's no surprise, since poetry dates back to prehistorical times. Did you know some of the first poetry was hunting poetry in

In [30]:
cT_example_emb = embed_bert.encode(cT_example, convert_to_tensor=True)

In [31]:
scores = []
is_talking_points = {}

sim_mat = util.pytorch_cos_sim(cT_example_emb,
                               talking_points_emb).detach().cpu().numpy()
                        
# for each sentence find matching phrases above threshold. 
# In a chat mutiple talking points can be mentioned
for j in range(sim_mat.shape[0]):
    sim_idx = np.where(sim_mat[j,]>0.5)[0]
    sim_values = [(k, sim_mat[j,k]) for k in sim_idx]
    if len(sim_idx)>0:
        print(cT_example[j])
        print('***', sim_values[0][1], talking_points[sim_values[0][0]])

Wow! That is certainly different than Emily's poetry. There's such a diversity in poetry. It's no surprise, since poetry dates back to prehistorical times. Did you know some of the first poetry was hunting poetry in Africa?
*** 0.53270096 Poetry (the term derives from a variant of the Greek term, poiesis, "making") is a form of literature  that uses aesthetic and rhythmic qualities of language—such as phonaesthetics, sound symbolism, and metre—to evoke meanings in addition to, or in place of, the prosaic ostensible meaning.
Poetry has a long history, dating back to prehistorical times with the creation of hunting poetry in Africa, and panegyric and elegiac court poetry were developed extensively throughout the history of the empires of the Nile, Niger and Volta river valleys . Some of the earliest written poetry in Africa can be found among the Pyramid Texts written during the 25th century BCE, while the Epic of Sundiata is one of the most well-known examples of griot court poetry. The

[Back to Contents](#Table-of-Contents)

## Data Processing

In [34]:
@ray.remote
class processText:
        
    def __init__(self):
        
        import tensorflow as tf
        import tensorflow_hub as hub
        import tensorflow_text #needed to avoid crashes    
        
        self.texts = ray.get(texts_ray)
        
        print('loading model')
        self.embed = SentenceTransformer('../models/sbert.net_models_distiluse-base-multilingual-cased-v2')
        print('loading model...done')

        self.talking_points_emb = self.embed.encode(ray.get(talking_points_ray), convert_to_tensor=True)

        
    def clean_sentences(self, cT):

        #create list of sentences from text                
        #cT = [s.lower() for s in cT] #lower case
        cT = [" ".join([w for w in s.split() if len(s.split())>5]) for s in cT]  #extra spaces, at least 5 words

        if len(cT)==0:
            return []

        if len(cT)<3: # at least 3 sentences
            return []
        else:
            return cT
    
    
    def get_points(self, text_idxs, phrase_thr=0.60):

        is_talking_points = {}
        
        print('start', text_idxs)
        
        for i in range(text_idxs[0], text_idxs[1]): 
            
            cT = self.texts[i]
            cT = self.clean_sentences(cT)
            
            if len(cT)==0:
                continue

            emb_cT_i = self.embed.encode(cT, convert_to_tensor=True)
            # similarity 
            sim_mat = util.pytorch_cos_sim(emb_cT_i,
                                           self.talking_points_emb).detach().cpu().numpy()
            
            # loop over individual sentences and check similarity to each of the phrases    
            #best_match = np.unravel_index(np.argmax(sim_mat, axis=None), sim_mat.shape)
            #best_match_value = sim_mat[best_match]
            #if best_match_value>phrase_thr:
            #    is_talking_points.update({(i,cT[best_match[0]]): [(best_match[1], best_match_value)]})
            
            # for each sentence find matching phrases above threshold. 
            # In a chat mutiple talking points can be mentioned
            for j in range(sim_mat.shape[0]):
                sim_idx = np.where(sim_mat[j,]>phrase_thr)[0]
                sim_values = [(k, sim_mat[j,k]) for k in sim_idx] # this is for debugging purposes, later just select best matching phrase
                if len(sim_idx)>0:
                    is_talking_points.update({(i,cT[j]): sim_values})
            

        print('end', text_idxs)
        
        return is_talking_points
    

In [None]:
ray.shutdown()

[Back to Contents](#Table-of-Contents)

### Processing Chats

In [35]:
all_chats = []
for i in chats_df['content'].values :
    all_chats_tmp = []
    for j in i:
        all_chats_tmp.append(' '.join(j['message']))
    all_chats.append(all_chats_tmp)

In [36]:
len(all_chats)

10784

In [37]:
all_chats[1]

['Did you know the richest superhero is black panther?',
 'Is that the character in the 2018 film, or the actor who played the role?',
 'I think it is the character. I will check closer.',
 'Okay. Have you seen the film? Heard it was nominated for Oscars...',
 "I haven't yet but I plan to.",
 'Me, too. Understand that symbols and script were based on 4th century Nigerian story. Costume designer won an Oscar, I think...deserved to, at least...',
 'Interesting. Costume designing would be a fun occupation.',
 'Agree. One young star in the film, Chadwick Boseman, was sponsored to an Oxford, England theater program by a "private benefactor". Turned out to be the great actor, Denzel Washington. Fun fact?',
 'Wow! That is interesting.',
 'Glad script made black panther richest superhero, over characters like Tony Stark (Iron Man) and Bruce Wayne (Batman). Why not? ?',
 'True! If you want a superhero make him really rich.',
 "And of noble ancestry! This nearly all-black film is possibly the bi

In [38]:
%%time 

num_cpus = psutil.cpu_count(logical=False)
max_chats = len(all_chats)
n_cells = 100 #to run in a single process

print('Processing', max_chats, 'chats')

# split task into chunks
run_list_chnks = np.linspace(0, max_chats, int(max_chats/n_cells), dtype=int)
run_list_chnks = [(run_list_chnks[i],run_list_chnks[i+1]) for i in range(0,len(run_list_chnks)-1)]

# init ray
ray.init(num_cpus=num_cpus)

phrase_thr = 0.50

texts_ray = ray.put(all_chats[:max_chats])
talking_points_ray = ray.put(talking_points)

pT = [processText.remote() for _ in range(num_cpus)]

# every 'num_cpus' jobs, start from worker 0 again
result_ids = [pT[i % num_cpus].get_points.remote(run_list_chnks[i],
                                                 phrase_thr=phrase_thr) for i in range(len(run_list_chnks))] 

# Fetch the results.
#results = ray.get(result_ids)    
results_c = ray.get(result_ids)

ray.shutdown()

Processing 10784 chats


2021-05-05 08:59:50,830	INFO services.py:1172 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


[2m[36m(pid=26913)[0m loading model
[2m[36m(pid=26914)[0m loading model
[2m[36m(pid=26911)[0m loading model
[2m[36m(pid=26912)[0m loading model
[2m[36m(pid=26913)[0m loading model...done
[2m[36m(pid=26914)[0m loading model...done
[2m[36m(pid=26912)[0m loading model...done
[2m[36m(pid=26911)[0m loading model...done
[2m[36m(pid=26912)[0m start (305, 406)
[2m[36m(pid=26914)[0m start (0, 101)
[2m[36m(pid=26913)[0m start (101, 203)
[2m[36m(pid=26911)[0m start (203, 305)
[2m[36m(pid=26914)[0m end (0, 101)
[2m[36m(pid=26914)[0m start (406, 508)
[2m[36m(pid=26913)[0m end (101, 203)
[2m[36m(pid=26913)[0m start (508, 610)
[2m[36m(pid=26912)[0m end (305, 406)
[2m[36m(pid=26912)[0m start (712, 813)
[2m[36m(pid=26911)[0m end (203, 305)
[2m[36m(pid=26911)[0m start (610, 712)
[2m[36m(pid=26914)[0m end (406, 508)
[2m[36m(pid=26914)[0m start (813, 915)
[2m[36m(pid=26913)[0m end (508, 610)
[2m[36m(pid=26913)[0m start (915, 1017)
[2

[2m[36m(pid=26914)[0m end (8952, 9054)
[2m[36m(pid=26914)[0m start (9359, 9461)
[2m[36m(pid=26912)[0m end (9257, 9359)
[2m[36m(pid=26912)[0m start (9664, 9766)
[2m[36m(pid=26911)[0m end (9563, 9664)
[2m[36m(pid=26911)[0m start (9970, 10071)
[2m[36m(pid=26913)[0m end (9461, 9563)
[2m[36m(pid=26913)[0m start (9868, 9970)
[2m[36m(pid=26914)[0m end (9359, 9461)
[2m[36m(pid=26914)[0m start (9766, 9868)
[2m[36m(pid=26912)[0m end (9664, 9766)
[2m[36m(pid=26912)[0m start (10071, 10173)
[2m[36m(pid=26911)[0m end (9970, 10071)
[2m[36m(pid=26911)[0m start (10377, 10478)
[2m[36m(pid=26913)[0m end (9868, 9970)
[2m[36m(pid=26913)[0m start (10275, 10377)
[2m[36m(pid=26912)[0m end (10071, 10173)
[2m[36m(pid=26912)[0m start (10478, 10580)
[2m[36m(pid=26914)[0m end (9766, 9868)
[2m[36m(pid=26914)[0m start (10173, 10275)
[2m[36m(pid=26911)[0m end (10377, 10478)
[2m[36m(pid=26913)[0m end (10275, 10377)
[2m[36m(pid=26913)[0m start (10682,

In [39]:
ray.shutdown()

#### Example of Processed Chat Data

In [40]:
results_c[0] # (chat index, utterance text above threshold): [(talking point 1, similarity to utterance),(talking point 2, similarity to utterance)] 

{(0,
  "Wow! That is certainly different than Emily's poetry. There's such a diversity in poetry. It's no surprise, since poetry dates back to prehistorical times. Did you know some of the first poetry was hunting poetry in Africa?"): [(66,
   0.53270096)],
 (1,
  'Glad script made black panther richest superhero, over characters like Tony Stark (Iron Man) and Bruce Wayne (Batman). Why not? ?'): [(58,
   0.5512788),
  (202, 0.5192375),
  (230, 0.5024116)],
 (1,
  "True!! You didn't miss anything. Marvel Comics character was not named after the 70's Black Panther Party, though studio almost called film, Black Leopard."): [(58,
   0.53018355)],
 (3,
  'Yes, it is very important, the term summit was not used until the Geneva summit back in 1955, after the cold war era the number of sumiit events increased.'): [(122,
   0.51651245)],
 (3,
  "It is just sad she went from winning a gold medal and only 2 months into retirement...and gets such a bad diagnosis. She says she will beat it though,

In [41]:
talking_points[35]

'The piano is an acoustic, stringed musical instrument invented in Italy by Bartolomeo Cristofori around the year 1700 (the exact year is uncertain), in which the strings are struck by hammers. It is played using a keyboard, which is a row of keys (small levers) that the performer presses down or strikes with the fingers and thumbs of both hands to cause the hammers to strike the strings. '

[Back to Contents](#Table-of-Contents)

## Analyzing Processed Data

For each talking point in the table below, we randomly sample few transcript extracts to check reliability of extract retrieval approach used in this notebook 

In [42]:
def analyze_results(results, n_texts, phrase_thr, verbose=True):

    talking_points_cnts = {k: 0 for k in talking_points}

    # loop over processed chunks and extract statistics
    for rr in range(len(results)):
        for chat_no_extract,tp_no_sim in results[rr].items():
            chat_no, chat_extract = chat_no_extract[0], chat_no_extract[1]
            tp_no_sim_sorted = sorted(tp_no_sim, key=lambda x: x[1], reverse=True) # sort by similarity values
            for kk in tp_no_sim_sorted[:1]: # take best
                tp_no = kk[0]
                tp_sim = kk[1]
                if tp_sim>phrase_thr:
                    if verbose:
                        print('------> Chat no. ', chat_no, 'Type: ', df.chatQueueLob[chat_no])
                        print('---> Extract:', chat_extract)
                        print('---> Phrase ', talking_points[tp_no])
                        print('---> Similarity index', tp_sim)
                    talking_points_cnts[talking_points[tp_no]] += 1
                 
    #extract some examples of match
    sample_extracts = {tp: [] for tp in talking_points}
    for tp in talking_points:
        for rr in range(len(results)):
            for chat_no_extract, tp_no_sim in results[rr].items():
                chat_no, chat_extract = chat_no_extract[0], chat_no_extract[1]
                tp_no_sim_sorted = sorted(tp_no_sim, key=lambda x: x[1], reverse=True)
                for kk in tp_no_sim_sorted[:1]: # take best
                    tp_no = kk[0]
                    tp_sim = kk[1]
                    if tp_sim>phrase_thr and tp_no==talking_points.index(tp):
                        sample_extracts[tp].append('**' + chat_extract)
                        #print('----->', tp)
                        #print('---> Extract:', chat_extract)
        if verbose:
            print(tp, len(sample_extracts[tp]), len(sample(sample_extracts[tp],min(20, len(sample_extracts[tp])))))
        sample_extracts[tp] = '\n'.join(sample(sample_extracts[tp],min(20, len(sample_extracts[tp])))) # sample only 5 examples
        #sample_extracts[tp] = '\n'.join(sample_extracts[tp]) # sample only 5 examples
    sample_extracts = [v for k,v in sample_extracts.items()]
    
    ## put in excel
    results_df = pd.DataFrame({'Talking Points': [k for k,v  in talking_points_cnts.items()], 
                  'Perc. Used': [100*v/n_texts for k,v  in talking_points_cnts.items()],
                  'Sample Extracts': sample_extracts,
                  'Human Corrected': [0 for k in talking_points_cnts]})# randomly sample 20 extracts for each talking point and visually inspect how many are right

    #results_df.to_excel('extracts.xlsx')

    return results_df

[Back to Contents](#Table-of-Contents)

### Analyzing Chats

In [43]:
results_c_df = analyze_results(results_c, max_chats, phrase_thr=0.5, verbose=False)
results_c_df

Unnamed: 0,Talking Points,Perc. Used,Sample Extracts,Human Corrected
0,A horror film is a film that seeks to elicit f...,3.347552,**Yes I like horror movies too. You know that ...,0
1,"A soundtrack, also written sound track, can be...",0.000000,,0
2,An album is a collection of audio recordings i...,0.649110,**Well the 21st century sales have mainly focu...,0
3,The president is a common title for the head o...,1.047849,"**I cannot imagine it either. Apparently, a di...",0
4,The United States Senate is the upper chamber ...,0.259644,"**I think the same 50 states, 100 US senators,...",0
...,...,...,...,...
256,"Hip hop or hip-hop, is a culture and art movem...",0.649110,"**So, I've been listening to a lot of rap late...",0
257,The Terminator series is an American science-f...,0.009273,**It could be. I remember that The Terminator ...,0
258,Blade Runner is a 1982 neo-noir science fictio...,0.000000,,0
259,Jane Austen (; 16 December 1775 – 18 July 1817...,0.018546,**Yes they are. Jane Austin works critique the...,0


In [52]:
#results_c_df['Perc. Used'].sum()

In [44]:
i = 0
results_c_df.iloc[i,0], results_c_df.iloc[i,2].split('\n')

('A horror film is a film that seeks to elicit fear. Initially inspired by literature from authors like Edgar Allan Poe, Bram Stoker, and Mary Shelley, horror has existed as a film genre for more than a century. The macabre and the supernatural are frequent themes. Horror may also overlap with the fantasy, supernatural fiction, and thriller genres.',
 ['**Yes I like horror movies too. You know that Stephen King considers Bambi to be a horror movie?',
  '**Hey they are pretty scary! Horror is not for everyone. Pet Cemetery is being remade into another movie soon too. Maybe stick with things like Simpsons or Futurama.',
  '**I do like horror films. Insidious was a good one I watched that focuses on a paranormal investigator and her sidekicks.',
  "**Here's a horror movie you might like. Cujo? It has a dog. One of the only Stephen King movies I can watch. Funny story... he was actually terrified of Bambi as a kid and considered that the first horror movie he saw.",
  '**I can understand t

In [45]:
i = 2

results_c_df.iloc[i,0], results_c_df.iloc[i,2].split('\n')

('An album is a collection of audio recordings issued as a collection on compact disc (CD), vinyl, audio tape, or another medium. Albums of recorded music were developed in the early 20th century as individual 78-rpm records collected in a bound book resembling a photograph album; this format evolved after 1948 into single vinyl LP records  played at \u200b33 1⁄3 rpm. Vinyl LPs are still issued, though album sales in the 21st-century have mostly focused on CD and MP3 formats. The audio cassette was a format used alongside vinyl from the 1970s into the first decade of the 2000s.',
 ["**Well the 21st century sales have mainly focused on cd's and mp3 format.",
  '**Okay then what comes to mind? I still love playing vinyl but I hate that cassettes are starting to come back. Their sound quality is inferior and we no longer need them as a portable format.',
  '**Thats cool Albums of recorded music were developed in the early 20th century',
  "**I guess it's just a sign of the changing times.

[Back to Contents](#Table-of-Contents)

## Interactive Visualization of Results

In [46]:
embed = SentenceTransformer('../models/sbert.net_models_distiluse-base-multilingual-cased-v2')

In [47]:
talking_points_emb = embed.encode(talking_points, convert_to_tensor=True)

In [48]:
# input text gets extra slash in front of \n, so we added \\n for splitting sentences    
def get_points(cT, phrase_thr=0.60):
            
        #print(len(cT.split('\\n')))

        cT = cT.split('\\n')
        if len(cT)==0:
            return ''
        
        emb_cT = embed.encode(cT, convert_to_tensor=True)
        # similarity 
        sim_mat = util.pytorch_cos_sim(emb_cT,
                                       talking_points_emb).detach().cpu().numpy()
            
        # for each sentence find matching phrases above threshold. 
        for j in range(sim_mat.shape[0]):
            sim_idx = np.where(sim_mat[j,]>phrase_thr)[0]
            sim_values = [(k, sim_mat[j,k]) for k in sim_idx] 
            sim_sorted = sorted(sim_values, key=lambda x: x[1], reverse=True) # sort by similarity values
            if len(sim_sorted)>0:
                print('**utterance:', cT[j])
                print('Talking Point:', talking_points[sim_sorted[0][0]])
                print('Similarity: ', sim_sorted[0][1])            


In [49]:
print('**Available talking points**')
print('')
for i in talking_points:
    print(i)

**Available talking points**

A horror film is a film that seeks to elicit fear. Initially inspired by literature from authors like Edgar Allan Poe, Bram Stoker, and Mary Shelley, horror has existed as a film genre for more than a century. The macabre and the supernatural are frequent themes. Horror may also overlap with the fantasy, supernatural fiction, and thriller genres.
A soundtrack, also written sound track, can be recorded music accompanying and synchronized to the images of a motion picture, book, television program, or video game; a commercially released soundtrack album of music as featured in the soundtrack of a film, video, or television presentation; or the physical area of a film that contains the synchronized recorded sound.
An album is a collection of audio recordings issued as a collection on compact disc (CD), vinyl, audio tape, or another medium. Albums of recorded music were developed in the early 20th century as individual 78-rpm records collected in a bound book 

In [50]:
#Chats
i=1 
cT = all_chats[i]
print("\\n".join(cT) )


Did you know the richest superhero is black panther?\nIs that the character in the 2018 film, or the actor who played the role?\nI think it is the character. I will check closer.\nOkay. Have you seen the film? Heard it was nominated for Oscars...\nI haven't yet but I plan to.\nMe, too. Understand that symbols and script were based on 4th century Nigerian story. Costume designer won an Oscar, I think...deserved to, at least...\nInteresting. Costume designing would be a fun occupation.\nAgree. One young star in the film, Chadwick Boseman, was sponsored to an Oxford, England theater program by a "private benefactor". Turned out to be the great actor, Denzel Washington. Fun fact?\nWow! That is interesting.\nGlad script made black panther richest superhero, over characters like Tony Stark (Iron Man) and Bruce Wayne (Batman). Why not? ?\nTrue! If you want a superhero make him really rich.\nAnd of noble ancestry! This nearly all-black film is possibly the biggest performance work to come alon

In [53]:
@interact_manual(in_text='', phrase_thr=widgets.FloatSlider(min=0.5, max=0.8, step=0.1, value=0.5))
def g(in_text, phrase_thr):
    return get_points(in_text, phrase_thr) 

interactive(children=(Text(value='', description='in_text'), FloatSlider(value=0.5, description='phrase_thr', …

[Back to Contents](#Table-of-Contents)