# Topic Modelling

Creating a system which, given turns of conversation that has happened and a new turn, can return the most important topic(s) the conversation is revolving around

Note: This will be fed into the knowledge retreival phases to provide the core context in addition to the latest turn of conversation

## Sample Data Definition

In [10]:
docs = ["Hey! How are you doing?", "This COVID thing has been crazy hasn't it", "I heard the vaccines aren't all that effective",
       "I heard Pfizer had something to do with the vaccines", "and what about Moderna? I'm pretty sure they were involved too",
       "I'm not sure social distancing is useful in stopping the spread"]
docs.extend(docs)
docs.extend(docs)
docs.extend(docs)

# from sklearn.datasets import fetch_20newsgroups
# docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

### Imports

In [2]:
from bertopic import BERTopic
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True, embedding_model="../../models/bert-base-cased-squad2")

In [4]:
topics, probs = topic_model.fit_transform(docs)

Some weights of the model checkpoint at ../../models/bert-base-cased-squad2 were not used when initializing BertModel: ['qa_outputs.weight', 'qa_outputs.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Batches: 100%|███████████████████████████████████████████████████████████████████████| 589/589 [13:40<00:00,  1.39s/it]
2022-03-28 21:28:18,366 - BERTopic - Transformed documents to Embeddings
2022-03-28 21:29:24,013 - BERTopic - Reduced dimensionality with UMAP
2022-03-28 21:29:46,150 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [5]:
freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,10947,-1_the_to_of_and
1,0,875,0_he_game_team_year
2,1,709,1_edu_the_it_of
3,2,561,2_that_of_god_is
4,3,515,3____


In [6]:
topic_model.get_topic(0)  # Select the most frequent topic

[('he', 0.017976277241705883),
 ('game', 0.016807014189312745),
 ('team', 0.01569459667494884),
 ('year', 0.01127740493065954),
 ('the', 0.010826522447135929),
 ('games', 0.010601793995639252),
 ('was', 0.010445442727259373),
 ('his', 0.010272403028888195),
 ('in', 0.00958907970771917),
 ('they', 0.00894425107608824)]

## Using genism

### Imports

In [5]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

### Pre-process text

In [11]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemmatizer = WordNetLemmatizer()

def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemmatizer.lemmatize(word) for word in punc_free.split())
    return normalized

clean_corpus = [clean(doc).split() for doc in docs]
print(clean_corpus)

[['hey', 'doing'], ['covid', 'thing', 'crazy'], ['heard', 'vaccine', 'effective'], ['heard', 'pfizer', 'something', 'vaccine'], ['moderna', 'im', 'pretty', 'sure', 'involved'], ['im', 'sure', 'social', 'distancing', 'useful', 'stopping', 'spread'], ['hey', 'doing'], ['covid', 'thing', 'crazy'], ['heard', 'vaccine', 'effective'], ['heard', 'pfizer', 'something', 'vaccine'], ['moderna', 'im', 'pretty', 'sure', 'involved'], ['im', 'sure', 'social', 'distancing', 'useful', 'stopping', 'spread'], ['hey', 'doing'], ['covid', 'thing', 'crazy'], ['heard', 'vaccine', 'effective'], ['heard', 'pfizer', 'something', 'vaccine'], ['moderna', 'im', 'pretty', 'sure', 'involved'], ['im', 'sure', 'social', 'distancing', 'useful', 'stopping', 'spread'], ['hey', 'doing'], ['covid', 'thing', 'crazy'], ['heard', 'vaccine', 'effective'], ['heard', 'pfizer', 'something', 'vaccine'], ['moderna', 'im', 'pretty', 'sure', 'involved'], ['im', 'sure', 'social', 'distancing', 'useful', 'stopping', 'spread'], ['hey',

In [None]:
from gensim import corpora
dictionary = corpora.Dictionary(text_data)corpus = [dictionary.doc2bow(text) for text in ]

## Manually

In [2]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import time
import numpy as np
from nltk.tokenize import word_tokenize, sent_tokenize
import nltk

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
model = SentenceTransformer('../../models/all-MiniLM-L6-v2', device='cuda')

In [13]:
def calc_sentence_similarity(msg, candidates):
    msg_embedding = model.encode([msg])
    candidate_embeddings = model.encode(candidates)
    distances = cosine_similarity(msg_embedding, candidate_embeddings).flatten()
    return distances

def has_topic_changed(msg, prev_msg, control_msg, error_threshold=0.02):
    distances = calc_sentence_similarity(msg, [prev_msg, control_msg])
    print(f"topic_change_dist: {distances}")
    return distances[1] - distances[0] > -error_threshold

In [5]:
def tag_sentence(message):
    tokenized = sent_tokenize(message)
    nouns = []
    for sentence in tokenized:
        wordsList = word_tokenize(sentence)
        #print(wordsList)
        # wordsList = [w for w in wordsList if not w in stop_words]
        tagged = nltk.pos_tag(wordsList)
        nouns.extend([tag[0] for tag in tagged if tag[1][:2] in ['NN', 'CD'] and tag[0].lower() not in ['hi', 'hey']])
    return nouns

### Experimenting with knowledge transition

In [37]:
prev_m_k = []
m_k = []
cur_k = []
topic_k = []
while True:
    m = input('User: ')
    if m == 'exit':
        break
    m_k = tag_sentence(m)
    prev_m_k = m_k
    

['match', 'yesterday']


### Clustering cosine similarity

In [6]:
from scipy.cluster.vq import kmeans

def list_sorted_args(l, reverse=False):
    return sorted(range(len(l)), key=l.__getitem__, reverse=reverse)

def find_highest_similarity_scores(scores, n=3):
    s_idxs = list_sorted_args(scores)
    s = [scores[i] for i in s_idxs]
    s_len = len(s)
    s_range = range(s_len)
    
    kclust = kmeans(np.matrix([s_range, s]).transpose(), n)
    assigned_clusters = [abs(kclust[0][:, 0] - e).argmin() for e in s_range]
    print(assigned_clusters)
    
    highest_cluster = assigned_clusters[-1]
    highest_idxs = []
    for i in range(s_len-1, -1, -1):
        if assigned_clusters[i] != highest_cluster:
            return highest_idxs
        highest_idxs.append(s_idxs[i])
    return highest_idxs

#t = [0.12140948, 0.426371, 0.11862079, 0.44534147, 0.17006755, 0.55, 0.00, 0.00, 0.00, 0.00]
t = [0, 0.2, 0.5, 1]
[t[i] for i in find_highest_similarity_scores(t)]

[1, 1, 2, 0]


[1]

In [15]:
topics = []
keywords = []
c_keywords = []
t_keywords = []
prev_msg = control_msg = "Hey! How are you doing?"
prev_keywords = []

# Keeps track of whether the topic changed in the previous turn
topic_changed = False
cur_topic_desc_keywords = []
while True:
    msg = input('\nUser: ')
    t1 = time.time()
    if msg == 'exit':
        break
    c_keywords = tag_sentence(msg)
    
    if has_topic_changed(msg, prev_msg, control_msg):
        keywords = c_keywords
        print(f"topic has changed to {keywords}")
        topic_changed = True
    else:
        if topic_changed:
            topic_changed = False

            # Compare the current sentence against the keywords in the previous turn and keep the most relevant ones
            # until the topic changes
            if len(keywords) > 3:
                scores = calc_sentence_similarity(msg, keywords)
                cur_topic_desc_keywords = [keywords[i] for i in find_highest_similarity_scores(scores)]
            else:
                cur_topic_desc_keywords = keywords
                keywords = []
            t_keywords = list(set(c_keywords) | set(cur_topic_desc_keywords) | set(keywords))
        else:
            # It's been >2 turns since the topic changed
            t_keywords = list(set(c_keywords) | set(cur_topic_desc_keywords) | set(keywords))
            if len(t_keywords) > 0:
                s_scores = calc_sentence_similarity(msg, t_keywords)
                print(f"scores: {s_scores}")
                s_idxs = list_sorted_args(s_scores, reverse=True)
                print(f"s_idxs: {s_idxs}")
                t_keywords = [t_keywords[i] for i in s_idxs]
        
        if keywords is None:
            keywords = []
        print(f"topic has not changed, keywords: {t_keywords}")
        keywords = c_keywords
            
    print(f"time elapsed: {time.time() - t1}")
    prev_msg = msg


User: I love football
topic_change_dist: [0.14568186 0.14568186]
topic has changed to ['football']
time elapsed: 0.33107709884643555

User: My favorite team is Brazil
topic_change_dist: [0.52336955 0.07332079]
topic has not changed, keywords: ['Brazil', 'team', 'football']
time elapsed: 0.03590273857116699

User: They won the world cup in 1970
topic_change_dist: [0.40615582 0.01180051]
scores: [0.36334264 0.11462341 0.19479725 0.18227434 0.21584925 0.54440093]
s_idxs: [5, 0, 4, 2, 3, 1]
topic has not changed, keywords: ['1970', 'Brazil', 'football', 'team', 'cup', 'world']
time elapsed: 0.0718085765838623

User: their best player is neymar
topic_change_dist: [0.3478805  0.05280136]
scores: [0.19667219 0.11755693 0.24733979 0.29790506 0.04338944]
s_idxs: [3, 2, 0, 1, 4]
topic has not changed, keywords: ['player', 'football', 'cup', 'world', '1970']
time elapsed: 0.35477280616760254


KeyboardInterrupt: Interrupted by user

### backup

In [36]:
topics = []
keywords = []
c_keywords = []
prev_msg = control_msg = "Hey! How are you doing?"
prev_keywords = []

# Keeps track of whether the topic changed in the previous turn
topic_changed = False
cur_topic_desc_keywords = []
while True:
    msg = input('User: ')
    t1 = time.time()
    if msg == 'exit':
        break
    c_keywords = tag_sentence(msg)
    
    if has_topic_changed(msg, prev_msg, control_msg):
        keywords = c_keywords
        print(f"topic has changed to {keywords}")
        topic_changed = True
    else:
        if topic_changed:
            topic_changed = False

            # Compare the current sentence against the keywords in the previous turn and keep the most relevant ones
            # until the topic changes
            if len(keywords) > 3:
                scores = calc_sentence_similarity(msg, keywords)
                cur_topic_desc_keywords = [keywords[i] for i in find_highest_similarity_scores(scores)]
            else:
                cur_topic_desc_keywords = keywords
                keywords = None
        #print("um", keywords)
        if keywords is None:
            keywords = []
        print(f"topic has not changed, keywords: {c_keywords + cur_topic_desc_keywords + keywords}")
        keywords = c_keywords
            
    print(f"time elapsed: {time.time() - t1}")
    prev_msg = msg

User: Hi there! What's up?
topic has not changed, keywords: []
time elapsed: 0.3567070960998535
User: Nothing much, what about you?
topic has not changed, keywords: ['Nothing']
time elapsed: 0.03793525695800781
User: Let's just say it's a great day to play football
topic has changed to ['day', 'football']
time elapsed: 0.041112422943115234
User: Did you watch the match yesterday?
topic has not changed, keywords: ['Did', 'match', 'yesterday', 'day', 'football']
time elapsed: 0.4401993751525879
User: I did! What a game, Barcelona is back!
topic has not changed, keywords: ['game', 'Barcelona', 'day', 'football', 'Did', 'match', 'yesterday']
time elapsed: 0.33307862281799316
User: Still not a match for Madrid though
topic has not changed, keywords: ['match', 'Madrid', 'day', 'football', 'game', 'Barcelona']
time elapsed: 0.3435804843902588
User: Cricket is going great too, I went to watch one of the Australia vs Pakistan matches here in Rawalpindi
topic has changed to ['Cricket', 'one', 'A

KeyboardInterrupt: Interrupted by user