# Topic modeling
Goal: find various topic in corpus
every doc consist of mix topics, every topic consists of mix words
-> to know the topic of doc, we can look into the content words

## Latent Dirichlet Allocation (LDA)
"To use a topic modeling technique, you need to provide (1) a document-term matrix and (2) the number of topics you would like the algorithm to pick up"

ref: [https://github.com/adashofdata/nlp-in-python-tutorial/blob/master/4-Topic-Modeling.ipynb]

## Topic Modeling - Attempt #1 (All Text)

In [1]:
# Let's read in our document-term matrix
import pandas as pd
import pickle

data = pd.read_pickle('dtm_stop_to.pkl')
data

Unnamed: 0,able,abracadabra,accident,ache,aches,acorn,acrobat,actually,adam,add,...,yum,yummilicious,yummy,zebra,zoo,zoom,zooming,zoos,zzz,æta
AJC1.xml,0,0,0,0,0,0,0,1,2,0,...,6,0,2,0,0,0,1,0,0,0
AJC2.xml,1,0,0,0,0,0,0,0,1,0,...,4,0,0,0,4,0,0,0,0,0
AVW1.xml,0,0,0,0,0,0,1,1,0,0,...,2,0,0,0,0,0,0,0,0,0
AVW2.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
BDO1.xml,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
BDO2.xml,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
CDH1.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
CMC1.xml,0,0,0,17,1,0,0,0,0,1,...,0,0,1,0,4,0,0,0,0,0
CMC2.xml,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,2,0,0,0,0,0
ECB1.xml,2,0,0,0,0,0,0,0,0,0,...,8,0,0,0,0,0,0,0,0,0


In [2]:
# Import the necessary modules for LDA with gensim
# Terminal / Anaconda Navigator: conda install -c conda-forge gensim
from gensim import matutils, models
import scipy.sparse

In [3]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,AJC1.xml,AJC2.xml,AVW1.xml,AVW2.xml,BDO1.xml,BDO2.xml,CDH1.xml,CMC1.xml,CMC2.xml,ECB1.xml,...,OMS1.xml,OMS2.xml,RMB1.xml,RMB2.xml,SAU1.xml,SAU2.xml,SC1.xml,SC2.xml,WJAW1.xml,WJAW2.xml
able,0,1,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
abracadabra,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
accident,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ache,0,0,0,0,0,0,0,17,0,0,...,0,0,0,0,0,2,0,0,0,0
aches,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [7]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
cv = pickle.load(open("cv_stop_to.pkl", "rb"))
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [8]:
# Now that we have the corpus (term-document matrix) and id2word (dictionary of location: term),
# we need to specify two other parameters as well - the number of topics and the number of passes
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=2, passes=10)
lda.print_topics()

[(0,
  '0.009*"does" + 0.009*"wanna" + 0.009*"erm" + 0.008*"lets" + 0.008*"little" + 0.008*"come" + 0.008*"wheres" + 0.008*"good" + 0.007*"right" + 0.007*"eat"'),
 (1,
  '0.013*"house" + 0.009*"come" + 0.009*"hes" + 0.009*"eat" + 0.009*"shall" + 0.008*"just" + 0.008*"cat" + 0.008*"wanna" + 0.007*"does" + 0.007*"lets"')]

In [9]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=80)
lda.print_topics()

[(0,
  '0.012*"right" + 0.010*"does" + 0.010*"mummy" + 0.010*"wanna" + 0.010*"come" + 0.009*"just" + 0.008*"hes" + 0.008*"house" + 0.008*"erm" + 0.007*"cat"'),
 (1,
  '0.013*"erm" + 0.012*"shall" + 0.011*"eat" + 0.010*"house" + 0.010*"little" + 0.008*"ice" + 0.008*"thankyou" + 0.008*"lets" + 0.008*"come" + 0.008*"tea"'),
 (2,
  '0.010*"wheres" + 0.009*"lets" + 0.009*"wanna" + 0.009*"house" + 0.009*"does" + 0.009*"just" + 0.008*"mom" + 0.008*"um" + 0.008*"come" + 0.007*"mum"')]

In [10]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.016*"house" + 0.013*"shall" + 0.009*"come" + 0.009*"little" + 0.009*"tea" + 0.009*"wanna" + 0.009*"wheres" + 0.008*"hes" + 0.008*"mummy" + 0.007*"eat"'),
 (1,
  '0.012*"does" + 0.009*"wanna" + 0.009*"erm" + 0.008*"right" + 0.008*"eat" + 0.008*"good" + 0.008*"lets" + 0.007*"bed" + 0.007*"little" + 0.007*"hes"'),
 (2,
  '0.012*"dog" + 0.011*"la" + 0.008*"house" + 0.008*"hafta" + 0.008*"lets" + 0.008*"just" + 0.007*"mummy" + 0.007*"little" + 0.007*"eat" + 0.007*"erm"'),
 (3,
  '0.010*"come" + 0.010*"house" + 0.009*"wanna" + 0.009*"lets" + 0.008*"does" + 0.008*"eat" + 0.008*"wheres" + 0.008*"just" + 0.008*"tea" + 0.008*"shall"')]

## Topic Modeling - Attempt #2 (Nouns Only)

In [11]:
# Let's create a function to pull out nouns from a string of text
from nltk import word_tokenize, pos_tag

def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    return ' '.join(all_nouns)

In [17]:
# Read in the cleaned data, before the CountVectorizer step
data_clean = pd.read_pickle('data_clean_to.pkl')
data_clean

Unnamed: 0,transcript
AJC1.xml,so you know it should be fine i could leave hi...
AJC2.xml,drove to c if you do needta go out dont bother...
AVW1.xml,lets see what we can see a bit lefti lighter i...
AVW2.xml,when did you play with them was it yesterday l...
BDO1.xml,look a doctor chair mum is that a doctor chair...
BDO2.xml,is he in there again in his little house yeah ...
CDH1.xml,yeah there we are thankyou okay yes have fun t...
CMC1.xml,it really doesnt matter i cant stress that eno...
CMC2.xml,pardon yeah why it doesnt work batterys gone j...
ECB1.xml,oh look at this oh look theres a fire station ...


In [18]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns = pd.DataFrame(data_clean.transcript.apply(nouns))
data_nouns

Unnamed: 0,transcript
AJC1.xml,i hour thats matter something power rangers th...
AJC2.xml,drove coffee morning i yeah orange orange bowl...
AVW1.xml,lets bit lighter bit bit inside i toilet knick...
AVW2.xml,yesterday yesterday toys milk yours milk mind ...
BDO1.xml,doctor chair mum doctor chair yeah shoes door ...
BDO2.xml,house yeah i wonder yeah mummy house monster h...
CDH1.xml,yes see bit bear water thankyou yes cats home ...
CMC1.xml,matter i stress helpful point something past r...
CMC2.xml,pardon work batterys yeah hello hey i meow meo...
ECB1.xml,look look fire station theyd car foot door let...


In [56]:
# Create a new document-term matrix using only nouns
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

# Re-add the additional stop words since we are recreating the document-term matrix
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said', 'yyy', 'xxx', 'www',
                  'ahhah', 'ah', 'yyys', 'mhm', 'um', 'okay', 'mm', 'oh', 'wow', 'ooh']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

# Recreate a document-term matrix with only nouns
cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_nouns.transcript)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtmn.index = data_nouns.index
data_dtmn

Unnamed: 0,abracadabra,accident,ache,acorn,acrobat,adja,adjalash,adjalass,adjas,adult,...,youll,youve,yuck,yucky,yum,yummy,zebra,zoo,zoos,zzz
AJC1.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,6,0,0,0,0,0
AJC2.xml,0,0,0,0,0,0,0,0,0,0,...,1,3,0,0,0,0,0,4,0,0
AVW1.xml,0,0,0,0,1,1,1,1,1,0,...,3,4,0,0,2,0,0,0,0,0
AVW2.xml,0,0,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
BDO1.xml,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
BDO2.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
CDH1.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
CMC1.xml,0,0,4,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,4,0,0
CMC2.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,2,0,0
ECB1.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,4,0,0,0,0,0


In [57]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [58]:
# Let's start with 2 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=2, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.021*"look" + 0.020*"house" + 0.018*"whats" + 0.012*"plate" + 0.012*"tea" + 0.010*"lets" + 0.010*"bit" + 0.009*"cream" + 0.009*"ice" + 0.009*"dog"'),
 (1,
  '0.019*"look" + 0.017*"whats" + 0.011*"way" + 0.011*"house" + 0.011*"car" + 0.010*"bit" + 0.010*"hes" + 0.009*"cat" + 0.009*"mom" + 0.009*"lets"')]

In [59]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.023*"whats" + 0.022*"look" + 0.015*"house" + 0.015*"cat" + 0.014*"hes" + 0.012*"car" + 0.009*"lets" + 0.008*"cow" + 0.008*"mom" + 0.007*"helicopter"'),
 (1,
  '0.020*"look" + 0.019*"house" + 0.017*"whats" + 0.014*"plate" + 0.013*"tea" + 0.013*"ice" + 0.012*"cream" + 0.010*"mummy" + 0.009*"bit" + 0.009*"cup"'),
 (2,
  '0.020*"look" + 0.017*"rabbit" + 0.015*"lets" + 0.014*"house" + 0.014*"bit" + 0.013*"whats" + 0.009*"hello" + 0.009*"way" + 0.008*"yes" + 0.008*"car"')]

In [60]:
# Let's try 4 topics
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.031*"house" + 0.016*"lets" + 0.012*"whats" + 0.011*"tea" + 0.011*"bit" + 0.011*"cake" + 0.010*"yum" + 0.010*"car" + 0.009*"bed" + 0.009*"look"'),
 (1,
  '0.022*"look" + 0.015*"whats" + 0.013*"house" + 0.011*"lets" + 0.010*"dog" + 0.010*"rabbit" + 0.009*"car" + 0.009*"ice" + 0.009*"bit" + 0.008*"cream"'),
 (2,
  '0.024*"look" + 0.022*"whats" + 0.018*"house" + 0.015*"plate" + 0.014*"cat" + 0.013*"tea" + 0.012*"cup" + 0.011*"hes" + 0.008*"farm" + 0.008*"box"'),
 (3,
  '0.020*"whats" + 0.017*"look" + 0.016*"house" + 0.016*"bit" + 0.014*"cream" + 0.014*"ice" + 0.014*"plate" + 0.011*"round" + 0.011*"mummy" + 0.010*"car"')]

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [61]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

In [62]:
# Apply the nouns function to the transcripts to filter only on nouns
data_nouns_adj = pd.DataFrame(data_clean.transcript.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,transcript
AJC1.xml,fine i hour thats matter something i power ran...
AJC2.xml,drove thankyou ive alotof coffee morning i tea...
AVW1.xml,lets bit lefti lighter bit lighter bit lighter...
AVW2.xml,yesterday yesterday toys milk yours hot milk m...
BDO1.xml,doctor chair mum doctor chair yeah shoes door ...
BDO2.xml,little house yeah i i wonder yeah mummy ohdear...
CDH1.xml,thankyou yes thankyou much see bit okay teddy ...
CMC1.xml,doesnt matter i stress helpful point something...
CMC2.xml,pardon work batterys yeah hello hey i meow whi...
ECB1.xml,oh look oh look fire station theyd able car fo...


In [63]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(stop_words=stop_words, max_df=.8)
data_cvna = cvna.fit_transform(data_nouns_adj.transcript)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0,able,abracadabra,accident,ache,acorn,acrobat,adam,adja,adjalash,adjalass,...,youve,yuck,yucky,yum,yummilicious,yummy,zebra,zoo,zoos,zzz
AJC1.xml,0,0,0,0,0,0,1,0,0,0,...,1,1,0,6,0,2,0,0,0,0
AJC2.xml,1,0,0,0,0,0,0,0,0,0,...,3,0,0,1,0,0,0,4,0,0
AVW1.xml,0,0,0,0,0,1,0,1,1,1,...,5,0,0,2,0,0,0,0,0,0
AVW2.xml,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
BDO1.xml,0,0,0,0,0,0,0,0,0,0,...,1,1,1,0,0,0,0,0,0,0
BDO2.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
CDH1.xml,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
CMC1.xml,0,0,0,9,0,0,0,0,0,0,...,1,0,0,0,0,1,0,4,0,0
CMC2.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,2,0,0
ECB1.xml,2,0,0,0,0,0,0,0,0,0,...,0,0,0,6,0,0,0,0,0,0


In [64]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [65]:
# Let's start with 2 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.011*"erm" + 0.010*"ice" + 0.009*"mummy" + 0.008*"car" + 0.008*"lets" + 0.008*"cream" + 0.007*"way" + 0.007*"bed" + 0.007*"monkey" + 0.007*"round"'),
 (1,
  '0.010*"cat" + 0.009*"lets" + 0.009*"plate" + 0.009*"cup" + 0.009*"grapes" + 0.008*"cream" + 0.008*"ice" + 0.008*"dog" + 0.008*"cake" + 0.007*"car"')]

In [66]:
# Let's start with 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.011*"plate" + 0.011*"cup" + 0.011*"mummy" + 0.009*"lets" + 0.008*"grapes" + 0.008*"cake" + 0.008*"ice" + 0.008*"erm" + 0.007*"cream" + 0.007*"door"'),
 (1,
  '0.023*"cat" + 0.012*"cow" + 0.011*"gon" + 0.010*"car" + 0.010*"box" + 0.010*"farm" + 0.008*"ice" + 0.007*"round" + 0.007*"way" + 0.007*"cream"'),
 (2,
  '0.011*"ice" + 0.011*"lets" + 0.010*"cream" + 0.009*"erm" + 0.009*"way" + 0.008*"car" + 0.008*"mom" + 0.008*"rabbit" + 0.008*"dog" + 0.008*"ive"')]

In [67]:
# Let's start with 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.014*"car" + 0.011*"cat" + 0.010*"lets" + 0.009*"way" + 0.009*"dog" + 0.008*"ice" + 0.008*"mummy" + 0.007*"road" + 0.007*"ive" + 0.007*"ones"'),
 (1,
  '0.013*"lets" + 0.012*"mummy" + 0.011*"mom" + 0.009*"car" + 0.008*"bed" + 0.008*"rabbit" + 0.007*"mum" + 0.007*"cat" + 0.007*"way" + 0.007*"monkey"'),
 (2,
  '0.014*"hello" + 0.014*"dog" + 0.014*"rabbit" + 0.012*"plate" + 0.012*"cat" + 0.011*"mum" + 0.011*"cake" + 0.010*"erm" + 0.008*"food" + 0.008*"cup"'),
 (3,
  '0.017*"ice" + 0.015*"cream" + 0.012*"erm" + 0.011*"plate" + 0.008*"lets" + 0.008*"grapes" + 0.007*"yum" + 0.007*"mummy" + 0.007*"round" + 0.007*"bed"')]

## Identify Topics in Each Document
Out of 9 topic models we've made, 2 topic makes more sense

In [70]:
# Our final LDA model (for now)
ldana = models.LdaModel(corpus=corpusna, num_topics=2, id2word=id2wordna, passes=80)
ldana.print_topics()

[(0,
  '0.011*"cat" + 0.010*"ice" + 0.010*"erm" + 0.009*"rabbit" + 0.009*"cream" + 0.009*"dog" + 0.009*"lets" + 0.008*"cow" + 0.007*"plate" + 0.007*"way"'),
 (1,
  '0.010*"mummy" + 0.010*"car" + 0.009*"lets" + 0.009*"ice" + 0.009*"plate" + 0.007*"cream" + 0.007*"mom" + 0.007*"cup" + 0.006*"cake" + 0.006*"daddy"')]

topic 0: animal
topic 1: parents

In [79]:
data_dtmna[5:10]

Unnamed: 0,able,abracadabra,accident,ache,acorn,acrobat,adam,adja,adjalash,adjalass,...,youve,yuck,yucky,yum,yummilicious,yummy,zebra,zoo,zoos,zzz
BDO2.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
CDH1.xml,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
CMC1.xml,0,0,0,9,0,0,0,0,0,0,...,1,0,0,0,0,1,0,4,0,0
CMC2.xml,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,2,0,0
ECB1.xml,2,0,0,0,0,0,0,0,0,0,...,0,0,0,6,0,0,0,0,0,0


In [96]:
# Let's take a look at which topics each transcript contains

corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna[:5].transpose()))

corpus_transformed = ldana[corpusna]
list(zip([a for [(a,b)] in corpus_transformed], data_dtmna.index))

ValueError: too many values to unpack (expected 1)

In [97]:
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))
corpus_transformed = ldana[corpusna]
list(corpus_transformed)

[[(1, 0.999146)],
 [(1, 0.9990987)],
 [(0, 0.99879956)],
 [(0, 0.99864256)],
 [(0, 0.22984977), (1, 0.77015024)],
 [(0, 0.99797976)],
 [(0, 0.9987211)],
 [(1, 0.9987976)],
 [(1, 0.998828)],
 [(1, 0.99795127)],
 [(0, 0.023704091), (1, 0.9762959)],
 [(0, 0.9977507)],
 [(0, 0.99394137)],
 [(0, 0.7724488), (1, 0.22755127)],
 [(1, 0.99770564)],
 [(1, 0.99786925)],
 [(0, 0.9800664), (1, 0.019933563)],
 [(1, 0.99723023)],
 [(1, 0.9981713)],
 [(0, 0.9960613)],
 [(0, 0.9976249)],
 [(0, 0.9980566)],
 [(0, 0.9980165)],
 [(0, 0.9982986)],
 [(0, 0.99372655)],
 [(0, 0.9934612)],
 [(0, 0.9983932)],
 [(0, 0.99880683)],
 [(1, 0.9983617)],
 [(1, 0.9982095)],
 [(1, 0.99559224)],
 [(0, 0.9848155), (1, 0.015184549)],
 [(0, 0.86083734), (1, 0.13916264)],
 [(0, 0.9918299)],
 [(1, 0.99611884)],
 [(1, 0.9926781)],
 [(0, 0.9931518)],
 [(0, 0.9987147)],
 [(0, 0.89102805), (1, 0.10897197)],
 [(1, 0.9957)],
 [(0, 0.99617636)],
 [(0, 0.19408727), (1, 0.8059127)],
 [(1, 0.9980905)],
 [(1, 0.9975987)]]

This data is gathered from toy-playing situation
We can see that childre mantioned many objects, animals and parents.
