<font color = green >

# Text classification: topic modeling 

</font>

<font color = green >

## Latent Dirichlet allocation (LDA)

</font>

Typically used to detect underlying topics in the text documents

**Input** : text documents and number of topics 
<br>
**Output**: Distribution of topics for each document (that allows to assign the one with highest probability) and word distribution for each topic 

**Assumptions**:
- Documents with similar topics use similar groups of words 
- Documents are probability distribution over latent topics 
- Topics are probability distribution over words


<font color = green >

### Generative process

</font>

LDA considers the every document is created the following way:

1) Define number of words in the document
<br>
2) Chose the topic mixture over the fixed set of topics (e.g. 20% of topic 'Financial', 30% of topic 'Computer Science', and 50% of topic 'Sport')
<br>
3) Generate the words by:
<br>
   -pick the topic based on document's multinomial distribution 
<br>
   -pick the word based on topic's multinomial distribution 



<font color = green >

#### Recall
</font>


#### Binomial distribution

$$p(k/n)\quad =\quad C^{ k }_{ n }\cdot p^{ k }(1-p)^{ n-k }\quad =\quad \frac { n! }{ k!(n-k)! } p^{ k }(1-p)^{ n-k }$$

Example: Probability of 6 of 10 for fear coin: 
$$p(6,4)\quad =\quad C^{ 6 }_{ 10 }\cdot {0.5}^{ 6 }(0.5)^{ 4 }\quad = 210 \cdot 0.015625 \cdot 0.0625 = 0.205078125$$


#### Multinomial distribution

$$p(n_{ 1 }n_{ 2 }...n_{ k })\quad =\quad \frac { n! }{ n_{ 1 }!n_{ 2 }!...n_{ k }! } p^{ n_{ 1 } }_{ 1 }p^{ n_{ 2 } }_{ 2 }...p^{ n_{ k } }_{ k }$$

Example (three outcomes): <br>
n = 12 (12 games are played),<br>
n1 = 7 (number won by Player A),<br>
n2 = 2 (number won by Player B),<br>
n3 = 3 (the number drawn),<br>
p1 = 0.4 (probability Player A wins)<br>
p2 = 0.35(probability Player B wins)<br>
p3 = 0.25(probability of a draw)<br>
$$p(7,2,3)\quad =\quad \frac {12!}{ 7! \cdot 2! \cdot3 ! }  \cdot 0.4^{7} \cdot 0.35^{2} \cdot0.25^{3} = 0.0248$$




<font color = green >

### Maximul Likelihood Estimation
    
#### Simple sample
    
</font>

Data is factully sampled `Head Tail Head` (101)

Let's investigate  parametr `p` the probability of flipping `Head`

<!-- <img src = "MLE.jpg" height=500 width= 500 align="left"> -->


<br>

In [3]:
import numpy as np

Now we have documents (instead of coin flips) and need to find the distributions (instead of `p` for coin sample) s.t. is MLE for data (all documents)

**Recall** 
<br> Known are text documents and number $K$ of topics 

**Target**:
<br>Within all possible topics distribution for all documemnts and all possible words distribution for topics, shoose the one wich maximizes probability of all text documents.

**Note:** It is unclear how to iterate over all possible distributions 

**Approach** :
<br>
1) Randomly assign each word of each document to $K$ topics 
<br>
2) Iterate the following process till convergence (steady assignments of w to topics) 
<br>$\quad$>For each document $d$: 
<br>
    $\quad\quad\bullet$ Assume that all topic assignment except current one are correct     
    $\quad\quad\bullet$ For each word $w$ in $d$:           
    $\quad\quad\quad$ - For every topic $t$ compare the the score for hypothesis that w is in this topic $t$:
   <br>$\quad\quad\quad\quad\quad score (t) =  p(t | d) \cdot p (w |t),$
   <br>$\quad\quad\quad\quad p(t|d)$ is proportion of all words in d from t,
    <br>$\quad\quad\quad\quad p(w|t)$ is share of word w in topic t.  
    $\quad\quad\quad$ - Assign the word w to the topic with max score
    <br>$\quad\quad\bullet$ Iterate through all $w$ in $d$:           
$\quad$Iterate through all $d$

The results is matrix of distribution of words in topics  

Note: 
- The computed topics are just words distribution, i.e. need to summarize them somehow 
- Topics distribution over documents are computed being based on words in document and corresponding topics of each word 

In [5]:
pip install gensim 

Collecting gensim
  Downloading gensim-4.0.1-cp38-cp38-win_amd64.whl (23.9 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-5.2.0-py3-none-any.whl (58 kB)
Collecting Cython==0.29.21
  Downloading Cython-0.29.21-cp38-cp38-win_amd64.whl (1.7 MB)
Installing collected packages: smart-open, Cython, gensim
Successfully installed Cython-0.29.21 gensim-4.0.1 smart-open-5.2.0
Note: you may need to restart the kernel to use updated packages.


<font color = green >

## Gensim LDA 

</font>



In [4]:
import pandas as pd 
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

<font color = green >

### Define the text documents 

</font>



In [70]:
fn= 'voted-kaggle-dataset.csv'
df = pd.read_csv(fn)
print(len(df))
df.sample(3)

2150


Unnamed: 0,Title,Subtitle,Owner,Votes,Versions,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description
608,SherLock,A long-term smartphone sensor dataset with a h...,The BGU Cyber Security Research Center,18,"Version 1,2016-12-07",computer science,CSV,468 MB,Other,"6,767 views",516 downloads,5 kernels,,https://www.kaggle.com/BGU-CSRC/sherlock,What is the SherLock dataset?\nA long-term sma...
1609,Spoken Verbs,Classify simple audio commands,JohannesBuchner,4,"Version 3,2017-12-22|Version 2,2017-12-22|Vers...",languages\nacoustics\ncommunication\nhuman-com...,Other,349 MB,CC4,439 views,18 downloads,,,https://www.kaggle.com/jbuchner/spokenverbs,Context\nI want my computer to react to simple...
1298,Newspaper Endorsements of Presidential Candidates,Candidate endorsements and presidential electi...,WNYC,6,"Version 1,2017-02-04",news agencies\npolitics,CSV,19 KB,Other,992 views,79 downloads,2 kernels,0 topics,https://www.kaggle.com/wnyc/candidate-endorsem...,Content\nThis dataset includes presidential ca...


<font color = green >

### Tokenize, clean, and stem

</font>



In [73]:
en_stop = set(stopwords.words('english'))
p_stemmer = PorterStemmer()

def tokenize(df):
    texts = []
    for string in df:
        # tokenize document string
        raw = string
        tokens = word_tokenize(raw)

        # remove stop words from tokens
        tokens = [token for token in tokens if token not in en_stop]

        # stem tokens
        tokens = [p_stemmer.stem(token) for token in tokens]

        # add tokens to list
        texts.append(tokens)
    return texts

texts = tokenize(df)
texts

[['titl'],
 ['subtitl'],
 ['owner'],
 ['vote'],
 ['version'],
 ['tag'],
 ['data', 'type'],
 ['size'],
 ['licens'],
 ['view'],
 ['download'],
 ['kernel'],
 ['topic'],
 ['url'],
 ['descript']]

<font color = green >

### Convert tokenized documents into a "id <-> term" dictionary

</font>



In [74]:
dictionary = corpora.Dictionary(texts) # this is alternative way - without using count vectorizer
print (type(dictionary), dictionary)
for k,w in dictionary.items():
    print (k,w)

<class 'gensim.corpora.dictionary.Dictionary'> Dictionary(16 unique tokens: ['titl', 'subtitl', 'owner', 'vote', 'version']...)
0 titl
1 subtitl
2 owner
3 vote
4 version
5 tag
6 data
7 type
8 size
9 licens
10 view
11 download
12 kernel
13 topic
14 url
15 descript


<font color = green >

### Create gensim corpus

</font>



In [75]:
print ('\nconvert tokenized documents into a document-term matrix')
corpus = [dictionary.doc2bow(text) for text in texts] # id and count
for item in corpus:
    print (item)


convert tokenized documents into a document-term matrix
[(0, 1)]
[(1, 1)]
[(2, 1)]
[(3, 1)]
[(4, 1)]
[(5, 1)]
[(6, 1), (7, 1)]
[(8, 1)]
[(9, 1)]
[(10, 1)]
[(11, 1)]
[(12, 1)]
[(13, 1)]
[(14, 1)]
[(15, 1)]


#### Explanation: 
It shows the id of term and how many tiumes it occurs in the doc e.g. 
- (3, 1) means `brother` occurs once  in the second sentence
- (18, 2) means `health` occurs twice in the last sentence 

<font color = green >

### Generate LDA model

</font>



In [77]:
ldamodel = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)


<font color = green >

### Review topics 

</font>



In [78]:
ldamodel.print_topics(num_topics=2,num_words=60) # 0.098 means the p (w|t)

[(0,
  '0.093*"data" + 0.093*"type" + 0.092*"view" + 0.092*"owner" + 0.092*"url" + 0.092*"vote" + 0.092*"size" + 0.092*"licens" + 0.033*"topic" + 0.033*"titl" + 0.033*"download" + 0.033*"tag" + 0.033*"subtitl" + 0.033*"version" + 0.033*"kernel" + 0.033*"descript"'),
 (1,
  '0.092*"descript" + 0.092*"kernel" + 0.092*"version" + 0.092*"subtitl" + 0.092*"tag" + 0.092*"download" + 0.092*"titl" + 0.092*"topic" + 0.033*"licens" + 0.033*"size" + 0.033*"vote" + 0.033*"url" + 0.033*"owner" + 0.033*"view" + 0.032*"type" + 0.032*"data"')]

<font color = green >

#### Vectorize data set

</font>



In [80]:
df_vectorized= vectorizer.transform(df)
print (df_vectorized)

  (0, 31804)	1
  (1, 30529)	1
  (2, 23188)	1
  (3, 33958)	1
  (4, 33621)	1
  (5, 31017)	1
  (6, 9420)	1
  (6, 32650)	1
  (7, 29201)	1
  (8, 18902)	1
  (9, 33696)	1
  (10, 10907)	1
  (11, 17997)	1
  (12, 31967)	1
  (13, 33175)	1
  (14, 10057)	1


<font color = green >

#### Create gensim corpus

</font>



In [82]:
corpus = gensim.matutils.Sparse2Corpus(df_vectorized, documents_columns=False)
# comparing to using corpora.Dictionary:
# corpus = [dictionary.doc2bow(text) for text in texts] 
[item for item in corpus][:5]


[[(31804, 1)], [(30529, 1)], [(23188, 1)], [(33958, 1)], [(33621, 1)]]

<font color = green >

#### Create id2word dictionary

</font>



In [83]:
id_map = dict((v, k) for k, v in vectorizer.vocabulary_.items()) 
id_map

{9491: 'datasets',
 8483: 'contains',
 32232: 'transactions',
 8946: 'credit',
 6676: 'cards',
 28560: 'september',
 812: '2013',
 12233: 'european',
 6670: 'cardholders',
 9481: 'dataset',
 25014: 'presents',
 22530: 'occurred',
 9591: 'days',
 1528: '492',
 13635: 'frauds',
 1088: '284',
 2005: '807',
 15433: 'highly',
 32799: 'unbalanced',
 24698: 'positive',
 7552: 'class',
 2601: 'account',
 572: '172',
 22357: 'numerical',
 16669: 'input',
 33468: 'variables',
 27063: 'result',
 23713: 'pca',
 32260: 'transformation',
 32913: 'unfortunately',
 8307: 'confidentiality',
 17219: 'issues',
 25444: 'provide',
 22963: 'original',
 12881: 'features',
 4886: 'background',
 16565: 'information',
 9420: 'data',
 33355: 'v28',
 25096: 'principal',
 8163: 'components',
 22497: 'obtained',
 32263: 'transformed',
 31703: 'time',
 12874: 'feature',
 28353: 'seconds',
 11470: 'elapsed',
 32229: 'transaction',
 33238: 'used',
 12332: 'example',
 9976: 'dependant',
 8776: 'cost',
 28517: 'senstive

<font color = green >

#### Generate LDA model

</font>



In [86]:
ldamodel = gensim.models.ldamodel.LdaModel (corpus, num_topics=6, id2word=id_map, passes=25, random_state=34)

#### Note: 
Comparing to `corpora.Dictionary` use `id2word=id_map` instead of `id2word=dictionary`

`ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)`

<font color = green >

#### Review topics

</font>



In [89]:
ldamodel.print_topics(num_topics=6,num_words=10)

[(0,
  '0.000*"kernels" + 0.000*"download" + 0.000*"tags" + 0.000*"topics" + 0.000*"versions" + 0.000*"views" + 0.000*"owner" + 0.000*"votes" + 0.000*"url" + 0.000*"description"'),
 (1,
  '0.000*"description" + 0.000*"kernels" + 0.000*"owner" + 0.000*"votes" + 0.000*"download" + 0.000*"subtitle" + 0.000*"tags" + 0.000*"title" + 0.000*"views" + 0.000*"url"'),
 (2,
  '0.000*"tags" + 0.000*"versions" + 0.000*"topics" + 0.000*"kernels" + 0.000*"views" + 0.000*"subtitle" + 0.000*"description" + 0.000*"size" + 0.000*"votes" + 0.000*"title"'),
 (3,
  '0.000*"size" + 0.000*"subtitle" + 0.000*"topics" + 0.000*"description" + 0.000*"votes" + 0.000*"versions" + 0.000*"tags" + 0.000*"kernels" + 0.000*"url" + 0.000*"title"'),
 (4,
  '0.000*"data" + 0.000*"type" + 0.000*"url" + 0.000*"owner" + 0.000*"versions" + 0.000*"tags" + 0.000*"topics" + 0.000*"kernels" + 0.000*"votes" + 0.000*"subtitle"'),
 (5,
  '0.000*"license" + 0.000*"views" + 0.000*"download" + 0.000*"title" + 0.000*"votes" + 0.000*"topi

<font color = green >

#### Name topics
   
</font>

You need to name the topics manually (or use the top frequent words from topic )


In [24]:
topics_names= ['primary processing', 'Computers & IT', 'Religion', 'Sports', 'Science', 'Society & Lifestyle']

<font color = green >

### Topic Modeling 

</font>

[voted-kaggle-dataset](https://www.kaggle.com/canggih/voted-kaggle-dataset/version/2#voted-kaggle-dataset.csv)

In [28]:
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer
import pickle
fn= 'voted-kaggle-dataset.csv'
df = pd.read_csv(fn)
print(len(df))

2150


Unnamed: 0,Title,Subtitle,Owner,Votes,Versions,Tags,Data Type,Size,License,Views,Download,Kernels,Topics,URL,Description
1753,Canadian Car Accidents 1994-2014,Car accidents in Canada from 1999-2014 with va...,steal,3,"Version 1,2017-07-10",,Other,353 MB,ODbL,"1,124 views",251 downloads,,,https://www.kaggle.com/tbsteal/canadian-car-ac...,Context\nThis data set contains collision data...
491,News Headlines Of India,16 years of categorized headlines focusing on ...,Rohk,21,"Version 3,2018-01-10|Version 2,2017-12-23|Vers...",news agencies\ncities\nhistoriography,CSV,62 MB,CC4,"2,758 views",241 downloads,8 kernels,0 topics,https://www.kaggle.com/therohk/india-headlines...,Context\nThis dataset is a compilation of 2.7 ...
753,US Candy Production by Month,From January 1972 to August 2017,Rachael Tatman,14,"Version 1,2017-10-14",food and drink\ntime series\nproduct\n+ 2 more...,CSV,10 KB,CC0,"4,780 views",732 downloads,4 kernels,,https://www.kaggle.com/rtatman/us-candy-produc...,Context:\nHalloween begins frenetic candy cons...


In [93]:
text = df['Description'].values.astype('U')

vectorizer = CountVectorizer(
    min_df=1, 
    stop_words='english',
    token_pattern=r"\b\w{3,}\b") 
print(vectorizer.fit(text))

vectorized = vectorizer.transform(text)
corpus = gensim.matutils.Sparse2Corpus(vectorized, documents_columns=False)
id_ = dict((i, j) for j, i in vectorizer.vocabulary_.items())

ldamodel = gensim.models.ldamodel.LdaModel (corpus, num_topics=2, id2word=id_, passes=30, random_state=4381)

ldamodel.print_topics(num_topics=2,num_words=5)

CountVectorizer(stop_words='english', token_pattern='\\b\\w{3,}\\b')


[(0,
  '0.053*"university" + 0.010*"state" + 0.007*"college" + 0.005*"data" + 0.004*"california"'),
 (1,
  '0.022*"data" + 0.017*"dataset" + 0.006*"content" + 0.006*"context" + 0.005*"acknowledgements"')]

In [94]:
topics_names = ['Education', 'Data']

<font color = green >

## Learn more
</font>

Latent Dirichlet allocation
<br>
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

LDA Algorithm Description.mp4


<font color = green >

## Next lesson: Clustering 
</font>

