# Latent Dirichlet Allocation

In [1]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [2]:
data['clean_text'] = data['text']

The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

ðŸ‘‡ You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [3]:
import nltk
import string

def preprocess(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '').lower()
    return text

In [4]:
data.clean_text = data.clean_text.apply(preprocess)
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,from gldcunixbcccolumbiaedu gary l dare\nsubje...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,from minerkuhubccukansedu\nsubject re ancient ...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,from vzhivovsuperiorcarletonca vladimir zhivov...


## Latent Dirichlet Allocation model

ðŸ‘‡ Train an LDA model to extract potential topics.

In [5]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
vectorizer = TfidfVectorizer().fit(data['clean_text'])
data_vectorized = vectorizer.transform(data['clean_text'])


lda_model = LatentDirichletAllocation(n_components=5).fit(data_vectorized)

def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

## Visualize potential topics

ðŸ‘‡ The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [7]:
print_topics(lda_model, vectorizer)

Topic 0:
[('ideological', 1.1394003457588737), ('denounced', 0.9905009425123743), ('wbt', 0.803070570184468), ('manipulation', 0.7556586051298897), ('govt', 0.6437208650822466), ('mp', 0.6352753351773337), ('repression', 0.6076287581064024), ('wbtsil', 0.6076287578102031), ('rahim', 0.5671692846794159), ('hirji', 0.5671692846605073)]
Topic 1:
[('706', 2.0405879937951776), ('306027415', 1.676685484101163), ('mcovingtaiugaedu', 1.6766854840967513), ('5420358', 1.676685484069725), ('n4tmi', 1.676685484052121), ('ai', 1.632608127337766), ('colons', 1.4731293700255716), ('artificial', 1.40010056578612), ('covington', 1.3245859320648412), ('associate', 1.201671150016187)]
Topic 2:
[('grass', 3.7824998476081833), ('valley', 3.6239915437181227), ('petch', 2.154420692907776), ('howl', 1.4635233430873373), ('petchgvg47gvgtekcom', 1.4350492104940011), ('finalswinner', 1.2451230965411488), ('gargle', 1.231119981359719), ('octopus', 1.1570864724291683), ('finalswho', 1.0921011251470316), ('statemai

## Predict topic of new text

ðŸ‘‡ You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [8]:
my_review = 'I hate this one, this too bad and rubissh ! This test is definitely shiit !!'
my_review = preprocess(my_review)

In [9]:
example_vectorized = vectorizer.transform([my_review])

lda_vectors = lda_model.transform(example_vectorized)

print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])
print("topic 2 :", lda_vectors[0][2])
print("topic 3 :", lda_vectors[0][3])
print("topic 4 :", lda_vectors[0][4])

topic 0 : 0.05367469912737307
topic 1 : 0.053674658891134173
topic 2 : 0.05367451921735552
topic 3 : 0.7853016499610639
topic 4 : 0.0536744728030734


In [10]:
lda_vectors

array([[0.0536747 , 0.05367466, 0.05367452, 0.78530165, 0.05367447]])

In [11]:
import numpy as np
print(f'my_review is predicted to be of Topic {np.argmax(lda_vectors)}')

my_review is predicted to be of Topic 3
