# Latent Dirichlet Allocation

In [108]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [110]:
import string
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    text = text.lower()
    text = ''.join(c for c in text if not c.isdigit())
    word_tokens = word_tokenize(text) 
    text = ''.join(w+' ' for w in word_tokens if not w in stop_words)
    word_tokens = word_tokenize(text) 
    lemmatized = [lemmatizer.lemmatize(word) for word in word_tokens]
    text = ''.join(w+' ' for w in lemmatized)
    return text

data['clean_text'] = data['text'].map(lambda x : clean_text(x))

In [111]:
data

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gldcunixbcccolumbiaedu gary l dare subject sta...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,minerkuhubccukansedu subject ancient book orga...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivovsuperiorcarletonca vladimir zhivov subj...
...,...,...
1194,From: jerryb@eskimo.com (Jerry Kaufman)\nSubje...,jerrybeskimocom jerry kaufman subject prayer a...
1195,From: golchowy@alchemy.chem.utoronto.ca (Geral...,golchowyalchemychemutorontoca gerald olchowy s...
1196,From: jayne@mmalt.guild.org (Jayne Kulikauskas...,jaynemmaltguildorg jayne kulikauskas subject q...
1197,From: sclark@epas.utoronto.ca (Susan Clark)\nS...,sclarkepasutorontoca susan clark subject pick ...


In [112]:
data.head(5)

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gldcunixbcccolumbiaedu gary l dare subject sta...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,minerkuhubccukansedu subject ancient book orga...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivovsuperiorcarletonca vladimir zhivov subj...


## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [113]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer().fit(data['clean_text'])
X = vectorizer.transform(data['clean_text'])

LDA = LatentDirichletAllocation(n_components=2)
LDA.fit(X)

LatentDirichletAllocation(n_components=2)

## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [114]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])
        

print_topics(LDA, vectorizer)

Topic 0:
[('god', 35.54760893325172), ('christian', 22.323245818238878), ('jesus', 18.979310370077958), ('people', 17.494785849906638), ('would', 16.572695383414498), ('church', 16.438950156053217), ('one', 16.364900830040867), ('bible', 13.69651139938445), ('believe', 13.61376839976182), ('say', 13.009815517085483)]
Topic 1:
[('game', 26.681544442642505), ('team', 25.45486664865145), ('hockey', 18.504878448240177), ('player', 18.165315254538694), ('go', 15.247110552515855), ('play', 14.488760602224872), ('nhl', 13.455637003281197), ('year', 13.277282381409819), ('playoff', 13.088511677254296), ('university', 13.060207180707774)]


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [118]:
email2 ='The tirade was posted after the president, who has refused to acknowledge his election \
loss to Democratic nominee Joe Biden, retweeted multiple comments from supporters,\
many of which expressed the view that they would instead be relying on right-wing\
cable channel and website Newsmax.Late on Thursday, the top story on Newsmax.com was headlined\
“Sen. Ted Cruz to Newsmax TV: ‘Media Don’t Get to Decide Presidency’.”\
Among Trump’s retweets was one by a user called “Appalachian Christian”,\
who said: “Suit yourself Left Fox 4 NewsMaxxxxx.” Fox was one of the first news organisations to \
call the state of Arizona for Biden and has warned its readers that Trump’s claims of victory are false.'

In [119]:
email = 'Ice hockey is a contact team sport played on ice, usually in a rink, in which two teams of skaters use their sticks to shoot a vulcanized rubber puck into their opponents net to score goals. The sport is known to be fast-paced and physical, with teams usually fielding six players at a time: one goaltender, and five players who skate the span of the ice trying to control the puck and score goals against the opposing team.'

In [120]:
new_text = pd.DataFrame({'text':[email]})

In [121]:
new_text['clean_text'] = new_text['text'].map(lambda x : remove_punct(x))
new_text['clean_text'] = new_text['clean_text'].map(lambda x : lower(x))

In [122]:
new_X = vectorizer.transform(new_text['clean_text'])

In [123]:
example_vectorized = vectorizer.transform(new_text['clean_text'])

lda_vectors = LDA.transform(example_vectorized)

print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])

topic 0 : 0.1384943530925435
topic 1 : 0.8615056469074566
