# Latent Dirichlet Allocation

In [11]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [None]:
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def remove_punctuation(df, column):
    for punctuation in string.punctuation:
        df[column] = df[column].apply(lambda x: x.replace(punctuation,''))
    return df[column]
def lower(df, column):
    df[column] = df[column].apply(lambda x: x.lower())
    return df[column]
def del_number(df):
    return ''.join([word for word in df if not word.isdigit()])
def stop_words(df):
    stop_word = set(stopwords.words('english'))
    word_tokens = word_tokenize(df)
    return ' '.join([w for w in word_tokens if not w in stop_word])


data['clean_text'] = remove_punctuation(data, "text")
data['clean_text'] = lower(data, "clean_text")
data['clean_text']  = data.clean_text.apply(del_number)
data['clean_text']  = data.clean_text.apply(stop_words)

data

## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [69]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer().fit(data.clean_text)

data_vectorized = vectorizer.transform(data.clean_text)

lda_model = LatentDirichletAllocation(n_components=2).fit(data_vectorized)

## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [70]:
def print_topics(model, vectorizer):
    for idx, topics in enumerate(model.components_):
        print("Topics %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topics[i]) for i in topics.argsort()[:-10 -1 : -1]])
    
print_topics(lda_model, vectorizer)

Topics 0:
[('pp', 2.7506781892521315), ('nyr', 1.7006927132368947), ('period', 1.544655219411332), ('har', 1.3454686248643242), ('edm', 1.3250652529247504), ('min', 1.3186068223510787), ('scorer', 1.3006121950701688), ('holger', 1.2861121697089621), ('pit', 1.2324652001021246), ('saves', 1.2234839476907964)]
Topics 1:
[('god', 29.932825348709095), ('would', 25.817333568994357), ('one', 23.031879597510123), ('subject', 22.440295777912066), ('organization', 21.55784969170528), ('university', 21.489029896325793), ('lines', 21.483881389040086), ('writes', 20.410085253142725), ('people', 20.38361422715784), ('game', 19.565453296883753)]




## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [83]:
new_text = ["I can buy myself flower !"]

new_text_vectorizer = vectorizer.transform(new_text)
lda_vectors = lda_model.transform(new_text_vectorizer)

print("Topic 0 :", lda_vectors[0][0])
print("Topic 1 :", lda_vectors[0][1])

Topic 0 : 0.2772098477216234
Topic 1 : 0.7227901522783766
