# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [3]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [4]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [19]:
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
    # $CHALLENGIFY_BEGIN
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = ''.join(char for char in sentence if not char.isdigit())
    for punctuation in string.punctuation:
        sentence = sentence.replace(punctuation, '') 
    tokenized = word_tokenize(sentence) # Tokenize
    words_only = [word for word in tokenized if word.isalpha()] # Remove numbers
    stop_words = set(stopwords.words('english')) # Make stopword list
    without_stopwords = [word for word in words_only if not word in stop_words] # Remove Stop Words
    lemma=WordNetLemmatizer() # Initiate Lemmatizer
    lemmatized = [lemma.lemmatize(word) for word in without_stopwords] # Lemmatize
    return lemmatized
    # $CHALLENGIFY_END

In [29]:
data['clean_text'] = data['text'].apply(lambda x: preprocessing(x))

In [32]:
data['clean_text'] = data['clean_text'].astype('str')

In [33]:
data.dtypes

text          object
clean_text    object
dtype: object

## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectorized_documents = vectorizer.fit_transform(data['clean_text'])
vectorized_documents = pd.DataFrame(vectorized_documents.toarray(), 
                                    columns = vectorizer.get_feature_names_out())

vectorized_documents

Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aadams,aafreenetcarletonca,aargh,aaron,aaronbinahccbrandeisedu,aaroncathenamitedu,aassists,...,zombo,zone,zoo,zoomed,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.086861,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.071741,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1194,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1195,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1196,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1197,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
from sklearn.decomposition import LatentDirichletAllocation

# Instantiate the LDA 
n_components = 5
lda_model = LatentDirichletAllocation(n_components=n_components, max_iter = 100)

# Fit the LDA on the vectorized documents
lda_model.fit(vectorized_documents)

LatentDirichletAllocation(max_iter=100, n_components=5)

In [35]:
topic_word_mixture = pd.DataFrame(lda_model.components_, 
                                 columns = vectorizer.get_feature_names_out())

In [36]:
topic_word_mixture

Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aadams,aafreenetcarletonca,aargh,aaron,aaronbinahccbrandeisedu,aaroncathenamitedu,aassists,...,zombo,zone,zoo,zoomed,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.592012,0.200975,0.234539,0.200283,1.898426,0.200005,1.48817,0.676765,0.569512,0.200009,...,0.200006,0.200007,0.200014,0.200008,0.282841,0.200005,0.200005,0.200009,0.538734,0.200011
1,0.200018,0.200022,0.200016,0.200002,0.200021,0.20001,0.200011,0.200012,0.20002,0.200021,...,0.200012,0.200008,0.20003,0.200017,0.200017,0.20001,0.200009,0.200015,0.200022,0.200023
2,0.200018,0.200022,0.200016,0.200002,0.200021,0.20001,0.200011,0.200012,0.20002,0.200021,...,0.200012,0.200008,0.20003,0.200017,0.200017,0.20001,0.200009,0.200015,0.200022,0.200023
3,0.208622,0.300013,0.200008,0.206197,0.207672,0.971785,0.200005,0.200006,0.20001,0.277751,...,0.466287,2.277234,0.363448,0.461411,0.20001,1.465741,0.287793,0.358939,0.20001,0.381856
4,0.200018,0.200022,0.200016,0.200002,0.200021,0.20001,0.200011,0.200012,0.20002,0.200021,...,0.200012,0.200008,0.20003,0.200017,0.200017,0.20001,0.200009,0.200015,0.200022,0.200023


##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [37]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [38]:
print_topics(lda_model, vectorizer)

Topic 0:
[('god', 35.24028798498692), ('christian', 22.02696956087202), ('jesus', 18.69060618938084), ('people', 16.52165638461461), ('would', 16.245723070134886), ('church', 16.146022924001947), ('one', 15.77157398065796), ('bible', 13.400911715851686), ('believe', 12.773527512121278), ('say', 12.435351135708782)]
Topic 1:
[('klingon', 0.6371143985681904), ('barone', 0.5482628389160791), ('pbaronexaessharriscom', 0.5482628389160791), ('romford', 0.41148942821992784), ('swindon', 0.41148942821992784), ('humberside', 0.41148942821992784), ('basingstoke', 0.41148942821992784), ('slough', 0.41148942821992784), ('billingham', 0.41148942821992784), ('peterborough', 0.41148942821992784)]
Topic 2:
[('testing', 1.3857661006194322), ('rfl', 0.8871619923566103), ('khettryrwpubutkedu', 0.8871619923566103), ('tennessee', 0.8871619923566103), ('singapore', 0.88681682525264), ('sturm', 0.7970038377623305), ('gakwrscom', 0.6949200387424055), ('dee', 0.626781192271493), ('ladwig', 0.574521138884944), 

## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [39]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [54]:
lda_model.transform(vectorizer.transform(example))

  "X does not have valid feature names, but"


array([[0.05053579, 0.04949582, 0.04949541, 0.80097707, 0.04949591]])

In [50]:
data

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,"['gldcunixbcccolumbiaedu', 'gary', 'l', 'dare'..."
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,"['atterlepvelaacsoaklandedu', 'cardinal', 'xim..."
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,"['minerkuhubccukansedu', 'subject', 'ancient',..."
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,"['atterlepvelaacsoaklandedu', 'cardinal', 'xim..."
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,"['vzhivovsuperiorcarletonca', 'vladimir', 'zhi..."
...,...,...
1194,From: jerryb@eskimo.com (Jerry Kaufman)\nSubje...,"['jerrybeskimocom', 'jerry', 'kaufman', 'subje..."
1195,From: golchowy@alchemy.chem.utoronto.ca (Geral...,"['golchowyalchemychemutorontoca', 'gerald', 'o..."
1196,From: jayne@mmalt.guild.org (Jayne Kulikauskas...,"['jaynemmaltguildorg', 'jayne', 'kulikauskas',..."
1197,From: sclark@epas.utoronto.ca (Susan Clark)\nS...,"['sclarkepasutorontoca', 'susan', 'clark', 'su..."


🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!