# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [2]:
import pandas as pd

data = pd.read_csv('/home/kristina/code/g0zzy/stress_sense/raw_data/CombinedData.csv', index_col=0)
data.head()

Unnamed: 0,statement,status
0,oh my gosh,Anxiety
1,"trouble sleeping, confused mind, restless hear...",Anxiety
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety
3,I've shifted my focus to something else but I'...,Anxiety
4,"I'm restless and restless, it's been a month n...",Anxiety


In [3]:
data.shape

(53043, 2)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [None]:
from nltk.corpus import stopwords
import string
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize

def clean (text):
    for punctuation in string.punctuation:
        text = str(text).replace(punctuation, ' ') # Remove Punctuation
    lowercased = text.lower() # Lower Case
    tokenized = word_tokenize(lowercased) # Tokenize
    words_only = [word for word in tokenized if word.isalpha()] # Remove numbers
    stop_words = set(stopwords.words('english')) # Make stopword list
    without_stopwords = [word for word in words_only if not word in stop_words] # Remove Stop Words
    lemma=WordNetLemmatizer() # Initiate Lemmatizer
    lemmatized = [lemma.lemmatize(word) for word in without_stopwords] # Lemmatize
    cleaned = ' '.join(lemmatized) # Join back to a string
    return cleaned

# Apply to all texts
data['clean_text'] = data.statement.apply(clean)

data.head()

Unnamed: 0,statement,status,clean_text
0,oh my gosh,Anxiety,oh gosh
1,"trouble sleeping, confused mind, restless hear...",Anxiety,trouble sleeping confused mind restless heart ...
2,"All wrong, back off dear, forward doubt. Stay ...",Anxiety,wrong back dear forward doubt stay restless re...
3,I've shifted my focus to something else but I'...,Anxiety,shifted focus something else still worried
4,"I'm restless and restless, it's been a month n...",Anxiety,restless restless month boy mean


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [18]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

# Keep only stress-like categories
stress_posts = data[data['status'].isin(['Stress','Anxiety','Depression'])]['clean_text']


data_vectorized = vectorizer.fit_transform(stress_posts)

lda_model = LatentDirichletAllocation(n_components=2)

lda_vectors = lda_model.fit_transform(data_vectorized)

##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [19]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [20]:
print_topics(lda_model, vectorizer)

Topic 0:
[('anxiety', 6001.16025842358), ('year', 5893.714727561282), ('time', 5215.763203981976), ('get', 4187.162633974875), ('day', 3949.655330549668), ('would', 3669.923523269622), ('got', 3613.879514941272), ('back', 3560.9720252917696), ('one', 3354.232691669636), ('like', 3326.282973930347)]
Topic 1:
[('feel', 21013.0245050186), ('like', 20160.717026069255), ('want', 13037.830432828661), ('life', 11267.069870227526), ('know', 11107.901687634014), ('get', 9867.837366024774), ('even', 8979.413570431958), ('people', 8007.706118703771), ('time', 7840.23679601768), ('thing', 7676.862349425387)]


## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [16]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [17]:
example_vectorized = vectorizer.transform(example)

lda_vectors = lda_model.transform(example_vectorized)

print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])

topic 0 : 0.7249920585383364
topic 1 : 0.2750079414616636


🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!