# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [1]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [2]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [None]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download necessary NLTK data
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))

    # Tokenize words
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words("english"))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Join tokens back to string
    return " ".join(tokens)

# Apply preprocessing
data["clean_text"] = data["text"].apply(preprocess_text)

[nltk_data] Downloading package stopwords to /home/aheggs/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/aheggs/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/aheggs/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [4]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Convert text data into numerical format using CountVectorizer
vectorizer = CountVectorizer(max_features=1000, stop_words="english")
X = vectorizer.fit_transform(data["clean_text"])

# Train LDA model
lda = LatentDirichletAllocation(n_components=5, random_state=42)  # Extract 5 topics
lda.fit(X)

##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [5]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [6]:
print_topics(lda, vectorizer)

Topic 0:
[('christian', 339.10862171650405), ('truth', 280.78756805426355), ('god', 242.72311766260563), ('law', 211.48444469805474), ('belief', 189.70994858386743), ('believe', 176.08994671300357), ('say', 172.03322267682327), ('question', 169.51246332405012), ('christianity', 168.25680161280187), ('absolute', 162.197998947287)]
Topic 1:
[('game', 446.8221809544968), ('team', 361.92372106446174), ('line', 305.6521076789393), ('player', 281.37316981587355), ('play', 267.4944336780269), ('goal', 263.94417451145165), ('think', 240.82123229840536), ('subject', 225.38258946565315), ('organization', 221.71340065643332), ('hockey', 208.63362950328792)]
Topic 2:
[('25', 320.01441425053616), ('pt', 312.1980310859925), ('550', 280.19999328881835), ('10', 270.568618739754), ('la', 238.9904998817351), ('period', 227.87869712759672), ('11', 198.10801870047007), ('12', 185.45528870456747), ('14', 168.37317828824655), ('13', 165.18358695677736)]
Topic 3:
[('team', 577.3940074521192), ('game', 459.10

## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [7]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [8]:
# Preprocess new text
example_cleaned = [preprocess_text(example[0])]

# Transform new text into numerical format
example_vectorized = vectorizer.transform(example_cleaned)

# Predict topic distribution
topic_distribution = lda.transform(example_vectorized)
print("Topic distribution:", topic_distribution)

Topic distribution: [[0.02862926 0.88470332 0.02864203 0.02942078 0.0286046 ]]


🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!