# Latent Dirichlet Allocation

In [3]:
import pandas as pd
import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

data = pd.read_pickle('data_pickle')

data.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\utilisateur\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

## Remove Punctuation and lower the text

In [4]:
import string
def remove_punct(text):
    text = "".join([word for word in text if word not in string.punctuation])
    return text.lower()

data['clean_text'] = data['reviews'].apply(lambda x: remove_punct(x))

## Remove StopWords

In [5]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stopwords_list = set(stopwords.words("english"))
def remove_stopwords(text):
    text = [word.lower() for word in word_tokenize(text) if word.lower() not in stopwords_list]
    return " ".join(text)

data['clean_text'] = data['clean_text'].apply(lambda x: remove_stopwords(x))

## Lemmatize

In [6]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    text = [lemmatizer.lemmatize(word) for word in word_tokenize(text)]
    return " ".join(text)

data['clean_text'] = data['clean_text'].apply(lambda x: lemmatize_text(x))

## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [7]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Transform texts to a Bag-of-Words representation
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['clean_text'])

# Train an LDA model
lda = LatentDirichletAllocation(n_components=3, learning_method='online', random_state=0)
lda.fit(X)

## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [8]:
# Print extracted topics
for topic in lda.components_:
    print("Topic: ", " ".join([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-10:-1]]))

Topic:  film movie one like character time scene even get
Topic:  mulan cauldron lama dalai taran china disney pollock fantasia
Topic:  film one character movie like life get make time


## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [9]:
new_text = ["i'm a new text"]
new_text_vectorized = vectorizer.transform(new_text)

topic_distribution = lda.transform(new_text_vectorized)
print("Topic distribution: ", topic_distribution)
topic = topic_distribution.argmax()
print("Topic: ", topic)

Topic distribution:  [[0.13961317 0.1118682  0.74851863]]
Topic:  2


In [10]:
import requests
import json

# Azure Text Analytics API endpoint and key
endpoint = "https://azure-ml-ai900-william-31012023.cognitiveservices.azure.com/"
key = "a82304f95ac14b63b28ed755ca46acb9"

# Texts to extract topics from
texts = data['reviews']
new_text = ['new text']

# Call the Text Analytics API
def extract_topics_azure(texts, endpoint, key):
    topics = []
    documents = {"documents": [{"id": i, "text": text} for i, text in enumerate(texts)]}
    headers = {
        'Ocp-Apim-Subscription-Key': key,
        'Content-Type': 'application/json'
    }
    response = requests.post(endpoint, headers=headers, json=documents)
    if response.status_code == 200:
        response_json = response.json()
        for document in response_json['documents']:
            topics.append([topic['topic'][:10] for topic in document['topics']])
    return topics

# Extract topics using Azure Cognitive Services
azure_topics = extract_topics_azure(texts + new_text, endpoint, key)

# Print extracted topics using Azure Cognitive Services
print("Azure topics: ", azure_topics)

# Extract topics using Latent Dirichlet Allocation (LDA)
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Transform texts to a Bag-of-Words representation
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Train an LDA model
lda = LatentDirichletAllocation(n_components=3, learning_method='online', random_state=0)
lda.fit(X)

# Vectorize the new text
new_text_vectorized = vectorizer.transform(new_text)

# Predict the topic of the new text
topic_distribution = lda.transform(new_text_vectorized)
print("LDA topic distribution: ", topic_distribution)
lda_topic = topic_distribution.argmax()
print("LDA topic: ", lda_topic)

# Compare the results of the two models
print("Azure vs LDA: ", azure_topics[0], lda_topic)

Azure topics:  []
LDA topic distribution:  [[0.29022329 0.59743036 0.11234635]]
LDA topic:  1


IndexError: list index out of range