# LDA Topic Modelling

* This notebook is showcases the process of building an NLP Topic Model using `Latent Dirichlet Allocation` method. 

## Table Of Contents

## Installations

In [1]:
# ## installing required libraries
# ! pip install pandas
# ! pip install numpy
# ! pip install plotly
# ! pip install nbformat
# ! pip install ipykernel
# ! pip install matplotlip
# ! pip install wordcloud
# ! pip install gensim
# ! pip install pyLDAvis
# ! pip install nltk
# ! pip install spacy
# !python -m spacy download en_core_web_lg 

## Imports

In [19]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

import re
import string
from bs4 import BeautifulSoup
import nltk
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('stopwords')
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV

from gensim import corpora, models
from gensim.models import Phrases
from gensim.models import CoherenceModel
from gensim.models.ldamodel import LdaModel

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/gaurang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Reading Data

In [3]:
## reading scrapped data from data/apify_dataset_clean.csv
data = pd.read_csv('../data/apify_dataset_clean.csv')
data.head()

Unnamed: 0,url,author,date,title,soft_title,description,text,day,month,year,month_name,word_count,line_count
0,https://www.foxnews.com/politics/biden-says-xi...,Greg Norman,2022-11-14 00:00:00+00:00,Biden says after Xi meeting he doesn’t believe...,Biden says after Xi meeting he doesn’t believe...,President Biden said following his meeting wit...,President Biden told reporters Monday followin...,14,11,2022,Nov,356,17
1,https://www.foxnews.com/politics/gop-rep-calve...,Sophia Slacik,2022-11-14 00:00:00+00:00,GOP Rep. Calvert wins election in competitive ...,GOP Rep. Calvert wins election in competitive ...,"The race for California 41st House district, o...",The Associated Press projects that GOP Rep. Ke...,14,11,2022,Nov,228,9
2,https://www.foxnews.com/politics/pelosi-not-ev...,Haris Alic,2022-11-14 00:00:00+00:00,Pelosi 'not even thinking' about political fut...,Pelosi 'not even thinking' about political fut...,House Speaker Nancy Pelosi’s spokesman said th...,House Speaker Nancy Pelosi’s spokesman forcefu...,14,11,2022,Nov,334,19
3,https://www.foxnews.com/politics/arizona-gover...,Paul Steinhauser,2022-10-25 00:00:00+00:00,"Katie Hobbs defeats GOP challenger Kari Lake, ...",Arizona gov election: Katie Hobbs defeats GOP ...,Democratic Secretary of State Katie Hobbs has ...,The Fox News Decision Desk can project that De...,25,10,2022,Oct,266,12
4,https://www.foxnews.com/us/idaho-quadruple-hom...,Paul Best,2022-11-14 00:00:00+00:00,"'Crime of passion,' 'burglary gone wrong' amon...",Idaho quadruple student homicide: 'Crime of pa...,Idaho police are trying to narrow down a motiv...,Four college students were killed around 3:00 ...,14,11,2022,Nov,518,21


##### Notes
* This data has been scrapped using the `apify` application and we've performed initial feature engineering and `EDA` in `eda_apify_data.ipynb` notebook. 
* For the purposes of EDA we are only interested in `text` column, which contains text from news articles. 

## Cleaning Text

In [4]:
## lets look at sample data
data.loc[15, 'text']

'Elvis Presley, The king of rock \'n\' roll, capped the most extraordinary breakout year in pop-culture history with the release of his first movie on this day in history, Nov. 15, 1956.\n\n"Love Me Tender" — and Elvis the actor — garnered only tepid reviews. But the film helped turn the groundbreaking recording star into a multimedia icon who is still to this day beloved around the world, 45 years after his death at age 42.\n\n"Appraising Presley as an actor, he ain’t," Variety magazine wrote of the movie at the time.\n\nON THIS DAY IN HISTORY, NOV. 14, 1776, BRITISH PRESS NAMES FAMOUS LONDONER BEN FRANKLIN LEADER OF REBELLION\n\n"Not that it makes much difference. There are four songs, and lotsa Presley wriggles thrown in for good measure."\n\n"Love Me Tender" debuted amid great fanfare at the Paramount Theatre in Times Square in New York City.\n\nPresley, just 21 at the time, starred as Clint Reno, a man caught in a love triangle with his Confederate veteran brother in the Civil War

### Utility Functions for Text Cleaning

In [5]:
# lets break down the cleaning functions into smaller functions
nlp = spacy.load('en_core_web_lg')
stop_words = nltk.corpus.stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use', 'say', 'one', 'time', 'people',
                  'know', 'like', 'tell', 'get', 'year', 'go', 'around', 'award', 'actually', 'carry',
                   'new', 'it', 'show', 'news', 'go', 'fox', 'make', 'do', 'not', 'say',
                   'also', 'love', 'it', 'star', 'go', 'do', 'say', 'not', 'said'
                   ])
print(stop_words)
# function to clean html tags from text


def clean_html(html):
    # parse html content
    soup = BeautifulSoup(html, "html.parser")
    for data in soup(['style', 'script', 'code', 'a']):
        # Remove tags
        data.decompose()
    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings)

# function to convert text to lowercase


def lower_case(text):
    return text.lower()

# function to remove line breaks


def remove_line_breaks(text):
    return re.sub(r'\n', '', text)

# function to remove punctuation


def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# function to remove numbers


def remove_numbers(text):
    return re.sub(r'\d+', '', text)

# function to remove extra spaces


def remove_extra_spaces(text):
    return re.sub(' +', ' ', text)

# function to remove stopwords


def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in stop_words])

# function for text lemmatization using spacy


def lemmatize_text(text, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    doc = nlp(text)
    return ' '.join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags])


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [6]:
## function to clean text
def clean_text(text):     
     ## clean html tags
     text = clean_html(text)
     
     ## convert text to lowercase
     text = lower_case(text)
     
     ## remove line breaks
     text = remove_line_breaks(text)
     
     ## remove extra spaces
     text = remove_extra_spaces(text)
     
     ## remove punctuation
     text = remove_punctuation(text)
     
     ## remove numbers
     text = remove_numbers(text)
     
     ## lemmatize text
     text = lemmatize_text(text)
     
     ## remove stopwords
     text = remove_stopwords(text)
     
     return text

### Text Cleaning Sample

In [7]:
clean_text(data.loc[15, 'text'])

'king rock n roll cap extraordinary breakout popculture history release first movie day history tender actor garner tepid review film help turn groundbreaking recording multimedia icon still day belove world death age appraise presley actor variety magazine write movie timeon day history british press name famous much difference song wriggle throw good tender debut great fanfare paramount theatre square man catch triangle confederate veteran brother civil warera feature filmsfor number top box office draw highestpaid actor website list sale definitely film release follow critically acclaim film jailhouse rock king creole become classic erapresley explode american global popculture scene glitter meteoric rise experience artist sinceappraise presley actor ai variety magazinehe release incredible song top billboard pop chart heartbreak hotel want need cruel hound dog movie title tune spend nearly half week billboard songin extraordinarily rare example crossover appeal cruel hound dog top 

##### Notes
* So the text `"looks clean"`, we can run this for our dataset and revisit cleaning if needed

### Splitting Data
* Am not sure if we need to split train/test. But I would like to train the model on some data and then test it on new data to see if it assigns the right topic. 

In [8]:
train, test = train_test_split(data, test_size=0.3, random_state=42)

In [9]:
## verifing the shape
print("Training Data Set Shape : ", train.shape)
print("Testing Data Set Shape : ", test.shape)

Training Data Set Shape :  (7702, 13)
Testing Data Set Shape :  (3301, 13)


### Cleaning The Dataset

In [10]:
## lets apply the cleaning function to all the text
train["clean_text"] = train["text"].apply(clean_text)

## Vectorization

### TF-IDF Vectorization

In [11]:
## function to creation tfidf matrix for the text
def create_tfidf_matrix(data, max_features=1000):
    tfidf_vectorizer = TfidfVectorizer(max_features=max_features)
    tfidf_matrix = tfidf_vectorizer.fit_transform(data)
    return tfidf_matrix, tfidf_vectorizer


vect_text, text_vectorizer = create_tfidf_matrix(train["clean_text"])


### Count Vectorization

## Sklearn LDA with TF-IDF Vectorization

In [12]:
lda_model=LatentDirichletAllocation(n_components=10,learning_method='online') 
lda_top=lda_model.fit_transform(vect_text)

In [13]:
print("Document 0: ")
for i,topic in enumerate(lda_top[0]):
  print("Topic ",i,": ",topic*100,"%")

Document 0: 
Topic  0 :  89.12991282657808 %
Topic  1 :  1.2080404657366075 %
Topic  2 :  1.2077345739199736 %
Topic  3 :  1.2079045006761884 %
Topic  4 :  1.207601565055684 %
Topic  5 :  1.207901736363814 %
Topic  6 :  1.2076836330464362 %
Topic  7 :  1.2076015092615067 %
Topic  8 :  1.2079162179135658 %
Topic  9 :  1.2077029714481455 %


In [31]:
## analyzing the topics
vocab = text_vectorizer.get_feature_names_out()

## lets create a function that will return the top 10 words for given topic number
def print_topic_words(topic_number, model, vocab):
    print(f'Top 10 words for topic #{topic_number}:')
    print([vocab[i] for i in model.components_[topic_number].argsort()[-10:]])
    print('\n')

for i in range(10):
    print_topic_words(i, lda_model, vocab)

Top 10 words for topic #0:
['government', 'climate', 'ukrainian', 'country', 'official', 'military', 'war', 'russian', 'migrant', 'border']


Top 10 words for topic #1:
['build', 'front', 'available', 'drive', 'sell', 'system', 'price', 'model', 'vehicle', 'car']


Top 10 words for topic #2:
['district', 'teach', 'gender', 'theory', 'child', 'education', 'teacher', 'parent', 'student', 'school']


Top 10 words for topic #3:
['assault', 'investigation', 'court', 'woman', 'report', 'arrest', 'crime', 'charge', 'officer', 'police']


Top 10 words for topic #4:
['spending', 'team', 'season', 'veteran', 'model', 'hotel', 'democracy', 'legislation', 'car', 'coach']


Top 10 words for topic #5:
['work', 'good', 'want', 'first', 'take', 'family', 'day', 'share', 'life', 'think']


Top 10 words for topic #6:
['disease', 'mask', 'drug', 'study', 'test', 'health', 'fentanyl', 'virus', 'vaccine', 'covid']


Top 10 words for topic #7:
['earn', 'voter', 'committee', 'master', 'record', 'label', 'ver

In [15]:
## lets create a function that would return topics for given document
def print_document_topics(document_number):
    print(f'Topics for document #{document_number}:')
    for i,topic in enumerate(lda_top[document_number]):
        print("Topic ",i,": ",topic*100,"%")
    print('\n')

In [18]:
## Analyzing the model performance. 
print("Perplexity: ", lda_model.perplexity(vect_text))
# print("Coherence Score: ", CoherenceModel(model=lda_model, texts=train["clean_text"], dictionary=dictionary, coherence='c_v').get_coherence())
print("Log Likelihood:", lda_model.score(vect_text))


Perplexity:  1529.0185103407616
Log Likelihood: -414558.7954475313


In [25]:
## lets do grid search for the best parameters
# Define Search Param
search_params = {'n_components': [5, 10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}

# Init the Model
lda = LatentDirichletAllocation()

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
model.fit(vect_text)

In [26]:
# Best Model
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(vect_text))

Best Model's Params:  {'learning_decay': 0.5, 'n_components': 5}
Best Log Likelihood Score:  -87141.47510331546
Model Perplexity:  1312.340644544248


##### Notes
* Best Model Params
```
Best Model's Params:  {'learning_decay': 0.5, 'n_components': 5}
Best Log Likelihood Score:  -87178.7585565557
Model Perplexity:  1290.9544364071992
```

In [29]:
best_lda = best_lda_model.transform(vect_text)

In [30]:
print("Document 0: ")
for i,topic in enumerate(best_lda[0]):
  print("Topic ",i,": ",topic*100,"%")

Document 0: 
Topic  0 :  2.484731056650414 %
Topic  1 :  2.5838521898290967 %
Topic  2 :  2.9790407036326254 %
Topic  3 :  2.529371464836205 %
Topic  4 :  89.42300458505167 %


In [33]:
for i in range(5):
    print_topic_words(i, best_lda_model, vocab)

Top 10 words for topic #0:
['democratic', 'campaign', 'trump', 'candidate', 'border', 'state', 'abortion', 'voter', 'vote', 'election']


Top 10 words for topic #1:
['swift', 'game', 'woman', 'report', 'man', 'accord', 'team', 'charge', 'police', 'car']


Top 10 words for topic #2:
['country', 'force', 'war', 'city', 'official', 'military', 'police', 'report', 'russian', 'school']


Top 10 words for topic #3:
['thing', 'really', 'film', 'feel', 'child', 'want', 'share', 'family', 'life', 'think']


Top 10 words for topic #4:
['inflation', 'price', 'company', 'gas', 'food', 'musk', 'energy', 'recipe', 'oil', 'climate']




In [16]:
# tf_matrix, tf_vectorizer = create_tfidf_matrix(test.sample(1)["clean_text"], max_features=1000)
# topics = lda_model.fit_transform(tf_matrix)
# topics

## Gensim LDA with BOW Vectorization

## LDA2Vec with Word2Vec