<a href="https://colab.research.google.com/github/mayur7garg/66DaysOfData/blob/main/Day%2024/Topic_Modelling_using_LDA_and_NMF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modelling using LDA and NMF

**Reference**:
- [Topic Modeling with LDA and NMF on the ABC News Headlines dataset](https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df)

## Imports

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import sklearn
import sys
from nltk.corpus import stopwords
import nltk
from gensim.models import ldamodel
import gensim.corpora
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import NMF
from sklearn.preprocessing import normalize
import pickle

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Data Preparation

### Load the dataset
Dataset used is available at [A Million News Headlines](https://www.kaggle.com/therohk/million-headlines)

For this example:
- Lines with bad formatting are ignored.
- Only the `headline_text` column is used which contains the text headlines.
- Only the first 10,000 rows are used.

In [3]:
data_text = pd.read_csv('/content/abcnews-date-text.csv', error_bad_lines = False, usecols = ['headline_text'], nrows = 10_000)
data_text.head()

Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit
3,air nz staff in aust strike for pay rise
4,air nz strike to affect australian travellers


In [4]:
data_text.shape

(10000, 1)

### Data preprocessing
- Change the data type to string.
- Remove the stopwords from each row.
- Split each row into its constituent words.
- Since this process is expensive, save the processed data onto disk using `pickle`.

In [5]:
data_text = data_text.astype('str')
for idx in range(len(data_text)):
    
    data_text.iloc[idx]['headline_text'] = [word for word in data_text.iloc[idx]['headline_text'].split(' ') if word not in stopwords.words()]
    
    if (idx + 1) % 1000 == 0:
        sys.stdout.write(f'\rc = {idx + 1} / {len(data_text)}')

pickle.dump(data_text, open('data_text.dat', 'wb'))

data_text.head()

c = 10000 / 10000

Unnamed: 0,headline_text
0,"[aba, decides, community, broadcasting, licence]"
1,"[act, fire, witnesses, must, aware, defamation]"
2,"[g, calls, infrastructure, protection, summit]"
3,"[air, nz, staff, aust, strike, pay, rise]"
4,"[air, nz, strike, affect, australian, travellers]"


### Extract the data from pandas series to a list

In [6]:
train_headlines = [value[0] for value in data_text.iloc[0:].values]
train_headlines[np.random.randint(len(train_headlines))]

['mugabe', 'faces', 'prospect', 'fresh', 'protests']

## Latent Dirichlet Allocation

### Fit an LDA model
- Specify the number of topics to extract.
- Create a dictionary mapping between ids and words.
- Create a corpus from the extracted words.
- Fit an LDA model on the corpus. Specify the number of `passes`.

In [7]:
num_topics = 8
id2word = gensim.corpora.Dictionary(train_headlines)
corpus = [id2word.doc2bow(text) for text in train_headlines]
lda = ldamodel.LdaModel(corpus = corpus, id2word = id2word, num_topics = num_topics, passes = 10)

### Get LDA topics from the model and their top words

In [8]:
def get_lda_topics(model, num_topics, top_n = 10):
    word_dict = {}

    for i in range(num_topics):
        words = model.show_topic(i, topn = top_n)
        word_dict['LDA Topic #' + f'{(i+1):02d}'] = [i[0] for i in words]

    return pd.DataFrame(word_dict)

In [9]:
get_lda_topics(lda, num_topics, 10)

Unnamed: 0,LDA Topic #01,LDA Topic #02,LDA Topic #03,LDA Topic #04,LDA Topic #05,LDA Topic #06,LDA Topic #07,LDA Topic #08
0,crash,us,baghdad,us,water,police,govt,us
1,mp,world,new,iraq,us,win,attack,govt
2,howard,anti,plan,troops,urged,fire,woman,iraq
3,council,iraqi,election,court,go,probe,coast,toll
4,missing,iraq,south,north,sars,case,england,bush
5,australia,cup,found,korea,health,wins,call,hits
6,killed,says,iraqi,home,title,victory,hospital,nsw
7,four,group,us,govt,funding,first,police,iraqi
8,two,protesters,police,centre,ahead,killed,gold,tv
9,iraq,new,denies,virus,says,residents,concerns,peace


## Non Negative Matrix Factorization

### Join the headlines to make sentences
Stopwords were already removed during data preprocessing step.

In [10]:
train_headlines_sentences = [' '.join(text) for text in train_headlines]
train_headlines_sentences[np.random.randint(len(train_headlines_sentences))]

'suspected rebels kill four kashmir'

### Train a TF-IDF Vectorizer
- Train a `CountVectorizer` on the sentences at a word level.
- Transform the sentences using the fitted `CountVectorizer`.
- Fit a `TfidfTransformer` on the transformed output of the `CountVectorizer`.
- Transform the data to get the TF-IDF values.

**Note** - This step can be simplified by using the `TfidfVectorizer` directly on the headline sentences which encompasses both the `CountVectorizer` and `TfidfTranformer`.

In [11]:
vectorizer = CountVectorizer(analyzer = 'word', max_features = 5000)
x_counts = vectorizer.fit_transform(train_headlines_sentences)
transformer = TfidfTransformer(smooth_idf = False)
x_tfidf = transformer.fit_transform(x_counts)

### Normalize the TF-IDF values to unit length for each row

In [12]:
xtfidf_norm = normalize(x_tfidf, norm = 'l1', axis = 1)

### Fit an `NMF` model on the normalized TF-IDF values.

In [13]:
model = NMF(n_components = num_topics, init = 'nndsvd')
model.fit(xtfidf_norm)

NMF(alpha=0.0, beta_loss='frobenius', init='nndsvd', l1_ratio=0.0, max_iter=200,
    n_components=8, random_state=None, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

### Get NMD topics from the model and their top words

In [14]:
def get_nmf_topics(model, n_top_words):
    
    feat_names = vectorizer.get_feature_names()
    
    word_dict = {};
    for i in range(num_topics):
        
        words_ids = model.components_[i].argsort()[:-n_top_words - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['NMF Topic #' + '{:02d}'.format(i+1)] = words;
    
    return pd.DataFrame(word_dict)

In [15]:
get_nmf_topics(model, 10)

Unnamed: 0,NMF Topic #01,NMF Topic #02,NMF Topic #03,NMF Topic #04,NMF Topic #05,NMF Topic #06,NMF Topic #07,NMF Topic #08
0,iraq,police,us,govt,council,new,crash,charged
1,says,probe,baghdad,nsw,security,resolution,woman,murder
2,bush,death,iraqi,rain,funding,world,car,court
3,howard,search,troops,water,elections,cup,hospital,stabbing
4,pm,missing,turkey,vic,land,ceo,killed,attack
5,missiles,victim,forces,plan,water,zealand,two,charge
6,post,investigate,says,fire,welcomes,china,accident,assault
7,blair,stabbing,korea,qld,pressure,president,fatal,attempted
8,set,charge,killed,urged,plan,nats,plane,offences
9,resolution,shooting,attack,farmers,seeks,takes,injured,robbery
