<a href="https://colab.research.google.com/github/fulllz/DSProjects/blob/main/NLP_1_Topic_modeling_with_Gensim_for_newsgroup_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction  
Topic modeling is one kind of unsupervised machine learning. It is a form of dimensionality reduction. We can extract the hidden topics from large volumes of text by Topic modeling. Topic models are useful for purpose of document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection.

Finding good topics depends on the quality of text processing , the choice of the topic modeling algorithm, the number of topics specified in the algorithm. There are several existing algorithms but here we will focus on Latent Dirichlet Allocation(LDA). LDA is the most popular method for doing topic modeling in real-world applications. It considers each document as a collection of topics and each topic as collection of keywords. Once you provide the algorithm with number of topics all it does is to rearrange the topic distribution within documents and key word distribution within the topics to obtain good composition of topic-keyword distribution.

In [None]:
!pip install pyLDAvis



In [None]:
import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import en_core_web_sm
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
import warnings
warnings.filterwarnings("ignore")

  from collections import Iterable


## Prepare stopwords

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
nlp = en_core_web_sm.load(disable=['parser', 'ner'])

## Load Dataset
We will use 20-Newsgroups dataset.

In [None]:
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
data_words = data

In [None]:
print(data_words[:4]) #it will print the data after prepared for stopwords

['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ---- ', 'From: (Guy Kuo) Subject: SI Clock Poll - Final Call Summary: Final call for SI clock reports Keywords: SI,acceleration,clock,upgrade Article-I.D.: shelley.1qvfo9INNc3s Organization: University of Washington Lines: 11 NNTP-Posting-Host: carson.u.washington.edu A fair number of brave souls who upgraded th

## Create Bigram and Trigram models

In [None]:
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [None]:
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]

def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
   [trigram_mod[bigram_mod[doc]] for doc in texts]
   
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out

In [None]:
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)

data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[
   'NOUN', 'ADJ', 'VERB', 'ADV'
])

In [None]:
print(data_lemmatized[:4]) #it will print the lemmatized data.

[['where', 'thing', 'car', 'nntp', 'post', 'host', 'line', 'wonder', 'could', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'bricklin', 'door', 'really', 'small', 'addition', 'front', 'bumper', 'separate', 'rest', 'body', 'know', 'tellme', 'model', 'name', 'engine', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst'], ['final', 'call', 'summary', 'final', 'call', 'clock', 'report', 'keyword', 'acceleration', 'post', 'fair', 'number', 'brave', 'soul', 'upgrade', 'clock', 'oscillator', 'share', 'experience', 'poll', 'send', 'brief', 'message', 'detailing', 'experience', 'procedure', 'top', 'speed', 'attain', 'cpu', 'rate', 'speed', 'add', 'card', 'heat', 'sink', 'hour', 'usage', 'day', 'floppy', 'disk', 'functionality', 'floppy', 'especially', 'request', 'summarize', 'next', 'day', 'add', 'network', 'knowledge', 'base', 'do', 'clock', 'upgrade', 'answer', 'poll', '

## Create Dictionary and Corpus needed for Topic Modeling

In [None]:
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:4]) #it will print the corpus we created above.

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 5), (7, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 2), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1)], [(5, 2), (8, 2), (29, 1), (38, 1), (43, 1), (44, 2), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 3), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 2), (58, 1), (59, 2), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 2), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 2), (81, 1), (82, 1), (83, 1), (84, 2), (85, 1)], [(7, 3), (8, 1), (17, 2), (18, 1), (21, 2), (22, 2), (24, 1), (31, 2), (38, 1), (41, 1), (45, 1), (54, 2), (59, 1), (82, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 2), 

In [None]:
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]] 
#it will print the words with their frequencies.

[[('addition', 1),
  ('body', 1),
  ('bricklin', 1),
  ('bring', 1),
  ('bumper', 1),
  ('call', 1),
  ('car', 5),
  ('could', 1),
  ('day', 1),
  ('door', 2),
  ('early', 1),
  ('engine', 1),
  ('enlighten', 1),
  ('front', 1),
  ('funky', 1),
  ('history', 1),
  ('host', 1),
  ('info', 1),
  ('know', 1),
  ('late', 1),
  ('lerxst', 1),
  ('line', 1),
  ('look', 2),
  ('mail', 1),
  ('make', 1),
  ('model', 1),
  ('name', 1),
  ('neighborhood', 1),
  ('nntp', 1),
  ('post', 1),
  ('production', 1),
  ('really', 1),
  ('rest', 1),
  ('see', 1),
  ('separate', 1),
  ('small', 1),
  ('sport', 1),
  ('tellme', 1),
  ('thank', 1),
  ('thing', 1),
  ('where', 1),
  ('wonder', 1),
  ('year', 1)],
 [('call', 2),
  ('day', 2),
  ('post', 1),
  ('thank', 1),
  ('acceleration', 1),
  ('add', 2),
  ('answer', 1),
  ('attain', 1),
  ('base', 1),
  ('brave', 1),
  ('brief', 1),
  ('card', 1),
  ('clock', 3),
  ('cpu', 1),
  ('detailing', 1),
  ('disk', 1),
  ('do', 1),
  ('especially', 1),
  ('expe

## Build topic model

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(
   corpus=corpus, id2word=id2word, num_topics=20, random_state=100, 
   update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)

In [None]:
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.018*"hour" + 0.017*"food" + 0.016*"cool" + 0.013*"cause" + '
  '0.013*"treatment" + 0.012*"doctor" + 0.011*"air" + 0.011*"tumor" + '
  '0.011*"gas" + 0.010*"hot"'),
 (1,
  '0.031*"drive" + 0.023*"thank" + 0.023*"use" + 0.020*"card" + '
  '0.019*"problem" + 0.019*"system" + 0.018*"line" + 0.016*"run" + '
  '0.015*"work" + 0.014*"driver"'),
 (2,
  '0.027*"israeli" + 0.020*"patient" + 0.019*"report" + 0.019*"tape" + '
  '0.016*"drug" + 0.015*"research" + 0.015*"press" + 0.013*"case" + '
  '0.011*"insurance" + 0.011*"purchase"'),
 (3,
  '0.016*"public" + 0.016*"order" + 0.012*"government" + 0.011*"provide" + '
  '0.011*"physical" + 0.010*"god" + 0.009*"issue" + 0.009*"may" + '
  '0.008*"money" + 0.008*"cheap"'),
 (4,
  '0.062*"key" + 0.037*"system" + 0.029*"number" + 0.025*"chip" + 0.024*"use" '
  '+ 0.019*"bit" + 0.016*"wire" + 0.013*"ripem" + 0.012*"serial" + '
  '0.011*"encrypt"'),
 (5,
  '0.031*"people" + 0.023*"right" + 0.020*"state" + 0.016*"gun" + 0.015*"law" '
  '+ 0.011*

## Evaluate topic models

In [None]:
print('\nPerplexity: ', lda_model.log_perplexity(corpus))


Perplexity:  -8.252789257567267


In [None]:
coherence_model_lda = CoherenceModel(
   model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v'
)
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.43094175609905533


## Visualize the topic model

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

## Conclusion
We demonstrated how statistical modeling helps group the news . The codes can be generalized to many other tasks, aiming at discovering the abstract topics that occur in a collection of documents.