<a href="https://www.kaggle.com/code/faressayah/text-analysis-topic-modeling-with-spacy-gensim?scriptVersionId=119406980" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# ✔️ Apply Topic Modeling algorithms using Gensim — before moving on to more advanced textual analysis techniques

> Topic Modelling is a great way to analyze completely unstructured textual data. Python NLP framework such as Gensim, NLTK, and spaCy makes it easier to do this.

> The purpose of this article is to guide one through the whole process of topic modeling right from pre-processing your raw textual data, creating your topic models, evaluating the topic models, to visualizing them. The python packages used during the tutorial will be spaCy (for pre-processing), Gensim (for topic modeling), and pyLDAvis (for visualization).

# 📌 Notebook Goals
> - Learn how to use the power of `spaCy` to clean textual data.
> - Use different Topic Modelling techniques like `LDA (Latent Dirichlet Allocation)`, `LSI (Latent Semantic Indexing)`, and `HDP (Hierarchical Drichlet Process)`
---

# 📚 Topic Modelling Overview
Let's understand the general concept of Topic Modelling and why it's important! 
> - Topic Modeling is an unsupervised machine learning technique that allows us to efficiently analyze large volumes of text by clustering documents into topics.
> - A large amount of text data is unlabeled meaning we can’t apply Supervised Learning approaches to create machine learning models for the data! In this case of text data, this means attempting to discover clusters of documents, grouped by topic. A very important idea to keep in mind here is that we don’t know the correct topic or the right answer! All we know is that the documents clustered together share similar topic ideas. It is up to us to identify what these topics represent.

---
# 📑 Text Analysis Tutorial

> Our steps, naturally, are setting up our imports. We will be using spaCy for data pre-processing and computational linguistics, Gensim for Topic Modeling, Scikit-Learn for classification, and Keras for text generation.

# ✔️ Import Libraries

In [1]:
import os
import numpy as np

import spacy 
from spacy import displacy

import gensim
from gensim.corpora import Dictionary
from gensim.models import LdaModel, CoherenceModel, LsiModel, HdpModel
from gensim.models.wrappers import LdaMallet

import matplotlib.pyplot as plt
import sklearn
import keras

import warnings

warnings.filterwarnings('ignore', category=DeprecationWarning)

In [2]:
print(spacy.__version__)
print(gensim.__version__)

2.3.5
3.8.3


---
# 📂 Gathering Data

> The dataset we will be working with will be the Lee corpus which is a shortened of the Lee Background Corpus, and the 20NG dataset. 

In [3]:
test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
print(test_data_dir)

lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
print(lee_train_file)

text = open(lee_train_file).read()

/opt/conda/lib/python3.7/site-packages/gensim/test/test_data
/opt/conda/lib/python3.7/site-packages/gensim/test/test_data/lee_background.cor


---
# 🧹 Cleaning Data

> We can't have state-of-the-art results without data which is as good. Let's spend this section working on cleaning and understanding our data set. We will be checking out `spaCy`, an industry grade text-processing package.

In [4]:
nlp = spacy.load('en')

> For safe measure, let's add some stopwords. It's a newspaper corpus, so it is likely we will be coming across variations of 'said', 'Mister', and 'Mr'... which will not really add any value to the topic models.

In [5]:
my_stop_words = ['say', '\s', 'mr', 'Mr', 'said', 'says', 'saying', 'today', 'be']

for stopword in my_stop_words:
    lexeme = nlp.vocab[stopword]
    lexeme.is_stop = True

In [6]:
doc = nlp(text)

In [7]:
# doc

# 💹 Computational Linguistics

Now that we have our doc object. We can see that the doc object now contains the entire corpus. This is important because we will be using the doc object to create our corpus for the machine learning algorithms. When creating a corpus for `gensim/scikit-learn`, we sometimes forget the incredible power which `spaCy` packs in its pipeline, so we will briefly demonstrate the same in this section with a smaller example sentence.


In [8]:
sent = nlp('Last Thursday, Manchester United defeated AC Milan at San Siro.')

## 🔖 POS-Tagging

The **Part Of Speech (POS)** explains how a word is used in a sentence. There are eight main parts of speech — nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions, and interjections.

In [9]:
for token in sent:
    print(token.text, token.pos_, token.tag_)

Last ADJ JJ
Thursday PROPN NNP
, PUNCT ,
Manchester PROPN NNP
United PROPN NNP
defeated VERB VBD
AC PROPN NNP
Milan PROPN NNP
at ADP IN
San PROPN NNP
Siro PROPN NNP
. PUNCT .


## 🔖 NER-Tagging  — (Named Entity Recognition)

**Named-entity recognition (NER)** is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In [10]:
for token in sent:
    print(token.text, token.ent_type_)

Last DATE
Thursday DATE
, 
Manchester ORG
United ORG
defeated 
AC ORG
Milan ORG
at 
San GPE
Siro GPE
. 


In [11]:
for ent in sent.ents:
    print(ent.text, ent.label_)

Last Thursday DATE
Manchester United ORG
AC Milan ORG
San Siro GPE


In [12]:
displacy.render(sent, style='ent', jupyter=True)

## 🧮 Dependency Parsing

The term Dependency Parsing (DP) refers to the process of examining the dependencies between the phrases of a sentence in order to determine its grammatical structure.

In [13]:
for chunk in sent.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)

Manchester United United nsubj defeated
AC Milan Milan dobj defeated
San Siro Siro pobj at


In [14]:
for token in sent:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
         [child for child in token.children])

Last amod Thursday PROPN []
Thursday npadvmod defeated VERB [Last]
, punct defeated VERB []
Manchester compound United PROPN []
United nsubj defeated VERB [Manchester]
defeated ROOT defeated VERB [Thursday, ,, United, Milan, at, .]
AC compound Milan PROPN []
Milan dobj defeated VERB [AC]
at prep defeated VERB [Siro]
San compound Siro PROPN []
Siro pobj at ADP [San]
. punct defeated VERB []


In [15]:
displacy.render(sent, style='dep', jupyter=True, options={'distance':90})

# 🧹 Continuing Cleaning

> Have a quick look at the output of the doc object. It seems like nothing, right? But spaCy's internal data structure has done all the work for us. Let's see how we can create our corpus.

In [16]:
# We add some words to the stop word list
texts, article = [], []

for word in doc:
    
    if word.text != '\n' and not word.is_stop and not word.is_punct and not word.like_num and word.text != 'I':
        article.append(word.lemma_)
        
    if word.text == '\n':
        texts.append(article)
        article = []
        
        
print(texts[0])

['hundred', 'people', 'force', 'vacate', 'home', 'Southern', 'Highlands', 'New', 'South', 'Wales', 'strong', 'wind', 'push', 'huge', 'bushfire', 'town', 'Hill', 'new', 'blaze', 'near', 'Goulburn', 'south', 'west', 'Sydney', 'force', 'closure', 'Hume', 'Highway', '4:00pm', 'AEDT', 'marked', 'deterioration', 'weather', 'storm', 'cell', 'move', 'east', 'Blue', 'Mountains', 'force', 'authority', 'decision', 'evacuate', 'people', 'home', 'outlying', 'street', 'Hill', 'New', 'South', 'Wales', 'southern', 'highland', 'estimated', 'resident', 'leave', 'home', 'nearby', 'Mittagong', 'New', 'South', 'Wales', 'Rural', 'Fire', 'Service', 'weather', 'condition', 'cause', 'fire', 'burn', 'finger', 'formation', 'ease', 'fire', 'unit', 'Hill', 'optimistic', 'defend', 'property', 'blaze', 'burn', 'New', 'Year', 'Eve', 'New', 'South', 'Wales', 'fire', 'crew', 'call', 'new', 'fire', 'Gunning', 'south', 'Goulburn', 'detail', 'available', 'stage', 'fire', 'authority', 'close', 'Hume', 'Highway', 'direction

> - And this is the magic of spaCy - just like that, we've managed to get rid of stopwords, puctuation markers, and added lemmatized word.
> - Sometimes topic modeling make more sense when `New` and `York` are treated as `New York` - we can do this by creating a bigram model and modifying our corpus accordingly.

In [17]:
bigram = gensim.models.phrases.Phrases(texts)
texts = [bigram[line] for line in texts]
texts = [bigram[line] for line in texts]

print(texts[0])

['hundred', 'people', 'force', 'vacate', 'home', 'Southern', 'Highlands', 'New_South', 'Wales', 'strong', 'wind', 'push', 'huge', 'bushfire', 'town', 'Hill', 'new', 'blaze', 'near', 'Goulburn', 'south_west', 'Sydney', 'force', 'closure', 'Hume', 'Highway', '4:00pm', 'AEDT', 'marked', 'deterioration', 'weather', 'storm', 'cell', 'move', 'east', 'Blue_Mountains', 'force', 'authority', 'decision', 'evacuate', 'people', 'home', 'outlying', 'street', 'Hill', 'New_South', 'Wales', 'southern', 'highland', 'estimated', 'resident', 'leave', 'home', 'nearby', 'Mittagong', 'New_South', 'Wales', 'Rural_Fire', 'Service', 'weather_condition', 'cause', 'fire_burn', 'finger', 'formation', 'ease', 'fire', 'unit', 'Hill', 'optimistic', 'defend', 'property', 'blaze', 'burn', 'New', 'Year', 'Eve', 'New_South', 'Wales', 'fire', 'crew', 'call', 'new', 'fire', 'Gunning', 'south', 'Goulburn', 'detail', 'available', 'stage', 'fire', 'authority', 'close', 'Hume', 'Highway', 'direction', 'new', 'fire', 'Sydney',

In [18]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

print(corpus[1])

[(71, 1), (83, 1), (91, 1), (93, 1), (94, 1), (108, 1), (109, 1), (110, 1), (111, 4), (112, 1), (113, 1), (114, 1), (115, 1), (116, 2), (117, 1), (118, 1), (119, 3), (120, 1), (121, 1), (122, 1), (123, 2), (124, 3), (125, 1), (126, 2), (127, 2), (128, 1), (129, 1), (130, 1), (131, 1), (132, 1), (133, 1), (134, 1), (135, 1), (136, 1), (137, 1), (138, 3), (139, 1), (140, 1), (141, 1), (142, 2), (143, 1), (144, 1), (145, 1), (146, 1), (147, 1), (148, 1), (149, 1), (150, 3), (151, 3), (152, 1), (153, 1), (154, 2), (155, 1), (156, 1), (157, 2), (158, 1), (159, 1), (160, 1), (161, 1), (162, 1), (163, 1), (164, 1), (165, 1), (166, 1), (167, 1), (168, 1), (169, 1), (170, 2), (171, 1), (172, 1), (173, 1), (174, 1), (175, 1), (176, 1)]


> Now we are done with a very important part of text analysis - the data cleaning and setting up of corpus. It must kept in mind that we created the corpus the way we did because that's how gensim requires it - most algorithms still require one to clean the data set the way we did, by removing stop words and numbers, adding the lemmatized form of the word, and using bigrams.

---
# 📚 Topic Modeling

> Topic Modeling refers to the probabilistic modeling of text document as topics. Gensim remains the most popular library to perform such modelling, and we will be using it to perform our topic modelling.

## ✔️ LSI - Latent Semantic Indexing

> LSI stands for Latent Semantic Indexing - It is a popular information retreival method which works by decomposing the original matrix of words to maintain key topics. 

In [19]:
lsi_model = LsiModel(corpus=corpus, num_topics=10, id2word=dictionary)
lsi_model.show_topics(num_topics=5)

[(0,
  '-0.231*"israeli" + -0.215*"Arafat" + -0.197*"palestinian" + -0.177*"force" + -0.159*"kill" + -0.159*"official" + -0.151*"attack" + -0.141*"people" + -0.118*"day" + -0.116*"Israel"'),
 (1,
  '0.306*"israeli" + 0.305*"Arafat" + 0.272*"palestinian" + -0.167*"Afghanistan" + 0.165*"Sharon" + -0.160*"Australia" + 0.154*"Israel" + 0.126*"Hamas" + 0.123*"West_Bank" + -0.118*"force"'),
 (2,
  '0.259*"Afghanistan" + 0.218*"force" + -0.191*"fire" + 0.182*"Al_Qaeda" + 0.173*"bin_Laden" + 0.150*"Pakistan" + -0.147*"Sydney" + 0.130*"fighter" + 0.130*"Tora_Bora" + 0.127*"Taliban"'),
 (3,
  '0.388*"fire" + 0.272*"area" + -0.207*"Australia" + 0.202*"Sydney" + 0.180*"firefighter" + 0.160*"north" + 0.149*"wind" + 0.135*"Wales" + 0.135*"New_South" + 0.127*"south"'),
 (4,
  '0.274*"company" + 0.206*"Qantas" + 0.179*"union" + -0.166*"test" + 0.148*"worker" + -0.143*"match" + -0.142*"South_Africa" + -0.134*"win" + -0.132*"wicket" + -0.124*"day"')]

## ✔️ HDP - Hierarchical Drichlet Process

> HDP, the Hierarchical Drichlet Process is an unsupervised topic model which figures out the number of topics on it's own.

In [20]:
hdp_model = HdpModel(corpus=corpus, id2word=dictionary)
hdp_model.show_topics()

[(0,
  '0.003*match + 0.003*Powell + 0.002*time + 0.002*Afghanistan + 0.002*israeli + 0.002*want + 0.002*Taliban + 0.002*ask + 0.002*southern + 0.002*team + 0.002*play + 0.002*force + 0.002*guarantee + 0.002*France + 0.001*day + 0.001*medium + 0.001*single + 0.001*Rafter + 0.001*United_States + 0.001*kill'),
 (1,
  '0.003*palestinian + 0.003*Sharon + 0.003*group + 0.003*kill + 0.002*Arafat + 0.002*Government + 0.002*israeli + 0.002*call + 0.002*choose + 0.002*Gaza_Strip + 0.002*terrorism + 0.002*security + 0.002*attack + 0.002*suicide_attack + 0.001*  + 0.001*Hamas + 0.001*militant + 0.001*right + 0.001*human_right + 0.001*Israel'),
 (2,
  '0.003*airport + 0.003*Taliban + 0.003*opposition + 0.002*night + 0.002*people + 0.002*Kandahar + 0.002*leave + 0.002*kill + 0.002*fighter + 0.002*wound + 0.002*intruder + 0.002*end + 0.002*go + 0.002*Gul + 0.002*force + 0.002*near + 0.001*report + 0.001*Lali + 0.001*bombing + 0.001*city'),
 (3,
  '0.002*Afghanistan + 0.002*cent + 0.002*week + 0.002*

## ✔️ LDA - Latent Dirchlet Allocation

> LDA, or Latent Dirchlet Allocation is arguably the most famous topic modeling algorithm out there. Out here we create a simple topic model with 10 topics.

In [21]:
lda_model = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)
lda_model.show_topics()

[(0,
  '0.006*"israeli" + 0.005*"people" + 0.005*"palestinian" + 0.005*"Arafat" + 0.005*"call" + 0.004*"United_States" + 0.003*"security" + 0.003*"give" + 0.003*"official" + 0.003*"Sharon"'),
 (1,
  '0.005*"day" + 0.004*"tell" + 0.004*"year" + 0.003*"Australia" + 0.003*"Afghanistan" + 0.003*"report" + 0.003*"know" + 0.003*"lead" + 0.003*"Taliban" + 0.003*"work"'),
 (2,
  '0.004*"area" + 0.004*"yesterday" + 0.004*"come" + 0.004*"Sydney" + 0.004*"day" + 0.003*"time" + 0.003*"test" + 0.003*"force" + 0.003*"power" + 0.003*"win"'),
 (3,
  '0.005*"report" + 0.005*"official" + 0.005*"israeli" + 0.004*"force" + 0.004*"palestinian" + 0.003*"fire" + 0.003*"kill" + 0.003*"tell" + 0.003*"people" + 0.003*"come"'),
 (4,
  '0.004*"company" + 0.003*"Australia" + 0.003*"fire" + 0.003*"union" + 0.003*"claim" + 0.003*"people" + 0.003*"time" + 0.003*"force" + 0.003*"year" + 0.003*"Government"'),
 (5,
  '0.007*"man" + 0.007*"Australia" + 0.004*"people" + 0.004*"time" + 0.004*"year" + 0.004*"day" + 0.003*"n

---
# 📊 pyLDAvis

Python library for interactive topic model visualization. This is a port of the fabulous R package by Carson Sievert and Kenny Shirley.

**pyLDAvis** is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

The visualization is intended to be used within an IPython notebook but can also be saved to a stand-alone HTML file for easy sharing.

In [22]:
import pyLDAvis.gensim


pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)

> This is a great way to get a view of what words end up appearing in our documents, and what kind of document topics might be present.