<a href="https://colab.research.google.com/github/WasudeoGurjalwar/AL_ML_Training/blob/main/15_Extracting_Topics_from_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://drive.google.com/uc?id=1n3kK5ev0YR5K51HSrxjnoSRUw7ywwUIL" />


Extracting Topics from Text
--
In this section, we are going to discuss how to identify topics from the
document. Say, for example, there is an online library with multiple departments based on the kind of book. As the new book comes in,
you want to look at the unique keywords/topics and decide on which
department this book might belong to and place it accordingly. In these
kinds of situations, topic modeling would be handy.

<font color='green'><b>Basically, this is document tagging and clustering. </b></font>


Problem
--
You want to extract or identify topics from the document.

Solution
--
The simplest way to do this by using the gensim library.

In [None]:
# step 1: define some text documents
doc1 = "I am learning NLP, it is very interesting!! and exciting. it includes machine learning and deep learning"
doc2 = "My father is a data scientist and he is nlp expert"
doc3 = "My sister has good exposure into android development"
doc_complete = [doc1, doc2, doc3]
doc_complete

['I am learning NLP, it is very interesting!! and exciting. it includes machine learning and deep learning',
 'My father is a data scientist and he is nlp expert',
 'My sister has good exposure into android development']

<font color='red'><b>Please Note - IMPORTANT</b></font>

You may be wondering, that if I simply find cosine similarity or use fuzzywuzzy package , I can find which documents are similar. Well, thats true when the <font color='green'><b> words in both the documents or their lemma's are same.</b></font>

Here the sentence_1 or document_1 has very few common words w.r.t sentence_2 or document_2, But, both can be classified into the <font color='green'><b>TOPIC of DATA SCIENCE</b></font> with a certain degree of confidence. Thus indicating **Topic Modelling** and not merely word matching !!

This NB teaches us techniques like <b>LDA</b>  and  <b>LSA</b> which go bound simple word similarity algos like cosine or phonetic algo's.

Watch the video once, before moving ahead :

<a href="https://drive.google.com/open?id=1IoSAsPZKtIKX3_amqZhyXFkoCaMZs1Ch">
<img border="0" alt="TopicModellingConcept" src="https://drive.google.com/uc?id=14OOsd0HaKoMJjqu5YT5n7-HsvE6UVV7z" width="100" height="60">
</a>

<small><font color='brown'><b>Don't worry about the Maths, its anyways implemented inside the LSA or LDA algo's </b></font></small>

In [None]:
# step 2: Cleaning and preprocessing

# Install and import libraries
# !pip install gensim
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

# Text preprocessing as discussed in part 3
stop = set(stopwords.words('english'))
#print(stop)
exclude = set(string.punctuation)
#print(exclude)
lemma = WordNetLemmatizer()

def clean(doc):
 stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
 #print(stop_free) # Note That : stop_free is a single string, not a list
 punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
 normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
 return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]
print(doc_clean)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


[['learning', 'nlp', 'interesting', 'exciting', 'includes', 'machine', 'learning', 'deep', 'learning'], ['father', 'data', 'scientist', 'nlp', 'expert'], ['sister', 'good', 'exposure', 'android', 'development']]


In [None]:
# step 3: Preparing document term matrix

# Importing gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our corpus, where every unique term
# is assigned an index.
dictionary = corpora.Dictionary(doc_clean)
print(dictionary)

# Converting a list of documents (corpus) into Document-Term Matrix
# using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
doc_term_matrix

Dictionary<16 unique tokens: ['deep', 'exciting', 'includes', 'interesting', 'learning']...>


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 3), (5, 1), (6, 1)],
 [(6, 1), (7, 1), (8, 1), (9, 1), (10, 1)],
 [(11, 1), (12, 1), (13, 1), (14, 1), (15, 1)]]

Must watch for LDA (Latent Dirichlet Allocation)
----

<a href="https://drive.google.com/file/d/12a9W1OwNaDBt0O3ADp4724P1uZ6sQuUV/view?usp=sharing">
<img border="0" alt="LDATopicModelling" src="https://drive.google.com/uc?id=14OOsd0HaKoMJjqu5YT5n7-HsvE6UVV7z" width="100" height="60">
</a>

<small>Credits : LDA concept Video recorded by Andrius Knispelis </small>

In [None]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix for 3 topics.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word=dictionary, passes=50)

# Results
print(ldamodel.print_topics())

[(0, '0.173*"learning" + 0.069*"exciting" + 0.069*"includes" + 0.069*"machine" + 0.069*"deep" + 0.069*"interesting" + 0.069*"android" + 0.069*"good" + 0.069*"exposure" + 0.069*"development"'), (1, '0.063*"nlp" + 0.063*"sister" + 0.063*"good" + 0.063*"exposure" + 0.063*"development" + 0.063*"android" + 0.062*"deep" + 0.062*"interesting" + 0.062*"machine" + 0.062*"includes"'), (2, '0.129*"nlp" + 0.129*"data" + 0.129*"father" + 0.129*"scientist" + 0.129*"expert" + 0.032*"good" + 0.032*"exposure" + 0.032*"development" + 0.032*"android" + 0.032*"sister"')]


All the weights associated with the topics from the sentence seem almost similar. You can perform this on huge data to extract significant
topics. The whole idea to implement this on sample data is to make you familiar with it, and you can use the same code snippet to perform on the
huge data for significant results and insights.


Must watch Video Series on **LSA** :

> https://www.youtube.com/watch?v=hB51kkus-Rc

> https://www.youtube.com/watch?v=Fy0bF7u6W20

> https://www.youtube.com/watch?v=NWb_4O3ssbA

> https://www.youtube.com/watch?v=YX4xRIQ84Z0

And then move on to implementing yourself this **`kaggle NB`**

> https://www.kaggle.com/rcushen/topic-modelling-with-lsa-and-lda

<font color='green'>
On this NB , I would take Vivas to check on your understandibility on "Topic Modeling". </font>

This would also land as a **`Project`** in your **`resume`**.

<br>

<b><u>Viva Questions</u> could be some thing like this :</b>

> What is gensim ?

> What is corpora in gensim ?

> use of doc2bow() ?

Refer : <a href="https://drive.google.com/file/d/1IhE5ZV_R_Q-BvuNvGIdxhJBPq2lEKqOM/view?usp=sharing"> This video</a> to answer below Qns:
> Can u explain the blueprint of LDA model ?

> What do dirichlet distribution (alpha and Beta) parameters signify ?

> What do multinomial distribution (theta and phi) parameters signify ?

> What is Gibbs Sampling in LDA topic modeling ?  
Refer <a href='https://drive.google.com/file/d/1pmbaVVZkt5uq4hIaGfekKNntRoV98bBu/view?usp=sharing'>this video</a>


<hr>

**`Recommended (Extra) Reading for all participants`**

Introduction
> https://monkeylearn.com/blog/introduction-to-topic-modeling/

All about Gensim Library with Implementation Code
> https://www.machinelearningplus.com/nlp/gensim-tutorial/

Codes from the Standford NLP Group
> https://nlp.stanford.edu/software/tmt/tmt-0.4/  