Extracting Topics from Text
--
In this section, we are going to discuss how to identify topics from the
document. Say, for example, there is an online library with multiple departments based on the kind of book. As the new book comes in,
you want to look at the unique keywords/topics and decide on which
department this book might belong to and place it accordingly. In these
kinds of situations, topic modeling would be handy.

Basically, this is document tagging and clustering.

Problem
--
You want to extract or identify topics from the document.

Solution
--
The simplest way to do this by using the gensim library.

In [3]:
# step 1: define some text documents

doc1 = "I am learning NLP, it is very interesting and exciting. it includes machine learning and deep learning"
doc2 = "My father is a data scientist and he is nlp expert"
doc3 = "My sister has good exposure into android development"
doc_complete = [doc1, doc2, doc3]
doc_complete

['I am learning NLP, it is very interesting and exciting. it includes machine learning and deep learning',
 'My father is a data scientist and he is nlp expert',
 'My sister has good exposure into android development']

In [4]:
# step 2: Cleaning and preprocessing

# Install and import libraries
# !pip install gensim
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

# Text preprocessing as discussed in chapter 2
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

def clean(doc):
 stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
 punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
 normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
 return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]
print(doc_clean)

[['learning', 'nlp', 'interesting', 'exciting', 'includes', 'machine', 'learning', 'deep', 'learning'], ['father', 'data', 'scientist', 'nlp', 'expert'], ['sister', 'good', 'exposure', 'android', 'development']]


In [5]:
# step 3: Preparing document term matrix

# Importing gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our corpus, where every unique term 
# is assigned an index.
dictionary = corpora.Dictionary(doc_clean)

# Converting a list of documents (corpus) into Document-Term Matrix 
# using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
doc_term_matrix

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 3), (5, 1), (6, 1)],
 [(6, 1), (7, 1), (8, 1), (9, 1), (10, 1)],
 [(11, 1), (12, 1), (13, 1), (14, 1), (15, 1)]]

In [7]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix for 3 topics.
ldamodel = Lda(doc_term_matrix, num_topics=4, id2word=dictionary, passes=50)

# Results
print(ldamodel.print_topics())

[(0, '0.063*"nlp" + 0.063*"father" + 0.063*"data" + 0.063*"scientist" + 0.063*"expert" + 0.063*"good" + 0.063*"exposure" + 0.063*"development" + 0.063*"android" + 0.063*"sister"'), (1, '0.250*"learning" + 0.096*"deep" + 0.096*"includes" + 0.096*"interesting" + 0.096*"machine" + 0.096*"exciting" + 0.096*"nlp" + 0.019*"scientist" + 0.019*"data" + 0.019*"father"'), (2, '0.139*"sister" + 0.139*"good" + 0.139*"exposure" + 0.139*"development" + 0.139*"android" + 0.028*"nlp" + 0.028*"father" + 0.028*"scientist" + 0.028*"data" + 0.028*"expert"'), (3, '0.139*"nlp" + 0.139*"father" + 0.139*"data" + 0.139*"scientist" + 0.139*"expert" + 0.028*"good" + 0.028*"exposure" + 0.028*"development" + 0.028*"android" + 0.028*"sister"')]


Must watch twice for LDA
https://www.youtube.com/watch?v=3mHy4OSyRf0

Recommended Reading for all participants
--
https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d

https://www.linkedin.com/pulse/lda-explanation-gaurhari-dass/