## Problem
You want to extract or identify topics from the document.

## Step 5-1 Create the text data
Here is the text:


In [4]:
doc1 = "I am learning NLP, it is very interesting and exciting.it includes machine learning and deep learning"
doc2 = "My father is a data scientist and he is nlp expert"
doc3 = "My sister has good exposure into android development"
doc_complete = [doc1, doc2, doc3]
doc_complete

['I am learning NLP, it is very interesting and exciting.it includes machine learning and deep learning',
 'My father is a data scientist and he is nlp expert',
 'My sister has good exposure into android development']

## Step 5-2 Cleaning and preprocessing
Next, we clean it up:


In [6]:
# Install and import libraries
!pip install gensim
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
# Text preprocessing as discussed in chapter 2
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()



In [8]:
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split()if i not in stop])
    punc_free = "".join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [9]:
doc_clean = [clean(doc).split() for doc in doc_complete]
doc_clean

[['learning',
  'nlp',
  'interesting',
  'excitingit',
  'includes',
  'machine',
  'learning',
  'deep',
  'learning'],
 ['father', 'data', 'scientist', 'nlp', 'expert'],
 ['sister', 'good', 'exposure', 'android', 'development']]

## Step 5-3 Preparing document term matrix
The code is below:


In [10]:
# Importing gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our corpus, where every unique term is assigned an index.
dictionary = corpora.Dictionary(doc_clean)
# Converting a list of documents (corpus) into Document-Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
doc_term_matrix

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 3), (5, 1), (6, 1)],
 [(6, 1), (7, 1), (8, 1), (9, 1), (10, 1)],
 [(11, 1), (12, 1), (13, 1), (14, 1), (15, 1)]]

## Step 5-4 LDA model
The final part is to create the LDA model:


In [11]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix for 3 topics.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word =dictionary, passes=50)

# Results
print(ldamodel.print_topics())

[(0, '0.173*"learning" + 0.121*"nlp" + 0.069*"deep" + 0.069*"includes" + 0.069*"interesting" + 0.069*"machine" + 0.069*"excitingit" + 0.069*"scientist" + 0.069*"data" + 0.069*"father"'), (1, '0.129*"android" + 0.129*"sister" + 0.129*"exposure" + 0.129*"development" + 0.129*"good" + 0.032*"father" + 0.032*"expert" + 0.032*"data" + 0.032*"scientist" + 0.032*"nlp"'), (2, '0.063*"data" + 0.063*"scientist" + 0.063*"father" + 0.063*"expert" + 0.063*"nlp" + 0.062*"deep" + 0.062*"interesting" + 0.062*"includes" + 0.062*"excitingit" + 0.062*"machine"')]
