# Topic Modeling in Python

## 3. Refining the model

Next, we will use lemmatization, and bigrams, to obtain higher quality topics. We will be comparing the output of the model with the existing categories from the Reuters corpus.

In [1]:
import nltk
import gensim

Download English stop words from https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip and place the content of the zip file in /Users/username/nltk_data/ and then import the dictionary of stopwords indicating the target language.

In [2]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
#you can add as many words to the existing list as you consider depending on the context
stop_words.extend(['yes', 'would', 'might', 'may','however', 'could'])

In [3]:
#import the Reuters corpus from NLTK
from nltk.corpus import reuters
from nltk.stem.wordnet import WordNetLemmatizer
#you can import your own corpus by using the following function from nltk.corpus import PlaintextCorpusReader
#get all document ids from the corpus and then extract all the words for each document that are not stopwords
doc_ids = reuters.fileids()
corpus = [reuters.words(doc) for doc in doc_ids]
corpus=[gensim.utils.simple_preprocess(str(" ".join(c)), deacc=True) for c in corpus]
corpus = [[word for word in doc if word not in stop_words if len(word)>3] for doc in corpus]
lemma = WordNetLemmatizer()
corpus = [[lemma.lemmatize(token) for token in doc] for doc in corpus]

In [4]:
#Create bigrams given a minimum count
bigrams = gensim.models.Phrases(corpus, min_count=5)
bigrams_mod = gensim.models.phrases.Phraser(bigrams)
data = [bigrams_mod[doc] for doc in corpus]

In [5]:
#Prepare data for input in the Topic Model
word_id = gensim.corpora.Dictionary(data)
texts = [word_id.doc2bow(d) for d in data]

In [6]:
#Model definition
topic_model = gensim.models.ldamodel.LdaModel(corpus=texts,
                                           id2word=word_id,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [7]:
#Print topics and the words associated with each 
from pprint import pprint
pprint(topic_model.print_topics())

[(0,
  '0.072*"loss_loss" + 0.046*"rev" + 0.042*"profit_loss" + 0.030*"gulf" + '
  '0.021*"iran" + 0.014*"oper" + 0.012*"missile" + 0.012*"american" + '
  '0.012*"network" + 0.009*"loss_rev"'),
 (1,
  '0.047*"bank" + 0.039*"said" + 0.024*"dollar" + 0.022*"rate" + '
  '0.019*"market" + 0.014*"currency" + 0.011*"growth" + 0.011*"mark" + '
  '0.010*"central_bank" + 0.009*"economy"'),
 (2,
  '0.133*"dlrs" + 0.071*"year" + 0.064*"billion" + 0.029*"sale" + 0.028*"loss" '
  '+ 0.024*"profit" + 0.014*"earnings" + 0.013*"share" + 0.012*"result" + '
  '0.010*"quarter"'),
 (3,
  '0.053*"said" + 0.010*"government" + 0.010*"trade" + 0.008*"market" + '
  '0.008*"japan" + 0.007*"country" + 0.007*"also" + 0.006*"industry" + '
  '0.006*"official" + 0.005*"world"'),
 (4,
  '0.086*"said" + 0.042*"company" + 0.030*"share" + 0.015*"corp" + '
  '0.015*"group" + 0.011*"unit" + 0.011*"stock" + 0.010*"dlrs" + 0.010*"offer" '
  '+ 0.008*"also"'),
 (5,
  '0.047*"dividend" + 0.032*"april" + 0.026*"rev_shrs" + 0.0

In [8]:
#perplexity value
topic_model.log_perplexity(texts)

-8.822482420353063

In [9]:
#coherence value
coherence=gensim.models.CoherenceModel(model=topic_model, corpus=texts, coherence='u_mass')
coherence.get_coherence()

-5.917017007449319

In [10]:
cats = reuters.categories()
print("Reuters has %d categories:\n%s" % (len(cats), cats))

Reuters has 90 categories:
['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']


How does the output of the model compare with the corpus categories?

How does the lemmatization process improve the results?

Continue to [PTM for summarization](summarization.ipynb)