##  Topic Modelling 
As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus.

Topic Modelling is different from rule-based text mining (regex) . It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

More formally, we define a **topic** to be a distribution over a fixed vocabulary.



## Use Cases

New York Times are using topic models to boost their user – article recommendation engines. Various corporations are using topic models for recruitment industries where they aim to extract latent features of job descriptions and map them to right candidates. 

## Latent Dirichlet Allocation (LDA)

A popular topic modeling technique, LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place --> THIS OUR LIKELIHOOD

In [1]:
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc4 = "Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better."
doc5 = "Health experts say that sugar is not good for your lifestyle."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

## Sklearn Implementation

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
# LDA can only use raw term counts for LDA because it is a probabilistic graphical model


tf_vectorizer = CountVectorizer(max_features=13, stop_words='english')
tf = tf_vectorizer.fit_transform(doc_complete)
tf_feature_names = tf_vectorizer.get_feature_names()

In [3]:
tf_feature_names

['bad',
 'driving',
 'father',
 'lot',
 'perform',
 'practice',
 'pressure',
 'say',
 'school',
 'sister',
 'spends',
 'stress',
 'sugar']

### Discussion

Min_df: ignore terms that have a document frequency strictly lower than the given threshold.
Max_df: ignore terms that have a document frequency strictly higher than the given threshold.

If integer is passed occurences are considered, if float b/w 0.0 and 1.0 then frequency. 

In [4]:
tf_feature_names

['bad',
 'driving',
 'father',
 'lot',
 'perform',
 'practice',
 'pressure',
 'say',
 'school',
 'sister',
 'spends',
 'stress',
 'sugar']

In [10]:
from sklearn.decomposition import LatentDirichletAllocation
# Run LDA
lda = LatentDirichletAllocation(n_topics=3, max_iter=40).fit(tf)




In [11]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
#display_topics(nmf, tfidf_feature_names, no_top_words)
display_topics(lda, tf_feature_names, 4)

Topic 0:
say sugar sister driving
Topic 1:
father sister driving school
Topic 2:
sugar bad pressure stress


## Gensim and By hand cleaning Implementation

In [12]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/uday/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    #normalized = set(normalized)
    print(normalized)
    print("********")
    print(doc)
    print("\n")
    
    return normalized

doc_clean = [clean(doc).split() for doc in doc_complete]

sugar bad consume sister like sugar father
********
Sugar is bad to consume. My sister likes to have sugar, but not my father.


father spends lot time driving sister around dance practice
********
My father spends a lot of time driving my sister around to dance practice.


doctor suggest driving may cause increased stress blood pressure
********
Doctors suggest that driving may cause increased stress and blood pressure.


sometimes feel pressure perform well school father never seems drive sister better
********
Sometimes I feel pressure to perform well at school, but my father never seems to drive my sister to do better.


health expert say sugar good lifestyle
********
Health experts say that sugar is not good for your lifestyle.




In [14]:
doc_clean = [set (i) for i in doc_clean]

In [15]:
doc_clean

[{'bad', 'consume', 'father', 'like', 'sister', 'sugar'},
 {'around',
  'dance',
  'driving',
  'father',
  'lot',
  'practice',
  'sister',
  'spends',
  'time'},
 {'blood',
  'cause',
  'doctor',
  'driving',
  'increased',
  'may',
  'pressure',
  'stress',
  'suggest'},
 {'better',
  'drive',
  'father',
  'feel',
  'never',
  'perform',
  'pressure',
  'school',
  'seems',
  'sister',
  'sometimes',
  'well'},
 {'expert', 'good', 'health', 'lifestyle', 'say', 'sugar'}]

In [26]:
!pip install --upgrade gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/62/19/8ecba86351de0eacb9baf1cc49ba86315cd91bc672acd74d6e4e709eb482/gensim-3.6.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (24.0MB)
[K    100% |████████████████████████████████| 24.0MB 929kB/s 
Collecting smart-open>=1.2.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/4b/1f/6f27e3682124de63ac97a0a5876da6186de6c19410feab66c1543afab055/smart_open-1.7.1.tar.gz
Collecting bz2file (from smart-open>=1.2.1->gensim)
  Downloading https://files.pythonhosted.org/packages/61/39/122222b5e85cd41c391b68a99ee296584b2a2d1d233e7ee32b4532384f2d/bz2file-0.98.tar.gz
Collecting boto3 (from smart-open>=1.2.1->gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/f8/fe/76fc5a00b0ef8cd7958f2b71e7c442f01ff3883c6f3720a64ddef6b6680b/boto3-1.9.21-py2.py3-none-any.whl (128kB)
[K    100% |████████████████████████████████| 133kB 174kB/s 

In [16]:
# Importing Gensim. A module specifically meant for Topic Modelling 

import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(doc_clean)


# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [17]:
doc_term_matrix

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)],
 [(2, 1), (4, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(8, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1)],
 [(2, 1),
  (4, 1),
  (18, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1)],
 [(5, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]]

In [18]:
print(dictionary)

Dictionary(35 unique tokens: ['bad', 'consume', 'father', 'like', 'sister']...)


In [19]:
doc_term_matrix

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)],
 [(2, 1), (4, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(8, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1)],
 [(2, 1),
  (4, 1),
  (18, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1)],
 [(5, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]]

In [21]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

In [23]:
print(ldamodel.print_topics(num_topics=3, num_words=4))

[(0, '0.079*"driving" + 0.045*"doctor" + 0.045*"may" + 0.045*"cause"'), (1, '0.099*"sugar" + 0.056*"good" + 0.056*"health" + 0.056*"expert"'), (2, '0.057*"father" + 0.057*"sister" + 0.056*"pressure" + 0.056*"never"')]


In [None]:
Topic 0:
pressure say stress
Topic 1:
sister father sugar