# Topic Modeling Approaches

Topic Modeling is an unsupervised machine learning algorithm used to identify topics in a given set of documents. In this notebook, we will cover two widely used techniques used in Topic Modeling:

1. Latent Dirichlet Allocation.
2. Non-negative Matrix Factorization.

We will also look at the effects of passing different features to these algorithms-namely Term frequencies only and then Inverse Document Frequencies.

## Loading libraries and reading the data

We will load the basic libraries and have a sneak peak at the data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
filename = "sample-S2-records"
df = pd.read_json(filename,lines=True)
df.head()

Unnamed: 0,authors,doi,doiUrl,entities,id,inCitations,journalName,journalPages,journalVolume,outCitations,paperAbstract,pdfUrls,pmid,s2PdfUrl,s2Url,sources,title,venue,year
0,"[{'name': 'Kate Jack', 'ids': ['38280253']}]",,,[Jack Device Component],f2320c08c7d95bbf8bb72e4d6deaa6845ea4cf27,[],Nursing times,26,109 49-50,[],,[],24568020v1,,https://semanticscholar.org/paper/f2320c08c7d9...,[Medline],60 seconds with Kate Jack.,Nursing times,2013.0
1,"[{'name': 'W N Spellacy', 'ids': ['5862934']},...",,,"[Decision Making, Laboratory Certification Doc...",5432a99cdd9f8b248c50274cd3d2a6016f3d081e,[],The Journal of reproductive medicine,127-30,31 2,[],The search for new administrators in complex s...,[],3514907v1,,https://semanticscholar.org/paper/5432a99cdd9f...,[Medline],Organizing a search for an academic administra...,The Journal of reproductive medicine,1986.0
2,"[{'name': 'Stefanie Ernst', 'ids': ['39900230'...",,,"[Annexin A1, Annexins, Bacterial Infections, C...",155663331ea93379e99997bd43340eb54ab41a73,"[3738fad17126054f03cfe736b7156b6d6eef0481, 927...",Journal of immunology,7669-76,172 12,"[c2b53b26c004fe57e85424df6ad101d283150648, d30...",The human N-formyl peptide receptor (FPR) is a...,[http://www.jimmunol.org/content/jimmunol/172/...,15187149v1,http://pdfs.semanticscholar.org/cb73/147dc0bf1...,https://semanticscholar.org/paper/155663331ea9...,[Medline],An annexin 1 N-terminal peptide activates leuk...,Journal of immunology,2004.0
3,"[{'name': 'S Yamamoto', 'ids': ['1801874']}, {...",,,"[Adrenal Cortex Hormones, Bladder Neoplasm, Ca...",b5a25960ebee9a6e5db79196e6b07f0edfcf5313,"[8bcedf8512f672310326a6cc0ec897939d28c6d1, 8b7...",Nihon Rinsho Men'eki Gakkai kaishi = Japanese ...,128-35,19 2,[],Serum CA 19-9 (2-3 sialyl Le(a)) is a marker o...,[],8705689v1,,https://semanticscholar.org/paper/b5a25960ebee...,[Medline],[Serum CA 19-9 levels in rheumatic diseases wi...,Nihon Rinsho Men'eki Gakkai kaishi = Japanese ...,1996.0
4,"[{'name': 'Edwards', 'ids': ['14380299']}, {'n...",,,"[Cell Nucleus, Dependence, Nucleic Acids]",3b7538465b0559e2d3ff2b65991c8e399e457822,[],"Physical review. A, Atomic, molecular, and opt...",2709-2717,44 4,[],,[],9906253v1,,https://semanticscholar.org/paper/3b7538465b05...,[Medline],Sequence dependence of low-frequency Raman-act...,"Physical review. A, Atomic, molecular, and opt...",1991.0


## Preprocessing the data

#### We need to preprocess data before performing topic modeling.We will perform the following steps a part of preprocessing:
    1. Removing NaN(null) values.
    2. Tokenizing the text.
    3. Removing stop-words.
    4. Lemmatizing the words.

In [3]:
abstract = df['paperAbstract']
abstract.replace('', np.NaN)

0                                                   NaN
1     The search for new administrators in complex s...
2     The human N-formyl peptide receptor (FPR) is a...
3     Serum CA 19-9 (2-3 sialyl Le(a)) is a marker o...
4                                                   NaN
5                                                   NaN
6     The analysis of trichinellosis patients' sera ...
7     As a result of a large outbreak of measles, me...
8     Since 1954 average orthophosphate and total ph...
9     Using a rifampicin-resistant RNA polymerase wi...
10    Acid extracts of bovine hypothalamus stimulate...
11    We studied native point defects as well as Eu ...
12    The influences of induced alterations in muscl...
13                                                  NaN
14    Ischemic brain is particularly susceptible to ...
15    In an effort to develop potent antiplatelet ag...
16    Cost-containment regulations and possible legi...
17                                              

In [4]:
absDictionary = abstract.to_dict()
absData = []
for key in absDictionary:
    if absDictionary[key] =='':
        continue
    else:
        absData.append(absDictionary[key])

In [5]:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

wn=WordNetLemmatizer()


In [6]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    stop_words=stopwords.words('english')
    tokens=[token for token in tokens if token not in stop_words]
    tokens=[wn.lemmatize(token) for token in tokens]
    return tokens

In [7]:
text_tokens=[]
for item in absData:
    tokens = preprocess_text(item)
    temp = " ".join(tokens)
    text_tokens.append(temp)
    

## Vectorization

In the next step ,we perform vectorization of the clean text using CountVectorizer.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

def Term_frequency_vectors(text):
    #no_features will be used as the max_features parameter in creating a CountVectorizer object later. 
    no_features = 100
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
    #training. fit_transform returns a Document-term matrix.
    tf = tf_vectorizer.fit_transform(text)   #Array mapping from feature integer indices to feature name
    tf_feature_names = tf_vectorizer.get_feature_names()
    return tf_vectorizer,tf, tf_feature_names


In [9]:
vec, term_freq,features=Term_frequency_vectors(text_tokens)
#how many topics we want to classify
no_topics = 20
# Run LDA
# http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
lda_model = lda.fit(term_freq)



## Implementation of LDA using Term Frequency.

Now, we run the LDA model on the resultant vectors from the previous step and display the top words of each topic.

In [10]:
def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents] 

In [11]:
lda_W = lda_model.transform(term_freq)

#word to topics matrix
lda_H = lda_model.components_

no_top_words = 4
no_top_documents = 4
display_topics(lda_H, lda_W, features , text_tokens, no_top_words, no_top_documents)

Topic 0:
paper algorithm simulation similar
Topic 1:
symptom clinical study finding
Topic 2:
care dna rat related
Topic 3:
present specie value based
Topic 4:
living cancer ratio state
Topic 5:
peptide activity like result
Topic 6:
perfusion used method related
Topic 7:
pressure type positive compared
Topic 8:
knowledge factor population ratio
Topic 9:
mg value model determined
Topic 10:
control level hypothermia local
Topic 11:
treatment local study common
Topic 12:
liver metastasis patient peptide
Topic 13:
time stroke health self
Topic 14:
skin effect using temperature
Topic 15:
case significantly reported structure
Topic 16:
level using problem network
Topic 17:
pressure ip condition induced
Topic 18:
19 level serum ca
Topic 19:
patient health medical care


### Visualizing the Model

In [21]:
import pyLDAvis
import warnings
warnings.filterwarnings("ignore")
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

data=pyLDAvis.sklearn.prepare(lda_model, term_freq, vec)
pyLDAvis.display(data)

## Implementation of LDA using Inverse Document Frequency.

Instead of using term frequency, let's try using Inverse Document Frequency(IDF) for performing LDA.

In [22]:
from sklearn.feature_extraction.text import TfidfTransformer

idf_trans=TfidfTransformer(norm=None)
idf = idf_trans.fit_transform(term_freq)

lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
lda_model = lda.fit(idf)

lda_doc_topic = lda_model.transform(idf)
lda_topic_word = lda_model.components_

display_topics(lda_topic_word,lda_doc_topic, features , text_tokens, no_top_words, no_top_documents)

Topic 0:
simulation control model algorithm
Topic 1:
symptom child clinical peptide
Topic 2:
care rat dna related
Topic 3:
present specie value based
Topic 4:
living cancer ratio state
Topic 5:
peptide activity like result
Topic 6:
temperature enzyme dna case
Topic 7:
pressure perfusion induced response
Topic 8:
knowledge control network based
Topic 9:
mg value structure model
Topic 10:
hypothermia local level process
Topic 11:
treatment local rate common
Topic 12:
metastasis liver patient rate
Topic 13:
time stroke health self
Topic 14:
skin ratio expression disease
Topic 15:
high based paper problem
Topic 16:
network level algorithm using
Topic 17:
ip condition patient serum
Topic 18:
19 serum ca ip
Topic 19:
patient medical health concentration


## Visualizing the Model

In [28]:
data=pyLDAvis.sklearn.prepare(lda_model, idf, vec)
pyLDAvis.display(data)

## Implementation of NMF using Term Frequency.

Now, let's try implementing Non-Negative Matrix Factorization using Term Frequency as features.

In [26]:
from sklearn.decomposition import NMF
nmf = NMF(n_components=no_topics, random_state=1,
          alpha=.1, l1_ratio=.5)
nmf_model=nmf.fit(term_freq)

nmf_doc_topic=nmf_model.transform(term_freq)

nmf_topic_word=nmf_model.components_

display_topics(nmf_topic_word,nmf_doc_topic, features , text_tokens, no_top_words, no_top_documents)

Topic 0:
time day work medical
Topic 1:
19 level ca serum
Topic 2:
stroke health self associated
Topic 3:
living cancer ratio state
Topic 4:
pressure perfusion induced response
Topic 5:
symptom clinical finding level
Topic 6:
metastasis liver patient case
Topic 7:
care rat related treatment
Topic 8:
dna skin cell using
Topic 9:
hypothermia local effect model
Topic 10:
peptide like result activity
Topic 11:
knowledge based study using
Topic 12:
condition level human compared
Topic 13:
patient medical health based
Topic 14:
mg model concentration expression
Topic 15:
immunized child le result
Topic 16:
fed control reduced network
Topic 17:
ip process expression cell
Topic 18:
problem using algorithm network
Topic 19:
treatment rate study year


## Visualizing the Model

In [29]:
data=pyLDAvis.sklearn.prepare(nmf_model, term_freq, vec)
pyLDAvis.display(data)

## Implementation of NMF using Inverse-Document Frequency.

Now, let's try implementing Non-negative Matrix Factorization using Inverse-Document Frequency.

In [30]:
nmf = NMF(n_components=no_topics, random_state=1,
          alpha=.1, l1_ratio=.5)
nmf_model=nmf.fit(idf)

nmf_doc_topic=nmf_model.transform(idf)

nmf_topic_word=nmf_model.components_

display_topics(nmf_topic_word,nmf_doc_topic, features , text_tokens, no_top_words, no_top_documents)


Topic 0:
time day work medical
Topic 1:
19 serum ca ip
Topic 2:
stroke health self associated
Topic 3:
living cancer ratio state
Topic 4:
pressure perfusion induced response
Topic 5:
metastasis liver patient organ
Topic 6:
symptom clinical finding presence
Topic 7:
care rat related treatment
Topic 8:
hypothermia local model effect
Topic 9:
peptide like activity result
Topic 10:
dna skin cell using
Topic 11:
immunized child le compared
Topic 12:
mg concentration model expression
Topic 13:
knowledge based factor study
Topic 14:
ip skin process expression
Topic 15:
fed control reduced activity
Topic 16:
condition patient medical human
Topic 17:
treatment rate local year
Topic 18:
network algorithm problem temperature
Topic 19:
level state ca positive


## Visualizing the Model

In [31]:
data=pyLDAvis.sklearn.prepare(nmf_model, idf, vec)
pyLDAvis.display(data)