<h1 align="center">NLP Tutorial in Gensim</h1>
<h3 align="center">Chuck Chan</h3>


In [1]:
#We are using these sentences as examples to represent documents to save time. 
#But the results for the topic model will be biased because this is intended for large documents
#where words are distributed differently
docs =["We are running a tutorial on Natural Language Processing in Python.",
       "San Francisco will be foggy all week!", 
       "Python is a high level programming language.",
       "Today it is cold in SF",
       "It is so sunny that you can fry an egg outside in Vegas.",
       "Las Vegas is getting very cold in the evening.",
       "Other programming languages that have Natural Language Processing libraries include Java and C.",
       "The cold weather yesterday evening gave Kevin frostbite.",
       "What is the weather outside? Is it rainy or sunny?",
       "Doesn't it cost $25 to run your code on Amazon? No",
       "Running NLP code is sometimes very slow and not very accurate.",
       "Python libraries for NLP include: Spacy, NLTK, and gensim",
       "Gensim is is used in this tutorial because it is easy to install",
    ]
for n in docs:
    print(n)

We are running a tutorial on Natural Language Processing in Python.
San Francisco will be foggy all week!
Python is a high level programming language.
Today it is cold in SF
It is so sunny that you can fry an egg outside in Vegas.
Las Vegas is getting very cold in the evening.
Other programming languages that have Natural Language Processing libraries include Java and C.
The cold weather yesterday evening gave Kevin frostbite.
What is the weather outside? Is it rainy or sunny?
Doesn't it cost $25 to run your code on Amazon? No
Running NLP code is sometimes very slow and not very accurate.
Python libraries for NLP include: Spacy, NLTK, and gensim
Gensim is is used in this tutorial because it is easy to install


<h1>Remove punctuation</h1>

In [2]:
#replace punctuation
import re
for i in range(len(docs)):
    docs[i]= re.sub('(\w+)(?:\?|\,|\.|:|;|!)','\g<1>', docs[i])
    print(docs[i])

print("")

We are running a tutorial on Natural Language Processing in Python
San Francisco will be foggy all week
Python is a high level programming language
Today it is cold in SF
It is so sunny that you can fry an egg outside in Vegas
Las Vegas is getting very cold in the evening
Other programming languages that have Natural Language Processing libraries include Java and C
The cold weather yesterday evening gave Kevin frostbite
What is the weather outside Is it rainy or sunny
Doesn't it cost $25 to run your code on Amazon No
Running NLP code is sometimes very slow and not very accurate
Python libraries for NLP include Spacy NLTK and gensim
Gensim is is used in this tutorial because it is easy to install



<h1>Dealing with compound Words</h1>

In [3]:
replacements ={
    'san_francisco':['San Francisco', 'SF'],
    'las_vegas':['Las Vegas',"Vegas"],
    'nlp':['NLP', 'Natural Language Processing']
}

    
    
for key, value in replacements.items():
    for i in range(len(docs)):
        for j in range(len(value)):
            docs[i] = docs[i].replace(value[j], key)
            
for d in docs:
    print (d)



We are running a tutorial on nlp in Python
san_francisco will be foggy all week
Python is a high level programming language
Today it is cold in san_francisco
It is so sunny that you can fry an egg outside in las_vegas
las_vegas is getting very cold in the evening
Other programming languages that have nlp libraries include Java and C
The cold weather yesterday evening gave Kevin frostbite
What is the weather outside Is it rainy or sunny
Doesn't it cost $25 to run your code on Amazon No
Running nlp code is sometimes very slow and not very accurate
Python libraries for nlp include Spacy NLTK and gensim
Gensim is is used in this tutorial because it is easy to install


<h1>Tokenize</h1>

In [4]:
print("Tokenize text")

tokens=[[word for word in doc.split()] for doc in docs]
for d in tokens:
    print (d)

Tokenize text
['We', 'are', 'running', 'a', 'tutorial', 'on', 'nlp', 'in', 'Python']
['san_francisco', 'will', 'be', 'foggy', 'all', 'week']
['Python', 'is', 'a', 'high', 'level', 'programming', 'language']
['Today', 'it', 'is', 'cold', 'in', 'san_francisco']
['It', 'is', 'so', 'sunny', 'that', 'you', 'can', 'fry', 'an', 'egg', 'outside', 'in', 'las_vegas']
['las_vegas', 'is', 'getting', 'very', 'cold', 'in', 'the', 'evening']
['Other', 'programming', 'languages', 'that', 'have', 'nlp', 'libraries', 'include', 'Java', 'and', 'C']
['The', 'cold', 'weather', 'yesterday', 'evening', 'gave', 'Kevin', 'frostbite']
['What', 'is', 'the', 'weather', 'outside', 'Is', 'it', 'rainy', 'or', 'sunny']
["Doesn't", 'it', 'cost', '$25', 'to', 'run', 'your', 'code', 'on', 'Amazon', 'No']
['Running', 'nlp', 'code', 'is', 'sometimes', 'very', 'slow', 'and', 'not', 'very', 'accurate']
['Python', 'libraries', 'for', 'nlp', 'include', 'Spacy', 'NLTK', 'and', 'gensim']
['Gensim', 'is', 'is', 'used', 'in', 'th

<h1>Lemmatization</h1>
For demonstration purposes to show when it is applied and function. Change libraries to library and languages to language 

In [5]:
from nltk.stem.wordnet import WordNetLemmatizer

lmtzr = WordNetLemmatizer()
lemmaset = [[lmtzr.lemmatize(w.lower()) for w in sent] for sent in tokens]


print("Lemmatized Set")
for s in lemmaset:
    print(s)


Lemmatized Set
['we', 'are', 'running', 'a', 'tutorial', 'on', 'nlp', 'in', 'python']
['san_francisco', 'will', 'be', 'foggy', 'all', 'week']
['python', 'is', 'a', 'high', 'level', 'programming', 'language']
['today', 'it', 'is', 'cold', 'in', 'san_francisco']
['it', 'is', 'so', 'sunny', 'that', 'you', 'can', 'fry', 'an', 'egg', 'outside', 'in', 'las_vegas']
['las_vegas', 'is', 'getting', 'very', 'cold', 'in', 'the', 'evening']
['other', 'programming', 'language', 'that', 'have', 'nlp', 'library', 'include', 'java', 'and', 'c']
['the', 'cold', 'weather', 'yesterday', 'evening', 'gave', 'kevin', 'frostbite']
['what', 'is', 'the', 'weather', 'outside', 'is', 'it', 'rainy', 'or', 'sunny']
["doesn't", 'it', 'cost', '$25', 'to', 'run', 'your', 'code', 'on', 'amazon', 'no']
['running', 'nlp', 'code', 'is', 'sometimes', 'very', 'slow', 'and', 'not', 'very', 'accurate']
['python', 'library', 'for', 'nlp', 'include', 'spacy', 'nltk', 'and', 'gensim']
['gensim', 'is', 'is', 'used', 'in', 'this',

<h1>Remove Stopwords & Rare words</h1>
Rare words removed to prevent biasing

In [10]:
stopset = set('for is a of the and to in it did or has had'.split())
filtered_docs = [[w for w in token if w.lower() not in stopset] for token in tokens]
print(filtered_docs)

[['We', 'are', 'running', 'tutorial', 'on', 'nlp', 'Python'], ['san_francisco', 'will', 'be', 'foggy', 'all', 'week'], ['Python', 'high', 'level', 'programming', 'language'], ['Today', 'cold', 'san_francisco'], ['so', 'sunny', 'that', 'you', 'can', 'fry', 'an', 'egg', 'outside', 'las_vegas'], ['las_vegas', 'getting', 'very', 'cold', 'evening'], ['Other', 'programming', 'languages', 'that', 'have', 'nlp', 'libraries', 'include', 'Java', 'C'], ['cold', 'weather', 'yesterday', 'evening', 'gave', 'Kevin', 'frostbite'], ['What', 'weather', 'outside', 'rainy', 'sunny'], ["Doesn't", 'cost', '$25', 'run', 'your', 'code', 'on', 'Amazon', 'No'], ['Running', 'nlp', 'code', 'sometimes', 'very', 'slow', 'not', 'very', 'accurate'], ['Python', 'libraries', 'nlp', 'include', 'Spacy', 'NLTK', 'gensim'], ['Gensim', 'used', 'this', 'tutorial', 'because', 'easy', 'install']]


In [11]:
#case insensitive
idocs =[[word.lower() for word in doc] for doc in filtered_docs]
print("Lower case")
for i in idocs:
    print(i)
print('')


print('Unique words')
alltokens =sum(idocs,[])
uniqueset = set(word for word in set(alltokens) if alltokens.count(word) == 1)

print(uniqueset)


Lower case
['we', 'are', 'running', 'tutorial', 'on', 'nlp', 'python']
['san_francisco', 'will', 'be', 'foggy', 'all', 'week']
['python', 'high', 'level', 'programming', 'language']
['today', 'cold', 'san_francisco']
['so', 'sunny', 'that', 'you', 'can', 'fry', 'an', 'egg', 'outside', 'las_vegas']
['las_vegas', 'getting', 'very', 'cold', 'evening']
['other', 'programming', 'languages', 'that', 'have', 'nlp', 'libraries', 'include', 'java', 'c']
['cold', 'weather', 'yesterday', 'evening', 'gave', 'kevin', 'frostbite']
['what', 'weather', 'outside', 'rainy', 'sunny']
["doesn't", 'cost', '$25', 'run', 'your', 'code', 'on', 'amazon', 'no']
['running', 'nlp', 'code', 'sometimes', 'very', 'slow', 'not', 'very', 'accurate']
['python', 'libraries', 'nlp', 'include', 'spacy', 'nltk', 'gensim']
['gensim', 'used', 'this', 'tutorial', 'because', 'easy', 'install']

Unique words
{'what', 'c', 'no', 'high', 'amazon', 'spacy', 'be', 'have', 'nltk', 'slow', 'an', 'foggy', 'cost', 'rainy', 'accurate', 

In [12]:
wbag = [[w for w in text if w not in uniqueset] for text in idocs]

print('Bag of words')
for i in wbag:
    print(i)

Bag of words
['running', 'tutorial', 'on', 'nlp', 'python']
['san_francisco']
['python', 'programming']
['cold', 'san_francisco']
['sunny', 'that', 'outside', 'las_vegas']
['las_vegas', 'very', 'cold', 'evening']
['programming', 'that', 'nlp', 'libraries', 'include']
['cold', 'weather', 'evening']
['weather', 'outside', 'sunny']
['code', 'on']
['running', 'nlp', 'code', 'very', 'very']
['python', 'libraries', 'nlp', 'include', 'gensim']
['gensim', 'tutorial']


<h1>Text Dictionary</h1>
Done to save memory by since reading strngs are more expensive 

In [13]:
from gensim import corpora

dictionary = corpora.Dictionary(wbag)
print(dictionary)
print('')
print(dictionary.token2id)

Dictionary(19 unique tokens: ['on', 'python', 'nlp', 'code', 'that']...)

{'on': 1, 'python': 4, 'nlp': 3, 'code': 17, 'that': 8, 'running': 2, 'weather': 16, 'libraries': 15, 'outside': 10, 'include': 14, 'las_vegas': 11, 'evening': 12, 'cold': 7, 'sunny': 9, 'very': 13, 'tutorial': 0, 'gensim': 18, 'programming': 6, 'san_francisco': 5}


In [14]:
corpus = [dictionary.doc2bow(text) for text in wbag]

#Display for each doc[(word, frequency)]
for i in corpus:
    print(i)

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]
[(5, 1)]
[(4, 1), (6, 1)]
[(5, 1), (7, 1)]
[(8, 1), (9, 1), (10, 1), (11, 1)]
[(7, 1), (11, 1), (12, 1), (13, 1)]
[(3, 1), (6, 1), (8, 1), (14, 1), (15, 1)]
[(7, 1), (12, 1), (16, 1)]
[(9, 1), (10, 1), (16, 1)]
[(1, 1), (17, 1)]
[(2, 1), (3, 1), (13, 2), (17, 1)]
[(3, 1), (4, 1), (14, 1), (15, 1), (18, 1)]
[(0, 1), (18, 1)]


<h1>Tf-IDF</h1>

In [15]:
from gensim import models
#initialize
tfidf = models.TfidfModel(corpus)

print("TF-IDF Model")
print(tfidf)
print("")

corpus_tfidf = tfidf[corpus]

for i in corpus_tfidf:
    print(i)

TF-IDF Model
TfidfModel(num_docs=13, num_nnz=42)

[(0, 0.4993638686130151), (1, 0.4993638686130151), (2, 0.4993638686130151), (3, 0.3144444033650505), (4, 0.3911929157895468)]
[(5, 1.0)]
[(4, 0.6166859611993709), (6, 0.7872092639569278)]
[(5, 0.7872092639569278), (7, 0.6166859611993709)]
[(8, 0.5), (9, 0.5), (10, 0.5), (11, 0.5)]
[(7, 0.41209612453809086), (11, 0.5260471411514012), (12, 0.5260471411514012), (13, 0.5260471411514012)]
[(3, 0.30031204378750365), (6, 0.47692050604796093), (8, 0.47692050604796093), (14, 0.47692050604796093), (15, 0.47692050604796093)]
[(7, 0.4845593542465289), (12, 0.6185475859673962), (16, 0.6185475859673962)]
[(9, 0.5773502691896257), (10, 0.5773502691896257), (16, 0.5773502691896257)]
[(1, 0.7071067811865476), (17, 0.7071067811865476)]
[(2, 0.3953925464613104), (3, 0.2489747079866609), (13, 0.7907850929226208), (17, 0.3953925464613104)]
[(3, 0.3144444033650505), (4, 0.3911929157895468), (14, 0.4993638686130151), (15, 0.4993638686130151), (18, 0.499363868

<h1>Latent Dirichlet Allocation</h1>
LDA is an unsupervised method for topic modeling where output are probabilities for each topic.

See <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">LDA</a> for description
For additional gensim lda model features see https://radimrehurek.com/gensim/models/ldamodel.html


In [16]:
#Define the number of topics
tops = 2
#setup model
lda = models.LdaModel(corpus_tfidf, id2word = dictionary, num_topics = tops)

#shows component of each topic
topic = [lda.get_topic_terms(t) for t in range(tops)]
for i in topic:
    print (i)

[(5, 0.071180690634488133), (18, 0.070236694847283135), (6, 0.070078244367833081), (9, 0.066817136009506353), (10, 0.06556492351032131), (4, 0.062382774238076782), (8, 0.060047174535213063), (14, 0.056073547434890368), (16, 0.053383846433893774), (15, 0.053148343921609963)]
[(7, 0.080569433827501175), (13, 0.075313140263695996), (1, 0.070300554453170849), (17, 0.066019340083690495), (12, 0.065839871233356634), (5, 0.063055734971793245), (2, 0.056513420663318395), (3, 0.055099145098413169), (0, 0.054864765071564639), (4, 0.053159702420372608)]


In [17]:
#shows component of each topic
lda.print_topics(2)

[(0,
  '0.071*san_francisco + 0.070*gensim + 0.070*programming + 0.067*sunny + 0.066*outside + 0.062*python + 0.060*that + 0.056*include + 0.053*weather + 0.053*libraries'),
 (1,
  '0.081*cold + 0.075*very + 0.070*on + 0.066*code + 0.066*evening + 0.063*san_francisco + 0.057*running + 0.055*nlp + 0.055*tutorial + 0.053*python')]

In [18]:
#For each doc displays [(topic1, probability), (topic2, probability)]
lda_corpus = lda[corpus_tfidf]
for doc in lda_corpus:
    print(doc) 

[(0, 0.2247946635725305), (1, 0.77520533642746947)]
[(0, 0.59995871307913284), (1, 0.40004128692086716)]
[(0, 0.73802606213437372), (1, 0.26197393786562628)]
[(0, 0.31837758508477138), (1, 0.68162241491522868)]
[(0, 0.79760053911623652), (1, 0.20239946088376351)]
[(0, 0.2075786924779715), (1, 0.79242130752202844)]
[(0, 0.79820294359975319), (1, 0.20179705640024681)]
[(0, 0.2385248696936767), (1, 0.76147513030632319)]
[(0, 0.77298637996271324), (1, 0.2270136200372869)]
[(0, 0.23316488239982783), (1, 0.7668351176001722)]
[(0, 0.20594018762942526), (1, 0.79405981237057477)]
[(0, 0.78758353206146936), (1, 0.21241646793853075)]
[(0, 0.71035821886814599), (1, 0.28964178113185413)]
