# NLP Part 2


## Bigram, Clustering and Topic Modeling


In This notebook we will performing Bigram, Clustering and Topic Modeling. if you haven't follow part one notebook you can click [here](https://nbviewer.jupyter.org/github/datA2Z/All-about-data-science-and-AI/blob/master/Natural%20language%20processing/Natural%20language%20processing%20part%201.ipynb) and follow along. lets code...

In [1]:
# All the information related to this code are provided in notebook one.
import pandas as pd
import numpy as np
import nltk
from textblob import TextBlob
import os
import warnings
warnings.filterwarnings('ignore')
os.chdir("F:/portfolio/NLP/Sentiment analysis/Data")
yelp = pd.read_csv('yelp.csv')

yelp['Sentiment'] = 'NA'
for i in range(0,len(yelp)):
    t = TextBlob(yelp['text'][i]).sentiment.polarity
    if t < 0:
        yelp.set_value(i,'Sentiment','Negative')
    elif t == 0:
        yelp.set_value(i,'Sentiment','Neutral')
    else :
        yelp.set_value(i,'Sentiment','Positive')



pos = yelp.loc[yelp['Sentiment'] == "Positive"]
neg = yelp.loc[yelp['Sentiment'] == "Negative"]

### Bigram

Bigrams are basically a set of co-occuring words within a given window.

***Note : We are analyzing only negative text but you can perform it with other text as well!***

In [2]:
# import require library 
import re
import string
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords

In [3]:
# Converting text data into list
texts = neg["text"].tolist()

In [4]:
stopwords=stopwords.words('english')
english_vocab = set(w.lower() for w in nltk.corpus.words.words())

In [5]:
# Text clean funcation
def process_text(text):
   if text.startswith('@null'):
       return "[text not available]"
   text = re.sub(r'\$\w*','',text) # Remove tickers
   text = re.sub(r'https?:\/\/.*\/\w*','',text) # Remove hyperlinks
   text = re.sub(r'['+string.punctuation+']+', ' ',text) # Remove puncutations like 's
   text_ok = TweetTokenizer(strip_handles=True, reduce_len=True)
   tokens = text_ok.tokenize(text)
   tokens = [i.lower() for i in tokens if i not in stopwords and len(i) > 2 and  
                                             i in english_vocab]
   return tokens

In [6]:
words = []
for tw in texts:
    words += process_text(tw)

In [7]:
# Perform bigram
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words, 5)
finder.apply_freq_filter(5)
print(finder.nbest(bigram_measures.likelihood_ratio, 10))

[('worst', 'ever'), ('customer', 'service'), ('parking', 'lot'), ('strip', 'mall'), ('tasted', 'like'), ('behind', 'counter'), ('mac', 'cheese'), ('even', 'though'), ('fried', 'rice'), ('gross', 'gross')]


Great, as we can see Bigrams output customer ofter complaning about ('customer', 'service'), ('parking', 'lot'), ('worst', 'ever').

# Clustering

Let's go further, Now we will perform clustering to understand similar texts in negative reviews. Texts can be grouped together in clusters based on closeness or ‘distance’ amongst them.

Each text is pre-processed and added to a list. The list is fed to TFIDF Vectorizer to convert each text into a vector. Each value in the vector depends on how many times a word or a term appears in the text (TF) and on how rare it is amongst all text/documents (IDF).

In [8]:
cleaned_texts = []
for tw in texts:
    words = process_text(tw)
    cleaned_text = " ".join(w for w in words if len(w) > 2 and 
w.isalpha()) #Form sentences of processed words
    cleaned_texts.append(cleaned_text)
neg['CleanText'] = cleaned_texts

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer  
tfidf_vectorizer = TfidfVectorizer(use_idf=True, ngram_range=(1,3))  
tfidf_matrix = tfidf_vectorizer.fit_transform(cleaned_texts)  
feature_names = tfidf_vectorizer.get_feature_names() # num phrases  
from sklearn.metrics.pairwise import cosine_similarity  
dist = 1 - cosine_similarity(tfidf_matrix)  
print(dist)

[[-5.77315973e-15  9.85838108e-01  9.91833374e-01 ...  9.97755512e-01
   9.92920586e-01  9.92770666e-01]
 [ 9.85838108e-01 -1.06581410e-14  9.96470534e-01 ...  9.95751630e-01
   9.94295377e-01  9.94358520e-01]
 [ 9.91833374e-01  9.96470534e-01 -2.22044605e-16 ...  9.92887555e-01
   9.96491856e-01  9.94311228e-01]
 ...
 [ 9.97755512e-01  9.95751630e-01  9.92887555e-01 ...  2.10942375e-15
   9.91922312e-01  9.97681282e-01]
 [ 9.92920586e-01  9.94295377e-01  9.96491856e-01 ...  9.91922312e-01
  -1.11022302e-15  9.74792558e-01]
 [ 9.92770666e-01  9.94358520e-01  9.94311228e-01 ...  9.97681282e-01
   9.74792558e-01 -4.44089210e-16]]


In [10]:
from sklearn import cluster
from sklearn.cluster import KMeans
num_clusters = 3
km = KMeans(n_clusters=num_clusters)  
km.fit(tfidf_matrix)  
clusters = km.labels_.tolist()  
neg['ClusterID'] = clusters
print(neg['ClusterID'].value_counts())

2    514
0    270
1     18
Name: ClusterID, dtype: int64


The output shows 3 clusters, with following number of text in respective clusters.
Most of the tweets are clustered around in group Id =2. Remaining are in group id 0 and id 1.

In [11]:
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(num_clusters):
    print("Cluster {} : Words :".format(i))
    for ind in order_centroids[i, :10]: 
        print(' %s' % feature_names[ind])

Cluster 0 : Words :
 service
 place
 like
 chicken
 good
 average
 small
 pizza
 one
 money
Cluster 1 : Words :
 closed
 location closed
 location
 closed long
 closed long time
 closed last
 closed last month
 sad closed well
 sad closed
 closed well
Cluster 2 : Words :
 food
 get
 place
 like
 chicken
 one
 time
 service
 back
 bad


The output shows words appear in each cluster

### Topic Modeling


Finding central subject in the set of documents, Text in case here. 


We are using Latent Dirichlet Allocation (LDA). LDA is commonly used to identify chosen number (say, 6) topics.

In [12]:
# Import libraries and functions
from gensim import corpora, models
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

In [13]:
# Funcation for removing stopwords, punctuations and lemmatization 
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [14]:
# process texts
texts = [text for text in cleaned_texts if len(text) > 2]
doc_clean = [clean(doc).split() for doc in texts]
dictionary = corpora.Dictionary(doc_clean)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
ldamodel = models.ldamodel.LdaModel(doc_term_matrix, num_topics=6, id2word = dictionary, passes=5)

In [15]:
# Loop over text to find topics
for topic in ldamodel.show_topics(num_topics=6, formatted=False, num_words=6):
    print("Topic {}: Words: ".format(topic[0]))
    topicwords = [w for (w, val) in topic[1]]
    print(topicwords)

Topic 0: Words: 
['like', 'time', 'back', 'one', 'place', 'really']
Topic 1: Words: 
['food', 'much', 'time', 'like', 'really', 'pedicure']
Topic 2: Words: 
['food', 'like', 'get', 'time', 'chicken', 'tasted']
Topic 3: Words: 
['food', 'place', 'chicken', 'like', 'one', 'get']
Topic 4: Words: 
['one', 'place', 'like', 'food', 'would', 'get']
Topic 5: Words: 
['place', 'time', 'food', 'like', 'get', 'one']


It is clear from the words associated with the topics that they represent certain sentiments. Topic 0 is about place and time, Topic 1, 2, 3, 4, 5  is about food, chicken, taste etc.

**That is all for this demonstration. I hope you enjoyed the notebook, and see you in the next one.**