# Tasks - 

1. Normalize case
2. Tokenize (using word_tokenize from NLTK)
3. POS tagging using the NLTK pos tagger
4. For the topic model, we would want to include only nouns
 - First, find out all the POS tags that correspond to nouns
 - Limit the data to only terms with these tags
5. Lemmatize (you want different forms of the terms to be treated as one, don't worry about providing POS tag to lemmatizer for now)
6. Remove stop words and punctuation (if there are any at all after the POS tagging)
7. Create a topic model using LDA on the cleaned up data with 12 topics
 - choose the topic model parameters carefully
 - what is the perplexity of the model?
 - what is the coherence of the model?
8. Analyze the topics, which pairs of topics can be combined?
9. Create topic model using LDA with what you think is the optimal number of topics
 - choose the topic model parameters carefully
 - is the perplexity better now?
 - is the coherence better now?
10. The business finally needs to be able to interpret the topics
 - name each of the identified topics
 - create a table with the topic name and the top 10 terms in each to present to business

### Task 1 : Normalize Case

In [1]:
# import libraries
import warnings
warnings.filterwarnings("ignore")

import numpy as np, pandas as pd
import re, random, os, string

from pprint import pprint #pretty print
import matplotlib.pyplot as plt
%matplotlib inline

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [2]:
reviews = pd.read_csv("K8 Reviews v0.2.csv")
reviews.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


In [3]:
reviewsLower = [sent.lower() for sent in reviews.review.values]
reviewsLower[0]

'good but need updates and improvements'

### Task 2. Tokenize (using word_tokenize from NLTK)

In [4]:
reviewsToken = [word_tokenize(sent) for sent in reviewsLower]
reviewsToken[0]

['good', 'but', 'need', 'updates', 'and', 'improvements']

### Task 3. POS tagging using the NLTK pos tagger

In [5]:
import nltk

In [6]:
nltk.pos_tag(reviewsToken[0])

[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

In [8]:
sent = "I like to move it".split()
sentTagged = nltk.pos_tag(sent)
sentTagged

[('I', 'PRP'), ('like', 'VBP'), ('to', 'TO'), ('move', 'VB'), ('it', 'PRP')]

In [9]:
reviewsTagged = [nltk.pos_tag(tokens) for tokens in reviewsToken]
reviewsTagged[0]

[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

### Task 4. For the topic model, we would want to include only nouns
 - First, find out all the POS tags that correspond to nouns
 - Limit the data to only terms with these tags


In [10]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [11]:
taggedTuple = nltk.pos_tag(['great'])
taggedTuple[0]

('great', 'JJ')

In [12]:
pprint(taggedTuple[0][0])
pprint(taggedTuple[0][1])

'great'
'JJ'


In [13]:
reviewsNoun=[]
for sent in reviewsTagged:
    reviewsNoun.append([token for token in sent if re.search("NN.*", token[1])])
reviewsNoun[0]

[('updates', 'NNS'), ('improvements', 'NNS')]

### Task  5. Lemmatize
 - you want different forms of the terms to be treated as one
 - don't worry about providing POS tag to lemmatizer for now

In [15]:
lemm = WordNetLemmatizer()
reviewsLemm=[]
for sent in reviewsNoun:
    reviewsLemm.append([lemm.lemmatize(word[0]) for word in sent])

In [18]:
reviewsLemm[0:5]

[['update', 'improvement'],
 ['mobile',
  'i',
  'battery',
  'hell',
  'backup',
  'hour',
  'us',
  'idle',
  'discharged.this',
  'lie',
  'amazon',
  'lenove',
  'battery',
  'charger',
  'hour',
  'don'],
 ['i', '%', 'cash', 'january..'],
 [],
 ['phone', 'everthey', 'phone', 'problem', 'amazon', 'phone', 'amazon']]

### Task  6. Remove stop words and punctuation (if there are any at all after the POS tagging)

In [20]:
from string import punctuation
from nltk.corpus import stopwords
stopNltk = stopwords.words("english")

In [21]:
stopUpdated = stopNltk + list(punctuation) + ["..."] + [".."]
reviewsStopWordremoved=[]
for sent in reviewsLemm:
    reviewsStopWordremoved.append([term for term in sent if term not in stopUpdated])

In [23]:
reviewsStopWordremoved[1:4]

[['mobile',
  'battery',
  'hell',
  'backup',
  'hour',
  'us',
  'idle',
  'discharged.this',
  'lie',
  'amazon',
  'lenove',
  'battery',
  'charger',
  'hour'],
 ['cash', 'january..'],
 []]

##3 Task 7. Create a topic model using LDA on the cleaned up data with 12 topics
 - what is the coherence of the model?
 
 Use gensim for this task

In [24]:
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.models import ldamodel

In [25]:
id2word = corpora.Dictionary(reviewsStopWordremoved)
texts = reviewsStopWordremoved
corpus = [id2word.doc2bow(text) for text in texts]

In [26]:
print(corpus[200])

[(426, 1), (427, 1), (428, 1), (429, 1)]


In [27]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=12, 
                                           random_state=42,
                                           passes=10,
                                           per_word_topics=True)

In [28]:
pprint(lda_model.print_topics())

[(0,
  '0.381*"mobile" + 0.023*"problem" + 0.023*"notification" + 0.017*"heat" + '
  '0.016*"cell" + 0.016*"message" + 0.011*"hang" + 0.011*"rate" + '
  '0.010*"whatsapp" + 0.009*"call"'),
 (1,
  '0.267*"battery" + 0.105*"problem" + 0.055*"backup" + 0.055*"heating" + '
  '0.052*"issue" + 0.037*"performance" + 0.036*"hour" + 0.032*"day" + '
  '0.030*"time" + 0.029*"life"'),
 (2,
  '0.062*"handset" + 0.051*"software" + 0.041*"box" + 0.032*"contact" + '
  '0.030*"update" + 0.026*"set" + 0.023*"star" + 0.023*"option" + 0.022*"item" '
  '+ 0.020*"purchase"'),
 (3,
  '0.080*"phone" + 0.049*"amazon" + 0.044*"service" + 0.030*"lenovo" + '
  '0.030*"day" + 0.029*"issue" + 0.027*"problem" + 0.026*"time" + '
  '0.022*"delivery" + 0.019*"experience"'),
 (4,
  '0.135*"feature" + 0.076*"camera" + 0.048*"mode" + 0.037*"video" + '
  '0.027*"android" + 0.025*"stock" + 0.023*"depth" + 0.019*"gallery" + '
  '0.018*"volta" + 0.017*"thanks"'),
 (5,
  '0.439*"product" + 0.090*"charger" + 0.018*"earphone" + 

In [30]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=reviewsStopWordremoved, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.5560767730635368


### Task 8. Analyze the topics, which pairs of topics can be combined?
 - you can assume that if a pair of topics has very similar top terms, they are very close and can be combined

### Looking at the topics and each terms following can be combined -

 1. Topic 2 and 5 related about 'pricing'  
 2. Topic 4, 6 and 10 related to'battery related issues'  
 3. Topic 3 and 11 = 'performance'

### Task 9. Create topic model using LDA with what you think is the optimal number of topics

 - is the coherence better now?

In [31]:
# Build LDA model
lda_model8 = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=8, 
                                           random_state=42,
                                           passes=10,
                                           per_word_topics=True)

In [33]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model8, texts=reviewsStopWordremoved, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.5470127061130555


### Task 10. The business finally needs to be able to interpret the topics
 - name each of the identified topics
 - create a table with the topic name and the top 10 terms in each to present to business

In [34]:
x = lda_model8.show_topics(formatted=False)
topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]

In [35]:
for topic,words in topics_words:
    print(str(topic)+ "::"+ str(words))
print()

0::['mobile', 'charger', 'heat', 'charge', 'superb', 'turbo', 'hour', 'min', 'notification', 'awesome']
1::['battery', 'phone', 'problem', 'camera', 'backup', 'heating', 'issue', 'performance', 'quality', 'life']
2::['note', 'k8', 'lenovo', 'phone', 'software', 'screen', 'update', 'issue', 'handset', 'option']
3::['phone', 'amazon', 'issue', 'time', 'service', 'day', 'problem', 'month', 'lenovo', 'delivery']
4::['phone', 'camera', 'price', 'feature', 'range', 'mode', 'performance', 'device', 'quality', 'depth']
5::['product', 'money', 'waste', 'performance', 'ok', 'cast', 'item', 'pic', 'please', 'work']
6::['phone', 'network', 'call', 'sim', 'hai', 'jio', 'volta', 'budget', 'card', 'issue']
7::['camera', 'quality', 'money', 'value', 'music', 'speed', 'h', 'clarity', 'video', 'screen']



### Possible topics are

1. Product accessories
2. Amazon
3. Pricing
4. Mobile network
5.  phone performance
6. battery related issues
7. camera quality
8. overall general phone features 