## DESCRIPTION

Help a leading mobile brand understand the voice of the customer by analyzing the reviews of their product on Amazon and the topics that customers are talking about. You will perform topic modeling on specific parts of speech. You’ll finally interpret the emerging topics.

## Problem Statement: 

A popular mobile phone brand, Lenovo has launched their budget smartphone in the Indian market. The client wants to understand the VOC (voice of the customer) on the product. This will be useful to not just evaluate the current product, but to also get some direction for developing the product pipeline. The client is particularly interested in the different aspects that customers care about. Product reviews by customers on a leading e-commerce site should provide a good view.

In [1]:
import pandas as pd
import numpy as np
import re
import string
import nltk
import gensim
import spacy
from nltk.tokenize import word_tokenize

In [2]:
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from gensim import corpora
from gensim.models import CoherenceModel
import gensim.corpora as corpora
from pprint import pprint

In [3]:
from nltk.corpus import stopwords
stop_word = stopwords.words('english')

## Tasks: 

### 1. Read the .csv file using Pandas. Take a look at the top few records.

In [4]:
data = pd.read_csv(r'D:\Simplilearn\project\NLP\Project1\K8 Reviews v0.2.csv')
data.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


### 2. Normalize casings for the review text and extract the text into a list for easier manipulation.

In [5]:
data.review = data.review.apply(lambda x:x.lower())

In [6]:
data['review'] = data.review.apply(lambda x: re.sub(r'[^\w\s]','',x)) ## removing punctuations

In [7]:
data['review'] = data.review.apply(lambda x: ' '.join([item for item in x.split(' ') if item not in stop_word]))# removing stopwords

In [8]:
data.head()

Unnamed: 0,sentiment,review
0,1,good need updates improvements
1,0,worst mobile bought ever battery draining like...
2,1,get 10 cash back already 15 january
3,1,good
4,0,worst phone everthey changed last phone proble...


In [9]:
corpus = data.review.to_list()

In [10]:
corpus[1]

'worst mobile bought ever battery draining like hell backup 6 7 hours internet uses even put mobile idle getting dischargedthis biggest lie amazon  lenove expected making full saying battery 4000mah  booster charger fake takes least 4 5 hours fully chargeddont know lenovo survive making full usplease dont go else regret like'

### 3. Tokenize the reviews using NLTKs word_tokenize function.

In [11]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

In [12]:
tokenized = list(sent_to_words(corpus))
print(tokenized[:1])

[['good', 'need', 'updates', 'improvements']]


### 4.  Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

In [13]:
corpus_pos = [nltk.pos_tag(word) for word in tokenized]

In [14]:
corpus_pos[:1]

[[('good', 'JJ'), ('need', 'NN'), ('updates', 'NNS'), ('improvements', 'NNS')]]

### 5. For the topic model, we should  want to include only nouns.

* Find out all the POS tags that correspond to nouns.

In [15]:
noun_tag = ['NNS','NN','NNP']

* Limit the data to only terms with these tags.

In [16]:
pos_tag = [[token[0] for token in sent if token[1] in noun_tag] for sent in corpus_pos]

In [17]:
pos_tag[:2]

[['need', 'updates', 'improvements'],
 ['mobile',
  'hell',
  'backup',
  'hours',
  'uses',
  'lie',
  'amazon',
  'lenove',
  'battery',
  'mah',
  'booster',
  'charger',
  'hours',
  'lenovo',
  'usplease',
  'dont']]

### 6. Lemmatize.

* Different forms of the terms need to be treated as one.

* No need to provide POS tag to lemmatizer for now.

In [18]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [19]:
lemmatized_tags = [[lemmatizer.lemmatize(word) for word in sent] for sent in pos_tag]

In [20]:
lemmatized_tags[:2]

[['need', 'update', 'improvement'],
 ['mobile',
  'hell',
  'backup',
  'hour',
  'us',
  'lie',
  'amazon',
  'lenove',
  'battery',
  'mah',
  'booster',
  'charger',
  'hour',
  'lenovo',
  'usplease',
  'dont']]

###  7. Remove stopwords and punctuation (if there are any). 

In [21]:
## already removed stopwords and puntuation in point 2.

### 8. Create a topic model using LDA on the cleaned-up data with 12 topics.

* Print out the top terms for each topic.

* What is the coherence of the model with the c_v metric?

In [22]:
id2word = corpora.Dictionary(lemmatized_tags)
texts = lemmatized_tags
corpus = [id2word.doc2bow(text) for text in texts] #bag-of-words

In [23]:
corpus[1]

[(3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 2),
 (11, 1),
 (12, 1),
 (13, 1),
 (14, 1),
 (15, 1),
 (16, 1),
 (17, 1)]

In [24]:
# build lda model for 12 topics
lda_model = gensim.models.LdaMulticore(corpus = corpus,
                                      id2word=id2word,
                                      num_topics=12,
                                      random_state=50,
                                      chunksize=100,
                                      passes=10,per_word_topics=True)

In [25]:
pprint(lda_model.print_topics()) # Print out the top terms for each topic.

[(0,
  '0.284*"problem" + 0.082*"heating" + 0.052*"network" + 0.051*"charger" + '
  '0.030*"handset" + 0.027*"turbo" + 0.024*"sim" + 0.014*"connectivity" + '
  '0.012*"bill" + 0.011*"connection"'),
 (1,
  '0.047*"phone" + 0.037*"device" + 0.029*"return" + 0.029*"day" + '
  '0.027*"issue" + 0.027*"support" + 0.025*"amazon" + 0.022*"customer" + '
  '0.021*"call" + 0.020*"lenovo"'),
 (2,
  '0.112*"service" + 0.074*"delivery" + 0.065*"month" + 0.054*"amazon" + '
  '0.043*"smartphone" + 0.037*"thanks" + 0.031*"center" + 0.028*"centre" + '
  '0.026*"function" + 0.023*"class"'),
 (3,
  '0.431*"product" + 0.157*"mobile" + 0.029*"buy" + 0.025*"awesome" + '
  '0.023*"dont" + 0.015*"bit" + 0.014*"amazon" + 0.013*"piece" + 0.009*"plz" + '
  '0.008*"cost"'),
 (4,
  '0.144*"camera" + 0.042*"quality" + 0.033*"phone" + 0.027*"battery" + '
  '0.018*"mode" + 0.016*"feature" + 0.016*"performance" + 0.014*"processor" + '
  '0.012*"music" + 0.011*"game"'),
 (5,
  '0.143*"price" + 0.102*"battery" + 0.095*"b

In [26]:
print('coherence of the model : ',
      CoherenceModel(model=lda_model,texts=texts,dictionary=id2word,coherence='c_v').get_coherence())

coherence of the model :  0.4818381900203999


### 9. Analyze the topics through the business lens.

* Determine which of the topics can be combined.

* Here we can combine some topics based on specification of device like battery, camera, etc

a) topic '0' and '8' can be combine as 'negative feedback'.

b) topic '5' and '6' can be combine as 'battery'.

c) topic '7' and '11' can be combine as 'display'.

d) camera is found in too many topics so we are ignoring it.

### 10. Create a topic model using LDA with what you think is the optimal number of topics

* What is the coherence of the model?

In [27]:
## use grid search to find optimal numbers of topics

In [28]:
lemmatized_data = [' '.join(word) for word in lemmatized_tags]

In [29]:
vectorizer = CountVectorizer(analyzer='word',       
                             min_df=10,                        # minimum reqd occurences of a word 
                             stop_words='english',             # remove stop words
                             lowercase=True,                   # convert all words to lowercase
                             token_pattern='[a-zA-Z0-9]{3,}',  # num chars > 3
                             # max_features=50000,             # max number of uniq words
                            )

In [30]:
data_vectorized = vectorizer.fit_transform(lemmatized_data)

In [31]:
search_params = {'n_components': [2,3,4,5,6,7,8], 'learning_decay': [.5, .7, .9]}

In [32]:
lda = LatentDirichletAllocation()
model = GridSearchCV(lda,param_grid=search_params)

In [33]:
%%time
model.fit(data_vectorized)

Wall time: 30min 22s


GridSearchCV(estimator=LatentDirichletAllocation(),
             param_grid={'learning_decay': [0.5, 0.7, 0.9],
                         'n_components': [2, 3, 4, 5, 6, 7, 8]})

In [34]:
best_model = model.best_estimator_ # Best Model
# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_model.perplexity(data_vectorized))

Best Model's Params:  {'learning_decay': 0.7, 'n_components': 2}
Best Log Likelihood Score:  -74073.43473645803
Model Perplexity:  187.42102140611075


* The above grid search model shows that optimal number of topic is 2

In [35]:
## building model with number of topics two
lda_model = gensim.models.LdaMulticore(corpus = corpus,
                                      id2word=id2word,
                                      num_topics=2,
                                      random_state=50,
                                      chunksize=100,
                                      passes=10,
                                       decay=0.7,
                                       per_word_topics=True)

In [36]:
print('coherence of the model : ',
      CoherenceModel(model=lda_model,texts=texts,dictionary=id2word,coherence='c_v').get_coherence())

coherence of the model :  0.512284641443429


### 11. The business should be able to interpret the topics.

* Name each of the identified topics.
There are two topics named as: 'positive review' and 'negative review'

* Create a table with the topic name and the top 10 terms in each to present to the business.

In [37]:
topic_terms = lda_model.print_topics()

In [38]:
def clear(topic):
    terms = []
    for term in topic:
        terms.append(term.split('*"')[1][:-1])
    return terms

In [39]:
df_topic = pd.DataFrame()
df_topic['negative topic'] = clear(topic=topic_terms[0][1].split(' + '))
df_topic['positive topic'] = clear(topic=topic_terms[1][1].split(' + '))

In [40]:
df_topic

Unnamed: 0,negative topic,positive topic
0,camera,phone
1,battery,product
2,phone,battery
3,problem,issue
4,quality,time
5,performance,lenovo
6,mobile,day
7,note,note
8,heating,price
9,feature,money


## Conclusion

* Here we are using noun phrase for topic modeling and as we know the dataset is for sentiments analysis so we have to use adverbs and verbs for better prediction, along with this we can use bigram and tri-gram to improve over end result.