# Problem Statement: 

A popular mobile phone brand, Lenovo has launched their budget smartphone in the Indian market. The client wants to understand the VOC (voice of the customer) on the product. This will be useful to not just evaluate the current product, but to also get some direction for developing the product pipeline. The client is particularly interested in the different aspects that customers care about. Product reviews by customers on a leading e-commerce site should provide a good view.

### Domain: 
Amazon reviews for a leading phone brand

### Analysis to be done: 
POS tagging, topic modeling using LDA, and topic interpretation

### Steps to perform:

Discover the topics in the reviews and present it to business in a consumable format. Employ techniques in syntactic processing and topic modeling.

Perform specific cleanup, POS tagging, and restricting to relevant POS tags, then, perform topic modeling using LDA. Finally, give business-friendly names to the topics and make a table for business.


#### Task 1: Read the .csv file using Pandas. Take a look at the top few records.

In [1]:
import pandas as pd

In [2]:
reviews = pd.read_csv('K8 Reviews v0.2.csv')

In [3]:
reviews.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


In [4]:
reviews.shape

(14675, 2)

#### Task 2: Normalize casings for the review text and extract the text into a list for easier manipulation.

In [5]:
reviews = reviews.review.values


In [6]:
reviews = [text.lower() for text in reviews]
reviews[:3]

['good but need updates and improvements',
 "worst mobile i have bought ever, battery is draining like hell, backup is only 6 to 7 hours with internet uses, even if i put mobile idle its getting discharged.this is biggest lie from amazon & lenove which is not at all expected, they are making full by saying that battery is 4000mah & booster charger is fake, it takes at least 4 to 5 hours to be fully charged.don't know how lenovo will survive by making full of us.please don;t go for this else you will regret like me.",
 'when i will get my 10% cash back.... its already 15 january..']

#### Task 3: Tokenize the reviews using NLTKs word_tokenize function.

In [7]:
from nltk.tokenize import word_tokenize
reviews = [word_tokenize(text) for text in reviews]


In [8]:
reviews[:1]

[['good', 'but', 'need', 'updates', 'and', 'improvements']]

#### Task 4: Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

In [9]:
import nltk
from nltk.tag import pos_tag

In [10]:
reviews = [nltk.pos_tag(review) for review in reviews]

In [11]:
reviews[:1]

[[('good', 'JJ'),
  ('but', 'CC'),
  ('need', 'VBP'),
  ('updates', 'NNS'),
  ('and', 'CC'),
  ('improvements', 'NNS')]]

#### Task 5: For the topic model, we should  want to include only nouns.

1. Find out all the POS tags that correspond to nouns.

2. Limit the data to only terms with these tags.

In [12]:
def noun_pos_tags(reviews):
    reviews_postag=[]
    for review in reviews:
        for word,pos in review:
            if pos=='NN' or pos=='NNS' or pos=='NNP' or pos=='NNPS':
                reviews_postag.append(word)
            else:
                pass
    return reviews_postag
                
    

In [13]:
tagged_reviews = noun_pos_tags(reviews)
tagged_reviews

['updates',
 'improvements',
 'mobile',
 'i',
 'battery',
 'hell',
 'backup',
 'hours',
 'uses',
 'idle',
 'discharged.this',
 'lie',
 'amazon',
 'lenove',
 'battery',
 'charger',
 'hours',
 'don',
 'i',
 '%',
 'cash',
 '..',
 'phone',
 'everthey',
 'phone',
 'problem',
 'amazon',
 'phone',
 'amazon',
 'camerawaste',
 'money',
 'phone',
 'allot',
 '..',
 'reason',
 'k8',
 'battery',
 'level',
 'problems',
 'phone',
 'hanging',
 'problems',
 'note',
 'station',
 'ahmedabad',
 'years',
 'phone',
 'lenovo',
 'lot',
 'glitches',
 'thing',
 'options',
 'wrost',
 'phone',
 'charger',
 'damage',
 'months',
 'item',
 'battery',
 'life',
 'i',
 'battery',
 'problem',
 'motherboard',
 'problem',
 'months',
 'mobile',
 'life',
 'phone',
 'slim',
 'battry',
 'backup',
 'screen',
 'headset',
 'time',
 'i',
 'product',
 'prize',
 'range',
 'specification',
 'comparison',
 'mobile',
 'range',
 'i',
 'phone',
 'seal',
 'i',
 'credit',
 'card',
 'i',
 '..',
 '..',
 'deal',
 'amazon',
 '..',
 'battery',

#### Task 7: Remove stopwords and punctuation (if there are any). 



In [14]:
#removing punctuation

tagged_reviews = [word for word in tagged_reviews if word.isalpha()]

In [15]:
len(tagged_reviews)

85077

In [16]:
# removing stopwords

from nltk.corpus import stopwords

In [17]:
stop_words = set(stopwords.words('english'))
tagged_reviews = [w for w in tagged_reviews if not w in stop_words]

In [18]:
len(tagged_reviews)

81822

In [19]:
tagged_reviews

['updates',
 'improvements',
 'mobile',
 'battery',
 'hell',
 'backup',
 'hours',
 'uses',
 'idle',
 'lie',
 'amazon',
 'lenove',
 'battery',
 'charger',
 'hours',
 'cash',
 'phone',
 'everthey',
 'phone',
 'problem',
 'amazon',
 'phone',
 'amazon',
 'camerawaste',
 'money',
 'phone',
 'allot',
 'reason',
 'battery',
 'level',
 'problems',
 'phone',
 'hanging',
 'problems',
 'note',
 'station',
 'ahmedabad',
 'years',
 'phone',
 'lenovo',
 'lot',
 'glitches',
 'thing',
 'options',
 'wrost',
 'phone',
 'charger',
 'damage',
 'months',
 'item',
 'battery',
 'life',
 'battery',
 'problem',
 'motherboard',
 'problem',
 'months',
 'mobile',
 'life',
 'phone',
 'slim',
 'battry',
 'backup',
 'screen',
 'headset',
 'time',
 'product',
 'prize',
 'range',
 'specification',
 'comparison',
 'mobile',
 'range',
 'phone',
 'seal',
 'credit',
 'card',
 'deal',
 'amazon',
 'battery',
 'solutions',
 'battery',
 'life',
 'smartphone',
 'galery',
 'problem',
 'speaker',
 'phone',
 'camera',
 'battery',

#### Task 6: Lemmatize. 

1. Different forms of the terms need to be treated as one.

2. No need to provide POS tag to lemmatizer for now.

In [20]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()


In [21]:
lemmatized_wordlist = [wordnet_lemmatizer.lemmatize(word) for word in tagged_reviews]

In [22]:
lemmatized_wordlist 

['update',
 'improvement',
 'mobile',
 'battery',
 'hell',
 'backup',
 'hour',
 'us',
 'idle',
 'lie',
 'amazon',
 'lenove',
 'battery',
 'charger',
 'hour',
 'cash',
 'phone',
 'everthey',
 'phone',
 'problem',
 'amazon',
 'phone',
 'amazon',
 'camerawaste',
 'money',
 'phone',
 'allot',
 'reason',
 'battery',
 'level',
 'problem',
 'phone',
 'hanging',
 'problem',
 'note',
 'station',
 'ahmedabad',
 'year',
 'phone',
 'lenovo',
 'lot',
 'glitch',
 'thing',
 'option',
 'wrost',
 'phone',
 'charger',
 'damage',
 'month',
 'item',
 'battery',
 'life',
 'battery',
 'problem',
 'motherboard',
 'problem',
 'month',
 'mobile',
 'life',
 'phone',
 'slim',
 'battry',
 'backup',
 'screen',
 'headset',
 'time',
 'product',
 'prize',
 'range',
 'specification',
 'comparison',
 'mobile',
 'range',
 'phone',
 'seal',
 'credit',
 'card',
 'deal',
 'amazon',
 'battery',
 'solution',
 'battery',
 'life',
 'smartphone',
 'galery',
 'problem',
 'speaker',
 'phone',
 'camera',
 'battery',
 'product',
 '

#### Task 8: Create a topic model using LDA on the cleaned-up data with 12 topics.

1. Print out the top terms for each topic.

2. What is the coherence of the model with the c_v metric?

In [23]:
from gensim import corpora, models
import gensim

In [24]:
dictionary = corpora.Dictionary([lemmatized_wordlist])

In [25]:
print(dictionary)

Dictionary(6006 unique tokens: ['aa', 'aab', 'aachha', 'aaguthu', 'aaj']...)


In [26]:
print(dictionary.token2id)



In [27]:
corpus = [dictionary.doc2bow(text) for text in [lemmatized_wordlist]]

In [28]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=12, id2word=dictionary)

In [29]:
print(ldamodel)

LdaModel(num_terms=6006, num_topics=12, decay=0.5, chunksize=2000)


In [30]:
for idx, topic in ldamodel.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.080*"phone" + 0.034*"battery" + 0.030*"camera" + 0.021*"problem" + 0.020*"issue" + 0.020*"product" + 0.015*"mobile" + 0.013*"time" + 0.013*"quality" + 0.012*"lenovo"


Topic: 1 
Words: 0.037*"phone" + 0.025*"camera" + 0.020*"battery" + 0.017*"product" + 0.014*"issue" + 0.012*"problem" + 0.012*"mobile" + 0.011*"day" + 0.010*"performance" + 0.009*"quality"


Topic: 2 
Words: 0.067*"phone" + 0.030*"battery" + 0.028*"camera" + 0.022*"product" + 0.020*"problem" + 0.020*"mobile" + 0.013*"issue" + 0.012*"lenovo" + 0.010*"feature" + 0.010*"price"


Topic: 3 
Words: 0.048*"phone" + 0.026*"battery" + 0.025*"product" + 0.023*"camera" + 0.019*"problem" + 0.016*"mobile" + 0.014*"quality" + 0.012*"day" + 0.012*"time" + 0.011*"issue"


Topic: 4 
Words: 0.074*"phone" + 0.050*"camera" + 0.029*"battery" + 0.019*"problem" + 0.016*"mobile" + 0.014*"product" + 0.014*"quality" + 0.013*"note" + 0.012*"issue" + 0.012*"time"


Topic: 5 
Words: 0.028*"phone" + 0.020*"camera" + 0.020*"battery"

In [31]:
#coherence

from gensim.models.coherencemodel import CoherenceModel

In [32]:
cohmodel = CoherenceModel(model = ldamodel, texts=[lemmatized_wordlist], dictionary=dictionary, coherence='c_v')

In [33]:
import numpy as np
np.seterr(divide='ignore', invalid='ignore')

{'divide': 'warn', 'over': 'warn', 'under': 'ignore', 'invalid': 'warn'}

In [None]:
coherence = cohmodel.get_coherence()


In [None]:
print(coherence)