#                         PRODUCT REVIEW SUMMARIZATION

                                             INTRODUCTION
Talking about the past decade, online shopping market has reached new heights. The e-commerce industry has grown at a rapid pace in this period and now days it’s an essential part of human lives all over the world. Online shopping now has a vast variety of products which are being purchased on a simple web application hassle free. The continuous improvement in the whole process by use of growing technology has brought a wide range of services to give the customer real time shopping experience. One of these services include customer reviews which are found to be very helpful in providing a full-fledged information of the product. Not only the images, prices, specifications but the reviews also are a part of a product catalogue now days. 

                                             DATA DESCRIPTION
 The data description is with a file named “Cell_Phones_and_Accessories.json” are as follows:
 
 
 IC – Item Code of the product, e.g. B016MF3P3K

 Reviewer_Name - Name of the reviewer

 Useful- Number of useful votes (upvotes) of the review

 Prod_meta- a dictionary of the product metadata. It contains only additional information about the product, if any available.

 Review- text of the review

 Rating- rating given to the product by the reviewer.

 Rev_summ- summary of the review

 Review_timestamp- time when the review has been posted (unix time format)

 Review_Date- Date when the review has been posted

 Prod_img- images that users post after they have received the product

 Rev_verify- Flag to represent whether the review has been verified or not. (True/False)


# Data Wrangling #

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')


In [2]:
df=pd.read_json("Cell_Phones_and_Accessories.json")
df.head()

Unnamed: 0,Rating,Rev_verify,Review_Date,IC,Prod_meta,Reviewer_Name,Review,Rev_summ,Review_timestamp,Useful,Prod_img
0,5,True,"09 1, 2015",B009XD5TPQ,,Sunny Zoeller,Bought it for my husband. He's very happy with it,He's very happy with,1441065600,,
1,5,True,"01 9, 2016",B016MF3P3K,,Denise Lesley,Great screen protector. Doesn't even seem as ...,Five Stars,1452297600,,
2,5,True,"04 21, 2013",B008DC8N5G,,Emir,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!,1366502400,,
3,3,True,"02 27, 2013",B0089CH3TM,{'Color:': ' Green'},Alyse,"The material and fit is very nice, but the col...",Good case overall,1361923200,3.0,
4,4,True,"12 19, 2013",B00AKZWGAC,,TechGuy,This last me about 3 days till i have to charg...,Awesome Battery,1387411200,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 760450 entries, 0 to 760449
Data columns (total 11 columns):
Rating              760450 non-null int64
Rev_verify          760450 non-null bool
Review_Date         760450 non-null object
IC                  760450 non-null object
Prod_meta           407826 non-null object
Reviewer_Name       760359 non-null object
Review              759920 non-null object
Rev_summ            760095 non-null object
Review_timestamp    760450 non-null int64
Useful              62200 non-null object
Prod_img            18194 non-null object
dtypes: bool(1), int64(2), object(8)
memory usage: 64.5+ MB


Only the `Rating` and the `Review_timestamp` series are stored as integers. The rest are interpreted as strings (objects) and `Rev_verify` in boolean.

In [4]:
from datetime import datetime

condition = lambda row: datetime.fromtimestamp(row).strftime("%m-%d-%Y")
df["Review_timestamp"] = df["Review_timestamp"].apply(condition)

The `Review_timestamp` is converted from Unix time to the more intuitive `datetime` datatype.The `reviewTime` is dropped since the `Review_timestamp` series are seen to be same as `Review_Date`, so we can drop any one column.

In [5]:
df.head()

Unnamed: 0,Rating,Rev_verify,Review_Date,IC,Prod_meta,Reviewer_Name,Review,Rev_summ,Review_timestamp,Useful,Prod_img
0,5,True,"09 1, 2015",B009XD5TPQ,,Sunny Zoeller,Bought it for my husband. He's very happy with it,He's very happy with,09-01-2015,,
1,5,True,"01 9, 2016",B016MF3P3K,,Denise Lesley,Great screen protector. Doesn't even seem as ...,Five Stars,01-09-2016,,
2,5,True,"04 21, 2013",B008DC8N5G,,Emir,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!,04-21-2013,,
3,3,True,"02 27, 2013",B0089CH3TM,{'Color:': ' Green'},Alyse,"The material and fit is very nice, but the col...",Good case overall,02-27-2013,3.0,
4,4,True,"12 19, 2013",B00AKZWGAC,,TechGuy,This last me about 3 days till i have to charg...,Awesome Battery,12-19-2013,,


In [6]:
df.isnull().sum()

Rating                   0
Rev_verify               0
Review_Date              0
IC                       0
Prod_meta           352624
Reviewer_Name           91
Review                 530
Rev_summ               355
Review_timestamp         0
Useful              698250
Prod_img            742256
dtype: int64

In [7]:
#Dropping all those column which are not important for the process and also drop `Prod_meta`,`Useful`,`Prod_img` 
# because it have max number of null value 

df.drop(labels=["Reviewer_Name","Review_timestamp","Useful","Prod_img","Review_Date","Prod_meta"], axis=1, inplace=True)

In [8]:
#Dropping all the `nan` values which is present in rest of data

df.dropna(inplace=True)

In [9]:
df.isnull().sum()

Rating        0
Rev_verify    0
IC            0
Review        0
Rev_summ      0
dtype: int64

In [10]:
df.head(10)

Unnamed: 0,Rating,Rev_verify,IC,Review,Rev_summ
0,5,True,B009XD5TPQ,Bought it for my husband. He's very happy with it,He's very happy with
1,5,True,B016MF3P3K,Great screen protector. Doesn't even seem as ...,Five Stars
2,5,True,B008DC8N5G,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!
3,3,True,B0089CH3TM,"The material and fit is very nice, but the col...",Good case overall
4,4,True,B00AKZWGAC,This last me about 3 days till i have to charg...,Awesome Battery
5,5,True,B00MAWPGMI,"Love this case, very sturdy!",Five Stars
6,5,False,B00NB7B4GI,Simple and good quality iPhone 6 case. Fits on...,Simple and good quality iPhone 6 case
7,5,True,B00NMR6N7W,Great screen protector for the money! Paid $1....,Perfect!
8,5,True,B018V60504,"Nice charger. One problem, one if the two USB ...",Make sure your Items work before you miss the ...
9,5,False,B00PG8TID6,Most battery packs for iPhones come as a total...,This clever design combines a battery pack int...


In [11]:
df['Rating'].value_counts()

5    475982
4    123871
3     66350
1     54948
2     38451
Name: Rating, dtype: int64

In [12]:
df['Rev_verify'].value_counts()

True     664660
False     94942
Name: Rev_verify, dtype: int64

In [13]:
#There are some review which is not verify so it is better to go only with true verified reviews.

df=df[(df['Rev_verify']==True)]
df.head()

Unnamed: 0,Rating,Rev_verify,IC,Review,Rev_summ
0,5,True,B009XD5TPQ,Bought it for my husband. He's very happy with it,He's very happy with
1,5,True,B016MF3P3K,Great screen protector. Doesn't even seem as ...,Five Stars
2,5,True,B008DC8N5G,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!
3,3,True,B0089CH3TM,"The material and fit is very nice, but the col...",Good case overall
4,4,True,B00AKZWGAC,This last me about 3 days till i have to charg...,Awesome Battery


In [14]:
# Now dataset have only verified review so droping `Rev_verify` column

df.drop(["Rev_verify"],axis=1,inplace=True)

In [15]:
# merge `Review`,`Rev_summ` into single column

df['new_review'] = df['Review'] + df['Rev_summ']

In [16]:
df.head()

Unnamed: 0,Rating,IC,Review,Rev_summ,new_review
0,5,B009XD5TPQ,Bought it for my husband. He's very happy with it,He's very happy with,Bought it for my husband. He's very happy with...
1,5,B016MF3P3K,Great screen protector. Doesn't even seem as ...,Five Stars,Great screen protector. Doesn't even seem as ...
2,5,B008DC8N5G,Saved me lots of money! it's not gorilla glass...,As long as you know how to put it on!,Saved me lots of money! it's not gorilla glass...
3,3,B0089CH3TM,"The material and fit is very nice, but the col...",Good case overall,"The material and fit is very nice, but the col..."
4,4,B00AKZWGAC,This last me about 3 days till i have to charg...,Awesome Battery,This last me about 3 days till i have to charg...


In [17]:
# Each review is stored as string in the `new_review` series. A sample product review is below:

print(df["new_review"].iloc[4])

This last me about 3 days till i have to charge it. It does take FOREVER to charge so make sure you plug it in early at night so it will be fully charged in the morning. Sometimes I will get home late (1AM) and when i wake up (10AM) 9 hours It will still be charging, only at 70-80%. And it will take another 2 to be fully charged. But if i have to go somewhere after i wake up it won't be fully charged.
Anyways, great battery if you have the time to fully charge it.Awesome Battery


In [18]:
# Defining a function to covert every letter to lower case, removing tag and special characters and digits.

import re
def pre_process(new_review):
    new_review=new_review.lower()
    new_review=re.sub("","",new_review)
    new_review=re.sub("(\\d|\\W)+"," ",new_review)
    return new_review

In [19]:
df['new_review'] = df['new_review'].apply(lambda x:pre_process(x))

Stop words consist of the most commonly used words that include pronouns (e.g. us, she, their), articles (e.g. the),
and prepositions (e.g. under, from, off). These words are not helpful in distinguishing a document from another and 
are therefore dropped.
importing stopwords

In [20]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop=set(stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure'])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
#importing count vectorizer and applying to the dataset converted to list

from sklearn.feature_extraction.text import CountVectorizer

#conversion of data to list
docs=df['new_review'].tolist()
cv=CountVectorizer(max_df=0.85,stop_words=stop,max_features=10000)
word=cv.fit_transform(docs)

In [22]:
list(cv.vocabulary_.keys())[:30]

['bought',
 'husband',
 'happy',
 'great',
 'screen',
 'protector',
 'even',
 'seem',
 'though',
 'therefive',
 'stars',
 'saved',
 'lots',
 'money',
 'gorilla',
 'glass',
 'careful',
 'subject',
 'easier',
 'scratching',
 'also',
 'sticky',
 'stuff',
 'like',
 'original',
 'press',
 'hard',
 'digitizer',
 'go',
 'crazy']

In [23]:
list(cv.get_feature_names())[200000:200012]

[]

The Term Frequency-Inverse Document Frequency (*TF-IDF*) approach assigns continuous values instead of simple integers for the token frequency. Words that appear frequently overall tend to not establish saliency in a document, and are thus weighted lower. Words that are unique to some documents tend to help distinguish it from the rest and are thus weighted higher.


In [24]:
#importing tf-idf
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [25]:
def sort(matrix):
    tuples = zip(matrix.col, matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for x, score in sorted_items:
        fname = feature_names[x]
        score_vals.append(round(score, 4))
        feature_vals.append(feature_names[x])
    results= {}
    for x in range(len(feature_vals)):
        results[feature_vals[x]]=score_vals[x]
    
    return results

In [26]:
feature_names=cv.get_feature_names()

doc=docs[30]
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))
sorted_items=sort(tf_idf_vector.tocoo())
keywords=extract_topn_from_vector(feature_names,sorted_items,30)
print("\**************REVIEW****************")
print(docs[0:30])
print("\n*************Keywords**************")
for k in keywords:
    print(k)

\**************REVIEW****************
['bought it for my husband he s very happy with ithe s very happy with', 'great screen protector doesn t even seem as though it s on therefive stars', 'saved me lots of money it s not gorilla glass so be careful as it will be subject to easier scratching it also doesn t have sticky stuff on the glass like the original if you press hard on the glass the digitizer will go crazy just shut the screen on and off and you ll be fine as long as you know how to put it on ', 'the material and fit is very nice but the color is more of a neon green than i expected or would have liked good case overall', 'this last me about days till i have to charge it it does take forever to charge so make sure you plug it in early at night so it will be fully charged in the morning sometimes i will get home late am and when i wake up am hours it will still be charging only at and it will take another to be fully charged but if i have to go somewhere after i wake up it won t 

In [27]:
tf_idf_vector=tfidf_transformer.transform(cv.transform(docs))

results=[]
for i in range(tf_idf_vector.shape[0]):
    curr_vector=tf_idf_vector[i]
    sorted_items=sort(curr_vector.tocoo())
    keywords=extract_topn_from_vector(feature_names,sorted_items,10)
    results.append(keywords)

nlp=pd.DataFrame(zip(docs,results),columns=['Summary','keywords'])
nlp

Unnamed: 0,Summary,keywords
0,bought it for my husband he s very happy with ...,"{'happy': 0.7931, 'husband': 0.5121, 'bought':..."
1,great screen protector doesn t even seem as th...,"{'therefive': 0.7224, 'seem': 0.3764, 'though'..."
2,saved me lots of money it s not gorilla glass ...,"{'glass': 0.4301, 'subject': 0.2895, 'digitize..."
3,the material and fit is very nice but the colo...,"{'neon': 0.5143, 'green': 0.3846, 'liked': 0.3..."
4,this last me about days till i have to charge ...,"{'fully': 0.5042, 'charged': 0.3556, 'wake': 0..."
...,...,...
664655,i love this nokia lumia unlocked phone fast an...,"{'nokia': 0.602, 'lumia': 0.3127, 'windows': 0..."
664656,great productfive stars,"{'productfive': 0.8845, 'great': 0.3426, 'star..."
664657,this iphone case is very durable and long last...,"{'producti': 0.6213, 'lasting': 0.4613, 'love'..."
664658,greatfive stars,"{'greatfive': 0.9335, 'stars': 0.3586}"


In [28]:
nlp['Itemcode']=df['IC']
nlp['Max_Rating']=df['Rating'].max()
nlp['Avg_Rating']=df['Rating'].unique().mean()
nlp['Min_Rating']=df['Rating'].min()

In [29]:
nlp = nlp[['Itemcode','Summary','keywords','Max_Rating','Min_Rating','Avg_Rating']]

In [30]:
nlp['Itemcode'].drop_duplicates(inplace=True)
nlp.dropna(inplace=True)

nlp.head()

Unnamed: 0,Itemcode,Summary,keywords,Max_Rating,Min_Rating,Avg_Rating
0,B009XD5TPQ,bought it for my husband he s very happy with ...,"{'happy': 0.7931, 'husband': 0.5121, 'bought':...",5,1,3.0
1,B016MF3P3K,great screen protector doesn t even seem as th...,"{'therefive': 0.7224, 'seem': 0.3764, 'though'...",5,1,3.0
2,B008DC8N5G,saved me lots of money it s not gorilla glass ...,"{'glass': 0.4301, 'subject': 0.2895, 'digitize...",5,1,3.0
3,B0089CH3TM,the material and fit is very nice but the colo...,"{'neon': 0.5143, 'green': 0.3846, 'liked': 0.3...",5,1,3.0
4,B00AKZWGAC,this last me about days till i have to charge ...,"{'fully': 0.5042, 'charged': 0.3556, 'wake': 0...",5,1,3.0


In [31]:
result=nlp.to_json('C:/Users/Admin/Desktop/Product_review/PRODUCT_REVIEW.json ')

In [32]:
#saving the model using joblib
from sklearn.externals import joblib
joblib.dump(tfidf_transformer,'product_review.pkl')



['product_review.pkl']