# Topic Modeling of Amazon Reviews

### Steps:

- Find asin (unique id for a product on Amazon) for all cell phones that belong to the brand "Samsung" 
    -  Filter rows from df_metadata where categories like ['Cell Phones & Accessories', 'Cell Phones','....']
    -  Filter rows where brand = "Samsung"
- Create a new dataframe by joining the asin in the dataframe derived in the previous step with df_reviews to capture all reviews related to Samsung Cell Phones. 

#### Pre-processing:
- Choose the right corpus: 
    -  Choose a phone with maximum number of reviews. 50,000 - 100,000 words. 
- Remove words in the stoplist:
    -  English stopwords
    -  Common/Expected words 
        -  "phone"
        -  "Samsung"
- Lemmatize
- Filter datasets for:
    -  Nouns
    -  Adjectives
- Build a tf-idf matrix 

#### Modeling
- Build topic models using gensim

Import libraries, define functions

In [2]:
import json
import os
import glob
import numpy as np
import random
import pandas as pd
import gzip

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

Read data

In [4]:
data_dir = '/Users/birupakhya/Documents/Projects/topic modeling/'
df_reviews = getDF(data_dir+'reviews_Cell_Phones_and_Accessories_5.json.gz')
# df_metadata = getDF(data_dir+'samsung_metadata.json.gz')
df_metadata = getDF(data_dir+'meta_Cell_Phones_and_Accessories.json.gz')

In [5]:
df_reviews.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A30TL5EWN6DFXT,120401325X,christina,"[0, 0]",They look good and stick good! I just don't li...,4.0,Looks Good,1400630400,"05 21, 2014"
1,ASY55RVNIL0UD,120401325X,emily l.,"[0, 0]",These stickers work like the review says they ...,5.0,Really great product.,1389657600,"01 14, 2014"
2,A2TMXE2AFO7ONB,120401325X,Erica,"[0, 0]",These are awesome and make my phone look so st...,5.0,LOVE LOVE LOVE,1403740800,"06 26, 2014"
3,AWJ0WZQYMYFQ4,120401325X,JM,"[4, 4]",Item arrived in great time and was in perfect ...,4.0,Cute!,1382313600,"10 21, 2013"
4,ATX7CZYFXI1KW,120401325X,patrice m rogoza,"[2, 3]","awesome! stays on, and looks great. can be use...",5.0,leopard home button sticker for iphone 4s,1359849600,"02 3, 2013"


In [None]:
df_metadata.head(2)

In [None]:
df_metadata[df_metadata['brand'] == 'Samsung'].head(2)

#### Array of asins
This function just returns the array of asins required.

In [6]:
def asin_extractor(subcat,brand):
    df_asins_categories = df_metadata.loc[:,['asin','categories','brand']]
    asins_categories = df_asins_categories.as_matrix()
    cellphones_true = list(filter(lambda x: len(x[1][0]) >= 2 , asins_categories))
    cellphones = list(filter(lambda x: x[1][0][1] == subcat and x[2] == brand , cellphones_true))
    asins = [each[0] for each in cellphones]
    return asins

In [50]:
cellphone_asins = asin_extractor('Cell Phones','Samsung')

In [52]:
cellphone_asins[:2]

['B00280QJFU', 'B00387FAC0']

In [53]:
df_metadata_Samsung = df_metadata[df_metadata['asin'].isin(cellphone_asins)]

In [54]:
df_metadata_Samsung.head(3)

Unnamed: 0,asin,related,title,price,salesRank,imUrl,brand,categories,description
23000,B00280QJFU,"{'also_bought': ['B005UOUC54', 'B0046REOM8', '...",Samsung T301G Prepaid Phone (Tracfone),6.99,{'Cell Phones & Accessories': 10403},http://ecx.images-amazon.com/images/I/41uSjT4l...,Samsung,"[[Cell Phones & Accessories, Cell Phones, No-C...",This stylish Samsung T301G slider phone offers...
37558,B00387FAC0,"{'also_bought': ['B004JM2S4G', 'B007ZSWEKO', '...",Samsung T139 Prepaid Phone (T-Mobile),44.99,{'Cell Phones & Accessories': 6689},http://ecx.images-amazon.com/images/I/41g0lRu%...,Samsung,"[[Cell Phones & Accessories, Cell Phones, No-C...",Easily stay in contact with your family and fr...
45372,B003TP3FVO,"{'also_viewed': ['B004ZF0E16', 'B0067WEX1M', '...",Samsung Galaxy Spica GT-I5700 Black Unlocked,147.99,{'Cell Phones & Accessories': 448253},http://ecx.images-amazon.com/images/I/41BuaRci...,Samsung,"[[Cell Phones & Accessories, Cell Phones, Unlo...",Samsung Galaxy Spica is the little brother of ...


In [55]:
print(len(df_metadata))
print(len(df_metadata_Samsung))

346793
291


In [56]:
df_merged = pd.merge(df_reviews, df_metadata_Samsung, how='inner', on='asin' )

In [57]:
print(len(df_merged))

772


In [58]:
df_merged.head(1)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,related,title,price,salesRank,imUrl,brand,categories,description
0,A2SH6A32BE6NEV,B00387FAC0,"Comp Expert ""Comp""","[11, 13]","While I already have a main cell phone, a Gala...",5.0,Makes for a great car or secondary phone,1356480000,"12 26, 2012","{'also_bought': ['B004JM2S4G', 'B007ZSWEKO', '...",Samsung T139 Prepaid Phone (T-Mobile),44.99,{'Cell Phones & Accessories': 6689},http://ecx.images-amazon.com/images/I/41g0lRu%...,Samsung,"[[Cell Phones & Accessories, Cell Phones, No-C...",Easily stay in contact with your family and fr...


In [59]:
d = df_merged.groupby('asin')

In [60]:
d.size()

asin
B00387FAC0     6
B004B9QNJS    20
B004H23JXW    20
B004T0LKHO     8
B004XIE6WI    12
B0057JAQXU    10
B005FLI78G    15
B005MUC9DY     5
B006HXJ40G     7
B006V47ONU     9
B007OVHW2M    15
B007UOXRS6     5
B007VCRRNS    76
B007X6FFLS    20
B00804T8A6     5
B0080DJ6CM    36
B00812YWXU    35
B0081HBX2I    14
B008DKPGP8     5
B008MC3N34     8
B008OK8IIY    13
B008P2SUEI     6
B008ZE6PJS     9
B0099LATZ2    56
B0099QRVZS    22
B009EQJ77I     7
B009PLBLQC    37
B00A29WCA0    43
B00A7K62Z0     7
B00AEK5V5K     7
B00ALGOQCQ    12
B00B090QRM     7
B00BDBEZEC     6
B00BI9AKJI     8
B00BV1MVJ0    18
B00BV1NKCW    20
B00BV48MY0     5
B00CBSX5U6    16
B00CIF9MJK     8
B00CIZFK9G    15
B00D1VVQ4O     6
B00D8T9QZU     6
B00DIJUURI     8
B00DMRRR72     7
B00DRNAT7G     6
B00DRNEV9S     6
B00DUJ6TYY     6
B00F0FVKN6     5
B00F0FVKRM    11
B00F33OE06    12
B00F5PGBEE     6
B00F9RRVUG     9
B00FDZMC9Y    12
B00H50DVPE     9
B00J4TK4B8     5
B00J4TK4CC     5
dtype: int64

In [63]:
d.filter(lambda x:len(x) > 50)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,related,title,price,salesRank,imUrl,brand,categories,description
132,A2V2RODHVIJKV5,B007VCRRNS,,"[0, 0]",Some of my best phones,5.0,Five Stars,1405814400,"07 20, 2014","{'also_bought': ['B0089VO78I', 'B00812YWXU', '...",Samsung Galaxy S3 i9300 16GB - Factory Unlocke...,285.00,{'Cell Phones & Accessories': 330},http://ecx.images-amazon.com/images/I/41exDZBI...,Samsung,"[[Cell Phones & Accessories, Cell Phones, Unlo...",The Galaxy S III is powered by Qualcomm MSM896...
133,A39H670PD32Z1X,B007VCRRNS,"Adam Sorenson ""Adam""","[2, 2]",I was on the verge of buying the new iphone bu...,5.0,i-phone killer,1356739200,"12 29, 2012","{'also_bought': ['B0089VO78I', 'B00812YWXU', '...",Samsung Galaxy S3 i9300 16GB - Factory Unlocke...,285.00,{'Cell Phones & Accessories': 330},http://ecx.images-amazon.com/images/I/41exDZBI...,Samsung,"[[Cell Phones & Accessories, Cell Phones, Unlo...",The Galaxy S III is powered by Qualcomm MSM896...
134,A292JUY8KQV3EB,B007VCRRNS,A.,"[0, 0]","The S3 is an amazing phone, even considering i...",5.0,Great phone!,1375315200,"08 1, 2013","{'also_bought': ['B0089VO78I', 'B00812YWXU', '...",Samsung Galaxy S3 i9300 16GB - Factory Unlocke...,285.00,{'Cell Phones & Accessories': 330},http://ecx.images-amazon.com/images/I/41exDZBI...,Samsung,"[[Cell Phones & Accessories, Cell Phones, Unlo...",The Galaxy S III is powered by Qualcomm MSM896...
135,A8M5WJ8H1K4T7,B007VCRRNS,Alex Alexzander,"[112, 121]",This is without a doubt the best phone I've ev...,5.0,Best phone I've ever owned,1344643200,"08 11, 2012","{'also_bought': ['B0089VO78I', 'B00812YWXU', '...",Samsung Galaxy S3 i9300 16GB - Factory Unlocke...,285.00,{'Cell Phones & Accessories': 330},http://ecx.images-amazon.com/images/I/41exDZBI...,Samsung,"[[Cell Phones & Accessories, Cell Phones, Unlo...",The Galaxy S III is powered by Qualcomm MSM896...
136,A2K5IZ8UR8GVXF,B007VCRRNS,Amateur Filmmaker,"[4, 5]","Original Review: August 8, 2013I didn't even b...",4.0,Lucky Buy? Great Phone...,1375920000,"08 8, 2013","{'also_bought': ['B0089VO78I', 'B00812YWXU', '...",Samsung Galaxy S3 i9300 16GB - Factory Unlocke...,285.00,{'Cell Phones & Accessories': 330},http://ecx.images-amazon.com/images/I/41exDZBI...,Samsung,"[[Cell Phones & Accessories, Cell Phones, Unlo...",The Galaxy S III is powered by Qualcomm MSM896...
137,A39QFOFN5RKIJZ,B007VCRRNS,Another Dude,"[17, 18]","First of all, I bought from Wireless Everythin...",5.0,This is the best CELLPHONE EVER!!!,1341619200,"07 7, 2012","{'also_bought': ['B0089VO78I', 'B00812YWXU', '...",Samsung Galaxy S3 i9300 16GB - Factory Unlocke...,285.00,{'Cell Phones & Accessories': 330},http://ecx.images-amazon.com/images/I/41exDZBI...,Samsung,"[[Cell Phones & Accessories, Cell Phones, Unlo...",The Galaxy S III is powered by Qualcomm MSM896...
138,A1QV5IH6HDRN0L,B007VCRRNS,armygirl,"[39, 47]","I was always a fan of Android operated phones,...",1.0,Love this phone! EDIT: NO I DON'T!,1346889600,"09 6, 2012","{'also_bought': ['B0089VO78I', 'B00812YWXU', '...",Samsung Galaxy S3 i9300 16GB - Factory Unlocke...,285.00,{'Cell Phones & Accessories': 330},http://ecx.images-amazon.com/images/I/41exDZBI...,Samsung,"[[Cell Phones & Accessories, Cell Phones, Unlo...",The Galaxy S III is powered by Qualcomm MSM896...
139,A3AFELPYTZH90T,B007VCRRNS,Baja Alan,"[0, 0]",This phone is great! Lots of features and Apps...,5.0,Unbelievably Easy to Use.,1356480000,"12 26, 2012","{'also_bought': ['B0089VO78I', 'B00812YWXU', '...",Samsung Galaxy S3 i9300 16GB - Factory Unlocke...,285.00,{'Cell Phones & Accessories': 330},http://ecx.images-amazon.com/images/I/41exDZBI...,Samsung,"[[Cell Phones & Accessories, Cell Phones, Unlo...",The Galaxy S III is powered by Qualcomm MSM896...
140,A3HJFYGUWFC0D9,B007VCRRNS,BARCAGP,"[0, 0]","excellent phone, a little big but it fits the ...",5.0,awesome,1360972800,"02 16, 2013","{'also_bought': ['B0089VO78I', 'B00812YWXU', '...",Samsung Galaxy S3 i9300 16GB - Factory Unlocke...,285.00,{'Cell Phones & Accessories': 330},http://ecx.images-amazon.com/images/I/41exDZBI...,Samsung,"[[Cell Phones & Accessories, Cell Phones, Unlo...",The Galaxy S III is powered by Qualcomm MSM896...
141,A346MAIT1GXHOH,B007VCRRNS,Bryant,"[3, 8]","Both phones freeze, turn off, kick you out of ...",2.0,Think twice on this one.,1348272000,"09 22, 2012","{'also_bought': ['B0089VO78I', 'B00812YWXU', '...",Samsung Galaxy S3 i9300 16GB - Factory Unlocke...,285.00,{'Cell Phones & Accessories': 330},http://ecx.images-amazon.com/images/I/41exDZBI...,Samsung,"[[Cell Phones & Accessories, Cell Phones, Unlo...",The Galaxy S III is powered by Qualcomm MSM896...
