# Amazon Review Codebook

This notebook details our data processing from [Amazon Product Data](http://jmcauley.ucsd.edu/data/amazon/links.html) (He and McAuley, 2016 and McAuley et al, 2015). We process the dataset for category `Musical Instruments`. We replicate the method to all categories in `scripts/review_extract.py` and `scripts/cat_dummies.py`.

```
This dataset contains product reviews and metadata from Amazon, 
including 142.8 million reviews spanning May 1996 - July 2014.
This dataset includes reviews (ratings, text, helpfulness votes), 
product metadata (descriptions, category information, price, brand, and image features), 
and links (also viewed/also bought graphs).
```

We used the [aggressively deduplicated data](http://snap.stanford.edu/data/amazon/productGraph/aggressive_dedup.json.gz) for product reviews and [product meta data](http://snap.stanford.edu/data/amazon/productGraph/metadata.json.gz) as raw data.

## Imports

In [None]:
import pandas as pd
import numpy as np
import gzip
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from collections import Counter
from datetime import datetime

#path
PATH = '../data/amzn/'
RAW_PATH = PATH + 'raw/'
PROCESSED_PATH = PATH + 'processed/'
lev1 = 'Musical Instruments'

#utils
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

def extract_rank(x):
    if (type(x)==dict):
        if (len(list(x.keys()))>0):
            return x[list(x.keys())[0]]
        else:
            return float('nan')
    else:
        return float('nan')

def word_len(x):
    if (type(x)==str):
        return len(nltk.word_tokenize(x))
    else:
        return float('nan')

def char_len(x):
    if (type(x)==str):
        return len(x)
    else:
        return float('nan')
    
def get_sentiment(x):
    if (type(x)==str):
        score_dict = sid.polarity_scores(x)
        return score_dict['compound'],score_dict['neg'],score_dict['neu'],score_dict['pos']
    else:
        return float('nan'),float('nan'),float('nan'),float('nan')

def unix_to_dt(x):
    return(datetime.fromtimestamp(x).strftime('%Y-%m-%d %H:%M:%S'))

## Load Data

### Review Data

* reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
* asin - ID of the product, e.g. 0000013714
* reviewerName - name of the reviewer
* helpful - helpfulness rating of the review, e.g. 2/3
* reviewText - text of the review
* overall - rating of the product
* summary - summary of the review
* unixReviewTime - time of the review (unix time)
* reviewTime - time of the review (raw)

In [225]:
reviews = getDF(RAW_PATH+'/reviews_Musical_Instruments.json.gz')

In [185]:
#helpful
reviews['helpful_yes'] = reviews.helpful.map(lambda x: x[0])
reviews['helpful_no'] = reviews.helpful.map(lambda x: x[1])

#len
reviews['reviewText_len'] = reviews.reviewText.map(word_len)
reviews['reviewText_char'] = reviews.reviewText.map(char_len)
reviews['summary_len'] = reviews.summary.map(word_len)
reviews['summary_char'] = reviews.summary.map(char_len)

#datetime
reviews['dt'] = reviews.unixReviewTime.map(unix_to_dt)

#sentiment
sid = SentimentIntensityAnalyzer()
reviews['reviewText_tuple']= reviews.reviewText.map(get_sentiment)
reviews['summary_tuple']= reviews.summary.map(get_sentiment)

#extract tuple
reviews['reviewText_compound'] = reviews['reviewText_tuple'].map(lambda x: x[0])
reviews['reviewText_neg'] = reviews['reviewText_tuple'].map(lambda x: x[1])
reviews['reviewText_neu'] = reviews['reviewText_tuple'].map(lambda x: x[2])
reviews['reviewText_pos'] = reviews['reviewText_tuple'].map(lambda x: x[3])

reviews['summary_compound'] = reviews['summary_tuple'].map(lambda x: x[0])
reviews['summary_neg'] = reviews['summary_tuple'].map(lambda x: x[1])
reviews['summary_neu'] = reviews['summary_tuple'].map(lambda x: x[2])
reviews['summary_pos'] = reviews['summary_tuple'].map(lambda x: x[3])

In [214]:
#save
selected_columns = ['reviewerID', 'asin', 'reviewerName', 'reviewText',
       'overall', 'summary', 'unixReviewTime', 'helpful_yes',
       'helpful_no', 'reviewText_len', 'reviewText_char', 'summary_len',
       'summary_char', 'dt', 'reviewText_compound', 'reviewText_neg', 'reviewText_neu',
       'reviewText_pos', 'summary_compound', 'summary_neg', 'summary_neu',
       'summary_pos']
reviews = reviews[selected_columns]
reviews.to_csv(PROCESSED_PATH+'reviews_sample.csv',index=False)
reviews.head()

Unnamed: 0,reviewerID,asin,reviewerName,reviewText,overall,summary,unixReviewTime,helpful_yes,helpful_no,reviewText_len,...,summary_char,dt,reviewText_compound,reviewText_neg,reviewText_neu,reviewText_pos,summary_compound,summary_neg,summary_neu,summary_pos
0,A1YS9MDZP93857,6428320,John Taylor,The portfolio is fine except for the fact that...,3.0,Parts missing,1394496000,0,0,24,...,13,2014-03-11 07:00:00,-0.1027,0.096,0.826,0.078,-0.296,0.688,0.312,0.0
1,A3TS466QBAWB9D,14072149,Silver Pencil,If you are a serious violin student on a budge...,5.0,"Perform it with a friend, today!",1370476800,0,0,107,...,32,2013-06-06 07:00:00,0.8542,0.051,0.809,0.14,0.5411,0.0,0.534,0.466
2,A3BUDYITWUSIS7,41291905,joyce gabriel cornett,This is and excellent edition and perfectly tr...,5.0,Vivalldi's Four Seasons,1381708800,0,0,34,...,23,2013-10-14 07:00:00,0.9651,0.0,0.52,0.48,0.0,0.0,1.0,0.0
3,A19K10Z0D2NTZK,41913574,TexasCowboy,Perfect for someone who is an opera fan or a w...,5.0,Full score: voice and orchestra,1285200000,0,0,145,...,31,2010-09-23 07:00:00,0.8834,0.053,0.818,0.129,0.0,0.0,1.0,0.0
4,A14X336IB4JD89,201891859,dfjm53,How many Nocturnes does it contain? All of the...,1.0,Unable to determine contents,1350432000,0,1,29,...,28,2012-10-17 07:00:00,-0.2359,0.085,0.915,0.0,0.0,0.0,1.0,0.0


### Meta Data

* asin - ID of the product, e.g. 0000031852
* title - name of the product
* price - price in US dollars (at time of crawl)
* imUrl - url of the product image
* related - related products (also bought, also viewed, bought together, buy after viewing)
* salesRank - sales rank information
* brand - brand name
* categories - list of categories the product belongs to

In [116]:
meta = getDF(RAW_PATH+'/meta_Musical_Instruments.json.gz')
meta.salesRank =  meta.salesRank.map(extract_rank)

#len
meta['title_len'] = meta.title.map(word_len)
meta['title_char'] = meta.title.map(char_len)
meta['desc_len'] = meta.description.map(word_len)
meta['desc_char'] = meta.description.map(char_len)

#save
selected_columns = ['asin','title','title_len','title_char','price','salesRank','categories','brand','description','desc_len','desc_char']
meta = meta[selected_columns]
meta.to_csv(PROCESSED_PATH+'meta_sample.csv',index=False)

In [117]:
meta.head()

Unnamed: 0,asin,title,title_len,title_char,price,salesRank,categories,brand,description,desc_len,desc_char
0,6428320,"Six Sonatas For Two Flutes Or Violins, Volume ...",14.0,54.0,17.95,207315.0,"[[Musical Instruments, Instrument Accessories,...",,,,
1,14072149,Double Concerto in D Minor By Johann Sebastian...,48.0,239.0,18.77,94593.0,[[Musical Instruments]],,Composer: J.S. Bach.Peters Edition.For two vio...,11.0,62.0
2,41291905,Hal Leonard Vivaldi Four Seasons for Piano (Or...,12.0,66.0,,222972.0,"[[Musical Instruments, Instrument Accessories,...",,Vivaldi's famous set of four violin concertos ...,44.0,263.0
3,41913574,"Aida: Opera in Quattro Atti, Partitura -- Aida...",31.0,151.0,49.99,,[[Musical Instruments]],,444 pages. \nReprint of corrected and revised ...,10.0,53.0
4,201891859,Nocturnes,1.0,9.0,,171871.0,"[[Musical Instruments, Instrument Accessories,...",,,,


In [230]:
s = meta['categories'].map(lambda x: x[0])
pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)

In [231]:
pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)

Unnamed: 0,A-D & D-A Converters,Accessories,Accessories & Supplies,Accordion Accessories,Accordions,Acid Jazz,Acoustic & Acoustic-Electric Basses,Acoustic & Classical Guitar Bags & Cases,Acoustic & Classical Guitar Parts,Acoustic Blues,...,Windsreens & Pop Filters,Wired Headsets,Wireless Microphones,Wood & Inlay Material,Wood Blocks,World Dance,World Music,Xylophone Accessories,Xylophones,Zurnas
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Combined

In [12]:
combined = pd.read_csv(PROCESSED_PATH+'combined_Musical_Instruments.csv')
combined.columns

Index(['reviewer_nb', 'asin', 'overall', 'unixReviewTime', 'helpful_yes',
       'helpful_no', 'reviewText_len', 'reviewText_char', 'summary_len',
       'summary_char', 'dt', 'reviewText_compound', 'reviewText_neg',
       'reviewText_neu', 'reviewText_pos', 'summary_compound', 'summary_neg',
       'summary_neu', 'summary_pos', 'lev1', 'title_len', 'title_char',
       'desc_len', 'desc_char', 'price', 'salesRank', 'brand'],
      dtype='object')

In [217]:
combined = pd.merge(reviews,meta,how='left',on=['asin'])
combined.to_csv(PATH+'/processed/combined_sample.csv',index=False)
combined.head()

Unnamed: 0,reviewerID,asin,reviewerName,reviewText,overall,summary,unixReviewTime,helpful_yes,helpful_no,reviewText_len,...,title,title_len,title_char,price,salesRank,categories,brand,description,desc_len,desc_char
0,A1YS9MDZP93857,6428320,John Taylor,The portfolio is fine except for the fact that...,3.0,Parts missing,1394496000,0,0,24,...,"Six Sonatas For Two Flutes Or Violins, Volume ...",14.0,54.0,17.95,207315.0,"[[Musical Instruments, Instrument Accessories,...",,,,
1,A3TS466QBAWB9D,14072149,Silver Pencil,If you are a serious violin student on a budge...,5.0,"Perform it with a friend, today!",1370476800,0,0,107,...,Double Concerto in D Minor By Johann Sebastian...,48.0,239.0,18.77,94593.0,[[Musical Instruments]],,Composer: J.S. Bach.Peters Edition.For two vio...,11.0,62.0
2,A3BUDYITWUSIS7,41291905,joyce gabriel cornett,This is and excellent edition and perfectly tr...,5.0,Vivalldi's Four Seasons,1381708800,0,0,34,...,Hal Leonard Vivaldi Four Seasons for Piano (Or...,12.0,66.0,,222972.0,"[[Musical Instruments, Instrument Accessories,...",,Vivaldi's famous set of four violin concertos ...,44.0,263.0
3,A19K10Z0D2NTZK,41913574,TexasCowboy,Perfect for someone who is an opera fan or a w...,5.0,Full score: voice and orchestra,1285200000,0,0,145,...,"Aida: Opera in Quattro Atti, Partitura -- Aida...",31.0,151.0,49.99,,[[Musical Instruments]],,444 pages. \nReprint of corrected and revised ...,10.0,53.0
4,A14X336IB4JD89,201891859,dfjm53,How many Nocturnes does it contain? All of the...,1.0,Unable to determine contents,1350432000,0,1,29,...,Nocturnes,1.0,9.0,,171871.0,"[[Musical Instruments, Instrument Accessories,...",,,,


In [218]:
combined.head(1000).to_csv(PATH+'/processed/combined_sample_sample.csv',index=False)

## Citations

* He, Ruining, Julian McAuley. Modeling the visual evolution of fashion trends with one-class collaborative filtering. WWW, 2016
* McAuley, Julian, C. Targett, J. Shi, A. van den Hengel. Image-based recommendations on styles and substitutes. SIGIR, 2015