# Data Gathering Project: Amazon Reviews
### Data Science for Quantitative Finance

Dataset: The Amazon review dataset was gathered by Professor Julian McCauley at UCSD. The dataset includes all reviews of all products sold on Amazon between 1995 and 2014.  The data is publicly available via Professor McCauley's website:  http://jmcauley.ucsd.edu/data/amazon/

The data for this notebook was downloaded directly from our class Google Drive and was moved into the working directory.

Create the following databases (in a dataframe or HDF5 format) for the Books, Electronics,
Cell Phones and Accessories, and Beauty categories:  
● df_reviews_categoryname: whose columns are timestamp, productid, reviewerid,
rating, review_text, review_summary   
● df_products: whose columns are productid, title, imUrl, brand  
● df_products_also_bought: indexed productid, contains also_bought column  
● df_products_also_viewed: indexed productid, contains also_viewed column  
● df_products_bought_together: indexed productid, contains bought_together column  
● df_products_sales_rank: indexed by productid contains sales_rank  
● df_products_categories: indexed by productid contains categories column  

In [41]:
#import packages
import numpy as np
import pandas as pd
import json
import gzip

In [None]:
#reading in files, script from Justin's wesbite
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

In [73]:
#load ratings
df_reviews = getDF('Reviews_Electronics_5.json.gz')
#pd.read_csv('ratings_Electronics.csv', header = None)
#ratings_electronics.head()

In [74]:
df_reviews.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,AO94DHGC771SJ,528881469,amazdnu,"[0, 0]",We got this GPS for my husband who is an (OTR)...,5.0,Gotta have GPS!,1370131200,"06 2, 2013"
1,AMO214LNFCEI4,528881469,Amazon Customer,"[12, 15]","I'm a professional OTR truck driver, and I bou...",1.0,Very Disappointed,1290643200,"11 25, 2010"
2,A3N7T0DY83Y4IG,528881469,C. A. Freeman,"[43, 45]","Well, what can I say. I've had this unit in m...",3.0,1st impression,1283990400,"09 9, 2010"
3,A1H8PY3QHMQQA0,528881469,"Dave M. Shaw ""mack dave""","[9, 10]","Not going to write a long review, even thought...",2.0,"Great grafics, POOR GPS",1290556800,"11 24, 2010"
4,A24EV6RXELQZ63,528881469,Wayne Smith,"[0, 0]",I've had mine for a year and here's what we go...,1.0,"Major issues, only excuses for support",1317254400,"09 29, 2011"


In [78]:
#changing time entry into a datetime
df_reviews['timestamp'] = pd.to_datetime(df_reviews.unixReviewTime,
                                                  unit="s")

In [79]:
#dropping columns we won't need
df_reviews.drop(['helpful', 'reviewTime',
      'reviewerName', 'unixReviewTime'], axis=1, inplace=True)

In [80]:
#renaming columns as per assignment
df_reviews.rename(columns={'asin':'productid', 'overall':'rating', 'reviewText': 'review_text', 
                                       'summary':"review_summary"}, inplace=True)

In [81]:
df_reviews.head()

Unnamed: 0,reviewerID,productid,review_text,rating,review_summary,timestamp
0,AO94DHGC771SJ,528881469,We got this GPS for my husband who is an (OTR)...,5.0,Gotta have GPS!,2013-06-02
1,AMO214LNFCEI4,528881469,"I'm a professional OTR truck driver, and I bou...",1.0,Very Disappointed,2010-11-25
2,A3N7T0DY83Y4IG,528881469,"Well, what can I say. I've had this unit in m...",3.0,1st impression,2010-09-09
3,A1H8PY3QHMQQA0,528881469,"Not going to write a long review, even thought...",2.0,"Great grafics, POOR GPS",2010-11-24
4,A24EV6RXELQZ63,528881469,I've had mine for a year and here's what we go...,1.0,"Major issues, only excuses for support",2011-09-29


# loading in metadata

Df_products  - DONE
- Productid  
- title  
- imURL  
- Brand  

Df_products_also_bought  
- Indexed productid  
- Also_bought  

Df_products_also_viewed  
- Indexed productid  
- Also_viewed  

Df_products_bought_together  
- Indexed productid  
- Bought_together  

Df_products_sales_rank (A)  - DONE
- Indexed by productid  
- Sales_rank  

Df_products_categories  - DONE
- Indexed by productid  
- Categories  


In [55]:
df = getDF('meta_Electronics.json.gz')

In [56]:
df.head()

Unnamed: 0,asin,imUrl,description,categories,title,price,salesRank,related,brand
0,132793040,http://ecx.images-amazon.com/images/I/31JIPhp%...,The Kelby Training DVD Mastering Blend Modes i...,"[[Electronics, Computers & Accessories, Cables...",Kelby Training DVD: Mastering Blend Modes in A...,,,,
1,321732944,http://ecx.images-amazon.com/images/I/31uogm6Y...,,"[[Electronics, Computers & Accessories, Cables...",Kelby Training DVD: Adobe Photoshop CS5 Crash ...,,,,
2,439886341,http://ecx.images-amazon.com/images/I/51k0qa8f...,Digital Organizer and Messenger,"[[Electronics, Computers & Accessories, PDAs, ...",Digital Organizer and Messenger,8.15,{'Electronics': 144944},"{'also_viewed': ['0545016266', 'B009ECM8QY', '...",
3,511189877,http://ecx.images-amazon.com/images/I/41HaAhbv...,The CLIKR-5 UR5U-8780L remote control is desig...,"[[Electronics, Accessories & Supplies, Audio &...",CLIKR-5 Time Warner Cable Remote Control UR5U-...,23.36,,"{'also_viewed': ['B001KC08A4', 'B00KUL8O0W', '...",
4,528881469,http://ecx.images-amazon.com/images/I/51FnRkJq...,"Like its award-winning predecessor, the Intell...","[[Electronics, GPS & Navigation, Vehicle GPS, ...",Rand McNally 528881469 7-inch Intelliroute TND...,299.99,,"{'also_viewed': ['B006ZOI9OY', 'B00C7FKT2A', '...",


In [57]:
#rename columns
df.rename(columns={'asin':'productid'}, inplace=True)

### Categories Dataframe

In [14]:
#make categories df
df_products_categories = df[['productid', 'categories']]

In [16]:
#set index to productid
df_products_categories.set_index(['productid'], inplace=True)

In [None]:
#WRITE TO CSV

### SalesRank Dataframe

In [18]:
#make categories df
df_products_sales_rank = df[['productid', 'salesRank']]

In [19]:
#set index to productid
df_products_sales_rank.set_index(['productid'], inplace=True)

In [20]:
df_products_sales_rank.head()

Unnamed: 0_level_0,salesRank
productid,Unnamed: 1_level_1
B004A9SDD8,
B004AFQAUA,
B004AGCR1K,
B004AHBBPW,
B004ALFHV2,


In [21]:
#WRITE TO CSV

### Bought and Viewed Together

In [48]:
df.related[2][]

{'also_viewed': ['B00A7W29BE', 'B00I5PB9UM', 'B00FR88VTC']}

In [52]:
products = []
viewed = []
bought = []
for i in np.arange(5):
    print (df.related[i])
    products.append(df.productid[i])
    #need to take care of the error -- if it throws an error go to the next one
    bought.append(df.related[i]['also_bought'])
    viewed.append(df.related[i]['also_viewed'])
    
    

{'also_bought': ['B006M3K874', 'B00F85SMOI', 'B008K39F78', 'B00CLR1R5W', 'B00D3LBZV6', 'B006OOMDH4', 'B00IIZ7HGY', 'B008H4VBEU', 'B00DMNTXFK', 'B005DU213G', 'B00FDZ2EN8', 'B00HLCC6HK', 'B00C2QYOVQ', 'B007NK03WU', 'B00H5CGSDO', 'B009LSFH5U', 'B00AQIQ1OA', 'B00ARDBV8A', 'B00JRGCQ14', 'B00B2NNMYK', 'B00KO6GB4O', 'B00HDGR8YU', 'B00J5GL0D6', 'B00GWUBMJA', 'B009S984J8', 'B005D9T44G', 'B00JZQNJJ4', 'B008QMMEOE', 'B00EJEB4KS', 'B008I7H3H0', 'B00B3V4MEA', 'B005DGA9CA', 'B00BRN84QQ', 'B00IRQ12VU', 'B00H35I2Z0', 'B004Q1NH4U', 'B00CD3L8JY', 'B00B7S5X06', 'B00EWO3RHI', 'B00G5LBBU6', 'B005F0WONQ', 'B00B86QL4O', 'B004PLPYGU', 'B004DPC5Y2', 'B00FYXR49I', 'B008JHQXK2', 'B00B4B8MLI', 'B0077TVAD8', 'B00FL4EUZG', 'B00HFQBCEU', 'B00L43HTUG', 'B00H4I33YG', 'B005290NXI', 'B007F38Y7Q', 'B00J06AJNI', 'B004NWLM8K', 'B00BBF4NX8', 'B004WKIKGU', 'B007ZFZYH2', 'B00EUUQ172', 'B00KN80SS8', 'B008NMT684', 'B00DHIDYEG', 'B00FW5RJEI', 'B00CI5BH0C', 'B00CQ55S22', 'B00I0FNDEM', 'B00BWBHIUG', 'B00GUNPKGK', 'B00AYB1TP0', 'B0

KeyError: 'also_viewed'

### Df_Products 
Productid
title
imURL
Brand

In [59]:
#make products df
df_products = df[['productid', 'title', 'imUrl', 'brand']]

In [19]:
#set index to productid
df_products_sales_rank.set_index(['productid'], inplace=True)

In [60]:
df_products.head()

Unnamed: 0,productid,title,imUrl,brand
0,132793040,Kelby Training DVD: Mastering Blend Modes in A...,http://ecx.images-amazon.com/images/I/31JIPhp%...,
1,321732944,Kelby Training DVD: Adobe Photoshop CS5 Crash ...,http://ecx.images-amazon.com/images/I/31uogm6Y...,
2,439886341,Digital Organizer and Messenger,http://ecx.images-amazon.com/images/I/51k0qa8f...,
3,511189877,CLIKR-5 Time Warner Cable Remote Control UR5U-...,http://ecx.images-amazon.com/images/I/41HaAhbv...,
4,528881469,Rand McNally 528881469 7-inch Intelliroute TND...,http://ecx.images-amazon.com/images/I/51FnRkJq...,


In [61]:
#save to CSV