# Data Preprocess 

#### Team members: 


In [1]:
%matplotlib inline
import gzip
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### 1. Functions 
 
1.1 Functions for loading data

Read the data into a **pandas dataframe** by using the two functions below. These two functions are provided by the website of our Amazon dataset.

In [2]:
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

In [3]:
def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

1.2 Functions for checking Nan value

We define the two functions below to verify if there exists nan value in the chosen column.

In [11]:
def checkNanValue(dataframe, column):
    print("Check if column {} exists Nan value: {}".format(column,dataframe[column].isnull().values.any()))
    return dataframe[column].isnull().values.any()
    
def checkDataframeNanValue(dataframe):
    list_Nancolumns = []
    list_columns = list(dataframe.columns)
    for column in list_columns:
        if checkNanValue(dataframe, column):
             list_Nancolumns.append(column)
    return list_Nancolumns

### 2. Health&Care metadata

After observing the Health&Care metadata, we discover that there are lots of null values and some columns we are not interested. We decide to do some data cleaning work 

2.1 Import Health&Care metadata

In [48]:
df_health_meta = getDF('data/meta_Health_and_Personal_Care.json.gz')

In [49]:
df_health_meta.head()

Unnamed: 0,asin,description,title,imUrl,related,salesRank,categories,price,brand
0,77614992,This is an example product description.,Principles of Mgmt + Oper-CSUF Custom C,http://ecx.images-amazon.com/images/I/51G%2BRqOCiqL._SY300_.jpg,"{'also_bought': ['0471730726', '0132834871', '0471391905', 'B00000JZKB', '0324314132', '00735250...",{'Health & Personal Care': 168429},[[Health & Personal Care]],,
1,615208479,By now we all know the benefits of exercise for the body. It's the only real fountain of youth! ...,Brain Fitness Exercises Software,http://ecx.images-amazon.com/images/I/41kbZB047NL._SY300_.jpg,,{'Health & Personal Care': 1346973},"[[Health & Personal Care, Personal Care]]",,
2,615269990,What's wrong with your patient?Do all the symptoms and signs point to one diagnosis?Or are there...,Occam's Razor,http://ecx.images-amazon.com/images/I/51fH-ABeBAL._SY300_.jpg,"{'also_bought': ['1935660152', '0071743979', '0071831428', '0323087876', '0443069522', '09670090...",{'Toys & Games': 110575},"[[Health & Personal Care, Personal Care, Shaving & Hair Removal, Manual Shaving]]",34.99,
3,615315860,,101 BlenderBottle Recipes Quick and Easy,http://ecx.images-amazon.com/images/I/21zOQu2QrFL.jpg,"{'also_bought': ['B006VT9RBM', 'B0010JLMO8', 'B001CXC69C', 'B0064QSHXG', 'B00CZAQIZ4', 'B0018G4Z...",{'Health & Personal Care': 254068},[[Health & Personal Care]],,
4,615406394,This is an example product description.,"Aphrodite Reborn - Women's Stories of Hope, Courage and Cancer",http://ecx.images-amazon.com/images/I/51rJLgsi0%2BL._SX300_.jpg,"{'also_bought': ['0966035232', '1421407205']}",{'Health & Personal Care': 377936},[[Health & Personal Care]],,


2.2  Discard several uninterested columns

We discard **description & imUrl** 

In [50]:
df_health_meta = df_health_meta.drop(['description','imUrl'],axis=1)

2.3  Check if there exits NaN Value in the DataFrame

In [51]:
list_Nancolumns = checkDataframeNanValue(df_health_meta)

Check if column asin exists Nan value: False
Check if column title exists Nan value: True
Check if column related exists Nan value: True
Check if column salesRank exists Nan value: True
Check if column categories exists Nan value: False
Check if column price exists Nan value: True
Check if column brand exists Nan value: True


2.4  Replace Nan value by 0

According to the result above, we know that some columns exists Nan value. Thus, we decide to only replace column 'price '**Nan** values as **0**. The other Nan values we will process it later

In [52]:
# for column in list_Nancolumns:
#     df_health_meta[column] = df_health_meta[column].fillna(0)
df_health_meta['price'] = df_health_meta['price'].fillna(0)

2.5 Only keep Health & Personal Care products and discard all the other relative products

In this part, we **drop all columns that is not relevant with Health & Personal Care** from the dataset with the help of the attribute **category**.

In [53]:
pd.set_option('max_colwidth',100)
df_health_meta['categories'][:10]

0                                                                             [[Health & Personal Care]]
1                                                              [[Health & Personal Care, Personal Care]]
2                      [[Health & Personal Care, Personal Care, Shaving & Hair Removal, Manual Shaving]]
3                                                                             [[Health & Personal Care]]
4                                                                             [[Health & Personal Care]]
5                                                                             [[Health & Personal Care]]
6                                   [[Health & Personal Care, Personal Care, Eye Care, Reading Glasses]]
7    [[Health & Personal Care, Medical Supplies & Equipment, Daily Living Aids, Visual Impairment Aid...
8    [[Health & Personal Care, Medical Supplies & Equipment, Daily Living Aids, Low Strength Aids, Gr...
9         [[Health & Personal Care, Stationery & Party 

In [54]:
df_health_meta['categories'].apply(lambda x : x[0][0]).value_counts()

Health & Personal Care       262317
CDs & Vinyl                     445
Sports & Outdoors               170
Automotive                       31
Cell Phones & Accessories        21
Home & Kitchen                   15
Baby Products                    11
Electronics                       9
Tools & Home Improvement          6
Office Products                   5
Books                             2
Name: categories, dtype: int64

In [65]:
rows_to_delete = []
for idx, categories in enumerate(df_health_meta['categories']):
    if categories[0][0] != 'Health & Personal Care':
        rows_to_delete.append(idx)

In [67]:
df_health_meta = df_health_meta.drop(df_health_meta.index[rows_to_delete])

In [68]:
df_health_meta['categories'].apply(lambda x : x[0][0]).value_counts()

Health & Personal Care    262317
Name: categories, dtype: int64

2.6 - Set asin as index

In [70]:
df_health_meta = df_health_meta.set_index(['asin'])

2.7 - Generate pickle file

In [71]:
df_health_meta.to_pickle('health_metadata.pkl')

### 3. Health&Care reviews
In this part, we begin to turn our attention to Health&Care review dataset.

3.1 Import Health&Care reviews

In [81]:
df_health_review = getDF('data/reviews_Health_and_Personal_Care_5.json.gz')

In [82]:
df_health_review.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,ALC5GH8CAMAI7,159985130X,AnnN,"[1, 1]",This is a great little gadget to have around. We've already used it to look for splinters and a...,5.0,Handy little gadget,1294185600,"01 5, 2011"
1,AHKSURW85PJUE,159985130X,"AZ buyer ""AZ buyer""","[1, 1]",I would recommend this for a travel magnifier for the occasional reading.I had read on another r...,4.0,Small & may need to encourage battery,1329523200,"02 18, 2012"
2,A38RMU1Y5TDP9,159985130X,"Bob Tobias ""Robert Tobias""","[75, 77]",What I liked was the quality of the lens and the built in light. Then lens had no discernable d...,4.0,Very good but not great,1275955200,"06 8, 2010"
3,A1XZUG7DFXXOS4,159985130X,Cat lover,"[56, 60]","Love the Great point light pocket magnifier! works great, especially if you forget your glasses...",4.0,great addition to your purse,1202428800,"02 8, 2008"
4,A1MS3M7M7AM13X,159985130X,Cricketoes,"[1, 1]","This is very nice. You pull out on the magnifier when you want the light to come on, then slide ...",5.0,Very nice and convenient.,1313452800,"08 16, 2011"


In the review dataset, there are nine columns. 

**reviewerID** and **reviewName** are unrelated to our projects, we choose to delete them. 

Also, we notice there are two attributes about time information, **unixReviewTime** which is only numbers and **reviewTime** seems like a little dirty, so we decide to keep **unixReviewTime**

3.2  Discard uninterested columns

In [83]:
df_health_review = df_health_review.drop(['reviewerID','reviewerName','reviewTime','summary'],axis=1)

3.3 Change date format to standard datetime

We convert the unix time format into date time format.

In [84]:
df_health_review['unixReviewTime'] = pd.to_datetime(df_health_review['unixReviewTime'],unit='s')

3.4 Check if there exists Nan Value in the DataFrame

There isn't any Nan value in the dataframe.

In [85]:
checkDataframeNanValue(df_health_review)

Check if column asin exists Nan value: False
Check if column helpful exists Nan value: False
Check if column reviewText exists Nan value: False
Check if column overall exists Nan value: False
Check if column unixReviewTime exists Nan value: False


[]

3.5  Set asin as index

In [86]:
df_health_review = df_health_review.set_index(['asin'])

3.6  Generate pickle file

In [87]:
df_health_review.to_pickle('health_review.pkl')

### 4. Merge Health&Care metadata and review

4.1 Load Health&Care metadata and review

In [106]:
df_health_metadata = pd.read_pickle('health_metadata.pkl')
df_health_review = pd.read_pickle('health_review.pkl')

In [107]:
df_health_metadata.head(3)

Unnamed: 0_level_0,title,related,salesRank,categories,price,brand
asin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
77614992,Principles of Mgmt + Oper-CSUF Custom C,"{'also_bought': ['0471730726', '0132834871', '0471391905', 'B00000JZKB', '0324314132', '00735250...",{'Health & Personal Care': 168429},[[Health & Personal Care]],0.0,
615208479,Brain Fitness Exercises Software,,{'Health & Personal Care': 1346973},"[[Health & Personal Care, Personal Care]]",0.0,
615269990,Occam's Razor,"{'also_bought': ['1935660152', '0071743979', '0071831428', '0323087876', '0443069522', '09670090...",{'Toys & Games': 110575},"[[Health & Personal Care, Personal Care, Shaving & Hair Removal, Manual Shaving]]",34.99,


In [108]:
df_health_review.head(3)

Unnamed: 0_level_0,helpful,reviewText,overall,unixReviewTime
asin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
159985130X,"[1, 1]",This is a great little gadget to have around. We've already used it to look for splinters and a...,5.0,2011-01-05
159985130X,"[1, 1]",I would recommend this for a travel magnifier for the occasional reading.I had read on another r...,4.0,2012-02-18
159985130X,"[75, 77]",What I liked was the quality of the lens and the built in light. Then lens had no discernable d...,4.0,2010-06-08


In [109]:
df_health_review.shape

(346355, 4)

In [110]:
df_health_metadata.shape

(262317, 6)

4.2 Merge Health&Care metadata and review

In [111]:
df_merge = df_health_metadata.merge(df_health_review, how ='inner', left_index= True, right_index=True)

In [114]:
df_merge

Unnamed: 0_level_0,title,related,salesRank,categories,price,brand,helpful,reviewText,overall,unixReviewTime
asin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
159985130X,"Lightwedge Lighted Pocket Magnifier, Plum","{'also_bought': ['B002DGPUM2', 'B00524H8MC', '1935009656', 'B0011X0PDW', 'B00524H98U', 'B000M755...",,"[[Health & Personal Care, Medical Supplies & Equipment, Daily Living Aids, Visual Impairment Aid...",24.95,,"[1, 1]",This is a great little gadget to have around. We've already used it to look for splinters and a...,5.0,2011-01-05
159985130X,"Lightwedge Lighted Pocket Magnifier, Plum","{'also_bought': ['B002DGPUM2', 'B00524H8MC', '1935009656', 'B0011X0PDW', 'B00524H98U', 'B000M755...",,"[[Health & Personal Care, Medical Supplies & Equipment, Daily Living Aids, Visual Impairment Aid...",24.95,,"[1, 1]",I would recommend this for a travel magnifier for the occasional reading.I had read on another r...,4.0,2012-02-18
159985130X,"Lightwedge Lighted Pocket Magnifier, Plum","{'also_bought': ['B002DGPUM2', 'B00524H8MC', '1935009656', 'B0011X0PDW', 'B00524H98U', 'B000M755...",,"[[Health & Personal Care, Medical Supplies & Equipment, Daily Living Aids, Visual Impairment Aid...",24.95,,"[75, 77]",What I liked was the quality of the lens and the built in light. Then lens had no discernable d...,4.0,2010-06-08
159985130X,"Lightwedge Lighted Pocket Magnifier, Plum","{'also_bought': ['B002DGPUM2', 'B00524H8MC', '1935009656', 'B0011X0PDW', 'B00524H98U', 'B000M755...",,"[[Health & Personal Care, Medical Supplies & Equipment, Daily Living Aids, Visual Impairment Aid...",24.95,,"[56, 60]","Love the Great point light pocket magnifier! works great, especially if you forget your glasses...",4.0,2008-02-08
159985130X,"Lightwedge Lighted Pocket Magnifier, Plum","{'also_bought': ['B002DGPUM2', 'B00524H8MC', '1935009656', 'B0011X0PDW', 'B00524H98U', 'B000M755...",,"[[Health & Personal Care, Medical Supplies & Equipment, Daily Living Aids, Visual Impairment Aid...",24.95,,"[1, 1]","This is very nice. You pull out on the magnifier when you want the light to come on, then slide ...",5.0,2011-08-16
159985130X,"Lightwedge Lighted Pocket Magnifier, Plum","{'also_bought': ['B002DGPUM2', 'B00524H8MC', '1935009656', 'B0011X0PDW', 'B00524H98U', 'B000M755...",,"[[Health & Personal Care, Medical Supplies & Equipment, Daily Living Aids, Visual Impairment Aid...",24.95,,"[2, 3]",The light comes on when the item is pulled. This is much easier to use than the plastic bookmar...,5.0,2007-02-24
159985130X,"Lightwedge Lighted Pocket Magnifier, Plum","{'also_bought': ['B002DGPUM2', 'B00524H8MC', '1935009656', 'B0011X0PDW', 'B00524H98U', 'B000M755...",,"[[Health & Personal Care, Medical Supplies & Equipment, Daily Living Aids, Visual Impairment Aid...",24.95,,"[0, 0]",These are lightweight and efficient and have some very good points:- the batteries last 2-3 mont...,4.0,2014-07-06
159985130X,"Lightwedge Lighted Pocket Magnifier, Plum","{'also_bought': ['B002DGPUM2', 'B00524H8MC', '1935009656', 'B0011X0PDW', 'B00524H98U', 'B000M755...",,"[[Health & Personal Care, Medical Supplies & Equipment, Daily Living Aids, Visual Impairment Aid...",24.95,,"[2, 2]",We bought one for road trips and trying to interpret maps without having to strain our eyes. Rea...,5.0,2011-02-24
159985130X,"Lightwedge Lighted Pocket Magnifier, Plum","{'also_bought': ['B002DGPUM2', 'B00524H8MC', '1935009656', 'B0011X0PDW', 'B00524H98U', 'B000M755...",,"[[Health & Personal Care, Medical Supplies & Equipment, Daily Living Aids, Visual Impairment Aid...",24.95,,"[1, 1]",The screen of the magnifier is small. If you're looking to read text this is not going to work. ...,3.0,2013-01-24
159985130X,"Lightwedge Lighted Pocket Magnifier, Plum","{'also_bought': ['B002DGPUM2', 'B00524H8MC', '1935009656', 'B0011X0PDW', 'B00524H98U', 'B000M755...",,"[[Health & Personal Care, Medical Supplies & Equipment, Daily Living Aids, Visual Impairment Aid...",24.95,,"[1, 1]",This pocket magnifier is nice and compact. The slide out feature makes it so it will fit into a...,4.0,2012-06-02


4.3  Generate pickle file

In [113]:
df_merge.to_pickle('merge.pkl')