# 2. Data Cleaning

In this section, 'product' and 'review' dataset will be imported and cleaned for further processing.

In [1]:
import pandas as pd
import gzip
import json
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

import dask.dataframe as dd

## 2.1 Import

In this section and the sections that follow goes through the data cleaning process from trimming unneeded fields, checking for any data values that does not meet the requirements or duplicated points that may effect the overall accuracy. <br> Because the entire dataset is very large, the whole data cleaning process will be done by categories.

There are in total 29 categories for each of review and product dataset.

For the same reason, after the each of the category dataset has been cleaned, they will be saved back to csv for later use and the next category in line will be cleaned.

In [2]:
#defining methods to import data

""" unzips gz and load to json"""
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)


""" use parse method to read and translate to pandas dataframe """
def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

""" transform pandas dataframe to dask dataframe with 10 partitions"""
def pandasToDask(df):
    return dd.from_pandas(df,npartitions=10)

""" unzips gz and dask reads json to dataframe"""
def toDask(path):
    return dd.read_json(path,compression='gzip')
    

**Dataset 'review' import starts here.**

In [207]:
#datasets imported.

re_fashion = getDF('data/review/AMAZON_FASHION.json.gz')

In [233]:
re_beauty = getDF('data/review/All_Beauty.json.gz')

In [191]:
re_appliances = getDF('data/review/Appliances.json.gz')

In [251]:
re_ACS = getDF('data/review/Arts_Crafts_and_Sewing.json.gz')

In [266]:
re_automotive = getDF('data/review/Automotive.json.gz')

In [None]:
#skipping for now as this is 11GB
#re_books = getDF('data/review/Books.json.gz')

In [283]:
re_CV = getDF('data/review/CDs_and_Vinyl.json.gz')

In [296]:
re_CPA = getDF('data/review/Cell_Phones_and_Accessories.json.gz')

In [None]:
#testing with dask
re_CSJ = toDask('data/review/Clothing_Shoes_and_Jewelry.json.gz')

**At this point, Jupyter Notebook and laptop hardware resource could not further process or import data as subsequent categories contained data upto 30 GB.  After discussion with the expert, it was decided to pick 3 to 4 categories to move forward with project.**

categories chosen : Digital_Music , Kindle_Store , Magazine_Subscriptions , Movies_and_TVs

In [3]:
re_MTV = getDF('data/review/Movies_and_TV.json.gz')

In [22]:
re_KS = getDF('data/review/Kindle_Store.json.gz')

In [34]:
re_DM = getDF('data/review/Digital_Music.json.gz')

In [37]:
re_MS = getDF('data/review/Magazine_Subscriptions.json.gz')

**Dataset 'product' import starts here.**

In [86]:
#datasets imported.

pr_MTV = getDF('data/metadata/meta_Movies_and_TV.json.gz')

In [58]:
pr_KS = getDF('data/metadata/meta_Kindle_Store.json.gz')

In [19]:
pr_DM = getDF('data/metadata/meta_Digital_Music.json.gz')

In [63]:
pr_MS = getDF('data/metadata/meta_Magazine_Subscriptions.json.gz')

In [3]:
pr_books = getDF('data/metadata/meta_Books.json.gz')

In [4]:
pr_books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2935525 entries, 0 to 2935524
Data columns (total 17 columns):
description     object
title           object
also_buy        object
brand           object
rank            object
also_view       object
main_cat        object
price           object
asin            object
category        object
image           object
feature         object
date            object
similar_item    object
tech1           object
details         object
fit             object
dtypes: object(17)
memory usage: 403.1+ MB


## 2.2 Data Trimming

In this section, data features deemed unneeded will be trimmed.  <br> As mentioned above, this process will be done by categories to reduce heavy data import.

### 2.2.1 Dataset 'review' Trimming

Candidates for dataset 'review' has already been determined in previous section **1.2.1 Dataset 'review' Analysis**. <br>


* Needs To Be Maintained:    **overall , asin , reviewText**
* May Be Maintained If Possible:   **verfied , reviewerID , summary**
* Candidates For Data Trimming:  **reviewTime , reviewerName , unixReviewTime , vote , style , image**

The columns in 'May Be Maintained If Possible' will be stored for now until the end of data cleaning process and then be checked for viability one more time.

#### 2.2.1.1 Category 'AMAZON_FASHION'

In [208]:
#check dataset info
re_fashion.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 883636 entries, 0 to 883635
Data columns (total 12 columns):
overall           883636 non-null float64
verified          883636 non-null bool
reviewTime        883636 non-null object
reviewerID        883636 non-null object
asin              883636 non-null object
reviewerName      883544 non-null object
reviewText        882403 non-null object
summary           883103 non-null object
unixReviewTime    883636 non-null int64
vote              79900 non-null object
style             304569 non-null object
image             28807 non-null object
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 81.7+ MB


In [209]:
#drop columns

re_fashion = re_fashion.drop(columns=['reviewTime','reviewerName','unixReviewTime','vote','style','image'])

#### 2.2.1.2 Category 'All_Beauty'

In [234]:
#check dataset info
re_beauty.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 371345 entries, 0 to 371344
Data columns (total 12 columns):
overall           371345 non-null float64
verified          371345 non-null bool
reviewTime        371345 non-null object
reviewerID        371345 non-null object
asin              371345 non-null object
reviewerName      371307 non-null object
reviewText        370946 non-null object
summary           371139 non-null object
unixReviewTime    371345 non-null int64
vote              51899 non-null object
style             125958 non-null object
image             8391 non-null object
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 34.4+ MB


In [235]:
#drop columns

re_beauty = re_beauty.drop(columns=['reviewTime','reviewerName','unixReviewTime','vote','style','image'])

#### 2.2.1.3 Category 'Appliances'

In [192]:
#check dataset info
re_appliances.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 602777 entries, 0 to 602776
Data columns (total 12 columns):
overall           602777 non-null float64
vote              65262 non-null object
verified          602777 non-null bool
reviewTime        602777 non-null object
reviewerID        602777 non-null object
asin              602777 non-null object
style             137973 non-null object
reviewerName      602762 non-null object
reviewText        602453 non-null object
summary           602649 non-null object
unixReviewTime    602777 non-null int64
image             9258 non-null object
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 55.8+ MB


In [193]:
#drop columns

re_appliances = re_appliances.drop(columns=['reviewTime','reviewerName','unixReviewTime','vote','style','image'])

#### 2.2.1.4 Category 'Arts_Crafts_and_Sewing'

In [256]:
#check dataset info
re_ACS.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2875917 entries, 0 to 2875916
Data columns (total 12 columns):
overall           2875917 non-null float64
vote              372185 non-null object
verified          2875917 non-null bool
reviewTime        2875917 non-null object
reviewerID        2875917 non-null object
asin              2875917 non-null object
style             1125693 non-null object
reviewerName      2875717 non-null object
reviewText        2873387 non-null object
summary           2874960 non-null object
unixReviewTime    2875917 non-null int64
image             86106 non-null object
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 266.0+ MB


In [257]:
#drop columns

re_ACS = re_ACS.drop(columns=['reviewTime','reviewerName','unixReviewTime','vote','style','image'])

#### 2.2.1.5 Category 'Automotive'

In [271]:
#check dataset info
re_automotive.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7990166 entries, 0 to 7990165
Data columns (total 12 columns):
image             220180 non-null object
overall           7990166 non-null float64
vote              793310 non-null object
verified          7990166 non-null bool
reviewTime        7990166 non-null object
reviewerID        7990166 non-null object
asin              7990166 non-null object
style             2348412 non-null object
reviewerName      7989739 non-null object
reviewText        7982068 non-null object
summary           7987439 non-null object
unixReviewTime    7990166 non-null int64
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 739.1+ MB


In [272]:
#drop columns

re_automotive = re_automotive.drop(columns=['reviewTime','reviewerName','unixReviewTime','vote','style','image'])

#### 2.2.1.6 Category 'Books'

#### 2.2.1.7 Category 'CDs_and_Vinyl'

In [284]:
#check dataset info
re_CV.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4543369 entries, 0 to 4543368
Data columns (total 12 columns):
reviewerID        4543369 non-null object
asin              4543369 non-null object
reviewerName      4543185 non-null object
verified          4543369 non-null bool
reviewText        4541934 non-null object
overall           4543369 non-null float64
reviewTime        4543369 non-null object
summary           4542549 non-null object
unixReviewTime    4543369 non-null int64
vote              1312570 non-null object
style             4390937 non-null object
image             17698 non-null object
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 420.3+ MB


In [285]:
#drop columns

re_CV = re_CV.drop(columns=['reviewTime','reviewerName','unixReviewTime','vote','style','image'])

#### 2.2.1.8 Category 'Cell_Phones_and_Accessories'

In [297]:
#check dataset info
re_CPA.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10063255 entries, 0 to 10063254
Data columns (total 12 columns):
overall           10063255 non-null float64
verified          10063255 non-null bool
reviewTime        10063255 non-null object
reviewerID        10063255 non-null object
asin              10063255 non-null object
reviewerName      10062528 non-null object
reviewText        10053882 non-null object
summary           10057736 non-null object
unixReviewTime    10063255 non-null int64
vote              689745 non-null object
image             182305 non-null object
style             5017134 non-null object
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 930.9+ MB


In [298]:
#drop columns

re_CPA = re_CPA.drop(columns=['reviewTime','reviewerName','unixReviewTime','vote','style','image'])

**At this point, Jupyter Notebook and laptop hardware resource could not further process or import data as subsequent categories contained data upto 30 GB.  After discussion with the expert, it was decided to pick 3 to 4 categories to move forward with project.**

categories chosen : Digital_Music , Kindle_Store , Magazine_Subscriptions , Movies_and_TVs

#### 2.2.1.9 Category 'Movies_and_TVs'

In [5]:
#check dataset info
re_MTV.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8765568 entries, 0 to 8765567
Data columns (total 12 columns):
overall           8765568 non-null float64
verified          8765568 non-null bool
reviewTime        8765568 non-null object
reviewerID        8765568 non-null object
asin              8765568 non-null object
style             8316608 non-null object
reviewerName      8765309 non-null object
reviewText        8757545 non-null object
summary           8763379 non-null object
unixReviewTime    8765568 non-null int64
vote              1425989 non-null object
image             18346 non-null object
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 810.9+ MB


In [6]:
#drop columns

re_MTV = re_MTV.drop(columns=['reviewTime','reviewerName','unixReviewTime','vote','style','image'])

#### 2.2.1.10 Category 'Kindle_Store'

In [23]:
#check dataset info
re_KS.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5722988 entries, 0 to 5722987
Data columns (total 12 columns):
overall           5722988 non-null float64
verified          5722988 non-null bool
reviewTime        5722988 non-null object
reviewerID        5722988 non-null object
asin              5722988 non-null object
style             5309554 non-null object
reviewerName      5722682 non-null object
reviewText        5721364 non-null object
summary           5719911 non-null object
unixReviewTime    5722988 non-null int64
vote              737567 non-null object
image             6189 non-null object
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 529.4+ MB


In [24]:
#drop columns

re_KS = re_KS.drop(columns=['reviewTime','reviewerName','unixReviewTime','vote','style','image'])

#### 2.2.1.11 Category 'Digital_Music'

In [35]:
#check dataset info
re_DM.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1584082 entries, 0 to 1584081
Data columns (total 12 columns):
overall           1584082 non-null float64
verified          1584082 non-null bool
reviewTime        1584082 non-null object
reviewerID        1584082 non-null object
asin              1584082 non-null object
style             1310814 non-null object
reviewerName      1584001 non-null object
reviewText        1582629 non-null object
summary           1583547 non-null object
unixReviewTime    1584082 non-null int64
vote              124722 non-null object
image             6591 non-null object
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 146.5+ MB


In [36]:
#drop columns

re_DM = re_DM.drop(columns=['reviewTime','reviewerName','unixReviewTime','vote','style','image'])

#### 2.2.1.12 Category 'Magazine_Subscription'

In [38]:
#check dataset info
re_MS.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 89689 entries, 0 to 89688
Data columns (total 12 columns):
overall           89689 non-null float64
vote              24103 non-null object
verified          89689 non-null bool
reviewTime        89689 non-null object
reviewerID        89689 non-null object
asin              89689 non-null object
reviewerName      89687 non-null object
reviewText        89656 non-null object
summary           89670 non-null object
unixReviewTime    89689 non-null int64
style             51398 non-null object
image             135 non-null object
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 8.3+ MB


In [39]:
#drop columns

re_MS = re_MS.drop(columns=['reviewTime','reviewerName','unixReviewTime','vote','style','image'])

### 2.2.2 Dataset 'product' Trimming

Candidates for dataset 'product' has already been determined in previous section **1.2.2 Dataset 'product' Analysis**. <br>


* Needs To Be Maintained:    **title , asin , brand**
* May Be Maintained If Possible:   **also_view , also_buy , main_cat**
* Candidates For Data Trimming:  **image , feature , date , description , price , fit , details , similar_item , tech1 , category , rank**

The columns in 'May Be Maintained If Possible' will be stored for now until the end of data cleaning process and then be checked for viability one more time.

#### 2.2.2.1 Category 'Movies_and_TVs'

In [88]:
#check dataset info
pr_MTV.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 203970 entries, 0 to 203969
Data columns (total 15 columns):
category       203745 non-null object
title          203900 non-null object
rank           201909 non-null object
main_cat       203922 non-null object
asin           203970 non-null object
image          37947 non-null object
description    174965 non-null object
brand          138384 non-null object
also_buy       95170 non-null object
also_view      93860 non-null object
price          107712 non-null object
details        195495 non-null object
feature        172 non-null object
date           38 non-null object
tech1          6 non-null object
dtypes: object(15)
memory usage: 24.9+ MB


In [89]:
#drop columns

pr_MTV = pr_MTV.drop(columns=['image','feature','date','description','price','details','category','tech1','rank'])

#### 2.2.2.2 Category 'Kindle_Store'

In [90]:
#check dataset info
pr_KS.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 493552 entries, 0 to 493551
Data columns (total 18 columns):
category        493552 non-null object
tech1           493552 non-null object
description     493552 non-null object
fit             493552 non-null object
title           493552 non-null object
also_buy        493552 non-null object
image           493552 non-null object
tech2           493552 non-null object
brand           493552 non-null object
feature         493552 non-null object
rank            493552 non-null object
also_view       493552 non-null object
main_cat        493552 non-null object
similar_item    493552 non-null object
date            493552 non-null object
price           493552 non-null object
asin            493552 non-null object
details         493550 non-null object
dtypes: object(18)
memory usage: 71.5+ MB


In [91]:
#drop columns  

pr_KS = pr_KS.drop(columns=['image','feature','date','description','price','fit','details','similar_item','tech1','tech2','category','also_buy','also_view','brand'])

#### 2.2.2.3 Category 'Digital_Music'

In [21]:
#check dataset info
pr_DM.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 465392 entries, 0 to 465391
Data columns (total 14 columns):
title          395452 non-null object
also_buy       108477 non-null object
brand          60374 non-null object
rank           444407 non-null object
also_view      48208 non-null object
price          45713 non-null object
asin           465392 non-null object
description    36870 non-null object
image          23407 non-null object
details        463852 non-null object
date           5 non-null object
feature        88 non-null object
category       7 non-null object
main_cat       1 non-null object
dtypes: object(14)
memory usage: 53.3+ MB


In [22]:
#drop columns  

pr_DM = pr_DM.drop(columns=['category','image','feature','date','description','price','also_buy','details','also_view','brand','rank'])

#### 2.2.2.4 Category 'Magazine_Subscriptions'

In [20]:
#check dataset info
pr_MS.info(verbose=True, null_counts=True)

NameError: name 'pr_MS' is not defined

In [95]:
#drop columns  

pr_MS = pr_MS.drop(columns=['category','image','also_buy','brand','description','rank','details','also_view'])

#### 2.2.2.5 Category 'Books'

In [5]:
pr_books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2935525 entries, 0 to 2935524
Data columns (total 17 columns):
description     object
title           object
also_buy        object
brand           object
rank            object
also_view       object
main_cat        object
price           object
asin            object
category        object
image           object
feature         object
date            object
similar_item    object
tech1           object
details         object
fit             object
dtypes: object(17)
memory usage: 403.1+ MB


In [6]:
#drop columns  

pr_books = pr_books.drop(columns=['rank','price','image','date','tech1','fit'])

## 2.3 Null Check

Null objects in dataset can lower the accuracy of the model. <br> Aside from the NaN object that pandas recognizes, there may be further data points that are null but are represented in different way. <br> This sectino will dive further into checking the data values and address null values.

### 2.3.1 Dataset 'review' Null Check

#### 2.3.1.1 Category 'AMAZON_FASHION'

In [210]:
re_fashion.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 883636 entries, 0 to 883635
Data columns (total 6 columns):
overall       883636 non-null float64
verified      883636 non-null bool
reviewerID    883636 non-null object
asin          883636 non-null object
reviewText    882403 non-null object
summary       883103 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 41.3+ MB


Columns overall, verified, reviewerID, asin contains no null values. <br> We can suspect the range of values for overall (numeric ranging from 1 to 5) and verified (true or false). 

In [211]:
print('Unique overall values: ',re_fashion.overall.unique())
print('Unique verified values: ',re_fashion.verified.unique())

Unique overall values:  [5. 2. 4. 3. 1.]
Unique verified values:  [ True False]


 <br> However, reviewerID and asin can be any of the users or product names so there is no point to check for unique values but we will check if the values follow the reviewerID and asin feature's constraints. 
 
 For reviewerID, it can be of different lengths so they will be checked for any outliers. <br>
 In case of asin, it follows a strict 10 alphanumeric characters.

In [10]:
"""
Finds unique lengths of data values

Takes in Pandas Series and outputs unique lengths in the series
"""

def findUniqueLength(srs):
    
    return set([len(x) for x in srs])
    

In [213]:
print('ReviewerID value lengths: ',findUniqueLength(re_fashion.reviewerID))
print('ASIN value lengths: ',findUniqueLength(re_fashion.asin))

ReviewerID value lengths:  {10, 11, 12, 13, 14, 19, 20}
ASIN value lengths:  {10}


All required fields have been accounted for. <br> 

Columns reviewText and summary contains some null values. 

In [214]:
print('Examples of valid reviewText data: \n\n',re_fashion.reviewText.values[:3])
print('\n')
print('Examples of null reviewText data: \n\n', re_fashion.loc[re_fashion.reviewText.isnull()].reviewText.head(3))
print('\n')
print('Any empty string values: ',len(re_fashion.loc[re_fashion.reviewText == ''].reviewText))


Examples of valid reviewText data: 

 ['Exactly what I needed.'
 "I agree with the other review, the opening is too small.  I almost bent the hook on some very expensive earrings trying to get these up higher than just the end so they're not seen.  Would not buy again but for the price, not sending back."
 "Love these... I am going to order another pack to keep in work; someone (including myself) is always losing the back to an earring.  I don't understand why all fish hook earrings don't have them.  Just wish that they were a tiny bit longer.  :)"]


Examples of null reviewText data: 

 303     NaN
2651    NaN
5094    NaN
Name: reviewText, dtype: object


Any empty string values:  0


In [215]:
print('Examples of valid summary data: \n\n',re_fashion.summary.values[:3])
print('\n')
print('Examples of null summary data: \n\n', re_fashion.loc[re_fashion.summary.isnull()].summary.head(3))
print('\n')
print('Any empty string values: ', len(re_fashion.loc[re_fashion.summary == ''].summary))

print('\n')
print('Column reviewText: ',len(re_fashion.loc[re_fashion.reviewText.isnull()].reviewText),' out of ',len(re_fashion.reviewText))
print('Column summary: ',len(re_fashion.loc[re_fashion.summary.isnull()].summary),' out of ',len(re_fashion.summary))


Examples of valid summary data: 

 ['perfect replacements!!'
 'I agree with the other review, the opening is ...' "My New 'Friends' !!"]


Examples of null summary data: 

 1030    NaN
8917    NaN
9422    NaN
Name: summary, dtype: object


Any empty string values:  0


Column reviewText:  1233  out of  883636
Column summary:  533  out of  883636


The model may not like null values as NaN so we will replace them with empty string, ''. 

In [216]:
#replace NaN with empty string 

re_fashion[['reviewText','summary']] = re_fashion[['reviewText','summary']].fillna('')

In [217]:
#check if empty string count is same as previous NaN count

print('reviewText empty string count: ', len(re_fashion.loc[re_fashion.reviewText == ''].reviewText))
print('summary empty string count: ', len(re_fashion.loc[re_fashion.summary == ''].summary))

reviewText empty string count:  1233
summary empty string count:  533


All null values have been dealt with.

#### 2.3.1.2 Category 'All_Beauty'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.3.1.1 Category 'AMAZON_FASHION'** for process details.

In [236]:
#check dataset info
re_beauty.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 371345 entries, 0 to 371344
Data columns (total 6 columns):
overall       371345 non-null float64
verified      371345 non-null bool
reviewerID    371345 non-null object
asin          371345 non-null object
reviewText    370946 non-null object
summary       371139 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 17.4+ MB


In [9]:
#this codes will be reused for all the categories so better to call it as method

"""
Analyzes review dataset for null

Takes in dataframe and outputs various null and valid related information

"""

def analyzeNull(df):
    
    print('Unique overall values: ',df.overall.unique())
    print('Unique verified values: ',df.verified.unique())


    print('ReviewerID value lengths: ',findUniqueLength(df.reviewerID))
    print('ASIN value lengths: ',findUniqueLength(df.asin))


    print('Examples of valid reviewText data: \n\n',df.reviewText.values[:3])
    print('\n')
    print('Examples of null reviewText data: \n\n', df.loc[df.reviewText.isnull()].reviewText.head(3))
    print('Any empty string reviewText values: ',len(df.loc[df.reviewText == ''].reviewText))

    print('\n')

    print('Examples of valid summary data: \n\n',df.summary.values[:3])
    print('\n')
    print('Examples of null summary data: \n\n', df.loc[df.summary.isnull()].summary.head(3))
    print('Any empty string summary values: ', len(df.loc[df.summary == ''].summary))

    print('\n')
    print('Column reviewText: ',len(df.loc[df.reviewText.isnull()].reviewText),' null out of ',len(df.reviewText))
    print('Column summary: ',len(df.loc[df.summary.isnull()].summary),' null out of ',len(df.summary))

    

In [238]:
analyzeNull(re_beauty)

Unique overall values:  [1. 4. 5. 2. 3.]
Unique verified values:  [ True False]
ReviewerID value lengths:  {10, 11, 12, 13, 14, 19, 20}
ASIN value lengths:  {10}
Examples of valid reviewText data: 

 ['great'
 "My  husband wanted to reading about the Negro Baseball and this a great addition to his library\n Our library doesn't haveinformation so this book is his start. Tthank you"
 'This book was very informative, covering all aspects of game.']


Examples of null reviewText data: 

 547     NaN
3594    NaN
4105    NaN
Name: reviewText, dtype: object
Any empty string reviewText values:  0


Examples of valid summary data: 

 ['One Star'
 "... to reading about the Negro Baseball and this a great addition to his library Our library doesn't haveinformation so ..."
 'Worth the Read']


Examples of null summary data: 

 6979    NaN
7709    NaN
8317    NaN
Name: summary, dtype: object
Any empty string summary values:  0


Column reviewText:  399  null out of  371345
Column summary:  206  nul

In [228]:
#replace NaN with empty string 

re_beauty[['reviewText','summary']] = re_beauty[['reviewText','summary']].fillna('')

In [229]:
#check if empty string count is same as previous NaN count

print('reviewText empty string count: ', len(re_beauty.loc[re_beauty.reviewText == ''].reviewText))
print('summary empty string count: ', len(re_beauty.loc[re_beauty.summary == ''].summary))

reviewText empty string count:  399
summary empty string count:  206


#### 2.3.1.3 Category 'Appliances'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.3.1.1 Category 'AMAZON_FASHION'** for process details.

In [239]:
re_appliances.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 602777 entries, 0 to 602776
Data columns (total 6 columns):
overall       602777 non-null float64
verified      602777 non-null bool
reviewerID    602777 non-null object
asin          602777 non-null object
reviewText    602453 non-null object
summary       602649 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 28.2+ MB


In [240]:
analyzeNull(re_appliances)

Unique overall values:  [5. 4. 3. 1. 2.]
Unique verified values:  [False  True]
ReviewerID value lengths:  {10, 11, 12, 13, 14, 19, 20}
ASIN value lengths:  {10}
Examples of valid reviewText data: 

 ["Not one thing in this book seemed an obvious original thought. However, the clarity with which this author explains how innovation happens is remarkable.\n\nAlan Gregerman discusses the meaning of human interactions and the kinds of situations that tend to inspire original and/or clear thinking that leads to innovation. These things include how people communicate in certain situations such as when they are outside of their normal patterns.\n\nGregerman identifies the ingredients that make innovation more likely. This includes people being compelled to interact when they normally wouldn't, leading to serendipity. Sometimes the phenomenon will occur through collaboration, and sometimes by chance such as when an individual is away from home on travel.\n\nI recommend this book for its common

In [241]:
#replace NaN with empty string 

re_appliances[['reviewText','summary']] = re_appliances[['reviewText','summary']].fillna('')

In [242]:
#check if empty string count is same as previous NaN count

print('reviewText empty string count: ', len(re_appliances.loc[re_appliances.reviewText == ''].reviewText))
print('summary empty string count: ', len(re_appliances.loc[re_appliances.summary == ''].summary))

reviewText empty string count:  324
summary empty string count:  128


#### 2.3.1.4 Category 'Arts_Crafts_and_Sewing'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.3.1.1 Category 'AMAZON_FASHION'** for process details.

In [258]:
re_ACS.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2875917 entries, 0 to 2875916
Data columns (total 6 columns):
overall       2875917 non-null float64
verified      2875917 non-null bool
reviewerID    2875917 non-null object
asin          2875917 non-null object
reviewText    2873387 non-null object
summary       2874960 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 134.4+ MB


In [259]:
analyzeNull(re_ACS)

Unique overall values:  [5. 2. 4. 3. 1.]
Unique verified values:  [ True False]
ReviewerID value lengths:  {10, 11, 12, 13, 14, 18, 19, 20}
ASIN value lengths:  {10}
Examples of valid reviewText data: 

 ["I've read this book already and I've got plans for using it in future projects.  I'm DELIGHTED with the patterns in it and the advice and suggestions are just as good as you would expect from Melissa Leapman.  I'm so glad that I bought this.  As a lifelong and addicted knitter, this has been a valuable addition to my already good sized book collection.  Thanks Melissa for this very special knitting treat."
 'Nicely written directions.' 'love it']


Examples of null reviewText data: 

 5468    NaN
7599    NaN
8032    NaN
Name: reviewText, dtype: object
Any empty string reviewText values:  0


Examples of valid summary data: 

 ['A  WONDERFUL BOOK' 'Nice' 'Five Stars']


Examples of null summary data: 

 5585     NaN
10904    NaN
10952    NaN
Name: summary, dtype: object
Any empty stri

In [260]:
#replace NaN with empty string 

re_ACS[['reviewText','summary']] = re_ACS[['reviewText','summary']].fillna('')

In [261]:
#check if empty string count is same as previous NaN count

print('reviewText empty string count: ', len(re_ACS.loc[re_ACS.reviewText == ''].reviewText))
print('summary empty string count: ', len(re_ACS.loc[re_ACS.summary == ''].summary))

reviewText empty string count:  2530
summary empty string count:  957


#### 2.3.1.5 Category 'Automotive'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.3.1.1 Category 'AMAZON_FASHION'** for process details.

In [273]:
re_automotive.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7990166 entries, 0 to 7990165
Data columns (total 6 columns):
overall       7990166 non-null float64
verified      7990166 non-null bool
reviewerID    7990166 non-null object
asin          7990166 non-null object
reviewText    7982068 non-null object
summary       7987439 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 373.4+ MB


In [274]:
analyzeNull(re_automotive)

Unique overall values:  [4. 2. 5. 1. 3.]
Unique verified values:  [ True False]
ReviewerID value lengths:  {10, 11, 12, 13, 14, 18, 19, 20}
ASIN value lengths:  {10}
Examples of valid reviewText data: 

 ["it's fine. I just would like the stickers to be a little more secure as I'm not sure I trust the gluing power of it. But for the moment it holds my keys."
 "took me three returns to get one that didn't wobble, and it's still not super stable but whatever, im tired of returning so ill just deal with it. Too bad because the product actually looks and functions well"
 'While the product is fine the description and picture are wrong. Describing headline states Bamboo and shows a light colored wood. I bought this to match other pieces I have. What I got is much darker. Box was checked "black" vs. "white" and box states material is plywood (Basswood/Waunut). No packing slip. I will probably give this away to a friend and try to find what I\'m looking for elsewhere. I purchased this on 08/0

In [275]:
#replace NaN with empty string 

re_automotive[['reviewText','summary']] = re_automotive[['reviewText','summary']].fillna('')

In [276]:
#check if empty string count is same as previous NaN count

print('reviewText empty string count: ', len(re_automotive.loc[re_automotive.reviewText == ''].reviewText))
print('summary empty string count: ', len(re_automotive.loc[re_automotive.summary == ''].summary))

reviewText empty string count:  8098
summary empty string count:  2727


#### 2.3.1.6 Category 'Books'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.3.1.1 Category 'AMAZON_FASHION'** for process details.

#### 2.3.1.7 Category 'CDs_and_Vinyl'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.3.1.1 Category 'AMAZON_FASHION'** for process details.

In [286]:
re_CV.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4543369 entries, 0 to 4543368
Data columns (total 6 columns):
reviewerID    4543369 non-null object
asin          4543369 non-null object
verified      4543369 non-null bool
reviewText    4541934 non-null object
overall       4543369 non-null float64
summary       4542549 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 212.3+ MB


In [287]:
analyzeNull(re_CV)

Unique overall values:  [5. 4. 3. 1. 2.]
Unique verified values:  [ True False]
ReviewerID value lengths:  {9, 10, 11, 12, 13, 14, 19, 20}
ASIN value lengths:  {10}
Examples of valid reviewText data: 

 ['I love this CD.  So inspiring!' 'Love it!!  Great seller!'
 "I bought this on cassette tape in the 80's. So inspirational to me back then. Came across it again and needed to be uplifted! Keith Green's music still works magic. I bought CD's for my girls to play in their cars"]


Examples of null reviewText data: 

 1177    NaN
6252    NaN
8609    NaN
Name: reviewText, dtype: object
Any empty string reviewText values:  0


Examples of valid summary data: 

 ['Five Stars' 'Five Stars'
 "I bought this on cassette tape in the 80's. ..."]


Examples of null summary data: 

 4703     NaN
8609     NaN
13204    NaN
Name: summary, dtype: object
Any empty string summary values:  0


Column reviewText:  1435  null out of  4543369
Column summary:  820  null out of  4543369


In [288]:
#replace NaN with empty string 

re_CV[['reviewText','summary']] = re_CV[['reviewText','summary']].fillna('')

In [289]:
#check if empty string count is same as previous NaN count

print('reviewText empty string count: ', len(re_CV.loc[re_CV.reviewText == ''].reviewText))
print('summary empty string count: ', len(re_CV.loc[re_CV.summary == ''].summary))

reviewText empty string count:  1435
summary empty string count:  820


#### 2.3.1.8 Category 'Cell_Phones_and_Accessories'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.3.1.1 Category 'AMAZON_FASHION'** for process details.

In [299]:
re_CPA.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10063255 entries, 0 to 10063254
Data columns (total 6 columns):
overall       10063255 non-null float64
verified      10063255 non-null bool
reviewerID    10063255 non-null object
asin          10063255 non-null object
reviewText    10053882 non-null object
summary       10057736 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 470.3+ MB


In [300]:
analyzeNull(re_CPA)

Unique overall values:  [5. 3. 2. 4. 1.]
Unique verified values:  [False  True]
ReviewerID value lengths:  {7, 9, 10, 11, 12, 13, 14, 19, 20}
ASIN value lengths:  {10}
Examples of valid reviewText data: 

 ['If your into space this is the Calendar for you.' 'Awesome pictures!'
 'Great wall art and information for space exploration minded people.']


Examples of null reviewText data: 

 9969     NaN
13870    NaN
22862    NaN
Name: reviewText, dtype: object
Any empty string reviewText values:  0


Examples of valid summary data: 

 ['Five Stars' 'Five Stars' 'Five Stars']


Examples of null summary data: 

 23185    NaN
30331    NaN
44839    NaN
Name: summary, dtype: object
Any empty string summary values:  0


Column reviewText:  9373  null out of  10063255
Column summary:  5519  null out of  10063255


In [301]:
#replace NaN with empty string 

re_CPA[['reviewText','summary']] = re_CPA[['reviewText','summary']].fillna('')

In [302]:
#check if empty string count is same as previous NaN count

print('reviewText empty string count: ', len(re_CPA.loc[re_CPA.reviewText == ''].reviewText))
print('summary empty string count: ', len(re_CPA.loc[re_CPA.summary == ''].summary))

reviewText empty string count:  9373
summary empty string count:  5519


**At this point, Jupyter Notebook and laptop hardware resource could not further process or import data as subsequent categories contained data upto 30 GB.  After discussion with the expert, it was decided to pick 3 to 4 categories to move forward with project.**

categories chosen : Digital_Music , Kindle_Store , Magazine_Subscriptions , Movies_and_TVs

#### 2.3.1.9 Category 'Movies_and_TVS'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.3.1.1 Category 'AMAZON_FASHION'** for process details.

In [7]:
re_MTV.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8765568 entries, 0 to 8765567
Data columns (total 6 columns):
overall       8765568 non-null float64
verified      8765568 non-null bool
reviewerID    8765568 non-null object
asin          8765568 non-null object
reviewText    8757545 non-null object
summary       8763379 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 409.6+ MB


In [12]:
analyzeNull(re_MTV)

Unique overall values:  [5. 4. 1. 3. 2.]
Unique verified values:  [ True False]
ReviewerID value lengths:  {10, 11, 12, 13, 14, 19, 20}
ASIN value lengths:  {10}
Examples of valid reviewText data: 

 ['really happy they got evangelised .. spoiler alert==happy ending liked that..since started bit worrisome... but yeah great stories these missionary movies, really short only half hour but still great'
 'Having lived in West New Guinea (Papua) during the time period covered in this video, it is realistic, accurate, and conveys well the entrance of light and truth into a culture that was for centuries dead to and alienated from God.'
 "Excellent look into contextualizing the Gospel and God's sovereignty over cultural barriers. The book and movie are both captivating. I would definitely recommend to both Christians and non-believers."]


Examples of null reviewText data: 

 3196    NaN
5947    NaN
7459    NaN
Name: reviewText, dtype: object
Any empty string reviewText values:  0


Examples 

In [14]:
#replace NaN with empty string 

re_MTV[['reviewText','summary']] = re_MTV[['reviewText','summary']].fillna('')

In [15]:
#check if empty string count is same as previous NaN count

print('reviewText empty string count: ', len(re_MTV.loc[re_MTV.reviewText == ''].reviewText))
print('summary empty string count: ', len(re_MTV.loc[re_MTV.summary == ''].summary))

reviewText empty string count:  8023
summary empty string count:  2189


#### 2.3.1.10 Category 'Kindle_Store'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.3.1.1 Category 'AMAZON_FASHION'** for process details.

In [25]:
re_KS.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5722988 entries, 0 to 5722987
Data columns (total 6 columns):
overall       5722988 non-null float64
verified      5722988 non-null bool
reviewerID    5722988 non-null object
asin          5722988 non-null object
reviewText    5721364 non-null object
summary       5719911 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 267.4+ MB


In [26]:
analyzeNull(re_KS)

Unique overall values:  [4. 5. 3. 1. 2. 0.]
Unique verified values:  [ True False]
ReviewerID value lengths:  {10, 11, 12, 13, 14, 19, 20}
ASIN value lengths:  {10}
Examples of valid reviewText data: 

 ['If you like making salsas this is a great book with different ideas for party dips. I gave it as a gift.'
 'great little book. simple and right to the point. A good basic Salsas and Tacos cooking guide.  I found it quite useful.'
 'This book has good pics of the recipes and easy to create them as well.']


Examples of null reviewText data: 

 2793    NaN
5026    NaN
5332    NaN
Name: reviewText, dtype: object
Any empty string reviewText values:  0


Examples of valid summary data: 

 ['Great Book' 'great little book' 'very good bok with good ideas.']


Examples of null summary data: 

 2989     NaN
9860     NaN
24311    NaN
Name: summary, dtype: object
Any empty string summary values:  0


Column reviewText:  1624  null out of  5722988
Column summary:  3077  null out of  5722988


In [27]:
#replace NaN with empty string 

re_KS[['reviewText','summary']] = re_KS[['reviewText','summary']].fillna('')

In [28]:
#check if empty string count is same as previous NaN count

print('reviewText empty string count: ', len(re_KS.loc[re_KS.reviewText == ''].reviewText))
print('summary empty string count: ', len(re_KS.loc[re_KS.summary == ''].summary))

reviewText empty string count:  1624
summary empty string count:  3077


#### 2.3.1.11 Category 'Digital_Music'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.3.1.1 Category 'AMAZON_FASHION'** for process details.

In [40]:
re_DM.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1584082 entries, 0 to 1584081
Data columns (total 6 columns):
overall       1584082 non-null float64
verified      1584082 non-null bool
reviewerID    1584082 non-null object
asin          1584082 non-null object
reviewText    1582629 non-null object
summary       1583547 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 74.0+ MB


In [41]:
analyzeNull(re_DM)

Unique overall values:  [5. 1. 4. 3. 2.]
Unique verified values:  [ True False]
ReviewerID value lengths:  {10, 11, 12, 13, 14, 18, 19, 20}
ASIN value lengths:  {10}
Examples of valid reviewText data: 

 ['This is a great cd full of worship favorites!!  All time great Keith green songs. His best album by far.'
 'So creative!  Love his music - the words, the message! Some of my favorite songs on this CD. I should have bought it years ago!'
 'Keith Green, gone far to early in his carreer, left us with these few golden alblums to bless us and let us see from a more in sync world veiw or I should say "the language of the modern world\'.\n\nHad this on LP all His alblums..look for ammples and then you will wee what I am talking about.\nGod Bless you all']


Examples of null reviewText data: 

 1737    NaN
2166    NaN
8165    NaN
Name: reviewText, dtype: object
Any empty string reviewText values:  0


Examples of valid summary data: 

 ['Great worship cd' 'Gotta listen to this!'
 'Great appr

In [42]:
#replace NaN with empty string 

re_DM[['reviewText','summary']] = re_DM[['reviewText','summary']].fillna('')

In [43]:
#check if empty string count is same as previous NaN count

print('reviewText empty string count: ', len(re_DM.loc[re_DM.reviewText == ''].reviewText))
print('summary empty string count: ', len(re_DM.loc[re_DM.summary == ''].summary))

reviewText empty string count:  1453
summary empty string count:  535


#### 2.3.1.12 Category 'Magazine_Subscriptions'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.3.1.1 Category 'AMAZON_FASHION'** for process details.

In [44]:
re_MS.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 89689 entries, 0 to 89688
Data columns (total 6 columns):
overall       89689 non-null float64
verified      89689 non-null bool
reviewerID    89689 non-null object
asin          89689 non-null object
reviewText    89656 non-null object
summary       89670 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 4.2+ MB


In [45]:
analyzeNull(re_MS)

Unique overall values:  [5. 3. 4. 2. 1.]
Unique verified values:  [False  True]
ReviewerID value lengths:  {11, 12, 13, 14, 19, 20}
ASIN value lengths:  {10}
Examples of valid reviewText data: 

 ['for computer enthusiast, MaxPC is a welcome sight in your mailbox. i can remember for years savorying every page of "boot" (as it was called in beginning) as i was (and still am) obcessed with PC\'s. Anyone, from advanced users - to beginners looking for knowledge - can profit from every issue of MaxPC. the icing on the cake is the subscription that comes with a CD-ROM as it is packed with demos, utilities, and other useful apps (very helpful for those not blessed with broadband connections). Until I discovered the community of hardware enthusiast web sites, MaxPC, formerly "boot", was my only really informative source for computing news and articles. To this day, i consider my subscription to it worth more than 10 subscriptions to most other computing mags. I can\'t wait until they merge wi

In [46]:
#replace NaN with empty string 

re_MS[['reviewText','summary']] = re_MS[['reviewText','summary']].fillna('')

In [47]:
#check if empty string count is same as previous NaN count

print('reviewText empty string count: ', len(re_MS.loc[re_MS.reviewText == ''].reviewText))
print('summary empty string count: ', len(re_MS.loc[re_MS.summary == ''].summary))

reviewText empty string count:  33
summary empty string count:  19


### 2.3.2 Dataset 'product' Null Check

#### 2.3.2.1 Category 'Movies_and_TVs'

In [96]:
pr_MTV.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 203970 entries, 0 to 203969
Data columns (total 6 columns):
title        203900 non-null object
main_cat     203922 non-null object
asin         203970 non-null object
brand        138384 non-null object
also_buy     95170 non-null object
also_view    93860 non-null object
dtypes: object(6)
memory usage: 10.9+ MB


In [97]:
#dropping also_buy, also_view, brand 
pr_MTV = pr_MTV.drop(columns=['also_buy','also_view','brand'])

In [7]:
#this codes will be reused for all the categories so better to call it as method

"""
Analyzes review dataset for null

Takes in dataframe and outputs various null and valid related information

"""

def analyzePrNull(df):
    
    print('ASIN value lengths: ',findUniqueLength(df.asin))


    print('Examples of valid title data: \n\n',df.title.values[:3])
    print('\n')
    print('Examples of null title data: \n\n', df.loc[df.title.isnull()].title.head(3))
    print('Any empty string title values: ', len(df.loc[df.title == ''].title))
    
    print('\n')

    print('Examples of valid main_cat data: \n\n',df.main_cat.values[:3])
    print('\n')
    print('Examples of null main_cat data: \n\n', df.loc[df.main_cat.isnull()].main_cat.head(3))
    print('Any empty string main_cat values: ', len(df.loc[df.main_cat == ''].main_cat))

    print('\n')
    print('Column title: ',len(df.loc[df.title.isnull()].title),' null out of ',len(df.title))
    print('Column main_cat: ',len(df.loc[df.main_cat.isnull()].main_cat),' null out of ',len(df.main_cat))


    

In [99]:
analyzePrNull(pr_MTV)

ASIN value lengths:  {10}
Examples of valid title data: 

 ['Understanding Seizures and Epilepsy'
 "Spirit Led—Moving By Grace In The Holy Spirit's Gifts"
 'My Fair Pastry (Good Eats Vol. 9)']


Examples of null title data: 

 31001    NaN
35320    NaN
43015    NaN
Name: title, dtype: object
Any empty string title values:  0


Examples of valid main_cat data: 

 ['Movies & TV' 'Movies & TV' 'Movies & TV']


Examples of null main_cat data: 

 2356     NaN
21438    NaN
31001    NaN
Name: main_cat, dtype: object
Any empty string main_cat values:  0


Column title:  70  null out of  203970
Column main_cat:  48  null out of  203970


In [100]:
#replace NaN with empty string 

pr_MTV[['main_cat','title',]] = pr_MTV[['main_cat','title']].fillna('')

#### 2.3.1.2 Category 'Kindle_Store'


In [101]:
pr_KS.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 493552 entries, 0 to 493551
Data columns (total 4 columns):
title       493552 non-null object
rank        493552 non-null object
main_cat    493552 non-null object
asin        493552 non-null object
dtypes: object(4)
memory usage: 18.8+ MB


In [102]:
#dropping rank
pr_KS = pr_KS.drop(columns=['rank'])

In [103]:
#replace NaN with empty string 

pr_KS[['main_cat','title',]] = pr_KS[['main_cat','title']].fillna('')

#### 2.3.1.3 Category 'Digital_Music'


In [23]:
pr_DM.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 465392 entries, 0 to 465391
Data columns (total 3 columns):
title       395452 non-null object
asin        465392 non-null object
main_cat    1 non-null object
dtypes: object(3)
memory usage: 14.2+ MB


In [24]:
print('ASIN value lengths: ',findUniqueLength(pr_DM.asin))


print('Examples of valid title data: \n\n',pr_DM.title.values[:3])
print('\n')
print('Examples of null title data: \n\n', pr_DM.loc[pr_DM.title.isnull()].title.head(3))
print('Any empty string title values: ', len(pr_DM.loc[pr_DM.title == ''].title))

print('\n')
print('Column title: ',len(pr_DM.loc[pr_DM.title.isnull()].title),' null out of ',len(pr_DM.title))

ASIN value lengths:  {10}
Examples of valid title data: 

 ['Master Collection Volume One' 'Hymns Collection: Hymns 1 & 2'
 'Early Works - Don Francisco']


Examples of null title data: 

 17    NaN
18    NaN
30    NaN
Name: title, dtype: object
Any empty string title values:  0


Column title:  69940  null out of  465392


In [25]:
#dropping also_buy, also_view, brand 
pr_DM = pr_DM.drop(columns=['main_cat'])


In [26]:
#replace NaN with empty string 

pr_DM['title'] = pr_DM['title'].fillna('')

#### 2.3.1.4 Category 'Magazine_Subscriptions'


In [112]:
pr_MS.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3493 entries, 0 to 3492
Data columns (total 3 columns):
main_cat    3493 non-null object
asin        3493 non-null object
title       329 non-null object
dtypes: object(3)
memory usage: 109.2+ KB


In [113]:
analyzePrNull(pr_MS)

ASIN value lengths:  {10}
Examples of valid title data: 

 [nan nan nan]


Examples of null title data: 

 0    NaN
1    NaN
2    NaN
Name: title, dtype: object
Any empty string title values:  0


Examples of valid main_cat data: 

 ['Magazine Subscriptions' 'Magazine Subscriptions'
 'Magazine Subscriptions']


Examples of null main_cat data: 

 Series([], Name: main_cat, dtype: object)
Any empty string main_cat values:  0


Column title:  3164  null out of  3493
Column main_cat:  0  null out of  3493


In [114]:
#replace NaN with empty string 

pr_MS[['main_cat','title',]] = pr_MS[['main_cat','title']].fillna('')

#### 2.3.1.4 Category 'Books'

In [8]:
pr_books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2935525 entries, 0 to 2935524
Data columns (total 11 columns):
description     object
title           object
also_buy        object
brand           object
also_view       object
main_cat        object
asin            object
category        object
feature         object
similar_item    object
details         object
dtypes: object(11)
memory usage: 268.8+ MB


In [11]:
analyzePrNull(pr_books)

ASIN value lengths:  {10}
Examples of valid title data: 

 ['Biology Gods Living Creation Third Edition 10 (A Beka Book Science Series)'
 'Mksap 16 Audio Companion: Medical Knowledge Self-Assessment Program'
 'Flex! Discography of North American Punk, Hardcore, and Powerpop 1975-1985 A-M']


Examples of null title data: 

 4141    NaN
6669    NaN
7545    NaN
Name: title, dtype: object
Any empty string title values:  0


Examples of valid main_cat data: 

 ['Books' 'Books' 'Books']


Examples of null main_cat data: 

 189771     NaN
189832     NaN
2382644    NaN
Name: main_cat, dtype: object
Any empty string main_cat values:  0


Column title:  828  null out of  2935525
Column main_cat:  48  null out of  2935525


In [12]:
#replace NaN with empty string 

pr_books[['main_cat','title',]] = pr_books[['main_cat','title']].fillna('')

## 2.4 Duplicate Check

Duplicate objects in dataset can also lower the accuracy of the model. <br>
This section will check for any duplicate residing in the dataset and remove them.

### 2.4.1 Dataset 'review' Duplicate Check

#### 2.4.1.1 Category 'AMAZON_FASHION'

In [179]:
re_fashion.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 883636 entries, 0 to 883635
Data columns (total 6 columns):
overall       883636 non-null float64
verified      883636 non-null bool
reviewerID    883636 non-null object
asin          883636 non-null object
reviewText    883636 non-null object
summary       883636 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 41.3+ MB


Important thing to note is that these columns do not have to have unique values for every rows. <br> We are simply looking for duplicate in terms of rows which are our reviews.

In [180]:
num_dup = len(re_fashion[re_fashion.duplicated(keep=False)])

print('total number of duplicated rows: ',num_dup)

dups = re_fashion[re_fashion.duplicated(keep=False)].copy()
dups.drop_duplicates(keep='first', inplace=True)
num_uniq = len(dups)

print('total number of unique rows out of duplicated rows: ', num_uniq)

print('expected number of rows after drop duplicates: ', len(re_fashion) - num_dup + num_uniq)

total number of duplicated rows:  14888
total number of unique rows out of duplicated rows:  7385
expected number of rows after drop duplicates:  876133


In [181]:
#drop duplicates

re_fashion.drop_duplicates(keep='first', inplace=True)

In [182]:
len(re_fashion)

876133

All duplicates have been dealt with.

#### 2.4.1.2 Category 'All_Beauty'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.4.1.1 Category 'AMAZON_FASHION'** for process details.

In [187]:
re_beauty.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 371345 entries, 0 to 371344
Data columns (total 6 columns):
overall       371345 non-null float64
verified      371345 non-null bool
reviewerID    371345 non-null object
asin          371345 non-null object
reviewText    371345 non-null object
summary       371345 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 17.4+ MB


In [17]:
#this codes will be reused for all the categories so better to call it as method

"""
Analyzes review dataset for duplicates

Takes in dataframe and outputs various duplicates related information

"""

def analyzeDup(df):
    #analyze duplicates
    
    dups = df[df.duplicated(keep=False)].copy()
    
    num_dup = len(dups)

    print('total number of duplicated rows: ',num_dup)

    
    dups.drop_duplicates(keep='first', inplace=True)
    num_uniq = len(dups)

    print('total number of unique rows out of duplicated rows: ', num_uniq)

    print('expected number of rows after drop duplicates: ', len(df) - num_dup + num_uniq)
    
    del dups

In [244]:
analyzeDup(re_beauty)

total number of duplicated rows:  17382
total number of unique rows out of duplicated rows:  8663
expected number of rows after drop duplicates:  362626


In [246]:
#drop duplicates

re_beauty.drop_duplicates(keep='first', inplace=True)

len(re_beauty)

362626

#### 2.4.1.3 Category 'Appliances'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.4.1.1 Category 'AMAZON_FASHION'** for process details.

In [247]:
re_appliances.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 602777 entries, 0 to 602776
Data columns (total 6 columns):
overall       602777 non-null float64
verified      602777 non-null bool
reviewerID    602777 non-null object
asin          602777 non-null object
reviewText    602777 non-null object
summary       602777 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 28.2+ MB


In [248]:
analyzeDup(re_appliances)

total number of duplicated rows:  20832
total number of unique rows out of duplicated rows:  9403
expected number of rows after drop duplicates:  591348


In [249]:
#drop duplicates

re_appliances.drop_duplicates(keep='first', inplace=True)

len(re_appliances)

591348

#### 2.4.1.4 Category 'Arts_Crafts_and_Sewing'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.4.1.1 Category 'AMAZON_FASHION'** for process details.

In [262]:
re_ACS.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2875917 entries, 0 to 2875916
Data columns (total 6 columns):
overall       2875917 non-null float64
verified      2875917 non-null bool
reviewerID    2875917 non-null object
asin          2875917 non-null object
reviewText    2875917 non-null object
summary       2875917 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 134.4+ MB


In [263]:
analyzeDup(re_ACS)

total number of duplicated rows:  261648
total number of unique rows out of duplicated rows:  130194
expected number of rows after drop duplicates:  2744463


In [264]:
#drop duplicates

re_ACS.drop_duplicates(keep='first', inplace=True)

len(re_ACS)

2744463

#### 2.4.1.5 Category 'Automotive'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.4.1.1 Category 'AMAZON_FASHION'** for process details.

In [278]:
re_automotive.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7990166 entries, 0 to 7990165
Data columns (total 6 columns):
overall       7990166 non-null float64
verified      7990166 non-null bool
reviewerID    7990166 non-null object
asin          7990166 non-null object
reviewText    7990166 non-null object
summary       7990166 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 373.4+ MB


In [279]:
analyzeDup(re_automotive)

total number of duplicated rows:  328964
total number of unique rows out of duplicated rows:  164269
expected number of rows after drop duplicates:  7825471


In [280]:
#drop duplicates

re_automotive.drop_duplicates(keep='first', inplace=True)

len(re_automotive)

7825471

#### 2.4.1.6 Category 'Books'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.4.1.1 Category 'AMAZON_FASHION'** for process details.

#### 2.4.1.7 Category 'CDs_and_Vinyl'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.4.1.1 Category 'AMAZON_FASHION'** for process details.

In [290]:
re_CV.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4543369 entries, 0 to 4543368
Data columns (total 6 columns):
reviewerID    4543369 non-null object
asin          4543369 non-null object
verified      4543369 non-null bool
reviewText    4543369 non-null object
overall       4543369 non-null float64
summary       4543369 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 212.3+ MB


In [291]:
analyzeDup(re_CV)

total number of duplicated rows:  137777
total number of unique rows out of duplicated rows:  68280
expected number of rows after drop duplicates:  4473872


In [292]:
#drop duplicates

re_CV.drop_duplicates(keep='first', inplace=True)

len(re_CV)

4473872

#### 2.4.1.8 Category 'Cell_Phones_and_Vinyl'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.4.1.1 Category 'AMAZON_FASHION'** for process details.

In [303]:
re_CPA.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10063255 entries, 0 to 10063254
Data columns (total 6 columns):
overall       10063255 non-null float64
verified      10063255 non-null bool
reviewerID    10063255 non-null object
asin          10063255 non-null object
reviewText    10063255 non-null object
summary       10063255 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 470.3+ MB


In [304]:
analyzeDup(re_CPA)

total number of duplicated rows:  41913
total number of unique rows out of duplicated rows:  20870
expected number of rows after drop duplicates:  10042212


In [306]:
#drop duplicates

re_CPA.drop_duplicates(keep='first', inplace=True)

len(re_CPA)

10042212

**At this point, Jupyter Notebook and laptop hardware resource could not further process or import data as subsequent categories contained data upto 30 GB.  After discussion with the expert, it was decided to pick 3 to 4 categories to move forward with project.**

categories chosen : Digital_Music , Kindle_Store , Magazine_Subscriptions , Movies_and_TVs

#### 2.4.1.9 Category 'Movies_and_TVs'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.4.1.1 Category 'AMAZON_FASHION'** for process details.

In [16]:
re_MTV.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8765568 entries, 0 to 8765567
Data columns (total 6 columns):
overall       8765568 non-null float64
verified      8765568 non-null bool
reviewerID    8765568 non-null object
asin          8765568 non-null object
reviewText    8765568 non-null object
summary       8765568 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 409.6+ MB


In [17]:
analyzeDup(re_MTV)

total number of duplicated rows:  476331
total number of unique rows out of duplicated rows:  237683
expected number of rows after drop duplicates:  8526920


In [18]:
#drop duplicates

re_MTV.drop_duplicates(keep='first', inplace=True)

len(re_MTV)

8526920

#### 2.4.1.10 Category 'Kindle_Store'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.4.1.1 Category 'AMAZON_FASHION'** for process details.

In [29]:
re_KS.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5722988 entries, 0 to 5722987
Data columns (total 6 columns):
overall       5722988 non-null float64
verified      5722988 non-null bool
reviewerID    5722988 non-null object
asin          5722988 non-null object
reviewText    5722988 non-null object
summary       5722988 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 267.4+ MB


In [30]:
analyzeDup(re_KS)

total number of duplicated rows:  27034
total number of unique rows out of duplicated rows:  13362
expected number of rows after drop duplicates:  5709316


In [31]:
#drop duplicates

re_KS.drop_duplicates(keep='first', inplace=True)

len(re_KS)

5709316

#### 2.4.1.11 Category 'Digital_Music'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.4.1.1 Category 'AMAZON_FASHION'** for process details.

In [48]:
re_DM.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1584082 entries, 0 to 1584081
Data columns (total 6 columns):
overall       1584082 non-null float64
verified      1584082 non-null bool
reviewerID    1584082 non-null object
asin          1584082 non-null object
reviewText    1584082 non-null object
summary       1584082 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 74.0+ MB


In [49]:
analyzeDup(re_DM)

total number of duplicated rows:  129258
total number of unique rows out of duplicated rows:  64487
expected number of rows after drop duplicates:  1519311


In [50]:
#drop duplicates

re_DM.drop_duplicates(keep='first', inplace=True)

len(re_DM)

1519311

#### 2.4.1.12 Category 'Magazine_Subscriptions'

To reduce redudant explanation, only codes and output will be shown from here and the rest of the categories. <br>Refer to section **2.4.1.1 Category 'AMAZON_FASHION'** for process details.

In [51]:
re_MS.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 89689 entries, 0 to 89688
Data columns (total 6 columns):
overall       89689 non-null float64
verified      89689 non-null bool
reviewerID    89689 non-null object
asin          89689 non-null object
reviewText    89689 non-null object
summary       89689 non-null object
dtypes: bool(1), float64(1), object(4)
memory usage: 4.2+ MB


In [52]:
analyzeDup(re_MS)

total number of duplicated rows:  2388
total number of unique rows out of duplicated rows:  1193
expected number of rows after drop duplicates:  88494


In [53]:
#drop duplicates

re_MS.drop_duplicates(keep='first', inplace=True)

len(re_MS)

88494

### 2.4.2 Dataset 'product' Duplicate Check

#### 2.4.2.1 Category 'Movies_and_TVs'

In [115]:
pr_MTV.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 203970 entries, 0 to 203969
Data columns (total 3 columns):
title       203970 non-null object
main_cat    203970 non-null object
asin        203970 non-null object
dtypes: object(3)
memory usage: 6.2+ MB


In [116]:
analyzeDup(pr_MTV)

total number of duplicated rows:  43854
total number of unique rows out of duplicated rows:  21927
expected number of rows after drop duplicates:  182043


In [117]:
#drop duplicates

pr_MTV.drop_duplicates(keep='first', inplace=True)

len(pr_MTV)

182043

#### 2.4.2.2 Category 'Kindle_Store'

In [118]:
pr_KS.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 493552 entries, 0 to 493551
Data columns (total 3 columns):
title       493552 non-null object
main_cat    493552 non-null object
asin        493552 non-null object
dtypes: object(3)
memory usage: 15.1+ MB


In [119]:
analyzeDup(pr_KS)

total number of duplicated rows:  0
total number of unique rows out of duplicated rows:  0
expected number of rows after drop duplicates:  493552


#### 2.4.2.3 Category 'Digital_Music'

In [27]:
pr_DM.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 465392 entries, 0 to 465391
Data columns (total 2 columns):
title    465392 non-null object
asin     465392 non-null object
dtypes: object(2)
memory usage: 10.7+ MB


In [28]:
analyzeDup(pr_DM)

total number of duplicated rows:  16764
total number of unique rows out of duplicated rows:  8382
expected number of rows after drop duplicates:  457010


In [29]:
#drop duplicates

pr_DM.drop_duplicates(keep='first', inplace=True)

len(pr_DM)

457010

#### 2.4.2.4 Category 'Magazine_Subscriptions'

In [123]:
pr_MS.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3493 entries, 0 to 3492
Data columns (total 3 columns):
main_cat    3493 non-null object
asin        3493 non-null object
title       3493 non-null object
dtypes: object(3)
memory usage: 109.2+ KB


In [129]:
analyzeDup(pr_MS)

total number of duplicated rows:  2130
total number of unique rows out of duplicated rows:  1065
expected number of rows after drop duplicates:  2428


In [130]:
#drop duplicates

pr_MS.drop_duplicates(keep='first', inplace=True)

len(pr_MS)

2428

#### 2.4.2.4 Category 'Books'

In [14]:
pr_books.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2935525 entries, 0 to 2935524
Data columns (total 11 columns):
description     2384361 non-null object
title           2935525 non-null object
also_buy        1344649 non-null object
brand           2834504 non-null object
also_view       1205292 non-null object
main_cat        2935525 non-null object
asin            2935525 non-null object
category        2545906 non-null object
feature         1542 non-null object
similar_item    61 non-null object
details         399866 non-null object
dtypes: object(11)
memory usage: 268.8+ MB


In [15]:
#similar_item and featur column has too few data to be viable.
pr_books.drop(columns=['similar_item','feature'],inplace=True)

In [36]:
pr_books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2935525 entries, 0 to 2935524
Data columns (total 9 columns):
description    object
title          object
also_buy       object
brand          object
also_view      object
main_cat       object
asin           object
category       object
details        object
dtypes: object(9)
memory usage: 224.0+ MB


Due to some columns containing data in lists, drop_duplicates does not work in place. <br> Instead, we will derive out the asin to identify and duplicates.

In [44]:
asin_books = pr_books.asin.copy()

In [47]:
print('total number of asin: ',len(asin_books))
print('number of unique asin: ',len(asin_books.unique()))

total number of asin:  2935525
number of unique asin:  2930600


In [50]:
asin_count = pr_books.groupby('asin').asin.count()

In [53]:
dup_books = asin_count[asin_count>1].index

In [56]:
pr_books.loc[pr_books.asin.isin(dup_books)].sort_values('asin')

Unnamed: 0,description,title,also_buy,brand,also_view,main_cat,asin,category,details
2603424,"[Contains blank, lined pages Measures 5.75 by...",Adventure Journal,,,,Books,B00000IJYC,,"\n <div class=""content"">\n\n\n\n\n\n\n\n<ul>\..."
2608349,"[Contains blank, lined pages Measures 5.75 by...",Adventure Journal,,,,Books,B00000IJYC,,"\n <div class=""content"">\n\n\n\n\n\n\n\n<ul>\..."
2603426,"[Includes cardboard storage box, 24 note cards...",Tiffany Windows Note Cards,,,,Books,B00000IJYX,,"\n <div class=""content"">\n\n\n\n\n\n\n\n<ul>\..."
2608351,"[Includes cardboard storage box, 24 note cards...",Tiffany Windows Note Cards,,,,Books,B00000IJYX,,"\n <div class=""content"">\n\n\n\n\n\n\n\n<ul>\..."
2608350,"[, , , ]",NiteOwl DC Plus Upgrade Kit,,,,Books,B00000IJZO,,"\n <div class=""content"">\n\n\n\n\n\n\n\n<ul>\..."
2603425,"[, , , ]",NiteOwl DC Plus Upgrade Kit,,,,Books,B00000IJZO,,"\n <div class=""content"">\n\n\n\n\n\n\n\n<ul>\..."
2603428,[Book lovers like nothing better than to curl ...,Black Floral Tapestry Book Cover (Hardcover),,,,Books,B00000IRGQ,,"\n <div class=""content"">\n\n\n\n\n\n\n\n<ul>\..."
2608353,[Book lovers like nothing better than to curl ...,Black Floral Tapestry Book Cover (Hardcover),,,,Books,B00000IRGQ,,"\n <div class=""content"">\n\n\n\n\n\n\n\n<ul>\..."
2603427,"[, , , ]",Plum Canvas Book Cover (Hardcover),,,"[B00W4E1TMS, 0310823706]",Books,B00000IRGX,,"\n <div class=""content"">\n\n\n\n\n\n\n\n<ul>\..."
2608352,"[, , , ]",Plum Canvas Book Cover (Hardcover),,,"[B00W4E1TMS, 0310823706]",Books,B00000IRGX,,"\n <div class=""content"">\n\n\n\n\n\n\n\n<ul>\..."


Checking the duplicate data, it looks like the duplicates are an exact copies of each other.  <br> Since there is no difference between the columns in duplicated asin, we will manually drop duplicates by collecting the repeating index to a list and drop them directly from the dataframe.  

In [63]:
pr_books.loc[pr_books.asin.isin(dup_books)].sort_values('asin').index

Int64Index([2603424, 2608349, 2603426, 2608351, 2608350, 2603425, 2603428,
            2608353, 2603427, 2608352,
            ...
            2613269, 2608344, 2608345, 2613270, 2613271, 2608346, 2608348,
            2613273, 2608347, 2613272],
           dtype='int64', length=9850)

In [64]:
dup_ind = pr_books.loc[pr_books.asin.isin(dup_books)].sort_values('asin').index[::2]

In [65]:
dup_ind

Int64Index([2603424, 2603426, 2608350, 2603428, 2603427, 2603429, 2603430,
            2608356, 2608357, 2608358,
            ...
            2608339, 2613265, 2613266, 2613267, 2613268, 2613269, 2608345,
            2613271, 2608348, 2608347],
           dtype='int64', length=4925)

In [76]:
pr_books.drop(pr_books.index[[dup_ind]],inplace=True)

In [77]:
pr_books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2930600 entries, 0 to 2935524
Data columns (total 9 columns):
description    object
title          object
also_buy       object
brand          object
also_view      object
main_cat       object
asin           object
category       object
details        object
dtypes: object(9)
memory usage: 223.6+ MB


## 2.5 Data Export

After finishing data cleaning, we will save the datasets into csv files again so it can be easily imported for further use.

### 2.5.1 Dataset 'review' Export

In [139]:
#save to csv
re_fashion.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_fashion.csv',index=False)

In [190]:
re_beauty.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_beauty.csv',index=False)

In [250]:
re_appliances.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_appliances.csv',index=False)

In [265]:
re_ACS.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_ACS.csv',index=False)

In [281]:
re_automotive.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_automotive.csv',index=False)

In [293]:
re_CV.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_CV.csv',index=False)

In [307]:
re_CPA.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_CPA.csv',index=False)

**At this point, Jupyter Notebook and laptop hardware resource could not further process or import data as subsequent categories contained data upto 30 GB.  After discussion with the expert, it was decided to pick 3 to 4 categories to move forward with project.**

categories chosen : Digital_Music , Kindle_Store , Magazine_Subscriptions , Movies_and_TVs

In [19]:
re_MTV.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_MTV.csv',index=False)

In [32]:
re_KS.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_KS.csv',index=False)

In [54]:
re_DM.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_DM.csv',index=False)

In [55]:
re_MS.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_MS.csv',index=False)

**Dataset 'product' export starts here.**

In [131]:
pr_MTV.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/pr_MTV.csv',index=False)

In [132]:
pr_KS.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/pr_KS.csv',index=False)

In [133]:
pr_DM.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/pr_DM.csv',index=False)

In [134]:
pr_MS.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/pr_MS.csv',index=False)

In [80]:
pr_books.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/pr_books.csv',index=False)