# 2. Data Wrangling: Books

At this point in the project, decision was made to focus solely on the Amazon Books review and product dataset. <br> Initially, the review dataset for books was too big to import for processing.  <br> Steps were made to split up the original dataset to 14 smaller chunks for initial processing.

In [248]:
#import required packages
import pandas as pd
import gzip
import json
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import numpy as np

from itertools import islice

## 2.1 Data Import

To reduce dataset size, review dataset 'Books' until year 2013 was selected to be used. <br> Unlike the previous data import method, 2013 books review dataset is saved in txt format.  <br> A new read method is required to set up the dataset in pands.

In [249]:
"""
Method to read in structured text data

Takes in path to file
Outputs pandas dataframe

"""

def parseTxt(path):
    
    df = {}
    i = 0
    
    with open(path) as f:
        while True:
            chunk = list(islice(f,11))
            if not chunk:
                break
            d={}
            
            d.update({'asin': chunk[0].split('product/productId: ')[1].rstrip('\n')})
            d.update({'title': chunk[1].split('product/title: ')[1].rstrip('\n')})
            d.update({'price': chunk[2].split('product/price: ')[1].rstrip('\n')})
            d.update({'reviewerID': chunk[3].split('review/userId: ')[1].rstrip('\n')})
            d.update({'reviewerName': chunk[4].split('review/profileName: ')[1].rstrip('\n')})
            d.update({'helpfulness': chunk[5].split('review/helpfulness: ')[1].rstrip('\n')})
            d.update({'overall': chunk[6].split('review/score: ')[1].rstrip('\n')})
            d.update({'timestamp': chunk[7].split('review/time: ')[1].rstrip('\n')})
            d.update({'summary': chunk[8].split('review/summary: ')[1].rstrip('\n')})
            d.update( {'reviewText' :chunk[9].split('review/text: ')[1].rstrip('\n')} )
            df[i]=d
            i+=1
    
    return pd.DataFrame.from_dict(df, orient='index')

In [255]:
xaa = parseTxt('data/review/books/xaa')

In [333]:
xab = parseTxt('data/review/books/xab')

In [318]:
xac = parseTxt('data/review/books/xac')

In [326]:
xad = parseTxt('data/review/books/xad')

In [350]:
xae = parseTxt('data/review/books/xae')

In [351]:
xaf = parseTxt('data/review/books/xaf')

In [370]:
xag = parseTxt('data/review/books/xag')

In [366]:
xah = parseTxt('data/review/books/xah')

In [390]:
xai = parseTxt('data/review/books/xai')

In [391]:
xaj = parseTxt('data/review/books/xaj')

In [408]:
xak = parseTxt('data/review/books/xak')

In [409]:
xal = parseTxt('data/review/books/xal')

In [420]:
xam = parseTxt('data/review/books/xam')

In [421]:
xan = parseTxt('data/review/books/xan')

## 2.2 Data Trimming

### 2.2.1 Low Review Count Trimming

The original books dataset is too big to be imported to the notebook.<br> Big dataset was splitted to 14 smaller txt files each around 1GB. <br> At this time, our biggiest concern is trimming down the dataset as much as possible so in the end, we'll have a dataset that we can work with.

We will be dropping any products with less than 50 reviews first.

In [114]:
xaa_count= xaa.groupby('asin').asin.count()

In [115]:
#this will be our list that monitors asin with less than 50 reviews.
master_50 = xaa_count[xaa_count<50]
len(master_50)

65281

In [145]:
#method to aid in series addition

def updateASIN(master, srs):
    
    print('master length: ', len(master))
    print('adding series length: ', len(srs))
    
    c=0
    dup = []
    
    for i in srs.index:
        if i in master.index:
            c+=1
            dup.append(i)
    
    print('number of overlapping ASIN: ',c)
    print('overlapping ASIN: ',dup)
    
    result = master.add(srs,fill_value=0)
    
    print('expected master length: ',len(master)+len(srs)-len(dup))
    
    print('actual master length: ', len(result))
    
    print('number of ASIN with 50 or more: ',len(result[result>49]))
    
    return result

In [129]:
#get the count for the next list of asin
tmp = xab.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

66793

In [130]:
new_master = updateASIN(master_50,tmp_50)

master length:  65281
adding series length:  66793
number of overlapping ASIN:  0
overlapping ASIN:  []
expected master length:  132074
actual master length:  132074
number of ASIN with 50 or more:  0


In [131]:
#get the count for the next list of asin
tmp = xac.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

67224

In [132]:
new_master_2 = updateASIN(new_master,tmp_50)

master length:  132074
adding series length:  67224
number of overlapping ASIN:  0
overlapping ASIN:  []
expected master length:  199298
actual master length:  199298
number of ASIN with 50 or more:  0


In [138]:
#get the count for the next list of asin
tmp = xad.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

63579

In [139]:
new_master_d = updateASIN(new_master_2,tmp_50)

master length:  199298
adding series length:  63579
number of overlapping ASIN:  0
overlapping ASIN:  []
expected master length:  262877
actual master length:  262877
number of ASIN with 50 or more:  0


In [143]:
#get the count for the next list of asin
tmp = xae.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

61770

In [146]:
new_master_e = updateASIN(new_master_d,tmp_50)

master length:  262877
adding series length:  61770
number of overlapping ASIN:  1
overlapping ASIN:  ['1931499926']
expected master length:  324646
actual master length:  324646
number of ASIN with 50 or more:  0


In [150]:
#get the count for the next list of asin
tmp = xaf.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

64052

In [151]:
new_master_f = updateASIN(new_master_e,tmp_50)

master length:  324646
adding series length:  64052
number of overlapping ASIN:  1
overlapping ASIN:  ['B000GY4ZYC']
expected master length:  388697
actual master length:  388697
number of ASIN with 50 or more:  0


In [157]:
#get the count for the next list of asin
tmp = xag.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

64496

In [158]:
new_master_g = updateASIN(new_master_f,tmp_50)

master length:  388697
adding series length:  64496
number of overlapping ASIN:  1
overlapping ASIN:  ['0877739447']
expected master length:  453192
actual master length:  453192
number of ASIN with 50 or more:  0


In [162]:
#get the count for the next list of asin
tmp = xah.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

66665

In [163]:
new_master_h = updateASIN(new_master_g,tmp_50)

master length:  453192
adding series length:  66665
number of overlapping ASIN:  0
overlapping ASIN:  []
expected master length:  519857
actual master length:  519857
number of ASIN with 50 or more:  0


In [165]:
#get the count for the next list of asin
tmp = xai.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

60767

In [166]:
new_master_i = updateASIN(new_master_h,tmp_50)

master length:  519857
adding series length:  60767
number of overlapping ASIN:  1
overlapping ASIN:  ['0810992493']
expected master length:  580623
actual master length:  580623
number of ASIN with 50 or more:  0


In [172]:
#get the count for the next list of asin
tmp = xaj.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

64643

In [173]:
new_master_j = updateASIN(new_master_i,tmp_50)

master length:  580623
adding series length:  64643
number of overlapping ASIN:  0
overlapping ASIN:  []
expected master length:  645266
actual master length:  645266
number of ASIN with 50 or more:  0


In [176]:
#get the count for the next list of asin
tmp = xak.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

66562

In [177]:
new_master_k = updateASIN(new_master_j,tmp_50)

master length:  645266
adding series length:  66562
number of overlapping ASIN:  0
overlapping ASIN:  []
expected master length:  711828
actual master length:  711828
number of ASIN with 50 or more:  0


In [180]:
#get the count for the next list of asin
tmp = xal.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

66052

In [181]:
new_master_l = updateASIN(new_master_k,tmp_50)

master length:  711828
adding series length:  66052
number of overlapping ASIN:  0
overlapping ASIN:  []
expected master length:  777880
actual master length:  777880
number of ASIN with 50 or more:  0


In [184]:
#get the count for the next list of asin
tmp = xam.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

65052

In [185]:
new_master_m = updateASIN(new_master_l,tmp_50)

master length:  777880
adding series length:  65052
number of overlapping ASIN:  0
overlapping ASIN:  []
expected master length:  842932
actual master length:  842932
number of ASIN with 50 or more:  0


In [188]:
#get the count for the next list of asin
tmp = xan.groupby('asin').asin.count()
tmp_50 = tmp[tmp<50]
len(tmp_50)

47670

In [189]:
new_master_n = updateASIN(new_master_m,tmp_50)

master length:  842932
adding series length:  47670
number of overlapping ASIN:  0
overlapping ASIN:  []
expected master length:  890602
actual master length:  890602
number of ASIN with 50 or more:  0


Final ASIN list with less than 50 reviews have been derived. <br> Now, the datasets will be compared to our master asin list and remove any rows containing those asin values.

In [193]:
#save the index
remove_these = new_master_n.index

In [194]:
remove_these

Index(['0000000868', '0000020214', '0000024341', '0000025240', '0000038504',
       '0000913154', '0001024043', '0001035649', '0001042335', '0001046349',
       ...
       'B0014S49YU', 'B001EMUIWI', 'B001EYEHIW', 'B002PWLQ04', 'B0030EY97I',
       'B00354WUGA', 'B0038EPHIU', 'B003A8D2YU', 'B003IMNND8', 'B004TU2WYO'],
      dtype='object', name='asin', length=890602)

### 2.2.2 Columns 'reviewText' and 'summary' Combining

In this section, books review dataset will be trimmed down to remove any unnessecery features.  <br> Our primary goal is to reduce the overall size of the dataset.  <br>Our dataset is currently broken down into 14 chunks at the moment.

**Low Review Count ASIN will be removed in this section as well.**

In [355]:
#this codes will be reused for all the categories so better to call it as method

"""
Finds unique lengths of data values

Takes in Pandas Series and outputs unique lengths in the series
"""

def findUniqueLength(srs):
    
    return set([len(x) for x in srs])
    

"""
Analyzes review dataset for null

Takes in dataframe and outputs various null and valid related information

"""

def analyzeNull(df):
    
    print('Unique overall values: ',df.overall.unique())

    print('ASIN value lengths: ',findUniqueLength(df.asin))
    
    
    #print('Example of valid title data: \n',df.title.values[:1])
    print('Any empty string title values: ',len(df.loc[df.title == ''].title))

    print('\n')
    #print('Examples of valid reviewText data: \n',df.reviewText.values[:1])
    print('Any empty string reviewText values: ',len(df.loc[df.reviewText == ''].reviewText))

    print('\n')
    #print('Examples of valid summary data: \n',df.summary.values[:1])
    print('Any empty string summary values: ', len(df.loc[df.summary == ''].summary))

def analyzeReview(df):
    

    print('Any empty string review values: ',len(df.loc[df.review == ''].review))

def analyzeEmptyTitle(df):
    
    empty_title = df.loc[df.title == '']
    
    print('Any empty string title values: ',len(empty_title.title))
    
    
    uniq = empty_title.asin.unique()
    print('How many unique ASIN with empty title: ',len(uniq))
    
    in_df = df.loc[df.asin.isin(uniq)]
    print('Are there other reviews to pull title from: ',len(in_df.loc[in_df.title!='']))

#### Chunk1 : re_b1

In [256]:
xaa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 939056 entries, 0 to 939055
Data columns (total 10 columns):
asin            939056 non-null object
title           939056 non-null object
price           939056 non-null object
reviewerID      939056 non-null object
reviewerName    939056 non-null object
helpfulness     939056 non-null object
overall         939056 non-null object
timestamp       939056 non-null object
summary         939056 non-null object
reviewText      939056 non-null object
dtypes: object(10)
memory usage: 78.8+ MB


All the remaining dataset also follows the same structure.  <br> As discussed in previous chapter,**1. Initial Data Exploration**, columns : price, reviewerName, helpfulness, timestamp, will be removed.

In [257]:
#drop unneeded columns
xaa.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

In [265]:
xaa = xaa[~xaa.asin.isin(remove_these)]

In [269]:
analyzeNull(xaa)

Unique overall values:  ['5.0' '3.0' '1.0' '4.0' '2.0']
ASIN value lengths:  {10}
Example of valid title data: 
 ['Night World: Daughters Of Darkness']
Any empty string title values:  0


Examples of valid reviewText data: 
 ['This is 1 of da bst books dat i have EVER read! @ my school, we are doing a play on this & im playin Mary-Lynette. i cant wait 2 get to the last chapters when they finally give in 2 each other! Gr8 books!']
Any empty string reviewText values:  1


Examples of valid summary data: 
 ['BEST BOOK EVER!!']
Any empty string summary values:  3


At the later stages of the project, summary and reviewText fields will be used in CountVectorizer to create a similarity matrix.<br> Therefore, it is important to have either one of reviewText or summary column populated. 


As there are no distinction between the reviewText and summary, the two columns will first be combined into one and then removed if the combined field is also empty.


In [270]:
#combine the two columns
xaa["review"] = xaa["reviewText"] + xaa["summary"]

In [272]:
analyzeReview(xaa)

Example of valid review data: 
 ['This is 1 of da bst books dat i have EVER read! @ my school, we are doing a play on this & im playin Mary-Lynette. i cant wait 2 get to the last chapters when they finally give in 2 each other! Gr8 books!BEST BOOK EVER!!']
Any empty string review values:  0


In [273]:
#drop reviewText, summary columns
xaa.drop(columns=['reviewText','summary'],inplace=True)

In [274]:
xaa.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 559361 entries, 391 to 939030
Data columns (total 5 columns):
asin          559361 non-null object
title         559361 non-null object
reviewerID    559361 non-null object
overall       559361 non-null object
review        559361 non-null object
dtypes: object(5)
memory usage: 25.6+ MB


**The same process will be repeated for the remaining 13 chunks.**

#### Chunk2 : re_b2

In [334]:
xab.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 937222 entries, 0 to 937221
Data columns (total 10 columns):
asin            937222 non-null object
title           937222 non-null object
price           937222 non-null object
reviewerID      937222 non-null object
reviewerName    937222 non-null object
helpfulness     937222 non-null object
overall         937222 non-null object
timestamp       937222 non-null object
summary         937222 non-null object
reviewText      937222 non-null object
dtypes: object(10)
memory usage: 78.7+ MB


In [335]:
#drop unneeded columns
xab.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xab = xab[~xab.asin.isin(remove_these)]

In [336]:
analyzeNull(xab)

Unique overall values:  ['5.0' '3.0' '2.0' '4.0' '1.0']
ASIN value lengths:  {10}
Example of valid title data: 
 ['Dead Run (Monkeewrench Series)']
Any empty string title values:  185


Examples of valid reviewText data: 
 ["First, this is a thriller not a mystery. It's a -- Can the good guys escape trouble themselves and stop the bad guys in time -- type of book. Second, the characters are way overblown, in the James Bond or Indianna Jones tradition. And they are not people I felt close to or felt like I even knew very well. But all of that said, it's a quick and entertaining, well written book."]
Any empty string reviewText values:  3


Examples of valid summary data: 
 ['entertaining though overblown']
Any empty string summary values:  3


In [337]:
analyzeEmptyTitle(xab)

Any empty string title values:  185
How many unique ASIN with empty title:  1
Are there other reviews to pull title from:  0


Empty title is something that needs to be dealt.  If we have other reviews with an title and ASIN that corresponds to the ASIN with missing titles, we will be able to fill it.  <br> If found, this will be filled. Otherwise, the ASIN will be saved as missing titles list for later.

In [338]:
#this will be our missing title list
missing_title_asin = []
missing_title_asin.append(xab.loc[xab.title==''].asin.unique()[0])

In [340]:
#combine the two columns
xab["review"] = xab["reviewText"] + xab["summary"]

analyzeReview(xab)

Example of valid review data: 
 ["First, this is a thriller not a mystery. It's a -- Can the good guys escape trouble themselves and stop the bad guys in time -- type of book. Second, the characters are way overblown, in the James Bond or Indianna Jones tradition. And they are not people I felt close to or felt like I even knew very well. But all of that said, it's a quick and entertaining, well written book.entertaining though overblown"]
Any empty string review values:  0


In [341]:
#drop reviewText, summary columns
xab.drop(columns=['reviewText','summary'],inplace=True)

xab.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 550043 entries, 2 to 937221
Data columns (total 5 columns):
asin          550043 non-null object
title         550043 non-null object
reviewerID    550043 non-null object
overall       550043 non-null object
review        550043 non-null object
dtypes: object(5)
memory usage: 25.2+ MB


#### Chunk3 : re_b3

In [319]:
xac.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 933052 entries, 0 to 933051
Data columns (total 10 columns):
asin            933052 non-null object
title           933052 non-null object
price           933052 non-null object
reviewerID      933052 non-null object
reviewerName    933052 non-null object
helpfulness     933052 non-null object
overall         933052 non-null object
timestamp       933052 non-null object
summary         933052 non-null object
reviewText      933052 non-null object
dtypes: object(10)
memory usage: 78.3+ MB


In [320]:
#drop unneeded columns
xac.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xac = xac[~xac.asin.isin(remove_these)]

In [321]:
analyzeNull(xac)

Unique overall values:  ['4.0' '5.0' '2.0' '3.0' '1.0']
ASIN value lengths:  {10}
Example of valid title data: 
 ['Heart of Darkness. (Heritage Club Series)']
Any empty string title values:  0


Examples of valid reviewText data: 
 ["I'm not going to review of Conrad's actual work. Too much ink has been spilled on that. I just want to give my thoughts on the free kindle edition.Since it's free, you can't argue with the price. However, you sort of get what you pay for. The chapters are not keyed, making navigation a tad more difficult. This lack is almost a non-issue, however, since the book is so short. The text also contains quite a few typos. Most damaging to the reading experience is the somewhat sloppy paragraphing.These issues are minor, though. Since this edition is free, one shouldn't expect a pristine text."]
Any empty string reviewText values:  2


Examples of valid summary data: 
 ['Review of the Free Kindle Edition']
Any empty string summary values:  4


In [322]:
#combine the two columns
xac["review"] = xac["reviewText"] + xac["summary"]

analyzeReview(xac)

Example of valid review data: 
 ["I'm not going to review of Conrad's actual work. Too much ink has been spilled on that. I just want to give my thoughts on the free kindle edition.Since it's free, you can't argue with the price. However, you sort of get what you pay for. The chapters are not keyed, making navigation a tad more difficult. This lack is almost a non-issue, however, since the book is so short. The text also contains quite a few typos. Most damaging to the reading experience is the somewhat sloppy paragraphing.These issues are minor, though. Since this edition is free, one shouldn't expect a pristine text.Review of the Free Kindle Edition"]
Any empty string review values:  0


In [323]:
#drop reviewText, summary columns
xac.drop(columns=['reviewText','summary'],inplace=True)

xac.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 538063 entries, 0 to 933051
Data columns (total 5 columns):
asin          538063 non-null object
title         538063 non-null object
reviewerID    538063 non-null object
overall       538063 non-null object
review        538063 non-null object
dtypes: object(5)
memory usage: 24.6+ MB


#### Chunk4 : re_b4

In [327]:
xad.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 937273 entries, 0 to 937272
Data columns (total 10 columns):
asin            937273 non-null object
title           937273 non-null object
price           937273 non-null object
reviewerID      937273 non-null object
reviewerName    937273 non-null object
helpfulness     937273 non-null object
overall         937273 non-null object
timestamp       937273 non-null object
summary         937273 non-null object
reviewText      937273 non-null object
dtypes: object(10)
memory usage: 78.7+ MB


In [328]:
#drop unneeded columns
xad.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xad = xad[~xad.asin.isin(remove_these)]

analyzeNull(xad)

Unique overall values:  ['5.0' '4.0' '3.0' '2.0' '1.0']
ASIN value lengths:  {10}
Example of valid title data: 
 ['The Surrendered Wife: A Practical Guide To Finding Intimacy, Passion and Peace']
Any empty string title values:  100


Examples of valid reviewText data: 
 ["Here's a question for you: do you think men and women have the same temperament and needs?This book is not based on equality (although the author is all for equality in the workplace and equal pay for equal work). It's based on the differencess between men and women. In my life experience and in reading, I believe the majority of men and women do have different needs.I don't mean that as a blanket statement in every case... there is a wide variety in nature. Some men have high sex drives, some have low sex drives, some men are more androgynous and would probably prefer a marriage based on equality rather than differences. I celebrate variety in life.Some books that have led me to the conclusion that men and women have

In [329]:
analyzeEmptyTitle(xad)

Any empty string title values:  100
How many unique ASIN with empty title:  1
Are there other reviews to pull title from:  0


In [343]:
#append to missing asin list as fix is not possible at this time.
missing_title_asin.append(xad.loc[xad.title==''].asin.unique()[0]) #[0] is to save the item only not a list

In [345]:
#combine the two columns
xad["review"] = xad["reviewText"] + xad["summary"]

analyzeReview(xad)

Example of valid review data: 
 ["Here's a question for you: do you think men and women have the same temperament and needs?This book is not based on equality (although the author is all for equality in the workplace and equal pay for equal work). It's based on the differencess between men and women. In my life experience and in reading, I believe the majority of men and women do have different needs.I don't mean that as a blanket statement in every case... there is a wide variety in nature. Some men have high sex drives, some have low sex drives, some men are more androgynous and would probably prefer a marriage based on equality rather than differences. I celebrate variety in life.Some books that have led me to the conclusion that men and women have different needs (besides my own life experience).....Self-Made Man: One Woman's Year Disguised as a Manwhich is about a butch gay woman masquerading as a man for over a year. She felt emotionally stifled taking on a man's role in life eve

In [346]:
#drop reviewText, summary columns
xad.drop(columns=['reviewText','summary'],inplace=True)

xad.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 571565 entries, 0 to 936847
Data columns (total 5 columns):
asin          571565 non-null object
title         571565 non-null object
reviewerID    571565 non-null object
overall       571565 non-null object
review        571565 non-null object
dtypes: object(5)
memory usage: 26.2+ MB


#### Chunk5 : re_b5

In [353]:
xae.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 938624 entries, 0 to 938623
Data columns (total 10 columns):
asin            938624 non-null object
title           938624 non-null object
price           938624 non-null object
reviewerID      938624 non-null object
reviewerName    938624 non-null object
helpfulness     938624 non-null object
overall         938624 non-null object
timestamp       938624 non-null object
summary         938624 non-null object
reviewText      938624 non-null object
dtypes: object(10)
memory usage: 78.8+ MB


In [354]:
#drop unneeded columns
xae.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xae = xae[~xae.asin.isin(remove_these)]

analyzeNull(xae)

Unique overall values:  ['5.0' '3.0' '4.0' '1.0' '2.0']
ASIN value lengths:  {10}
Example of valid title data: 
 ['Absalom, Absalom! (100 Greatest Masterpieces of American Literature)']
Any empty string title values:  0


Examples of valid reviewText data: 
 ['Imagine a line drawn in red dust. A stick dragging through creates mounds on either side that are too small to appreciate. Then the men: people, a conglomeration of classes and sexes, with race usually being white, walking on either side of the line. Then they are standing, yelling curt insults and blasphemies at each other....This sad situation best describes critical appraisal of William Faulkner and his novels. I find myself one of his supporters, though to tell the truth I understand why others don\'t like him and I can understand their aversion to him. First of all, it isn\'t that they are less intelligent than Faulkner fans, (V. Nabokov, one of the best literary minds of the 20th century, hated Faulkner\'s works and called 

In [356]:
#combine the two columns
xae["review"] = xae["reviewText"] + xae["summary"]

analyzeReview(xae)

Any empty string review values:  0


In [357]:
#drop reviewText, summary columns
xae.drop(columns=['reviewText','summary'],inplace=True)

xae.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 578301 entries, 298 to 938512
Data columns (total 5 columns):
asin          578301 non-null object
title         578301 non-null object
reviewerID    578301 non-null object
overall       578301 non-null object
review        578301 non-null object
dtypes: object(5)
memory usage: 26.5+ MB


#### Chunk6 : re_b6

In [358]:
xaf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 935959 entries, 0 to 935958
Data columns (total 10 columns):
asin            935959 non-null object
title           935959 non-null object
price           935959 non-null object
reviewerID      935959 non-null object
reviewerName    935959 non-null object
helpfulness     935959 non-null object
overall         935959 non-null object
timestamp       935959 non-null object
summary         935959 non-null object
reviewText      935959 non-null object
dtypes: object(10)
memory usage: 78.5+ MB


In [359]:
#drop unneeded columns
xaf.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xaf = xaf[~xaf.asin.isin(remove_these)]

analyzeNull(xaf)

Unique overall values:  ['1.0' '5.0' '4.0' '3.0' '2.0']
ASIN value lengths:  {10}
Any empty string title values:  0


Any empty string reviewText values:  0


Any empty string summary values:  2


In [360]:
#combine the two columns
xaf["review"] = xaf["reviewText"] + xaf["summary"]

analyzeReview(xaf)

Any empty string review values:  0


In [361]:
#drop reviewText, summary columns
xaf.drop(columns=['reviewText','summary'],inplace=True)

xaf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 561844 entries, 316 to 935734
Data columns (total 5 columns):
asin          561844 non-null object
title         561844 non-null object
reviewerID    561844 non-null object
overall       561844 non-null object
review        561844 non-null object
dtypes: object(5)
memory usage: 25.7+ MB


#### Chunk7 : re_b7

In [371]:
xag.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 940691 entries, 0 to 940690
Data columns (total 10 columns):
asin            940691 non-null object
title           940691 non-null object
price           940691 non-null object
reviewerID      940691 non-null object
reviewerName    940691 non-null object
helpfulness     940691 non-null object
overall         940691 non-null object
timestamp       940691 non-null object
summary         940691 non-null object
reviewText      940691 non-null object
dtypes: object(10)
memory usage: 78.9+ MB


In [372]:
#drop unneeded columns
xag.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xag = xag[~xag.asin.isin(remove_these)]

analyzeNull(xag)

Unique overall values:  ['5.0' '3.0' '4.0' '2.0' '1.0']
ASIN value lengths:  {10}
Any empty string title values:  194


Any empty string reviewText values:  0


Any empty string summary values:  7


In [373]:
analyzeEmptyTitle(xag)

Any empty string title values:  194
How many unique ASIN with empty title:  2
Are there other reviews to pull title from:  0


In [379]:
#append to missing asin list as fix is not possible at this time.
missing_title_asin.append(xag.loc[xag.title==''].asin.unique()[0]) #[0] is to save the item only not a list
missing_title_asin.append(xag.loc[xag.title==''].asin.unique()[1])

In [381]:
#combine the two columns
xag["review"] = xag["reviewText"] + xag["summary"]

analyzeReview(xag)

Any empty string review values:  0


In [382]:
#drop reviewText, summary columns
xag.drop(columns=['reviewText','summary'],inplace=True)

xag.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 567496 entries, 66 to 940690
Data columns (total 5 columns):
asin          567496 non-null object
title         567496 non-null object
reviewerID    567496 non-null object
overall       567496 non-null object
review        567496 non-null object
dtypes: object(5)
memory usage: 26.0+ MB


#### Chunk8 : re_b8

In [383]:
xah.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 948660 entries, 0 to 948659
Data columns (total 10 columns):
asin            948660 non-null object
title           948660 non-null object
price           948660 non-null object
reviewerID      948660 non-null object
reviewerName    948660 non-null object
helpfulness     948660 non-null object
overall         948660 non-null object
timestamp       948660 non-null object
summary         948660 non-null object
reviewText      948660 non-null object
dtypes: object(10)
memory usage: 79.6+ MB


In [384]:
#drop unneeded columns
xah.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xah = xah[~xah.asin.isin(remove_these)]

analyzeNull(xah)

Unique overall values:  ['5.0' '4.0' '3.0' '2.0' '1.0']
ASIN value lengths:  {10}
Any empty string title values:  0


Any empty string reviewText values:  0


Any empty string summary values:  3


In [385]:
#combine the two columns
xah["review"] = xah["reviewText"] + xah["summary"]

analyzeReview(xah)

Any empty string review values:  0


In [386]:
#drop reviewText, summary columns
xah.drop(columns=['reviewText','summary'],inplace=True)

xah.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 558092 entries, 0 to 948583
Data columns (total 5 columns):
asin          558092 non-null object
title         558092 non-null object
reviewerID    558092 non-null object
overall       558092 non-null object
review        558092 non-null object
dtypes: object(5)
memory usage: 25.5+ MB


#### Chunk9 : re_b9

In [392]:
xai.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 946304 entries, 0 to 946303
Data columns (total 10 columns):
asin            946304 non-null object
title           946304 non-null object
price           946304 non-null object
reviewerID      946304 non-null object
reviewerName    946304 non-null object
helpfulness     946304 non-null object
overall         946304 non-null object
timestamp       946304 non-null object
summary         946304 non-null object
reviewText      946304 non-null object
dtypes: object(10)
memory usage: 79.4+ MB


In [393]:
#drop unneeded columns
xai.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xai = xai[~xai.asin.isin(remove_these)]

analyzeNull(xai)

Unique overall values:  ['3.0' '4.0' '5.0' '1.0' '2.0']
ASIN value lengths:  {10}
Any empty string title values:  174


Any empty string reviewText values:  0


Any empty string summary values:  6


In [394]:
analyzeEmptyTitle(xai)

Any empty string title values:  174
How many unique ASIN with empty title:  1
Are there other reviews to pull title from:  0


In [395]:
#append to missing asin list as fix is not possible at this time.
missing_title_asin.append(xai.loc[xai.title==''].asin.unique()[0]) #[0] is to save the item only not a list
missing_title_asin

['0912411201', '0736668144', '0792734890', 'B00006B6YY', 'B00005WCAQ']

In [396]:
#combine the two columns
xai["review"] = xai["reviewText"] + xai["summary"]

analyzeReview(xai)

Any empty string review values:  0


In [397]:
#drop reviewText, summary columns
xai.drop(columns=['reviewText','summary'],inplace=True)

xai.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 589415 entries, 131 to 946303
Data columns (total 5 columns):
asin          589415 non-null object
title         589415 non-null object
reviewerID    589415 non-null object
overall       589415 non-null object
review        589415 non-null object
dtypes: object(5)
memory usage: 27.0+ MB


#### Chunk10 : re_b10

In [398]:
xaj.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 939582 entries, 0 to 939581
Data columns (total 10 columns):
asin            939582 non-null object
title           939582 non-null object
price           939582 non-null object
reviewerID      939582 non-null object
reviewerName    939582 non-null object
helpfulness     939582 non-null object
overall         939582 non-null object
timestamp       939582 non-null object
summary         939582 non-null object
reviewText      939582 non-null object
dtypes: object(10)
memory usage: 78.9+ MB


In [399]:
#drop unneeded columns
xaj.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xaj = xaj[~xaj.asin.isin(remove_these)]

analyzeNull(xaj)

Unique overall values:  ['4.0' '5.0' '3.0' '1.0' '2.0']
ASIN value lengths:  {10}
Any empty string title values:  0


Any empty string reviewText values:  2


Any empty string summary values:  1


In [400]:
#combine the two columns
xaj["review"] = xaj["reviewText"] + xaj["summary"]

analyzeReview(xaj)

Any empty string review values:  0


In [401]:
#drop reviewText, summary columns
xaj.drop(columns=['reviewText','summary'],inplace=True)

xaj.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 565701 entries, 532 to 939581
Data columns (total 5 columns):
asin          565701 non-null object
title         565701 non-null object
reviewerID    565701 non-null object
overall       565701 non-null object
review        565701 non-null object
dtypes: object(5)
memory usage: 25.9+ MB


#### Chunk11 : re_b11

In [410]:
xak.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 936059 entries, 0 to 936058
Data columns (total 10 columns):
asin            936059 non-null object
title           936059 non-null object
price           936059 non-null object
reviewerID      936059 non-null object
reviewerName    936059 non-null object
helpfulness     936059 non-null object
overall         936059 non-null object
timestamp       936059 non-null object
summary         936059 non-null object
reviewText      936059 non-null object
dtypes: object(10)
memory usage: 78.6+ MB


In [411]:
#drop unneeded columns
xak.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xak = xak[~xak.asin.isin(remove_these)]

analyzeNull(xak)

Unique overall values:  ['5.0' '4.0' '1.0' '3.0' '2.0']
ASIN value lengths:  {10}
Any empty string title values:  0


Any empty string reviewText values:  3


Any empty string summary values:  2


In [412]:
#combine the two columns
xak["review"] = xak["reviewText"] + xak["summary"]

analyzeReview(xak)

Any empty string review values:  0


In [413]:
#drop reviewText, summary columns
xak.drop(columns=['reviewText','summary'],inplace=True)

xak.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 543730 entries, 0 to 936058
Data columns (total 5 columns):
asin          543730 non-null object
title         543730 non-null object
reviewerID    543730 non-null object
overall       543730 non-null object
review        543730 non-null object
dtypes: object(5)
memory usage: 24.9+ MB


#### Chunk12 : re_b12

In [414]:
xal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 935726 entries, 0 to 935725
Data columns (total 10 columns):
asin            935726 non-null object
title           935726 non-null object
price           935726 non-null object
reviewerID      935726 non-null object
reviewerName    935726 non-null object
helpfulness     935726 non-null object
overall         935726 non-null object
timestamp       935726 non-null object
summary         935726 non-null object
reviewText      935726 non-null object
dtypes: object(10)
memory usage: 78.5+ MB


In [415]:
#drop unneeded columns
xal.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xal = xal[~xal.asin.isin(remove_these)]

analyzeNull(xal)

Unique overall values:  ['1.0' '5.0' '3.0' '2.0' '4.0']
ASIN value lengths:  {10}
Any empty string title values:  0


Any empty string reviewText values:  0


Any empty string summary values:  3


In [416]:
#combine the two columns
xal["review"] = xal["reviewText"] + xal["summary"]

analyzeReview(xal)

Any empty string review values:  0


In [417]:
#drop reviewText, summary columns
xal.drop(columns=['reviewText','summary'],inplace=True)

xal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 551027 entries, 0 to 935101
Data columns (total 5 columns):
asin          551027 non-null object
title         551027 non-null object
reviewerID    551027 non-null object
overall       551027 non-null object
review        551027 non-null object
dtypes: object(5)
memory usage: 25.2+ MB


#### Chunk13 : re_b13

In [423]:
xam.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 936193 entries, 0 to 936192
Data columns (total 10 columns):
asin            936193 non-null object
title           936193 non-null object
price           936193 non-null object
reviewerID      936193 non-null object
reviewerName    936193 non-null object
helpfulness     936193 non-null object
overall         936193 non-null object
timestamp       936193 non-null object
summary         936193 non-null object
reviewText      936193 non-null object
dtypes: object(10)
memory usage: 78.6+ MB


In [424]:
#drop unneeded columns
xam.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xam = xam[~xam.asin.isin(remove_these)]

analyzeNull(xam)

Unique overall values:  ['3.0' '4.0' '5.0' '2.0' '1.0']
ASIN value lengths:  {10}
Any empty string title values:  0


Any empty string reviewText values:  0


Any empty string summary values:  3


In [425]:
#combine the two columns
xam["review"] = xam["reviewText"] + xam["summary"]

analyzeReview(xam)

Any empty string review values:  0


In [426]:
#drop reviewText, summary columns
xam.drop(columns=['reviewText','summary'],inplace=True)

xam.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 557375 entries, 105 to 936192
Data columns (total 5 columns):
asin          557375 non-null object
title         557375 non-null object
reviewerID    557375 non-null object
overall       557375 non-null object
review        557375 non-null object
dtypes: object(5)
memory usage: 25.5+ MB


#### Chunk14 : re_b14

In [427]:
xan.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 682087 entries, 0 to 682086
Data columns (total 10 columns):
asin            682087 non-null object
title           682087 non-null object
price           682087 non-null object
reviewerID      682087 non-null object
reviewerName    682087 non-null object
helpfulness     682087 non-null object
overall         682087 non-null object
timestamp       682087 non-null object
summary         682087 non-null object
reviewText      682087 non-null object
dtypes: object(10)
memory usage: 57.2+ MB


In [428]:
#drop unneeded columns
xan.drop(columns=['price','reviewerName','helpfulness','timestamp'],inplace=True)

#remove low count reviews
xan = xan[~xan.asin.isin(remove_these)]

analyzeNull(xan)

Unique overall values:  ['5.0' '4.0' '3.0' '2.0' '1.0']
ASIN value lengths:  {10}
Any empty string title values:  157


Any empty string reviewText values:  0


Any empty string summary values:  1


In [429]:
analyzeEmptyTitle(xan)

Any empty string title values:  157
How many unique ASIN with empty title:  1
Are there other reviews to pull title from:  0


In [430]:
#append to missing asin list as fix is not possible at this time.
missing_title_asin.append(xan.loc[xan.title==''].asin.unique()[0]) #[0] is to save the item only not a list
missing_title_asin

['0912411201',
 '0736668144',
 '0792734890',
 'B00006B6YY',
 'B00005WCAQ',
 'B0000667ET']

In [431]:
#combine the two columns
xan["review"] = xan["reviewText"] + xan["summary"]

analyzeReview(xan)

Any empty string review values:  0


In [432]:
#drop reviewText, summary columns
xan.drop(columns=['reviewText','summary'],inplace=True)

xan.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 404097 entries, 0 to 681895
Data columns (total 5 columns):
asin          404097 non-null object
title         404097 non-null object
reviewerID    404097 non-null object
overall       404097 non-null object
review        404097 non-null object
dtypes: object(5)
memory usage: 18.5+ MB


## 2.3 Data Export

In [433]:
#all files saved in below format.
xam.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_b13.csv',index=False)

In [434]:
xan.to_csv(r'/Users/byungchankim/Downloads/Springboard/capstone2/data/re_b14.csv',index=False)

In [438]:
#saving missing_title_asin
f = open("missing.txt", "w")
f.write(str(missing_title_asin))
f.close()

In [440]:
#saving remove_these
f = open("remove_these.txt", "w")
f.write(str(list(remove_these)))
f.close()