**Steps to download the data**

* We download the set of Amazon reviews from the website (https://nijianmo.github.io/amazon/index.html). We use the *Small subsets for experimentation* and not the Complete review data. They are referred to as k-core and are available for several categories. We only download the reviews (5-core) and not the ratings.

* The list of categories is specified in the file - *productCategories.txt*. Even though this is a *small* subset, the reviews for some of the categories like Books can be quite large (6.6GB). We selected the following categories to balance the size and diversity of the categories.

        * AMAZON_FASHION
        * All_Beauty
        * Appliances
        * Arts_Crafts_and_Sewing
        * Automotive
        * Cell_Phones_and_Accessories
        * Digital_Music
        * Gift_Cards
        * Grocery_and_Gourmet_Food
        * Industrial_and_Scientific
        * Luxury_Beauty
        * Magazine_Subscriptions
        * Musical_Instruments
        * Office_Products
        * Patio_Lawn_and_Garden
        * Prime_Pantry
        * Software

* If you would like to change them, you need to add/delete entries to the *productCategories.txt*

* Once the set of categories is set, you need to run the provided shell script *getData.sh* by executing the command `sh getData.sh`. This downloads each category file into the current folder

* Once this script completes running, we can run the cells that follow in this notebook to create the final dataset used in the chapter

In [1]:
import pandas as pd

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

def num_sentences(text):
    text = str(text)
    return len(sent_tokenize(text))

def get_category_data(cat_name):
    # Load the zip file and deduplicate
    data = pd.read_json(cat_name+'_5.json.gz', lines=True)
    data = data.drop(columns=['style','reviewerName','vote','image'])
    print ('Original Number of Reviews in ', cat_name, ' - ', len(data))
    data = data.drop_duplicates()
    print ('Number of reviews after de-duplication - ', len(data))
        
    # Keep reviews that contain only one sentence (remove if no text found)
    data['num_sentences'] = data['reviewText'].apply(num_sentences)
    data = data[data['num_sentences'] == 1]
    data = data[~data['reviewText'].isna()]
    data = data.drop(columns=['num_sentences'])
    print ('Number of reviews with only one sentence - ', len(data))
    
    return data

In [5]:
# Create list of product categories
with open('productCategories.txt', 'r') as f:
    productCategories = []
    for line in f:
        if line.strip() != '':
            productCategories.append(line.rstrip('\n'))

In [None]:
# Load reviews for each category and append all to create dataset
finalDF = pd.DataFrame(columns=['overall', 'verified', 'reviewTime', 'reviewerID', 'asin', 'reviewText', 'summary', 'unixReviewTime'])
for category in productCategories:
    returnDF = get_category_data(category)
    finalDF = finalDF.append(returnDF)

Original Number of Reviews in  AMAZON_FASHION  -  3176
Number of reviews after de-duplication -  3088
Number of reviews with only one sentence -  1144
Original Number of Reviews in  All_Beauty  -  5269
Number of reviews after de-duplication -  4223
Number of reviews with only one sentence -  1370
Original Number of Reviews in  Appliances  -  2277
Number of reviews after de-duplication -  205
Number of reviews with only one sentence -  117
Original Number of Reviews in  Arts_Crafts_and_Sewing  -  494485
Number of reviews after de-duplication -  447166
Number of reviews with only one sentence -  190688
Original Number of Reviews in  Automotive  -  1711519
Number of reviews after de-duplication -  1645944


In [32]:
len(finalDF)

2498117

In [33]:
finalDF['overall'].value_counts()

5    1875488
4     334975
3     143414
1      84851
2      59389
Name: overall, dtype: int64

In [34]:
# We intend to use rating scores to tag positive and negative sentiment
# As a result a score of 3 is not very helpful and we drop all reviews with this rating from the data
finalDF = finalDF[finalDF['overall'] != 3]

In [35]:
# We tag all the reviews with a high rating [4,5] as positive sentiment (1)
# We tag all the reviews with a low rating [1,2] as negative sentiment (0)

finalDF['sentiment'] = 0
finalDF.loc[finalDF['overall'] > 3, 'sentiment'] = 1
finalDF.loc[finalDF['overall'] < 3, 'sentiment'] = 0

In [36]:
# To maintain a reasonable size of the dataset that we can work with easily while still capturing enough
# observations from both samples, we sample 150k observations from each group
def sampling_k_elements(group, k=150000):
    if len(group) < k:
        return group
    return group.sample(k, random_state=42)

finalDF = finalDF.groupby('sentiment').apply(sampling_k_elements).reset_index(drop=True)

In [38]:
print (finalDF['sentiment'].value_counts())
finalDF = finalDF.drop(columns=['sentiment'])

1    150000
0    144240
Name: sentiment, dtype: int64


In [39]:
finalDF.to_json('reviews_5_balanced.json', orient='records', lines=True)