<h1 style="font-family: Roboto">Data preprocessing</h1>

This notebook downloads a subset of the Amazon Reviews dataset consisting of product ratings in the *Appliances* subcategory and prepares it for training, splits it into train and test sets, and puts it on S3.

Information about the dataset can be found at https://nijianmo.github.io/amazon/index.html.

<h2>Config</h2>

Let us configure some parameters here at the top. You can change the bucket name to make sure it is unique.

In [5]:
from config import config
DATA_URL  = "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Appliances.csv"
S3_BUCKET = config['AWS']['S3_BUCKET']

<h2>Downloading</h2>

Now we download the dataset from the URL. It is stored into a temporary file where the notebook is running

In [2]:
from urllib.request import urlretrieve

filename, _ = urlretrieve(DATA_URL)

<h2>Load the data with Pandas</h2>

Now we load the comma-separated data into a Pandas dataframe. The data is in a *tall* format. See explanation of tall and wide data formats here: https://en.wikipedia.org/wiki/Wide_and_narrow_data. Wide data is also referred to as *tidy* data.

In [3]:
import pandas as pd

data = pd.read_csv(filename,
                   names=["item", "user", "rating", "timestamp"])

data.drop("timestamp", axis=1, inplace=True)

# Let's see what the data frame looks like
data.head()

Unnamed: 0,item,user,rating
0,1118461304,A3NHUQ33CFH3VM,5.0
1,1118461304,A3SK6VNBQDNBJE,5.0
2,1118461304,A3SOFHUR27FO3K,5.0
3,1118461304,A1HOG1PYCAE157,5.0
4,1118461304,A26JGAM6GZMM4V,5.0


<h2>Data cleaning</h2>

Some additional data cleaning is required by the downstream algorithms

In [4]:
# Drop duplicate ratings for a user
data = data.drop_duplicates(['item', 'user'])

<h2> Train/Test split </h2>

We now split the dataset randomly in half into a training dataset used to fit the models and a test set used to evaluate and compare them.

In [5]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data, train_size=0.5)

train_data.reset_index(inplace=True, drop=True)
test_data.reset_index(inplace=True, drop=True)

<h2> Upload to S3 </h2>

Finally we upload the data to AWS S3 using the <span style="font-family: 'Courier New', monospace;">boto3</span> Python interface to the AWS API.

In [8]:
import boto3

s3 = boto3.client('s3')

location = {'LocationConstraint': 'eu-west-2'}
s3.create_bucket(Bucket                    = S3_BUCKET,
                 CreateBucketConfiguration = location )

def upload_data(dataset, datakey):
    """Helper function to upload a dataframe to an S3 Bucket
    """
    s3.put_object(Body   = dataset.to_csv(index=False),
                  Bucket = S3_BUCKET,
                  Key    = datakey)


upload_data(train_data, 'train_ratings')
upload_data(test_data,   'test_ratings')