# Data Prepping

The first thing to be done is to get the data in a format for Amazon Personalize.

We are looking to supply Item Metadata and User-Item Interaction data. We are also only going to use data that is provided here, no additional metadata will be added at this time.


In [29]:
# Imports
import boto3

import json
import numpy as np
import pandas as pd
import time
import datetime
import csv

## Exploring the data

The data comes in 2 files, one with movie data and the other with user-movie interactions. Below we are going to describe the schema and the manipulations required to build CSV files for Personalize.

### Movie Item Metadata:

#### movie_titles.csv:

* MovieID,
* YearOfRelease
* Title

#### Personalize Required Data:

* ITEM_ID = string

#### Personalize Item Schema:

Notice that we are removing the `Title` attribute, this provides metadata but there's no obvious reason for it to be useful for predicting a movie recommendation. It would be worth experimenting with this added or other metadata added later.

* ITEM_ID = MovieID
* Year_Of_Release = datetime


### User - Movie Interaction Data:

#### combined_data_1.txt ( 2, 3, 4 )

* MovieID
* * CustomerID
* * Rating
* * Date

#### Personalize Required Data:

* USER_ID
* ITEM_ID
* TIMESTAMP

#### Personalize User-Item Interaction Schema:

As per the movielens dataset we will be removing `Rating` from the dataset and mapping `Customer ID` to `USER_ID`, `Movie ID` to `ITEM_ID`, and `Date` will be converted to a unix epoch timestamp and represented as `TIMESTAMP`.

Also we will reduce the interactions to only include ratings of 3 or greater just like the movielens data. This is to keep the experiments close in inputs but additional approaches could be valid.

* USER_ID = string
* ITEM_ID  = string
* TIMESTAMP = long

The last heavy change to the interaciton data will be combining the interaction files from 4 distinct text files into one larger CSV. This may require a larger notebook instance, in this case we are using a 2XL.

## Prepping Movie Data:

The CSV is nearly in the right format, the process below will just rename the headers, generate a new CSV, and upload it to S3 for usage with Personalize.


In [69]:
# This cell will take quite a few minutes to complete, the following cells will append the other content to the CSV

# First create the interactions CSV File
with open('items.csv', 'w') as csvfile:
    filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
    
    
    # Write the headers:
    filewriter.writerow(['ITEM_ID', 'YEAR_OF_RELEASE'])
    # Next open the first items file
    with open('netflix-prize-data/movie_titles.csv', encoding = "ISO-8859-1") as fileHandler:
        for line in fileHandler:
            # Clean and parse the line
            line = line.strip('\n').split(',')
            filewriter.writerow([line[0], line[1]])


In [4]:
# Item Schema Cell
item_schema = {
    "type": "record",
    "name": "Item",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "YEAR_OF_RELEASE",
            "type": "string"
        },
    ],
    "version": "1.0"
}


"""
create_schema_response = personalize.create_schema(
    name = "django-items-schema-finalF",
    schema = json.dumps(item_schema)
)

schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))
"""

'\ncreate_schema_response = personalize.create_schema(\n    name = "django-items-schema-finalF",\n    schema = json.dumps(item_schema)\n)\n\nschema_arn = create_schema_response[\'schemaArn\']\nprint(json.dumps(create_schema_response, indent=2))\n'

### Prepping User Movie Interaction Data

Here things get a lot more complicated, the files are pretty large each around 500MB, also they are not structured in a CSV format so we need to build a parser for the file to build a dataframe, and then to export the dataframe to a CSV that is shipped to S3. Just like the Movie metadata, there is a JSON file in the project that is the schema that can be used to build the dataset items within Personalize.

Basic Algo:

1. Loop through the file, on finding a number that also contains a `:` character
1. Identify the number,
1. Until a new number is found split the file into components of `CustomerID`, `Rating`, `Date`.
1. Convert the date to a unix epoch format
1. Add line to dataframe or temp storage device
1. Export to CSV with correct headers

In [55]:
# The functions below will aid in decontsructing the interaction data into 
# a usable CSV

def strip_line(line):
    """
    As bad characters are found in this experiment, they will be
    added here so they are removed from the dataset, for example tabs or new lines.
    """
    line = line.strip('\n')
    return line


def is_line_a_movie(line):
    """
    This will take in a string `line` and determine if it has the 
    characteristics of being a movie or not.
    
    Movie lines do not contain commas, so we will split on those,
    get the length of the array, and determine if it is a user or not.
    """
    line_list = line.split(',')
    if len(line_list) == 1:
        return True
    return False

def is_line_an_interaction(line):
    """
    This will take in a string `line` and determine if it has the
    characteristics of an interaction or not.
    
    Interaction lines are identified by containing commas and an list
    length of 3, if those conditions are met true is returned, otherwise false
    """
    line_list = line.split(',')
    if len(line_list) == 3:
        return True
    return False

def convert_date_to_epoch(date_str):
    """
    The Netflix dataset provides dates like `2005-09-06` to designate:
    * Year 2005
    * Month September / 9 
    * Day 6
    
    We need these in unix epoch format, the code below does that and returns
    the date as an epoch
    """
    format = '%Y-%m-%d'
    epoch = time.mktime(time.strptime(date_str, format))
    epoch = int(epoch)
    return epoch
    

In [67]:
# This cell will take quite a few minutes to complete, the following cells will append the other content to the CSV

# First create the interactions CSV File
with open('interactions.csv', 'w') as csvfile:
    filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
    
    # Next open the first interactions file
    with open('netflix-prize-data/combined_data_1.txt') as fileHandler:
        movie = 0 
        # First write the headers for the file:
        filewriter.writerow(['USER_ID', 'ITEM_ID', 'TIMESTAMP'])
        # Read Each Line in a Loop:
        for line in fileHandler:
            line = strip_line(line=line)
            # Now we determine if it is a movie:
            if is_line_a_movie(line=line):
                movie = line.split(',')[0].strip(':')
            # If the line is an interaction
            elif is_line_an_interaction(line=line):
                line = line.split(',')
                if int(line[1]) >= 3:
                    timestamp = convert_date_to_epoch(date_str=line[2])
                    # Once all values are sorted write the row
                    filewriter.writerow([line[0], str(movie), str(timestamp)])

            


In [65]:
"""
# Note here we append file 2
with open('interactions.csv', 'a') as csvfile:
    filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
    
    # Next open the first interactions file
    with open('netflix-prize-data/combined_data_2.txt') as fileHandler:
        movie = 0 
        # Read Each Line in a Loop:
        for line in fileHandler:
            line = strip_line(line=line)
            # Now we determine if it is a movie:
            if is_line_a_movie(line=line):
                movie = line.split(',')[0].strip(':')
            # If the line is an interaction
            elif is_line_an_interaction(line=line):
                line = line.split(',')
                if int(line[1]) >= 3:
                    timestamp = convert_date_to_epoch(date_str=line[2])
                    # Once all values are sorted write the row
                    filewriter.writerow([line[0], str(movie), str(timestamp)])


# Note here we append file 3
with open('interactions.csv', 'a') as csvfile:
    filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
    
    # Next open the first interactions file
    with open('netflix-prize-data/combined_data_3.txt') as fileHandler:
        movie = 0 
        # Read Each Line in a Loop:
        for line in fileHandler:
            line = strip_line(line=line)
            # Now we determine if it is a movie:
            if is_line_a_movie(line=line):
                movie = line.split(',')[0].strip(':')
            # If the line is an interaction
            elif is_line_an_interaction(line=line):
                line = line.split(',')
                if int(line[1]) >= 3:
                    timestamp = convert_date_to_epoch(date_str=line[2])
                    # Once all values are sorted write the row
                    filewriter.writerow([line[0], str(movie), str(timestamp)])



# Note here we append file 4
with open('interactions.csv', 'a') as csvfile:
    filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
    
    # Next open the first interactions file
    with open('netflix-prize-data/combined_data_2.txt') as fileHandler:
        movie = 0 
        # Read Each Line in a Loop:
        for line in fileHandler:
            line = strip_line(line=line)
            # Now we determine if it is a movie:
            if is_line_a_movie(line=line):
                movie = line.split(',')[0].strip(':')
            # If the line is an interaction
            elif is_line_an_interaction(line=line):
                line = line.split(',')
                if int(line[1]) >= 3:
                    timestamp = convert_date_to_epoch(date_str=line[2])
                    # Once all values are sorted write the row
                    filewriter.writerow([line[0], str(movie), str(timestamp)])




"""

"\n# Note here we append file 2\nwith open('interactions.csv', 'a') as csvfile:\n    filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)\n    \n    # Next open the first interactions file\n    with open('netflix-prize-data/combined_data_2.txt') as fileHandler:\n        movie = 0 \n        # Read Each Line in a Loop:\n        for line in fileHandler:\n            line = strip_line(line=line)\n            # Now we determine if it is a movie:\n            if is_line_a_movie(line=line):\n                movie = line.split(',')[0].strip(':')\n            # If the line is an interaction\n            elif is_line_an_interaction(line=line):\n                line = line.split(',')\n                if int(line[1]) >= 3:\n                    timestamp = convert_date_to_epoch(date_str=line[2])\n                    # Once all values are sorted write the row\n                    filewriter.writerow([line[0], str(movie), str(timestamp)])\n\n\n# Note here we appen

### Uploading Interactions Data

Now that the CSV above has been created it needs to be uploaded to s3 so that it can be used with Amazon Personalize.