# Data preprocessing

This notebook handles all the data preparation and storage needed for the analysis. We decided to separate the analysis into multiple notebooks in order to make the report more readable

## Imports, constants and utilitary functions

In [1]:
import pandas as pd
import numpy as np
import gzip

In [2]:
DATA_DIR = 'data/'

In [3]:
##### Functions for reading and parsing files #####

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

In [4]:
##### Functions related to the DataFrames directly #####

def df_with_datetime(df, col_name='datetime', out_format=None):
    if out_format:
        df[col_name] = pd.to_datetime(df['unixReviewTime'], unit='s').dt.strftime(out_format)
    else:
        df[col_name] = pd.to_datetime(df['unixReviewTime'], unit='s')
        
    return df

## Data collection and storage

The first task is to task is to collect data from Amazon, organise this data and store it in a ready-to-use data format (pickes).

To achieve this, we do the following:
* Load the reviews data into dataframes
* Load the products metadata into dataframes
* Filter the reviews and products to only keep products considered "healthy"
* Merge the metadata and the reviews into a single dataframe
* Store the dataframe in the picke format

We decided to restrict our analysis by choosing reviews and products from the **Grocery and Gourmet Food** and **Sports and Outdoors** categories as these categories are more representative of healthy lifestyle than the others.


#### The Grocery and Gourmet Food category

We see that the different categories in the 'Grocery & Gourmet Food' file are not directly useful, because:
1. They are not directly telling us if the food is a healthy one or not
2. We see that a lot of reviews are about products that are not in a category (except for the main one 'Grocery & Gourmet Food')
    
In order to get the reviews related to a healthy product in this file, we can try the following: we could read the title (and/or description) of all products in the metadata and find the ones containing some keyword related to a healthy lifestyle (e.g. "organic", "natural", ...). Once we have those products, we can keep only the reviews about those products (using the 'asin' value).

#### The Sports and Outdoors category

The Sports and Outdoors file contains categories that seem easy to categorize into healthy (or not), e.g. 'Exercise & Fitness' or 'Cycling', plus the products seem to be in more precise categories (not like in the 'Grocery & Gourmet Food' file). Thus, in order to get all reviews about products related to an healthy lifestyle, we could take all the reviews about a product that is in one of the 'healthy' categories, and we can choose those healthy categories manually.

### Load the data into pandas dataframes

Given the amount of data to treat, we do the computations sequentially (file by file)

In [5]:
METADATA_TO_KEEP = ['asin', 'title', 'categories', 'price']
REVIEWS_DATA_TO_KEEP = ['asin', 'overall', 'reviewText', 'datetime']
HEALTHY_FOOD_KEYWORDS = ['organic', 'natural', 'sugar-free', 'healthy', 'vitamin',
                        'supplement', 'minerals', 'diet', 'vegan']

##### Functions related to the 'healthiness' of items #####

def is_food_healthy(item):
    for kw in HEALTHY_FOOD_KEYWORDS:
        try:
            if kw in item['title'].lower() or kw in item['description'].lower():
                return True
        except:
            pass
        
    return False

def is_sport_item_healthy(item):
    for cat in get_categories(item):
        if cat in HEALTHY_SPORT_CATEGORIES:
            return True
        
    return False

With our utilitary functions ready, we can store the data into pickles. The created files are the following:
* **food_reviews_df**: A file containing all the reviews of the Food category
* **food_meta_df**: A file containing all the reviews of the Food category
* **sports_reviews_df**: A file containing all the reviews of the Sports category
* **sports_meta_df**: A file containing all the reviews of the Sports category
* **healthy_food_df**: A file containing the merged information about healthy products and reviews of the Food category
* **healthy_sports_df**: A file containing the merged information about healthy products and reviews of the Sports category

In [6]:
def save_data(filename, pickle_filename, with_datetime=False):
    print('Saving file:', DATA_DIR + pickle_filename)

    df = getDF(DATA_DIR + filename)
    
    if with_datetime:
        df = df_with_datetime(df=df)
    
    df.to_pickle(DATA_DIR + pickle_filename)

In [None]:
save_data('reviews_Grocery_and_Gourmet_Food.json.gz', 'food_reviews_df', True)
save_data('meta_Grocery_and_Gourmet_Food.json.gz', 'food_meta_df')
save_data('reviews_Sports_and_Outdoors.json.gz', 'sports_reviews_df', True)
save_data('meta_Sports_and_Outdoors.json.gz', 'sports_meta_df')

Saving file: data/sports_reviews_df


In [None]:
def save_healthy_data(reviews_filename, metadata_filename, filtering_func, filename):
    print('Retrieving data from pickles...')
    reviews_df = pd.read_pickle(DATA_DIR + reviews_filename)
    meta_df = pd.read_pickle(DATA_DIR + metadata_filename)

    # Metadata about healthy products only
    print('Filtering healthy products...')
    meta_healthy_df = meta_df[meta_df.apply(lambda item: filtering_func(item), axis=1)]

    # Reviews about healthy products merged with corresponding metadata
    print('Merging dataframes...')
    merged_healthy_df = pd.merge(meta_healthy_df[METADATA_TO_KEEP], reviews_df[REVIEWS_DATA_TO_KEEP], on='asin')

    # Store file into a picke
    print('Saving into pickle at:', DATA_DIR + filename)
    merged_healthy_df.to_pickle(DATA_DIR + filename)

    print('Done')

In [None]:
save_healthy_data(
    'food_reviews_df',
    'food_meta_df',
    is_food_healthy,
    'healthy_food_df'
)

In [None]:
save_healthy_data(
    'sports_reviews_df',
    'sports_meta_df',
    is_sport_item_healthy,
    'healthy_sports_df'
)