# Data preprocessing

This notebook handles 
all the data preparation and storage needed for the analysis. We decided to separate the analysis into multiple notebooks in order to make the report more readable

## Imports, constants and utilitary functions

In [1]:
import pandas as pd
import numpy as np
import gzip

In [2]:
DATA_DIR = 'data/'

## Data collection and storage

The first task is to task is to collect data from Amazon, organise this data and store it in a ready-to-use data format (pickes).

To achieve this, we do the following:
* Load the reviews data into dataframes
* Load the products metadata into dataframes
* Filter the reviews and products to only keep products considered "healthy"
* Merge the metadata and the reviews into a single dataframe
* Store the dataframe in the picke format

We decided to restrict our analysis by choosing reviews and products from the **Grocery and Gourmet Food** and **Sports and Outdoors** categories as these categories are more representative of healthy lifestyle than the others.


#### The Grocery and Gourmet Food category

We see that the different categories in the 'Grocery & Gourmet Food' file are not directly useful, because:
1. They are not directly telling us if the food is a healthy one or not
2. We see that a lot of reviews are about products that are not in a category (except for the main one 'Grocery & Gourmet Food')
    
In order to get the reviews related to a healthy product in this file, we can try the following: we could read the title (and/or description) of all products in the metadata and find the ones containing some keyword related to a healthy lifestyle (e.g. "organic", "natural", ...). Once we have those products, we can keep only the reviews about those products (using the 'asin' value).

#### The Sports and Outdoors category

The Sports and Outdoors file contains categories that seem easy to categorize into healthy (or not), e.g. 'Exercise & Fitness' or 'Cycling', plus the products seem to be in more precise categories (not like in the 'Grocery & Gourmet Food' file). Thus, in order to get all reviews about products related to an healthy lifestyle, we could take all the reviews about a product that is in one of the 'healthy' categories, and we can choose those healthy categories manually.

### Load the data into pandas dataframes

Given the amount of data to treat, we do the computations sequentially (file by file)

With our utilitary functions ready, we can store the data into pickles. The created files are the following:
* **food_reviews_df**: A file containing all the reviews of the Food category
* **food_meta_df**: A file containing all the reviews of the Food category
* **sports_reviews_df**: A file containing all the reviews of the Sports category
* **sports_meta_df**: A file containing all the reviews of the Sports category
* **healthy_food_df**: A file containing the merged information about healthy products and reviews of the Food category
* **healthy_sports_df**: A file containing the merged information about healthy products and reviews of the Sports category

Saving file: data/food_reviews_df
Saving file: data/food_meta_df
Saving file: data/sports_reviews_df
Saving file: data/sports_meta_df


Retrieving data from pickles...
Filtering healthy products...
Merging dataframes...
Saving into pickle at: data/healthy_food_df
Done


Retrieving data from pickles...
Filtering healthy products...
Merging dataframes...
Saving into pickle at: data/healthy_sports_df
Done


## Number of reviews per category

We store the number of reviews per category into a pickle

Food categories

Sports categories