# Part 1: Pre-processing

First, we load our modules and data:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dataset = pd.read_csv('CS_Purchase_data.csv', index_col=0)
print(dataset.head())
print(len(dataset))

In [None]:
len(dataset['Product_ID'].unique())

In [None]:
dataset['Marital_Status'].value_counts()

In [None]:
len(dataset['User_ID'].unique())

In [None]:
dataset['Product_Category_3'].value_counts()

Take a good look at the different variables we have. There are two IDs: one for the users, one for the products. Those two entities each have properties, i.e. users have an age, occupation, annual income etc., and products have categories. Furthermore, for each transaction the purchase amount is given (variable 'Purchase').

## Merge datasets

Given that the exercise requires us to predict the purchase amount of a particular customer, the dataset needs to be transformed accordingly. In this part, you will aggregate the purchase data per user and remove the columns we no longer need.

### Remove unnecessary columns

Create a function to remove the unnecessary columns:

First, list the ones (i.e. copy their exact column name) you think should be removed:

In [None]:
def fill_in_columns():
    to_remove = []
    
    ### BEGIN SOLUTION    
    to_remove = ['Product_Category_1','Product_Category_2', 'Product_Category_3', 'Product_ID']
    ### END SOLUTION

    return to_remove

Now remove them:

In [None]:
def remove_columns(dataset, to_remove):
    purchase_data = dataset.copy()
    
    ### BEGIN SOLUTION    
    purchase_data = dataset.drop(to_remove, axis=1)
    ### END SOLUTION
    
    return purchase_data

In [None]:
data_copy = dataset.copy()
purchase_data = remove_columns(data_copy,fill_in_columns())

### Aggregate the observations

Now, aggregate the observations based on the user ID and make a new column 'Purchase_Sum' which contains the sum of all purchases of a particular user:

In [None]:
def aggregate_observations(dataset):
    purchase_data = dataset.copy()
    
    purchase_data = purchase_data.groupby( ['User_ID', 'Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status']).sum()
    purchase_data = purchase_data.add_suffix('_Sum').reset_index()
    
    return purchase_data

In [None]:

purchase_data_copy = data_copy.copy()
purchase_data_copy = aggregate_observations(purchase_data_copy)
print(purchase_data_copy.head())

In [None]:
### BEGIN HIDDEN TESTS
from pandas.testing import assert_frame_equal
result = purchase_data.groupby( ['User_ID', 'Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status']).sum()
result = result.add_suffix('_Sum').reset_index()
assert result.equals(aggregate_observations(purchase_data))
purchase_data = result
### END HIDDEN TESTS
print(purchase_data.head())

### Merge the datasets

Now, we add the extra customer data:

In [None]:
customer_data = pd.read_csv('CS_Customer_data.csv')
print(customer_data.head())

Write a function that merges the two datasets:

In [None]:
def merge_datasets(purchase_data, customer_data):
    final_data = purchase_data.copy()
    
    ### BEGIN SOLUTION    
    final_data=pd.merge(purchase_data, customer_data, on="User_ID", how='left')
    ### END SOLUTION
    
    return final_data

Let's have a look at the variables:

In [None]:
final_data = merge_datasets(purchase_data, customer_data)
for var in final_data.columns:
    if var in final_data.select_dtypes(include=['float64', 'float32', 'int64', 'int32']):
        plt.hist(final_data[var].fillna(0))
    else:
        plt.bar(x = final_data[var].unique(), height = final_data[var].value_counts())
    plt.title(var)
    plt.show()

## Transform variables

The following problems could be spotted:
- User_ID : is ID, not needed anymore
- Marital status: should not be numeric
- Occupation: should not be numeric

However, we are going to do linear regression. Hence, we want to transform the categorical attributes into numeric ones, including:
- Gender
- Age
- City category
- Stay in current city years

Fix the ID:

In [None]:
fixed_id_data = final_data.drop(['User_ID'], axis=1)
print(fixed_id_data.head())

Now, write a function to fix the categorical variables using dummy encoding. The new variables should contain the variable name as prefix (see below):

In [None]:
def transform_categorical_variables(fixed_id_data, to_transform):
    transformed_data = fixed_id_data.copy()
    
    ### BEGIN SOLUTION    
    # We loop all the variables we want to transform, and to so by introducing dummy variables
    for var in to_transform:
        transformed_data = pd.concat([transformed_data.drop(var, axis=1), pd.get_dummies(transformed_data[var].values, prefix=var, drop_first=True)], axis=1)
    ### END SOLUTION
    
    return transformed_data

That concludes Part 1: Pre-processing. You can save your progress for the next stage in our process, Part 2: Transformation:

In [None]:
to_transform = ['Age', 'Gender', 'City_Category', 'Stay_In_Current_City_Years']
transformed_data = transform_categorical_variables(fixed_id_data, to_transform)
print(transformed_data.head())
transformed_data.to_csv('CS_pre_processed_data.csv')