# Feature engineering notebook

In this notebook the feature engineering part of the data preprocessing takes place. In the data exploration notebook we have found some interesting features about the data, checked the distributions and found outlier datapoints which possibly would disturb the learning. In this notebook we carry out some preprocessing to remedy the issues found with the dataset, construct the features vector - label pairs and to split the dataset into train, validation and test sets.

## Structure

The following preprocessing steps are going to be carried out:
* converting the lengths of the offers into hours
* extracting features from the transcript datasets value column
* Removing the people and all of their transactions from the data who are considered to be outliers based on their total spent money during the 30 day period
* Encode the gender in the profile dataset by one-hot-encoding: 'M', 'F', 'O', 'U'
* Standardizing the non missing values in the age and income data with sklearn's StandardScaler.
* Filling up the None type values in the age and income columns with zeros (mean after standardization)
* calculate the membership length and remove the became_member_on field.
* Transforming the membership length dataset with sklearn's QuantileTransformer
* Creating the training dataset by creating feature vectors for every received offer and creating label for it which puts the offer in one of the following cathegories:
    * not viewed
    * viewed
    * viewed and completed
    * not vieweed but still completed
* splitting the dataset into training, validation and test datasets
* creating the data loaders
---

In [137]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
from sklearn.preprocessing import StandardScaler, QuantileTransformer
from importlib import reload
from source import feature_helpers, exploration_helpers

In [138]:
reload(feature_helpers)

<module 'source.feature_helpers' from '/home/ferenc/Documents/Udacity/Machine_Learning_Engineer/Starbucks_Capstone_Project/source/feature_helpers.py'>

### loading the data

In [139]:
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

In [140]:
profile = pd.read_json('data/profile.json', orient='records', lines=True)

### Converting the lengths of offers into hours

In [141]:
# converting the duration into hours:
portfolio.duration = portfolio.duration.apply(lambda x: x * 24)

### Extracting transcript values

In [142]:
# extracting the transaction values:
transcript = exploration_helpers.extract_transcript_values(transcript)

### Removing outliers

Removing persons from the dataset who have outlier amount of spent money. First the total amount of money spent per person is calculated:

In [143]:
cache_dir = 'cache'
cache_file = 'profile_total_spent.csv'
profile = feature_helpers.get_total_spent(profile, transcript, cache_file=os.path.join(cache_dir, cache_file))

Read preprocessed data from cache file: cache/profile_total_spent.csv


In [144]:
profile.head()

Unnamed: 0,gender,age,id,became_member_on,income,total_spent
0,,118,68be06ca386d4c31939f3a4f0e3dd783,20170212,,20.4
1,F,55,0610b486422d4921ae7d2bf64640c50b,20170715,112000.0,77.01
2,,118,38fe809add3b4fcf9315a9694bb96ff5,20180712,,14.3
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,20170509,100000.0,159.27
4,,118,a03223e636434f42ac4c3df47e8bac43,20170804,,4.65


The the outliers are removed:

In [145]:
print('Number of people originally: {}'.format(len(profile.index)))
print('Number of transcript data originally: {}'.format(len(transcript.index)))

# removing outplier spending people.
profile, transcript = feature_helpers.remove_outliers(profile, transcript, lower_threshold=0, upper_threshold=300)

print('Number of people without outliers: {}'.format(len(profile.index)))
print('Number of transcript data without outliers: {}'.format(len(transcript.index)))

Number of people originally: 17000
Number of transcript data originally: 306534
Number of people without outliers: 15840
Number of transcript data without outliers: 284047


It can be seen that around 6.8% of the people and around 7.3% of the total transcript data is removed.

### One-hot-encoding the gender

For further use the gender has to be encoded using one-hot-encoding.

In [146]:
# inserting the extra columns into profile:
profile = feature_helpers.encode_gender(profile)

profile[1: 10]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  profile[gender][gender_mask] = 1


Unnamed: 0,F,M,O,U,age,id,became_member_on,income,total_spent
1,1,0,0,0,55,0610b486422d4921ae7d2bf64640c50b,20170715,112000.0,77.01
2,0,0,0,1,118,38fe809add3b4fcf9315a9694bb96ff5,20180712,,14.3
3,1,0,0,0,75,78afa995795e4d85b5d9ceeca43f5fef,20170509,100000.0,159.27
4,0,0,0,1,118,a03223e636434f42ac4c3df47e8bac43,20170804,,4.65
5,0,1,0,0,68,e2127556f4f64592b11af22de27a7932,20180426,70000.0,57.73
7,0,0,0,1,118,68617ca6246f4fbc85e91a2a49552598,20171002,,0.24
8,0,1,0,0,65,389bc3fa690240e798340f5a15918d5c,20180209,53000.0,36.43
9,0,0,0,1,118,8974fc5686fe429db53ddde067b88302,20161122,,15.62
10,0,0,0,1,118,c4863c7985cf408faee930f111475da3,20170824,,66.41


### Standardizing age and income

standardizing the age and income data using sklearn's StandarScaler.

In [149]:
# instantiating the standard scaler object:
standard_scaler = StandardScaler()

# transforming the data:
profile.age[profile.age != 118] = standard_scaler.fit_transform(profile.age[profile.age != 118].values.reshape(-1, 1))
profile.income = standard_scaler.fit_transform(profile.income.values.reshape(-1, 1))

### Filling up None type values

The age and income data of the people who belong to the Unknown cathegory will be filled with the respective means.

In [150]:
# filling up the unknown values in the age column:
profile.age[profile.age == 118] = 0
profile.income[profile.income.isna()] = 0

profile.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  profile.age[profile.age == 118] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  profile.income[profile.income.isna()] = 0


Unnamed: 0,F,M,O,U,age,id,became_member_on,income,total_spent
0,0,0,0,1,2.050423,68be06ca386d4c31939f3a4f0e3dd783,20170212,0.0,20.4
1,1,0,0,0,-0.279687,0610b486422d4921ae7d2bf64640c50b,20170715,2.214943,77.01
2,0,0,0,1,2.050423,38fe809add3b4fcf9315a9694bb96ff5,20180712,0.0,14.3
3,1,0,0,0,0.46003,78afa995795e4d85b5d9ceeca43f5fef,20170509,1.654475,159.27
4,0,0,0,1,2.050423,a03223e636434f42ac4c3df47e8bac43,20170804,0.0,4.65


### Calculating membership length

Calculating the membership length and removing the became_member_on field.

In [113]:
profile = exploration_helpers.get_membership_length(profile)

profile.head()

Unnamed: 0,F,M,O,U,age,id,income,total_spent,membership_length
0,0,0,0,1,62.561995,68be06ca386d4c31939f3a4f0e3dd783,64576.632252,20.4,529
1,1,0,0,0,55.0,0610b486422d4921ae7d2bf64640c50b,112000.0,77.01,376
2,0,0,0,1,62.561995,38fe809add3b4fcf9315a9694bb96ff5,64576.632252,14.3,14
3,1,0,0,0,75.0,78afa995795e4d85b5d9ceeca43f5fef,100000.0,159.27,443
4,0,0,0,1,62.561995,a03223e636434f42ac4c3df47e8bac43,64576.632252,4.65,356


In [None]:
# standardizing the age column:


In [135]:
a = {'a': [0, 1, 2, float('nan')], 'b': [1, 2, 3, 4]}
df = pd.DataFrame(a)
df.insert(loc=1, column='c', value=np.zeros_like(df.index))
df

Unnamed: 0,a,c,b
0,0.0,0,1
1,1.0,0,2
2,2.0,0,3
3,,0,4


In [136]:
standard_scaler = StandardScaler()
df.a = standard_scaler.fit_transform(df.a.values.reshape(-1,1))
df

Unnamed: 0,a,c,b
0,-1.224745,0,1
1,0.0,0,2
2,1.224745,0,3
3,,0,4
