# EDA - Feature Engineering

### In the following document, this will outline some concepts to be used in performing Exploratory Data Analysis and prep data for machine learning.  This is not exhaustive and will cover the following:

1. Missing values
2. Temporal variables
3. Non-Gaussian distributed variables
4. Categorical variables: remove rare labels
5. Categorical variables: convert strings to numbers
6. Standarise the values of the variables to the same range

### Setting the seed
It is important to note that we are engineering variables and pre-processing data with the idea of deploying the model if we find business value in it. Therefore, for each step that includes some element of randomness, it is extremely important that we set the seed. This way, we can obtain reproducibility between our research and our development code.

### Code vs Pseudocode

Most of this is actual code.  However, since there is no dataset included, we will use the variable **data** to represent data that has been loaded in.

Other conventions, where needed will use the format **target_col** for the column that we are looking to predict or classify, **num_col** for numerical column, **cat_col** for categorical column, **str_col** for string column, **date_col** for columns containing dates.


In [4]:
# some standard imports
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
%matplotlib inline

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import MinMaxScaler

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [None]:
# load the data from a csv

data = pd.read_csv('file.csv')

# get the dimensions
print(data.shape)

# get a look
data.head()

In [None]:
# Let's separate into train and test set
# this can also be done at a later step
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(data, data.target_col,
                                                    test_size=0.1,
                                                    random_state=0) # we are setting the seed here
X_train.shape, X_test.shape

## Missing Values

For categorical variables, fill missing information by adding an additional category: "missing".  This allows for easier manipulation and understanding of how missing impacts the data better than **NaN**.

In [None]:
# make a list of the categorical variables that contain missing values
vars_with_na = [var for var in data.columns if X_train[var].isnull().sum()>1 and X_train[var].dtypes=='O']

# print the variable name and the percentage of missing values
for var in vars_with_na:
    print(var, np.round(X_train[var].isnull().mean(), 3),  ' % missing values')

In [None]:
# function to replace NA in categorical variables with missing
def fill_categorical_na(df, var_list):
    X = df.copy()
    X[var_list] = df[var_list].fillna('Missing')
    return X

In [None]:
# replace missing values with new label: "Missing"
X_train = fill_categorical_na(X_train, vars_with_na)
X_test = fill_categorical_na(X_test, vars_with_na)

# check that we have no missing information in the engineered variables
# they should have now been replaced with 'missing'
X_train[vars_with_na].isnull().sum()

In [None]:
# check that test set does not contain null values in the engineered variables
[vr for var in vars_with_na if X_test[var].isnull().sum()>0]

For numerical variables, add an additional variable capturing the missing information, and then replace the missing information in the original variable by the mode, or most frequent value.  Sometimes, it's more common to use the mean, but that can capture outliers.

In [None]:
# make a list of the numerical variables that contain missing values
# note the dtypes!='O' for numerical

vars_with_na = [var for var in data.columns if X_train[var].isnull().sum()>1 and X_train[var].dtypes!='O']

# print the variable name and the percentage of missing values
for var in vars_with_na:
    print(var, np.round(X_train[var].isnull().mean(), 3),  ' % missing values')

In [None]:
# check that test set does not contain null values in the engineered variables
[vr for var in vars_with_na if X_test[var].isnull().sum()>0]

### Temporal variables

If there are variables that refer to the years in which something was something specific happened. Capture the time elapsed between the that variable and the event:

In [None]:
# let's explore the relationship between the year variables

def elapsed_years(df, var):
    # capture difference between year variable and year of event
    df[var] = df['year_col'] - df[var]
    return df

In [None]:
for var in ['date_col1', 'date_col2', 'date_col3']:
    X_train = elapsed_years(X_train, var)
    X_test = elapsed_years(X_test, var)

In [None]:
# check that test set does not contain null values in the engineered variables
[vr for var in ['date_col1', 'date_col2', 'date_col3'] if X_test[var].isnull().sum()>0]

### Numerical variables

Use the log transform the numerical variables that do not contain zeros in order to get a more Gaussian-like distribution. This tends to help Linear machine learning models.

In [None]:
for var in ['num_col1', 'num_col2', 'num_col3']:
    X_train[var] = np.log(X_train[var])
    X_test[var]= np.log(X_test[var]

In [None]:
# check that test and train set does not contain null values in the engineered variables

[var for var in ['num_col1', 'num_col2', 'num_col3'] if X_test[var].isnull().sum()>0]

### Categorical variables

First, remove those categories within variables that are present in less than 1% of the observations:

In [None]:
# capture the categorical variables first

cat_vars = [var for var in X_train.columns if X_train[var].dtype == 'O']

In [None]:
    # finds the labels that are shared by more than a certain % of the rows in the dataset
def find_frequent_labels(df, var, rare_perc):
    df = df.copy()
    tmp = df.groupby(var)['target_col'].count() / len(df)
    return tmp[tmp>rare_perc].index

for var in cat_vars:
    frequent_ls = find_frequent_labels(X_train, var, 0.01)
    X_train[var] = np.where(X_train[var].isin(frequent_ls), X_train[var], 'Rare')
    X_test[var] = np.where(X_test[var].isin(frequent_ls), X_test[var], 'Rare')

In [None]:
# Next, we need to transform the strings of these variables into numbers.
# We will do it so that we capture the relationship between the label and the target
# this function will assign discrete values to the strings of the variables, 
# so that the smaller value corresponds to the smaller mean of target

def replace_categories(train, test, var, target):
    ordered_labels = train.groupby([var])[target].mean().sort_values().index
    ordinal_label = {k:i for i, k in enumerate(ordered_labels, 0)} 
    train[var] = train[var].map(ordinal_label)
    test[var] = test[var].map(ordinal_label)

In [None]:

for var in cat_vars:
    replace_categories(X_train, X_test, var, 'target_col')

In [None]:
# plot the relationship between labels and target
# this makes as series of bar plots between the variables and the target_col
# each plot shows the variable components/values in relationship to the target_col
# remember that the target is log-transformed, so differences may seem small.

def analyse_vars(df, var):
    df = df.copy()
    df.groupby(var)['target_col'].median().plot.bar()
    plt.title(var)
    plt.ylabel('target_col')
    plt.show()
    
for var in cat_vars:
    analyse_vars(X_train, var)

### Feature Scaling

For use in linear models, features need to be either scaled or normalised. Not all models require it.

In [None]:
# Pandas adds the Id column as the index when you load the dataset

train_vars = [var for var in X_train.columns if var not in ['Id', 'target_col']]
len(train_vars)

# fit scaler
scaler = MinMaxScaler() # create an instance
scaler.fit(X_train[train_vars]) #  fit  the scaler to the train set for later use

# transform the train and test set, and add on the Id and SalePrice variables
X_train = pd.concat([X_train[['Id', 'target_col']].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(X_train[train_vars]), columns=train_vars)],
                    axis=1)

X_test = pd.concat([X_test[['Id', 'target_col']].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(X_test[train_vars]), columns=train_vars)],
                    axis=1)

