# Data sampling and feature engineering
Example of data processing starting from scratch (from data file import).

All we need is a powerful ***pandas*** module as well as some auxilliary:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# Feature engineering technics
## Think of feature engineering methods, we can list some of the most common practices:

1. Missing values - remove another article.
2. Dealing with categorical features - Bucketing bins, One-Hot encoding categorical
 2.1, 2.2 ...
3 Feature transformation
3.1 log
3.2 feature expentions, polinoms, kernels,
3.3 Normalization 
4. Feature crossing - (long, lat) date,
5. Row aggregation
5.1 Moving averages
6. Embedding



We will use ***"Bad Drivers"*** dataset (freely available from *FiveThirtyEight* at https://github.com/fivethirtyeight/data/tree/master/bad-drivers).

Our script parameters:

In [2]:
# Source data file
SOURCE = 'input/bad-drivers.csv'

# Number of bucketing bins
BINS = 10

# Drop factor (related to standard deviation)
DROP_FACTOR = 2

Let's load the dataset first using emnedded pandas method and have a look at it.

In [3]:
df = pd.read_csv(SOURCE)
df

Unnamed: 0,State,Number of drivers involved in fatal collisions per billion miles,Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding,Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired,Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted,Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents,Car Insurance Premiums ($),Losses incurred by insurance companies for collisions per insured driver ($)
0,Alabama,18.8,39,30,96,80,784.55,145.08
1,Alaska,18.1,41,25,90,94,1053.48,133.93
2,Arizona,18.6,35,28,84,96,899.47,110.35
3,Arkansas,22.4,18,26,94,95,827.34,142.39
4,California,12.0,35,28,91,89,878.41,165.63
5,Colorado,13.6,37,28,79,95,835.5,139.91
6,Connecticut,10.8,46,36,87,82,1068.73,167.02
7,Delaware,16.2,38,30,87,99,1137.87,151.48
8,District of Columbia,5.9,34,27,100,100,1273.89,136.05
9,Florida,17.9,21,29,92,94,1160.13,144.18


This datset is not big enough, but well suited for the demo purposes.

In [None]:
df.shape  
#no shape
# top 5 
# remove info

Let's describe it.

In [None]:
#df.info()

In [None]:
df.describe()

In [None]:
df.hist(figsize=(30, 15))
plt.show()

The dataset looks well completed, but what if we have something *missing*? Pandas could help us with this problem to avoid model breakdown. Let's create a fault dataframe as a copy of current data and remove some values in the first data column.

In [None]:
df_bad = df.copy()
df_bad.iloc[[0, 1, 2], 1] = None
df_bad

Have noticed missing/corrupted values for the first 3 states? It is not a good idea to continue data modeling using partially filled dataset. What about elimination of data for the missing states?

In [None]:
df_recovered = df_bad.dropna()
df_recovered

The recovered dataset seems better, except for the lack of precious numbers... So, let's drop it and return to the original one.

What about the ***Top 5*** and ***Bottom 5*** of states by number of drivers involved in fatal collisions per billion miles? Pandas helps us to find this list in a very unattended way.

In [None]:
df.sort_values(by=['Number of drivers involved in fatal collisions per billion miles'], ascending=False).head(5)

In [None]:
df.sort_values(by=['Number of drivers involved in fatal collisions per billion miles'], ascending=False).tail(5)

The difference between sets look to high, so it might be reasonable to drop outliers according to standard deviation.

In [None]:
upper_lim = df['Number of drivers involved in fatal collisions per billion miles'].mean() + df['Number of drivers involved in fatal collisions per billion miles'].std() * DROP_FACTOR
upper_lim = df['Number of drivers involved in fatal collisions per billion miles'].mean() + df['Number of drivers involved in fatal collisions per billion miles'].std() * DROP_FACTOR
lower_lim = df['Number of drivers involved in fatal collisions per billion miles'].mean() - df['Number of drivers involved in fatal collisions per billion miles'].std() * DROP_FACTOR

df = df[(df['Number of drivers involved in fatal collisions per billion miles'] < upper_lim) & (df['Number of drivers involved in fatal collisions per billion miles'] > lower_lim)]
df

Now both Top 5 and Bottom 5 are less spanned.

In [None]:
df.sort_values(by=['Number of drivers involved in fatal collisions per billion miles'], ascending=False).head(5)

In [None]:
df.sort_values(by=['Number of drivers involved in fatal collisions per billion miles'], ascending=False).tail(5)

We may also apply mathematical data transformation if needed. For example, let's transform first variable in the ***log*** format (last column).

In [None]:
df = df.assign(log_n = np.log10(df['Number of drivers involved in fatal collisions per billion miles']))
df.rename(columns={'log_n':'Log of number of drivers involved in fatal collisions per billion miles'}, inplace=True)
df

What if we need to prepare so called *bucketing* (partitioning the entire range of a numerical feature into bins with potential further preparation of categorical variable)? Again, pandas can help us (prepared as last column).

In [None]:
bin_labels = list(range(BINS))
df.loc[:, 'Fatal collisions as bins'] = pd.cut(df['Number of drivers involved in fatal collisions per billion miles'],
                                        BINS, labels=bin_labels)
df

What of our future model will require a ***one-hot-encoded*** feature set starting from one or several categorical variables? For example, we may proceed with the new variable *Fatal collisions as bins*.

This transformation is presented as a separate pandas dataframe in order to keep a neat representation. Also this trick could help to prepare a separate array for the dependent variable.

In [None]:
df_y = pd.get_dummies(df['Fatal collisions as bins'], prefix='bin')
df_y

Is your dependent variable ready? Feel free to make a final numpy array output and go ahead with the modeling!

In [None]:
df_y.values