# Processing data

Many ML algorithms need data that is numeric, complete (no missing) and standardized. Ensembles of trees are the most accommodating and require the least data processing.

In [None]:
url = (
    'http://biostat.mc.vanderbilt.edu/' 
    'wiki/pub/Main/DataSets/titanic3.xls'
)

In [None]:
import pandas as pd

In [None]:
df = pd.read_excel(url)
df_orig = df.copy()

### Basic inspection

In [None]:
df.sample(5)

In [None]:
df.info()

### Detailed inspection

In [None]:
import pandas_profiling as pp

In [None]:
pp.ProfileReport(df)

### Create new features

Sometimes we want to create new features from existing columns. For example, the names column can be mined to extract titles. We illustrate how to do this, but will not use this here.

In [None]:
df['title'] = df.name.str.extract('.*([A-Z][a-z]+)\..*')
df.title.value_counts()

### Drop features

These features are either uninformative or leak information about the outcome.

In [None]:
target = df.survived
df = df.drop(columns = [
    'survived', 'name', 'ticket', 'cabin'
    ,'boat', 'body', 'home.dest', 'title'])

### Inspect for missing data

In [None]:
import missingno as mn

In [None]:
mn.matrix(df);

### Fill in missing values for categorical values

In [None]:
df.select_dtypes('object').isnull().sum()

In [None]:
df['embarked'] = df['embarked'].fillna('')

### Tangent:  `catboost` is nice

Minimal processing or tuning is required to use `catboost`, making it a nice "default" algorithm.

In [None]:
! python3 -m pip install --quiet catboost

In [None]:
import catboost

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test =train_test_split(df, target, random_state=0)

In [None]:
cb = catboost.CatBoostClassifier()

In [None]:
cb.fit(X_train, y_train, 
       cat_features=['sex', 'embarked'],
       verbose=0);

In [None]:
cb.score(X_test, y_test)

### Category encoding

#### Vanilla encoding

For variables with only a few distinct values, one hot encoding (or dummy variables) is often used. For more values, we can use hash encoding, which is basically the same idea but bins values using a hash function.

We may choose to drop one of the created columns to avoid multicollinearity.

In [None]:
pd.get_dummies(df, drop_first=True).head()

#### Target encoding

We can use the target to find a more informative encoding. Note that these methods leak information and are prone to over-fitting.

In [None]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

In [None]:
import category_encoders as ce

In [None]:
te = ce.TargetEncoder()

In [None]:
te.fit_transform(df.select_dtypes('number'), target).head()

### Split data into train and test data sets

Before we go further, we split into test and train data sets to avoid data leakage.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df, target)

### Category encoding

#### We will be conservative and avoid risk of leakage

Note we don't bother to drop columns - multicollinearity is only a problem when fitting linear models without regularization - this is rarely done in ML (c.f. statistics).

In [None]:
ohe= ce.OneHotEncoder(cols=['sex','embarked'], use_cat_names=True)

In [None]:
X_train = ohe.fit_transform(X_train)
X_test = ohe.transform(X_test)

In [None]:
X_train.head()

### Impute missing numeric values

#### Vanilla imputation

A simple imputation is to fill with mean or median.

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
si = SimpleImputer(strategy='mean')

In [None]:
X_train.select_dtypes('number').head(3)

We illustrate the code but will try more fancy imputation instead.

```python
X_train[X_train.select_dtypes('number').columns] = \
si.fit_transform(X_train.select_dtypes('number'))
X_test[X_test.select_dtypes('number').columns] = \
si.transform(X_test.select_dtypes('number'))
```

#### Fancy imputation

This basically does the same thing as `mice` in R.

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [None]:
ii = IterativeImputer(random_state=0)

In [None]:
X_train[X_train.select_dtypes('number').columns] = \
ii.fit_transform(X_train.select_dtypes('number'))

In [None]:
X_test[X_test.select_dtypes('number').columns] =  \
ii.transform(X_test.select_dtypes('number'))

In [None]:
X_train.isnull().sum().sum(), X_test.isnull().sum().sum()

#### Simple example to illustrate differences

In [None]:
import numpy as np

In [None]:
x = np.array([
    [10, 10],
    [1, 1],
    [2,2],
    [10, 10],
    [10, np.nan],
    [np.nan, 10],
    [np.nan, np.nan]
])

In [None]:
si.fit_transform(x)

In [None]:
ii.fit_transform(x)

In [None]:
X_train.to_csv('data/X_train_unscaled.csv', index=False)
X_test.to_csv('data/X_test_unscaled.csv', index=False)
y_train.to_csv('data/y_train_unscaled.csv', index=False)
y_test.to_csv('data/y_test_unscaled.csv', index=False)

### Standardize data

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
X_train.iloc[:, :] = scaler.fit_transform(X_train)
X_test.iloc[:, :] = scaler.transform(X_test)

### Save processed data for future use

In [None]:
X_train.to_csv('data/X_train.csv', index=False)
X_test.to_csv('data/X_test.csv', index=False)
y_train.to_csv('data/y_train.csv', index=False)
y_test.to_csv('data/y_test.csv', index=False)