# Data Dojo 20 - Missing Data

[`scikit-learn` documentation on imputation](https://scikit-learn.org/stable/modules/impute.html)

## Specific Tasks

- How often is each feature missing?
- Try a simple imputer
- Try a more sophisticated imputation strategy
- Optional: try a model that can handle missing values / a multi-stage modeling approach

## Setup



In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

### Hacking Order

In [2]:
np.random.seed(42)
names = ["Sascha", "Marko", "Sebastian", "Max", "Markus", "Sabine", "Caro", "Prithivi", "Mike", "Robin"]
np.random.shuffle(names)
" => ".join(names)

'Mike => Marko => Sabine => Sascha => Prithivi => Sebastian => Robin => Markus => Max => Caro'

### Data Loading

In [4]:
data = pd.read_csv("https://github.com/ddojo/ddojo.github.io/raw/main/sessions/14_trees/train.tsv", sep="\t")
test = pd.read_csv("https://github.com/ddojo/ddojo.github.io/raw/main/sessions/14_trees/test.tsv", sep="\t")

#### All cases

In [5]:
X = data.drop("species",axis=1)
y = data.species
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42)

In [6]:
X_test = test.drop("tree_id",axis=1)
tree_id = test.tree_id
pred = pd.DataFrame()
pred["tree_id"] = tree_id
pred["species"] = "unknown"

In [15]:
X_train.isna().sum()

latitude               0
longitude              0
stem_diameter_cm       0
height_m             316
crown_radius_m      2512
dtype: int64

In [23]:
for index,entry in X_train.iterrows():
    if np.nan in entry:
        print(entry)

In [24]:
from sklearn.impute import SimpleImputer

In [25]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(X_train)

In [36]:
X_train_full = imp.transform(X_train)

In [33]:
y_train.isna().sum()

0

In [35]:
X_val_full = imp.transform(X_val)
X_test_full = imp.transform(X_test)

In [29]:
X_train_full.isna().sum()

0    0
1    0
2    0
3    0
4    0
dtype: int64

In [37]:
from sklearn.ensemble import RandomForestClassifier

In [39]:
RF = RandomForestClassifier(n_estimators = 100, criterion="log_loss", oob_score=True)

In [40]:
RF.fit(X_train_full, y_train)

In [41]:
RF.score(X_val_full, y_val)

0.9516809116809117

In [46]:

imp = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=-1)
imp.fit(X_train)

In [47]:
X_train_full = imp.transform(X_train)
X_val_full = imp.transform(X_val)
X_test_full = imp.transform(X_test)

In [52]:
RF.fit(X_train_full, y_train)

In [53]:
RF.score(X_val_full, y_val)

0.9535042735042735

In [54]:
predictions = RF.predict(X_test_full)

In [45]:
?SimpleImputer

[0;31mInit signature:[0m
[0mSimpleImputer[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmissing_values[0m[0;34m=[0m[0mnan[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstrategy[0m[0;34m=[0m[0;34m'mean'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfill_value[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mverbose[0m[0;34m=[0m[0;34m'deprecated'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcopy[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0madd_indicator[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mkeep_empty_features[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Univariate imputer for completing missing values with simple strategies.

Replace missing values using a descriptive statistic (e.g. mean, median, or
most frequent) along each column, or using a cons

In [31]:
X_train_full = X_train.copy()
X_train_full["height_m"].fillna((X_train_full["height_m"].mean()), inplace = True )
X_train_full.isna().sum()

latitude               0
longitude              0
stem_diameter_cm       0
height_m               0
crown_radius_m      2512
dtype: int64

In [32]:
X_train_full["crown_radius_m"].fillna((X_train_full["crown_radius_m"].mean()), inplace = True )
X_train_full.isna().sum()

latitude            0
longitude           0
stem_diameter_cm    0
height_m            0
crown_radius_m      0
dtype: int64

#### Only complete cases



In [6]:
X_complete = data.dropna().drop("species",axis=1)
y_complete = data.dropna().species
X_train_complete, X_val_complete, y_train_complete, y_val_complete = train_test_split(X_complete, y_complete, random_state=42)

In [7]:
X_test_complete = test.dropna().drop("tree_id",axis=1)
tree_id_complete = test.dropna().tree_id
pred_complete = pd.DataFrame()
pred_complete["tree_id"] = tree_id_complete
pred_complete["species"] = "unknown"

## Models

## Save Test Predictions

In [55]:
pred["species"] = RF.predict(X_test_full)
pred.to_csv("constant-1_imp_RF.tsv", sep="\t")

or

In [0]:
pred_complete["species"] = model.predict(X_test_complete)
pred_complete.to_csv("my_prediction.tsv", sep="\t")