<a href="https://colab.research.google.com/github/alastra32/DS-Unit-2-Applied-Modeling/blob/master/module2/assignment_applied_modeling_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 2

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Plot the distribution of your target. 
    - Classification problem: Are your classes imbalanced? Then, don't use just accuracy.
    - Regression problem: Is your target skewed? If so, let's discuss in Slack.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline?
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

## Setup 

In [1]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly

Requirement already up-to-date: category_encoders in /usr/local/lib/python3.6/dist-packages (2.1.0)
Requirement already up-to-date: pandas-profiling in /usr/local/lib/python3.6/dist-packages (2.3.0)
Requirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.1.1)


In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

# merge train_features.csv & train_labels.csv
trainandval = pd.merge(pd.read_csv('https://raw.githubusercontent.com/alastra32/DS-Unit-2-Kaggle-Challenge/master/data/tanzania/train_features.csv'), 
                 pd.read_csv('https://raw.githubusercontent.com/alastra32/DS-Unit-2-Kaggle-Challenge/master/data/tanzania/train_labels.csv'))

# read test_features.csv & sample_submission.csv
test = pd.read_csv('https://raw.githubusercontent.com/alastra32/DS-Unit-2-Kaggle-Challenge/master/data/tanzania/test_features.csv')
sample_submission = pd.read_csv('https://raw.githubusercontent.com/alastra32/DS-Unit-2-Kaggle-Challenge/master/data/tanzania/sample_submission.csv')

In [0]:
# import block
pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('dark_background')
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
import category_encoders as ce
from xgboost import XGBClassifier

In [4]:
# train validation split
train, val = train_test_split(trainandval, train_size=0.95, test_size=0.05, 
                              stratify=trainandval['status_group'], random_state=42)

train.shape, val.shape, test.shape

((56430, 41), (2970, 41), (14358, 40))

## Manual Mode 



In [0]:
# We need a function that returns the mode of a given series for the imputer function.
def manual_mode(feature):
  try: 
    return feature.mode()[0]
  except:
    pass

## Imputer

In [0]:
# imputes by the lowest non-null region measure


def fill_nulls(df, feature, method):
  #attempt to fill nulls by method in succesively larger geographic scopes
  df = df.copy()# avoid settingwithcopy warning
  geo_scopes = ['ward', 'lga', 'region', 'basin']
  
  if method == 'mode':
    method = manual_mode
  
  for scope in geo_scopes:
    if df[feature].isnull().sum() == 0:
      break
    df[feature] = df[feature].fillna(df.groupby(scope)[feature].transform(method))

  return df[feature]


def impute(df, features, method):
  #imputation of given features by given method (mean/median/mode)
  df = df.copy()
  
  for feature in features:
    df[feature] = fill_nulls(df, feature, method)

  return df

## Wrangler

In [0]:


def flag_missing_values(df):
  '''add "<FEATURE>_MISSING" flag feature for all columns with nulls'''
  df.copy()
  
  columns_with_nulls = df.columns[df.isna().any()]
  
  for col in columns_with_nulls:
    df[col+'_MISSING'] = df[col].isna()
  
  return df


def convert_dummy_nulls(df):
  '''Convert 0 to NaN's'''
  df = df.copy()
  
  # replace near-zero latitudes with zero
  df['latitude'] = df['latitude'].replace(-2e-08, 0)
  
  zero_columns = ['longitude', 'latitude', 'construction_year', 'gps_height', 
                  'population']
  
  for col in zero_columns:
    df[col] = df[col].replace(0, np.nan)
    
  return df
  
    
def clean_text_columns(df):
  '''convert text to lowercase, remove non-alphanumerics, unknowns to NaN'''
  df = df.copy()
  
  text_columns = df[df.columns[(df.applymap(type) == str).all(0)]]
  unknowns = ['unknown', 'notknown', 'none', 'nan', '']
    
  for col in text_columns:
    df[col] = df[col].str.lower().str.replace('\W', '')
    df[col] = df[col].replace(unknowns, np.nan)

  return df


def get_distances_to_population_centers(df):
  '''create a distance feature for population centers'''
  df = df.copy()
  population_centers = {'dar': (6.7924, 39.2083), 
                        'mwanza': (2.5164, 32.9175),
                        'dodoma': (6.1630, 35.7516)}
  
  for city, loc in population_centers.items():
    df[city+'_distance'] = ((((df['latitude']-loc[0])**2)
                           + ((df['longitude']-loc[1])**2))**0.5)
  
  return df


def engineer_date_features(df):
  df = df.copy()
  
  # change date_recorded to datetime format
  df['date_recorded'] = pd.to_datetime(df.date_recorded, 
                                      infer_datetime_format=True)
    
  # extract components from date_recorded
  df['year_recorded'] = df['date_recorded'].dt.year
  df['month_recorded'] = df['date_recorded'].dt.month
  df['day_recorded'] = df['date_recorded'].dt.day

  df['inspection_interval'] = df['year_recorded'] - df['construction_year']
  
  return df


def wrangle(df):
    '''cleaning/engineering function'''
    df = df.copy()
    
    df = convert_dummy_nulls(df)   
    df = clean_text_columns(df)
    df = get_distances_to_population_centers(df)
    df = engineer_date_features(df)
    df = flag_missing_values(df)
    
    drop_features = ['recorded_by', 'id', 'date_recorded']
    df = df.drop(columns=drop_features)

    # Apply imputation
    numeric_columns = df.select_dtypes(include = 'number').columns
    nonnumeric_columns = df.select_dtypes(exclude = 'number').columns
    
    df = impute(df, numeric_columns, 'median')
    df = impute(df, nonnumeric_columns, 'mode')

    return df

## Engineer, Pipe, and Train

In [0]:
# clean and engineer all datasets - this may take a bit, ~5 minutes
train_wrangled = wrangle(train)
val_wrangled = wrangle(val)
test_wrangled = wrangle(test)

In [0]:
# arrange data into X features matrix and y target vector
target = 'status_group'

X_train = train_wrangled.drop(columns=target)
y_train = train_wrangled[target]

X_val = val_wrangled.drop(columns=target)
y_val = val_wrangled[target]

X_test = test_wrangled

In [10]:

transformers = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median')
)

X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.fit_transform(X_val)

model = RandomForestClassifier(n_estimators=129, max_depth=29, min_samples_leaf=2, 
                            random_state=42, min_impurity_decrease=2.22037e-16, n_jobs=-1)


model.fit(X_train_transformed,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=29, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=2.22037e-16,
                       min_impurity_split=None, min_samples_leaf=2,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=129, n_jobs=-1, oob_score=False,
                       random_state=42, verbose=0, warm_start=False)

In [11]:
#score
model.score(X_val_transformed,y_val)

0.8111111111111111

Permutation Importance

In [12]:
!pip install eli5



In [13]:
import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(
    model, 
    scoring='accuracy',
    n_iter=2,
    random_state=42
)

permuter.fit(X_val_transformed, y_val)
feature_names = X_val.columns.tolist()

eli5.show_weights(
    permuter,
    top=None,
    feature_names = feature_names
)


Using TensorFlow backend.


Weight,Feature
0.0316  ± 0.0000,quantity_group
0.0258  ± 0.0024,quantity
0.0123  ± 0.0010,waterpoint_type_group
0.0072  ± 0.0003,waterpoint_type
0.0059  ± 0.0017,population
0.0040  ± 0.0007,extraction_type_class
0.0034  ± 0.0000,region_code
0.0024  ± 0.0000,source_type
0.0020  ± 0.0000,district_code
0.0020  ± 0.0007,construction_year


In [14]:
minimum_importance = 0
mask = permuter.feature_importances_ > minimum_importance
features = X_train.columns[mask]
X_train = X_train[features]

X_train.shape


(56430, 44)

In [15]:
X_val = X_val[features]
X_val.shape

(2970, 44)

In [16]:
pipeline = make_pipeline(
    
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=129, max_depth=29, min_samples_leaf=2, 
                            random_state=42, min_impurity_decrease=2.22037e-16, n_jobs=-1)

)

pipeline.fit(X_train, y_train)
print ('Validation Accuracy', pipeline.score(X_val, y_val))

Validation Accuracy 0.8171717171717172


xgboost for gradient boosting

In [17]:
from xgboost import XGBClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    XGBClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['funder', 'region', 'lga',
                                      'scheme_management', 'scheme_name',
                                      'permit', 'extraction_type',
                                      'extraction_type_group',
                                      'extraction_type_class', 'management',
                                      'quantity', 'quantity_group', 'source',
                                      'source_type', 'source_class',
                                      'waterpoint_type',
                                      'waterpoint_type_group'],
                                drop_invariant=False, handle_...
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               max_delta_step=0,

In [18]:
from sklearn.metrics import accuracy_score 
y_pred = pipeline.predict(X_val)
print ('Validation Accuracy', accuracy_score(y_val, y_pred))

Validation Accuracy 0.7521885521885522


In [19]:
encoder = ce.OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

X_train.shape, X_val.shape, X_train_encoded.shape, X_val_encoded.shape

((56430, 44), (2970, 44), (56430, 44), (2970, 44))

In [0]:
eval_set = [(X_train_encoded, y_train), (X_val_encoded, y_val)]

In [21]:
model = XGBClassifier(
    n_estimators=1000,
    max_depth=7,
    learning_rate=0.1,
    n_jobs = -1
)

model.fit(X_train_encoded, y_train, eval_set=eval_set, eval_metric='merror', early_stopping_rounds=50)

[0]	validation_0-merror:0.250328	validation_1-merror:0.255219
Multiple eval metrics have been passed: 'validation_1-merror' will be used for early stopping.

Will train until validation_1-merror hasn't improved in 50 rounds.
[1]	validation_0-merror:0.24673	validation_1-merror:0.251852
[2]	validation_0-merror:0.243558	validation_1-merror:0.251178
[3]	validation_0-merror:0.241662	validation_1-merror:0.250842
[4]	validation_0-merror:0.23771	validation_1-merror:0.245791
[5]	validation_0-merror:0.234769	validation_1-merror:0.242088
[6]	validation_0-merror:0.232837	validation_1-merror:0.241414
[7]	validation_0-merror:0.232111	validation_1-merror:0.239731
[8]	validation_0-merror:0.231437	validation_1-merror:0.241414
[9]	validation_0-merror:0.230285	validation_1-merror:0.238721
[10]	validation_0-merror:0.228265	validation_1-merror:0.23771
[11]	validation_0-merror:0.227326	validation_1-merror:0.234343
[12]	validation_0-merror:0.226298	validation_1-merror:0.234343
[13]	validation_0-merror:0.2256

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=7,
              min_child_weight=1, missing=None, n_estimators=1000, n_jobs=-1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)