Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 3

## Assignment
- [X] [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2/portfolio-project/ds6), then choose your dataset, and [submit this form](https://forms.gle/nyWURUg65x1UTRNV9), due today at 4pm Pacific.
- [X] Continue to participate in our Kaggle challenge.
- [X] Try xgboost.
- [X] Get your model's permutation importances.
- [ ] Try feature selection with permutation importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other categorical encodings.
- [ ] Try other Python libraries for gradient boosting.
- [ ] Look at the bonus notebook in the repo, about monotonic constraints with gradient boosting.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_






### Python libraries for Gradient Boosting
- [scikit-learn Gradient Tree Boosting](https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting) — slower than other libraries, but [the new version may be better](https://twitter.com/amuellerml/status/1129443826945396737)
  - Anaconda: already installed
  - Google Colab: already installed
- [xgboost](https://xgboost.readthedocs.io/en/latest/) — can accept missing values and enforce [monotonic constraints](https://xiaoxiaowang87.github.io/monotonicity_constraint/)
  - Anaconda, Mac/Linux: `conda install -c conda-forge xgboost`
  - Windows: `conda install -c anaconda py-xgboost`
  - Google Colab: already installed
- [LightGBM](https://lightgbm.readthedocs.io/en/latest/) — can accept missing values and enforce [monotonic constraints](https://blog.datadive.net/monotonicity-constraints-in-machine-learning/)
  - Anaconda: `conda install -c conda-forge lightgbm`
  - Google Colab: already installed
- [CatBoost](https://catboost.ai/) — can accept missing values and use [categorical features](https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html) without preprocessing
  - Anaconda: `conda install -c conda-forge catboost`
  - Google Colab: `pip install catboost`

### Categorical Encodings

**1.** The article **[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)** mentions 4 encodings:

- **"Categorical Encoding":** This means using the raw categorical values as-is, not encoded. Scikit-learn doesn't support this, but some tree algorithm implementations do. For example, [Catboost](https://catboost.ai/), or R's [rpart](https://cran.r-project.org/web/packages/rpart/index.html) package.
- **Numeric Encoding:** Synonymous with Label Encoding, or "Ordinal" Encoding with random order. We can use [category_encoders.OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html).
- **One-Hot Encoding:** We can use [category_encoders.OneHotEncoder](http://contrib.scikit-learn.org/categorical-encoding/onehot.html).
- **Binary Encoding:** We can use [category_encoders.BinaryEncoder](http://contrib.scikit-learn.org/categorical-encoding/binary.html).


**2.** The short video 
**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)** introduces an interesting idea: use both X _and_ y to encode categoricals.

Category Encoders has multiple implementations of this general concept:

- [CatBoost Encoder](http://contrib.scikit-learn.org/categorical-encoding/catboost.html)
- [James-Stein Encoder](http://contrib.scikit-learn.org/categorical-encoding/jamesstein.html)
- [Leave One Out](http://contrib.scikit-learn.org/categorical-encoding/leaveoneout.html)
- [M-estimate](http://contrib.scikit-learn.org/categorical-encoding/mestimate.html)
- [Target Encoder](http://contrib.scikit-learn.org/categorical-encoding/targetencoder.html)
- [Weight of Evidence](http://contrib.scikit-learn.org/categorical-encoding/woe.html)

Category Encoder's mean encoding implementations work for regression problems or binary classification problems. 

For multi-class classification problems, you will need to temporarily reformulate it as binary classification. For example:

```python
encoder = ce.TargetEncoder(min_samples_leaf=..., smoothing=...) # Both parameters > 1 to avoid overfitting
X_train_encoded = encoder.fit_transform(X_train, y_train=='functional')
X_val_encoded = encoder.transform(X_train, y_val=='functional')
```

**3.** The **[dirty_cat](https://dirty-cat.github.io/stable/)** library has a Target Encoder implementation that works with multi-class classification.

```python
 dirty_cat.TargetEncoder(clf_type='multiclass-clf')
```
It also implements an interesting idea called ["Similarity Encoder" for dirty categories](https://www.slideshare.net/GaelVaroquaux/machine-learning-on-non-curated-data-154905090).

However, it seems like dirty_cat doesn't handle missing values or unknown categories as well as category_encoders does. And you may need to use it with one column at a time, instead of with your whole dataframe.

**4. [Embeddings](https://www.kaggle.com/learn/embeddings)** can work well with sparse / high cardinality categoricals.

_**I hope it’s not too frustrating or confusing that there’s not one “canonical” way to encode categorcals. It’s an active area of research and experimentation! Maybe you can make your own contributions!**_

In [2]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # eli5, version >= 0.9
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders eli5 pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module3')

Requirement already up-to-date: category_encoders in /usr/local/lib/python3.6/dist-packages (2.0.0)
Requirement already up-to-date: eli5 in /usr/local/lib/python3.6/dist-packages (0.9.0)
Requirement already up-to-date: pandas-profiling in /usr/local/lib/python3.6/dist-packages (2.3.0)
Requirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.1.0)
Reinitialized existing Git repository in /content/.git/
fatal: remote origin already exists.
From https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge
 * branch            master     -> FETCH_HEAD
Already up to date.


In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

In [7]:
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


In [0]:
X_train, X_val, y_train, y_val = train_test_split(train.drop('status_group', axis = 'columns'),train.status_group, test_size = .25, stratify = train.status_group)

In [16]:
X_train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
18292,58821,0.0,2011-07-10,Hesawa,0,HESAWA,31.837348,-2.620175,Kahuhwa,0,Lake Victoria,Kahuhwa,Kagera,18,8,Chato,Chato,0,True,GeoData Consultants Ltd,VWC,,True,0,afridev,afridev,handpump,vwc,user-group,never pay,never pay,salty,salty,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump
12698,2594,0.0,2011-07-15,Government Of Tanzania,1117,DWE,31.232741,-6.376189,Kwa Daniel Mbegele,0,Lake Rukwa,Isanjandugu,Rukwa,15,1,Mpanda,Nsimbo,1,False,GeoData Consultants Ltd,Water authority,Msaginya,True,1972,gravity,gravity,gravity,water authority,commercial,never pay,never pay,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
9757,67720,2400.0,2011-03-27,Livin,1704,LIVI,35.034634,-8.711974,none,0,Rufiji,M,Iringa,11,2,Mufindi,Mtambula,0,True,GeoData Consultants Ltd,VWC,,False,2000,india mark ii,india mark ii,handpump,vwc,user-group,pay annually,annually,soft,good,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump
6501,61045,0.0,2013-02-10,Jumanne Siabo,0,Jumanne Siabo,32.634808,-3.81645,Jummanne Siabo,0,Lake Victoria,Igagati,Shinyanga,17,3,Kahama,Mhongolo,0,True,GeoData Consultants Ltd,VWC,,False,0,other,other,other,private operator,commercial,unknown,unknown,milky,milky,seasonal,seasonal,shallow well,shallow well,groundwater,other,other
8549,65941,0.0,2013-01-19,Rips,20,Rips,39.614351,-10.053176,Kwa Sharifu,0,Ruvuma / Southern Coast,Mayani,Lindi,80,62,Lindi Urban,Jamhuri,3000,True,GeoData Consultants Ltd,,,True,2011,other,other,other,vwc,user-group,unknown,unknown,soft,good,enough,enough,shallow well,shallow well,groundwater,other,other


In [0]:
# Defining feature engineering function
import datetime
import numpy as np

def engineer(df):
  df = df.copy()
#   Making age variable
  df.date_recorded = pd.to_datetime(df.date_recorded)
  df.construction_year.replace(0,np.NaN, inplace = True)
  mean_year = np.nanmean(df.construction_year)
  df.construction_year.replace(np.NaN,mean_year, inplace = True)
  df['age'] = df.date_recorded.dt.year - df.construction_year
#   Adding day, month, year, and day of week features
  df['day_recorded'] = df.date_recorded.dt.day
  df['month_recorded'] = df.date_recorded.dt.month
  df['year_recorded'] = df.date_recorded.dt.year
  df['day_of_week_recorded'] = df.date_recorded.dt.dayofweek
  df.drop('date_recorded', axis = 'columns', inplace = True)
#   putting nan values where zeros are but shouldn't be
  cols_with_zeros = ['longitude', 'latitude', 
                       'gps_height', 'population']
  for col in cols_with_zeros:
    df[col] = df[col].replace(0, np.nan)
    df[col+'_MISSING'] = df[col].isnull()
# Drop duplicate columns
  duplicates = ['quantity_group', 'payment_type']
  df = df.drop(columns=duplicates)
# Drop recorded_by (never varies) and id (always varies, random)
  unusable_variance = ['recorded_by', 'id']
  df = df.drop(columns=unusable_variance)
#   Making region code categorical instead of numeric
  df['region_code'] = pd.Categorical(df.region_code)
  return df

In [0]:
def drop_high_cardinality(df):
  df = df.copy()
  cardinality= [[],[]]
  for column in df.select_dtypes(exclude = 'number').columns:
    cardinality[0].append(column)
    cardinality[1].append(df[column].nunique())
  cardinality = pd.DataFrame(cardinality).T
  cardinality
  cardinality_50 = cardinality[cardinality[1]>51].copy()
  cardinality_50
  features = df.drop(list(cardinality_50[0]), axis = 'columns').columns.copy()
  return df[features]

In [0]:
# applying engineer function
X_train = engineer(X_train)
X_val = engineer(X_val)

In [7]:
# Making pipeline and fitting XGBClassifier
from sklearn.pipeline import make_pipeline
import category_encoders as ce
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

from xgboost import XGBClassifier


pipeline = make_pipeline(
  ce.OrdinalEncoder(),
  IterativeImputer(),
  XGBClassifier(n_estimators = 100, n_jobs = -1))

pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['funder', 'installer', 'wpt_name',
                                      'basin', 'subvillage', 'region',
                                      'region_code', 'lga', 'ward',
                                      'public_meeting', 'scheme_management',
                                      'scheme_name', 'permit',
                                      'extraction_type',
                                      'extraction_type_group',
                                      'extraction_type_class', 'management',
                                      'management_group', 'payment',
                                      'water_quality', 'quality_g...
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, learning_rate=0.1,
                               max_delta

In [8]:
from sklearn.metrics import accuracy_score
y_pred = pipeline.predict(X_val)
print('Validation Accuracy', accuracy_score(y_pred, y_val))

Validation Accuracy 0.7475420875420875


In [0]:
# Trying early stopping

encoder = ce.OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

eval_set = [(X_train_encoded, y_train),
           (X_val_encoded, y_val)]

model = XGBClassifier(n_estimators = 1000, max_depth =7, learning_rate = .2, n_jobs = -1)
            

In [10]:
model.fit(X_train_encoded, y_train, early_stopping_rounds = 100, eval_metric = 'merror', eval_set = eval_set)

[0]	validation_0-merror:0.253805	validation_1-merror:0.263906
Multiple eval metrics have been passed: 'validation_1-merror' will be used for early stopping.

Will train until validation_1-merror hasn't improved in 100 rounds.
[1]	validation_0-merror:0.251246	validation_1-merror:0.259461
[2]	validation_0-merror:0.24615	validation_1-merror:0.255758
[3]	validation_0-merror:0.243367	validation_1-merror:0.253401
[4]	validation_0-merror:0.238429	validation_1-merror:0.250034
[5]	validation_0-merror:0.236207	validation_1-merror:0.24835
[6]	validation_0-merror:0.230348	validation_1-merror:0.24303
[7]	validation_0-merror:0.226644	validation_1-merror:0.241751
[8]	validation_0-merror:0.223143	validation_1-merror:0.237374
[9]	validation_0-merror:0.221235	validation_1-merror:0.237643
[10]	validation_0-merror:0.219416	validation_1-merror:0.236431
[11]	validation_0-merror:0.216655	validation_1-merror:0.234613
[12]	validation_0-merror:0.214164	validation_1-merror:0.233872
[13]	validation_0-merror:0.210

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.2, max_delta_step=0, max_depth=7,
              min_child_weight=1, missing=None, n_estimators=1000, n_jobs=-1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [27]:
model.best_score

0.200943

In [29]:
for max_depth in range(1, 15, 2):
  model = XGBClassifier(n_estimators = 1000, max_depth = max_depth, learning_rate = .4, n_jobs = -1)
  model.fit(X_train_encoded, y_train, early_stopping_rounds = 30, eval_metric = 'merror', eval_set = eval_set)
  print('Max depth: ', max_depth)
  print('best score: ', model.best_score)

[0]	validation_0-merror:0.352772	validation_1-merror:0.348889
Multiple eval metrics have been passed: 'validation_1-merror' will be used for early stopping.

Will train until validation_1-merror hasn't improved in 30 rounds.
[1]	validation_0-merror:0.309517	validation_1-merror:0.314074
[2]	validation_0-merror:0.352772	validation_1-merror:0.348889
[3]	validation_0-merror:0.308193	validation_1-merror:0.303502
[4]	validation_0-merror:0.306532	validation_1-merror:0.306465
[5]	validation_0-merror:0.306218	validation_1-merror:0.306128
[6]	validation_0-merror:0.299574	validation_1-merror:0.297037
[7]	validation_0-merror:0.298092	validation_1-merror:0.295084
[8]	validation_0-merror:0.302222	validation_1-merror:0.301818
[9]	validation_0-merror:0.300067	validation_1-merror:0.300067
[10]	validation_0-merror:0.300202	validation_1-merror:0.299529
[11]	validation_0-merror:0.298204	validation_1-merror:0.297508
[12]	validation_0-merror:0.298294	validation_1-merror:0.298182
[13]	validation_0-merror:0.2

In [30]:
# Taking a closer look at the best range
for max_depth in range(8, 11):
  model = XGBClassifier(n_estimators = 1000, max_depth = max_depth, learning_rate = .2, n_jobs = -1)
  model.fit(X_train_encoded, y_train, early_stopping_rounds = 50, eval_metric = 'merror', eval_set = eval_set)
  print('Max depth: ', max_depth)
  print('best score: ', model.best_score)

[0]	validation_0-merror:0.240853	validation_1-merror:0.250909
Multiple eval metrics have been passed: 'validation_1-merror' will be used for early stopping.

Will train until validation_1-merror hasn't improved in 50 rounds.
[1]	validation_0-merror:0.236947	validation_1-merror:0.24633
[2]	validation_0-merror:0.234613	validation_1-merror:0.246263
[3]	validation_0-merror:0.229091	validation_1-merror:0.239192
[4]	validation_0-merror:0.226599	validation_1-merror:0.241077
[5]	validation_0-merror:0.223861	validation_1-merror:0.238114
[6]	validation_0-merror:0.221122	validation_1-merror:0.237576
[7]	validation_0-merror:0.217666	validation_1-merror:0.232795
[8]	validation_0-merror:0.216027	validation_1-merror:0.231987
[9]	validation_0-merror:0.214007	validation_1-merror:0.228552
[10]	validation_0-merror:0.211044	validation_1-merror:0.227677
[11]	validation_0-merror:0.207295	validation_1-merror:0.227071
[12]	validation_0-merror:0.20413	validation_1-merror:0.224242
[13]	validation_0-merror:0.201

In [11]:
# Remaking best model
model = XGBClassifier(n_estimators = 1000, max_depth = 10, learning_rate = .2, n_jobs = -1)
model.fit(X_train_encoded, y_train, early_stopping_rounds = 50, eval_metric = 'merror', eval_set = eval_set)


[0]	validation_0-merror:0.206375	validation_1-merror:0.243771
Multiple eval metrics have been passed: 'validation_1-merror' will be used for early stopping.

Will train until validation_1-merror hasn't improved in 50 rounds.
[1]	validation_0-merror:0.198923	validation_1-merror:0.237037
[2]	validation_0-merror:0.191964	validation_1-merror:0.233064
[3]	validation_0-merror:0.185567	validation_1-merror:0.225993
[4]	validation_0-merror:0.182088	validation_1-merror:0.224983
[5]	validation_0-merror:0.178474	validation_1-merror:0.222694
[6]	validation_0-merror:0.174299	validation_1-merror:0.219529
[7]	validation_0-merror:0.170034	validation_1-merror:0.218047
[8]	validation_0-merror:0.165208	validation_1-merror:0.213939
[9]	validation_0-merror:0.159708	validation_1-merror:0.212391
[10]	validation_0-merror:0.155892	validation_1-merror:0.212054
[11]	validation_0-merror:0.152256	validation_1-merror:0.209899
[12]	validation_0-merror:0.148889	validation_1-merror:0.208552
[13]	validation_0-merror:0.1

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.2, max_delta_step=0, max_depth=10,
              min_child_weight=1, missing=None, n_estimators=1000, n_jobs=-1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [12]:
transformers = make_pipeline(ce.OrdinalEncoder(), 
                             IterativeImputer())

X_train_transformed = transformers.fit_transform(X_train,y_train)
X_val_transformed = transformers.transform(X_val)

X_train_transformed = pd.DataFrame(X_train_transformed)
X_val_transformed = pd.DataFrame(X_val_transformed)
X_val_transformed.columns = list(X_val_encoded.columns)
X_val_transformed.head()

Unnamed: 0,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,age,day_recorded,month_recorded,year_recorded,day_of_week_recorded,longitude_MISSING,latitude_MISSING,gps_height_MISSING,population_MISSING
0,0.0,75.0,1700.652645,30.0,33.442833,-2.699987,-1.0,0.0,4.0,10854.0,4.0,19.0,4.0,45.0,173.0,534.883384,1.0,1.0,3.0,1.0,1996.91383,4.0,4.0,3.0,1.0,1.0,4.0,1.0,1.0,1.0,3.0,3.0,1.0,3.0,2.0,14.08617,10.0,8.0,2011.0,2.0,0.0,0.0,1.0,1.0
1,6.0,34.0,1584.0,3.0,37.454083,-3.297059,-1.0,0.0,2.0,-1.0,2.0,3.0,4.0,35.0,258.0,60.0,1.0,2.0,28.0,1.0,2008.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,2.0,1.0,5.0,3.0,2.0,2013.0,6.0,0.0,0.0,0.0,0.0
2,20.0,146.0,267.0,142.0,39.082482,-10.977941,1831.0,0.0,9.0,6044.0,18.0,26.0,33.0,28.0,1459.0,1.0,1.0,1.0,3.0,1.0,2004.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,5.0,2.0,2.0,1.0,2.0,1.0,9.0,5.0,2.0,2013.0,1.0,0.0,0.0,0.0,0.0
3,30.0,5.0,1622.0,5.0,34.8809,-4.671819,-1.0,0.0,1.0,4129.0,16.0,13.0,2.0,60.0,94.0,492.0,1.0,1.0,49.0,2.0,2000.0,8.0,7.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,13.0,12.0,1.0,2013.0,5.0,0.0,0.0,0.0,0.0
4,25.0,34.0,709.0,3.0,37.531283,-3.468486,-1.0,0.0,2.0,-1.0,2.0,3.0,4.0,35.0,841.0,155.0,1.0,2.0,96.0,1.0,2008.0,8.0,7.0,5.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,5.0,13.0,3.0,2013.0,2.0,0.0,0.0,0.0,0.0


In [46]:
import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(model
                ,scoring = 'accuracy'
                ,n_iter = 2)
permuter.fit(X_val_transformed, y_val)
feature_names = X_val_encoded.columns.tolist()
eli5.show_weights(permuter, top = None, feature_names=feature_names)

Weight,Feature
0.1153  ± 0.0010,quantity
0.0354  ± 0.0023,waterpoint_type
0.0202  ± 0.0000,longitude
0.0202  ± 0.0016,extraction_type_class
0.0165  ± 0.0040,latitude
0.0143  ± 0.0011,population
0.0135  ± 0.0020,amount_tsh
0.0111  ± 0.0009,lga
0.0106  ± 0.0026,payment
0.0099  ± 0.0014,extraction_type


In [20]:
# Getting output file
test_engineered = engineer(test).copy()

test_transformed = transformers.transform(test_engineered)
test_transformed = pd.DataFrame(test_transformed)
test_transformed.columns = test_engineered.columns

y_pred = model.predict(test_transformed)
y_pred.shape

(14358,)

In [14]:
test.head()

Unnamed: 0,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,age,day_recorded,month_recorded,year_recorded,day_of_week_recorded,longitude_MISSING,latitude_MISSING,gps_height_MISSING,population_MISSING
0,0.0,Dmdd,1996.0,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,Internal,Magoma,Manyara,21,3,Mbulu,Bashay,321.0,True,Parastatal,,True,2012.0,other,other,other,parastatal,parastatal,never pay,soft,good,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other,1.0,4,2,2013,0,False,False,False,False
1,0.0,Government Of Tanzania,1569.0,DWE,36.656709,-3.309214,Kimnyak,0,Pangani,Kimnyak,Arusha,2,2,Arusha Rural,Kimnyaki,300.0,True,VWC,TPRI pipe line,True,2000.0,gravity,gravity,gravity,vwc,user-group,never pay,soft,good,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,13.0,4,2,2013,0,False,False,False,False
2,0.0,,1567.0,,34.767863,-5.004344,Puma Secondary,0,Internal,Msatu,Singida,13,2,Singida Rural,Puma,500.0,True,VWC,P,,2010.0,other,other,other,vwc,user-group,never pay,soft,good,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other,3.0,1,2,2013,4,False,False,False,False
3,0.0,Finn Water,267.0,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,Ruvuma / Southern Coast,Kipindimbi,Lindi,80,43,Liwale,Mkutano,250.0,,VWC,,True,1987.0,other,other,other,vwc,user-group,unknown,soft,good,dry,shallow well,shallow well,groundwater,other,other,26.0,22,1,2013,1,False,False,False,False
4,500.0,Bruder,1260.0,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,Ruvuma / Southern Coast,Losonga,Ruvuma,10,3,Mbinga,Mbinga Urban,60.0,,Water Board,BRUDER,True,2000.0,gravity,gravity,gravity,water board,user-group,pay monthly,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,13.0,27,3,2013,2,False,False,False,False


In [21]:
test_transformed.head()

Unnamed: 0,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,age,day_recorded,month_recorded,year_recorded,day_of_week_recorded,longitude_MISSING,latitude_MISSING,gps_height_MISSING,population_MISSING
0,0.0,246.0,1996.0,275.0,35.290799,-4.059696,-1.0,0.0,1.0,10602.0,11.0,21.0,3.0,15.0,1787.0,321.0,1.0,5.0,3.0,1.0,2012.0,5.0,5.0,4.0,4.0,3.0,4.0,1.0,1.0,5.0,6.0,6.0,2.0,4.0,3.0,1.0,4.0,2.0,2013.0,0.0,0.0,0.0,0.0,0.0
1,0.0,4.0,1569.0,3.0,36.656709,-3.309214,-1.0,0.0,2.0,-1.0,9.0,2.0,2.0,27.0,30.0,300.0,1.0,1.0,19.0,1.0,2000.0,2.0,2.0,2.0,1.0,1.0,4.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,1.0,13.0,4.0,2.0,2013.0,0.0,0.0,0.0,0.0,0.0
2,0.0,5.0,1567.0,5.0,34.767863,-5.004344,261.0,0.0,1.0,285.0,16.0,13.0,2.0,60.0,274.0,500.0,1.0,1.0,1419.0,2.0,2010.0,5.0,5.0,4.0,1.0,1.0,4.0,1.0,1.0,1.0,6.0,6.0,2.0,4.0,3.0,3.0,1.0,2.0,2013.0,4.0,0.0,0.0,0.0,0.0
3,0.0,104.0,267.0,103.0,38.058046,-9.418672,-1.0,0.0,9.0,8225.0,12.0,25.0,43.0,16.0,1972.0,250.0,3.0,1.0,3.0,1.0,1987.0,5.0,5.0,4.0,1.0,1.0,5.0,1.0,1.0,3.0,3.0,3.0,1.0,4.0,3.0,26.0,22.0,1.0,2013.0,1.0,0.0,0.0,0.0,0.0
4,500.0,766.0,1260.0,863.0,35.006123,-10.950412,14110.0,0.0,9.0,-1.0,10.0,10.0,3.0,13.0,1337.0,60.0,3.0,2.0,842.0,1.0,2000.0,2.0,2.0,2.0,2.0,1.0,3.0,1.0,1.0,2.0,2.0,2.0,1.0,2.0,1.0,13.0,27.0,3.0,2013.0,2.0,0.0,0.0,0.0,0.0


In [22]:
y_pred = pd.Series(y_pred)
submission = y_pred.to_frame()
submission.head()

Unnamed: 0,0
0,functional
1,functional
2,functional
3,non functional
4,functional


In [24]:
submission['status_group'] = submission[0]
submission.head()
submission[0] = test['id']
submission.head()
submission.columns = ['id', 'status_group']
submission.head()

Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional


In [0]:
submission.to_csv('submission4.csv', index=False)

In [0]:
from google.colab import files
    # Just try again if you get this error:
    # TypeError: Failed to fetch
    # https://github.com/googlecolab/colabtools/issues/337
files.download('submission4.csv')