<a href="https://colab.research.google.com/github/arewelearningyet/DS-Unit-2-Kaggle-Challenge/blob/master/module1-decision-trees/LS_DS_221_assignment%20%5BDS12%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 1*

---

# Decision Trees

## Assignment
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [2]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'



In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [4]:
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Collecting https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Using cached https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
Building wheels for collected packages: pandas-profiling
  Building wheel for pandas-profiling (setup.py) ... [?25l[?25hdone
  Created wheel for pandas-profiling: filename=pandas_profiling-2.5.0-py2.py3-none-any.whl size=240261 sha256=212bc2b60c7cdb783c4254e2bfdbc8fb3ca1e5c78603218eaaaa759482c41f07
  Stored in directory: /tmp/pip-ephem-wheel-cache-2qucrm7o/wheels/56/c2/dd/8d945b0443c35df7d5f62fa9e9ae105a2d8b286302b92e0109
Successfully built pandas-profiling


In [0]:
# Pandas Profiling can be very slow with medium & large datasets.
# These parameters will make it faster.
# https://github.com/pandas-profiling/pandas-profiling/issues/222
import pandas_profiling

#profile_report = ProfileReport(train, minimal=True)

#profile_report.to_notebook_iframe()

# Do train/validate/test split with the Tanzania Waterpumps data.

In [6]:
#  Do train/validate/test split with the Tanzania Waterpumps data.
train, val = train_test_split(train, train_size=0.80, test_size=0.20,
                              stratify=train['status_group'])

train.shape, val.shape, test.shape

((47520, 41), (11880, 41), (14358, 40))

In [0]:
import numpy as np

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

# Split train into train & val
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['status_group'], random_state=42)


def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""
    
    # Prevent SettingWithCopyWarning
    X = X.copy()
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # When columns have zeros and shouldn't, they are like null values.
    # So we will replace the zeros with nulls, and impute missing values later.
    # Also create a "missing indicator" column, because the fact that
    # values are missing may be a predictive signal.
    cols_with_zeros = ['longitude', 'latitude', 'construction_year', 
                       'gps_height', 'population']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
        X[col+'_MISSING'] = X[col].isnull()
            
    # Drop duplicate columns
    duplicates = ['quantity_group', 'payment_type']
    X = X.drop(columns=duplicates)
    
    # Drop recorded_by (never varies) and id (always varies, random)
    unusable_variance = ['recorded_by', 'id']
    X = X.drop(columns=unusable_variance)
    
    # Convert date_recorded to datetime
    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
    
    # Extract components from date_recorded, then drop the original column
    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day
    X = X.drop(columns='date_recorded')
    
    # Engineer feature: how many years from construction_year to date_recorded
    X['years'] = X['year_recorded'] - X['construction_year']
    X['years_MISSING'] = X['years'].isnull()
    
    # return the wrangled dataframe
    return X

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [0]:
from numba import njit

@njit
def cut(arr):
    bins = np.empty(arr.shape[0])
    for idx, x in enumerate(arr):
        if (x >= 0) & (x < 1):
            bins[idx] = 1
        elif (x >= 1) & (x < 50):
            bins[idx] = 2
        elif (x >= 5) & (x < 250):
            bins[idx] = 3
        elif (x >= 10) & (x < 500):
            bins[idx] = 4
        elif (x >= 25) & (x < 1000):
            bins[idx] = 5
        elif (x >= 50) & (x < 15300):
            bins[idx] = 6
        else:
            bins[idx] = 7

    return bins

train['popbin']=cut(train['population'].to_numpy())
val['popbin']=cut(val['population'].to_numpy())

In [0]:
test['popbin']=cut(test['population'].to_numpy())

# Begin with baselines for classification.

In [0]:
target= 'status_group'

In [9]:
train['status_group'].value_counts(normalize=True)

functional                 0.543077
non functional             0.384238
functional needs repair    0.072685
Name: status_group, dtype: float64

In [10]:
# define majority class baseline

majority_class = train[target].mode()[0]

print(f'Majority class of {target}: {majority_class}')

Majority class of status_group: functional


In [11]:
# print accuracy for majority baseline against train, val

from sklearn.metrics import accuracy_score

y_true = train[target]
y_pred = [majority_class] * len(y_true)
majbasacc_train = accuracy_score(y_true, y_pred)

y_true = val[target]
y_pred = [majority_class] * len(y_true)
majbasacc_val = accuracy_score(y_true, y_pred)

print(f'Baseline accuracy score against train: {majbasacc_train * 100:.2f}%')
print(f'Baseline accuracy score against val: {majbasacc_val * 100:.2f}%')

Baseline accuracy score against train: 54.31%
Baseline accuracy score against val: 54.31%


# Select features. 
### Use a scikit-learn pipeline to encode categoricals, 
### impute missing values, 
### and fit a decision tree classifier.

In [121]:
train.nunique().sort_values(ascending=False)

longitude                    46028
latitude                     46026
wpt_name                     30661
subvillage                   17231
scheme_name                   2563
gps_height                    2400
ward                          2082
installer                     1929
funder                        1716
population                     985
lga                            124
amount_tsh                      94
years                           60
num_private                     59
construction_year               54
day_recorded                    31
region_code                     27
region                          21
district_code                   20
extraction_type                 18
extraction_type_group           13
month_recorded                  12
management                      12
scheme_management               12
source                          10
basin                            9
water_quality                    8
extraction_type_class            7
waterpoint_type     

In [13]:
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,47520.0,37175.2375,21479.239079,1.0,18545.5,37102.5,55776.5,74246.0
amount_tsh,47520.0,316.01639,3066.352849,0.0,0.0,0.0,25.0,350000.0
gps_height,47520.0,669.638005,693.181687,-90.0,0.0,372.0,1320.25,2770.0
longitude,46072.0,35.14696,2.606233,29.607122,33.281198,35.001212,37.22606,40.345193
latitude,46072.0,-5.886312,2.813859,-11.586297,-8.648982,-5.164544,-3.372735,-0.998464
num_private,47520.0,0.492003,13.104644,0.0,0.0,0.0,0.0,1776.0
region_code,47520.0,15.35484,17.656093,1.0,5.0,12.0,17.0,99.0
district_code,47520.0,5.654104,9.690207,0.0,2.0,3.0,5.0,80.0
population,47520.0,177.626662,443.352264,0.0,0.0,25.0,215.0,15300.0
construction_year,47520.0,1301.220665,951.456039,0.0,0.0,1986.0,2004.0,2013.0


In [14]:
train.describe(exclude='number').T

Unnamed: 0,count,unique,top,freq
date_recorded,47520,352,2011-03-15,460
funder,44605,1694,Government Of Tanzania,7238
installer,44588,1903,DWE,13917
wpt_name,47520,30676,none,2852
basin,47520,9,Lake Victoria,8203
subvillage,47216,17240,Majengo,406
region,47520,21,Iringa,4231
lga,47520,124,Njombe,1994
ward,47520,2085,Igosi,241
public_meeting,44846,2,True,40828


In [123]:
# subset features to exclude target, index
train_features= train.drop(columns=[target])
# define numeric features to list
numeric_features=train_features.select_dtypes(include='number').columns.tolist()
# define list describing cardinality of non-numeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()
# define subset of categorical features by threshold of 50 unique values
# to reduce cardinality/dimensionality 
categorical_features=cardinality[cardinality <=50].index.tolist()

# define selected features by combining low-cardinality categoricals and numeric
features= numeric_features + categorical_features

print(features)

['amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private', 'region_code', 'district_code', 'population', 'construction_year', 'year_recorded', 'month_recorded', 'day_recorded', 'years', 'popbin', 'basin', 'region', 'public_meeting', 'scheme_management', 'permit', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'water_quality', 'quality_group', 'quantity', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group', 'longitude_MISSING', 'latitude_MISSING', 'construction_year_MISSING', 'gps_height_MISSING', 'population_MISSING', 'years_MISSING']


In [0]:
X_train=train[features]
y_train=train[target]
X_val=val[features]
y_val=val[target]
X_test=test[features]

In [128]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipeline=make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                       SimpleImputer(strategy='mean'),
                       StandardScaler(),
                       DecisionTreeClassifier(min_samples_leaf=20, max_depth=None))
pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

Validation Accuracy:  0.7697811447811448


In [129]:
pipeline=make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                       SimpleImputer(strategy='mean'),
                       StandardScaler(),
                       DecisionTreeClassifier(min_samples_leaf=10, max_depth=None))
pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

Validation Accuracy:  0.775


In [130]:
pipeline=make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                       SimpleImputer(strategy='mean'),
                       StandardScaler(),
                       DecisionTreeClassifier(min_samples_leaf=5, max_depth=None))
pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

y_pred=pipeline.predict(X_test)

Validation Accuracy:  0.7738215488215489


In [131]:
pipeline=make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                       SimpleImputer(strategy='mean'),
                       StandardScaler(),
                       DecisionTreeClassifier(min_samples_leaf=1, max_depth=None))
pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

Validation Accuracy:  0.755976430976431


In [132]:
from sklearn.ensemble import RandomForestClassifier

pipeline=make_pipeline(ce.OneHotEncoder(use_cat_names=True), 
                       SimpleImputer(strategy='mean'),
                       StandardScaler(),
                       RandomForestClassifier(n_estimators=20,
                             max_depth=25,
                             min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
                             max_features='auto', max_leaf_nodes=None,
                             bootstrap=True, oob_score=True, n_jobs=-1,
                             class_weight=None))

pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


Validation Accuracy:  0.8053872053872054


In [133]:
pipeline=make_pipeline(ce.OneHotEncoder(use_cat_names=True), 
                       SimpleImputer(strategy='median'),
                       StandardScaler(),
                       RandomForestClassifier(n_estimators=20,
                             max_depth=25,
                             min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
                             max_features='auto', max_leaf_nodes=None,
                             bootstrap=True, oob_score=True, n_jobs=-1,
                             class_weight=None))

pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


Validation Accuracy:  0.8074915824915825


In [134]:
pipeline=make_pipeline(ce.OneHotEncoder(use_cat_names=True), 
                       SimpleImputer(strategy='median'),
                       StandardScaler(),
                       RandomForestClassifier(n_estimators=20,
                             criterion='entropy',
                             max_depth=25,
                             min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
                             max_features='auto', max_leaf_nodes=None,
                             bootstrap=True, oob_score=True, n_jobs=-1,
                             class_weight=None))

pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


Validation Accuracy:  0.805050505050505


In [135]:
pipeline=make_pipeline(ce.OneHotEncoder(use_cat_names=True), 
                       SimpleImputer(strategy='median'),
                       StandardScaler(),
                       RandomForestClassifier(n_estimators=20,
                             criterion='entropy',
                             max_depth=20,
                             min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
                             max_features='auto', max_leaf_nodes=None,
                             bootstrap=True, oob_score=True, n_jobs=-1,
                             class_weight=None))

pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


Validation Accuracy:  0.805050505050505


In [136]:
pipeline=make_pipeline(ce.OneHotEncoder(use_cat_names=True), 
                       SimpleImputer(strategy='median'),
                       StandardScaler(),
                       RandomForestClassifier(n_estimators=20,
                             criterion='entropy',
                             max_depth=23,
                             min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
                             max_features='auto', max_leaf_nodes=None,
                             bootstrap=True, oob_score=True, n_jobs=-1,
                             class_weight=None))

pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


Validation Accuracy:  0.8061447811447812


In [137]:
pipeline=make_pipeline(ce.OneHotEncoder(use_cat_names=True), 
                       SimpleImputer(strategy='median'),
                       StandardScaler(),
                       RandomForestClassifier(n_estimators=20,
                             criterion='entropy',
                             max_depth=23,
                             min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
                             max_features='auto', max_leaf_nodes=None,
                             bootstrap=True, oob_score=False, n_jobs=-1,
                             class_weight=None))

pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

Validation Accuracy:  0.8045454545454546


In [138]:
pipeline=make_pipeline(ce.OneHotEncoder(use_cat_names=True), 
                       SimpleImputer(strategy='median'),
                       StandardScaler(),
                       RandomForestClassifier(n_estimators=50,
                             criterion='entropy',
                             max_depth=23,
                             min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
                             max_features='auto', max_leaf_nodes=None,
                             bootstrap=True, oob_score=False, n_jobs=-1,
                             class_weight=None))

pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

Validation Accuracy:  0.80993265993266


In [140]:
pipeline=make_pipeline(ce.OneHotEncoder(use_cat_names=True), 
                       SimpleImputer(strategy='median'),
                       StandardScaler(),
                       RandomForestClassifier(n_estimators=53,
                             criterion='entropy',
                             max_depth=23, min_samples_leaf=1,
                             max_features='auto', random_state=42,
                             bootstrap=True, oob_score=True, n_jobs=-1))

pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

Validation Accuracy:  0.80993265993266


In [0]:
# from sklearn.model_selection import RandomizedSearchCV

# transformations = make_pipeline(
#     ce.OneHotEncoder(cols=features, use_cat_names=True),
#     ce.OrdinalEncoder(),
# )

# pipeline = make_pipeline(
#     SimpleImputer(strategy='median'),
#     RandomForestClassifier(random_state=13, n_jobs=-1)
# )

# param_distributions = {
#     'randomforestclassifier__n_estimators' : [25, 50],
#     'randomforestclassifier__criterion' : ['gini'],
#     'randomforestclassifier__max_depth' : [25, 50],
#     'randomforestclassifier__min_samples_leaf' : [1, 5, 10],
#     'randomforestclassifier__max_features' : [None, 'auto'],
# }

In [0]:
# search = RandomizedSearchCV(estimator=pipeline, param_distributions=param_distributions, n_iter=5, scoring=None, cv=3, random_state=42, n_jobs=-1)

In [0]:
# X_train_transformed = transformations.fit_transform(X_train)
# X_test_transformed = transformations.transform(X_test)

# search.fit(X_train_transformed, y_train)

In [0]:
# print(search.best_params_)
# print(search.best_score_)

# Get your validation accuracy score.

# Get and plot your feature importances.

In [0]:
# feature importances
model = pipeline.named_steps['decisiontreeclassifier']
# model.feature_importances_ #linear models have coeff, but trees have 'feat imports'
encoder= pipeline.named_steps['onehotencoder']
encoded_cols = encoder.transform(X_val).columns
importances = pd.Series(model.feature_importances_, encoded_cols)

In [0]:
model = pipeline.named_steps['decisiontreeclassifier']
# FROM LECTURE: Linear models have coefficients, but TREES HAVE FEATURE IMPORTANCES
encoder = pipeline.named_steps['onehotencoder']
encoded_cols = encoder.transform(X_val).columns
importances = pd.Series(model.feature_importances_, encoded_cols)

In [0]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10,30))
importances.sort_values().plot.barh();

# Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue Submit Predictions button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)


# Commit your notebook to your fork of the GitHub repo.

In [139]:
y_pred = pipeline.predict(X_test)

submission = sample_submission.copy()
submission[target] = y_pred

filename = 'sub3_2020-02-17.csv'

submission.to_csv(filename, index=False)

submission

Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional
...,...,...
14353,39307,non functional
14354,18990,functional
14355,28749,functional
14356,33492,functional


# experiment with null treatment, binning population

In [110]:
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [111]:

train, val = train_test_split(train, train_size=0.80, test_size=0.20,
                              stratify=train['status_group'])

train.shape, val.shape, test.shape

((47520, 41), (11880, 41), (14358, 40))

In [0]:
nullcols = []
for col in train.columns:
    nbnull = (train[col].isnull()*1).sum()
    if ( nbnull > 0 ):
        t = type(train[train[col].notnull()][col].iat[0]) # type of first non null value
        nullcols.append([col, t])

for col, t in nullcols:
    if (t == type('abc')):
        train.loc[train[col].isnull(), col] = 'MISSING'

for col, t in nullcols:
    if (t == type(True)):
        train.loc[train[col]==True, col] = 'TRUE'
        train.loc[train[col]==False, col] = 'FALSE'
        train.loc[train[col].isnull(), col] = 'MISSING'

In [115]:
train.columns

Index(['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height',
       'installer', 'longitude', 'latitude', 'wpt_name', 'num_private',
       'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga',
       'ward', 'population', 'public_meeting', 'recorded_by',
       'scheme_management', 'scheme_name', 'permit', 'construction_year',
       'extraction_type', 'extraction_type_group', 'extraction_type_class',
       'management', 'management_group', 'payment', 'payment_type',
       'water_quality', 'quality_group', 'quantity', 'quantity_group',
       'source', 'source_type', 'source_class', 'waterpoint_type',
       'waterpoint_type_group', 'status_group'],
      dtype='object')

# experiment with dropping high correlation features

In [105]:
# subset features to exclude target, index
# drop region code which highly correlated to region
# same for extraction_*
train_features= train.drop(columns=[target, 'id', 'region', 
                                    'extraction_type_class', 
                                    'extraction_type_group', 'management_group',
                                    'payment', 'quality_group',
                                    'recorded_by'])
# define numeric features to list
numeric_features=train_features.select_dtypes(include='number').columns.tolist()
# define list describing cardinality of non-numeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()
# define subset of categorical features by threshold of 50 unique values
# to reduce cardinality/dimensionality 
categorical_features=cardinality[cardinality <=125].index.tolist()

# define selected features by combining low-cardinality categoricals and numeric
features= numeric_features + categorical_features

print(features)

['amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private', 'region_code', 'district_code', 'population', 'construction_year', 'popbin', 'basin', 'lga', 'public_meeting', 'scheme_management', 'permit', 'extraction_type', 'management', 'payment_type', 'water_quality', 'quantity', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group']


In [0]:
# stack ensemble submissons
files = ['sub1_2020-02-17.csv',
         'sub2_2020-02-17.csv',
         'sub3_2020-02-17.csv',
         'sub4_2020-02-17.csv',
         'sub5_2020-02-17.csv',
         'sub6_2020-02-17.csv',
         'sub7_2020-02-17.csv',
         ]
 
target = 'status_group'
submissions = (pd.read_csv(file)[[target]] for file in files)
ensemble = pd.concat(submissions, axis='columns')
majority_vote = ensemble.mode(axis='columns')[0]


submission = sample_submission.copy()
submission[target] = majority_vote
submission.to_csv('ensemble_submission-2020-02-17.csv', index=False)