<a href="https://colab.research.google.com/github/TimTree/DS-Unit-2-Kaggle-Challenge/blob/master/module2/assignment_kaggle_challenge_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 2

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](https://outline.com/5S5tsB), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_






### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).

In [1]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module2')

Reinitialized existing Git repository in /content/.git/
fatal: remote origin already exists.
From https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge
 * branch            master     -> FETCH_HEAD
Already up to date.


In [0]:
os.chdir('/content')

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv('data/waterpumps/train_features.csv'), 
                 pd.read_csv('data/waterpumps/train_labels.csv'))
test = pd.read_csv('data/waterpumps/test_features.csv')
sample_submission = pd.read_csv('data/waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [0]:
train, validate = train_test_split(train, random_state=84)

In [5]:
train.shape, validate.shape

((44550, 41), (14850, 41))

In [0]:
import numpy as np

def wrangle(X):
  # Wrangle the train, validate, and test datasets in one function

  # Prevent SettingWithCopyWarning
  X = X.copy()

  # About 3% of the time, latitude has small values near zero,
  # outside Tanzania, so we'll treat these values like zero.
  X['latitude'] = X['latitude'].replace(-2e-08, 0)

  # There are some values labeled as 0 that should be null values
  # (ex: Construction year can't realistically be 0). Let's convet
  # such values into nulls.
  cols_with_zeros = ['longitude', 'latitude', 'construction_year','population','amount_tsh','gps_height']
  for col in cols_with_zeros:
    X[col] = X[col].replace(0, np.nan)
  
  # drop duplicate columns
  X = X.drop(columns=['quantity_group', 'payment_type'])
  
  # Generate column that gives time difference from construction to inspection
  X['construction_to_inspection'] = pd.to_datetime(X['date_recorded']).dt.year - X['construction_year']

  # return the wrangled dataframe
  return X

In [0]:
train = wrangle(train)
validate = wrangle(validate)
test = wrangle(test)

In [8]:
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group,construction_to_inspection
5990,65194,20.0,2013-01-30,Fini Water,278.0,FINI WATER,38.925036,-10.059954,Tulieni,0,Ruvuma / Southern Coast,Mashineni,Lindi,80,53,Ruangwa,Ruangwa,560.0,True,GeoData Consultants Ltd,Water authority,Ruangwa,True,1980.0,submersible,submersible,submersible,water authority,commercial,pay per bucket,soft,good,enough,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional,33.0
9838,53155,,2012-10-15,Aict,,AICT,33.007452,-3.892433,Mwangoye,0,Internal,Mwangoye,Shinyanga,17,3,Shinyanga Rural,Didia,,True,GeoData Consultants Ltd,WUG,,False,,other,other,other,wug,user-group,never pay,soft,good,enough,shallow well,shallow well,groundwater,other,other,non functional,
48972,1693,,2011-07-11,Hesawa,,HESAWA,31.772398,-2.472249,Muungano,0,Lake Victoria,Nyangwe B,Kagera,18,8,Chato,Kigongo,,True,GeoData Consultants Ltd,WUA,,True,,nira/tanira,nira/tanira,handpump,wua,user-group,pay monthly,salty,salty,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional,
45435,61413,,2011-03-26,Village Council,,Village Council,33.247356,-8.951263,Ite,0,Lake Rukwa,Nsungwe,Mbeya,12,2,Mbeya Rural,Bonde la Songwe,,True,GeoData Consultants Ltd,VWC,,True,,submersible,submersible,submersible,wug,user-group,pay monthly,salty,salty,insufficient,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,functional,
1666,18326,,2013-02-14,Dwsp,1354.0,DWE,34.176786,-2.928304,Tumaini,0,Lake Victoria,Ndolelezi,Shinyanga,17,1,Bariadi,Lagangabilili,500.0,True,GeoData Consultants Ltd,WUG,,False,1997.0,nira/tanira,nira/tanira,handpump,wug,user-group,never pay,soft,good,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional needs repair,16.0


In [9]:
# The target is the the status group
target = 'status_group'

# There's a lot of features here, numeric and categorical.

# Begin by getting all the features that are not the ID or status_group
train_features = train.drop(columns=[target,'id'])

# Let's take in all the numeric features, which is the all of them minus
# the target (and id in this case).
numerics = train_features.select_dtypes(include='number').columns.tolist()

# And here's all the categorical features
categoricals = train_features.select_dtypes(exclude='number').columns.tolist()

# Some categorical variables may have a ton of unique values. Not only will
# so many unique values make our model difficult to generalize, we'll overflow
# our computer's RAM if we did a one-hot encode of them (will explain one-hot
# encode shortly)
# So let's only accept low cardinality categoricals to analyze (that is, in
# this case, the categorical variables with 21 or less unique values.)
low_cardinality_categoricals = [col for col in categoricals
                               if train_features[col].nunique() <= 21]

# Now here are our features.
features = numerics + low_cardinality_categoricals
print(features)

['amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private', 'region_code', 'district_code', 'population', 'construction_year', 'construction_to_inspection', 'basin', 'region', 'public_meeting', 'recorded_by', 'scheme_management', 'permit', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'water_quality', 'quality_group', 'quantity', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group']


In [0]:
# Override features in attempt to improve validation score
features = train_features.select_dtypes(include='number').columns.tolist() + train_features.select_dtypes(exclude='number').columns.tolist()

In [0]:
X_train = train[features]
y_train = train[target]
X_validate = validate[features]
y_validate = validate[target]
X_test = test[features]

In [27]:
%%time

import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import f_regression, SelectKBest, f_classif
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    RandomForestClassifier(n_estimators=100,random_state=235,n_jobs=-1)
)

# Fit on train
pipeline.fit(X_train, y_train)

# Score on validate
print('Validation Accuracy', pipeline.score(X_validate, y_validate))

# Predict on test
y_pred = pipeline.predict(X_test)

Validation Accuracy 0.8085521885521886
CPU times: user 19.6 s, sys: 109 ms, total: 19.7 s
Wall time: 10.7 s


In [28]:
encoder = pipeline.named_steps['ordinalencoder']
encoded = encoder.transform(X_train)

rf = pipeline.named_steps['randomforestclassifier']
importances = pd.Series(rf.feature_importances_, encoded.columns)
importances

amount_tsh                    0.018723
gps_height                    0.044582
longitude                     0.081173
latitude                      0.078269
num_private                   0.001042
region_code                   0.014087
district_code                 0.016295
population                    0.030916
construction_year             0.032510
construction_to_inspection    0.032648
date_recorded                 0.038890
funder                        0.030799
installer                     0.023259
wpt_name                      0.055980
basin                         0.011759
subvillage                    0.052947
region                        0.013868
lga                           0.021958
ward                          0.036980
public_meeting                0.007121
recorded_by                   0.000000
scheme_management             0.012098
scheme_name                   0.021689
permit                        0.006653
extraction_type               0.019373
extraction_type_group    

In [29]:
submission = test[['id']].copy()
submission['status_group'] = y_pred
submission.head()

Unnamed: 0,id,status_group
0,50785,non functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional


In [0]:
submission.to_csv('kaggleChallenge.csv', index=False)