<a href="https://colab.research.google.com/github/Vertex138/DS-Unit-2-Kaggle-Challenge/blob/master/submissions/Assignment222.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 2

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](https://outline.com/5S5tsB), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_






In [0]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

Collecting category_encoders==2.*
[?25l  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
[K     |████████████████████████████████| 102kB 3.3MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.1.0


In [0]:
# Imports:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

from sklearn.tree import DecisionTreeClassifier
import graphviz
from sklearn.tree import export_graphviz

from google.colab import files

In [0]:
df = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
dfTest = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
dfSample = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

df.shape, dfTest.shape

((59400, 41), (14358, 40))

In [0]:
# Split 'df' into Test and Validate groups
dfTrain, dfVal = train_test_split(df, random_state=138)

In [0]:
# Group features and pick the good ones, and pick a target
target = 'status_group'

preFeatures = dfTrain.drop(columns=[target, 'id'])
numFeatures = preFeatures.select_dtypes(include='number').columns.tolist()
carFeatures = preFeatures.select_dtypes(exclude='number').nunique()
catFeatures = carFeatures[carFeatures <= 500].index.tolist()

features = numFeatures + catFeatures

xTrain = dfTrain[features]
yTrain = dfTrain[target]
xVal = dfVal[features]
yVal = dfVal[target]
xTest = dfTest[features]

features

['amount_tsh',
 'gps_height',
 'longitude',
 'latitude',
 'num_private',
 'region_code',
 'district_code',
 'population',
 'construction_year',
 'date_recorded',
 'basin',
 'region',
 'lga',
 'public_meeting',
 'recorded_by',
 'scheme_management',
 'permit',
 'extraction_type',
 'extraction_type_group',
 'extraction_type_class',
 'management',
 'management_group',
 'payment',
 'payment_type',
 'water_quality',
 'quality_group',
 'quantity',
 'quantity_group',
 'source',
 'source_type',
 'source_class',
 'waterpoint_type',
 'waterpoint_type_group']

In [0]:
# # Make a Random Forest
# forestPipeline = make_pipeline(
#     ce.OrdinalEncoder(), 
#     SimpleImputer(strategy='median'), 
#     RandomForestClassifier(n_estimators=1000, random_state=138, n_jobs=-1)
# )

In [0]:
# # Test the forest pipeline
# forestPipeline.fit(xTrain, yTrain)
# print('Validation Accuracy', forestPipeline.score(xVal, yVal))

In [0]:
# AFTER USING THE CODE FROM THE PREVIOUS ASSIGNMENT:
# The best accurate tree based on 
# 6 24 77.865
# 10 26 77.558

# BEST FROM 

In [0]:
# A loop to determine how effective increasing the n_estimators is
# maxVarAcc = 0;
# prev = 0;
# score = 0;
# for _ in range(1,100):
#   loopPipeline = make_pipeline(
#       ce.OrdinalEncoder(), 
#       SimpleImputer(strategy='median'), 
#       RandomForestClassifier(n_estimators=_*10, random_state=138, n_jobs=-1))
#   loopPipeline.fit(xTrain, yTrain)
#   prev = score
#   score = loopPipeline.score(xVal, yVal)
#   print(str(_*10),"\t",str(score),"\t",str(score-prev))
#   if (score > maxVarAcc):
#     maxVarAcc = score
#     print("^^^")

In [0]:
# A loop to determine the best max_depth

# maxVarAcc = 0
# prev = 0
# score = 0
# for _ in range (1,50):
#   loopPipeline = make_pipeline(
#     ce.OrdinalEncoder(), 
#     SimpleImputer(strategy='median'), 
#     RandomForestClassifier(n_estimators=100, random_state=138, n_jobs=-1, max_depth = _))
#   loopPipeline.fit(xTrain, yTrain)
#   prev = score
#   score = loopPipeline.score(xVal, yVal)
#   print(str(_),"\t",str(score),"\t",str(score-prev))
#   if (score > maxVarAcc):
#     maxVarAcc = score
#     print("^^^")

In [0]:
# A loop to determine the best min_samples_split

# maxVarAcc = 0
# prev = 0
# score = 0
# for _ in range (2,10):
#   loopPipeline = make_pipeline(
#     ce.OrdinalEncoder(), 
#     SimpleImputer(strategy='median'), 
#     RandomForestClassifier(n_estimators=100, random_state=138, n_jobs=-1, max_depth = 19, min_samples_split = _))
#   loopPipeline.fit(xTrain, yTrain)
#   prev = score
#   score = loopPipeline.score(xVal, yVal)
#   print(str(_),"\t",str(score),"\t",str(score-prev))
#   if (score > maxVarAcc):
#     maxVarAcc = score
#     print("^^^")

In [0]:
bestPipeline = make_pipeline(
  ce.OrdinalEncoder(), 
  SimpleImputer(strategy='median'), 
  RandomForestClassifier(n_estimators=180, random_state=138, n_jobs=-1, max_depth = 19, min_samples_split = 4))
bestPipeline.fit(xTrain, yTrain)
bestPipeline.score(xVal, yVal)

0.8131986531986533

In [0]:
testPred = bestPipeline.predict(xTest)
dfSubmit = pd.DataFrame(columns=['id', 'status_group'])
dfSubmit.id = dfTest.id.copy()
dfSubmit.status_group = testPred
dfSubmit.head()

Unnamed: 0,id,status_group
0,50785,non functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional


In [0]:
# dfSubmit.to_csv('waterpumpsKaggle9.csv', index=False)
# files.download('waterpumpsKaggle9.csv')