<a href="https://colab.research.google.com/github/EvidenceN/DS-Unit-2-Kaggle-Challenge/blob/master/module1/Evidence.N.%20Answers_Assignment_kaggle_challenge_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 1*

---

# Decision Trees

## Assignment
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [131]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'



In [132]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [133]:
# splitting train into train, validation data sets. 

train, val = train_test_split(train, test_size = 0.8, train_size = 0.2, 
                              random_state = 42, stratify=train['status_group'])

train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
36369,12516,0.0,2011-02-23,Legeza Legeza,70,Private,38.714599,-6.919704,Legeza,0,Wami / Ruvu,Mzenga B,Pwani,6,3,Kisarawe,Vihingo,500,True,GeoData Consultants Ltd,Private operator,,True,2004,india mark iii,india mark iii,handpump,private operator,commercial,never pay,never pay,salty,salty,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump,functional
12756,59778,0.0,2011-07-20,Hifab,0,Hesawa,33.390971,-3.044316,Majilala,0,Lake Victoria,Mwajilala,Mwanza,19,4,Kwimba,Lyoma,0,,GeoData Consultants Ltd,VWC,,True,0,swn 80,swn 80,handpump,vwc,user-group,never pay,never pay,soft,good,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump,functional
17415,42561,0.0,2012-10-26,Rwssp,0,DWE,32.357735,-3.90101,Imalanota A,0,Lake Tanganyika,Imalangi,Shinyanga,17,3,Kahama,Mpunze,0,True,GeoData Consultants Ltd,,,True,0,other,other,other,wug,user-group,unknown,unknown,milky,milky,enough,enough,shallow well,shallow well,groundwater,other,other,functional
11103,26287,0.0,2013-02-14,,1425,,34.702588,-5.198856,Bwawani,0,Internal,Mnyankinda,Singida,13,2,Singida Rural,Ikungu,1,True,GeoData Consultants Ltd,VWC,,,2000,gravity,gravity,gravity,vwc,user-group,unknown,unknown,unknown,unknown,dry,dry,dam,dam,surface,communal standpipe,communal standpipe,non functional
18409,61331,3000.0,2011-03-27,Wsdp,412,DWE,38.359131,-4.975955,Mworongo,0,Pangani,Kwamkole,Tanga,4,2,Korogwe,Chekelei,500,True,GeoData Consultants Ltd,VWC,,False,2008,afridev,afridev,handpump,vwc,user-group,pay annually,annually,salty,salty,seasonal,seasonal,shallow well,shallow well,groundwater,hand pump,hand pump,functional


In [134]:
# classificatoin baseline. 

train['status_group'].value_counts(normalize=True)

functional                 0.543098
non functional             0.384259
functional needs repair    0.072643
Name: status_group, dtype: float64

In [135]:
# getting a general profile of the training dataset. And it is not working for some reason
'''
import pandas_profiling

profile_report = train.profile_report(
    check_correlation_pearson=False,
    correlations={
        'pearson': False,
        'spearman': False,
        'kendall': False,
        'phi_k': False,
        'cramers': False,
        'recoded': False,
    },
    plot={'histogram': {'bayesian_blocks_bins': False}},
)

profile_report
'''

"\nimport pandas_profiling\n\nprofile_report = train.profile_report(\n    check_correlation_pearson=False,\n    correlations={\n        'pearson': False,\n        'spearman': False,\n        'kendall': False,\n        'phi_k': False,\n        'cramers': False,\n        'recoded': False,\n    },\n    plot={'histogram': {'bayesian_blocks_bins': False}},\n)\n\nprofile_report\n"

In [0]:
# defining the target and features
target = 'status_group'
train_features = train.drop(columns=[target, 'id'])

numeric = train_features.select_dtypes(include='number').columns.tolist()
cardinality = train_features.select_dtypes(exclude='number').nunique()
categorical_features = cardinality[cardinality <= 50].index.tolist()

features = numeric + categorical_features

x_train = train[features]
y_train = train[target]
x_val = val[features]
y_val = val[target]
x_test = test[features]

In [137]:
# Select features. Use a scikit-learn pipeline to encode categoricals, 
# impute missing values, and fit a decision tree classifier.

# importing the necessary libraries
from sklearn.pipeline import make_pipeline
import category_encoders as ce
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt

# creating a pipeline for one hot encoding, imputing, and decision tree classifier
pipeline = make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                         SimpleImputer(strategy='mean'),
                         DecisionTreeClassifier(random_state=42))

# fitting the data
pipeline.fit(x_train, y_train)

# validation score
validation_accuracy = pipeline.score(x_val, y_val)
print(f'validation accuracy: {validation_accuracy}')

validation accuracy: 0.7166035353535354


In [0]:
# See what the decision tree looks like. 

import graphviz
from sklearn.tree import export_graphviz

model = pipeline.named_steps['decisiontreeclassifier']
encoder = pipeline.named_steps['onehotencoder']
encoded_columns = encoder.transform(x_val).columns

dot_data = export_graphviz(model, 
                           out_file=None, 
                           max_depth=3, 
                           feature_names=encoded_columns,
                           class_names=model.classes_, 
                           impurity=False, 
                           filled=True, 
                           proportion=True, 
                           rounded=True)   
display(graphviz.Source(dot_data))

In [0]:
# get and plot feature importances.
%matplotlib inline

model= pipeline.named_steps['decisiontreeclassifier']
encoder = pipeline.named_steps['onehotencoder']
columns = encoder.transform(x_val).columns
importances = pd.Series(model.feature_importances_, columns)
#plt.autoscale(enable=True)
plt.figure(figsize=(10,30))
importances.sort_values().plot.barh(color='red');

In [140]:
# tunning hyperparameters to get better prediction. 

# Testing different depth of trees
pipeline = make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                         SimpleImputer(strategy='mean'),
                         DecisionTreeClassifier(max_depth = 4, random_state=42))

# fitting the data
pipeline.fit(x_train, y_train)

# validation score
validation_accuracy = pipeline.score(x_val, y_val)
print(f'validation accuracy: {validation_accuracy}')

validation accuracy: 0.7041666666666667


In [141]:
# tunning hyperparameters to get better prediction. 

# Testing different maximum leaf nodes
pipeline = make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                         SimpleImputer(strategy='mean'),
                         DecisionTreeClassifier(max_leaf_nodes = 3, random_state=42))

# fitting the data
pipeline.fit(x_train, y_train)

# validation score
validation_accuracy = pipeline.score(x_val, y_val)
print(f'validation accuracy: {validation_accuracy}')

validation accuracy: 0.6935395622895623


In [142]:
# tunning hyperparameters to get better prediction. 

# combining different parameters to see if there is a difference
pipeline = make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                         SimpleImputer(strategy='mean'),
                         DecisionTreeClassifier(min_samples_leaf = 18,
                                                max_depth = 20, 
                                                max_leaf_nodes = 20, random_state=42))

# fitting the data
pipeline.fit(x_train, y_train)

# validation score
validation_accuracy = pipeline.score(x_val, y_val)
print(f'validation accuracy: {validation_accuracy}')

validation accuracy: 0.7106902356902357


In [155]:
# tunning hyperparameters to get better prediction. 

# Testing different number of leaves
best_model = make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                         SimpleImputer(strategy='mean'),
                         DecisionTreeClassifier(min_samples_leaf = 9, random_state=42))

# fitting the data
best_model.fit(x_train, y_train)

# validation score
validation_accuracy = best_model.score(x_val, y_val)
print(f'validation accuracy: {validation_accuracy}')

validation accuracy: 0.73621632996633


In [0]:
y_pred = best_model.predict(x_test)


# Makes a dataframe with two columns, id and status_group, 
# and writes to a csv file, without the index

submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('Evidence.N DS9 Kaggle Challenge 1.csv', index=False)

In [0]:
from google.colab import files
files.download('Evidence.N DS9 Kaggle Challenge 1.csv')