Lambda School Data Science

*Unit 2, Sprint 2, Module 1*

---

# Decision Trees

## Assignment
- [x] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition. Notice that the Rules page also has instructions for the Submission process. The Data page has feature definitions.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [1]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

Collecting category_encoders==2.*
[?25l  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
[K     |███▎                            | 10kB 16.6MB/s eta 0:00:01[K     |██████▌                         | 20kB 6.8MB/s eta 0:00:01[K     |█████████▉                      | 30kB 7.8MB/s eta 0:00:01[K     |█████████████                   | 40kB 6.1MB/s eta 0:00:01[K     |████████████████▍               | 51kB 6.5MB/s eta 0:00:01[K     |███████████████████▋            | 61kB 7.6MB/s eta 0:00:01[K     |██████████████████████▉         | 71kB 8.3MB/s eta 0:00:01[K     |██████████████████████████▏     | 81kB 7.9MB/s eta 0:00:01[K     |█████████████████████████████▍  | 92kB 8.7MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 5.8MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.1.0
Collecting pandas-

###Do train/validate/test split with the Tanzania Waterpumps data.

In [73]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

# Split train into train and val
train, val = train_test_split(train,train_size=0.80, test_size=.20,
                              stratify=train['status_group'], random_state=7)

train.shape, val.shape, test.shape

((47520, 41), (11880, 41), (14358, 40))

In [7]:
# Check Pandas Profiling version
import pandas_profiling
pandas_profiling.__version__

'2.4.0'

In [9]:
# Old code for Pandas Profiling version 2.3
# It can be very slow with medium & large datasets.
# These parameters will make it faster.

# profile = train.profile_report(
#     check_correlation_pearson=False,
#     correlations={
#         'pearson': False,
#         'spearman': False,
#         'kendall': False,
#         'phi_k': False,
#         'cramers': False,
#         'recoded': False,
#     },
#     plot={'histogram': {'bayesian_blocks_bins': False}},
# )
#

# New code for Pandas Profiling version 2.4
from pandas_profiling import ProfileReport
profile = ProfileReport(train, minimal=True)

profile.to_notebook_iframe()

In [0]:
# Save the profile report to an HTML file
profile.to_file(output_file='tanzania_profile_report_minimal.html')

In [11]:
# check out basic visualizations
import plotly.express as px
px.scatter(train, x='longitude',y='latitude',color='status_group',opacity=0.1)
# need to get rid of the crazy outlyres

In [12]:
train[['longitude','latitude']].describe()

Unnamed: 0,longitude,latitude
count,47520.0,47520.0
mean,34.071532,-5.699443
std,6.567491,2.943513
min,0.0,-11.64944
25%,33.087438,-8.532038
50%,34.902873,-5.013405
75%,37.177604,-3.325406
max,40.345193,-2e-08


In [13]:
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
26660,50225,0.0,2013-02-03,Rotary Club,1138,Rotary Club,36.901195,-3.39107,Kitefu Primary School,0,Pangani,Tanesco,Arusha,2,7,Meru,Maji ya Chai,350,True,GeoData Consultants Ltd,WUA,,False,2009,gravity,gravity,gravity,wua,user-group,unknown,unknown,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
46901,67215,3000.0,2011-03-08,Danida,910,DANID,35.366883,-7.642936,none,0,Rufiji,Ndorobo B,Iringa,11,1,Iringa Rural,Mlowa,1,True,GeoData Consultants Ltd,VWC,Mlowa,True,1992,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1058,55164,1500.0,2013-02-12,Government Of Tanzania,682,District council,37.512633,-3.542461,Kwa Salim Idd,0,Pangani,Kitopeni,Kilimanjaro,3,2,Mwanga,Kileo,250,True,GeoData Consultants Ltd,WUA,Kifaru water Supply,True,2011,submersible,submersible,submersible,wua,user-group,pay monthly,monthly,soft,good,insufficient,insufficient,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,functional
9559,7310,0.0,2011-03-11,Government Of Tanzania,370,RWE,38.688599,-4.847517,Msikitini,0,Pangani,Dukani,Tanga,4,2,Korogwe,Kizara,150,True,GeoData Consultants Ltd,VWC,Bombomajimoto water,False,1975,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,non functional
35556,37576,0.0,2012-10-09,Rwssp,0,WEDECO,34.156026,-3.104921,Polewalandi,0,Lake Victoria,Makungu,Shinyanga,17,6,Meatu,Kisesa,0,True,GeoData Consultants Ltd,WUG,,True,0,nira/tanira,nira/tanira,handpump,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional


In [0]:
import numpy as np

def wrangle(X):

  # make a copy to avoid messing with the data
  X = X.copy()

  #replace the tiny value with 0
  X['latitude'] =X['latitude'].replace(-2e-08, 0)

  # replace 0s with nulls so that they can be imputed later
  cols_with_zeros = ['longitude','latitude','construction_year',
                     'gps_height','population']
  for col in cols_with_zeros:
    X[col] = X[col].replace(0, np.nan)
  
  # quantity and quantity_group are duplicates, so drop one
  X = X.drop(columns='quantity_group')
  #recorded_by is all the same so I'm dropping it
  X = X.drop(columns='recorded_by')
  
  #convert date_recorded and to date data type
  X['date_recorded'] = pd.to_datetime(X['date_recorded'])
  
  # add a year column using the date_recorded
  X['year_inspection'] = X['date_recorded'].dt.year
  # add an age_at_inspection column
  X['age_at_inspection'] = X['year_inspection'] - X['construction_year']

  # convert funder and installer so that there is only 10 values plus an Other
  top10 = X['funder'].value_counts()[:10].index 
  X.loc[~train['funder'].isin(top10), 'funder'] = 'OTHER'
  top10 = X['installer'].value_counts()[:10].index 
  X.loc[~train['installer'].isin(top10), 'installer'] = 'OTHER'
  

  #return the wrangled dataframe
  return X

In [75]:
train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

IndexingError: ignored

###Begin with baselines for classification.

###Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.

###Get your validation accuracy score.

###Get and plot your feature importances.

### Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue Submit Predictions button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)

###Commit your notebook to your fork of the GitHub repo.