<a href="https://colab.research.google.com/github/claudiasofiaC/DS-Unit-2-Kaggle-Challenge/blob/master/assignment_kaggle_challenge_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 1*

---

# Decision Trees

## Assignment
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [1]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

Collecting category_encoders==2.*
[?25l  Downloading https://files.pythonhosted.org/packages/a0/52/c54191ad3782de633ea3d6ee3bb2837bda0cf3bc97644bb6375cf14150a0/category_encoders-2.1.0-py2.py3-none-any.whl (100kB)
[K     |███▎                            | 10kB 17.6MB/s eta 0:00:01[K     |██████▌                         | 20kB 1.8MB/s eta 0:00:01[K     |█████████▉                      | 30kB 2.6MB/s eta 0:00:01[K     |█████████████                   | 40kB 1.7MB/s eta 0:00:01[K     |████████████████▍               | 51kB 2.1MB/s eta 0:00:01[K     |███████████████████▋            | 61kB 2.5MB/s eta 0:00:01[K     |██████████████████████▉         | 71kB 2.9MB/s eta 0:00:01[K     |██████████████████████████▏     | 81kB 3.3MB/s eta 0:00:01[K     |█████████████████████████████▍  | 92kB 3.7MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 1.3MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.1.0


In [20]:

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [0]:
# imports
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import plotly.express as px
import pandas_profiling
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer

In [0]:
pd.options.display.max_rows = 20


In [0]:
# train, val, test split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

In [22]:
train.shape, test.shape

((59400, 41), (14358, 40))

In [24]:
train.isnull().sum()

id                          0
amount_tsh                  0
date_recorded               0
funder                   2903
gps_height                  0
                         ... 
source_type                 0
source_class                0
waterpoint_type             0
waterpoint_type_group       0
status_group                0
Length: 41, dtype: int64

In [23]:
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['status_group'], random_state=92)

train.shape, val.shape, test.shape

((47520, 41), (11880, 41), (14358, 40))

In [0]:
pandas_profiling.ProfileReport(train)

In [0]:
# do some exploring

for column in train:
    print(column)
    print(f'{column} has {train[column].isnull().sum()} null values.')
    print(f'There are {train[column].nunique()} possible values.')
    print(f'This column is a(n) {train[column].dtype}.')
    print( )

In [0]:
# Function to wrangle the data (In the same style as Ryan Herr demonstrated)
def wrangle(X):
    X = X.copy()
    
    #  conditions to fix 0 in numeric columns
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # Replace 0 with NaN
    fill_zeros = ['longitude', 'latitude', 'construction_year',
                  'population', 'amount_tsh']
    for column in fill_zeros:
        X[column] = X[column].replace(0, np.nan)
    
    # Drop duplicate columns
    X = X.drop(columns=['quality_group', 'source_type', 'source_class', 'quantity_group', 'payment_type', 
                        'extraction_type', 'extraction_type_group', 'waterpoint_type'])
    
    # Set numeric and Categorical columns
    numbers = X.select_dtypes('number').columns
    categorical_features = X.select_dtypes('object').columns

# Check if categorical has nulls.
# Get nunique / Less encode
    for column in categorical_features:
# If null values greater than 0
        if X[column].isnull().sum() > 0:
          # If number of unique options greater than 5
            if X[column].nunique() > 10:
                # Get 5 most frequent, impute Other for everything else
                frequent = X[column].value_counts()[:10].index
                X.loc[~X[column].isin(frequent), column] = 'Other'
            # Less than 5, encode all possible values, Impute Other for everything else
            else:
                possible_values = X[column].unique()
                X.loc[~X[column].isin(possible_values), column] = 'Other'
        # If cardinality greater than 35 values
        if X[column].nunique() > 35:
            # Bring down to 10
            frequent = X[column].value_counts()[:10].index
            X.loc[~X[column].isin(frequent), column] = 'Other'
            

    return X

In [0]:
# test out the wrangle

train = wrangle(train)

valid = wrangle(val)

In [29]:
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type_class,management,management_group,payment,water_quality,quantity,source,waterpoint_type_group,status_group
22586,11611,,Other,Government Of Tanzania,0,RWE,30.942205,-1.029911,none,0,Lake Victoria,Other,Kagera,18,7,Other,Other,,True,GeoData Consultants Ltd,Other,Other,True,,submersible,vwc,user-group,never pay,soft,insufficient,spring,communal standpipe,non functional
47046,10605,,Other,Other,0,DWE,31.479394,-1.37595,Other,0,Lake Victoria,Other,Kagera,18,2,Other,Other,,True,GeoData Consultants Ltd,VWC,Other,True,,handpump,vwc,user-group,never pay,soft,enough,shallow well,hand pump,functional
2808,64307,500.0,Other,Unicef,1506,DWE,34.648047,-8.972701,Other,0,Rufiji,Other,Iringa,11,4,Njombe,Other,150.0,True,GeoData Consultants Ltd,WUA,wanging'ombe water supply s,True,1984.0,gravity,wua,user-group,pay monthly,soft,enough,river,communal standpipe,functional needs repair
3758,64534,,Other,Other,1375,Other,34.264181,-2.939733,Other,0,Lake Victoria,Other,Shinyanga,17,1,Bariadi,Other,500.0,True,GeoData Consultants Ltd,WUG,Other,False,2007.0,handpump,wug,user-group,never pay,soft,enough,shallow well,hand pump,functional
57984,1888,,Other,Kkkt,1987,KKKT,36.210766,-2.935612,Other,0,Internal,Other,Arusha,2,6,Other,Other,250.0,True,GeoData Consultants Ltd,Other,Other,,2001.0,gravity,vwc,user-group,pay when scheme fails,soft,insufficient,river,communal standpipe,functional


In [0]:
# set up features and target

target = 'status_group'
features = train.drop(columns=['id', target]).columns

X_train = train[features]
X_val = valid[features]
y_train = train[target]
y_val = valid[target]
X_test = test[features]

In [32]:
X_train.isnull().sum()

amount_tsh               33374
date_recorded                0
funder                       0
gps_height                   0
installer                    0
                         ...  
payment                      0
water_quality                0
quantity                     0
source                       0
waterpoint_type_group        0
Length: 31, dtype: int64

In [33]:
y_train

22586             non functional
47046                 functional
2808     functional needs repair
3758                  functional
57984                 functional
                  ...           
30486                 functional
41313                 functional
6853              non functional
8604                  functional
32418             non functional
Name: status_group, Length: 47520, dtype: object

In [34]:
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    DecisionTreeClassifier(random_state=42, max_depth=20, min_samples_split=25)
)

# Fit pipeline with training data
pipeline.fit(X_train, y_train)

# Score training and validation data
print(f'Training score is {pipeline.score(X_train, y_train)}.')
print(f'Validation score is {pipeline.score(X_val, y_val)}.')


Training score is 0.8452020202020202.
Validation score is 0.7734006734006734.
