<a href="https://colab.research.google.com/github/grzegorzkwolek/DS-Unit-2-Kaggle-Challenge/blob/master/module1/GKwolek_assignment_kaggle_challenge_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 1*

---

# Decision Trees

## Assignment
- [X] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition.
- [X] Do train/validate/test split with the Tanzania Waterpumps data.
- [X] Begin with baselines for classification.
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [0]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'



In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [0]:
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(train, random_state = 7)

display(train.shape, train_df.shape, val_df.shape, test.shape)

(59400, 41)

(44550, 41)

(14850, 41)

(14358, 40)

In [0]:
train_df['status_group'].value_counts(normalize=True)

functional                 0.544422
non functional             0.383120
functional needs repair    0.072458
Name: status_group, dtype: float64

In [0]:
val_df['status_group'].value_counts(normalize=True)

functional                 0.539057
non functional             0.387609
functional needs repair    0.073333
Name: status_group, dtype: float64

In [0]:
train["extraction_type"].unique()

array(['gravity', 'submersible', 'swn 80', 'nira/tanira', 'india mark ii',
       'other', 'ksb', 'mono', 'windmill', 'afridev', 'other - rope pump',
       'india mark iii', 'other - swn 81', 'other - play pump', 'cemo',
       'climax', 'walimi', 'other - mkulima/shinyanga'], dtype=object)

In [0]:
train["extraction_type_group"].unique()

array(['gravity', 'submersible', 'swn 80', 'nira/tanira', 'india mark ii',
       'other', 'mono', 'wind-powered', 'afridev', 'rope pump',
       'india mark iii', 'other handpump', 'other motorpump'],
      dtype=object)

In [0]:
import numpy as np

def wrangle(X):
    X = X.copy()
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # When columns have zeros and shouldn't, they are like null values.
    # So we will replace the zeros with nulls, and impute missing values later.
    cols_with_zeros = ['longitude', 'latitude']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
            
    # quantity & quantity_group are duplicates, so are payment & payment_type
    # will leave extraction_type and extraction_type_group as is (although it feels like dropping the _group one)
    # same for source and source_type
    X = X.drop(columns='quantity_group')
    X = X.drop(columns='payment_type')
    
    #fixing the time
    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format = 'true')
    X['year_recorded'] = pd.DatetimeIndex(X['date_recorded']).year
    
    # return the wrangled dataframe
    return X

In [0]:
wrangle(test)
wrangle(train)
wrangle(train_df)
wrangle(val_df)

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group,year_recorded
41624,39618,0.0,2012-10-12,Africare,0,Africare,32.791519,-5.974270,Shule Ya Msingi Utimule,0,Lake Tanganyika,Mabangwe,Tabora,14,5,Sikonge,Ipole,0,True,GeoData Consultants Ltd,VWC,,True,0,india mark ii,india mark ii,handpump,vwc,user-group,never pay,soft,good,dry,shallow well,shallow well,groundwater,hand pump,hand pump,non functional,2012
35694,14077,1000.0,2013-02-21,Danida,980,DANIDA,35.948202,-10.632045,Kwa Shuleni,0,Rufiji,Misufini B,Ruvuma,10,5,Namtumbo,Msindo,750,True,GeoData Consultants Ltd,VWC,Ngwinde water gravity scheme,True,1991,gravity,gravity,gravity,vwc,user-group,pay annually,soft,good,seasonal,river,river/lake,surface,communal standpipe,communal standpipe,functional,2013
46007,60781,3000.0,2013-01-22,Government Of Tanzania,1170,DWE,30.264181,-4.612664,Kwa Paulo,0,Lake Tanganyika,Nyenge Kati,Kigoma,16,2,Kasulu,Titye,230,True,GeoData Consultants Ltd,Water authority,Mbagwe,True,1978,gravity,gravity,gravity,vwc,user-group,pay when scheme fails,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional,2013
18292,58821,0.0,2011-07-10,Hesawa,0,HESAWA,31.837348,-2.620175,Kahuhwa,0,Lake Victoria,Kahuhwa,Kagera,18,8,Chato,Chato,0,True,GeoData Consultants Ltd,VWC,,True,0,afridev,afridev,handpump,vwc,user-group,never pay,salty,salty,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional,2011
45030,43684,0.0,2011-03-20,Government Of Tanzania,1174,RWE,38.429973,-5.004257,Shuleni,0,Pangani,Madukani,Tanga,4,2,Korogwe,Dindira,750,True,GeoData Consultants Ltd,VWC,Sakale water supply,False,1971,gravity,gravity,gravity,vwc,user-group,never pay,soft,good,enough,spring,spring,groundwater,communal standpipe multiple,communal standpipe,functional,2011
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20029,55996,20.0,2011-03-09,Wua,229,WU,38.344302,-6.407104,Funua,0,Wami / Ruvu,Ofisi Ya Kijiji,Pwani,6,1,Bagamoyo,Lugoba,40,True,GeoData Consultants Ltd,WUA,Chalinze wate,True,2003,ksb,submersible,submersible,wua,user-group,pay per bucket,soft,good,enough,river,river/lake,surface,communal standpipe,communal standpipe,functional,2011
26726,71301,0.0,2011-07-18,H,0,H,33.076467,-2.388994,Kwa Nugwa Badegeleki,0,Lake Victoria,1,Mwanza,19,8,Ilemela,Pasiansi,0,True,GeoData Consultants Ltd,VWC,,True,0,swn 80,swn 80,handpump,vwc,user-group,pay monthly,soft,good,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional,2011
12124,20994,0.0,2011-07-28,Danida,0,Central government,33.909647,-9.489943,Lupaso-Shuleni,0,Lake Nyasa,Lupaso,Mbeya,12,3,Kyela,Ipinda,0,True,GeoData Consultants Ltd,VWC,Ngamanga water supplied sch,True,0,gravity,gravity,gravity,vwc,user-group,pay monthly,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional,2011
32494,52472,200.0,2013-01-18,Government Of Tanzania,640,District Council,39.270447,-10.145717,Guest House,0,Ruvuma / Southern Coast,Mnara,Lindi,80,23,Lindi Rural,Mnara,1,False,GeoData Consultants Ltd,WUA,Rondo Water Supply,False,2012,ksb,submersible,submersible,wua,user-group,pay per bucket,soft,good,insufficient,dam,dam,surface,communal standpipe multiple,communal standpipe,non functional,2013


In [0]:
target = "status_group"

train_features = train.drop(columns=[target, 'id'])

# numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# series with the cardinality of the nonnumeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()

# Get a list of all categorical features with cardinality <= 50
categorical_features = cardinality[cardinality <= 50].index.tolist()

# Combine the lists 
features = numeric_features + categorical_features
print(len(features))
print(features)

31
['amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private', 'region_code', 'district_code', 'population', 'construction_year', 'basin', 'region', 'public_meeting', 'recorded_by', 'scheme_management', 'permit', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'payment_type', 'water_quality', 'quality_group', 'quantity', 'quantity_group', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group']


In [0]:
categorical_features

['basin',
 'region',
 'public_meeting',
 'recorded_by',
 'scheme_management',
 'permit',
 'extraction_type',
 'extraction_type_group',
 'extraction_type_class',
 'management',
 'management_group',
 'payment',
 'payment_type',
 'water_quality',
 'quality_group',
 'quantity',
 'quantity_group',
 'source',
 'source_type',
 'source_class',
 'waterpoint_type',
 'waterpoint_type_group']

In [0]:
X_train = train_df[features]
y_train = train_df[target]
X_val = val_df[features]
y_val = val_df[target]
X_test = test[features]

In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    DecisionTreeClassifier(max_depth = 15)
)

# Fit on train
pipeline.fit(X_train, y_train)

# Score on Train/Val
print('Training Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Predict on Test Data
y_pred = pipeline.predict(X_test)

Training Accuracy 0.8290684624017958
Validation Accuracy 0.7632996632996633


In [0]:
y_pred

array(['non functional', 'functional', 'functional', ..., 'functional',
       'functional', 'non functional'], dtype=object)

In [0]:
status_group = pd.DataFrame(data=y_pred)
status_group

Unnamed: 0,0
0,non functional
1,functional
2,functional
3,non functional
4,functional
...,...
14353,non functional
14354,functional
14355,functional
14356,functional


In [0]:
submission_id = pd.DataFrame(test['id'])
submission_id["status_group"] = status_group

In [0]:
submission_id

Unnamed: 0,id,status_group
0,50785,non functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional
...,...,...
14353,39307,non functional
14354,18990,functional
14355,28749,functional
14356,33492,functional


In [0]:
submission_id.set_index("id")

Unnamed: 0_level_0,status_group
id,Unnamed: 1_level_1
50785,non functional
51630,functional
17168,functional
45559,non functional
49871,functional
...,...
39307,non functional
18990,functional
28749,functional
33492,functional


In [0]:
submission_id.to_csv('kaggle_03.csv')

In [0]:
# from google.colab import files
# files.download('kaggle_03.csv')

In [0]:
submission_id

Unnamed: 0,id,status_group
0,50785,non functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional
...,...,...
14353,39307,non functional
14354,18990,functional
14355,28749,functional
14356,33492,functional


In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    RandomForestClassifier(max_depth = 20, n_estimators = 225, max_features = "auto")
)

# Fit on train
pipeline.fit(X_train, y_train)

# Score on Train/Val
print('Training Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Predict on Test Data
y_pred = pipeline.predict(X_test)

Training Accuracy 0.9096071829405162
Validation Accuracy 0.8008754208754209


In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    RandomForestClassifier(max_depth = 22, n_estimators = 225, max_features = "auto")
)

# Fit on train
pipeline.fit(X_train, y_train)

# Score on Train/Val
print('Training Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Predict on Test Data
y_pred = pipeline.predict(X_test)

Training Accuracy 0.9363187429854096
Validation Accuracy 0.804040404040404


In [0]:
status_group = pd.DataFrame(data=y_pred)
submission_id = pd.DataFrame(test['id'])
submission_id["status_group"] = status_group
submission_id.set_index("id")
submission_id.to_csv('kaggle_06.csv')
from google.colab import files
files.download('kaggle_06.csv')