<a href="https://colab.research.google.com/github/medinadiegoeverardo/DS-Unit-2-Kaggle-Challenge/blob/master/module1/medinadiego_1_assignment_kaggle_challenge_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 1

## Assignment
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what other columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What other columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [49]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'



In [137]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [0]:
train, validation = train_test_split(train, random_state=10)

In [52]:
print(train.shape)
print(validation.shape)

(44550, 41)
(14850, 41)


In [53]:
train.tail()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
28017,45759,1500.0,2013-02-12,Government Of Tanzania,723,District council,37.556652,-3.544106,Kwa Baba Colimba,0,Pangani,Majengo,Kilimanjaro,3,2,Mwanga,Kileo,300,True,GeoData Consultants Ltd,WUA,Kifaru water Supply,True,2009,submersible,submersible,submersible,wua,user-group,pay monthly,monthly,soft,good,insufficient,insufficient,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,functional
50496,17747,0.0,2013-02-17,Government Of Tanzania,1750,Government,37.563315,-3.249813,Kwa Kara,0,Pangani,Komola,Kilimanjaro,3,4,Rombo,Keni Mengeni,1,True,GeoData Consultants Ltd,WUA,Marera-Lole pipeline,True,1970,gravity,gravity,gravity,wua,user-group,pay monthly,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe multiple,communal standpipe,functional needs repair
29199,18177,0.0,2011-07-16,Concern,0,TWESA,30.475867,-2.589907,Nyalukingie,0,Lake Victoria,Mgweli,Kagera,18,30,Ngara,Kanazi,0,True,GeoData Consultants Ltd,VWC,,False,0,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,enough,spring,spring,groundwater,other,other,functional
40061,45636,0.0,2013-03-18,Wfp/usaid,1825,Active MKM,35.578253,-2.246388,Shuleni,0,Internal,Maloon,Arusha,2,5,Ngorongoro,Arash,179,True,GeoData Consultants Ltd,Parastatal,,False,2011,gravity,gravity,gravity,parastatal,parastatal,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
17673,71037,0.0,2011-04-02,,0,,33.876041,-9.273638,Kalengo,0,Lake Nyasa,Kalengo,Mbeya,12,4,Rungwe,Lufilyo,0,True,GeoData Consultants Ltd,Parastatal,K,,0,gravity,gravity,gravity,parastatal,parastatal,unknown,unknown,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional


### All zeros to nulls for imputer

In [0]:
col = train.columns
# list(col)

In [111]:
train.source_type.isnull().any()

False

In [0]:
import numpy as np

def replacing_nulls_with_nulls(df):
  cols = df.columns
  cols = list(cols)
  those_null = []
  for col in cols:
    if df[col].isnull().any() == False:
      continue
    
    df[col] = df[col].replace(0, np.nan)
    those_null.append(col)
  return those_null

In [141]:
replacing_nulls_with_nulls(train)

['funder',
 'installer',
 'subvillage',
 'public_meeting',
 'scheme_management',
 'scheme_name',
 'permit']

In [143]:
replacing_nulls_with_nulls(validation)

['funder',
 'installer',
 'subvillage',
 'public_meeting',
 'scheme_management',
 'scheme_name',
 'permit']

In [0]:
mode_map = {'functional': 0, 'non functional': 1, 'functional needs repair': 2}
train['status_group'] = train['status_group'].replace(mode_map)
validation['status_group'] = validation['status_group'].replace(mode_map)

In [0]:
# not needed

# train = train.dropna(axis=1)
# validation = validation.dropna(axis=1)

### Selecting features

In [153]:
train.describe()

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,construction_year
count,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0,44550.0
mean,37197.87138,321.741092,667.512997,34.083798,-5.700832,0.482379,15.324893,5.656611,179.436229,1297.77486
std,21428.742249,3041.270688,693.334158,6.549088,2.948316,13.035927,17.617153,9.66277,455.355845,952.534494
min,0.0,0.0,-90.0,0.0,-11.64944,0.0,1.0,0.0,0.0,0.0
25%,18663.25,0.0,0.0,33.092791,-8.531322,0.0,5.0,2.0,0.0,0.0
50%,37204.5,0.0,367.0,34.901191,-5.014511,0.0,12.0,3.0,25.0,1986.0
75%,55679.75,20.0,1322.0,37.177192,-3.323021,0.0,17.0,5.0,214.0,2004.0
max,74247.0,350000.0,2770.0,40.345193,-2e-08,1776.0,99.0,80.0,15300.0,2013.0


In [154]:
train.describe(include=['O'])

Unnamed: 0,date_recorded,funder,installer,wpt_name,basin,subvillage,region,lga,ward,public_meeting,recorded_by,scheme_management,scheme_name,permit,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
count,44550,41811,41794,44550,44550,44261,44550,44550,44550,38204,44550,41632,23407,29105,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550
unique,347,1629,1840,28953,9,16658,21,125,2080,1,1,12,2482,1,18,13,7,12,5,7,7,8,6,5,5,10,7,3,7,6,3
top,2011-03-15,Government Of Tanzania,DWE,none,Lake Victoria,Majengo,Iringa,Njombe,Igosi,True,GeoData Consultants Ltd,VWC,K,True,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
freq,426,6798,13020,2708,7746,387,3984,1861,228,38204,44550,27593,513,29105,20084,20084,20084,30318,39340,18963,18963,38054,38054,24869,24869,12754,12754,34320,21387,25971,24106


In [182]:
# functional: 0
# non functional: 1
# functional needs repair: 2

train.groupby('quantity')['status_group'].var()
#train.groupby('construction_year')['status_group'].value_counts()

quantity
dry             0.031544
enough          0.387685
insufficient    0.440910
seasonal        0.458704
unknown         0.217881
Name: status_group, dtype: float64

In [0]:
# Decided to try something different. Instead of relying on a feature selection function, I pretended 
# I was an expert in this domain (waterpumps) and chose all features using domain knowledge.

# categoricals were chosen by their seemingly relative
# importance (read about what they represent on Kaggle feature description page)
# specific numerics were chosen that have high variance 

categoricals = ['management', 'management_group', 'extraction_type_class', 'extraction_type_group', 'waterpoint_type',
                'water_quality', 'quantity', 'basin',	'region']
numerics = ['construction_year', 'amount_tsh', 'gps_height', 'longitude', 'latitude']
categoricals.extend(numerics)
features = categoricals

### Prep and processing

In [208]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

target = 'status_group'

x_train = train[features]
y_train = train[target]
x_val = validation[features]
y_val = validation[target]
x_test = test[features]

my_pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    SimpleImputer(strategy='mean'), 
    StandardScaler(), 
    DecisionTreeClassifier(min_samples_leaf=5, max_features=40, random_state=10)
)

my_pipeline.fit(x_train, y_train)
print('train set accuracy', my_pipeline.score(x_train, y_train))
print('validation set accuracy', my_pipeline.score(x_val, y_val))

train set accuracy 0.8490011223344557
validation set accuracy 0.7665993265993266


In [209]:
y_pred = my_pipeline.predict(x_test)
print('prediction: ', y_pred)

prediction:  [0 0 1 ... 0 0 1]


In [0]:
submission = sample_submission.copy()
submission['status_group'] = y_pred

mode_map = {0: 'functional', 1: 'non functional', 2: 'functional needs repair'}
submission['status_group'] = submission['status_group'].replace(mode_map)

In [216]:
submission.head()

Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional
2,17168,non functional
3,45559,non functional
4,49871,functional


In [222]:
submission.shape

(14358, 2)

In [0]:
submission.to_csv('medinadiegokaggle_1.csv', index=False)

In [0]:
from google.colab import files
files.download('medinadiegokaggle_1.csv')

In [0]:
# train['funder'].value_counts().index[0]