<a href="https://colab.research.google.com/github/jonathanmendoza-tx/DS-Unit-2-Kaggle-Challenge/blob/master/module1/Jonathan_Mendoza_assignment_kaggle_challenge_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 1

## Assignment
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what other columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What other columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Try other [scikit-learn scalers](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this problem, you may want to use the parameter `logistic=True`

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```



In [0]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

Requirement already up-to-date: category_encoders in /usr/local/lib/python3.6/dist-packages (2.0.0)
Requirement already up-to-date: pandas-profiling in /usr/local/lib/python3.6/dist-packages (2.3.0)
Requirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.1.1)
Reinitialized existing Git repository in /content/.git/
fatal: remote origin already exists.
From https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge
 * branch            master     -> FETCH_HEAD
Already up to date.


In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [0]:
train, val = train_test_split(train, train_size=0.8, test_size = 0.2,
                              stratify = train['status_group'])
train.shape, val.shape, test.shape

((47520, 41), (11880, 41), (14358, 40))

In [0]:
import plotly.express as px
px.scatter(train, x = 'longitude',y='latitude', color = 'status_group',opacity = 0.1)

In [0]:
numeric_cols = train.select_dtypes('number').columns.tolist()
for col in range(len(numeric_cols)): #check how many zeros are in each column
  print('\n',numeric_cols[col],'\n',train.query(f'{numeric_cols[col]}==0').shape)



 id 
 (1, 41)

 amount_tsh 
 (33280, 41)

 gps_height 
 (16434, 41)

 longitude 
 (1473, 41)

 latitude 
 (0, 41)

 num_private 
 (46919, 41)

 region_code 
 (0, 41)

 district_code 
 (19, 41)

 population 
 (17195, 41)

 construction_year 
 (16654, 41)


In [0]:

def wrangle(X):
  """Wrangle a given dataset"""
  import numpy as np
  import datetime

  X = X.copy()

  X['date_recorded'] = pd.to_datetime(X['date_recorded'],infer_datetime_format=True)

  
  X['latitude'] = X['latitude'].replace(-2e-08, 0)

  zero_cols = ['latitude','longitude','construction_year','gps_height','district_code','population']

  for col in zero_cols:
    X[col] = X[col].replace(0,np.nan)

  X['years_since_inspection'] = X['date_recorded'].dt.year-X['construction_year']

  X = X.drop(columns = ['recorded_by','num_private','quantity_group','scheme_name','extraction_type_group','payment','source','waterpoint_type'])

  return X

In [0]:
train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [0]:
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,scheme_management,permit,construction_year,extraction_type,extraction_type_class,management,management_group,payment_type,water_quality,quality_group,quantity,source_type,source_class,waterpoint_type_group,status_group,years_since_inspection
59249,60321,0.0,2011-08-06,Jica,,DWE,30.734873,-1.041514,Musya 2,Lake Victoria,Magoma,Kagera,18,1.0,Karagwe,Bugomora,,True,,True,,other,other,vwc,user-group,never pay,soft,good,insufficient,shallow well,groundwater,other,non functional,
23547,73422,25.0,2013-03-21,0,47.0,0,39.096159,-6.646026,Chalesi Mapendano,Wami / Ruvu,New City,Dar es Salaam,7,1.0,Kinondoni,Bunju,168.0,True,VWC,False,2010.0,submersible,submersible,vwc,user-group,per bucket,soft,good,enough,river/lake,surface,communal standpipe,functional,3.0
8722,27388,0.0,2011-03-12,Government Of Tanzania,,GOVER,35.486193,-6.646269,Kwa Mtendaji,Rufiji,Mwenge,Dodoma,1,4.0,Chamwino,Huzi,,True,VWC,True,,mono,motorpump,vwc,user-group,unknown,soft,good,dry,borehole,groundwater,communal standpipe,non functional,
57204,50689,0.0,2011-03-17,Water,,Gove,35.451158,-5.836435,Zahana,Internal,Kichan,Dodoma,1,6.0,Bahi,Lamaiti,,True,VWC,True,,mono,motorpump,vwc,user-group,per bucket,soft,good,enough,borehole,groundwater,communal standpipe,functional,
55609,43864,50.0,2013-02-12,Government Of Tanzania,1535.0,Government,36.557698,-5.307173,Kwa Sugal,Internal,Ngarenaro,Manyara,21,4.0,Kiteto,Kibaya,200.0,False,Water Board,False,1997.0,gravity,gravity,water board,user-group,per bucket,salty,salty,insufficient,spring,groundwater,other,functional needs repair,16.0


In [0]:
px.scatter(train, x='longitude', y = 'latitude',color = 'status_group',opacity = 0.2)

In [0]:
target = 'status_group'

train_features = train.drop(columns=[target,'id'])

numeric = train_features.select_dtypes(include = 'number').columns.tolist()

cardinality = train_features.select_dtypes(exclude='number').nunique()

categorical = cardinality[cardinality<=21].index.tolist()

features = numeric + categorical

features

['amount_tsh',
 'gps_height',
 'longitude',
 'latitude',
 'region_code',
 'district_code',
 'population',
 'construction_year',
 'years_since_inspection',
 'basin',
 'region',
 'public_meeting',
 'scheme_management',
 'permit',
 'extraction_type',
 'extraction_type_class',
 'management',
 'management_group',
 'payment_type',
 'water_quality',
 'quality_group',
 'quantity',
 'source_type',
 'source_class',
 'waterpoint_type_group']

In [0]:
X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

X_test = test[features]

In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    StandardScaler(),
    DecisionTreeClassifier(min_samples_leaf = 8)
)

pipeline.fit(X_train, y_train)

print ('Validation Accuracy', pipeline.score(X_val, y_val))

y_pred = pipeline.predict(X_test)

Validation Accuracy 0.7746632996632996


In [0]:

submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('tanzania_submission-02.csv', index = False)

!head tanzania_submission-02.csv

if in_colab:
  from google.colab import files
  files.download('tanzania_submission-02.csv')

id,status_group
50785,non functional
51630,functional
17168,functional
45559,non functional
49871,functional
52449,functional
24806,non functional
28965,non functional
36301,non functional
