<a href="https://colab.research.google.com/github/emmettgb/DS-Unit-2-Kaggle-Challenge/blob/master/Emmett%20Boudreau-assignment_kaggle_challenge_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 1

## Assignment
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what other columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What other columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Try other [scikit-learn scalers](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this problem, you may want to use the parameter `logistic=True`

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```



In [1]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 3.5MB/s 
[?25hCollecting pandas-profiling
[?25l  Downloading https://files.pythonhosted.org/packages/2c/2f/aae19e2173c10a9bb7fee5f5cad35dbe53a393960fc91abc477dcc4661e8/pandas-profiling-2.3.0.tar.gz (127kB)
[K     |████████████████████████████████| 133kB 9.1MB/s 
[?25hCollecting plotly
[?25l  Downloading https://files.pythonhosted.org/packages/63/2b/4ca10995bfbdefd65c4238f9a2d3fde33705d18dd50914dd13302ec1daf1/plotly-4.1.0-py2.py3-none-any.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 36.5MB/s 
Collecting htmlmin>=0.1.12 (from pandas-profiling)
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Collecting phik>=0.9.8 (from pan

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

train.shape, test.shape
import numpy as np

In [3]:
from sklearn.model_selection import train_test_split
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['status_group'], random_state=42)
train.shape, val.shape, test.shape

((47520, 41), (11880, 41), (14358, 40))

In [4]:
import plotly.express as px
train = train[(train[['longitude','latitude']] != 0).all(axis=1)]
train = train.dropna()
px.scatter(train, x='longitude', y='latitude', color='status_group', opacity=0.1)

In [0]:
def clean(X):
    X = X.copy()
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    cols_with_zeros = ['longitude', 'latitude']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
    X = X.drop(columns='quantity_group')
    return X
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    SimpleImputer(), 
    StandardScaler(), 
    LogisticRegression(solver='lbfgs', multi_class='auto')
)

In [6]:
clean(train)

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
7263,65358,500.0,2011-03-23,Rc Church,2049,ACRA,34.665760,-9.308548,Kwa Yasinta Ng'Ande,0,Rufiji,Kitichi,Iringa,11,4,Njombe,Imalinyi,175,True,GeoData Consultants Ltd,WUA,Tove Mtwango gravity Scheme,True,2008,gravity,gravity,gravity,wua,user-group,pay monthly,monthly,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
52726,27001,0.0,2011-03-10,Water,0,Gove,35.389331,-6.399942,Chama,0,Internal,Mtakuj,Dodoma,1,6,Bahi,Nondwa,0,True,GeoData Consultants Ltd,VWC,Zeje,True,0,mono,mono,motorpump,vwc,user-group,pay per bucket,per bucket,soft,good,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,functional
8558,41546,0.0,2011-08-07,Dwe/norad,1295,DWE,31.214583,-8.431428,Kwa Feston Mambosasa,0,Lake Tanganyika,Kisumba Kati,Rukwa,15,2,Sumbawanga Rural,Kasanga,200,True,GeoData Consultants Ltd,VWC,Kisumba water supply,True,1986,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,river,river/lake,surface,communal standpipe,communal standpipe,functional
2559,16230,20000.0,2013-09-03,Oxfam,1515,DWE,36.696700,-3.337926,Oroirwa,0,Pangani,Oroirwa,Arusha,2,2,Arusha Rural,Oltroto,150,True,GeoData Consultants Ltd,VWC,Nabaiye pipe line,True,1995,gravity,gravity,gravity,vwc,user-group,pay monthly,monthly,soft,good,insufficient,spring,spring,groundwater,communal standpipe multiple,communal standpipe,functional
54735,10307,0.0,2011-04-17,Water,0,DWE,36.292724,-5.177333,Zahanati,0,Internal,Polisi,Dodoma,1,1,Kondoa,Mrijo,0,True,GeoData Consultants Ltd,VWC,Mrij,False,0,mono,mono,motorpump,vwc,user-group,pay per bucket,per bucket,soft,good,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,functional
4372,16650,20.0,2011-03-03,District Council,17,DWE,39.020094,-5.346661,Kwa Kilo,3,Pangani,Mbuyuni,Tanga,4,5,Pangani,Kimang'a,300,True,GeoData Consultants Ltd,Water authority,Boza water supply,False,2009,submersible,submersible,submersible,water authority,commercial,pay per bucket,per bucket,salty,salty,insufficient,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
6431,19057,50.0,2011-03-19,Private Individual,-12,Da,38.890990,-6.444754,Mwamini,0,Wami / Ruvu,Nia Njema C,Pwani,6,1,Bagamoyo,Magomeni,30,True,GeoData Consultants Ltd,Company,Bagamoyo wate,True,2008,ksb,submersible,submersible,private operator,commercial,pay per bucket,per bucket,soft,good,enough,river,river/lake,surface,communal standpipe,communal standpipe,functional
1373,12040,0.0,2013-03-26,Government Of Tanzania,1466,DWE,37.501573,-3.277989,Ofisi Ya Kata,0,Pangani,Kirimbochoni,Kilimanjaro,3,4,Moshi Rural,Marangu Magharibi,15,True,GeoData Consultants Ltd,VWC,Marangu west,True,1972,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
2026,21481,6500.0,2013-02-14,Dmdd,1969,Dmdd,35.371400,-4.470802,Mathias,0,Internal,Rabai,Manyara,21,2,Hanang,Mogitu,162,True,GeoData Consultants Ltd,VWC,Gamowaso,True,2003,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
41101,48982,500.0,2013-01-15,Norad,884,RWE,29.660968,-4.818648,Kwa Samwel Katabila,0,Lake Tanganyika,Kibingo,Kigoma,16,3,Kigoma Rural,Kagongo,1500,True,GeoData Consultants Ltd,VWC,Mkongoro One,True,1985,gravity,gravity,gravity,vwc,user-group,pay monthly,monthly,soft,good,seasonal,river,river/lake,surface,communal standpipe multiple,communal standpipe,non functional


In [0]:
target = 'status_group'
train_features = train.drop(columns=[target, 'id'])
numeric_features = train_features.select_dtypes(int,float).columns.tolist()
categorical = train_features.select_dtypes(exclude='number').nunique()
categorical_features = categorical[categorical <= 50].index.tolist()
features = numeric_features + categorical_features
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]

In [8]:
pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))
y_pred = pipeline.predict(X_test)


lbfgs failed to converge. Increase the number of iterations.



Validation Accuracy 0.7195286195286196


In [0]:
df = pd.DataFrame()

In [10]:
df = pd.DataFrame(y_pred)
df.head(5)

Unnamed: 0,0
0,non functional
1,functional
2,non functional
3,non functional
4,functional


In [11]:
print(df.shape)

(14358, 1)


In [12]:
df.index.names = ['id']
df.rename(columns={ df.columns[0]: "status_group" }, inplace = True)
df.head(3)

Unnamed: 0_level_0,status_group
id,Unnamed: 1_level_1
0,non functional
1,functional
2,non functional


In [13]:
df.to_csv('Kaggle')
print(df.shape)

(14358, 1)
