Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 1

## Assignment
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what other columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What other columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Try other [scikit-learn scalers](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this problem, you may want to use the parameter `logistic=True`

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```



In [1]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [24]:
y=train[['status_group']]
train=train.drop('status_group', axis=1)

In [25]:
X_train, X_val, y_train,y_val=train_test_split(train,y)

In [26]:
X_train.shape, y_train.shape, X_val.shape,y_val.shape

((44550, 40), (44550, 1), (14850, 40), (14850, 1))

In [27]:
import plotly.express as px
import numpy as np
%matplotlib inline

In [28]:
desc=X_train.describe(exclude='number').T
desc=desc[(desc['unique']<50) & (desc['unique']>1)]

In [29]:
encodableCats=desc.index.tolist()
numberCats=X_train.select_dtypes('number').columns.tolist()
allcats=numberCats+encodableCats
allcats

['id',
 'amount_tsh',
 'gps_height',
 'longitude',
 'latitude',
 'num_private',
 'region_code',
 'district_code',
 'population',
 'construction_year',
 'basin',
 'region',
 'public_meeting',
 'scheme_management',
 'permit',
 'extraction_type',
 'extraction_type_group',
 'extraction_type_class',
 'management',
 'management_group',
 'payment',
 'payment_type',
 'water_quality',
 'quality_group',
 'quantity',
 'quantity_group',
 'source',
 'source_type',
 'source_class',
 'waterpoint_type',
 'waterpoint_type_group']

In [36]:
def cleanUp(data):
    #I don't want setting with copy errors
    data=data[allcats].copy()
    #get rid of the bad faux 0s 
    data['latitude']=data['latitude'].replace(-2e-08, 0)
    #Replace fake 0's with NaNs for accuracy  
    
    zeros=['latitude','longitude', 'num_private', 'construction_year','population','amount_tsh']
    for col in zeros:
        data[col]=data[col].replace(0,np.nan)
    #handle duplicate columns,and ID is random
    data = data.drop(['quantity_group','id'],axis=1)
    # there are some numbers which should be treated as strings
    numStrings=['region_code','district_code']
    for col in numStrings:
        data[col]=data[col].astype(str)
    #Theres a date string
    #data['date_recorded']=pd.to_datetime(data['date_recorded'], infer_datetime_format=True)
    #data['age']=data['date_recorded'].apply(lambda x: x.year)-data['construction_year']
    #the GPS values that are 0 where the age is null are also errors and should be nan's 
    def gpsHeight(cols):
        age=cols[0]
        height=cols[1]
        if pd.isnull(age):
            if height==0:
                return np.NaN
        else:
            return height
    data['gps_height']=data[['construction_year','gps_height']].apply(gpsHeight,axis=1)
   
    return data

In [37]:
X_train=cleanUp(X_train)
X_val=cleanUp(X_val)
test=cleanUp(test)

In [38]:
px.scatter(data, 'latitude','longitude',color='amount_tsh')

NameError: name 'data' is not defined

In [39]:
X_train['amount_tsh'].value_counts(normalize=True)


500.0       0.173579
50.0        0.138064
1000.0      0.083472
20.0        0.083245
200.0       0.067486
100.0       0.047278
10.0        0.044940
2000.0      0.041321
30.0        0.040869
250.0       0.031820
300.0       0.030840
5000.0      0.025788
5.0         0.021641
25.0        0.019831
3000.0      0.018398
1200.0      0.015684
1500.0      0.011160
6.0         0.010406
600.0       0.010330
4000.0      0.009425
2500.0      0.008445
2400.0      0.008144
6000.0      0.007239
7.0         0.003770
40.0        0.003393
750.0       0.003393
8000.0      0.003242
10000.0     0.002865
12000.0     0.002639
3600.0      0.002564
              ...   
50000.0     0.000226
4500.0      0.000151
0.2         0.000151
26000.0     0.000151
9000.0      0.000151
520.0       0.000151
53.0        0.000075
590.0       0.000075
1.0         0.000075
9.0         0.000075
120000.0    0.000075
45000.0     0.000075
38000.0     0.000075
8500.0      0.000075
900.0       0.000075
1400.0      0.000075
13000.0     0

In [40]:
pd.crosstab(X['gps_height'], X['age']).loc[0].sum()

NameError: name 'X' is not defined

In [41]:
fig=px.scatter_mapbox(X_train, lon='longitude', lat='latitude', color='age')
fig.update_layout(mapbox_style='stamen-terrain')

ValueError: Value of 'color' is not the name of a column in 'data_frame'. Expected one of ['amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private', 'region_code', 'district_code', 'population', 'construction_year', 'basin', 'region', 'public_meeting', 'scheme_management', 'permit', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'payment_type', 'water_quality', 'quality_group', 'quantity', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group'] but received: age

In [42]:
Height0=X[X['gps_height']==0]

NameError: name 'X' is not defined

In [43]:
#working things out
def gpsHeight(cols):
    age=cols[0]
    height=cols[1]
    if pd.isnull(age):
        if height==0:
            return np.NaN
    else:
        return height
#data[['age','gps_height']].apply(gpsHeight,axis=1)

In [44]:
X_train['date_recorded'].isna().sum()

KeyError: 'date_recorded'

In [45]:
data.loc[data['age'].isna()]['gps_height']

NameError: name 'data' is not defined

In [46]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

In [50]:
pipeline=make_pipeline(
    ce.OneHotEncoder(),
    SimpleImputer(),
    StandardScaler(),
    LogisticRegressionCV(n_jobs=-1,multi_class='auto',cv=5, solver='lbfgs')
)
pipeline.fit(X_train,y_train)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



Pipeline(memory=None,
         steps=[('onehotencoder',
                 OneHotEncoder(cols=['region_code', 'district_code', 'basin',
                                     'region', 'public_meeting',
                                     'scheme_management', 'permit',
                                     'extraction_type', 'extraction_type_group',
                                     'extraction_type_class', 'management',
                                     'management_group', 'payment',
                                     'payment_type', 'water_quality',
                                     'quality_group', 'quantity', 'source',
                                     'source_type', 'so...
                ('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('logisticregressioncv',
                 LogisticRegressionCV(Cs=10, class_weight=None, cv=5,
                                      dual=False, fit_intercept=True,
           

In [52]:
y_pred=pipeline.predict(X_train)

In [55]:
pipeline.score(X_val, y_val)

0.745993265993266