<a href="https://colab.research.google.com/github/Phatdeluxe/DS-Unit-2-Regression-Classification/blob/master/module4/assignment_regression_classification_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 4


## Assignment

- [ ] Watch Aaron's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you donâ€™t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your validation accuracy score.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

---


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding. For example, you could try `quantity`, `basin`, `extraction_type_class`, and more. (But remember it may not work with high cardinality categoricals.)
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

---

## Data Dictionary 

### Features

Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided the following set of information about the waterpoints:

- `amount_tsh` : Total static head (amount water available to waterpoint)
- `date_recorded` : The date the row was entered
- `funder` : Who funded the well
- `gps_height` : Altitude of the well
- `installer` : Organization that installed the well
- `longitude` : GPS coordinate
- `latitude` : GPS coordinate
- `wpt_name` : Name of the waterpoint if there is one
- `num_private` :  
- `basin` : Geographic water basin
- `subvillage` : Geographic location
- `region` : Geographic location
- `region_code` : Geographic location (coded)
- `district_code` : Geographic location (coded)
- `lga` : Geographic location
- `ward` : Geographic location
- `population` : Population around the well
- `public_meeting` : True/False
- `recorded_by` : Group entering this row of data
- `scheme_management` : Who operates the waterpoint
- `scheme_name` : Who operates the waterpoint
- `permit` : If the waterpoint is permitted
- `construction_year` : Year the waterpoint was constructed
- `extraction_type` : The kind of extraction the waterpoint uses
- `extraction_type_group` : The kind of extraction the waterpoint uses
- `extraction_type_class` : The kind of extraction the waterpoint uses
- `management` : How the waterpoint is managed
- `management_group` : How the waterpoint is managed
- `payment` : What the water costs
- `payment_type` : What the water costs
- `water_quality` : The quality of the water
- `quality_group` : The quality of the water
- `quantity` : The quantity of water
- `quantity_group` : The quantity of water
- `source` : The source of the water
- `source_type` : The source of the water
- `source_class` : The source of the water
- `waterpoint_type` : The kind of waterpoint
- `waterpoint_type_group` : The kind of waterpoint

### Labels

There are three possible values:

- `functional` : the waterpoint is operational and there are no repairs needed
- `functional needs repair` : the waterpoint is operational, but needs repairs
- `non functional` : the waterpoint is not operational

--- 

## Generate a submission

Your code to generate a submission file may look like this:

```python
# estimator is your model or pipeline, which you've fit on X_train

# X_test is your pandas dataframe or numpy array, 
# with the same number of rows, in the same order, as test_features.csv, 
# and the same number of columns, in the same order, as X_train

y_pred = estimator.predict(X_test)


# Makes a dataframe with two columns, id and status_group, 
# and writes to a csv file, without the index

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('your-submission-filename.csv', index=False)
```

If you're working locally, the csv file is saved in the same directory as your notebook.

If you're using Google Colab, you can use this code to download your submission csv file.

```python
from google.colab import files
files.download('your-submission-filename.csv')
```

---

## Import data

In [0]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module4')

In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
# Read the Tanzania Waterpumps data
# train_features.csv : the training set features
# train_labels.csv : the training set labels
# test_features.csv : the test set features
# sample_submission.csv : a sample submission file in the correct format
    
import pandas as pd

train_features = pd.read_csv('../data/waterpumps/train_features.csv')
train_labels = pd.read_csv('../data/waterpumps/train_labels.csv')
test_features = pd.read_csv('../data/waterpumps/test_features.csv')
sample_submission = pd.read_csv('../data/waterpumps/sample_submission.csv')

assert train_features.shape == (59400, 40)
assert train_labels.shape == (59400, 2)
assert test_features.shape == (14358, 40)
assert sample_submission.shape == (14358, 2)

## spliting into train and validate

In [43]:
train_features.shape

(59400, 40)

In [44]:
train_labels.shape

(59400, 2)

In [0]:
train_features.head()

In [13]:
train_labels.head()

Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


In [30]:
all_values = pd.merge(train_features, train_labels, how='inner', on='id')
all_values.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe,functional
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe,non functional
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,functional


In [0]:
from sklearn.model_selection import train_test_split

my_train, my_val = train_test_split(all_values, random_state=69)

In [32]:
my_train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
13245,22823,5.0,2013-12-03,Germany Republi,1221,CES,37.216138,-3.253888,Kwa Rashid Uromi,0,Pangani,Mbweera,Kilimanjaro,3,5,Hai,Machame Uroki,55,True,GeoData Consultants Ltd,Water Board,Uroki-Bomang'ombe water sup,True,1999,gravity,gravity,gravity,water board,user-group,pay per bucket,per bucket,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
48583,17113,50.0,2013-03-18,0,-20,0,39.531663,-7.059533,Kwa Ramadhani,0,Wami / Ruvu,Kitomondo,Dar es Salaam,7,3,Temeke,Pemba Mnazi,110,True,GeoData Consultants Ltd,VWC,,False,2000,submersible,submersible,submersible,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,functional
18839,65518,0.0,2011-07-25,Drdp Ngo,0,Artisan,30.93893,-1.403907,Tank La Zahanat,0,Lake Victoria,Murutongole,Kagera,18,1,Karagwe,Kimuli,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,dry,dry,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe,non functional
19953,59331,0.0,2011-02-28,Co,433,Co,37.105749,-6.706684,Kisumuni,0,Wami / Ruvu,Mapilipili A,Morogoro,5,1,Kilosa,Chanzuru,60,True,GeoData Consultants Ltd,VWC,,True,0,other,other,other,wug,user-group,never pay,never pay,milky,milky,enough,enough,shallow well,shallow well,groundwater,other,other,functional
40213,62409,0.0,2011-02-28,Amref,26,AMREF,39.388779,-7.019467,Shuleni,0,Wami / Ruvu,Fungoni,Pwani,60,43,Mkuranga,Vikindu,254,True,GeoData Consultants Ltd,VWC,,False,2010,india mark ii,india mark ii,handpump,vwc,user-group,never pay,never pay,soft,good,enough,enough,machine dbh,borehole,groundwater,hand pump,hand pump,non functional


In [0]:
target = 'status_group'

y_train = my_train['status_group']
y_val = my_val['status_group']

In [0]:
my_train = my_train.drop('status_group', axis=1)
my_val = my_val.drop('status_group', axis=1)

## Baseline (Mode)

In [35]:
y_train.value_counts(normalize=True)

functional                 0.543120
non functional             0.384467
functional needs repair    0.072413
Name: status_group, dtype: float64

In [37]:
y_train.mode()[0]

'functional'

In [0]:
majority_class = y_train.mode()[0]

In [0]:
y_pred = [majority_class] * len(y_train)

In [42]:
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred)

0.5431200897867564

In [48]:
y_pred = [majority_class] * len(y_val)
accuracy_score(y_val, y_pred)

0.542962962962963

# Logistic regression first attempt

In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

In [117]:
my_train.describe(include='object')

Unnamed: 0,date_recorded,funder,installer,wpt_name,basin,subvillage,region,lga,ward,public_meeting,recorded_by,scheme_management,scheme_name,permit,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
count,44550,41815,41795,44550,44550,44277,44550,44550,44550,42036,44550,41674,23456,42244,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550
unique,349,1628,1853,29001,9,16656,21,125,2077,2,1,11,2490,2,18,13,7,12,5,7,7,8,6,5,5,10,7,3,7,6
top,2011-03-15,Government Of Tanzania,DWE,none,Lake Victoria,Madukani,Iringa,Njombe,Igosi,True,GeoData Consultants Ltd,VWC,K,True,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
freq,444,6768,12998,2674,7665,384,3998,1889,246,38240,44550,27639,511,29112,20073,20073,20073,30385,39410,19024,19024,38083,38083,24861,24861,12758,12758,34414,21414,25931


In [0]:
high_cardinal = ['funder', 'installer', 'wpt_name',
                 'subvillage', 'lga', 'ward',
                 'scheme_name']

X_train = my_train.drop(high_cardinal, axis=1)
X_val = my_val.drop(high_cardinal, axis=1)

In [0]:
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

In [0]:
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)

In [0]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)

In [122]:
model = LogisticRegressionCV(cv=5, n_jobs=-1, random_state=69)
model.fit(X_train_scaled, y_train)
model.score(X_val_scaled, y_val)



0.7396632996632997

In [0]:
# estimator is your model or pipeline, which you've fit on X_train
 
# X_test is your pandas dataframe or numpy array, 
# with the same number of rows, in the same order, as test_features.csv, 
# and the same number of columns, in the same order, as X_train
 
X_test = test_features.drop(high_cardinal, axis=1)
X_test_encoded = encoder.transform(X_test)
X_test_imputed = imputer.transform(X_test_encoded)
X_test_scaled = scaler.transform(X_test_imputed)
y_pred = model.predict(X_test_scaled)
 
 
# Makes a dataframe with two columns, id and status_group, 
# and writes to a csv file, without the index
 
sample_submission = pd.read_csv('../data/waterpumps/sample_submission.csv')
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('not_quite_there.csv', index=False)
# If you're working locally, the csv file is saved in the same directory as your notebook.

# If you're using Google Colab, you can use this code to download your submission csv file.

from google.colab import files
files.download('not_quite_there.csv')

In [131]:
y_pred.head()

AttributeError: ignored

## Logistic regression attempt 2 (more thought on feature selection)

In [67]:
my_train.describe(include='all')

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
count,44550.0,44550.0,44550,41815,44550.0,41795,44550.0,44550.0,44550,44550.0,44550,44277,44550,44550.0,44550.0,44550,44550,44550.0,42036,44550,41674,23456,42244,44550.0,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550,44550
unique,,,349,1628,,1853,,,29001,,9,16656,21,,,125,2077,,2,1,11,2490,2,,18,13,7,12,5,7,7,8,6,5,5,10,7,3,7,6
top,,,2011-03-15,Government Of Tanzania,,DWE,,,none,,Lake Victoria,Madukani,Iringa,,,Njombe,Igosi,,True,GeoData Consultants Ltd,VWC,K,True,,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
freq,,,444,6768,,12998,,,2674,,7665,384,3998,,,1889,246,,38240,44550,27639,511,29112,,20073,20073,20073,30385,39410,19024,19024,38083,38083,24861,24861,12758,12758,34414,21414,25931
mean,37113.344961,320.570879,,,669.227003,,34.105875,-5.707829,,0.512121,,,,15.25789,5.646218,,,177.897059,,,,,,1301.588373,,,,,,,,,,,,,,,,
std,21437.148679,3064.075752,,,693.005967,,6.512606,2.943769,,13.480397,,,,17.532836,9.649489,,,468.410805,,,,,,951.331393,,,,,,,,,,,,,,,,
min,0.0,0.0,,,-63.0,,0.0,-11.64944,,0.0,,,,1.0,0.0,,,0.0,,,,,,0.0,,,,,,,,,,,,,,,,
25%,18524.25,0.0,,,0.0,,33.10442,-8.545347,,0.0,,,,5.0,2.0,,,0.0,,,,,,0.0,,,,,,,,,,,,,,,,
50%,37123.5,0.0,,,372.0,,34.914879,-5.026543,,0.0,,,,12.0,3.0,,,25.0,,,,,,1986.0,,,,,,,,,,,,,,,,
75%,55620.75,20.0,,,1320.0,,37.184326,-3.326931,,0.0,,,,17.0,5.0,,,215.0,,,,,,2004.0,,,,,,,,,,,,,,,,


I am looking for columns to drop. Here is the list:

- date_recorded
- funder
- installer
- wpt_name
- subvillage
- lga
- ward
- scheme_name
- num_private
- recorded_by
- extraction_type_group
- extraction_type_group
- payment_type
- quantity_group
- quality_group
- source_type
- source_class
- waterpoint_type_group


Many of these are redundant data, or just broader categories of features, may play around with changing these in and out or engineering some features

In [76]:
my_train['water_quality'].value_counts()

soft                  38083
salty                  3674
unknown                1383
milky                   612
coloured                378
salty abandoned         269
fluoride                136
fluoride abandoned       15
Name: water_quality, dtype: int64

In [77]:
my_train['quality_group'].value_counts()

good        38083
salty        3943
unknown      1383
milky         612
colored       378
fluoride      151
Name: quality_group, dtype: int64

In [0]:
drop_list = ['date_recorded', 'funder', 'installer',
             'wpt_name', 'subvillage', 'lga', 
             'ward', 'scheme_name', 'num_private', 
             'recorded_by', 'extraction_type_group', 
             'extraction_type_group', 'payment_type', 'quantity_group',
             'quality_group', 'source_type', 'source_class', 
             'waterpoint_type_group']

X_train = my_train.drop(drop_list, axis=1)
X_val = my_val.drop(drop_list, axis=1)          

In [0]:
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

In [0]:
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)

In [0]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)

In [84]:
# This model is worse than just dropping the high cardinality columns
model = LogisticRegressionCV(cv=5, n_jobs=-1, random_state=69)
model.fit(X_train_scaled, y_train)
model.score(X_val_scaled, y_val)



0.7341414141414141

## Trying with just a couple of the high-cardinality columns (installer and funder)

In [0]:
columns_list = []
for header in my_train.columns:
  columns_list.append(header)
columns_list.remove('funder')
columns_list.remove('installer')

In [0]:
X_train = my_train.drop(columns_list, axis=1)
X_val = my_val.drop(columns_list, axis=1)

In [0]:
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

In [0]:
imputer = SimpleImputer()
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)

In [0]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)

In [115]:
# The worst model so far, but still not too bad.
# Could we assume that 60% of the time we can identify bad wells based on who made it and who funded it?
# missing values may have impacted this too much
model = LogisticRegressionCV(cv=5, n_jobs=-1, random_state=69)
model.fit(X_train_scaled, y_train)
model.score(X_val_scaled, y_val)



0.6336700336700337

## Now we will try ChoseKBest and see what happens

In [109]:
from sklearn.

Unnamed: 0,funder,installer
13245,Germany Republi,CES
48583,0,0
18839,Drdp Ngo,Artisan
19953,Co,Co
40213,Amref,AMREF
