Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 4

## Assignment

- [ ] Watch Aaron Gallant's [video #1](https://www.youtube.com/watch?v=pREaWFli-5I) (12 minutes) & [video #2](https://www.youtube.com/watch?v=bDQgVt4hFgY) (9 minutes) to learn about the mathematics of Logistic Regression.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Do one-hot encoding. (Remember it may not work with high cardinality categoricals.)
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your coefficients.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

> [Do Not Copy-Paste.](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit) You must type each of these exercises in, manually. If you copy and paste, you might as well not even do them. The point of these exercises is to train your hands, your brain, and your mind in how to read, write, and see code. If you copy-paste, you are cheating yourself out of the effectiveness of the lessons.


## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Clean the data. For ideas, refer to [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide),  a "reference to problems seen in real-world data along with suggestions on how to resolve them." One of the issues is ["Zeros replace missing values."](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values)
- [ ] Make exploratory visualizations.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this problem, you may want to use the parameter `logistic=True`

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from the previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```

#### Pipelines

[Scikit-Learn User Guide](https://scikit-learn.org/stable/modules/compose.html) explains why pipelines are useful, and demonstrates how to use them:

> Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:
> - **Convenience and encapsulation.** You only have to call fit and predict once on your data to fit a whole sequence of estimators.
> - **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
> - **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

### Reading
- [ ] [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/)
- [ ] [Always start with a stupid model, no exceptions](https://blog.insightdatascience.com/always-start-with-a-stupid-model-no-exceptions-3a22314b9aaa)
- [ ] [Statistical Modeling: The Two Cultures](https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726)
- [ ] [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way (without an excessive amount of formulas or academic pre-requisites).



In [1]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module4')

In [2]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [3]:
import pandas as pd

train_features = pd.read_csv('../data/tanzania/train_features.csv')
train_labels = pd.read_csv('../data/tanzania/train_labels.csv')
test_features = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

assert train_features.shape == (59400, 40)
assert train_labels.shape == (59400, 2)
assert test_features.shape == (14358, 40)
assert sample_submission.shape == (14358, 2)

In [4]:
# import block

pd.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('dark_background')
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score
import category_encoders as ce 

In [5]:
# Changing date feature to datetime format
train_features.dtypes

id                         int64
amount_tsh               float64
date_recorded             object
funder                    object
gps_height                 int64
installer                 object
longitude                float64
latitude                 float64
wpt_name                  object
num_private                int64
basin                     object
subvillage                object
region                    object
region_code                int64
district_code              int64
lga                       object
ward                      object
population                 int64
public_meeting            object
recorded_by               object
scheme_management         object
scheme_name               object
permit                    object
construction_year          int64
extraction_type           object
extraction_type_group     object
extraction_type_class     object
management                object
management_group          object
payment                   object
payment_ty

In [6]:
# Train-validation split

X_train = train_features.copy()
y_train = train_labels['status_group'].copy()

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, train_size = 0.80, test_size = 0.20,
    stratify = y_train, random_state = 42
)

X_train.shape, X_val.shape, y_train.shape, y_val.shape

((47520, 40), (11880, 40), (47520,), (11880,))

In [7]:
for col in X_train.select_dtypes('object').columns:
    print(col)

date_recorded
funder
installer
wpt_name
basin
subvillage
region
lga
ward
public_meeting
recorded_by
scheme_management
scheme_name
permit
extraction_type
extraction_type_group
extraction_type_class
management
management_group
payment
payment_type
water_quality
quality_group
quantity
quantity_group
source
source_type
source_class
waterpoint_type
waterpoint_type_group


In [8]:
# Fill the frame with the mode for object feature
for column in X_train.select_dtypes('object').columns:
    X_train[column].fillna(X_train[column].mode()[0], inplace=True)
    X_val[column].fillna(X_train[column].mode()[0], inplace=True)
    test_features[column].fillna(X_train[column].mode()[0], inplace=True)
    
# Fill the frame with the mean for numeric features
for column in X_train.select_dtypes('number').columns:
    X_train[column].fillna(X_train[column].mean(), inplace=True)
    X_val[column].fillna(X_train[column].mean(), inplace=True)
    test_features[column].fillna(X_train[column].mean(), inplace=True)
    
X_val.isna().sum()

id                       0
amount_tsh               0
date_recorded            0
funder                   0
gps_height               0
installer                0
longitude                0
latitude                 0
wpt_name                 0
num_private              0
basin                    0
subvillage               0
region                   0
region_code              0
district_code            0
lga                      0
ward                     0
population               0
public_meeting           0
recorded_by              0
scheme_management        0
scheme_name              0
permit                   0
construction_year        0
extraction_type          0
extraction_type_group    0
extraction_type_class    0
management               0
management_group         0
payment                  0
payment_type             0
water_quality            0
quality_group            0
quantity                 0
quantity_group           0
source                   0
source_type              0
s

In [9]:
# Add feature - pump age
X_train['pump_age'] = 2013 - X_train['construction_year']
X_val['pump_age'] = 2013 - X_val['construction_year']
test_features['pump_age'] = 2013 - test_features['construction_year']

In [10]:
test_features['latitude'].head()

0    -4.059696
1    -3.309214
2    -5.004344
3    -9.418672
4   -10.950412
Name: latitude, dtype: float64

In [11]:
# Add feature - Distance from Dodoma
X_train['dodomadistance'] = (((X_train['latitude']-(6.1630))**2)+((X_train['longitude']-(35.7516))**2))**0.5
X_val['dodomadistance'] = (((X_val['latitude']-(6.1630))**2)+((X_val['longitude']-(35.7516))**2))**0.5
test_features['dodomadistance'] = (((test_features['latitude']-(6.1630))**2)+((test_features['longitude']-(35.7516))**2))**0.5

In [12]:
test_features.dodomadistance.max()

36.278912219173826

In [13]:
# Mapping the ys to integers for the encoder
mapdict = {
    'functional': 1,
    'non functional': -1,
    'functional needs repair': 0
}
y_train = y_train.map(mapdict)
y_val = y_val.map(mapdict)

In [14]:
# Using category encoder to establish feature rank

categoryfeatures = X_train.select_dtypes(include = 'object').columns

encoder = ce.cat_boost.CatBoostEncoder()
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_val_encoded = encoder.transform(X_val)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_encoded)
X_val_scaled = scaler.transform(X_val_encoded)

model = LogisticRegressionCV()
model.fit(X_train_scaled, y_train)
print('Validation Accuracy', model.score(X_val_scaled, y_val))



Validation Accuracy 0.7553030303030303


In [15]:
# Selecting best features

selector = SelectKBest(score_func=f_regression, k = 42)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
#X_test_selected = selector.transform(X_test)

X_train_selected.shape#, #X_test_selected.shape

(47520, 42)

In [16]:
# List selected features

all_names = X_train.columns
selected_mask = selector.get_support()
selected_names = all_names[selected_mask]

for name in selected_names:
    print(name)

id
amount_tsh
date_recorded
funder
gps_height
installer
longitude
latitude
wpt_name
num_private
basin
subvillage
region
region_code
district_code
lga
ward
population
public_meeting
recorded_by
scheme_management
scheme_name
permit
construction_year
extraction_type
extraction_type_group
extraction_type_class
management
management_group
payment
payment_type
water_quality
quality_group
quantity
quantity_group
source
source_type
source_class
waterpoint_type
waterpoint_type_group
pump_age
dodomadistance


In [17]:
# Cat-boosted logistic with best features

X_train_subset = X_train[selected_names]
X_val_subset = X_val[selected_names]

encoder2 = ce.cat_boost.CatBoostEncoder()
X_train_encoded2 = encoder2.fit_transform(X_train_subset, y_train)
X_val_encoded2 = encoder2.transform(X_val_subset)

scaler2 = StandardScaler()
X_train_scaled2 = scaler2.fit_transform(X_train_encoded2)
X_val_scaled2 = scaler2.transform(X_val_encoded2)

model2 = LogisticRegressionCV()
model2.fit(X_train_scaled2, y_train)
print('Validation Accuracy', model2.score(X_val_scaled2, y_val))



Validation Accuracy 0.7553030303030303


In [18]:
# Transforming the test data
X_test_subset = test_features[selected_names]
X_test_encoded = encoder2.transform(X_test_subset)
X_test_scaled = scaler2.transform(X_test_encoded)
assert all(X_test_encoded.columns == X_train_encoded2.columns)

In [21]:
X_test_encoded.isna().sum()

id                       0
amount_tsh               0
date_recorded            0
funder                   0
gps_height               0
installer                0
longitude                0
latitude                 0
wpt_name                 0
num_private              0
basin                    0
subvillage               0
region                   0
region_code              0
district_code            0
lga                      0
ward                     0
population               0
public_meeting           0
recorded_by              0
scheme_management        0
scheme_name              0
permit                   0
construction_year        0
extraction_type          0
extraction_type_group    0
extraction_type_class    0
management               0
management_group         0
payment                  0
payment_type             0
water_quality            0
quality_group            0
quantity                 0
quantity_group           0
source                   0
source_type              0
s

In [22]:
# Predicting the test data
y_test_pred = model2.predict(X_test_scaled)

In [23]:
# Unmapping the prediction
y_test_pred = pd.Series(y_test_pred)
unmapdict = {value: key for key, value in mapdict.items()}
y_test_pred = y_test_pred.map(unmapdict)

In [24]:
# Formatting submission
submission = sample_submission.copy()
submission['status_group'] = y_test_pred
submission.to_csv('submission-02.csv', index = False)