<a href="https://colab.research.google.com/github/davidanagy/DS-Unit-2-Kaggle-Challenge/blob/master/module2/assignment_kaggle_challenge_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 2

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](https://outline.com/5S5tsB), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_






### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [3]:
# Copying my earlier code...

train, val = train_test_split(train, stratify=train['status_group'], random_state=77)

train.shape, val.shape

((44550, 41), (14850, 41))

In [0]:
# I'm going to try iterating. I'll hold off on dropping columns and reducing cardinality for now.

import numpy as np

def remove_zeroes(X):
  X = X.copy()
  
  X['latitude'] = X['latitude'].replace(-2e-08, 0)
  
  zeroes = ['gps_height', 'longitude', 'latitude', 'population', 'construction_year']
  for col in zeroes:
    X[col] = X[col].replace(0, np.nan)
  
  return X

def datetime_features(X):
  X = X.copy()
  
  X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
  
  X['year_recorded'] = X['date_recorded'].dt.year
  
  X['construction_year'] = X['construction_year'].fillna(np.around(np.mean(X['construction_year']), decimals=0))
  
  X['time_to_inspection'] = X['year_recorded'] - X['construction_year']
  
  return X

def engineer_features(X):
  X = X.copy()
  
  X = remove_zeroes(X)
  X = datetime_features(X)
  X = X.drop(['id', 'status_group', 'date_recorded'], axis=1)
  
  return X

X_train = engineer_features(train)
y_train = train['status_group']
X_val = engineer_features(val)
y_val = val['status_group']

In [5]:
# Trying ordinal encoding.

import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    DecisionTreeClassifier(min_samples_leaf=14, max_depth=29, random_state=90)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

Train Accuracy 0.8270707070707071
Validation Accuracy 0.7618855218855218


In [6]:
# Now let's see what happens when I drop redundant columns.

def drop_redundant(X):
  X = X.copy()
  
  redundant_cols = ['recorded_by', 'extraction_type_group', 'extraction_type_class', 'management_group', 'payment_type',
                   'quality_group', 'quantity_group', 'source_type', 'source_class', 'waterpoint_type_group',
                   'region_code', 'district_code', 'date_recorded', 'id', 'status_group']
  
  for col in redundant_cols:
    X = X.drop(col, axis=1)
    
  return X

def engineer_features(X):
  X = X.copy()
  
  X = remove_zeroes(X)
  X = datetime_features(X)
  X = drop_redundant(X)
  
  return X

X_train = engineer_features(train)
y_train = train['status_group']
X_val = engineer_features(val)
y_val = val['status_group']

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    DecisionTreeClassifier(min_samples_leaf=14, max_depth=29, random_state=90)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Accuracy increases a little bit, but still not as good as using one-hot encoding.

Train Accuracy 0.8286419753086419
Validation Accuracy 0.7636363636363637


In [7]:
# What if I use one-hot encoding for low cardinality, and ordinal encoding for high cardinality?

cardinality = X_train.select_dtypes(exclude='number').nunique()
  
high_cardinality = cardinality[cardinality > 50].index.tolist()

low_cardinality = cardinality[cardinality <= 50].index.tolist()

pipeline = make_pipeline(
    ce.OrdinalEncoder(cols=high_cardinality),
    ce.OneHotEncoder(cols=low_cardinality),
    SimpleImputer(strategy='median'),
    DecisionTreeClassifier(min_samples_leaf=14, max_depth=29, random_state=90)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Result is lower! One-hot combined with reducing cardinality seems the best, at least for single trees.

Train Accuracy 0.8264421997755331
Validation Accuracy 0.7603367003367003


In [8]:
from sklearn.ensemble import RandomForestClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Now that's high!

Train Accuracy 0.9999775533108867
Validation Accuracy 0.8058585858585858


In [9]:
# What if I reduce cardinality and do one-hot encode?

def reduce_cardinality(X):
  X = X.copy()
  
  cardinality = X.select_dtypes(exclude='number').nunique()
  
  high_cardinality = cardinality[cardinality > 50].index.tolist()
  
  for feature in high_cardinality:
    top10 = X[feature].value_counts()[:10].index
    X.loc[~X[feature].isin(top10), feature] = 'OTHER'
  
  return X

def engineer_features(X):
  X = X.copy()
  
  X = remove_zeroes(X)
  X = datetime_features(X)
  X = drop_redundant(X)
  X = reduce_cardinality(X)
  
  return X

X_train = engineer_features(train)
X_val = engineer_features(val)

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Accuracy is lower!

Train Accuracy 0.9970145903479237
Validation Accuracy 0.8048484848484848


In [10]:
def engineer_features(X):
  X = X.copy()
  
  X = remove_zeroes(X)
  X = datetime_features(X)
  X = drop_redundant(X)
  
  return X

X_train = engineer_features(train)
X_val = engineer_features(val)

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

Train Accuracy 0.9999775533108867
Validation Accuracy 0.8058585858585858


In [11]:
# Have to re-define so I'm not dropping "status_group," since 'test' has no status_group column!

def drop_redundant(X):
  X = X.copy()
  
  redundant_cols = ['recorded_by', 'extraction_type_group', 'extraction_type_class', 'management_group', 'payment_type',
                   'quality_group', 'quantity_group', 'source_type', 'source_class', 'waterpoint_type_group',
                   'region_code', 'district_code', 'date_recorded', 'id']
  
  for col in redundant_cols:
    X = X.drop(col, axis=1)
    
  return X

X_test = engineer_features(test)

y_pred = pd.DataFrame(pipeline.predict(X_test), columns=['status_group'])

submission1 = pd.concat([test['id'], y_pred], axis=1)

submission1.head()

Unnamed: 0,id,status_group
0,50785,non functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional


In [0]:
submission1.to_csv('water-submission-07.csv', index=None, header=True)

In [13]:
X_train.head()

Unnamed: 0,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,lga,ward,population,public_meeting,scheme_management,scheme_name,permit,construction_year,extraction_type,management,payment,water_quality,quantity,source,waterpoint_type,year_recorded,time_to_inspection
58517,0.0,Wvt,,WVT,,,Mwayai,0,Lake Victoria,Sasanda,Shinyanga,Bariadi,Bumera,,True,WUG,,False,1997.0,nira/tanira,wug,never pay,soft,enough,shallow well,hand pump,2013,16.0
53717,0.0,Government Of Tanzania,,RWE,31.770808,-1.003277,Kamenge,0,Lake Victoria,Bukuma,Kagera,Misenyi,Kanyigo,,True,VWC,Kan,True,1997.0,gravity,vwc,never pay,soft,insufficient,spring,communal standpipe,2011,14.0
5580,0.0,Twesa,,TWESA,33.405885,-3.504091,Azimio,0,Internal,Kadoto B,Shinyanga,Shinyanga Rural,Pandagichiza,,True,WUG,,True,1997.0,nira/tanira,wug,never pay,soft,insufficient,shallow well,hand pump,2012,15.0
47950,0.0,Patuu,,PATUU,36.117722,-6.869506,Wiyenzele Primary School,0,Rufiji,Wiyenzele,Dodoma,Mpwapwa,Mlunduzi,,True,VWC,,True,1997.0,gravity,vwc,never pay,soft,seasonal,rainwater harvesting,communal standpipe,2011,14.0
10488,0.0,Lwi,,LWI,,,Muungano,0,Lake Victoria,Madukani,Mwanza,Magu,Nkungulu,,,WUG,,False,1997.0,india mark iii,wug,unknown,soft,enough,shallow well,hand pump,2012,15.0


In [14]:
# Going to try dropping fewer columns.

def drop_redundant(X):
  X = X.copy()
  
  redundant_cols = ['recorded_by', 'payment_type', 'region_code', 'date_recorded', 'id']
  
  for col in redundant_cols:
    X = X.drop(col, axis=1)
    
  return X

def engineer_features(X):
  X = X.copy()
  
  X = remove_zeroes(X)
  X = datetime_features(X)
  X = drop_redundant(X)
  
  return X

X_train = engineer_features(train).drop('status_group', axis=1)
X_val = engineer_features(val).drop('status_group', axis=1)

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Accuracy goes up!

Train Accuracy 0.9999775533108867
Validation Accuracy 0.8078787878787879


In [0]:
X_test = engineer_features(test)

y_pred = pd.DataFrame(pipeline.predict(X_test), columns=['status_group'])

submission2 = pd.concat([test['id'], y_pred], axis=1)

submission2.to_csv('water-submission-08.csv', index=None, header=True)

In [16]:
# Testing the application of two encoders at once.

cardinality = X_train.select_dtypes(exclude='number').nunique()
  
high_cardinality = cardinality[cardinality > 50].index.tolist()

low_cardinality = cardinality[cardinality <= 50].index.tolist()

pipeline = make_pipeline(
    ce.OrdinalEncoder(cols=high_cardinality),
    ce.OneHotEncoder(cols=low_cardinality),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Accuracy goes up!

Train Accuracy 0.9999775533108867
Validation Accuracy 0.8081481481481482


In [0]:
y_pred = pd.DataFrame(pipeline.predict(X_test), columns=['status_group'])

submission3 = pd.concat([test['id'], y_pred], axis=1)

submission3.to_csv('water-submission-09.csv', index=None, header=True)

# Sadly this ended up having a lower test accuracy...

In [18]:
# Trying an iterative imputer.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    IterativeImputer(max_iter=100, initial_strategy='median', min_value=0),
    RandomForestClassifier(n_estimators=100, random_state=100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Accuracy goes down

Train Accuracy 0.9999775533108867
Validation Accuracy 0.8046464646464646


In [19]:
# Trying a "most frequent" strategy.

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='most_frequent'),
    RandomForestClassifier(n_estimators=100, random_state=100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Accuracy goes down (by a bit)

Train Accuracy 0.9999775533108867
Validation Accuracy 0.8074747474747475


In [20]:
X_train.describe()

Unnamed: 0,amount_tsh,gps_height,longitude,latitude,num_private,district_code,population,construction_year,year_recorded,time_to_inspection
count,44550.0,29104.0,43171.0,43171.0,44550.0,44550.0,28427.0,44550.0,44550.0,44550.0
mean,320.565308,1016.484779,35.153192,-5.895653,0.473692,5.629001,282.31449,1996.885926,2011.920135,15.034209
std,3150.47797,613.434859,2.604573,2.81298,11.594878,9.650221,581.84945,10.058957,0.958777,10.095001
min,0.0,-90.0,29.607201,-11.64944,0.0,0.0,1.0,1960.0,2002.0,-7.0
25%,0.0,391.0,33.284972,-8.662794,0.0,2.0,40.0,1996.0,2011.0,8.0
50%,0.0,1165.0,35.001336,-5.177179,0.0,3.0,150.0,1997.0,2012.0,14.0
75%,25.0,1498.0,37.237524,-3.375576,0.0,5.0,320.0,2004.0,2013.0,16.0
max,350000.0,2628.0,40.344301,-0.998464,1776.0,80.0,30500.0,2013.0,2013.0,53.0


In [21]:
X_train.isnull().sum()

amount_tsh                   0
funder                    2727
gps_height               15446
installer                 2737
longitude                 1379
latitude                  1379
wpt_name                     0
num_private                  0
basin                        0
subvillage                 268
region                       0
district_code                0
lga                          0
ward                         0
population               16123
public_meeting            2559
scheme_management         2913
scheme_name              21245
permit                    2314
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
water_quality                0
quality_group                0
quantity                     0
quantity_group               0
source                       0
source_type                  0
source_c

In [22]:
# WHat if I keep the zeroes in "gps_height"?

def remove_zeroes2(X):
  X = X.copy()
  
  X['latitude'] = X['latitude'].replace(-2e-08, 0)
  
  zeroes = ['longitude', 'latitude', 'population', 'construction_year']
  for col in zeroes:
    X[col] = X[col].replace(0, np.nan)
  
  return X

def engineer_features2(X):
  X = X.copy()
  
  X = remove_zeroes2(X)
  X = datetime_features(X)
  X = drop_redundant(X)
  
  return X

X_train = engineer_features2(train).drop('status_group', axis=1)
X_val = engineer_features2(val).drop('status_group', axis=1)

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Highest accuracy so far!

Train Accuracy 0.9999775533108867
Validation Accuracy 0.8082154882154882


In [0]:
X_test = engineer_features2(test)

y_pred = pd.DataFrame(pipeline.predict(X_test), columns=['status_group'])

submission4 = pd.concat([test['id'], y_pred], axis=1)

submission4.to_csv('water-submission-10.csv', index=None, header=True)

# Accuracy on the test set went down...

In [24]:
# What if I also replace the zeroes in amount_tsh?

def remove_zeroes3(X):
  X = X.copy()
  
  X['latitude'] = X['latitude'].replace(-2e-08, 0)
  
  zeroes = ['amount_tsh', 'gps_height', 'longitude', 'latitude', 'population', 'construction_year']
  for col in zeroes:
    X[col] = X[col].replace(0, np.nan)
  
  return X

def engineer_features3(X):
  X = X.copy()
  
  X = remove_zeroes3(X)
  X = datetime_features(X)
  X = drop_redundant(X)
  
  return X

X_train = engineer_features3(train).drop('status_group', axis=1)
X_val = engineer_features3(val).drop('status_group', axis=1)

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

Train Accuracy 0.9999775533108867
Validation Accuracy 0.8073400673400674


In [0]:
X_test = engineer_features3(test)

y_pred = pd.DataFrame(pipeline.predict(X_test), columns=['status_group'])

submission5 = pd.concat([test['id'], y_pred], axis=1)

submission5.to_csv('water-submission-11.csv', index=None, header=True)

# Lower...

In [26]:
# "Scheme name" has a lot of NaNs. What if I drop it?

def drop_redundant2(X):
  X = X.copy()
  
  redundant_cols = ['scheme_name', 'recorded_by', 'payment_type', 'region_code', 'date_recorded', 'id']
  
  for col in redundant_cols:
    X = X.drop(col, axis=1)
    
  return X

def engineer_features4(X):
  X = X.copy()
  
  X = remove_zeroes(X)
  X = datetime_features(X)
  X = drop_redundant2(X)
  
  return X

X_train = engineer_features4(train).drop('status_group', axis=1)
X_val = engineer_features4(val).drop('status_group', axis=1)

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Accuracy goes down...

Train Accuracy 0.9999775533108867
Validation Accuracy 0.8047811447811448


In [27]:
# What if I keep the zeroes in population?

def remove_zeroes4(X):
  X = X.copy()
  
  X['latitude'] = X['latitude'].replace(-2e-08, 0)
  
  zeroes = ['longitude', 'latitude', 'construction_year', 'gps_height']
  for col in zeroes:
    X[col] = X[col].replace(0, np.nan)
  
  return X

def engineer_features5(X):
  X = X.copy()
  
  X = remove_zeroes4(X)
  X = datetime_features(X)
  X = drop_redundant(X)
  
  return X

X_train = engineer_features5(train).drop('status_group', axis=1)
X_val = engineer_features5(val).drop('status_group', axis=1)

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Accuracy goes down, by a bit

Train Accuracy 0.9999775533108867
Validation Accuracy 0.8073400673400674


In [28]:
(X_train['gps_height'] < 0).sum()

1120

In [29]:
# Let's try removing all negative numbers from gps_height too.

def remove_zeroes5(X):
  X = X.copy()
  
  X['latitude'] = X['latitude'].replace(-2e-08, 0)
  
  X['gps_height'] = X['gps_height'].replace([number for number in X['gps_height'].tolist() if number < 0], 0)
  
  zeroes = ['gps_height', 'longitude', 'latitude', 'population', 'construction_year']
  for col in zeroes:
    X[col] = X[col].replace(0, np.nan)
  
  return X

def engineer_features6(X):
  X = X.copy()
  
  X = remove_zeroes5(X)
  X = datetime_features(X)
  X = drop_redundant(X)
  
  return X

X_train = engineer_features6(train).drop('status_group', axis=1)
X_val = engineer_features6(val).drop('status_group', axis=1)

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=100, random_state=100, n_jobs=-1)
)

pipeline.fit(X_train, y_train)
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Accuracy goes down

Train Accuracy 0.9999775533108867
Validation Accuracy 0.807070707070707


In [30]:
X_train.describe() # Confirms I successfully replaced the gps_height values

Unnamed: 0,amount_tsh,gps_height,longitude,latitude,num_private,district_code,population,construction_year,year_recorded,time_to_inspection
count,44550.0,27984.0,43171.0,43171.0,44550.0,44550.0,28427.0,44550.0,44550.0,44550.0
mean,320.565308,1057.969983,35.153192,-5.895653,0.473692,5.629001,282.31449,1996.885926,2011.920135,15.034209
std,3150.47797,588.756279,2.604573,2.81298,11.594878,9.650221,581.84945,10.058957,0.958777,10.095001
min,0.0,1.0,29.607201,-11.64944,0.0,0.0,1.0,1960.0,2002.0,-7.0
25%,0.0,468.0,33.284972,-8.662794,0.0,2.0,40.0,1996.0,2011.0,8.0
50%,0.0,1193.0,35.001336,-5.177179,0.0,3.0,150.0,1997.0,2012.0,14.0
75%,25.0,1512.0,37.237524,-3.375576,0.0,5.0,320.0,2004.0,2013.0,16.0
max,350000.0,2628.0,40.344301,-0.998464,1776.0,80.0,30500.0,2013.0,2013.0,53.0


In [0]:
# Right now I'm at #2 on the leaderboard...would like to get #1, but can't think
# of anything else to do right now. So I'll just submit to the repo.