<a href="https://colab.research.google.com/github/AHartNtkn/DS-Unit-2-Kaggle-Challenge/blob/master/module2/DS7_assignment_kaggle_challenge_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 2

## Assignment
- [X] Read [“Adopting a Hypothesis-Driven Workflow”](https://outline.com/5S5tsB), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [X] Continue to participate in our Kaggle challenge.
- [X] Try Ordinal Encoding.
- [X] Try a Random Forest Classifier.
- [X] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [X] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [X] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_






In [1]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module2')

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 4.1MB/s 
[?25hCollecting pandas-profiling
[?25l  Downloading https://files.pythonhosted.org/packages/2c/2f/aae19e2173c10a9bb7fee5f5cad35dbe53a393960fc91abc477dcc4661e8/pandas-profiling-2.3.0.tar.gz (127kB)
[K     |████████████████████████████████| 133kB 11.1MB/s 
[?25hRequirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.1.1)
Collecting htmlmin>=0.1.12 (from pandas-profiling)
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Collecting phik>=0.9.8 (from pandas-profiling)
[?25l  Downloading https://files.pythonhosted.org/packages/45/ad/24a16fa4ba612fb96a3c4bb115a5b9741483f53b66d3d3afd987f20fa227/phik-0.9.8-py3

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

In [0]:
import pandas as pd

import numpy as np

import random as ran

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.cluster import KMeans 
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.feature_selection import f_classif, chi2, SelectKBest, SelectPercentile, SelectFpr, SelectFromModel
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

import category_encoders as ce

import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='sklearn')

In [0]:
# Import all the training and test data initially together.
trainLen = 59400
IFull = pd.concat([
           pd.read_csv("../data/tanzania/train_features.csv"),
           pd.read_csv("../data/tanzania/test_features.csv")]).reset_index()
OTrain = pd.read_csv("../data/tanzania/train_labels.csv")

In [0]:
### Data Wrangling for both train and test sets

# Delete low-quality features
LQFeats = ["index", "id","recorded_by","quantity_group",'district_code','region_code']
IFull = IFull.drop(LQFeats, axis=1)

# Convert dates to actual date
IFull["date_recorded"] = pd.to_datetime(IFull["date_recorded"])
IFull["construction_year"] = pd.to_datetime(IFull["construction_year"].replace(0,np.NaN),format="%Y")

# Fill in construction year with average values.
averageConstructionYear = pd.to_datetime(IFull["construction_year"].dropna().values.astype(np.int64).mean())
IFull["construction_year"] = IFull["construction_year"].fillna(averageConstructionYear)

# Create new feature corresponding to the age of a pump.
IFull["age"] = round((IFull["date_recorded"] - IFull["construction_year"]).dt.days / 365.25, 1)

# Convert time-related types back to numbers
IFull["construction_year"] = IFull["construction_year"].dt.year
IFull["month_recorded"] = IFull["date_recorded"].dt.month.astype(str)
IFull["year_recorded"] = IFull["date_recorded"].dt.year
IFull["date_recorded"] = IFull["date_recorded"].astype(np.int64)

regionCoordinatesM = {'Arusha': [-3.246043844575401, 36.55500339056361],
 'Dar es Salaam': [-6.907108803888347, 39.21493696941172],
 'Dodoma': [-5.941307299202325, 36.041956683179855],
 'Iringa': [-8.909404369833466, 34.89582103436594],
 'Kagera': [-1.9612435961778185, 31.232021236726276],
 'Kigoma': [-4.296333588647042, 30.218888989479233],
 'Kilimanjaro': [-3.523668709464474, 37.50540380773228],
 'Lindi': [-9.766073749473128, 38.98823080785604],
 'Manyara': [-4.303462004587154, 35.942841353772295],
 'Mara': [-1.7375038054204093, 34.15713524788167],
 'Mbeya': [-9.096028396803234, 33.53034883194474],
 'Morogoro': [-7.409802021663037, 37.04663136955299],
 'Mtwara': [-10.683688033971968, 39.388908361752286],
 'Mwanza': [-1.9462319854940118, 24.602444512672093],
 'Pwani': [-7.008696225821545, 38.88377808797843],
 'Rukwa': [-7.3617965028073185, 31.292962136392454],
 'Ruvuma': [-10.776146647558239, 35.72782465778519],
 'Shinyanga': [-2.79133846094068, 26.5515938013012],
 'Singida': [-4.898334361159773, 34.73935867100201],
 'Tabora': [-4.72298819716211, 32.87706818312785],
 'Tanga': [-5.074809126685709, 38.5033910213175]}

# Replace missing coordinates with region centers
IFull['latitude'] = [
  regionCoordinatesM[IFull['region'][x]][0]
  if IFull['latitude'][x] == -2e-08
  else IFull['latitude'][x]
  for x in range(0,len(IFull)) ]
IFull['longitude'] = [
  regionCoordinatesM[IFull['region'][x]][1]
  if IFull['longitude'][x] == 0
  else IFull['longitude'][x]
  for x in range(0,len(IFull)) ]


In [0]:
# Split data back up into Train and Test
ITrain = IFull.loc[0:trainLen-1]
ITest = IFull.loc[trainLen:]

# Do a train-validate split
inputTrain, inputValidate, outputTrain, outputValidate = train_test_split(
    ITrain, OTrain['status_group'], train_size=0.8, test_size=0.2)

In [0]:
class KNeighborsColumn(BaseEstimator, TransformerMixin):
    """
    Transformer to create a k-nearest neighbors column.
    """
    def __init__(self, n_neighbors, distFeatures):
      self.distFeatures = distFeatures
      self.kscaler = StandardScaler()
      self.kmodel = KNeighborsClassifier(n_neighbors=n_neighbors)

    def fit(self, X, y):
        self.kscaler.fit(X[self.distFeatures])
        self.kmodel.fit(self.kscaler.transform(X[self.distFeatures]), y)
        return self

    def transform(self, X):
        X = X.copy()
        XScaled = self.kscaler.transform(X[self.distFeatures])

        probs = self.kmodel.predict_proba(XScaled).T
        X["NearestNonFunc"] = probs[0]
        X["NearestFuncNeRep"] = probs[1]
        X["NearestFunc"] = probs[2]
        return X

In [0]:
class ColumnSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a group of columns based on a list.
    """
    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.cols]

In [0]:
# Feature Selection Pipelines

numeric_features = inputTrain.select_dtypes('number').columns.tolist()

nums = Pipeline( [
    ("ncol", ColumnSelector(numeric_features)),
    ("nimp", SimpleImputer(missing_values=np.NaN, strategy='mean')),
    ("nmod", SelectFromModel(RandomForestClassifier(n_estimators=100), threshold='median'))
    ] )

categorical_features = inputTrain.describe(exclude='number').columns.tolist()

cats = Pipeline( [
    ("ccol", ColumnSelector(categorical_features)),
    ("cord", ce.OrdinalEncoder()),
    ("cfpr", SelectFpr(chi2, alpha=.001))
    ] )

feats = FeatureUnion([('nums', nums), ('cats', cats)])

In [0]:
RFCla = Pipeline( [
    ("knearest", KNeighborsColumn(n_neighbors=40, distFeatures=['longitude', 'latitude', "date_recorded"])),
    ("feat", feats),
    ("RF", RandomForestClassifier(n_estimators=100))
    ] )

In [15]:
model1 = RFCla
model1.fit(inputTrain, outputTrain)

score = model1.score(inputValidate, outputValidate)
print('Validation Accuracy', score)

Validation Accuracy 0.8074074074074075
