<a href="https://colab.research.google.com/github/vishnuyar/DS-Unit-2-Kaggle-Challenge/blob/master/module1/Copy_of_assignment_kaggle_challenge_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 1

## Assignment
- [X] Do train/validate/test split with the Tanzania Waterpumps data.
- [X] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what other columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What other columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [X] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [X] Get your validation accuracy score.
- [X] Get and plot your feature importances.
- [X] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [X] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Try other [scikit-learn scalers](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this problem, you may want to use the parameter `logistic=True`

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```



In [0]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from lightgbm import LGBMClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

ModuleNotFoundError: No module named 'lightgbm'

In [0]:
train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')
train,val = train_test_split(train,random_state = 32,stratify=train['status_group'],test_size=0.20)
train.shape, test.shape

((47520, 41), (14358, 40))

In [0]:
#After doing exhaustive study of data both graphically and by basic functions , the following are the observations
#Not relevant or too many null vaues funder, installer,scheme_name,num_private,scheme_name
#High cardinality wpt_name,subvillage,lga,ward
#remove region code as region is considered
#remove scheme_management as management is considered
#constant value recorded_by
#keeping extraction_type and removing extraction_type_group and extraction_type_class
#payment_type and payment are same
#removing waterpoint_type_group as it can be replaced with waterpoint_type
#keeping water_quality as same as quality_group
#quantity_group and quantity are same
#source and source_type are same 

drop_columns = ['funder','installer','scheme_name','scheme_name','scheme_management','wpt_name',
                'subvillage','lga','ward','recorded_by','extraction_type_group','extraction_type_class',
                'payment_type','waterpoint_type_group','quality_group','quantity_group','source_type','id','num_private','region_code']
#To avoid copy setting warning, creating a copy of the datasets
train = train.copy()
val = val.copy()

train.drop(columns=drop_columns,inplace=True)
val.drop(columns=drop_columns,inplace=True)
test.drop(columns=drop_columns,inplace=True)
train['latitude'] = train['latitude'].replace(-2e-08, 0)
val['latitude'] = val['latitude'].replace(-2e-08, 0)
test['latitude'] = test['latitude'].replace(-2e-08, 0)

In [0]:

#function to replace zero values with nan
def make_zero_nan(data,col):
  X = data.copy()
  X[col]=X[col].replace(0,np.NaN)
  return X

#function to replace nan values with a give value
def replace_nan(data,col,value):
    X = data.copy()
    X[col]=X[col].replace(np.NaN,value)
    return X

def feature_addition(data):
  X = data.copy()
  X['date_recorded']=pd.to_datetime(X['date_recorded'],infer_datetime_format=True).dt.year
  X['since_construction']=X['date_recorded']-X['construction_year']
  X['district_code']=X['district_code'].astype('str')
  X['public_meeting']=X['public_meeting'].fillna(True).astype(int)
  X['permit']=X['permit'].fillna(True).astype(int)
  return X

def remove_outlier(data,col,value):
  X = data.copy()
  X[col]=X[col].apply(lambda x: x if x<value else value)
  return X

In [0]:
nan_columns = ['latitude','longitude','construction_year','amount_tsh']
for col in nan_columns:
  train = make_zero_nan(train,col)
  val = make_zero_nan(val,col)
  test = make_zero_nan(test,col)

#Replacing construction year of zero values with minimum construction year
min_year = train['construction_year'].min()
train = replace_nan(train,'construction_year',min_year)
val = replace_nan(val,'construction_year',min_year)
test = replace_nan(test,'construction_year',min_year)


#Replacing gps_height of zero values with minimum gps_height
min_height = train['gps_height'].min()
train = replace_nan(train,'gps_height',min_height)
val = replace_nan(val,'gps_height',min_height)
test = replace_nan(test,'gps_height',min_height)


In [0]:
high = train['population'].quantile(.975)
train = remove_outlier(train,'population',high)
val = remove_outlier(val,'population',high)
test = remove_outlier(test,'population',high)

In [0]:
high = train['amount_tsh'].quantile(.975)
train = remove_outlier(train,'amount_tsh',high)
val = remove_outlier(val,'amount_tsh',high)
test = remove_outlier(test,'amount_tsh',high)

In [0]:
train = feature_addition(train)
val = feature_addition(val)
test = feature_addition(test)

In [0]:
target = 'status_group'
features = train.columns.drop(target)
Y_train = train[target]
Y_val = val[target]
X_train = train[features]
X_val = val[features]
#making a pipeline for model testing

for i in range(39,40):
  

  pipeline = make_pipeline(
      ce.OneHotEncoder(use_cat_names=True),
      SimpleImputer(strategy='most_frequent'),
      RobustScaler(),
      #DecisionTreeClassifier(random_state=32,min_samples_leaf=20,max_depth=i)
      LGBMClassifier(max_depth=8,learning_rate=0.03,n_estimators=1500,min_child_samples=20)
      

  )
  #pipeline.fit(X_train,Y_train)
  scores = cross_val_score(pipeline,X_train,Y_train,cv=5,)
  print(scores,"Mean: ",scores.mean())
  # pred_train = pipeline.predict(X_train)
  # y_pred = pipeline.predict(X_val)
  # print("Training Score:",accuracy_score(Y_train,pred_train)," Depth:",i)
  # print("Val Score:",accuracy_score(Y_val,y_pred)," Depth:",i)


[0.79694897 0.80115729 0.79366582 0.79345539 0.80214692] Mean:  0.7974748776846173


In [0]:
# [0.79852709 0.80178853 0.79219276 0.79534933 0.80109451] Mean:  0.7977904434635834

array([0.7945292 , 0.79768543, 0.79040404, 0.78935185, 0.79751631])

In [0]:
#predicting for test values
predictions = pipeline.predict(test[features])
sample_submission.status_group = predictions
sample_submission.to_csv('kaggle-submission-10.csv',index=False)

In [0]:
features

Index(['amount_tsh', 'date_recorded', 'gps_height', 'longitude', 'latitude',
       'basin', 'region', 'district_code', 'population', 'public_meeting',
       'permit', 'construction_year', 'extraction_type', 'management',
       'management_group', 'payment', 'water_quality', 'quantity', 'source',
       'source_class', 'waterpoint_type', 'since_construction'],
      dtype='object')

In [0]:
train.dtypes

amount_tsh            float64
date_recorded           int64
gps_height            float64
longitude             float64
latitude              float64
basin                  object
region                 object
district_code          object
population              int64
public_meeting          int64
permit                  int64
construction_year     float64
extraction_type        object
management             object
management_group       object
payment                object
water_quality          object
quantity               object
source                 object
source_class           object
waterpoint_type        object
status_group           object
since_construction    float64
dtype: object

In [0]:
model = pipeline.named_steps['decisiontreeclassifier']
encoder = pipeline.named_steps['onehotencoder']

In [0]:
features = pd.DataFrame({'columns':encoder.transform(X_val).columns.to_list(),'importance':model.feature_importances_})

In [0]:
features.shape

(122, 2)

In [0]:
#plotting the importance
#plt.figure(figsize=(12,35))
features.sort_values('importance',ascending=True)[:-20]

Unnamed: 0,columns,importance
73,management_unknown,0.000000
106,source_hand dtw,0.000000
72,management_other,0.000000
25,region_Kigoma,0.000000
28,region_Dar es Salaam,0.000000
30,region_Mara,0.000000
31,region_Rukwa,0.000000
65,management_parastatal,0.000000
63,extraction_type_climax,0.000000
62,extraction_type_windmill,0.000000


In [0]:
features.head()

0    Index(['amount_tsh', 'date_recorded_2013-03-15...
1    [0.044831299635566146, 5.743392291606526e-05, ...
dtype: object