<a href="https://colab.research.google.com/github/lechemrc/DS-Unit-2-Applied-Modeling/blob/master/module3/assignment_applied_modeling_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 3

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Continue to iterate on your project: data cleaning, exploration, feature engineering, modeling.
- [ ] Make at least 1 partial dependence plot to explain your model.
- [ ] Share at least 1 visualization on Slack.

(If you have not yet completed an initial model yet for your portfolio project, then do today's assignment using your Tanzania Waterpumps model.)

## Stretch Goals
- [ ] Make multiple PDPs with 1 feature in isolation.
- [ ] Make multiple PDPs with 2 features in interaction. 
- [ ] Use Plotly to make a 3D PDP.
- [ ] Make PDPs with categorical feature(s). Use Ordinal Encoder, outside of a pipeline, to encode your data first. If there is a natural ordering, then take the time to encode it that way, instead of random integers. Then use the encoded data with pdpbox.I Get readable category names on your plot, instead of integer category codes.

## Links
- [Christoph Molnar: Interpretable Machine Learning — Partial Dependence Plots](https://christophm.github.io/interpretable-ml-book/pdp.html) + [animated explanation](https://twitter.com/ChristophMolnar/status/1066398522608635904)
- [Kaggle / Dan Becker: Machine Learning Explainability — Partial Dependence Plots](https://www.kaggle.com/dansbecker/partial-plots)
- [Plotly: 3D PDP example](https://plot.ly/scikit-learn/plot-partial-dependence/#partial-dependence-of-house-value-on-median-age-and-average-occupancy)

### Colab Setup

In [0]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Applied-Modeling.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module1')

Initialized empty Git repository in /content/.git/
remote: Enumerating objects: 77, done.[K
remote: Total 77 (delta 0), reused 0 (delta 0), pack-reused 77[K
Unpacking objects: 100% (77/77), done.
From https://github.com/LambdaSchool/DS-Unit-2-Applied-Modeling
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> origin/master
Checking out files: 100% (26/26), done.
Collecting category_encoders==2.0.0 (from -r requirements.txt (line 1))
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 3.8MB/s 
[?25hCollecting eli5==0.10.1 (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/97/2f/c85c7d8f8548e460829971785347e14e45fa5c6617da374711dec8cb38cc/eli5-0.10.1-py2.py3-none-any.whl (105kB)
[K     |████████████████████████████████| 112kB 10.8MB/s 
Co

### Important Imports

In [0]:
# libraries and math functions
import pandas as pd
import numpy as np
import pandas_profiling
from scipy.io import arff # for loading .arff file
from scipy.stats import randint, uniform

# imports for pipeline and regression
import category_encoders as ce
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import validation_curve
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from ipywidgets import interact, fixed

# plotting
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

### Importing dataset

Dataset has largely been cleaned previously 

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/lechemrc/Datasets-to-ref/master/Autism%20Screening%20for%20Children/csv_result-Autism-Child-Data.csv', na_values='?')
print(df.shape)
df.head()

(292, 22)


Unnamed: 0,id,A1_Score,A2_Score,A3_Score,A4_Score,A5_Score,A6_Score,A7_Score,A8_Score,A9_Score,A10_Score,age,gender,ethnicity,jundice,austim,contry_of_res,used_app_before,result,age_desc,relation,Class/ASD
0,1,1,1,0,0,1,1,0,1,0,0,6.0,m,Others,no,no,Jordan,no,5,4-11 years,Parent,NO
1,2,1,1,0,0,1,1,0,1,0,0,6.0,m,Middle Eastern,no,no,Jordan,no,5,4-11 years,Parent,NO
2,3,1,1,0,0,0,1,1,1,0,0,6.0,m,,no,no,Jordan,yes,5,4-11 years,,NO
3,4,0,1,0,0,1,1,0,0,0,1,5.0,f,,yes,no,Jordan,no,4,4-11 years,,NO
4,5,1,1,1,1,1,1,1,1,1,1,5.0,m,Others,yes,no,United States,no,10,4-11 years,Parent,YES


### Data Wrangling

In [0]:
df = pd.read_csv('https://raw.githubusercontent.com/lechemrc/Datasets-to-ref/master/Autism%20Screening%20for%20Children/csv_result-Autism-Child-Data.csv', na_values='?')

def data_wrangle(df):
  ''' cleaning the data with one function'''

  # null values
  df = df.fillna(value='unspecified')

  # Dropping columns with single value
  df = df.drop('age_desc', axis=1)

  # Dropping id column to prevent obfuscation of data
  df = df.drop('id', axis=1)

  # Cleaning column names
  df = df.rename(columns={'jundice':'born_jaundice', 
                          'austim':'family_pdd', 
                          'contry_of_res':'country', 
                          'used_app_before':'prior_screening'})
  
  # Changing the country column values to 'other' if there are less
  # than 5 instances in the df
  frequencies = df['country'].value_counts()
  condition = frequencies <= 5
  mask = frequencies[condition].index
  mask_dict = dict.fromkeys(mask, 'other')

  df['country'] = df['country'].replace(mask_dict)

  # renaming values for clarity
  df['relation'] = df['relation'].replace(
      {'self':'Self', 
       'Health care professional':'Healthcare Professional', 
       'unspecified':'Unspecified'})
  
  df['ethnicity'] = df['ethnicity'].replace({'Pasifika':'Pacifica', 
                                           'Others':'unspecified'})

  return df

In [0]:
df = data_wrangle(df)

### Regression and Analysis

In [0]:
df['Class/ASD'].value_counts()

NO     151
YES    141
Name: Class/ASD, dtype: int64

In [0]:
df.columns

Index(['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score', 'A6_Score',
       'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score', 'age', 'gender',
       'ethnicity', 'born_jaundice', 'family_pdd', 'country',
       'prior_screening', 'result', 'relation', 'Class/ASD'],
      dtype='object')

In [0]:
# Result was giving the model 100% accuracy, 
# meaning there was major data leakage from it
df['result'].value_counts()

8     44
7     44
5     41
6     40
4     33
9     32
10    21
3     21
2      9
1      6
0      1
Name: result, dtype: int64

In [0]:
# Setting target and features
target = 'Class/ASD'

# Dropping 'result' and 'age' as they seem to be confounding and not helpful
features = df.columns.drop([target, 'result', 'age', 'family_pdd'])

X = df[features]
y = df[target]

# Train / Test split
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=42)

# Train / Val split 
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.25, 
    stratify=y_trainval, random_state=42)

X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape 

((174, 16), (174,), (59, 16), (59,), (59, 16), (59,))

In [0]:
features

Index(['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score', 'A6_Score',
       'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score', 'gender', 'ethnicity',
       'born_jaundice', 'country', 'prior_screening', 'relation'],
      dtype='object')

#### Majority Class Accuracy

In [0]:
pd.options.display.float_format = None

In [0]:
y_train.value_counts(normalize=True)

NO     0.517241
YES    0.482759
Name: Class/ASD, dtype: float64

In [0]:
# Accuracy score using the majority class
majority_class = y_train.mode()[0]
y_pred = np.full_like(y_val, fill_value=majority_class)
accuracy_score(y_val, y_pred)

0.5084745762711864

#### XGBoost with ordinal encoder

In [0]:
# xgboost = make_pipeline(
#     ce.OrdinalEncoder(),
#     SimpleImputer(strategy='median')
# )

# X_train_processed = xgboost.fit_transform(X_train)
# X_val_processed = xgboost.transform(X_val)

# eval_set = [(X_train_processed, y_train), 
#             (X_val_processed, y_val)]

# model = XGBClassifier(n_estimators=100, n_jobs=-1, random_state=42)
# model.fit(X_train_processed, y_train, eval_set=eval_set, eval_metric='auc', 
#           early_stopping_rounds=10)

# y_pred = model.predict(X_test)

#### Random Forest Classifier with ordinal encoder

In [0]:
# random_forest_ord = make_pipeline(
#     ce.OrdinalEncoder(),
#     SimpleImputer(strategy='median'),
#     RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
# )

# random_forest_ord.fit(X_train, y_train)
# y_pred = random_forest_ord.predict(X_test)
# print('Validation Accuracy:', random_forest_ord.score(X_val, y_val))

The Random Forest Classifier with One Hot Encoder was the better model. XGBoost says it had a nearly 100% accuracy, but I'm very skeptical of that and figiure there's some overfitting happening. 

#### Random Forest Classifier with One Hot Encoding

In [0]:
random_forest = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=50, random_state=42, n_jobs=-1)
)

random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
print('Validation Accuracy:', random_forest.score(X_val, y_val), 
      '\nTest Accuracy:', random_forest.score(X_test, y_test))

Validation Accuracy: 0.8305084745762712 
Test Accuracy: 0.9322033898305084
