# Predicting Congenital Disorder using Weighted Random Forest
https://www.kaggle.com/c/ga-dat-syd13/data

Author: <div class="LI-profile-badge"  data-version="v1" data-size="medium" data-locale="en_US" data-type="horizontal" data-theme="dark" data-vanity="aroraaman"><a class="LI-simple-link" href='https://au.linkedin.com/in/aroraaman?trk=profile-badge'>Aman Arora</a></div>  

**So, What is a congenital disorder?** <br> 
Most babies are born healthy, but when a baby has a condition that is present from birth, it is called a congenital disorder. Congenital disorders can be inherited or caused by environmental factors and their impact on a child’s health and development can vary from mild to severe. A child with a congenital disorder may experience a disability or health problems throughout life. (https://www.pregnancybirthbaby.org.au/what-is-a-congenital-disorder)

**Here are some of the resources that I referenced before creating this notebook:**<br>
1. https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python 
2. General Assembly, Sydney - Linear Regression iPython Notebook -
Authors: Kevin Markham (Washington, D.C.), Ed Podojil (New York City); <br>
Taught by: **Dima Galat** (https://www.linkedin.com/in/dimagalat/)
5. https://www.booktopia.com.au/multivariate-data-analysis-joe-f-hair/prod9781292021904.html?source=pla&gclid=EAIaIQobChMIpJ2qkJLO3QIV16mWCh3RBAFUEAQYASABEgL39vD_BwE (MultiVariate Data Analysis)
6. https://www-bcf.usc.edu/~gareth/ISL/ Introduction to Statistical Learning (James et al., 2014)
7. https://www.kaggle.com/erikbruin/house-prices-lasso-xgboost-and-a-detailed-eda
8. https://www.kaggle.com/juliencs/a-study-on-regression-applied-to-the-ames-dataset
9. https://www.kaggle.com/apapiu/regularized-linear-models

### Importing Libraries

In [1]:
#Getting the toolkit together
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt 
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from sklearn.preprocessing import StandardScaler
plt.style.use("fivethirtyeight")
from sklearn.ensemble import RandomForestClassifier 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.svm import SVC 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split, KFold
from sklearn.metrics import confusion_matrix,recall_score,precision_recall_curve,auc,\
                            roc_curve,roc_auc_score,classification_report
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
from sklearn import svm
from sklearn import metrics

### Data EDA

I performed initial EDA for this data set in my notebook 'Predicting Congenital Disorder using Resampling.ipynb'. Therefore, here I will just clean my data - remove nulls and implement Support Vector Machine model.

In [2]:
#importing data
train_path = '/Users/user/Desktop/Folders/Data_Scientist/Project 3_ GA/health-diagnostics-train.csv'
health_df =pd.read_csv(train_path)

# importing data
test_path = '/Users/user/Desktop/Folders/Data_Scientist/Project 3_ GA/health-diagnostics-test.csv'
health_df_test =pd.read_csv(test_path)

In [3]:
health_df.replace('#NULL!', np.NaN, inplace = True)
health_df_test.replace('#NULL!', np.NaN, inplace = True)

I have tried different combinations of imputing data and dropping data already while working on this exercise. It makes sense to impute the mode() in missing values as these are categorical features. However, imputing in test with mode() and dropping train values gives maximum accuracy. <br>
Why? Well, first of all we have over 30,000 values in train data. Losing 900 is approx 3% of the data. This data is not unique and exists for target == 0, that is, our majority class. We are not missing the minority class, therefore, it is safe to drop this data in Train. For our test data, we cannot drop any data and therefore, we will be imputing with mode(). 

In [4]:
health_df.dropna(inplace = True)
health_df_test = health_df_test.apply(lambda x:x.fillna(x.value_counts().index[0]))

**TRAIN TEST SPLIT**

In [5]:
X = health_df.loc[:, health_df.columns != 'target']
y = health_df['target']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 60)

In [8]:
X_train.shape

(24348, 9)

In [9]:
X_test.shape

(3630, 9)

### Implementing Weighted Random Forest

In [10]:
target_count = pd.Series(y_train).value_counts()

print('Class 0:', target_count[0])
print('Class 1:', target_count[1])
print('Proportion of majority to minority class:', round(target_count[0] / target_count[1], 2), ': 1')

prop = round(target_count[0] / target_count[1],2)

Class 0: 24299
Class 1: 49
Proportion of majority to minority class: 495.9 : 1


I ran random search followed by Grid Search to find the best class weight. It turns the best class weight was the inverse of proportion of class in our Dataset. It is also suggested in the research article "Predicting congenital heart defects: A comparison of three data mining methods" by Yanhong Luo. (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0177811)

In [11]:
RF = RandomForestClassifier(n_estimators = 500,
                            class_weight={0: 1/prop,1:1.6},
                            max_features = 3,
                            min_samples_leaf = 65,
                            criterion = 'gini',
                            random_state = 123,
                            bootstrap = True
                           )
RF.fit(X_train,y_train)

y_pred = RF.predict(X_test)

#Evaluation
cnf_matrix=confusion_matrix(y_test,y_pred)
TP = cnf_matrix[1,1,]
TN = cnf_matrix[0,0]
FP = cnf_matrix[0,1]
FN = cnf_matrix[1,0]
TPR = TP/(TP+FN)
TNR = TN/(TN+FP)
ACC = (TP+TN)/(TP+TN+FP+FN)
wtACC = (0.7*TPR) + (0.3*TNR)
prec = TP/(TP+FP)
G_mean = np.sqrt(TPR * TNR)
print("the recall for this model is :",cnf_matrix[1,1]/(cnf_matrix[1,1]+cnf_matrix[1,0]))
print("TP",TP) 
print("TN",TN) 
print("FP",FP) 
print("FN",FN) 
print("AUC: {}".format(roc_auc_score(y_test, y_pred)))
print('PRECISION:',prec)
print('TPR:',TPR)
print('TNR:',TNR)
print('ACC:',ACC)
print('wtACC:',wtACC)
print('G_mean:',G_mean)
print('-'*100)

fpr, tpr, threshold = roc_curve(y_test, y_pred)
auc_score = roc_auc_score(y_test, y_pred)
print('AUC score:', auc_score, 'for model:',RF)
plt.figure()
plt.plot(fpr, tpr, color='red',
lw=2, label='ROC curve')
plt.plot([0, 1], [0, 1], color='blue', lw=2, linestyle='--')
##Title and label
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.show()

the recall for this model is : 0.03636363636363636
TP 132
TN 0
FP 0
FN 3498




ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

**HYPERPARAMETER TUNING**

In [None]:
# #Class Weight 
# class_weight=[{0: 1/prop, 1: w} for w in np.linspace(start = 1, stop = 3, num = 20)]

# # Number of trees in random forest
# n_estimators = [100,200,300,400,500]

# # Number of features to consider at every split
# max_features = [2,3,4,5,6,7,8,9]

# # Maximum number of levels in tree
# max_depth = [int(x) for x in np.linspace(10, 1000, num = 50)]
# max_depth.append(None)
# # Minimum number of samples required to split a node

# # Minimum number of samples required at each leaf node
# min_samples_leaf = [45,50,55,60,65,70,75,80,100,150,200]
# # Method of selecting samples for training each tree
# bootstrap = [True, False]
# #criterion - measure of quality of split 
# criterion = ['gini','entropy']
# # Create the random grid
# random_grid = {'n_estimators': n_estimators,
#                'class_weight':class_weight,
#                'max_features': max_features,
#                'max_depth': max_depth,
#                'min_samples_leaf': min_samples_leaf,
#                'bootstrap': bootstrap,
#                'criterion':criterion}

In [None]:
# # Use the random grid to search for best hyperparameters

# rf = RandomForestClassifier()

# rf_random = RandomizedSearchCV(estimator = rf, 
#                                param_distributions = random_grid, 
#                                n_iter = 200, 
#                                cv = 3, 
#                                scoring = 'roc_auc',
#                                verbose=2, 
#                                random_state=None, 
#                                n_jobs = -1)

# # Fit the random search model
# rf_random.fit(X_train,y_train)

# # # RUNTIME ~35mins

In [None]:
# rf_random.best_estimator_
# # >>> RandomForestClassifier(bootstrap=True,
# #             class_weight={0: 0.0018015096650993532, 1: 1.1052631578947367},
# #             criterion='entropy', max_depth=460, max_features=9,
# #             max_leaf_nodes=None, min_impurity_decrease=0.0,
# #             min_impurity_split=None, min_samples_leaf=70,
# #             min_samples_split=2, min_weight_fraction_leaf=0.0,
# #             n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
# #             verbose=0, warm_start=False)