# Data Mining 7331 - Fall 2019
## Lab 2 - Classification

* **Allen Ansari**
* **Chad Madding**
* **Yongjun (Ian) Chu**

## Introduction
Cardiovascular diseases (CVD) are the no. 1 cause of death in US each year. To reduce the death rate, the best approach is by early detection and screening. In this Mini Lab we will implemented Logistic Regression (Logit) and Support Vector Machine (SVM) to look at predicting the probability of a patient having CVD based on results from medical examinations, such as blood pressure values and glucose content. The following categories are used for the analysis:

### Data description

We will be peforming an analysis of the cadiovascular diseases dataset found on Kaggle (https://www.kaggle.com/sulianova/cardiovascular-disease-dataset). Our analysis will consist of exploring the statistical summaries of the features, visualizing the attributes, and making conclusions from the visualizations and analysis.

Our task is to predict the presence or absence of cardiovascular disease (CVD) using the patient examination results. 

There are 3 types of input features:

- *Objective*: factual information;
- *Examination*: results of medical examination;
- *Subjective*: information given by the patient.

|Feature   |Variable Type   |Variable   |Value Type   |
|:---------|:--------------|:---------------|:------------|
| Years | Objective Feature | years | int (years) |
| Height | Objective Feature | height | int (cm) |
| Weight | Objective Feature | weight | float (kg) |
| Gender | Objective Feature | gender | categorical code |
| Systolic blood pressure | Examination Feature | ap_hi | int |
| Diastolic blood pressure | Examination Feature | ap_lo | int |
| Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
| Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
| Smoking | Subjective Feature | smoke | binary |
| Alcohol intake | Subjective Feature | alco | binary |
| Physical activity | Subjective Feature | active | binary |
| Body Mass Index | Examination Feature | bmi | int |
| Presence or absence of cardiovascular disease | Target Variable | cardio | binary |

For any binary data type, "0" means "No" and "1" means "Yes". All of the dataset values were collected at the moment of medical examination.

### Table of Contents<a id="top"></a>

* **[Data Preparation Part 1](#Data_Preparation_Part_1)**
    * **[10 points]** Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.
* **[Data Preparation Part 2](#Data_Preparation_Part_2)**
    * **[5 points]** Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).
* **[Modeling and Evaluation 1](#Modeling_and_Evaluation_1)**
    * **[10 points]** Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.
* **[Modeling and Evaluation 2](#Modeling_and_Evaluation_2)**
    * **[10 points]** Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.
* **[Modeling and Evaluation 3](#Modeling_and_Evaluation_3)**
    * **[20 points]** Create three different classification/regression models for each task (e.g., random forest, KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms!
* **[Modeling and Evaluation 4](#Modeling_and_Evaluation_4)**
    * **[10 points]** Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.
* **[Modeling and Evaluation 5](#Modeling_and_Evaluation_5)**
    * **[10 points]** Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques—be sure they are appropriate for your chosen method of validation as discussed in unit 7 of the course.
* **[Modeling and Evaluation 6](#Modeling_and_Evaluation_6)**
    * **[10 points]** Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.
* **[Deployment](#Deployment)**
    * **[5 points]** How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would you deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?
* **[Exceptional Work](#Exceptional_Work)**
    * **[10 points]** You have free reign to provide additional analyses. One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?

<a href="#top">Back to Top</a>
### Data Preparation Part 1<a id="Data_Preparation_Part_1"></a>
* Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.

#### Load the cleaned data generated in Lab_1

In [1]:
import pandas as pd
import numpy as np
import copy
import seaborn as sns

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

matplotlib.style.use('ggplot')

import warnings
warnings.simplefilter('ignore', DeprecationWarning)
warnings.simplefilter('ignore', FutureWarning)

from pandas.plotting import scatter_matrix

#Bring in data set
df = pd.read_csv('data/cardio_clean.csv', sep=',') #read in the csv file

# Show the dimention and the first 5 rows of the dataset
print(df.shape)
df.head()

(63055, 14)


Unnamed: 0.1,Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,years,BMI
0,0,2,168,62.0,110,80,1,1,0,0,1,0,50,2
1,1,1,156,85.0,140,90,3,1,0,0,1,1,55,4
2,2,1,165,64.0,130,70,3,1,0,0,0,1,52,2
3,3,2,169,82.0,150,100,1,1,0,0,1,1,48,3
4,4,1,156,56.0,100,60,1,1,0,0,0,0,48,2


<a href="#top">Back to Top</a>
### Data Preparation Part 2<a id="Data_Preparation_Part_2"></a>
* Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).

<a href="#top">Back to Top</a>
### Modeling and Evaluation 1<a id="Modeling_and_Evaluation_1"></a>
* Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.

<a href="#top">Back to Top</a>
### Modeling and Evaluation 2<a id="Modeling_and_Evaluation_2"></a>
* Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.

<a href="#top">Back to Top</a>
### Modeling and Evaluation 3<a id="Modeling_and_Evaluation_3"></a>
* Create three different classification/regression models for each task (e.g., random forest, KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms!

### KNN Classification Parameter Optimization with GridSearch

K-Nearest Neighbor (KNN) classification is valid option for this dataset since the dataset has been preprocessed and it has no missing values.  Parameter selections are critical to the performance of KNN classifiers; therefore, substantial time and effort was put forth to fully investigate the optimal parameters. 

##### Parameter Analysis:

*Algorithms:*  Algorithm used to compute the nearest neighbors can be ‘auto’matically determine the most appropriate algorithm to use for the given dataset/parameters, so it was left as default in our GridSearch

##### GridSearch Parameters:

*n_neighbors:* Number of neighbors to use in the analysis. Preliminary analyses were conducted to find a desired range of for number of neighbors. From these analyses, it was determined that the optimal number of neighbors is below 15. Above 15, the accuracy plateaus and start to decrease.

*Leaf_size:* The leaf size was adjusted, using: 10, 30, and 100 as the parameters. While there is an over-head penalty with using smaller leaves, accuracy may increase, so we will use it in our Grid Search.

*Metric:* How distance is measure between datapoints can be adjusted. The 2 options chosen were ‘minkowski’ and ‘euclidean’.

*Weights:* Both uniform and distance were looked at. ‘Uniform’ weight-all neighboring points get equal weight. ‘Distance’ weights points by the inverse of their distance.

*Predictor Variables:*  
Many of the predictor variable have different scaling, so to ensure all variables were treated equally in the analysis, all predictor variables are scaled to have a mean of 0 and Standard deviation of 1.

In [2]:
#KNN Classification 10-fold cross-validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn import metrics as mt
import warnings
warnings.filterwarnings('ignore')

#Bring in data set
df = pd.read_csv('data/cardio_train.csv', sep=';') #read in the csv file

#Xtemp = df.drop(['cardio'], axis=1)
# create variables we are more familiar with
y = df.cardio
X = df.drop(['cardio'], axis=1)

#yhat = np.zeros(y.shape) # we will fill this with predictions

# Scaling training variables
scl = StandardScaler()
X = scl.fit_transform(X)

# create cross validation iterator
cv = StratifiedKFold(n_splits=10)

ClsEstimator = KNeighborsClassifier(n_jobs = -1)

parameters = { 'n_neighbors':[3,5,13]
              ,'weights': ['uniform','distance']
              ,'leaf_size': [10,30]
              ,'metric': ['minkowski','euclidean']
             }
#Create a grid search object using the  
from sklearn.model_selection import GridSearchCV
ClsGridSearch = GridSearchCV(estimator=ClsEstimator
                   #, n_jobs=10 # jobs to run in parallel
                   , verbose=1 # low verbosity
                   , param_grid=parameters
                   , cv=cv # KFolds = 10
                   , scoring='accuracy')

#Perform hyperparameter search to find the best combination of parameters for our data
ClsGridSearch.fit(X, y)

Fitting 10 folds for each of 24 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:  9.7min finished


GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False),
             error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=-1,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=None,
             param_grid={'leaf_size': [10, 30],
                         'metric': ['minkowski', 'euclidean'],
                         'n_neighbors': [3, 5, 13],
                         'weights': ['uniform', 'distance']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=1)

The GridSearch algorithm determined the following optimal paramters for K-Neighbors Algorithn.

*Leaf-Size:* 10  
*Number of Neighbors:* 13

*Distance Matric:* Minkowski  
*Weights:* Uniform

In [3]:
#Use the best parameters for our KNN classifier
ClsGridSearchEst = ClsGridSearch.best_estimator_
print(ClsGridSearchEst)

KNeighborsClassifier(algorithm='auto', leaf_size=10, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=13, p=2,
                     weights='uniform')


Re-run the KNN classification analysis with the optimal algorithm parameters that were determined by the parameter GridSearch.

In [4]:
yhat = np.zeros(y.shape) # initializing variable

for train, test in cv.split(X,y):
    # Use Results parameters from GridSearch to run KNN Classifier model
    clf_knn = KNeighborsClassifier(n_neighbors=13, weights='uniform',metric='minkowski', algorithm='auto',p=2,leaf_size=10)
    clf_knn.fit(X[train],y[train])
    yhat[test] = clf_knn.predict(X[test])

total_accuracy = mt.accuracy_score(y, yhat)
#Print out the results
print('KNN classifier accuracy with optimal parameters is: %.3f'%(total_accuracy))

KNN classifier accuracy with optimal parameters is: 0.641


*KNN Classifier accuracy* with optimal Parameters is *64.1%*

### Random Forest

One of the most commonly used classifier techniques is random forest, due to its very low bias and general stability when it comes to classification. One method of optimizing a random forest model is to try different parameters to increase performance. Another method of doing so is by utilizing grid search to let random forrest decide which combination of hyperparameters would be best implemented in your model. We chose this route as it saves both time and sanity when comparing so many different parameters.

We'll start with a baseline random forest for our starting position.

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop('cardio', axis=1), df['cardio'], test_size=0.3, random_state=2019)

clf =RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

print(f'Random Forest : Accuracy score - %.3f'%(mt.accuracy_score(y_test, y_pred)))

print(f'Random Forest : F1 score - %.3f'%(mt.f1_score(y_test, y_pred)))

Random Forest : Accuracy score - 0.725
Random Forest : F1 score - 0.718


<a href="#top">Back to Top</a>
### Modeling and Evaluation 4<a id="Modeling_and_Evaluation_4"></a>
* Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.

<a href="#top">Back to Top</a>
### Modeling and Evaluation 5<a id="Modeling_and_Evaluation_5"></a>
* Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques—be sure they are appropriate for your chosen method of validation as discussed in unit 7 of the course.

<a href="#top">Back to Top</a>
### Modeling and Evaluation 6<a id="Modeling_and_Evaluation_6"></a>
* Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.

<a href="#top">Back to Top</a>
### Deployment<a id="Deployment"></a>
* How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would you deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?

<a href="#top">Back to Top</a>
### Exceptional Work<a id="Exceptional_Work"></a>
* You have free reign to provide additional analyses. One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?