# DS 7331 Data Mining
### Logistic Regression and SVM
### Mini Lab
* Allen Ansari<br>
* Chad Madding<br>
* Yongjun (Ian) Chu<br>

## Introduction
Cardiovascular diseases (CVD) are the no. 1 cause of death in US each year. To reduce the death rate, the best approach is by early detection and screening. In this Mini Lab we will implemented Logistic Regression (Logit) and Support Vector Machine (SVM) to look at predicting the probablity of a patient having CVD based on results from medical examinations, such as blood pressure values and glucose content. The following categories are used for the analysis:

**1) Model Creation**
- Create a logistic regression model and a support vector machine model for the classification task involved with our dataset. 
- Assess how well each model performs (use 80/20 training/testing split for your data).
- Adjust parameters of the models to make them more accurate. The SGDClassifier is fine to use for optimizing logistic regression and linear support vector machines. For many problems, SGD will be required in order to train the SVM model in a reasonable timeframe. 

**2) Model Advantages**
- Discuss the advantages of each model for each classification task. 
- Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail.

**3) Interpret Feature Importance**
- Use the weights from logistic regression to interpret the importance of different features for the classification task.
- Explain your interpretation in detail. Why do you think some variables are more important?

**4) Interpet Support Vectors**
- Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain. If you used stochastic gradient descent (and therefore did not explicitly solve for support vectors), try subsampling your data to train the SVC model— then analyze the support vectors from the subsampled dataset.

## Business Understanding
### Choosing the cadiovascular diseases dataset
Cardiovascular diseases (CVD) are the no. 1 cause of death in US each year. To reduce the death rate, the best approach is by early detection and screening. An efficeint way would be to predict the probablity of a patient having CVD based on results from medical examinations, such as blood pressure values and glucose content. 

Here, we obtained a CVD dataset from Kaggle. It consists of 70,000 records of patients data in 12 features, such as age, gender, systolic blood pressure, diastolic blood pressure and CVD status(binary, 1 or 0). The purpose of this dataset was to determine which medical aspects had the most bearing on whether a patient would had CVD or not. 

To mine useful knowledge from the dataset, we will establish a prediction algorithm chosen from some commonly used classification models, including logistic regression, to find a relationship between a specific attribute or group of attributes and the probability of having CVD for a patient. To measure the effectiveness of our prediction algorithm, we will use the cross-validation. For each evaluation, accuracy metric for binary classification models called Area Under the (Receiver Operating Characteristic) Curve (AUC) will be generated. AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples. We can get the overall performance measure by computing the average of the AUC metrics from cross-validations for any particular model. Results from different models will be compared and the best one(s) will be chosen.

### Data description

We will be peforming an analysis of a cleaned up cadiovascular diseases dataset we used from the Lab 1 assigement.

Our task is to predict the presence or absence of cardiovascular disease (CVD) using the patient examination results. 

There are 3 types of input features:

- *Objective*: factual information;
- *Examination*: results of medical examination;
- *Subjective*: information given by the patient.

|Feature   |Variable Type   |Variable   |Value Type   |
|:---------|:--------------|:---------------|:------------|
| Gender | Objective Feature | gender | categorical code |
| Height | Objective Feature | height | int (cm) |
| Weight | Objective Feature | weight | float (kg) |
| Systolic blood pressure | Examination Feature | ap_hi | int |
| Diastolic blood pressure | Examination Feature | ap_lo | int |
| Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
| Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
| Smoking | Subjective Feature | smoke | binary |
| Alcohol intake | Subjective Feature | alco | binary |
| Physical activity | Subjective Feature | active | binary |
| Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
| Years | Objective Feature | years | age in years |
| BMI | Objective Feature | bmi | bmi |

For any binary data type, "0" means "No" and "1" means "Yes". All of the dataset values were collected at the moment of medical examination.

In [1]:
import pandas as pd
import numpy as np
import copy
import seaborn as sns

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

matplotlib.style.use('ggplot')

import warnings
warnings.simplefilter('ignore', DeprecationWarning)
warnings.simplefilter('ignore', FutureWarning)

from pandas.plotting import scatter_matrix

#Bring in data set
df = pd.read_csv('data/cardio_clean.csv') #read in the csv file

# Show the dimention and the first 5 rows of the dataset
print(df.shape)
df.head()

(63055, 13)


Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,years,BMI
0,2,168,62.0,110,80,1,1,0,0,1,0,50,2
1,1,156,85.0,140,90,3,1,0,0,1,1,55,4
2,1,165,64.0,130,70,3,1,0,0,0,1,52,2
3,2,169,82.0,150,100,1,1,0,0,1,1,48,3
4,1,156,56.0,100,60,1,1,0,0,0,0,48,2
