# DS 7331 Data Mining
### Logistic Regression and SVM
### Mini Lab
* Allen Ansari<br>
* Chad Madding<br>
* Yongjun (Ian) Chu<br>

## Introduction
Cardiovascular diseases (CVD) are the no. 1 cause of death in US each year. To reduce the death rate, the best approach is by early detection and screening. In this Mini Lab we will implemented Logistic Regression (Logit) and Support Vector Machine (SVM) to look at predicting the probablity of a patient having CVD based on results from medical examinations, such as blood pressure values and glucose content. The following categories are used for the analysis:

**1) Model Creation**
- Create a logistic regression model and a support vector machine model for the classification task involved with our dataset. 
- Assess how well each model performs (use 80/20 training/testing split for your data).
- Adjust parameters of the models to make them more accurate. The SGDClassifier is fine to use for optimizing logistic regression and linear support vector machines. For many problems, SGD will be required in order to train the SVM model in a reasonable timeframe. 

**2) Model Advantages**
- Discuss the advantages of each model for each classification task. 
- Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or efficiency? Explain in detail.

**3) Interpret Feature Importance**
- Use the weights from logistic regression to interpret the importance of different features for the classification task.
- Explain your interpretation in detail. Why do you think some variables are more important?

**4) Interpet Support Vectors**
- Look at the chosen support vectors for the classification task. Do these provide any insight into the data? Explain. If you used stochastic gradient descent (and therefore did not explicitly solve for support vectors), try subsampling your data to train the SVC model— then analyze the support vectors from the subsampled dataset.

## Business Understanding
### Choosing the cadiovascular diseases dataset
Cardiovascular diseases (CVD) are the no. 1 cause of death in US each year. To reduce the death rate, the best approach is by early detection and screening. An efficeint way would be to predict the probablity of a patient having CVD based on results from medical examinations, such as blood pressure values and glucose content. 

Here, we obtained a CVD dataset from Kaggle. It consists of 70,000 records of patients data in 12 features, such as age, gender, systolic blood pressure, diastolic blood pressure and CVD status(binary, 1 or 0). The purpose of this dataset was to determine which medical aspects had the most bearing on whether a patient would had CVD or not. 

To mine useful knowledge from the dataset, we will establish a prediction algorithm chosen from some commonly used classification models, including logistic regression, to find a relationship between a specific attribute or group of attributes and the probability of having CVD for a patient. To measure the effectiveness of our prediction algorithm, we will use the cross-validation. For each evaluation, accuracy metric for binary classification models called Area Under the (Receiver Operating Characteristic) Curve (AUC) will be generated. AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples. We can get the overall performance measure by computing the average of the AUC metrics from cross-validations for any particular model. Results from different models will be compared and the best one(s) will be chosen.

### Data description

We will be peforming an analysis of a cleaned up cadiovascular diseases dataset we used from the Lab 1 assigement.

Our task is to predict the presence or absence of cardiovascular disease (CVD) using the patient examination results. 

There are 3 types of input features:

- *Objective*: factual information;
- *Examination*: results of medical examination;
- *Subjective*: information given by the patient.

|Feature   |Variable Type   |Variable   |Value Type   |
|:---------|:--------------|:---------------|:------------|
| Gender | Objective Feature | gender | categorical code |
| Height | Objective Feature | height | int (cm) |
| Weight | Objective Feature | weight | float (kg) |
| Systolic blood pressure | Examination Feature | ap_hi | int |
| Diastolic blood pressure | Examination Feature | ap_lo | int |
| Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
| Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
| Smoking | Subjective Feature | smoke | binary |
| Alcohol intake | Subjective Feature | alco | binary |
| Physical activity | Subjective Feature | active | binary |
| Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
| Years | Objective Feature | years | age in years |
| BMI | Objective Feature | bmi | bmi |

For any binary data type, "0" means "No" and "1" means "Yes". All of the dataset values were collected at the moment of medical examination.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC, LinearSVC
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from tqdm import tqdm
import time
from collections import OrderedDict

warnings.filterwarnings('ignore')
#Bring in data set
df = pd.read_csv('Data/cardio_train.csv',sep=';') #read in the csv file

# Show the dimention and the first 5 rows of the dataset
print(df.shape)
df.head()

(70000, 13)


Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


Body mass index (BMI) is commonly used in medical field. It is a key index for relating weight to height. BMI is a person's weight in kilograms (kg) divided by his or her height in meters squared.

In [2]:
#create a new vaiable as BMI

df['BMI'] = df['weight']/((df['height']/100)**2)

In [3]:
#to only keep the entries between 97.5% quantile and 2.5% quantile for ap_hi and ap_lo
df.drop(df[(df['ap_hi'] > df['ap_hi'].quantile(0.975)) | (df['ap_hi'] < df['ap_hi'].quantile(0.025))].index,inplace=True)
df.drop(df[(df['ap_lo'] > df['ap_lo'].quantile(0.975)) | (df['ap_lo'] < df['ap_lo'].quantile(0.025))].index,inplace=True)
df.shape

(66193, 14)

In [4]:
df.drop(df[(df['weight'] > df['weight'].quantile(0.975)) | (df['weight'] < df['weight'].quantile(0.025))].index,inplace=True)
#we want to check how the plot looks like when converting age from days to years
df['years'] = (df['age'] / 365).round().astype('int')

### Nominal Data
Nominal data usually has more than two values. For logistic regression and SVMs, we created dummy variables that only factor in 0s and 1s for the prediction process of logistic regression and SVMs. In CVD dataset gender is nominal but it is 1 for female and 2 for male which need to be converted to 0 and 1. Also we converted BMI to nominal form for better explanation 

In [5]:
df.gender = df.gender.apply(lambda x: 0 if x == 1 else 1)
df['BMI'] = df['BMI'].apply(lambda x: 1 if x<18.5 else( 2 if x>=18.5 and x<25 else( 3 if x >= 25 and x < 30 else 4)))

In [6]:
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,BMI,years
0,0,18393,1,168,62.0,110,80,1,1,0,0,1,0,2,50
1,1,20228,0,156,85.0,140,90,3,1,0,0,1,1,4,55
2,2,18857,0,165,64.0,130,70,3,1,0,0,0,1,2,52
3,3,17623,1,169,82.0,150,100,1,1,0,0,1,1,3,48
4,4,17474,0,156,56.0,100,60,1,1,0,0,0,0,2,48


In [7]:
#Are there any duplicate entries in the dataset?
#duplicateRowsDF = df[df.duplicated(keep=False)]
duplicateRowsDF = df[df.duplicated(keep='first')]

print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

print(f"\nThere are {len(duplicateRowsDF)} duplicated entries in the dataset!")

Duplicate Rows except first occurrence based on all columns are :
Empty DataFrame
Columns: [id, age, gender, height, weight, ap_hi, ap_lo, cholesterol, gluc, smoke, alco, active, cardio, BMI, years]
Index: []

There are 0 duplicated entries in the dataset!


In [8]:
del df['age']
del df['id']
y=df['cardio'].astype(np.int32)
del df['cardio']
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 63079 entries, 0 to 69999
Data columns (total 12 columns):
gender         63079 non-null int64
height         63079 non-null int64
weight         63079 non-null float64
ap_hi          63079 non-null int64
ap_lo          63079 non-null int64
cholesterol    63079 non-null int64
gluc           63079 non-null int64
smoke          63079 non-null int64
alco           63079 non-null int64
active         63079 non-null int64
BMI            63079 non-null int64
years          63079 non-null int32
dtypes: float64(1), int32(1), int64(10)
memory usage: 6.0 MB


# Create Models

## Logistic Regression

### SGDClassifier Over the Other Sklearn Functions

First, we used SVC setting kernel = 'linear' but waited a long time for it to finish. Then we used LogisticRegression and checked accuracy and precision.

At the end we tried SGDClassifier with loss = 'log' which was exponentially faster than the others so this is what we use for logistic regression.

### Functions to Test Accuracy and scores (MAE, RSME and MAPE)
here are the functions that we used to individually find, visualize, and report the best parameters per model, where we reuse those parameters for the optimized model. These functions also return accuracy,Precision, MAE,RSME and MAPE for each split.
we used 10 split cross validation with 80% for training and 20% for test.

In [20]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import mean_absolute_error, make_scorer, mean_squared_error
def test_accuracy(model, n_splits=10, print_steps=False, params={}):
    accuracies = []
    precisions = []
    MAEs = []
    rsmes = []
    MAPEs = []
    for i in range(1, n_splits+1):
        X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=.2, random_state=i)
        yhat, _ = model(
            X_train=X_train,
            y_train=y_train,
            X_test=X_test,
            **params
        )
        accuracy = float(sum(yhat==y_test)) / len(y_test)
        accuracies.append(accuracy)
        precision, recall, fscore, support = score(y_test, yhat)
        precisions.append(precision)
        MAE = float(mean_absolute_error(y_test, yhat))
        MAEs.append(MAE)
        rsme = float(np.sqrt(mean_squared_error(y_test, yhat)))
        rsmes.append(rsme)
        mask = y_test != 0
        MAPE = float(np.fabs((y_test - yhat)/y_test)[mask].mean() * 100)
        MAPEs.append(MAPE)
        if print_steps:
            matrix = pd.DataFrame(confusion_matrix(y_test, yhat),
                columns=['Predicted 1', 'Predicted 0'],
                index=['Actual 1', 'Actual 0'],
            )
            print('*' * 15 + ' Split %d ' % i + '*' * 15)
            print('Precision:',precision[0])
            print('Accuracy:', accuracy)
            print(matrix)
            print('-' * 40)
            print('Cross Validation Fold Mean Error Scores')
            print('MAE:',MAE)
            print('RSME:',rsme)
            print('MAPE:',MAPE)
    Scores = [np.mean(accuracies),np.mean(precisions),np.mean(MAEs),np.mean(rsmes),np.mean(MAPEs)]
    return Scores 
def find_optimal_accuracy(model, param, param_values, params={}):
    result = {}
    for param_value in tqdm(list(param_values)):
        params_local = params.copy()
        params_local[param] = param_value
        result[param_value] = test_accuracy(model, params=params_local)
    
    result = pd.Series(result).sort_index()
    plt.xlabel(param, fontsize=15)
    plt.ylabel('Accuracy', fontsize=15)
    
    optimal_param = result.argmax()
    optimal_accuracy = result[optimal_param]
    
    if type(param_value) == str:
        result.plot(kind='bar')
    else:
        result.plot()
    plt.show()
    return optimal_param

### Logistic Regression
For the logistic regression model, we created a function that took in X_train and Y_train from the original data set to test for X_test from the modified dataset. The accuracy of the logistic regression prediction for positive or negative cardio was compared with that of the original, where a confusion matrix was made to show percentage accuracy. Along with accuracy precision,MAE,RSME and MAPE for each split will be shown. Average Accuracy, Precision, MAE, RSME and MAPE for 10 folds cross validation is more than 0.72, 0.724, 0.279, 0.528 and 34.741 respectively.

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
def logistic_regression_model(X_train, y_train, X_test, **params):
    
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    clr = LogisticRegression(**params)
    clr.fit(X_train, y_train)
    return clr.predict(X_test), clr

best_params_logistic = {}

model = logistic_regression_model
scores = test_accuracy(model=model, params=best_params_logistic, print_steps=True)
print('-' * 50)
print('Average unoptimized accuracy: %f' % scores[0])
print('Average unoptimized Precision: %f' % scores[1])
print('Average unoptimized MAE: %f' % scores[2])
print('Average unoptimized RSME: %f' % scores[3])
print('Average unoptimized MAPE: %f' % scores[4])


*************** Split 1 ***************
Precision: 0.6938433617199498
Accuracy: 0.7184527584020292
          Predicted 1  Predicted 0
Actual 1         4970         1359
Actual 0         2193         4094
----------------------------------------
Cross Validation Fold Mean Error Scores
MAE: 0.28154724159797084
RSME: 0.5306102539510247
MAPE: 34.88150151105456
*************** Split 2 ***************
Precision: 0.7024655244463017
Accuracy: 0.7201965757767914
          Predicted 1  Predicted 0
Actual 1         5043         1394
Actual 0         2136         4043
----------------------------------------
Cross Validation Fold Mean Error Scores
MAE: 0.27980342422320864
RSME: 0.5289644829506124
MAPE: 34.56870043696391
*************** Split 3 ***************
Precision: 0.6929959457570251
Accuracy: 0.7171052631578947
          Predicted 1  Predicted 0
Actual 1         4957         1373
Actual 0         2196         4090
----------------------------------------
Cross Validation Fold Mean Error Scor

In [11]:
list(scores)

[0.7206721623335446,
 0.7240710236818055,
 0.2793278376664553,
 0.5285059307375766,
 34.74131711665776]

For SGDClassifier with loss = log we got average accuracy almost same as logistic regression model.

In [12]:
def SGDClassifier_log_model(X_train, y_train, X_test, **params):
    
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    params['loss'] = 'log'

    clf = SGDClassifier(**params)
    clf.fit(X_train, y_train)
    return clf.predict(X_test), clf

best_params_logistic = {}

model = SGDClassifier_log_model
scores = test_accuracy(model=model, params=best_params_logistic, print_steps=True)
print('-' * 50)
print('Average unoptimized accuracy: %f' % scores[0])
print('Average unoptimized Precision: %f' % scores[1])
print('Average unoptimized MAE: %f' % scores[2])
print('Average unoptimized RSME: %f' % scores[3])
print('Average unoptimized MAPE: %f' % scores[4])

*************** Split 1 ***************
Precision: 0.6742532005689901
Accuracy: 0.7119530754597336
          Predicted 1  Predicted 0
Actual 1         5214         1115
Actual 0         2519         3768
----------------------------------------
Cross Validation Fold Mean Error Scores
MAE: 0.2880469245402663
RSME: 0.5367000321783727
MAPE: 40.06680451725783
*************** Split 2 ***************
Precision: 0.6678748338770086
Accuracy: 0.7100507292327204
          Predicted 1  Predicted 0
Actual 1         5528          909
Actual 0         2749         3430
----------------------------------------
Cross Validation Fold Mean Error Scores
MAE: 0.28994927076727967
RSME: 0.5384693777433213
MAPE: 44.48939957921994
*************** Split 3 ***************
Precision: 0.6952421171171171
Accuracy: 0.718135700697527
          Predicted 1  Predicted 0
Actual 1         4939         1391
Actual 0         2165         4121
----------------------------------------
Cross Validation Fold Mean Error Scores

## Optimizing the Logistic Regression Model
By running logistic regression  one time with the built in parameters for both LogisticRegression and SGDClassifier, we got an average accuracy of 0.71 from 10 splits. To try to improve this, we are doing following steps.

first, for LogisticRegression function we want to see how changing the value of C, class_weight, random_state, max_iter and penalty will affect the accuracy. To do this we are using GridSearchCV function to check 'C' value between 0.001, 0.01, 0.1, 1, 10, 100 and 1000. Also sets penalty as L2 and 'class_weight'to 'balanced' and 'none'. we also assign 0 and 'lbfgs' for random_state and solver respectively. max_iter will be defined between 1500 and 2000.


In [13]:
#Divide data into test and training splits and having 10-fold CV
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=10, test_size=0.20, random_state=0)
#Logisitic regression 10-fold cross-validation 

regEstimator = LogisticRegression()

parameters = { 'penalty':['l1','l2']
              ,'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
              ,'class_weight': ['balanced',None]
              ,'random_state': [0]
              ,'max_iter':[1500,2000]
             }

#Create a grid search object using the above parameters 
from sklearn.model_selection import GridSearchCV
regGridSearch = GridSearchCV(estimator=regEstimator
                   , n_jobs=8 # jobs to run in parallel
                   , verbose=1 # low verbosity
                   , param_grid=parameters
                   , cv=cv # KFolds = 10
                   , scoring='accuracy')

#data scaling
scaler = StandardScaler()
scaler.fit(df)

#Transform training data to z-scores
#This makes our model's coefficients take on the same scale for accurate feature importance analysis 
X_Scl = scaler.transform(df)

#Perform hyperparameter search to find the best combination of parameters for our data
regGridSearch.fit(X_Scl, y)

Fitting 10 folds for each of 56 candidates, totalling 560 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    4.7s
[Parallel(n_jobs=8)]: Done 184 tasks      | elapsed:   13.9s
[Parallel(n_jobs=8)]: Done 434 tasks      | elapsed:   31.2s
[Parallel(n_jobs=8)]: Done 560 out of 560 | elapsed:   39.8s finished


GridSearchCV(cv=ShuffleSplit(n_splits=10, random_state=0, test_size=0.2, train_size=None),
             error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=8,
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                         'class_weight': ['balanced', None],
                         'max_iter': [1500, 2000], 'penalty': ['l1', 'l2'],
                         'random_state': [0]},
         

In [14]:
#Display the best estimator parameters
regGridSearch.best_estimator_

LogisticRegression(C=0.1, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=1500, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [15]:
#Use the best parameters for our Logistic Regression object
classifierEst = regGridSearch.best_estimator_

#Classifier Evaluation

from sklearn.model_selection import cross_validate

def EvaluateClassifierEstimator(classifierEstimator, X, Y, cv):
   
    #Perform cross validation 
    scores = cross_validate(classifierEstimator, df, y, scoring=['accuracy','precision']
                            , cv=cv, return_train_score=True)

    Accavg = scores['test_accuracy'].mean()
    Preavg = scores['test_precision'].mean()

    print_str = "The average accuracy for all cv folds is: \t\t\t {Accavg:.5}"
    print_str2 = "The average precision for all cv folds is: \t\t\t {Preavg:.5}"


    print(print_str.format(Accavg=Accavg))
    print(print_str2.format(Preavg=Preavg))
    print('*********************************************************')

    print('Cross Validation Fold Mean Error Scores')
    scoresResults = pd.DataFrame()
    scoresResults['Accuracy'] = scores['test_accuracy']
    scoresResults['Precision'] = scores['test_precision']

    return scoresResults
#Evaluate the regression estimator above using our pre-defined cross validation and scoring metrics. 
EvaluateClassifierEstimator(classifierEst, X_Scl, y, cv)

The average accuracy for all cv folds is: 			 0.71914
The average precision for all cv folds is: 			 0.74094
*********************************************************
Cross Validation Fold Mean Error Scores


Unnamed: 0,Accuracy,Precision
0,0.717898,0.750314
1,0.715996,0.741457
2,0.722654,0.732837
3,0.715679,0.740146
4,0.726141,0.743292
5,0.720989,0.747712
6,0.71663,0.734184
7,0.717977,0.740701
8,0.716392,0.744102
9,0.721068,0.734642


Second, we want to see how changing the value of alpha, epsilon, number of iterations, and penalty will affect the accuracy for SGDClassifier model. To do this we have another 'For' loop which sets alpha and epsilon at ten and twenty linear increments from 0.00001 to 0.001 and 0.01 to .5, respectively. The number of iterations could be 1, 3, 6, 10, or 15 and penalty could be L1 or L2.

The optimal values will be stored at best_params_logistic variable and will used to fit for final model.

The optimal value for alpha we found is 0.00034 and that for epsilon is 0.371. The optimal penalty is L2 at 15 iterations.Alpha is just a constant multiplied to the regularization term so our value of 0.00034 is expected.

We found L2, the squared error, is slightly more accurate than L1, the error. This was expected because L2 is typically better for minimizing error than L1 and L2 is standard for linear SVM models, where it performed the best for our model.

The iteration number vs accuracy should be a fairly random distribution. We expected to get different results each time and expected that they would be about our initial accuracy, 71%.

In [19]:
test_params = [
    ('n_iter_no_change', [1, 3, 6, 10, 15]),
    ('alpha', np.linspace(0.00001, 0.001, 10)),
    ('epsilon', np.linspace(0.01, .5, 20)),
    ('penalty', ['l1', 'l2'])
]

for param, param_values in test_params:
    best_params_logistic[param] = find_optimal_accuracy(
        SGDClassifier_log_model,
        param=param,
        param_values=param_values,
        params=best_params_logistic
    )
    print("Best", param, best_params_logistic[param])
    time.sleep(1)

SyntaxError: invalid syntax (<ipython-input-19-d9d6a0671bb3>, line 10)

### Optimized Logistic Regression Model Performance
Once we plugged in all optimal values into the model, the final accuracy became 0.719, which is slightly better than that of 0.718 from default parameters. 

In [None]:
%%timeit -n1 -r1

scores = test_accuracy(
    SGDClassifier_log_model, n_splits=10, params=best_params_logistic)
print('Optimized Logistic Regression Accuracy: %f' % scores[0])
print('Optimized Logistic Regression Precision: %f' % scores[1])
print('Optimized Logistic Regression MAE: %f' % scores[2])
print('Optimized Logistic Regressiond RSME: %f' % scores[3])
print('Optimized Logistic Regression MAPE: %f' % scores[4])


In [None]:
#Best Model
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=.2, random_state=1)
scaler = StandardScaler()
X_train_lr = scaler.fit_transform(X_train)
X_test_lr = scaler.transform(X_test)
Best_clf = SGDClassifier(loss='log',**best_params_logistic)
Best_clf.fit(X_train_lr, y_train)
# sort these attributes and spit them out
zip_vars = zip(Best_clf.coef_.T,df.columns) # combine attributes
zip_vars = sorted(zip_vars)
for coef, name in zip_vars:
    print(name, 'has weight of', coef[0]) # now print them out

In [None]:
# plot weight
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
weights = Best_clf.coef_[0]
feature_names = df.columns.values
linreg_ft_imp_df = pd.DataFrame({'feature_names':feature_names, 'weights':weights, 'absolute_weights': np.abs(weights)})
linreg_ft_imp_df.sort_values(by='absolute_weights', inplace=True, ascending=False )
#drawing the coefficients with SNS
import seaborn as sns
ax = sns.barplot(x =linreg_ft_imp_df['weights'], y = linreg_ft_imp_df['feature_names'], orient= 'h')
ax.set_title("Top Feature Correlations")
ax.set_xlabel("Coefficient Magnitude\n(z-score)")

The bar plots above will help interpret the importance of different features for the classification task.

General Observation - It is obvious that Systolic blood pressure (ap_hi) holds the most weight in our prediction. There is a sharp drop to the second and third coefficients; their age in years and the patents cholesterol. The last two positive correlated features are weight and Diastolic blood pressure (ap_lo), theses seem to play less of a roll in the prediction of cardiovascular diseases. Looking at the graph these features fill out the positive correlation, meaning the higher the numbers the greater the risk of having some form of Cardiovascular diseases. This makes since looking at the features in this group. The higher someone’s blood pressure, age or cholesterol the greater the risk. We can now look at the most important negative correlated feature, physical activity. The lower a person’s physical activity level (active) is the more of a predictor of Cardiovascular diseases.

ap_hi – This is the highest influential factor in our analysis. We saw a strong linear relationship between this variable and cardiovascular diseases (cardio). The rate of cardiovascular diseases tends to rise Systolic blood pressure increases.

years – The age of a patent was the second in the prediction of cardo. This would also make since as that the older a person is the more risk of developing cardiovascular problems. We found that the data showed a person in their mid-50’s and older was more at risk.

cholesterol – Cholesterol and years were almost equal in weight meaning the higher the cholesterol the more of a person is at risk of cardiovascular diseases.

weight – Weight was one of the lower predictors but still showed to have slight significances.

ap_lo – One surprising feature to show not much in the predictions was a person’s Diastolic blood pressure (ap_lo). It was the lowest of the positive correlated features.

active - The last feature we would like to point out is the physical activity of the patents. There is significant negative correlation in the prediction. This also was not any new revelation, it basically points out that the less active a person is the more of a predictor of cardiovascular diseases.

## B.SVM
For the support vector machine model, we created a function that took in X_train and Y_train from the original data set to test for X_test from the modified dataset. The accuracy of the SVM prediction for positive or negative cardio was compared with that of the original, where a confusion matrix was made to show percentage accuracy. Due to the complexity of the dataset, we are again slightly better than 71% accuracy.

In [None]:
def support_vector_machine_model(X_train, y_train, X_test, **params):
    
    # X = (X - µ) / σ
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    params['loss'] = 'hinge'
    clf = SGDClassifier(**params)
    clf.fit(X_train, y_train)
    return clf.predict(X_test), clf

best_params_svc = {}


model = support_vector_machine_model
score = test_accuracy(model=model, params=best_params_svc, print_steps=True)
print('-' * 50)
print('Average unoptimized accuracy: %f' % scores[0])
print('Average unoptimized Precision: %f' % scores[1])
print('Average unoptimized MAE: %f' % scores[2])
print('Average unoptimized RSME: %f' % scores[3])
print('Average unoptimized MAPE: %f' % scores[4])

### Optimizing the Support Vector Machine Model
By running SVM model one time with the built in parameters, we got an average accuracy of 0.71 from 10 splits. To try to improve this, we will do a few things listed below:

First, we want to do the 80/20 split 10 times and average those results to get a better accuracy. By splitting the training and test sets up multiple times, we can minimize the effects of outliers.

Second, we want to see how changing the value of alpha,epsilon, number of iterations, and penalty will affect the accuracy. To do this we have another for loop which sets alpha and epsilon at 10 and 20 linear increments from 0.00001 to 0.01 and 0.01 to 0.5 respectively. The number of iterations could be 10, 15, 30, 60, or 100 and penalty could be L1 or L2.

We found that the optimal value for alpha is 0.00421. The optimal penalty is L2 at 100 iterations.

Alpha is just a constant multiplied to the regularization term so our value of 0.00421 is expected. Alpha could be used again if we set the learning rate to optimal but we will not do that for this mini lab.

Epsilon was not changed because the results had noisy accuracy and we decided to remove it.

We found L2, the squared error, is slightly more accurate than L1, the error. This was expected because L2 is typically better for minimizing error than L1 and L2 is standard for linear SVM models, where it performed the best for our model.

The iteration number vs accuracy should be a fairly random distribution. We expected to get different results each time and expected that they would be about our initial accuracy, 0.53 +/- 0.1. This time, 100 iterations is the optimal number. Although the accuracy per iteration was still going up, we had to stop at 100 iterations due to running time restraints.

In [None]:
test_params = [
    ('n_iter_no_change', [10, 15, 30, 60, 100]),
    ('alpha', np.linspace(0.00001, 0.001, 10)),
    ('epsilon', np.linspace(0.01, .5, 20)),
    ('penalty', ['l1', 'l2'])]
    
model = support_vector_machine_model

for param, test_values in test_params:
    best_params_svc[param] = find_optimal_accuracy(
        model=model,
        param=param,
        param_values=test_values,
        params=best_params_svc
    )
    print("Best", param, best_params_svc[param])
    time.sleep(1)

In [None]:
#Best Model
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=.2, random_state=1)
scaler = StandardScaler()
X_train_lr = scaler.fit_transform(X_train)
X_test_lr = scaler.transform(X_test)
Best_clf = SGDClassifier(loss='hinge',**best_params_svc)
Best_clf.fit(X_train_lr, y_train)
# sort these attributes and spit them out
zip_vars = zip(Best_clf.coef_.T,df.columns) # combine attributes
zip_vars = sorted(zip_vars)
for coef, name in zip_vars:
    print(name, 'has weight of', coef[0]) # now print them out

In [None]:
# plot weight
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('ggplot')


weights = pd.Series(Best_clf.coef_[0],index=df.columns)
weights.plot(kind='bar')
plt.show()

In [None]:
from sklearn.svm import SVC
clf = SVC(kernel='linear')

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

clf.fit(X_scaled, y)

# this hold the indexes of the support vectors
clf.support_

# this holds a subset of the data which is used for support vectors
support_vectors = pd.DataFrame(clf.support_vectors_, columns=df.columns)

# get number of support vectors for each class
print('Number of support vectors for each feature:', clf.n_support_)

In [None]:
V_grouped = support_vectors.groupby(y.loc[clf.support_].values)
X_grouped = df.groupby(y.values)

vars_to_plot = ['years','ap_hi','ap_lo','height','weight','cholesterol','gluc','BMI']

for v in vars_to_plot:
    plt.figure(figsize=(10,4)).subplots_adjust(wspace=.4)

    plt.subplot(1,2,1)
    V_grouped[v].plot.kde() 
    plt.legend(['cardio 0','cardio 1'])
    plt.title(v+' (Instances chosen as Support Vectors)')

    plt.subplot(1,2,2)
    X_grouped[v].plot.kde() 
    plt.legend(['cardio 0','cardio 1'])
    plt.title(v+' (Original)')