**Exercise 1: (4 points) Given 1000 records in a dataset, 1000 models are trained with 999 records as part of
the training sample and the remaining 1 sample for testing, and the error rate is averaged out,
this validation technique is called**

(c) LOOCV

**Exercise 2: (4 points) In k-fold cross validation technique, the value of k being small could lead to which of
the following in relation to the error rate**

(c) high bias and low variance

**Exercise 3: (4 points) In k-fold cross validation technique, the value of k being large could lead to which of
the following in relation to the error rate**

(b) low bias and high variance

**Exercise 4: (6 points) Explain what regularization is and why it is useful**

Regularization involves fitting a model involving all the input variables. It shrinks the value of our variables to be closer to zero relative to the least squares estimates. This reduces our variance and prevents overfitting. In some cases, our coefficients can be estimated to be exactly zero. This gives the added bonus of helping with variable selection. Overall, regularization significantly reduces the variance of the model by shrinking the values of our coefficients towards zero, without substantial increase to bias.

**Exercise 5: Consider the framingham.csv data file. The dataset is publically available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham,
Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD). The dataset provides the patients? information. It includes
over 4,000 records and 15 attributes. Each attribute is a potential risk factor. There are both
demographic, behavioral and medical risk factors. In Python, answer the following:**

_(a) (4 points) Using the pandas library, read the csv data file and create a data-frame called
heart._

In [2]:
import boto3
import pandas as pd
import numpy as np

from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score

## Defining the s3 bucket
s3 = boto3.resource('s3')
bucket_name = 'craig-shaffer-data-445-bucket'
bucket = s3.Bucket(bucket_name)

## Defining the file to be read from s3 bucket
file_key = 'framingham.csv'

bucket_object = bucket.Object(file_key)
file_object = bucket_object.get()
file_content_stream = file_object.get('Body')

# reading the datafile
heart = pd.read_csv(file_content_stream)
heart.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


_(b) (3 points) Remove observations with missing values._

In [3]:
# removing observations with NA
heart = heart.dropna()

_(c) (25 points) Perform a 5-folds cross validation with the goal of measuring the performance,in terms of F1-score, of two competing models:_
- _Using age, currentSmoker, totChol, sysBP, diaBP, BMI, heartRate, and glucose as the predictor variables, and TenYearCHD as the target variable build a logistic regression model under the 5-folds cross validation framework. Compute and store the F1-score for each iteration._
- _Using age, currentSmoker, totChol, BMI, heartRate, and glucose as the predictor variables, and TenYearCHD as the target variable build a logistic regression model under the 5-folds cross validation framework. Compute and store the F1-score for each iteration._

_Use 25% as threshold to change the likelihoods to labels. Make sure to scale the input variables of both models to 0-1 range (see MinMaxScaler) before you run the 5-fold cross validation framework. Also, you can use the f1_score function to compute the F1-score._




In [21]:
# scale the input variables of both models to 0-1 range
scaler = MinMaxScaler()
scaled = pd.DataFrame(scaler.fit_transform(heart), index= heart.index, columns= heart.columns)

x= scaled.drop(columns = ['TenYearCHD'],axis=1)
y= scaled['TenYearCHD']

# defining folds
kf = KFold(n_splits =5,shuffle=True)

# defining the lists to store results
f1score_1 = list()
f1score_2 = list()

for train_idx, val_idx in kf.split(x):
    #splitting the data
    x_train,x_val = x.iloc[train_idx],x.iloc[val_idx]
    y_train,y_val = y.iloc[train_idx],y.iloc[val_idx]
    
    ## model 1 ##
    x1 = x_train[['age', 'currentSmoker', 'totChol', 'sysBP', 'diaBP','BMI','heartRate','glucose']]
    xval1 = x_val[['age', 'currentSmoker', 'totChol', 'sysBP', 'diaBP','BMI','heartRate','glucose']]
    
    md1 = LogisticRegression().fit(x1,y_train)
    
    #predicting on validation set
    pred1 = md1.predict(xval1)
    
    #changing liklihoods to label w/ 25% threshold
    pred_label= np.where(pred1<0.25,0,1)
    
    #computing f1 score
    f1score_1.append(f1_score(y_val, pred_label))
    
    ## model 2 ##
    x2 = x_train[['age', 'currentSmoker', 'totChol', 'BMI','heartRate','glucose']]
    xval2 = x_val[['age', 'currentSmoker', 'totChol', 'BMI','heartRate','glucose']]
    
    md2 = LogisticRegression().fit(x2,y_train)
    
    #predicting on validation set
    pred2 = md2.predict(xval2)
    
    #changing liklihoods to label w/ 25% threshold
    pred_label= np.where(pred2<0.25,0,1)
    
    #computing f1 score
    f1score_2.append(f1_score(y_val, pred_label))
    
print(f1score_1)
print(f1score_2)

[0.05309734513274336, 0.06504065040650406, 0.07272727272727272, 0.017391304347826087, 0.07874015748031496]
[0.018348623853211007, 0.017241379310344827, 0.0, 0.05128205128205127, 0.0]


_(d) (4 points) Report the average F1-score of each of the models. What model would you use
to predict TenYearCHD? Explain._

In [22]:
print('The avg score of model one is',np.mean(f1score_1))
print('The avg score of model two is',np.mean(f1score_2))

The avg score of model one is 0.05739934601893224
The avg score of model two is 0.017374410889121422


You would want to use model 1 to predict TenYearCHD. The avg f1 score of model 1 is higher. F1 scores can range from 0 to 1, with 1 representing a model that perfectly classifies each observation and 0 representing models that are unable to correctly classify an observation. So we want to chose model one because its avg f1 score is higher than model two.