# **Exercise - Model Performance**
# DATA 3300

## Name:

# Q1

**Using the (full) `voters.csv` dataset, conduct a 5-fold cross-validated logistic regression analysis in Python. Assume the data set has already been checked for collinear independent variables, and found none.**

*   Import the required libraries and packages
*   Import and view the dataset
*   Assign IVs to object called 'x', take any preprocessing steps
*   Assign DV to object called 'y', take any preprocessing steps
*   Perform 90-10 train-test split
*   Implement 5-fold cross-validated logistic regression



In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report, confusion_matrix
import sklearn.metrics as metrics

from sklearn.linear_model import LogisticRegression

In [None]:
df = pd.read_csv('')                                                                                      # reads in the dataset
# produce a heading of the dataset

In [None]:
x = df.drop(['Primary Key', 'DV'], axis=1)                                                                # remove non-IVs when creating x-object
x = pd.get_dummies(data = ?, drop_first = ?)                                                              # dummy codes categorical IVS

y = df['DV']                                                                                              # create y object
y = pd.get_dummies(data = ?, drop_first = ?)                                                              # dummy codes the y object

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size = ?, random_state = 100)                                                              # creates a 90-10 train-test split

print(x_train.shape)                                                                                      # examines the shape of the training and test set objects
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
cv = KFold(n_splits = ?, random_state = 1, shuffle = True)                                                # produces a 5-fold cross-validation

# set model to Logistic regression

In [None]:
scoring = {'acc': 'accuracy',                                                                            # creates a scoring dictionary
           'f1' : 'f1',
           'precision' : 'precision',
           'recall' : 'recall',
           'roc_auc' : 'roc_auc',
           'r2' : 'r2'}

In [None]:
scores = cross_validate(?, ?, ?, scoring = scoring, cv = cv, return_train_score=False)                   # train model using 5-fold cross validation on training set
# display the scores

# Q2
Display the average cross-validated accuracy, f1-score, precision, recall, ROC-AUC, and $R^2$

In [None]:
scores = pd.DataFrame(scores, columns = scores.keys())                                                  # creates a dataframe of each score type and it's score
# display the average scores, why?

## 2A

**List the $R^2$ value and interpret what it means.**

## 2B

**List the ROC-AUC value and interpret what it means.**

# Q3

**Generate a confusion matrix of the predicted outcome vs the actual outcome (`VIntent=Kodos`).**

In [None]:
y_pred = cross_val_predict(model, x_train, y_train, cv = cv)                                            # makes cross-validated predictions onto the training set
conf = confusion_matrix(?, ?)                                                                           # generate a confusion matrix of predicted against actual

sns.heatmap(conf, annot=True, fmt='g')
sns.set(rc={'figure.figsize':(12,10)})
plt.xlabel('Predicted Class')
plt.ylabel('Actual Class')
plt.show()

## 3A

**If `Vintent=Kodos` is the positive class, identify what TP, FP, TN, and FN mean in the context of this dataset, and provide the number of cases for each.**

* **TP** =
* **FP** =
* **TN** =
* **FN** =

## 3B

**Identify the costs of both types of model errors (FP and FN) for this specific dataset. Are these costs about the same or does one error cost more than the other?**

## 3C

**Is the distribution of the two outcome classes of the DV about even (within 60:40) or uneven (highly skewed)? State how you know this.**

In [None]:
# display value_counts of VIntent

## 3D

**What is the overall (cross-validated) model accuracy? Given this data set, is this a good model performance metric to use? Why or why not?**

## 3E

**Calculate the baseline accuracy using naive/apriori prediction. VIntent=Kodos is the positive class outcome. Is our model performing well? Why or why not?**

In [None]:
print('baseline accuracy =', ?/?)

## 3F

**What is the precision of the positive class and what does this mean? When should class precision be used to assess model peformance?**

## 3G

**What is the recall of the positive class and what does this mean? When should recall be used to assess model performance?**

## 3H

**What is the $f_1$ score and what does it mean? Also describe when this metric should be used.**

# Q4

**Develop and run a second Logistic Regression model with 5-fold cross-validation that excludes age, homeowner, income category, marital status and religion as IVs.**

**Then compute the accuracy, precision, recall, f1, and auc-score.**

In [None]:
# pull up columns in x_train

In [None]:
x_train_2 = x_train[['', '', '', '']]                                                         # include only the remaining variables in x_train_2

In [None]:
scores2 = cross_validate(?, ?, ?, scoring = scoring, cv = cv, return_train_score=False)       # run cross-validated regression on updated training data

In [None]:
scores2 = pd.DataFrame(scores2, columns = scores2.keys())                                     # creates a dataframe of score names and scores
np.abs(?.mean())                                                                              # take the absolute mean

In [None]:
y_pred = cross_val_predict(model, x_train_2, y_train, cv = cv)                                # makes cross-validated predictions on training data

# Q5

**Compare Model 1 and Model 2 on their performance metrics by displaying both model's cross-validated metrics in a table below.**

In [None]:
model_1 = np.abs(scores.mean())                                                                                 # stores your average CV scores to model_1
model_2 = np.abs(scores2.mean())                                                                                # stores your average CV scores to model_2
models = model_1, model_2                                                                                       # creates a models object of model_1 and model_2 scores

In [None]:
model_compare = pd.DataFrame(data = ?,
                        index = ["?", "?"],                                                                    # produce a dataframe to display the scores of model_1 and model_2
                        columns = ["?",
                                   "?",
                                   "?",
                                   "test_f1",
                                   "test_precision",
                                   "test_recall",
                                   "test_roc_auc",
                                   "test_r2"])
model_compare

## 5A

**Which model performed best overall? How do you know?**

## 5B

**Fit the best performing model to the test set. How does the model perform on the test-set? Is there any evidence of over-fitting?**

In [None]:
?.fit(x_train, y_train)                                                                     # fit best performing model to training data

predictions = ?.predict(x_test)                                                             # make predictions onto x_test

In [None]:
print(classification_report(?, ?))                                                          # print classification report on test set performance