# CSCE 623 Homework Assignment 5

### Student Name:  <font color="red">LASTNAME, FIRSTNAME</font>

### Date: <font color="red">May XX, 2022</font>

### Solving Common Problems

Instructions:
* Review all provided code before starting your work - this instructor has provided hints and tips throughout the code
* This assignment is composed of 3 parts (3 case studies)
    * Each part is designed to be a standalone snippet of a machine learning activity which contains flaws
    * Your goal is to identify the flaw by performing steps required to diagnose the flaws
    
* In each case, you will complete several steps
    * CODE CELL (diagnostics): Write code to diagnose the problem and produce evidence of the issue (e.g. print statements, tables, and graphs).  
    (IMPORTANT - Even if you can see the flaw directly in the provided code you must include code here to demonstrate the flaw to show evidence of it - failing to do so will result in not achieving full score on the assignment.)
    * MARKDOWN CELL: Describe the problem and how to solve it in a markdown cell (English text)
    * CODE CELL (solution): Solve the issues so the ML task works properly
    
* Additional Requirements / Considerations
    * While you may inspect the performance on the test set during diagnosis, you should not use the test set to fix the issue.  All decisionmaking (e.g. hyperparameter selection) should be conducted on the non-test set.
    * Some form of validation such as validation, crossvalidation or LOOCV should be used for hyperparameter tuning (dont just tune on the training set)
    * Ensure your choices for hyperparameter decisions and rationale for using them are displayed/explained in code and/or markdown cells
    * Make decisions algorithmically (avoid hardcoding values)

* Remember to restart the kernel and rerun all cells before submitting the assignment
* Submit only the Jupyter Notebook (.ipynb) file - do not submit the datasets.

In [None]:
# Note... not all of these are used...

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import itertools
import copy

from math import factorial

from sklearn.preprocessing import OneHotEncoder

from sklearn.preprocessing import scale 
from sklearn.preprocessing import StandardScaler

from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector

from sklearn.pipeline import Pipeline

from sklearn.model_selection import cross_validate,  cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.model_selection import train_test_split

from sklearn import metrics
from sklearn.metrics import mean_squared_error, make_scorer, average_precision_score, recall_score, accuracy_score, precision_score, confusion_matrix

from sklearn.metrics import roc_curve, precision_recall_curve, auc, roc_auc_score, RocCurveDisplay 


from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV

from sklearn.neighbors import KNeighborsClassifier

from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression

from sklearn.feature_selection import SequentialFeatureSelector

%matplotlib inline
plt.style.use('seaborn-white')

from IPython.display import Markdown as md

import warnings
#warnings.filterwarnings('ignore')
#warnings.filterwarnings(action='once')

## OPTIONAL STUDENT CODING: If you need any imports, code them below

In [None]:
######### ------- EXTRA STUDENT IMPORTS ------------


######### ------- END STUDENT IMPORTS ------------

In [None]:
#helper functions provided by instructor

def data_explore(df):
    

    display(md('Data Statistics:'))
    display(df.describe())
    display(md('Class 0 Statistics:'))
    display(df.loc[df.Class==0,:].describe())
    display(md('Class 1 Statistics:'))
    display(df.loc[df.Class==1,:].describe())
    display(md('Covariance of Class 0:'))
    display(df.loc[df.Class==0,['X1','X2']].cov())
    display(md('Covariance of Class 1:'))
    display(df.loc[df.Class==1,['X1','X2']].cov())
    sns.pairplot(df, hue="Class", height=5)
    df.loc[df.Class==0,:].hist(grid=False, layout=(1,3), figsize=(12,4));
    df.loc[df.Class==1,:].hist(grid=False, layout=(1,3), figsize=(12,4));
    

    
def predict_probs(models, X):
    """ Returns a dictionary of predicted proability vectors using models stored in the input dictionary 'models' on the feature data 'X'
    params:  
    models - a dictionary of fitted classification models with key equal to the name of the model
    X - the values of a dataset obtained"""
    predicts = {}
    
    for key, model in models.items():
        predicts[key] = model.predict_proba(X)
    return predicts    

------------------------------------
# CASE 1

The aspiring machine learning novice is trying to build a simple logistic regression classifier to fit a 2-feature, 2 class dataset, but the outcome is not as expected... you must rescue them from their uncertain fate!

------------------------------------

### Case 1 - Load data & explore:

In [None]:
#load dataset
# c1_df = pd.read_csv("c1_data.csv")
c1_df = pd.read_csv('c1_data.csv', header=0, names=['X1','X2','Class'], index_col=0)



#visualize/explore the dataset
data_explore(c1_df)


    



### Case 1 - Split test & non-test, fit a model and evaluate performance

In [None]:
#split into test & training
tngfrac = 0.75  # 25 percent of data used for test, rest for non-test
tngqty = np.ceil(tngfrac*len(c1_df)).astype(int)
c1_non_test_df, c1_test_df= c1_df[:tngqty], c1_df[tngqty:]
 

#fit a model
c1_model = LogisticRegression()
feature_cols = ['X1','X2']
c1_model.fit(c1_non_test_df[feature_cols],c1_non_test_df.Class)

#check performance on non-test set
non_test_cv_scores = cross_val_score(c1_model, c1_non_test_df[feature_cols], c1_non_test_df.Class, cv=5)
print("\n\n\nNon-test mean accuracy from 5-fold CV", np.mean(non_test_cv_scores))

#eval performance on test set
yhat = c1_model.predict(c1_test_df[feature_cols])
test_score = c1_model.score(c1_test_df[feature_cols], c1_test_df['Class'])
print("\nTest set accuracy:", test_score)
print("\n\n")

### C1 ISSUE & TASK:

The 5-fold crossval performance of a model on the non-training data looks very good, but the test set performance is bad.  

Your job is to figure out why.   In the areas below, complete the following steps

1.  Use the student code area below to diagnose the problem (use any tools you have learned in the class to do this).  Once you figure out what the problem is, make sure it is clearly presented using code to visualize/print evidence of the problem
2.  Use the markdown area after the code cell to describe the problem
3.  Solve the problem in the designated code cell after the markdown cell by copying the above cells and fixing errors to resolve the performance gap.   

Hint:  Your CV performance and your test performance should be similar and both should be better than chance on this dataset.

## Case 1 Diagnostics to discover the problem (STUDENT CODE REQUIRED)

In [None]:
# CASE 1 DIAGNOSTICS

# --------- START STUDENT CODE -------------


# --------- END STUDENT CODE -------------




## CASE 1 Explanation and plan for solution (STUDENT MARKDOWN REQUIRED)

In the markdown cell below, describe the problem/mistake the novice made and describe your plan for fixing the issue

<font color='green'>STUDENT ANSWER BELOW</font>   

<font color='green'>ANSWER....

## CASE 1 Solution (STUDENT CODE REQUIRED)

In this step, fix the problem, run CV make sure you print your mean test set accuracy from crossvalidation.

If working correctly, your CV and Test set accuracies should be within a few percent of each other.

### Display the mean CV accuracy and Test Set Accuracy


In [None]:
# CASE 1 SOLUTION

myrandstate = 42
mean_cv_accuracy = None #placeholder
test_accuracy = None #placeholder

c1_model = LogisticRegression()
feature_cols = ['X1','X2']

# --------- START STUDENT CODE -------------

#take the steps to fix the problem and fit the c1_model 



# --------- END STUDENT CODE -------------

#eval performance on test set
yhat = c1_model.predict(c1_test_df[feature_cols])
test_accuracy = c1_model.score(c1_test_df[feature_cols], c1_test_df['Class'])

print("\n\n\nNon-test mean accuracy from 5-fold CV", mean_cv_accuracy)
print("\nTest set accuracy:", test_accuracy)
print("if working well, CV accuracy should be close to test accuracy")
print("\n\n\n\n")



------------------------------------
# CASE 2

The aspiring machine learning novice is trying to build a KNN classifier to fit a 200-feature, 2 class dataset, but the 5-fold cv performance is far lower than desired.  Our novice consults with a colleague who brags about being able to achieve over 70% accuracy with KNN on the dataset but the braggart refuses to help the novice.   

Only you can save our novice!

------------------------------------

In [None]:
myrandstate = 42

c2_df = pd.read_csv('c2_data.csv', header=0, index_col=0)

#test / non-test split
tngfrac = 0.75
c2_non_test_df, c2_test_df= train_test_split(c2_df, train_size = tngfrac, stratify=c2_df.Class, random_state=myrandstate)


display(c2_non_test_df.describe())


#instantiate a model
c2_model = KNeighborsClassifier(n_neighbors=1)

#check performance on non-test set using cross-validation
non_test_cv_scores = cross_val_score(c2_model,
                                     c2_non_test_df.loc[:, c2_non_test_df.columns != "Class"],
                                     c2_non_test_df.Class, cv=5)
print("\n\n\nNon-test mean accuracy from 5-fold CV", np.mean(non_test_cv_scores))

#fit a model
c2_model.fit(c2_non_test_df.loc[:, c2_non_test_df.columns != "Class"],c2_non_test_df.Class)


#eval performance on test set
yhat = c2_model.predict(c2_test_df.loc[:, c2_test_df.columns != "Class"])
test_score = c2_model.score(c2_test_df.loc[:, c2_test_df.columns != "Class"], c2_test_df['Class'])
print("\nTest set accuracy:", test_score)
print("\n\n\n\n")

In [None]:
#CASE 2 DIAGNOSTICS


# --------- START STUDENT CODE -------------



# --------- END STUDENT CODE -------------


## CASE 2 Explanation and Plan for Solution (STUDENT MARKDOWN REQURIED)

<font color='green'>STUDENT ANSWER BELOW</font>   

<font color='green'>ANSWER....

## CASE 2 Solution (STUDENT CODE REQURIED) 

Implement your fix to help the novice achieve over 70% accuracy using KNN on the test set.

To achive this fit a new model `c2_fixed_model` on the non-test data

The model you fit will be evaluated on the test data and should achieve a performance around 70% accuracy

### Display the test set accuracy. 

In [None]:
#CASE 2 SOLUTION

c2_fixed_model = None  #placeholder for the model you will fit

# --------- START STUDENT CODE -------------

# take actions and fit a c2_fixed_model that will do well on the test set

# --------- END STUDENT CODE -------------

# determine test set performance


In [None]:

yhat = c2_fixed_model.predict(c2_test_df.loc[:, c2_test_df.columns != "Class"])
test_accuracy = c2_fixed_model.score(c2_test_df.loc[:, c2_test_df.columns != "Class"], c2_test_df['Class'])

print("\nTest set accuracy:", test_accuracy)
print("\n\n\n\n")



------------------------------------
# CASE 3

The ML Novice is tackling a customer requirement.  The customer wants to make a classifier for a targeting system which has maximally high precision - Ideally, as many possible true targets are found and there are zero false positives.  The catch is that the customer wants to use QDA for this model and that they want a solution which has perfect precision (1) and finds the maximum number of targets when precision is perfect.  Model tuning should happen on the non-test set and performance evaluation/reporting on the test set.

Unfortunately, things are not going well for our novice... see if you can help!

------------------------------------

In [None]:

randstate = 42

#load the data
c3_df = pd.read_csv('c3_data.csv', header=0, names=['X1','X2','Class'], index_col=0)

#split test & nontest
c3_non_test_df,c3_test_df = train_test_split(c3_df,test_size=0.5,random_state=randstate, stratify=c3_df.Class)
#explore the non-test data
data_explore(c3_non_test_df)

In [None]:
#subset the training data into features and labels
X = c3_non_test_df.loc[:,['X1','X2']]
y = c3_non_test_df.loc[:,['Class']].values.ravel()

#instantiate qda model
qda = QuadraticDiscriminantAnalysis()
# qda = LinearDiscriminantAnalysis()
qda.fit(X, y)



kfold_count = 5

non_test_cv_precision=np.mean(cross_val_score(estimator = QuadraticDiscriminantAnalysis(), scoring=make_scorer(precision_score),
                                    X=X,
                                    y=y,
                                    cv=kfold_count))

print("Non-test set cv precision:",non_test_cv_precision)


#obtain prediction probs on test set using the model fit previously on the non-test data
preds_test = qda.predict_proba(c3_test_df.loc[:,['X1','X2']])

#classify the prediction probabilities
desired_precision = 1.0
y_hat_test = (preds_test[:,1]>=desired_precision)*1.0

predPos = y_hat_test==1 
truePos = predPos&(c3_test_df['Class'].values==1)
prec = sum(truePos*1.0)/sum(predPos*1.0)
print("precision:", prec) 
print("Test Set predicted positives:",sum(truePos*1.0))



## CASE 3 Diagnostics (STUDENT CODE REQUIRED)

So we can see that the novice's model seems to be unacceptably low precision during CV on the non-test set, but on the test set, the model is not predicting *anything* positive and precision cannot even be computed due to the divide by zero error!  This is bad.  Run diagnostics in the cell below to see if you can discover the problem



In [None]:
#CASE 3 DIAGNOSTICS


# --------- START STUDENT CODE -------------




# --------- END STUDENT CODE -------------


## CASE 3 Explanation and Plan for Solution (STUDENT MARKDOWN REQURIED) 

Describe what you discovered in the diagnostic code you wrote above AND define the plan for how to resolve the issue

<font color='green'>STUDENT ANSWER BELOW</font>   

<font color='green'>ANSWER....

## CASE 3 Solution (STUDENT CODE REQURIRED)

In the code cell below, implement a solution which achieves the customer goal of perfect precision with the most possible targets found (maximize true positives).  

### Report the precision and number of true positives found in the test set.

In [None]:
# CASE 3 SOLUTION


test_prec = None #placeholder for test set precision
test_TP =None #placeholder for test set TRUE POSITIVE count (an integer)

X_non_test = c3_non_test_df.loc[:,['X1','X2']].values
y_non_test = c3_non_test_df.Class.values

X_test = c3_test_df.loc[:,['X1','X2']].values
y_test = c3_test_df.Class.values

clf = QuadraticDiscriminantAnalysis()  #you will need to train a model of this type

# --------- START STUDENT CODE -------------

#after fitting a model, evaluate it and compute the test_prec and test_TP on the test set data 

# --------- END STUDENT CODE -------------


In [None]:


print("Test Set Precision:", test_prec, "; Test Set True Positives:", test_TP)
