In [None]:
import finalExamUtilities as fEU

import numpy as np
import numpy.ma as ma
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import math
import scipy.linalg as sp_la
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

# Overall Review <a class="anchor" id="review"></a>


Today we are going to load a dataset and use it to review each type of data analysis method we have looked at this semester.

Here's the scenario: You have been hired for the summer by a local realty company that manages apartment complexes. When someone applies to rent an apartment, the company collects information about the applicant, including information about debt, income and credit rating. The company has a shortage of staff to process applications, and is also dealing with a fair housing-related lawsuit. The company would like you to develop an automated solution for determining whether applicants are rent-worthy.

# Prepare The Data <a class="anchor" id="prepData"></a>

## 1. Load and look at your data <a class="anchor" id="loadData"></a>

* Where does the data come from? *This data set comes from https://www.kaggle.com/datasets/samuelcortinhas/credit-card-approval-clean-data which in turn comes from https://archive.ics.uci.edu/ml/datasets/Credit+Approval. It was originally contributed in Robert Quinlan in the late 1980s/early 1990s.*
  * What are the variables?
  * What are the types of the variables?
* Are there any ethical concerns with using this data? *This data set is about credit scoring, so (unless it's artificial, in which case we have different concerns!) it includes information about individuals.
  * We check whether any personally identifying information is in the data and no, there is none (no names, addresses, social security numbers, etc).
  * Even if no PII is in the data, there are sensitive features in the data such as age, gender and ethnicity. Models that use these features should be subject to extra scrutiny to make sure they are not biased in favor of one group of people over another. In our experiments today, we will *exclude these features*.
  * The data includes a label for credit-worthiness. Historically, human-made decisions about credit-worthiness (and rent-worthiness) have been notably subject to bias. So we should tell the realty company not to deploy any model we create as the sole decision maker.*

Referring to the cell below, which contains a report of the data with sensitive variables 'age', 'gender', 'ethnicity' and 'citizen' filtered out, please answer:

* How many data points are there?
* How many variables are there?
* What is the type of each variable?
  * syntactic types:
  * semantic types:
* Are the variables independent of each other? How do you make that assessment?
* Is there a value for each variable for each data point? 
* Do the values make sense? Are there outliers or other insanities?

In [None]:
transformedData = fEU.prepData(dataName="cc", type="clustering", fractionToKeep=1)

* What would be a reasonable choice of dependent variable in this dataset, and why, for:
  * regression?
  * classification?
* What shall we split our data into, and why?
* What do we need to watch out for as we split our data?

## 2. Consider Transforming/Normalizing the Data <a class="anchor" id="normalizeData"></a>

* Do we need to transform this data? Why or why not?
* If we do need to transform it, what will we do?

## 3. Consider Dimensionality Reduction <a class="anchor" id="pcaData"></a>

* In what circumstances do we want to use dimensionality reduction?
* What method do we use for dimensionality reduction?
* What are the steps in this method?
* One way to choose how many dimensions to keep is by looking at an elbow plot. Looking at the one below, how many dimensions should we keep for this data set in order to retain 80% of the cumulative explained variance?

In [None]:
pca = fEU.PCA(centered=True, plot=True)
pca.fit(transformedData)
#projected = pca.project(transformedData, ??)


[Go back to the top](#review)


# Model <a class="anchor" id="model"></a>

## 4. Regression <a class="anchor" id="regression"></a>

* To fit a regression, the type of the dependent variable should be what?
* Name and define the loss function for regression.
* The normal equation is one method for fitting a linear regression. What is the normal equation? When can we *not* use it?
* If I have a lot of variables in my data, how can I effectively decide which to include in my regression?

In [None]:
transformedTrain, transformedDev, transformedTest, trainY, devY, testY = fEU.prepData(dataName="cc", type="regression", fractionToKeep=-1)

In [None]:
%%time

# Linear regression on the transformed data

lr = LinearRegression().fit(transformedTrain, trainY)
print(lr.score(transformedDev, devY))

In [None]:
%%time

# Polynomial regresson the transformed data, degree 2
pf = PolynomialFeatures(degree = 2, include_bias = False, interaction_only = True)
polynomial2Train = pf.fit_transform(transformedTrain)
lr = LinearRegression().fit(polynomial2Train, trainY)
polynomial2Dev = pf.fit_transform(transformedDev)
print(lr.score(polynomial2Dev, devY))

In [None]:
%%time

# Polynomial regresson the transformed data, degree 3
pf = PolynomialFeatures(degree = 3, include_bias = False, interaction_only = True)
polynomial2Train = pf.fit_transform(transformedTrain)
lr = LinearRegression().fit(polynomial2Train, trainY)
polynomial2Dev = pf.fit_transform(transformedDev)
print(lr.score(polynomial2Dev, devY))

[Go back to the top](#review)

## 5. Clustering <a class="anchor" id="clustering"></a>

* In what circumstances would we want to cluster our data?
* Clustering requires a distance metric. Name and define a distance metric *other than Euclidean distance*.
* For k-means clustering, we minimize *inertia*. Define inertia.
* k-means clustering is sensitive to the structure of the input data. In what way? How can we fix this type of issue with data structure?

[Go back to the top](#review)

# Classification <a class="anchor" id="classification"></a>

* To train a classifier, the type of the dependent variable should be what?
* We will use "credit-worthy" as a proxy for "rent-worthy". How many values does this variable have?
* How do we know how well a classification model works?

## 6. K-nearest neighbors <a class="anchor" id="knn"></a>

* How does the *fit* function work for k-nearest neighbors?
* How does the *predict* function work?
* One way to choose a value of $k$ is by looking at an elbow plot. Looking at the elbow plot below, what value of k would you choose for this data and why?

In [None]:
transformedTrain, transformedDev, transformedTest, trainY, devY, testY = fEU.prepData(dataName="cc", type="classification", fractionToKeep=-1)

In [None]:
%%time

# Fit a kNN to the transformed train data, choose k using dev data; this is hyperparameter tuning

fEU.fitExploreKNN(transformedTrain, trainY, transformedDev, devY, 2, 30, 2)

In [None]:
%%time

# Fit a kNN to the transformed train data, best k, test using test data

knn = KNeighborsClassifier(n_neighbors=?).fit(transformedTrain, trainY)
print(knn.score(transformedTest, testY))
print(confusion_matrix(testY, knn.predict(transformedTest)))

## 7.  Naive Bayes <a class="anchor" id="nb"></a>

* State Bayes rule.
* In Bayes rule, which parts are the posterior, prior, likelihood and evidence?
* Why do we call a Naive Bayes model "naive"? What does this allow us to do?
* A simple Naive Bayes model is based on relative frequencies of values of the variables in the training data. 
  * How can we account for values of variables we may not see for a particular class at train time?
  * The estimated probabilities output via this method, for any non-trivial number of variable values, will be very small. How can we handle this?
  * If a variable is quantitative (continuous or discrete) we can fit a Naive Bayes model using a probability density function for the variable. Name and define a probability distribution commonly used in this way.

In [None]:
transformedTrain, transformedDev, transformedTest, trainY, devY, testY = fEU.prepData(dataName="cc", type="classification", fractionToKeep=-1)

In [None]:
%%time

# Fit a naive Bayes model to the transformed train data, test using test data

gnb = GaussianNB().fit(transformedTrain, trainY)
print(gnb.score(transformedTest, testY))
print(confusion_matrix(testY, gnb.predict(transformedTest)))
fEU.aucRoc(gnb.predict(transformedTest), testY)

[Go back to the top](#review)

## Evaluation and Visualization <a class="anchor" id="classificationEvaluation"></a>

* In addition to accuracy, we often create confusion matrices for a classifier.
  * Draw a confusion matrix and label the cells corresponding to true positives, true negatives, false positives and false negatives
  * Looking at the confusion matrix above, what can we say about the classes in this model?
  * Define true positive rate and false positive rate.
  * What is a ROC curve? How is it related to AUC?
  * For a multiclass classifier, what is a variant on the vanilla confusion matrix that we can use?

[Go back to the top](#review)

# 8. RBF Networks <a class="anchor" id="rbfNetworks"></a>

* What is a radial basis function?
* What is the structure of a RBF network?
* In this course, what type of activation function did we define for the hidden nodes?
* What are the steps to training a RBF network?
* For what types of modeling can we use a RBF network?

Let's think about training a RBF network to determine whether an applicant is "rent-worthy".
* How many nodes will be in the input layer?
* How many nodes will be in the output layer?
* Thinking about the k-means clustering we trained on this data earlier, how many nodes would you put in the hidden layer?

## A Worked Example for Regression <a class="anchor" id="rbfRegression"></a>

Which is more accurate, linear regression (see earlier) or regression via RBF network?

In [None]:
transformedTrain, transformedDev, transformedTest, trainY, devY, testY = fEU.prepData(dataName="cc", type="regression", fractionToKeep=-1)

In [None]:
%%time

# Get the number of prototypes; this is hyperparameter tuning

rbf = fEU.RBFNetwork(type="regression")
rbf.explorePrototypes(transformedTrain, 2, 20, 1)

In [None]:
%%time

rbf.fit(transformedTrain, trainY, 15)
yhat = rbf.predict(transformedDev)
rbf.score(devY, yhat)

## A Worked Example for Classification <a class="anchor" id="rbfClassification"></a>

Which is most accurate, kNN classification, Naive Bayes classification, or classification via RBF network?

In [None]:
transformedTrain, transformedDev, transformedTest, trainY, devY, testY = fEU.prepData(dataName="cc", type="classification", fractionToKeep=-1)

In [None]:
%%time

# Get the number of prototypes; this is hyperparameter tuning

rbf = fEU.RBFNetwork(type="classification")
rbf.explorePrototypes(transformedTrain, 2, 20, 1)

In [None]:
%%time

rbf.fit(transformedTrain, trainY, 15)
yhat = rbf.predict(transformedDev)
rbf.score(devY, yhat)