# Assignment 2

As before, if a question can be answered with 'yes/no', or a numeric value, you may simply state as much. If you incorporate code from the internet (which is not required and generally not advisable), please cite the source within your code (providing a URL is sufficient).

We will go through comparable code and concepts in the live learning sessions. If you run into trouble, start by using the help `help()` function in Python, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Import specific objects
from sklearn.preprocessing import StandardScaler
from ISLP import load_data

### Question 1: Classification using KNN

We'll now use the `Caravan` dataset from the `ISLP` package. (You may use `Caravan.describe()` to review details of the dataset.) In this dataset, the response variable of interest is `Purchase`, which indicates if a given customer purchased a caravan insurance policy. We will simultaneously use all other variables in the dataset to predict the response variable.

In [2]:
# Load the "Caravan" dataset using the "load_data" function from the ISLP package
Caravan = load_data('Caravan')

# Obtain and Print number of rows and columns
rows, columns = Caravan.shape
print(f'The dataset has {rows} rows and {columns} columns.')

#Print Data Type Counts
print(f'\n\nCount Of Columns by Data Type:\n{Caravan.dtypes.value_counts()}\n')

#Print levels of the 'Purchase' variable
purchase_levels = Caravan['Purchase'].unique()
print(f'Levels of Purchase Variable: {purchase_levels}')

# Describing the Caravan Dataset
Caravan.describe()

The dataset has 5822 rows and 86 columns.


Count Of Columns by Data Type:
int64     85
object     1
Name: count, dtype: int64

Levels of Purchase Variable: ['No' 'Yes']


Unnamed: 0,MOSTYPE,MAANTHUI,MGEMOMV,MGEMLEEF,MOSHOOFD,MGODRK,MGODPR,MGODOV,MGODGE,MRELGE,...,ALEVEN,APERSONG,AGEZONG,AWAOREG,ABRAND,AZEILPL,APLEZIER,AFIETS,AINBOED,ABYSTAND
count,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,...,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0,5822.0
mean,24.253349,1.110615,2.678805,2.99124,5.773617,0.696496,4.626932,1.069907,3.258502,6.183442,...,0.076606,0.005325,0.006527,0.004638,0.570079,0.000515,0.006012,0.031776,0.007901,0.014256
std,12.846706,0.405842,0.789835,0.814589,2.85676,1.003234,1.715843,1.017503,1.597647,1.909482,...,0.377569,0.072782,0.080532,0.077403,0.562058,0.022696,0.081632,0.210986,0.090463,0.119996
min,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,10.0,1.0,2.0,2.0,3.0,0.0,4.0,0.0,2.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,30.0,1.0,3.0,3.0,7.0,0.0,5.0,1.0,3.0,6.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,35.0,1.0,3.0,3.0,8.0,1.0,6.0,2.0,4.0,7.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
max,41.0,10.0,5.0,6.0,10.0,9.0,9.0,5.0,9.0,9.0,...,8.0,1.0,1.0,2.0,7.0,1.0,2.0,3.0,2.0,2.0


Before fitting any model, it is essential to understand our data. Answer the following questions about the `Caravan` dataset (Hint: use `print` and `describe`):  
_(i)_ How many observations (rows) does the dataset contain?    
_(ii)_ How many variables (columns) does the dataset contain?    
_(iii)_ What 'variable' type is the response variable `Purchase` (e.g., 'character', 'factor', 'numeric', etc)? What are the 'levels' of the variable?    
_(iv)_ How many predictor variables do we have (Hint: all variables other than `Purchase`)?  

**Answer(i):**<p>
5822 rows as shown by the .shape() and .Describe() metods<p>
**Answer(ii):**<p>
86 columns as shown by .shape(). Describe() only shows numerical columns by default<p>
**Answer(iii):**<p>
Per block above 'Purchase' is of type 'object' which typically corresponds to a categorical or character variable. 'Purchase has 2 Levels: 'Yes' and 'No'<p>
**Answer(iv):**<p>
85 predictor variables

Next, we must preform 'pre-processing' or 'data munging', to prepare our data for classification/prediction. For KNN, there are three essential steps. A first essential step is to 'standardize' the predictor variables. We can achieve this using the `scaler` method, provided as follows:

In [3]:
# Select predictors (excluding the 86th column)
predictors = Caravan.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Display the head of the standardized predictors
print(predictors_standardized.head())

    MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR  \
0  0.680906  -0.27258  0.406697 -1.216964  0.779405 -0.694311  0.217444   
1  0.992297  -0.27258 -0.859500 -1.216964  0.779405  0.302552 -0.365410   
2  0.992297  -0.27258 -0.859500 -1.216964  0.779405 -0.694311 -0.365410   
3 -1.187437  -0.27258  0.406697  0.010755 -0.970980  1.299414 -0.948264   
4  1.225840  -0.27258  1.672893 -1.216964  1.479559  0.302552 -0.365410   

     MGODOV    MGODGE    MRELGE  ...   ALEVEN  APERSONG   AGEZONG  AWAOREG  \
0 -0.068711 -0.161816  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   
1 -0.068711  0.464159 -0.096077  ... -0.20291 -0.073165 -0.081055 -0.05992   
2  0.914172  0.464159 -1.667319  ... -0.20291 -0.073165 -0.081055 -0.05992   
3  0.914172  0.464159 -0.619824  ... -0.20291 -0.073165 -0.081055 -0.05992   
4 -0.068711  0.464159  0.427670  ... -0.20291 -0.073165 -0.081055 -0.05992   

     ABRAND   AZEILPL  APLEZIER   AFIETS   AINBOED  ABYSTAND  
0  0.764971 -0.02

_(v)_ Why is it important to standardize the predictor variables?  
_(vi)_ Why did we elect not to standard our response variable `Purchase`?  


**Answer(v):**<p>
The main reason is **Scale Sensitivity**, because KNN works by calculating the distance between data points. If the predictor variables are on different scales, variables with larger ranges will dominate the distance calculations, potentially leading to biased results <p>
**Answer(vi):**<p>
Because 'Purchase' is not numerical. It's a categorical variable.


_(vii)_ A second essential step is to set a random seed. Do so below (Hint: use the `random.seed` function). Why is setting a seed important? Is the particular seed value important? Why or why not?

In [4]:
# Seting a random seed
np.random.seed(33)

**Answer(vii):**<p>
 Setting a random seed ensures that the random processes (in our case: shuffling data and splitting datasets) yield the same results every time the code is run. The particular value of the random seed itself is generally not important. What matters is the consistency that setting a seed provides.

_(viii)_ A third essential step is to split our standardized data into separate training and testing sets. We will split into 75% training and 25% testing. The provided code randomly partitions our data, and creates linked training sets for the predictors and response variables. Extend the code to create a non-overlapping test set for the predictors and response variables.

**Answer(viii):**<p>
The code below already creates a non-overlapping test set for the predictors and response variables.

In [5]:
# Create a random vector of True and False values
split = np.random.choice([True, False], size=len(predictors_standardized), replace=True, p=[0.75, 0.25])

# Define the training set for X (predictors)
training_X = predictors_standardized[split]

# Define the training set for Y (response)
training_Y = Caravan.loc[split, 'Purchase']

# Define the testing set for X (predictors)
testing_X = predictors_standardized[~split]

# Define the testing set for Y (response)
testing_Y = Caravan.loc[~split, 'Purchase']


_(ix)_ We are finally set to fit the KNN model. In Python, we can use the `KNeighborsClassifier()` function. Fit the KNN with k=1. (You may review arguments to knn by typing `help(knn.fit)`). 

**Answer(ix):**<p>
Code Block below:

In [6]:
#Answer (ix):

# Importing additional required libraries
from ISLP import confusion_table
from sklearn.neighbors import KNeighborsClassifier

#creating model instance where K=1
knn1 = KNeighborsClassifier(n_neighbors=1)

# Fitting Model with Training Sample
knn1.fit(training_X, training_Y)
# Predicting unsing testing sample
knn1_pred = knn1.predict(testing_X)
# Confusion Matrix
print(confusion_table(knn1_pred, testing_Y))

Truth        No  Yes
Predicted           
No         1234   72
Yes          98   11


Using your fit model, answer the following questions:   
_(x)_ What is the prediction accuracy? (Hint: use the `score` method, and compare your model to `testing_Y`)  
_(xi)_ What is the predictor error ? (Hint: compute it from the accuracy)

**Answer(x):**<p>
87.98% Accuracy per code blocks below <p>
**Answer(xi):**<p>
12.01% predition error per code blocks below <p>

In [7]:
#calcuating accuracy
accuracy1 = knn1.score(testing_X, testing_Y)
print("Accuracy Rate:", accuracy1*100)

Accuracy Rate: 87.98586572438163


In [8]:
# prediction error rate
prediction_error_rate1 = 1-accuracy1
print("Prediction Error Rate:", prediction_error_rate1*100)

Prediction Error Rate: 12.014134275618371


_(xii)_ How does this prediction error/accuracy compare to what could be achieved via random guesses? To answer this, consider the percent of customers in the `Caravan` dataset who actually purchase insurance, computed below:

**Answer(xii):**<p>
Compared to a random guess, where Accuracy is 60.5% (per output below), The KNN Model K=1 has a greater accuracy with 87.98%.
The accuracy of the baseline random guess is given by: \[ p^2 + (1 - p)^2 \], where p is the percentage of customers who purchase insurance in the whole sample

In [9]:
# Calculate the percentage of customers who purchase insurance
percentage_purchase = (Caravan['Purchase'].eq('Yes').sum() / len(Caravan['Purchase'])) * 100

# Obtaining Random Guess Accuracy.
random_guess_accuracy = (percentage_purchase**2)+((1-percentage_purchase)**2)

# Compare the Performance
print(f"KNN K=1 Model Accuracy Rate: {accuracy1*100}")
print(f"Random Guess Accuracy Rate: {random_guess_accuracy}")

KNN K=1 Model Accuracy Rate: 87.98586572438163
Random Guess Accuracy Rate: 60.50223043146141


_(xiii)_ Fit a second KNN model, with $K=3$. Does this model perform better (i.e., have higher accuracy, compared to a random guess)?

**Answer(xiii):**<p>
Per code block below, The K=3 model resluts in an accuracy of 92.36%, which is better than the K=1 Model and the baseline random guess accuracy.

In [10]:
#creating model instance where K=3
knn3 = KNeighborsClassifier(n_neighbors=3)

# Fitting Model with Training Sample
knn3.fit(training_X, training_Y)

# Predicting unsing testing sample
knn3_pred = knn3.predict(testing_X)

# Confusion Matrix
print(confusion_table(knn3_pred, testing_Y))

#Calculating accuracy of Model K=3
accuracy3 = knn3.score(testing_X, testing_Y)

# prediction error rate
prediction_error_rate3 = 1-accuracy3
print("Prediction Error Rate:", prediction_error_rate3)

# Compare the Performance
print(f"KNN K=3 Model Accuracy Rate: {accuracy3*100}")
print(f"Random Guess Accuracy Rate: {random_guess_accuracy}")

Truth        No  Yes
Predicted           
No         1303   79
Yes          29    4
Prediction Error Rate: 0.07632508833922258
KNN K=3 Model Accuracy Rate: 92.36749116607774
Random Guess Accuracy Rate: 60.50223043146141


# Criteria

|Criteria            |Complete           |Incomplete          |
|--------------------|---------------|--------------|
|Classification using KNN|All steps are done correctly and the answers are correct.|At least one step is done incorrectly leading to a wrong answer.|

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-2`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_2.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/applied_statistical_concepts/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [X] Created a branch with the correct naming convention.
- [X] Ensured that the repository is public.
- [X] Reviewed the PR description guidelines and adhered to them.
- [X] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
