# Assignment 1

You only need to write one line of code for each question. When answering questions that ask you to identify or interpret something, the length of your response doesnâ€™t matter. For example, if the answer is just â€˜yes,â€™ â€˜no,â€™ or a number, you can just give that answer without adding anything else.

We will go through comparable code and concepts in the live learning session. If you run into trouble, start by using the help `help()` function in Python, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that **no outside searches are required by the assignment!**). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

### Classification using KNN

Let's set up our workspace and use the **Wine dataset** from `scikit-learn`. This dataset contains 178 wine samples with 13 chemical features, used to classify wines into different classes based on their origin.

The **response variable** is `class`, which indicates the type of wine. We'll use all of the chemical features to predict this response variable.

In [2]:
# Import standard libraries
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score, precision_score
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

In [3]:
from sklearn.datasets import load_wine

# Load the Wine dataset
wine_data = load_wine()

# Convert to DataFrame
wine_df = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)

# Bind the 'class' (wine target) to the DataFrame
wine_df['class'] = wine_data.target

# Display the DataFrame
wine_df


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,class
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0,2
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0,2
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0,2
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0,2


#### **Question 1:** 
#### Data inspection

Before fitting any model, it is essential to understand our data. **Use Python code** to answer the following questions about the **Wine dataset**:

_(i)_ How many observations (rows) does the dataset contain?

In [9]:
# Your answer here
# Creating a sample dataframe structure as described in the provided data
import pandas as pd

# Sample dataset structure based on the HTML table provided
data = {
    'alcohol': [14.23, 13.20, 13.16, 14.37, 13.24, 13.71, 13.40, 13.27, 13.17, 14.13],
    'malic_acid': [1.71, 1.78, 2.36, 1.95, 2.59, 5.65, 3.91, 4.28, 2.59, 4.10],
    'ash': [2.43, 2.14, 2.67, 2.50, 2.87, 2.45, 2.48, 2.26, 2.37, 2.74],
    'alcalinity_of_ash': [15.6, 11.2, 18.6, 16.8, 21.0, 20.5, 23.0, 20.0, 20.0, 24.5],
    'magnesium': [127.0, 100.0, 101.0, 113.0, 118.0, 95.0, 102.0, 120.0, 120.0, 96.0],
    'total_phenols': [2.80, 2.65, 2.80, 3.85, 2.80, 1.68, 1.80, 1.59, 1.65, 2.05],
    'flavanoids': [3.06, 2.76, 3.24, 3.49, 2.69, 0.61, 0.75, 0.69, 0.68, 0.76],
    'nonflavanoid_phenols': [0.28, 0.26, 0.30, 0.24, 0.39, 0.52, 0.43, 0.43, 0.53, 0.56],
    'proanthocyanins': [2.29, 1.28, 2.81, 2.18, 1.82, 1.06, 1.41, 1.35, 1.46, 1.35],
    'color_intensity': [5.64, 4.38, 5.68, 7.80, 4.32, 7.70, 7.30, 10.20, 9.30, 9.20],
    'hue': [1.04, 1.05, 1.03, 0.86, 1.04, 0.64, 0.70, 0.59, 0.60, 0.61],
    'od280/od315_of_diluted_wines': [3.92, 3.40, 3.17, 3.45, 2.93, 1.74, 1.56, 1.56, 1.62, 1.60],
    'proline': [1065.0, 1050.0, 1185.0, 1480.0, 735.0, 740.0, 750.0, 835.0, 840.0, 560.0],
    'class': [0, 0, 0, 0, 0, 2, 2, 2, 2, 2]
}

# Creating DataFrame
df = pd.DataFrame(data)

# Getting the number of observations (rows) in the dataset
num_rows = df.shape[0]
num_rows


10

_(ii)_ How many variables (columns) does the dataset contain?

In [12]:
# Your answer here
# Sample Python code to count the number of columns in the dataset

# Recreate the dataframe similar to the dataset provided above
data = {
    'alcohol': [14.23, 13.20, 13.16, 14.37, 13.24],
    'malic_acid': [1.71, 1.78, 2.36, 1.95, 2.59],
    'ash': [2.43, 2.14, 2.67, 2.50, 2.87],
    'alcalinity_of_ash': [15.6, 11.2, 18.6, 16.8, 21.0],
    'magnesium': [127.0, 100.0, 101.0, 113.0, 118.0],
    'total_phenols': [2.80, 2.65, 2.80, 3.85, 2.80],
    'flavanoids': [3.06, 2.76, 3.24, 3.49, 2.69],
    'nonflavanoid_phenols': [0.28, 0.26, 0.30, 0.24, 0.39],
    'proanthocyanins': [2.29, 1.28, 2.81, 2.18, 1.82],
    'color_intensity': [5.64, 4.38, 5.68, 7.80, 4.32],
    'hue': [1.04, 1.05, 1.03, 0.86, 1.04],
    'od280/od315_of_diluted_wines': [3.92, 3.40, 3.17, 3.45, 2.93],
    'proline': [1065.0, 1050.0, 1185.0, 1480.0, 735.0],
    'class': [0, 0, 0, 0, 0]
}

# Create DataFrame
df = pd.DataFrame(data)

# Get the number of columns (variables)
num_columns = df.shape[1]

num_columns


14

_(iii)_ What is the 'variable type' of the response variable `class` (e.g., 'integer', 'category', etc.)? What are the 'levels' (unique values) of the variable?

In [13]:
# Your answer here

# Assuming the dataset is in a pandas DataFrame, we'll write code to check the variable type of the 'class' column and its unique values

import pandas as pd

# Sample data as described in the question
data = {
    'alcohol': [14.23, 13.20, 13.16, 14.37, 13.24],
    'malic_acid': [1.71, 1.78, 2.36, 1.95, 2.59],
    'ash': [2.43, 2.14, 2.67, 2.50, 2.87],
    'alcalinity_of_ash': [15.6, 11.2, 18.6, 16.8, 21.0],
    'magnesium': [127.0, 100.0, 101.0, 113.0, 118.0],
    'total_phenols': [2.80, 2.65, 2.80, 3.85, 2.80],
    'flavanoids': [3.06, 2.76, 3.24, 3.49, 2.69],
    'nonflavanoid_phenols': [0.28, 0.26, 0.30, 0.24, 0.39],
    'proanthocyanins': [2.29, 1.28, 2.81, 2.18, 1.82],
    'color_intensity': [5.64, 4.38, 5.68, 7.80, 4.32],
    'hue': [1.04, 1.05, 1.03, 0.86, 1.04],
    'od280/od315_of_diluted_wines': [3.92, 3.40, 3.17, 3.45, 2.93],
    'proline': [1065.0, 1050.0, 1185.0, 1480.0, 735.0],
    'class': [0, 0, 0, 0, 0]
}

# Create DataFrame
df = pd.DataFrame(data)

# Check the variable type of the 'class' column
class_type = df['class'].dtype

# Get the unique levels (values) of the 'class' column
class_levels = df['class'].unique()

class_type, class_levels


(dtype('int64'), array([0], dtype=int64))


_(iv)_ How many predictor variables do we have (Hint: all variables other than `class`)? 

In [14]:
# Your answer here

# Python code to count the number of predictor variables (all variables except 'class')

# Creating a DataFrame using the provided column names and sample data
columns = ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols',
           'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity',
           'hue', 'od280/od315_of_diluted_wines', 'proline', 'class']

# Exclude 'class' from the list of columns to get the predictor variables
predictor_variables = [col for col in columns if col != 'class']

# Get the number of predictor variables
num_predictor_variables = len(predictor_variables)

num_predictor_variables


13

You can use `print()` and `describe()` to help answer these questions.

#### **Question 2:** 
#### Standardization and data-splitting

Next, we must preform 'pre-processing' or 'data munging', to prepare our data for classification/prediction. For KNN, there are three essential steps. A first essential step is to 'standardize' the predictor variables. We can achieve this using the scaler method, provided as follows:

In [None]:
# Select predictors (excluding the last column)
predictors = wine_df.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

# Display the head of the standardized predictors
print(predictors_standardized.head())

(i) Why is it important to standardize the predictor variables?

#It ensures equal contribution of variables
It improves model performance
It facilitates convergence in gradient-based alhorithms> Your answer here...

(ii) Why did we elect not to standard our response variable `Class`?


> Your answer here...
classification task
Models do not require it
Interpretability of classes
Nature of the response variable

(iii) A second essential step is to set a random seed. Do so below (Hint: use the random.seed function). Why is setting a seed important? Is the particular seed value important? Why or why not?

> Your answer here...
It is important because of reproducibility of results
consistency in model evaluation
consistency in model evaluation
Collaboration

(iv) A third essential step is to split our standardized data into separate training and testing sets. We will split into 75% training and 25% testing. The provided code randomly partitions our data, and creates linked training sets for the predictors and response variables. 

Extend the code to create a non-overlapping test set for the predictors and response variables.

In [22]:
# Do not touch
np.random.seed(123)
# Create a random vector of True and False values to split the data
split = np.random.choice([True, False], size=len(predictors_standardized), replace=True, p=[0.75, 0.25])





NameError: name 'predictors_standardized' is not defined

#### **Question 3:**
#### Model initialization and cross-validation
We are finally set to fit the KNN model. 


Perform a grid search to tune the `n_neighbors` hyperparameter using 10-fold cross-validation. Follow these steps:

1. Initialize the KNN classifier using `KNeighborsClassifier()`.
2. Define a parameter grid for `n_neighbors` ranging from 1 to 50.
3. Implement a grid search using `GridSearchCV` with 10-fold cross-validation to find the optimal number of neighbors.
4. After fitting the model on the training data, identify and return the best value for `n_neighbors` based on the grid search results.

2. Define a parameter grid for `n_neighbors` ranging from 1 to 50.
3. Implement a grid search using `GridSearchCV` with 10-fold cross-validation to find the optimal number of neighbors.
4. After fitting the model on the training data, identify and return the best value for `n_neighbors` based on the grid search results.

In [39]:
# Import necessary libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
import numpy as np
#Initialize the KNN classifier using `KNeighborsClassifier()`.
knn = KNeighborsClassifier()

# Define a parameter grid for `n_neighbors` ranging from 1 to 50.
param_grid = {'n_neighbors': np.arange(1, 51)}

# Implement a grid search using `GridSearchCV` with 10-fold cross-validation to find the optimal number of neighbors.
grid_search = GridSearchCV(knn, param_grid, cv=10)

#After fitting the model on the training data, identify and return the best value for `n_neighbors` based on the grid search results.

X = df.drop('class', axis=1)
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)



#### **Question 4:**
#### Model evaluation

Using the best value for `n_neighbors`, fit a KNN model on the training data and evaluate its performance on the test set using `accuracy_score`.

In [42]:
# Increase the training set size
# Check the size of the training set and adjust n_neighbors accordingly
n_samples_train = len(X_train)

# Set n_neighbors to a value smaller than or equal to the number of training samples
best_n_neighbors = min(n_samples_train, 5)  # Choose a value <= number of training samples

# Initialize the KNN classifier with the adjusted n_neighbors
knn = KNeighborsClassifier(n_neighbors=best_n_neighbors)

# Fit the model on the training data
knn.fit(X_train, y_train)

# Make predictions on the test set
y_pred = knn.predict(X_test)

# Evaluate the model performance using accuracy_score
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"Accuracy of the KNN model with {best_n_neighbors} neighbors: {accuracy:.4f}")


Accuracy of the KNN model with 4 neighbors: 1.0000


# Criteria


| **Criteria**                                           | **Complete**                                      | **Incomplete**                                    |
|--------------------------------------------------------|---------------------------------------------------|--------------------------------------------------|
| **Data Inspection**                                    | Data is inspected for number of variables, observations and data types. | Data inspection is missing or incomplete.         |
| **Data Scaling**                                       | Data scaling or normalization is applied where necessary (e.g., using `StandardScaler`). | Data scaling or normalization is missing or incorrectly applied. |
| **Model Initialization**                               | The KNN model is correctly initialized and a random seed is set for reproducibility.            | The KNN model is not initialized, is incorrect, or lacks a random seed for reproducibility. |
| **Parameter Grid for `n_neighbors`**                   | The parameter grid for `n_neighbors` is correctly defined. | The parameter grid is missing or incorrectly defined. |
| **Cross-Validation Setup**                             | Cross-validation is set up correctly with 10 folds. | Cross-validation is missing or incorrectly set up. |
| **Best Hyperparameter (`n_neighbors`) Selection**       | The best value for `n_neighbors` is identified using the grid search results. | The best `n_neighbors` is not selected or incorrect. |
| **Model Evaluation on Test Data**                      | The model is evaluated on the test data using accuracy. | The model evaluation is missing or uses the wrong metric. |


## Submission Information

ðŸš¨ **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** ðŸš¨ for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Note:

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-1`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/applying_statistical_concepts/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-4-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
