# Assignment 1

You only need to write one line of code for each question. When answering questions that ask you to identify or interpret something, the length of your response doesn’t matter. For example, if the answer is just ‘yes,’ ‘no,’ or a number, you can just give that answer without adding anything else.

We will go through comparable code and concepts in the live learning session. If you run into trouble, start by using the help `help()` function in Python, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that **no outside searches are required by the assignment!**). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

In [1]:
# Import standard libraries
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import recall_score, precision_score, accuracy_score
from sklearn.model_selection import cross_validate, GridSearchCV


In [2]:
from sklearn.datasets import load_wine

# Load the Wine dataset
wine_data = load_wine()

# Convert to DataFrame
wine_df = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)

# Bind the 'class' (wine target) to the DataFrame
wine_df['class'] = wine_data.target

# Display the DataFrame
wine_df.head()

In [3]:
# (i) Number of observations (rows) in the dataset
n_observations = wine_df.shape[0]
print(n_observations)

178


In [4]:
# (ii) Number of variables (columns) in the dataset
n_variables = wine_df.shape[1]
print(n_variables)

14


In [5]:
# (iii) Variable type of response variable 'class' and its unique values
type_class = wine_df['class'].dtype
unique_classes = wine_df['class'].unique()
print(type_class, unique_classes)

int64 [0 1 2]


In [6]:
# (iv) Number of predictor variables
n_predictors = wine_df.shape[1] - 1
print(n_predictors)

13


In [7]:
# Select predictors (excluding the last column)
predictors = wine_df.iloc[:, :-1]

# Standardize the predictors
scaler = StandardScaler()
predictors_standardized = pd.DataFrame(scaler.fit_transform(predictors), columns=predictors.columns)

print("Standardization Complete")

Standardization Complete


In [8]:
# Set a seed for reproducibility
random.seed(123)

In [9]:
# Split data into training (75%) and testing (25%) sets
X_train, X_test, y_train, y_test = train_test_split(
    predictors_standardized, wine_df['class'], test_size=0.25, random_state=123
)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(133, 13) (45, 13) (133,) (45,)


In [10]:
# Perform grid search with 10-fold cross-validation to find best k
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': list(range(1, 51))}
grid_search = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

best_n_neighbors = grid_search.best_params_['n_neighbors']
print("Best n_neighbors:", best_n_neighbors)

Best n_neighbors: 3
