<a href="https://colab.research.google.com/github/eliauf23/gateway-data-science/blob/main/ws3_gds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Worksheet Week 3
### Classification I

In this worksheet you will work on a simple binary classification problem based on k-nearest neighbor classifier. 

### PART I: Understanding the data

We will work with the Breast Cancer Wisconsin dataset. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. 

For more information, see https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

**0. Up and running**

a) Import `numpy`, `matplotlib.pyplot` and `pandas`

**1. Load the dataset.**

a) You can use sklearn's datasets package, and the function `load_breast_cancer()`

b) Turn this data into a DataFrame, and have a look at it (using the functions ``head()`` and ``describe()``

In [None]:
# your code here

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
import seaborn as sns


data = datasets.load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df["label"] = data.target
print(df)


In [None]:
df.head()


In [None]:
df.describe()


**2. Exploring the data**

a) How many samples are there? how many features, and what kind are they?

569 samples, 31 features, all appear to be real, positive numbers represented as float64 datatype.

b) What is the distribution of classes in this dataset?

2 classes - malignant & benign. 212 samples are malignant, 357 are benign. 

c) To gain some more insight into these features, have a look at the distribution of the first four features. An appealing way of this is by computing *violin plots*. Try using the function `violinplot` of the `seaborn` package to display the distributions of the first four features *per class*




In [None]:
df.info()


In [None]:
df["label"].value_counts()

In [None]:
# fig, axes = plt.subplot(1, 4, figsize=(15,4))

sns.violinplot(data=df, x="label", y="mean radius")
sns.violinplot(data=df, x="label", y="mean texture")
sns.violinplot( x="label", y="mean perimeter", data=df)
sns.violinplot( x="label", y="mean area", data=df)


In [None]:
sns.violinplot( x="label", y="mean texture", data=df)


In [None]:
sns.violinplot( x="label", y="mean perimeter", data=df)


In [None]:
sns.violinplot( x="label", y="mean area", data=df)


## Part II: Classification

**1. Getting started**

a) Split the data into a training and validation set, at a ratio of 70/30. You can do this manually, or use the function `train_test_split()` from the package `sklearn.model_selection`.

b) Compute the distribution of positive and negative samples in your training and validation splits. Are they the same as the distribution of the original data? Compare with by using the parameter `stratify` within `train_test_split()`.

c) Import the function from `KNeighborsClassifier` from the package `sklearn.neighbors`, and create a classifier with the choice of $k=1$ -- one neighbor. Have a look at https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html for details

d) Fit the classifier (use the method `fit()` of the instance you created) to your training data, and then compute accuracy (use the method `score()`) on the validation data. What is the accuracy of this classifier?



In [None]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
x = pd.DataFrame(data.data, columns=data.feature_names)
y = df.label

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, stratify=y)
model = KNeighborsClassifier(n_neighbors=1)
model.fit(x_train, y_train)
model.score(x_test, y_test)


**2. Dependence on amounts of data**

We will now explore how this performance depends on the number training data. 




a) As before, construct a training set and validation set.

b) Fit a k-NN classifier with increasing amounts of data, from 1 to $60$, then. For each model, record its training accuracy and validation accuracy.

c) Plot the obtained training and validation accuracy as a function of the number of training samples.

d) Since your results will depend on the specific (random) draw of your data, repeat points *a* and *b* above 20 times, each drawing a different split of training/validation. Finally, plot the *average* of the accuracies (training and validation) for each number of training samples.

e) Why is the training accuracy at 1?

f) Run again point *d* above, but now using a k-nearest neighbor classifier with $k=5$. What happened to the training accuracy, and why?

g) Noting that the validation accuracy is an unbiased estimate of the Risk of the classifier, what can you say about the value of the Training Error? is it unbiased?

In [None]:
# index i in array corresponds to number of data points
accuracy = [0] * 60
model = KNeighborsClassifier(n_neighbors=1)

for j in range(20):
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, stratify=y)
  for i in range(1, 60):
    x_new = x_train[0:i]
    y_new = y_train[0:i]
    model.fit(x_new, y_new)
    score = model.score(x_test[0:i], y_test[0:i])
    #print("index " ,i, ": score = ", score, "\n")
    accuracy[i] += score

num = np.divide(accuracy, 20)
print(num)

In [None]:
df = pd.DataFrame(num, columns=["accuracy"])
df.plot(y="accuracy" )


In [None]:
accuracy = [0] * 60
model = KNeighborsClassifier(n_neighbors=5)

for j in range(20):
  x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, stratify=None)
  for i in range(5, 60):
    x_new = x_train[0:i]
    y_new = y_train[0:i]
    model.fit(x_new, y_new)
    accuracy[i] = accuracy[i] +  model.score(x_test[0:i], y_test[0:i])

num = np.divide(accuracy, 20)
df = pd.DataFrame(num, columns=["accuracy"])
df.plot(y="accuracy" )


**3. Number of Neighbors**

This classifier has 1 (hyper)parameter that needs tuning: the number of neighbors, $k$. While we will cover model selection later in the class, it is useful for you to start thinking about these questions.

We will select this parameter as the one that maximizes performance *on a validation set*. 

a) Similarly as before, partition your data into a training and validation set, now using 75% for the validation set. Train *different* kNN classifiers (for different values of k) on the training set, and evaluate their performance on the validation set, and plot the validation accuracy as a function of k. Adequate values of K to explore might be `K =  [1,2,3,4,5,6,7,8,9,10,15,20,25,30,35]`.

b) Also, as before, one single run will have too high variance (because of the limited number of samples). As you did in 2.d, repeat this process 20 times, and report the mean validation accuracy as a function of $k$. Moreover, plot this mean together with its $5^{th}$ and $95^{th}$ percentiles of the accuracy for each $k$ (consider using the function `fill_between` of `matplotlib.pyplot`, as well as `numpy`'s `percentile` function. (use the `alpha` parameter for some extra aesthetics!)

c) What can you conclude about the number k? How does it influence the result? What value k would you choose?

In [None]:
x_train, x_validation, y_train, y_validation = train_test_split(x, y, test_size=0.25, stratify=None)
