# Module 4:  Classification

Notebook version: `25.0` (please don't change)


Each group should submit to the practical questions on Brightspace.

NOTE: For this practical you will need to import the following necessary Python libraries.

In [None]:
# Download helper functions and datasets:
!wget -nc https://raw.githubusercontent.com/brmprnk/LB2292/main/module4/cigarsdata.pkl
!wget -nc https://raw.githubusercontent.com/brmprnk/LB2292/main/module4/Genesdata.pkl
!wget -nc https://raw.githubusercontent.com/brmprnk/LB2292/main/LST_Functions.py

import numpy as np
import pandas as pd
print(pd.__version__)

import pickle
import scipy.stats as st
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier, NearestCentroid
from sklearn.svm import SVC
from sklearn import datasets
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from LST_Functions import plot_decision_boundary, learning_curve

## 1. Bayesian Classification

The following figure shows the uniform conditional probability density functions of two classes. The first class $p(x|w_1)$ is indicated by a dashed blue line and the second class $p(x|w_2)$ with a solid black line. The two classes have equal priors: $p(w_1) = p(w_2) = 1/2$.

![](https://raw.githubusercontent.com/brmprnk/LB2292/main/module4/img/bayesian.png)


---

#### ✏ Exercise 1a

> Use Bayes rule to derive the class posterior probabilities of the following objects: $x = 3$, $x = -0.54$, $x = 1$, $x = -2$. (i.e. for each object $x$, you want to calculate $p(w_1|x)$ and $p(w_2|x)$).

**Solution:**

(your solution here)



---

#### ✏ Exercise 1b

> Based on the calculated posterior probabilities, to which class do you assign each object?

**Solution:**

(your answer here)



---

#### ✏ Exercise 1c

> Where do you draw the decision boundary of the Bayes classifier?

**Solution:**

(your solution here)



---

#### ✏ Exercise 1d

> Compute the Bayes error.

**Solution**:

(your solution here)



---

#### ✏ Exercise 1e

> Assume that the class priors are: $P(w_1) = 1/3$ and $P(w_2) = 2/3$. Recalculate the posterior probabilities for $x = 3$, $x = -0.5$, $x = 1$.

**Solution:**

(your solution here)


### 2: Programming a Bayes classifier

Based on the two class conditional probabilities of the previous exercise we will generate 200 data points, 100 for each class. The data points generated based on ω1 are uniformly distributed ranging between -1 and 2 (x1). The data points generated based on ω2 are uniformly distributed ranging between 0 and 4 (x2). The following Python code generates this dataset.

In [None]:
# Generate 100 samples for both classes
x1 = np.random.uniform(0, 1, 100)*3 -1 # W1
x2 = np.random.uniform(0, 1, 100)*4 # W2

# Join the samples into one matrix X

X = np.hstack((x1, x2))

# Create an array of class labels (1 or 2) for the corresponding data points in X
y = np.ones(X.shape[0], int)
y[100:] = 2


---

#### ✏ Exercise 2a

> We just generated the data based on the class conditional probabilities, we know to which of the two classes each data point actually belongs. However, without that foreknowledge we can also use a Bayes classifier to classify each data point and assign it to either ω1 or ω2. Complete the Python code below to classify all the objects from the dataset using a Bayes classifier.

In [None]:
p_w_1 = 0.5
p_w_2 = 0.5
p_x_w_1 = 1/3
p_x_w_2 = 0.25


# An array where the predicted class labels are stored
y_predicted = np.zeros(X.shape[0], int)

# Loop over all objects and classify them
for i in range(0, len(X)):
  xi = X[i]

  # Calculate P(x_i | w_1)
  if (xi >= -1) and (xi <= 2):
    p_xi_given_w1 = ??? # Replace the ???
  else:
    p_xi_given_w1 = 0

  # Calculate P(x_i | w_2)
  if (xi >= 0) and (xi <= 4):
    p_xi_given_w2 = ??? # Replace the ???
  else:
    p_xi_given_w2 = 0

  # Multiply prior with class-conditional
  bayes_rule_numerator_w1 = ???  # Replace the ???

  bayes_rule_numerator_w2 = ???  # Replace the ???

  # Calculate P(x_i) (denominator in bayes rule)
  p_xi = ??? # Replace the ???

  # Apply the bayes rule to calculate the posterior probability of each class
  p_w_1_given_xi = ??? # Replace the ???
  p_w_2_given_xi = ??? # Replace the ???

  # classify object: w1 or w2
  if (p_w_1_given_xi > p_w_2_given_xi):
    y_predicted[i] = ??? # Replace the ???
  else:
    ??? # Replace the ???

error_rate = ???
print(error_rate)


---

#### ✏ Exercise 2b

> To see how well your Bayes classifier performed, count how many objects from x1 and x2 are misclassified. How does this compare to the Bayes error that you computed in the previous exercise?

**Solution:**

(your answer here)



## 3. Linear Classifiers


---

#### ✏ Exercise 3

> Load the cigars dataset from the *cigarsdata.pkl* file using the Python code below.
> Create a training and test set with `train_test_split`, with 50% test size.
> Create a linear discriminant classifier using `LinearDiscriminantAnalysis`, train the classifier using `fit` method.
> Make a scatter plot of the data and plot the boundary of the classifier using the `plot_decision_boundary` function given in `LST_Functions` (hint: check the function documentation in order to use it properly).
> Finally, obtain the classifier predictions for the test set using predict method and check the classification error.


In [None]:
with open('cigarsdata.pkl', 'rb') as f:
  datadict = pickle.load(f)

data = datadict['data']
labels = datadict['labels']
del datadict

# SOLUTION

# Your code here ...



---

#### ✏ Exercise 4a

> Create two interleaving half circles (banana-shaped) dataset with 400 samples, using the code below.
>
> Repeat the previous exercise using the new data (`LinearDiscriminantAnalysis`, `fit`, `plot_decision_boundary`, and `predict`). Is the linear classifier appropriate for this problem? What is the error rate in this case?


In [None]:
Banana_shaped = datasets.make_moons(n_samples=400, noise=0.05)
data = Banana_shaped[0]
labels = Banana_shaped[1]

# SOLUTION:

# Your code here ...



---

#### ✏ Exercise 4b

> The scikit-learn (sklearn) library has implemented many different classifiers (see below). Which classifier performs best? (hint: use `plot_decision_boundary` to observe the decision boundary for each classifier, and `train_test_split` to create a training and test set with 50% test size)
>

| Classifier   | Function name     |
|-------------------------------|---------------------------------|
| Linear bayes classifier       | LinearDiscriminantAnalysis()    |
| Quadratic bayes classifier    | QuadraticDiscriminantAnalysis() |
| k-Nearest neighbor classifier | KNeighborsClassifier()          |
| Support Vector Machine        | SVC(probability=True)           |
| Nearest mean classifier       | NearestCentroid()               |


In [None]:
# SOLUTION:

# Your code here ...


## 5. Training a classifier: effect of training set size

The code below will generate a two-dimensional Gaussian dataset with 20 samples, 2 features. Mean equals to (1,1) and (2,2), and variance equals to 1 and 2.

In [None]:
Gaussian_shaped = datasets.make_blobs(n_samples=20, n_features=2, centers=[[1, 1], [2, 2]], cluster_std=[1, 2])
data = Gaussian_shaped[0]
labels =  Gaussian_shaped[1]


---

#### ✏ Exercise 5a

> Train the k-Nearest neighbor classifier on the dataset obtained above (hint: use KNeighborsClassifier with 3 neighbors, and use fit method to train the classifier)

In [None]:
# SOLUTION:

# Your code here ...



---

#### ✏ Exercise 5b

> Now create a larger dataset with 1000 samples and use it to test the classifier trained in a. What is your error rate now?

In [None]:
# SOLUTION:

# Your code here ...



---

#### ✏ Exercise 5c

> Generate a new Gaussian, two-dimensional dataset with 500 samples.


In [None]:
# SOLUTION:

# Your code here ...



---

#### ✏ Exercise 5d

> Train the k-Nearest neighbor classifier with 3 neighbors on this new set and then test it on the set with 1000 samples in the previous exercise. What is your error rate?


In [None]:
# SOLUTION:

# Your code here ...



---

#### ✏ Exercise 5e

> Can you explain the different error rates obtained?
>
> Hint: Apply the `learning_curve` function given in `LST_Functions` to see the k-Nearest neighbor error on this dataset, using training set of different sizes
>
> Hint: The `learning_curve` function takes a while, because it has to retrain many times.


In [None]:
# SOLUTION:

learning_curve()  # Use this function!



## 6. Marker gene selection

The code below loads the "Genesdata.pkl" file, which contains two datasets: (X_train, y_train) and (X_test, y_test). Some genes in these datasets are good markers, while some contain less information about the classes.

In [5]:
# Load the genes dataset
with open('Genesdata.pkl', 'rb') as f:
  datadict = pickle.load(f)

X_train = datadict['X_train']
X_test = datadict['X_test']
y_train = datadict['y_train']
y_test = datadict['y_test']
del datadict


---

#### ✏ Exercise 6a

> Employ a criterion such as the t-statistic to evaluate a gene’s predictive power, using the training set. (hint: `X_train` contains the data, and the class labels are stored in `y_train`)

In [None]:
# SOLUTION:

# Your code here ...



---

#### ✏ Exercise 6b

> Identify the two best genes by sorting the t-statistic. (hint: remember to look at the absolute value of t-statistic).

In [None]:
# SOLUTION:

# Your code here ...



---

#### ✏ Exercise 6c

> Retain the two top features/genes and train a classifier using the training dataset `(X_train, y_train)`. Then test the trained classifier using the test dataset `(X_test, y_test)` (hint: using `LinearDiscriminantAnalysis`, fit and predict functions). What is the error?

In [None]:
# SOLUTION:

# Your code here ...



---

#### ✏ Exercise 6d

> Use `plot_decision_boundary` to visualize the two selected features for each dataset separately. Are these features the best overall separators for the two classes? Why?

In [None]:
# SOLUTION:

# Your code here ...


## Exercise 7 (Basic Cross-validation)

Generate a dataset using the following command:

In [11]:
Banana_shaped = datasets.make_moons(n_samples=1000, noise=0.1)
data = Banana_shaped[0]
labels = Banana_shaped[1]


---

#### ✏ Exercise 7

> Use the Python function `StratifiedKFold` to create 3-folds. Use `split` to get training and test fold indices. Split the data into test and training sets using these indices (see code below). For each fold, train three classifiers including linear bayes, SVM and Random Forest and compare their performances. Which one performs better?

In [None]:
# Use one classifier at a time
Classifier = LinearDiscriminantAnalysis()
Classifier = SVC()
Classifier = RandomForestClassifier(n_estimators=3)

# Create cross validation and use for loop to cover all folds
CV = StratifiedKFold(n_splits=3)

# SOLUTION:
for train_ind, test_ind in CV.split(data, labels):
  ??? # Replace the ??? with the correct code
