# Quiz 1
Notebook containing the answers for quiz 1 in ML: Supervised Methods

## Question 1
Assume that there is a labeling function $f : \mathcal{X} \rightarrow \mathcal{Y}$, where probability distribution over $\mathcal{X}$ is
denoted by \mathcal{D}. 
The Error of a classifier $h : \mathcal{X} \rightarrow \mathcal{Y}$ can be defined as:
$$ L_{\mathcal{D,f}} (h)\stackrel{\text{def}}{=} \mathbb{P}\sim [h(x) \neq f(x)]$$ 
where in the learner, $x$ is assumed to be a randomly chosen example
from an **unknown distribution $\mathcal{D}$**.


## Question 2
What is the probability of having more than 10 noisy examples, in a sample of 35 drawn uniformly
from a distribution with 20% inherent noise.

In [3]:
import numpy as np
import math
def binom(n, k):
    #computes the binomial coefficient
    return math.factorial(n) // math.factorial(k) // math.factorial(n - k)
n=35
p=0.2
sum_prob=0
#sums up the probability of the cases where we have 10 or less noisy example
for N in range(11):
    sum_prob+=(binom(n, N)*np.power(p,N)*np.power(1-p,n-N))
print("probability of sampling 10 or less noisy examples:",np.round(sum_prob,2))
print("probability of sampling more than 10 noisy examples: ",np.round(1-sum_prob,2))

probability of sampling 10 or less noisy examples: 0.93
probability of sampling more than 10 noisy examples:  0.07


# Question 3
In each of the following problems which measure should be prioritized for evaluating a classifier.

A) Cancer prediction (positive: cancer, negative: healthy). The patients with positive prediction will go through more analysis, and the negative ones will be sent home.

*In the case of detecting cancer, we must prioritize minimizing the number of false negatives (i.e. a patient with cancer which is not detected). Not detecting cancer is much more dangerous than detecting cancer on a healthy patient.
Thus, we must focus on **recall**.*

B) Spam email prediction (positive:spam, negative: non-spam). The spam detected emails would be automatically removed.
*In the case of detecting spam, it is more important to minimize false positives. We do not want important emails to be classified as spam. That clearly has more consequences than letting some spam emails through. Thus, we must focus on **precision**.*
$$\text{precision}=\dfrac{\text{TP}}{\text{TP}+\text{FP}} , \text{recall}=\dfrac{\text{TP}}{\text{TP}+\text{FN}}$$

1. A: precision, B: recall
2. **A: recall, B: precision**
3. A: precision, B: precision

# Question 4
Import the given files `X.csv` and `Y.csv` in the Materials section, as the inputs and targets. 
The provided files can also be imported from Boston dataset available in sklearn, where the features LSTAT and RM are used as inputs and target is used as output.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
X = pd.read_csv("X.csv")
Y = pd.read_csv("Y.csv")

X.shape, Y.shape

In [None]:
# Split the dataset to training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

# using Linear Regression (LR) without considering bias (intercept), 
# which option is the root-mean-square error (RMSE) over test data:
from sklearn.linear_model import LinearRegression
reg = LinearRegression(fit_intercept=False).fit(X_train, Y_train)

from sklearn.metrics import mean_squared_error
rmse_test = np.sqrt(mean_squared_error(Y_test, reg.predict(X_test)))
print("The RMSE over test data is {:.2f}".format(rmse_test))

The answer is:

1. **5.26**
2. 6.34
3. 4.62

# Question 5
compute average ($K$-folds cross-validation error) and variance of the root-mean-square error (RMSE) on test folds, for $K = 2$ and $K = 5$.

In [None]:
from sklearn.model_selection import KFold
k_list = [2, 5]
for k in k_list:
    rmse_test = []
    kf = KFold(n_splits=k, random_state=1, shuffle=True)
    for train_idx, val_idx in kf.split(X_train):
        X_train_kf, X_val_kf = X_train.iloc[train_idx, :], X_train.iloc[val_idx, :]
        Y_train_kf, Y_val_kf = Y_train.iloc[train_idx, :], Y_train.iloc[val_idx, :]
        
        reg = LinearRegression(fit_intercept=False).fit(X_train_kf, Y_train_kf)
        rmse_test.append(np.sqrt(mean_squared_error(Y_val_kf, reg.predict(X_val_kf))))
    print(f"k={k}: average={np.mean(rmse_test):.2f}, variance={np.var(rmse_test):.2f}.")

The answer is option **3.** 
1. K=2 : average = 6.23, variance = 0.21 / k=5: average = 6.12 , variance = 0.64
2. K=2 : average = 5.38 , variance = 0.43 / k=5: average = 5.53, variance = 0.38
3. **K=2 : average = 6.03 , variance = 0.37 / k=5: average = 5.72 , variance = 0.50**