## Data 607 -- Statistical and Machine Learning -- Winter 2020
### Assignment 1

#### Instructions

- Present your solutions in a single Jupyter notebook `.ipynb` file.

- Call the file 
<p style="text-align: center;font-family: monospace">[your last name]_[your first name]_[student number]_a1.ipynb.</p>
Do not include the `[` and `]` in the file name.

- Upload the file to the appropriate dropbox on the D2L site before 23:59 on Wednesday, March 4.

- You may consult with your classmates, but you must submit your own work.


#### Exercise 1

Edgar Anderson's <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">iris dataset</a> was made famous by <a href="https://en.wikipedia.org/wiki/Ronald_Fisher">Sir Roland Fisher</a>'s analysis in his famous 1936 paper, "<a href="https://digital.library.adelaide.edu.au/dspace/handle/2440/15227">The use of multiple measurements in taxonomic problems</a>". The dataset consists of measurements of physical characteristics -- *sepal length*, *sepal width*, *petal length*, and *petal width* -- of various species of the iris flower -- *Iris setosa*, *Iris virginica* and *Iris versicolor* -- collected with an eye to quantifying morphological variation. Nowadays, analysis of the Iris dataset is a staple of statistical pedagogy.

In this exercise, we will use Bayes Theorem and the iris dataset to train a classifier.

Due to its prominence, the iris dataset is included with `scikit-learn`. Load it as follows:

In [1]:
from sklearn.datasets import load_iris
data = load_iris()

print(data.feature_names)
print(data.target_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']


Although the iris dataset includes four features, we'll only use two for our Bayes classifier: petal length (feature 2) and petal width (feature 3), numbered from 0.

In [2]:
X = data.data[:, 2:]
y = data.target

print(X.shape, y.shape)

(150, 2) (150,)


Split the data set into a training set of size 120 and a testing set of size 30.

In [3]:
from sklearn.model_selection import train_test_split

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2)

print(X_tr.shape, y_tr.shape, X_te.shape, y_te.shape)

(120, 2) (120,) (30, 2) (30,)


Compute the relative proportions of the three species of iris in the training set. Store them in a vector `p_y`. In the Bayesian parlance, these proportions represent *prior probabilities* of the three species.

In [4]:
# p_y = # your code here

Separate `X_tr` into three subsets according to species.

In [5]:
X_tr_y = [X_tr[y_tr == k, :] for k in [0, 1, 2]]
print([x.shape for x in X_tr_y])

[(37, 2), (42, 2), (41, 2)]


Fit a 2-dimensional Gaussian distribution to each of these subsets.

In [6]:
import numpy as np

means = [np.mean(x, axis=0) for x in X_tr_y]
covs = [np.cov(x.T) for x in X_tr_y]

for i, (mean, cov) in enumerate(zip(means, covs)):
    print(f"class {i}:\n-------\nmean = {mean}\ncov = {cov}\n")

class 0:
-------
mean = [1.46216216 0.25135135]
cov = [[0.03186186 0.00588589]
 [0.00588589 0.01367868]]

class 1:
-------
mean = [4.32380952 1.35      ]
cov = [[0.2013705  0.06268293]
 [0.06268293 0.03621951]]

class 2:
-------
mean = [5.60731707 2.01707317]
cov = [[0.33169512 0.05137195]
 [0.05137195 0.07045122]]



Make contour plots of the densities for each of the three species.

Make a contour plot the associated *mixture density*. (Do *not* use `GaussianMixture` from `sklearn.mixture`. That's for situations in which you *don't know the class labels*. Just take an appropriate weighted sum of the densities whose contours you just plotted above.)

<div style="background-color: pink; padding: 10px">
See <code>height-weight.ipynb</code> from the Session 01 for an example in the two-class case.
</div>

In [7]:
from scipy.stats import multivariate_normal as mvn

# your code here

For each sample $x$ in the *test set*, compute the likelihoods
$$p(x|y=0),\quad p(x|y=1),\quad \text{and} \quad p(x|y=2)\tag{$\star$}$$
using the above class-conditional Gaussian distributions fit to the training data.
Store these in a matrix `p_x_y` of shape `(30, 3)` whose `i`-th row contains of the likelihoods of `X_te[i]` as in $(\star)$.

Predict class labels for the testing set by choosing those with the highest posterior probability.

Compare your predicted class labels with `y_te`. How accurate are your predictions?

<div style="background-color: pink; padding: 10px">
See <code>height-weight.ipynb</code> from the Session 01 for an example in the two-class case.
</div>

#### Exercise 2

Another data set included with `scikit-learn` is the
<a href="https://scikit-learn.org/stable/datasets/index.html#diabetes-dataset">diabetes dataset</a>.
It contains measurements of ten features for each of 442 diabetes patients as well as a quantitative measurement of disease progression after one year. The goal is to predict each patient's progression in terms of their ten baseline feature measurements.

1. Using `LinearRegression` from `sklearn.linear_model`, train a linear regression model on a training subset of the data. Report the mean squared error on the corresponding testing data.

2. Now use 10-fold cross-validation with mean squared to estimate the prediction error of a linear regression model for the diabetes data set, as above.

2. Now perform the regression using `KNeighborsRegressor` from `sklearn.neighbors`. Use 10-fold cross-validation with mean squared error to choose an optimal value of $k$. 

3. `KNeighborsRegressor` accepts a parameter called `weights`
(see <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html">the docs</a>).
Setting `weights="distance"`, the regressor will weight the neighbor response values by the inverse of their distance from the point, $x$, of interest, giving neighbors closer to $x$ more influence than those farther away.
The default is `weights="uniform"`, weighting each neighbor equally.
Use 10-fold cross-validation with mean squared error to find an optimal choice of the pair `(n_neighbors, weights)`.

4. Compare the estimates of prediction error from 1, 2, 3, and 4.