**Exercise set 9**
==============

>The goal of this exercise is to exemplify how decision trees
>are created, and how we can perform PLSR regression and decision tree
>classification for relatively complex data sets. 

**Exercise 9.1**

Table 1 list certain conditions in which our friend Hermann
is playing tennis. Use this data to create a decision tree (by hand) for whether or
not Hermann is playing tennis. Construct the tree by using the information entropy
and information gain.


|**Outlook**  | **Wind** | **Humidity** | **Playing tennis**  |
|:---|:---|:---|:---|
|overcast | strong | normal | yes |
|sunny    | strong | normal | yes |
|rain     | weak   | high   | yes |
|sunny    | strong | high   | no  |
|sunny    | weak   | high   | no  |
|sunny    | weak   | normal | yes |
|rain     | strong | high   | no  |
|**Table 1:** *Conditions for which Hermann is playing tennis.*|

In [None]:
# Your code here

**Your answer to question 9.1:** *Double click here*

**Exercise 9.2**

[Windig and Stephenson](https://doi.org/10.1021/ac00046a015) have measured near-infrared spectra
for 140 mixtures of the solvents methylene chloride, 2-butanol, methanol,
dichloropropane, and acetone. We will in this exercise see if we can
predict the compositions of the mixtures from the spectra.
Each of the $140$ spectra have been sampled at $700$ wavelengths
between $1100$ and $2500$ nm. The raw data containing the spectra
and the corresponding concentrations can be found in the file
[`Data/windig.csv`](Data/windig.csv).



**(a)**  Create a partial least-squares regression (PLSR) model for predicting
the concentrations. Use $1$ PLS component for your first model and
assess it using $R^2$, RMSEC, RMSECV and RMSEP. The raw data can
be loaded as shown below.
```python
"""Load the Windig data set."""
import pandas as pd
data = pd.read_csv('Data/windig.csv')
X = data.filter(like='data', axis=1).values  # NIR spectra
Y = data.filter(like='concentrations', axis=1).values  # Concentrations
```

In [None]:
# Your code here

**Your answer to question 9.2(a):** *Double click here*

**(b)** Improve your PLSR model by including more
PLS components. Try components in the
range from $2$ up to $15$ and compare the different models. How many
PLS components are you satisfied with? In the following, we will refer
to the model you are most satisfied with as "model A".



In [None]:
# Your code here

**Your answer to question 9.2(b):** *Double click here*

**(c)**  Plot the regression coefficients for model A (see point **(b)**).



In [None]:
# Your code here

**Your answer to question 9.2(c):** *Double click here*

**(d)**  If you are given a new spectrum of a mixture of methylene chloride,
2-butanol, methanol, dichloropropane, and acetone, how well would
your model A predict the concentrations of the different solvents
in the mixture?

In [None]:
# Your code here

**Your answer to question 9.2(d):** *Double click here*

**(e)**  Create a least-squared model for predicting the concentrations.
Assess it using $R^2$, RMSEC, RMSECV and RMSEP. Does this model
perform as you expect?

In [None]:
# Your code here

**Your answer to question 9.2(e):** *Double click here*


**Exercise 9.3**

[Schummer *et al.*](https://doi.org/10.1016/S0378-1119(99)00342-X) studied ovarian cancer by measuring gene expression
values for $1536$ genes in both normal and tumor tissues. One of their goals was
to find genes that were overexpressed in tumor samples compared with normal samples.
This knowledge may be used for tumor diagnosis. The raw data can be
found in the file [`Data/ovo.csv`](Data/ovo.csv).


**(a)**  Perform a principal component analysis (PCA) on the gene expression data,
and obtain the explained variance when using $1$, $2$, $5$ and $10$
components.

Center the data before performing the PCA. This can be
done as follows with the `scale` method
from `sklearn.preprocessing`: `X = scale(X, with_std=False)`.

The raw data can
be loaded as shown below.
```python
"""Load the Windig data set."""
import pandas as pd
data = pd.read_csv('Data/ovo.csv')
classes = data['objlabels']  # Classification of data points.
X = data.filter(like='X.', axis=1)  # Gene expressions.
```




In [None]:
# Your code here

**Your answer to question 9.3(a):** *Double click here*

**(b)**  Inspect the data by plotting the scores and loadings for principal component
number $1$ and principal component number $2$:

* (i)  Can you observe any clustering
of the samples?

* (ii)  Are there any outliers among the samples?

* (iii)  Can you identify some
genes which are overexpressed in tumors? 

* (iv)  Can you identify some
genes which are underexpressed in tumors? 





In [None]:
# Your code here

**Your answer to question 9.3(b):** *Double click here*

**(c)**  Based on your answer in **(b)**, can
you identify some pairs of genes that seem to distinguish between
normal and tumor tissues? Support your findings by plotting the raw data.

In [None]:
# Your code here

**Your answer to question 9.3(c):** *Double click here*

**(d)**  Create a classifier for this data set using a decision tree. Limit the depth of the decision
tree to $2$. Assess the classifier, and compare it with your findings in **(c)** .



In [None]:
# Your code here

**Your answer to question 9.3(d):** *Double click here*

**(e)**  Create a random forest classifier for the data set. Set the maximum depth to $2$, and
use $500$ trees in your forest (i.e. set `n_estimators=500`). Assess the
classifier and plot the
variable importance for the $20$ most important variables. Compare this with your
previous findings.

In [None]:
# Your code here

**Your answer to question 9.3(e):** *Double click here*