<h1><center><span style="color:red;">**IMPORTANT NOTICE**</span></center></h1>

Before submitting, **please**, make sure that your notebook runs **without errors** in Python 3.6
and **reproduces your solution as intended**, when you **Restart the Kernel and re-run the whole
notebook**!
<span style="color:red;">You will be severely penalized if you notebook does not run.</span>

Whereever applicable your solution will be graded based on the **plots**, generated by
**your code** on **TA's** computer.

<br/> <!--Intentionally left blank-->

# Home Assignment -- 5

Please, write your implementation within the designated blocks:
```python
...
### BEGIN Solution

# >>> your solution here <<<

### END Solution
...
```

Write your theoretical derivations within such blocks:
```markdown
**BEGIN Solution**

<!-- >>> your derivation here <<< -->

**END Solution**
```

## $\LaTeX$ in Jupyter
Jupyter has constantly improving $\LaTeX$ support. Below are the basic methods to
write **neat, tidy, and well typeset** equations in your notebooks:
* to write an **inline** equation use 
```markdown
$ you latex equation here $
```
* to write an equation, that is **displayed on a separate line** use 
```markdown
$$ you latex equation here $$
```
* to write a **block of equations** use 
```markdown
\begin{align}
    left-hand-side
        &= right-hand-side on line 1
        \\
        &= right-hand-side on line 2
        \\
        &= right-hand-side on the last line
\end{align}
```
The **ampersand** (`&`) aligns the equations horizontally and the **double backslash**
(`\\`) creates a new line.

<br/> <!--Intentionally left blank-->

<hr/> <!--Intentionally left blank-->

# Part 1 (19 pt.): Model selection and sensitivity analysis

<br/> <!--Intentionally left blank-->

## Task 1 (2 pt.): Information criteria

Assume that regression model is
$$y = \sum_{i=1}^k \beta_i x_i + \varepsilon,$$
and $\varepsilon$ is dictributed as normally: $\varepsilon \sim \mathcal{N}(0, \sigma^2)$, $\sigma^2$ is known.

Prove that the model with highest Akaike information criterion is the model with smallest Mallow's $C_p$.

**BEGIN Solution**


**END Solution**

<br/> <!--Intentionally left blank-->

## Task 2 (17 pt.): Sensitivity analysis and optimization for rotating disk problem

In this tsk, you are proposed to solve a problem of optimization of a rotating disc. You will use approximation techniques, sensitivity analysis and optimization. For sensitivity analysis you are recommended to use SALib library (https://github.com/SALib/SALib), and scipy for optimization.

1. Parameters `r1,t1,r2,r3,t3,r4` are input variables that define a geometrical shape of a disk. Parameters `mass,smax,u2` are mass of a disk, maximal radial stress, and contact stress, respectively. **Those are the
target variables to predict (yes there are three regression targets).**
2. The `problem` Pythonic dict is used for SALib methods and defines bounds for parameters.

### Necessary imports

Run the following command in the next empty code cell.

```python
!pip install salib
```

In [None]:
### here

Other imports

In [None]:
%matplotlib inline
from SALib.analyze import sobol as sobol_analyzer
from SALib.analyze import morris as morris_analyzer

from SALib.sample import saltelli as saltelli_sampler
from SALib.sample import morris as morris_sampler

from scipy.optimize import minimize

import numpy as np
import pandas as pd

from matplotlib import pyplot

from sklearn.model_selection import GridSearchCV

### Define problem

Problem is defined as a simple Pythonic dict, where you should number of input variables,
bounds for each input variable and their names. This will be helpful for sensitivity analysis.
Note, that bounds defined here are true for **standardized data**.

In [None]:
problem = {
    'num_vars': 6,
    'names': data.columns.values[:6],
    'bounds': np.array([[-1.7321, 1.7321],
                        [-1.7321, 1.7321],
                        [-1.7321, 1.7321],
                        [-1.7321, 1.7321],
                        [-1.7321, 1.7321],
                        [-1.7321, 1.7321]]),
    'groups': None
    }

<br/> <!--Intentionally left blank-->

### Task 2.1 (7 pt.): Surrogate modelling

The actual dependency is not given, only a data set of inputs and outputs.
Surrogate modelling is an approach that allows to construct approximations of the real dependecy, and use them for optimization and modelling.
To perform sensitivity analysis and optimization we are going to use a regression model.

Your tasks:

* Load the data set from `data/doe_100.csv`.
* Build several regression models using different techniques: Gaussian Process Regression, Kernel Ridge regression, SVR.
* Perform k-fold cross-validation for each model and choose the best.

The most accurate models will be used in **all subsequent excersices**.

<span style="color:green">**NOTE**</span> sklearn has a convenient GP implementation.

```python
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import My, Favourite, Kernels, ...
```

In [None]:
### BEGIN Solution

# >>> your solution here <<<

### END Solution

<br/> <!--Intentionally left blank-->

### Task 2.2 (6 pt.): Sensitivity analysis

SALib is a python library for sensitivity analysis.

It implements some popular global sensitivity analysis methods: 
* Morris method - that may be thought of as crude estimation of average absolute value of partial derivative. 
* Sobol indicies - that show portion of variance in the output that is explained by input.

Each method takes **x** and **y** samples as input. But the samples must be properly generated.
There are special functions in SALib library to do exactly that.

Using the **best model per target** your task is to

* calculate Sobol indices:
    * Generate **x** and **y** samples using Saltelli’s extension of the Sobol sequence
    * Calculate Sobol indices using obtained samples
* calculate screening indices
    * Generate **x** and **y** samples for Morris method
    * Apply Morris method to generated samples to obtain screening indices


* Using your judgement and based on the analysis results choose variables have the most influence on the output.

**NOTE** Make sure to use the *same sample* for all three targets.

In [None]:
### BEGIN Solution

# >>> your solution here <<<

### END Solution

<br/> <!--Intentionally left blank-->

### Task 2.3 (4 pt.): Optimization

The final goal is to optimize the **mass** of the rotating disk. It will be done with scipy optimizer via approximation, provided by the surrogate model. We assume that surrogate model is of reasonable quality. The optimization problem for full parameter space is prepared for you.

The following optimization problem should be solved:

$$
{\rm mass} \rightarrow \min_x \\
\mbox{subject to} \quad S_{max}(x) \le 600 \\
\qquad \qquad U_2(x) \le 0.3
$$

Your tasks:

* Perform optimization by running the code below
* After performing sensitivity analysis you got the most influential features. Reestimate your models on the reduced feature space
* Change the optimization problem statement so that it usess only the selected variables
* Compare the optimal results for two formulations considered and make a conclusion

In [None]:
result = minimize(lambda x: best_models[0].predict(x.reshape(1, -1)),
                  [109.0, 32.0, 123.0, 154.0, 6.0, 198.0],
                  bounds=problem['bounds'],
                  constraints=[{'type': 'ineq',
                                'fun' : lambda x: 600 - best_models[1].predict(x.reshape(1, -1))
                               },
                               {'type': 'ineq',
                                'fun' : lambda x: 0.3 - best_models[2].predict(x.reshape(1, -1))
                               }])
print(result)

In [None]:
### BEGIN Solution

# >>> your solution here <<<

### END Solution

<br/> <!--Intentionally left blank-->