# Nonlinear Relationships

Linear Regression can in fact model nonlinear relationships in data, provided it is given the appropriate features.

In this practical, we use an alternative dataset which will benefit from feature engineering. We focus here on univariate data in order to familiarise yourself with the process. You are welcome to use any transformation you like to improve your model fit.

We will work with `sklearn`, `numpy`, `pandas` and `matplotlib` libraries.

**Possible transformations**:

- Polynomial features: $x^2, x^3, x^4, \ldots$. Take a look at `sklearn.preprocessing.PolynomialFeatures` for a quick way to derive these features.
- Sinusoidal features: $\sin(x), \cos(x), \sin(0.5x), \cos(0.5x), \ldots$. 
- Any other standard functions, e.g. `log`, `exp`, `tanh`. Most familiar functions can be found within `numpy`. 
- 'Bump'-type functions e.g. the Gaussian density $f(x; c, \sigma) = \exp(-(x-c)^2/2\sigma)$. Try varying the center $c$ and the bandwidth $\sigma$.
- Compositions of any of the above.

**Additional hints**:

- You may find it useful to use `np.hstack` again in order to build up your matrix of derived features. 
- In order to debug your model fit (why have my transformations not yielded a good fit?), it may be helpful to plot the raw transformations against the target data.
- When considering how well your model may perform in practice, take a look at the fit outside the support of your data (i.e. below $\min(x)$ and above $\max(x)$).
- Have fun!

We start by importing the relevant libraries and defining several helper functions:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
def generate_data(n=200):
    """
    Generate data for the Linear Regression exercises. In our examples we 
    have used the default value of n=200.
    """
    np.random.seed(1000)
    
    x = (np.random.rand(n) - 0.2) * 160
    y = 1.8 * np.log(x + 36) + np.exp(x/10 - 10.5) + 0.02 * x  + np.random.randn(n) * 2
    x = x.reshape(n,1)
    
    return x, y
    
def generate_data_nonlinear(n=200):
    """
    Generate non-linear data for the Linear Regression exercises. In our examples we 
    have used the default value of n=200.
    """
    x, y = generate_data(200)
    s = float("".join([chr(47+x) for x in [1,-1,1,6]]))
    f = getattr(np, "".join([chr(92+x) for x in [23,13,18]]))
    y = f(s*np.abs(x)) + 0.25*np.random.randn(*x.shape)
    return x, y

def generate_data3d(n=200):
    """
    Generate 3D data for the Linear Regression exercises. In our examples we 
    have used the default value of n=200.
    """
    np.random.seed(3)
    
    x = (np.random.multivariate_normal(np.array([1,2]), np.array([[3, 2.6], [2.6, 3]]), size=n))
    w = np.random.randn(2)
    v = x @ w
    y = np.exp(np.sqrt(np.abs(v-np.percentile(v, 0.9)))) + 0.08 * v  + np.random.randn(n) * 0.3
    
    return x, y

We load the data and make a scatter plot of the data in order to understand what features might be useful.

In [None]:
# load the data
np.random.seed(1000)
x2, y2 = generate_data_nonlinear(200)

# make a scatter plot
plt.scatter(x2, y2);

**Task**:

- Fit a standard linear regression model to the data.

In [None]:
from sklearn.linear_model import LinearRegression
# Fit the model and extract the parameter vector w2
# Your code here...
model = LinearRegression().fit(x2, y2)
w2 = np.array([model.intercept_] + [coef for coef in model.coef_])


**Tasks**:

1. Plot the results of your linear model using the `plot_results` function.
2. Calculate the model's $R^2$ using `model.score`.

You may find the following plotting function useful. It takes the following arguments:

* `x`, `y`: The data for which to perform the scatter plot
* `predicted`: The predicted values for the `x_plot` array defined below. For `sklearn`, this should be the result of `model.predict(x_plot)`.

In [None]:
x_plot = np.linspace(-30,120, 200).reshape(200,1)

def plot_results(x, y, predicted, color="#e1851e", do_scatter=True):
    do_scatter and plt.scatter(x, y, alpha=.7)
    plt.plot(x_plot, predicted, color=color, linewidth=4, alpha=0.8)

In [None]:
# Your code here...
plot_results(x2, y2, model.predict(x_plot))


In [None]:
# What is the model score?
# Your result here...
print("R2 for nonlinear fit is: {:.3f}".format(model.score(x2, y2)))


### Feature extraction

**Tasks**:
1. Create a variety of features that you think may be useful (see the list given above) and concatenate them into an $X_2$ matrix using `np.hstack`. For example, using polynomial features, the desired result would be:
<br>
$$X_2 = \begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3 & \ldots \\ 1 & x_2 & x_2^2 & x_2^3 & \ldots \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 & x_n^3 & \ldots \end{bmatrix}$$
<br>
2. Fit an `sklearn` `LinearRegression` model using this matrix and with targets `y2`.

In [None]:
# Your code here...
X2 = np.hstack((np.ones((200,1)), x2, x2**2, x2**3, x2**4, x2**5))
model_poly = LinearRegression().fit(X2, y2)


**Tasks**

1. Apply the same transformations as above to the vector `x_plot`, and save the results to a variable `X_plot` (note the capital X).
2. Plot your model using `plot_results`. You'll need to feed in the predictions from `model.predict(X_plot)` (now on the matrix constructed in (1)).


In [None]:
# Your code here...
X_plot = np.hstack((np.ones((200,1)), x_plot, x_plot**2, x_plot**3, x_plot**4, x_plot**5))
plot_results(x2, y2, model_poly.predict(X_plot))


**Tasks**

1. Find your model's $R^2$ using `model.score`.
2. Try some different features and try to improve this!

In [None]:
# Your code here...
print("R2 for nonlinear fit is: {:.3f}".format(model_poly.score(X2, y2)))
