In [None]:
# Import necessary libaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
%matplotlib inline

# Non-linear transformations

Sometimes the output variable doesn't have a direct linear relationship with the predictor variable. They can have quadratic, exponential, logarithmic, or polynomial relationships. In such cases, transforming the variable comes in very handy.

The following is a rough guideline on how to spot and handle non-linear relationships:

- Plot a scatter plot of the output variable with each of the predictor variables.
- If the scatter plot assumes more or less a linear shape then it is linearly related to the output variable.
- If the scatter plot assumes a characteristic non-linear shape then transform the variable by applying that function.

Let's illustrate this with an example. We will use the `auto.csv` dataset for this. This dataset contains information about miles per gallon (mpg) and horsepower for a number of car models. `mpg` is the predictor variable.

Let's import the dataset and take a look at the relationship between `horsepower` and `mpg`:

In [None]:
# Import `auto` as a pandas dataframe
auto = pd.read_csv('auto.csv')
auto.head()

In [None]:
# Create scatter plot to show relationship between horsepower and mpg
plt.figure(figsize=(12, 6))
plt.plot(auto['horsepower'], auto['mpg'], 'ro')
plt.xlabel('Horsepower')
plt.ylabel('MPG (Miles Per Gallon)')

plt.show()

The relationship between `horsepower` and `mpg` doesn't seem to have a linear shape.

However, for the sake of comparison, let's try and fit a linear model first – i.e. assume that the model is:

![](https://latex.codecogs.com/gif.latex?%5Ctext%7Bmpg%7D%20%3D%20%5Calpha%20+%20%5Cbeta_1*%5Ctext%7Bhorsepower%7D)

In [None]:
# Intialise and fit linear regression model
X = auto['horsepower']
Y = auto['mpg']
lm = LinearRegression()
lm.fit(X[:, np.newaxis], Y)   # see note below

print(f'alpha = {lm.intercept_}')
print(f'betas = {lm.coef_}')

> Note that the linear regression method by default requires that *X* be an array of two dimensions. Using `np.newaxis`, we can create a new dimension for it to function properly.

The line of best fit can be plotted by the following snippet:

In [None]:
# Create scatter plot to show relationship between horsepower and mpg and predictions
plt.figure(figsize=(12, 6))
plt.plot(auto['horsepower'], auto['mpg'], 'ro')
plt.plot(X, lm.predict(X[:, np.newaxis]))
plt.xlabel('Horsepower')
plt.ylabel('MPG (Miles Per Gallon)')

plt.show()

This model is not very efficient – the *R<sup>2</sup>* for this model is 0.6059, and the RSE is 4.9058 (20.92%) error. See if you can figure out how to calculate these!

In [None]:
# Calculate R2 score

# Calculate RSE and error


From the scatter plot, it looks like the relationship between `horsepower` and `mpg` could linearised by taking the square-root of `horsepower` – if we assume a model in the form ![](https://latex.codecogs.com/gif.latex?%5Ctext%7Bmpg%7D%20%3D%20%5Calpha%20+%20%5Cbeta_1*%5Csqrt%5Ctext%7Bhorsepower%7D) the model may improve. Let’s give it a go!

In [None]:
# Transform X by taking the square root
X2 = np.sqrt(auto['horsepower'])
Y2 = auto['mpg']
lm2 = LinearRegression()
lm2.fit(X2[:, np.newaxis], Y2)

print(f'alpha = {lm2.intercept_}')
print(f'beta = {lm2.coef_}')

In [None]:
# See if you can work out the R2, RSE, and error on your own!
# Calculate R2 score

# Calculate RSE and error


The *R<sup>2</sup>* value for this model comes out to be around 0.6437, and the RSE is 4.6648 (19.90% error) – a slight improvement!

Let’s plot our model to see how transforming `horsepower` has helped our prediction:

In [None]:
# Create scatter plot to show relationship between horsepower and mpg and predictions
ypred = lm2.predict(X2[:, np.newaxis])

plt.figure(figsize=(12, 6))
plt.plot(X2, Y2, 'ro')
plt.plot(X2, ypred)
plt.xlabel('Horsepower')
plt.ylabel('MPG (Miles Per Gallon)')

plt.show()

There is still room for improvement. The scatter plot shows that there is still a curve in our data. We may be able to improve our model by taking the log of `horsepower`, i.e. assuming an equation:

![](https://latex.codecogs.com/gif.latex?%5Ctext%7Bmpg%7D%20%3D%20%5Calpha%20+%20%5Cbeta_1*%5Clog%28%5Ctext%7Bhorsepower%7D%29)

See if you can perform this **log-transform** on your own! 

In [None]:
# Transform X by taking the log - note: to take log, you can use np.log()

# Initialise and fit a linar regression model

# Print alpha and beta

# Calculate R2 score

# Calculate RSE and error


In [None]:
# Plot new prediction against log-transformed relationship


You should have found a *R<sup>2</sup>* of 0.6683, RSE of 4.5007 (19.19% error), and your plot should look like the following:

![](https://github.com/nextdotxyz/linear-regression-with-python/blob/master/00%20Notebooks/3.4%20log.png?raw=true)

Finally, we can try a model with a **polynomial fit** using `scikit-learn`'s `PolynomialFeatures` method. This allows us to **power-transform** X to a specified degree. We will try power-transforming `horsepower` to the second degree, i.e. assuming a model that can be written as:

![](https://latex.codecogs.com/gif.latex?%5Ctext%7Bmpg%7D%20%3D%20%5Calpha%20+%20%5Cbeta_1*%5Ctext%7Bhorsepower%7D%20+%20%5Cbeta_2*%5Ctext%7Bhorsepower%7D%5E2)

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)         # set number of degrees to 2
X4 = poly.fit_transform(X[:, np.newaxis])   # transform X
Y4 = auto['mpg']                            # Y remains un-transformed

# Intialise and fit new model
lm4 = LinearRegression()
lm4.fit(X4, Y4)

# Print parameters
print(f'alpha = {lm4.intercept_}')
print(f'beta = {lm4.coef_}')

This gives us the model:

![](https://latex.codecogs.com/gif.latex?%5Ctext%7Bmpg%7D%20%3D%2056.9%20-0.4662%20*%5Ctext%7Bhorsepower%7D%20+%200.001*%5Ctext%7Bhorsepower%7D%5E2)

We end up with a model with the highest *R<sup>2</sup>* and lowest RSE and error percentage! 

In [None]:
# R-squared
print(f'R2 = {lm4.score(X4, Y4)}')

# RSE & Error
SSD = (Y4 - lm4.predict(X4))**2
RSE = np.sqrt(np.sum(SSD) / 389)
ymean = np.mean(Y4)
error = RSE / ymean
print(f'RSE = {RSE}\nError = {np.round(error, 4)*100}%')

Let’s try plotting this to see how our prediction looks:

In [None]:
ypred = lm4.predict(X4)                    # store predictions
newX, newY = zip(*sorted(zip(X, ypred)))   # sort values for plotting

# Plot polynomial regression against original data
plt.figure(figsize=(12, 6))
plt.plot(X, Y, 'ro')
plt.plot(newX, newY)
plt.xlabel('Horsepower')
plt.ylabel('MPG (Miles Per Gallon)')

plt.show()

Looks pretty good!

You can try to further improve the model by playing with the number of degrees in your polynomial regression i.e. by changing `poly = PolynomialFeatures(degree=n)`. Keep in mind, however, a higher degree may fit better on training data but be poor at generalising to other data (remember overfitting?). 

In [None]:
# Try to change the polynomial degree


We've talked about how to handle a lot of issues and other considerations in implementing linear regression models in this lesson! To wrap up, reopen the instructions panel on the left, then press Next Step.