# **Spline Regression**

Spline regression is a non-linear regression which is used to try and overcome the difficulties of linear and polynomial regression algorithms. 

In spline regression, the dataset is divided into bins. Each bin of the data is then made to fit with separate models. The points where the data is divided are called knots. Since there are separate functions that fit the bins, each function is called piecewise step functions. 

To read about it more, please refer this [story](https://analyticsindiamag.com/hands-on-guide-to-spline-regression/).

## **Implementation**

We will implement polynomial spline regression on a simple Boston housing dataset. This data is most commonly used in case of linear regression but we will use cubic spline regression on it. The dataset contains information about the house prices in Boston and the features are the factors affecting the price of the house.

We will load the dataset now. 

In [None]:
!python -m pip install pip --upgrade --user -q 
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

In [None]:
import pandas as pd
from patsy import dmatrix
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
dataset=pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
dataset

Let us now plot the graph of age and the prices that are indicated as medv in the dataset and check how it looks. 

In [None]:
plt.scatter(dataset['age'], dataset['medv'])

Clearly, there is no linear relationship between these points. So we will use spline regression as follows:

### **Cube and natural spline**

In [None]:
spline_cube = dmatrix('bs(x, knots=(20,30,40,50))', {'x': dataset['age']})
spline_fit = sm.GLM(dataset['medv'], spline_cube).fit()
natural_spline = dmatrix('cr(x, knots=(20,30,40,50))', {'x': dataset['age']})
spline_natural = sm.GLM(dataset['medv'], natural_spline).fit()

### **Creating linspaces**

Next, we will create linspaces from the dataset based on minimum and maximum values. Then, we will use this linspace to make the prediction on the above model.

In [None]:
range = np.linspace(dataset['age'].min(), dataset['age'].max(), 50)
cubic_line = spline_fit.predict(dmatrix('bs(range, knots=(20,30,40,50))', {'range': range}))
line_natural = spline_natural.predict(dmatrix('cr(range, knots=(20,30,40,50))', {'range': range}))

### **Plot the graph**

Finally, after the predictions are made it is time to plot the spline regression graphs and check how the model has fit on the bins. 

In [None]:
plt.plot(range, cubic_line, color='r', label='Cubic spline')
plt.plot(range, line_natural, color='g', label='Natural spline')
plt.legend()
plt.scatter(dataset['age'], dataset['medv'])
plt.xlabel('age')
plt.ylabel('medv')
plt.show()

As you can see, the bins at 20 and 30 vary slightly more and the bins at 40 and 50 also fit differently. This is because different models are fit on the different bins of the data. But it is efficient since most points are being covered by the model. 