# Linear regression

Let's start with an easy example that you are likely familiar with.  Finding the best-fit line.

And let's introduce a key Python library for machine learning:  scikit-learn

<img src="data-sci-images/scikit-learn.png" width=500>

https://scikit-learn.org/stable/index.html
<br>
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

We'll make a data set to start.

In [None]:
x = np.arange(50)

In [None]:
x

In [None]:
# np.random.normal(mean, sigma, number of elements)
np.random.normal(0, 3, 100)

In [None]:
y = 5 + 0.5 * x + np.random.normal(0, 5, 50)

In [None]:
plt.plot(x,y,'ro')

The LinearRegression object will expect to get a 2D numpy array or data structure.  We therefore convert the array to a dataframe.

In [None]:
lindf = pd.DataFrame(x)

In [None]:
reg = LinearRegression().fit(lindf, y)

In [None]:
reg.coef_

In [None]:
reg.intercept_

In [None]:
ytrain = reg.intercept_ + reg.coef_ * x

In [None]:
plt.plot(x,y,'ro',x,ytrain,'b-');

In [None]:
reg.score(lindf, y)

In [None]:
mean_squared_error(y, ytrain)

In [None]:
r2_score(y, ytrain)

Really what we want to do is to predict new y values on the basis of the model that we've trained.

In [None]:
x2 = np.arange(1,50,5)

In [None]:
lindf2 = pd.DataFrame(x2)

In [None]:
ytest = 5 + 0.5 * x2 + np.random.normal(0, 5, len(x2))

In [None]:
plt.plot(x,ytrain,'b-',x2,ytest,'g^');

In [None]:
ypred = reg.predict(lindf2)

In [None]:
plt.plot(x2,ytest,'g^',x2,ypred,'g-');

In [None]:
mean_squared_error(ypred, ytest)

In [None]:
np.sqrt(mean_squared_error(ypred, ytest))

The basic flow here highlights some of the main processes that occur when using machine learning algorithms:

* load the data
  * here it was artificially generated data -- $y = 5 + 0.5x + noise$
* choose a model
  * here linear regression
* train the model
  * minimize the cost fuction (here the residual sum of squares between the observed values and the model values)
* test the model
  * on new data (i.e. data that was not used for training)