# Ordinary Least Squares



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We will start with the most familiar linear regression, a straight-line fit to data. A straight-line fit is a model of the form <br>
$$\hat{y} = a \cdot x + b$$ <br>
where a is commonly known as the slope, and b is commonly known as the intercept.

Consider the following data, which is scattered about a line with a slope of **2** and an intercept of **-5**:

In [None]:
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = 2 * x - 5 + rng.randn(50)

Showing X and y in the form of table

In [None]:
pd.DataFrame({"X":x,"y":y})

As dicussed, a is the slope in a linear equation, here is the co-variance of X and Y with respect to X

#### Cov(X,y)/Var(X)

In [None]:
#calculate the co-variance of X

covariance = np.cov(x,y)[0,1]
variance = np.var(x)

In [None]:
a = covariance/variance

In [None]:
a

As you can see above it's near to our values **2** and **-5**

# Calculate the intercept 

b =  mean(Y) — a.mean(X)

In [None]:
b = np.mean(y) - a*np.mean(x)
b

The equation is y = 2.06 * x - 5.1917

To predict new value of y we need to substitue the x value, for example to predict the value of y when x = 0.23, we will place 0.23 in the above equation

In [None]:
a *0.23 + b

## The same formula has been implemented in multiple packages like scikit-learn, scipy and statsmodels

### OLS using scikit-learn package for the same above data set

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()

In [None]:
lr.fit(x.reshape(-1,1),y.reshape(-1,1))

In [None]:
print("Model slope:    ", lr.coef_[0])
print("Model intercept:", lr.intercept_)

As you can see above it's near to our values **2** and **-5**

To predict for new values of x we can use the predict function

In [None]:
lr.predict(np.array([0.23]).reshape(-1,1))

### OLS using statsmodel package

In [None]:
import statsmodels.api as sm

In [None]:
X = sm.add_constant(x)
model = sm.OLS(y,X).fit()

In [None]:
model.summary()

As you can see the coef values for const and x1 is near to our values **2** and **-5**

### OLS using scipy package

In [None]:
from scipy.linalg import lstsq

In [None]:
p, res, rnk, s = lstsq(x[:, np.newaxis]**[0,1],y)

In [None]:
p

As you can see the coef values for const and x1 is near to our values **2** and **-5**

#### Limitation of Ordinary Least Square Technique
* Impacted by Outliers
* Non-linearities 
* Too many independent variables
* Multicollinearity 
* Heteroskedasticity
* Noise in the Independent Variables