In [1]:
import numpy as np
import pandas as pd

# Lecture 2 - 3

## Simple Linear Regression
- Examines linear relationship between two variables, can be positive or negative
- $ y = \beta_0X + \alpha $ where $\beta_0$ is the slope and $\alpha$ is the intercept
- In linear regression, we are predicting a continous variable
- trying to minimize the errors to find line of best fit


In [10]:
import statsmodels.api as sm
from sklearn import datasets

In [15]:
data = datasets.load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
target_respone = pd.DataFrame(data.target, columns=['MEDV'])
# target or Y is set as median value and other variables are set as regressors
#because we are trying to predict median house prices in boston


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np

        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_ho

In [16]:
X = df["RM"]
y = target_respone["MEDV"]

# Note the difference in argument order
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model
# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared (uncentered):,0.901
Model:,OLS,Adj. R-squared (uncentered):,0.901
Method:,Least Squares,F-statistic:,4615.0
Date:,"Thu, 29 Sep 2022",Prob (F-statistic):,3.7399999999999996e-256
Time:,15:59:50,Log-Likelihood:,-1747.1
No. Observations:,506,AIC:,3496.0
Df Residuals:,505,BIC:,3500.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
RM,3.6534,0.054,67.930,0.000,3.548,3.759

0,1,2,3
Omnibus:,83.295,Durbin-Watson:,0.493
Prob(Omnibus):,0.0,Jarque-Bera (JB):,152.507
Skew:,0.955,Prob(JB):,7.649999999999999e-34
Kurtosis:,4.894,Cond. No.,1.0


Above we are trying to minimise square of distance from regression line using OLS. 
- 

In [20]:
from sklearn import linear_model
lm = linear_model.LinearRegression()
model = lm.fit(df, target_respone['MEDV'])
predictions = lm.predict(df)
print(predictions[0:5])
print(lm.score(df,target_respone['MEDV'])) ## r^2 score
print(lm.coef_)
print(lm.intercept_)

[30.00384338 25.02556238 30.56759672 28.60703649 27.94352423]
0.7406426641094095
[-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
 -1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
  3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
 -5.24758378e-01]
36.45948838509016


## Multiple Linear Regression

## Ordinary Least Squares
- Linear Model:
  - $ y = \beta^tX + \alpha $
  - without loss of generaility it is $ y = \begin{bmatrix} \alpha & 1 \end{bmatrix} \begin{bmatrix} x \\ \beta \end{bmatrix} $
  - $ \hat{A} = \begin{bmatrix} a_1^T & 1 \\ ... & ... \\ a_1^T & 1 \end{bmatrix} $
  
  - $\hat{x} = \begin{bmatrix} x \\ \beta \end{bmatrix} $
- Opt: $ \hat{x} = min \parallel \hat{A}\hat{x} - y \parallel $
  
### Gauss Markov
> The OLS estimator has the lowest sampling variance in the class of linear unbiased estimators; in other words, the OLS is BLUE (Gauss-Markov).
- In real world, noise is included in model: $ y = \alpha^tX+\epsilon$
- In particular we assume that the noise has mean zero and finite variance : $ E[\epsilon] = 0 ; Var(\epsilon) = \sigma^2 $
- We are interested in estimating x hat, the solution to the following problem, which is itself a random variable, because y is a random variable.
- In particular, we are only interested in the class of linear estimators, or estimators of the form
  $ \hat{x} = \sum^d_{i=1}c_iy_i$
- That the estimator is unbiased means that: $ E[\sum^n_{i=1}c_i(\hat{x_i} - x_i)] = 0 $
- The Gauss-Markov theorem simply states that the following estimator is both unbiased and has the lowest variance:
  - $ x = (A^tA)^{-1} A^ty$
- the best we can do to minimize variance is the estimator above
- If we allow the estimator to be biased, then we can further reduce the variance
  

### Multicollinearity and Dummy Variables
