## [1] Simple Linear Regression

**Simple linear regression** is a special case when there is only one explanatory variable $X$. In this case, the relation can be represented quantitatively by:
$$
Y = \beta_0 + \beta_1X + \epsilon
$$

- $\beta_0$ and $\beta_1$ are two **unknown** constants that represent the intercept and slope.
- $\epsilon$ is called the error term.  This represents the deviation of the value from the linearity.

### Data Explanation

**Wine quality inspection with input vairables below**

- Input variables (based on physicochemical tests):
   - fixed acidity
   - volatile acidity
   - citric acid
   - residual sugar
   - chlorides
   - free sulfur dioxide
   - total sulfur dioxide
   - density
   - pH
   - sulphates
   - alcohol
- Output variable (based on sensory data): 
   "- quality (score between 0 and 10)

In [None]:
import pandas as pd
import numpy as np

In [None]:
from google.colab import files
uploaded = files.upload()

Saving winequality-red.csv to winequality-red.csv


In [None]:
df = pd.read_csv("winequality-red.csv", sep=';')

## EDA & Data Preparation

In [None]:
df.shape

(1599, 12)

In [None]:
df.isnull().sum(axis=0)

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [None]:
# Separate the independent columns from our target variable
X = df.drop('quality', axis= 1)
y = df['quality']

In [None]:
# Convert dataframe to numpy array
X = X.to_numpy()
y = y.to_numpy()

In [None]:
X.shape

(1599, 11)

### Weight Inspection

1) **OLS**

In [None]:
# Insert constant ones for bias weights
X1 = np.insert(X, 0, 1, axis=1)
# Multiply two matrices using @
weights = np.linalg.pinv(X1.T @ X1) @ X1.T @ y
weights

array([ 2.19652081e+01,  2.49905523e-02, -1.08359026e+00, -1.82563948e-01,
        1.63312696e-02, -1.87422516e+00,  4.36133331e-03, -3.26457970e-03,
       -1.78811634e+01, -4.13653146e-01,  9.16334412e-01,  2.76197700e-01])

2) **scikit-learn**

In [None]:
from sklearn import linear_model
l_regression = linear_model.LinearRegression()
l_regression.fit(X, y)

LinearRegression()

In [None]:
# Check the beta coefficients
l_regression.coef_

array([ 2.49905527e-02, -1.08359026e+00, -1.82563948e-01,  1.63312698e-02,
       -1.87422516e+00,  4.36133331e-03, -3.26457970e-03, -1.78811638e+01,
       -4.13653144e-01,  9.16334413e-01,  2.76197699e-01])

In [None]:
# Check the intercept
l_regression.intercept_

21.96520844944842

3) **Interpreation: Correlation of Variables to Weights**




In [None]:
print(df.columns[:-1])
print(l_regression.coef_)

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol'],
      dtype='object')
[ 2.49905527e-02 -1.08359026e+00 -1.82563948e-01  1.63312698e-02
 -1.87422516e+00  4.36133331e-03 -3.26457970e-03 -1.78811638e+01
 -4.13653144e-01  9.16334413e-01  2.76197699e-01]


- Linear model is useful for its interpretability that explains how each feature is contributing to the output variable.
- For example, $\beta_i$ means one unit increase of that variable will lead to $\beta_i$ difference in the target variable. 
  - Note that the difference could be positive or negative.

- **Alcohol (0.27619)**
> If we increase alcohol by 1 unit, we will get the weight by 0.27619.