In [22]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import sklearn.metrics as metrics
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
%config InlineBackend.figure_format = 'retina'

%matplotlib inline


1. **Slope (m):**
   - The formula you provided for the slope is:
     $[ m = r_P \times \frac{\sigma_y}{\sigma_x} = \frac{{\text{cov}(X, Y)}}{{\text{var}(X)}} ]$
   - Here, $( r_P )$ represents the Pearson correlation coefficient between the variables X and Y, while $( \sigma_y )$ and $( \sigma_x )$ represent the standard deviations of Y and X, respectively. 
   - This formula represents the slope of the best-fit line in terms of the relationship between the variables X and Y. In OLS regression, this slope is estimated based on the covariance and variance of the variables.

2. **Y-intercept (b):**
   - The formula you provided for the y-intercept is:
     $[ b = \mu_y - m\mu_x ]$
   - Here, $( \mu_y )$ and $( \mu_x )$ represent the means of Y and X, respectively. 
   - This formula represents the y-intercept of the best-fit line in terms of the means of the variables X and Y, along with the slope $( m )$ calculated previously.

These equations are indeed fundamental to linear regression and are used in OLS regression to estimate the parameters of the regression line that best fits the given data points.

Let's walk through how Ordinary Least Squares (OLS) regression works mathematically using the provided data:

```python
ages = np.array([20, 25, 30, 35, 40])
cigarettes_per_day = np.array([10, 15, 20, 25, 30])
```

1. **Define the model**: In simple linear regression, the model is represented as:

   $[ Y = \beta_0 + \beta_1 X + \epsilon ]$

   Where:
   - $ ( Y )$ is the dependent variable (cigarettes per day),
   - $ ( X )$ is the independent variable (ages),
   - $ ( \beta_0 )$ is the y-intercept (constant term),
   - $ ( \beta_1 )$ is the slope (coefficient of the independent variable),
   - $ ( \epsilon )$ is the error term.

2. **Calculate the mean of X and Y**:
   $[ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i ]$
    
   $[ \bar{Y} = \frac{1}{n} \sum_{i=1}^{n} Y_i ]$

   For our data:
   
   $[ \bar{X} = \frac{20 + 25 + 30 + 35 + 40}{5} = 30 ]$
   
   $[ \bar{Y} = \frac{10 + 15 + 20 + 25 + 30}{5} = 20 ]$

3. **Calculate the slope (β1)**:

   $[ \beta_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2} ]$

   Substituting the values:

   $[ \beta_1 = \frac{(20-30)(10-20) + (25-30)(15-20) + (30-30)(20-20) + (35-30)(25-20) + (40-30)(30-20)}{(20-30)^2 + (25-30)^2 + (30-30)^2 + (35-30)^2 + (40-30)^2} ]$

   Simplifying, we get:

   $[ \beta_1 = \frac{-100 + (-5)(-5) + 0 + 5(5) + 100}{100 + 25 + 0 + 25 + 100} = \frac{50}{125} = 0.4 ]$

4. **Calculate the y-intercept (β0)**:

   $[ \beta_0 = \bar{Y} - \beta_1 \bar{X} ]$

   Substituting the values:
   
   $[ \beta_0 = 20 - 0.4 \times 30 = 20 - 12 = 8 ]$

5. **Fit the regression line**:
   Now we have the estimated values of $( \beta_0 )$ and $( \beta_1 )$, so the regression line equation is:
   $[ \text{Cigarettes per day} = 0.4 \times \text{Age} + 8 ]$

   This line represents the best linear fit to the given data points.

<div class="width=80%">
In linear regression, there are several metrics commonly used to measure the error or goodness of fit of the model. Some of the most common ones include:

- Mean Absolute Error (MAE): The mean absolute error is the average of the absolute differences between the predicted values and the actual values. It provides a measure of the average magnitude of errors in the predictions.

- Mean Squared Error (MSE): The mean squared error is the average of the squared differences between the predicted values and the actual values. Squaring the errors penalizes larger errors more heavily than smaller ones.

- Root Mean Squared Error (RMSE): The root mean squared error is the square root of the mean squared error. It provides a measure of the average magnitude of errors in the same units as the dependent variable.

- R-squared (R2): R-squared represents the proportion of variance in the dependent variable that is explained by the independent variables in the model. It ranges from 0 to 1, with higher values indicating a better fit.

These metrics can help assess the overall performance of the linear regression model and compare different models or variations of the same model. It's important to consider the specific context of the problem and the goals of the analysis when choosing which metric to use.
</div>

<div class="width=80%">

### Hypothesis Testing in Linear Regression

In the context of linear regression, hypothesis testing involves evaluating the statistical significance of the relationship between the independent variable(s) and the dependent variable. Specifically, it aims to determine whether the independent variable(s) have a significant effect on the dependent variable, and whether the coefficients estimated by the regression model are significantly different from zero.

#### Null Hypothesis $(H_0)$
The null hypothesis states that there is no significant relationship between the independent variable(s) and the dependent variable. Mathematically, it can be expressed as:

$[ H_0: \beta_1 = 0, ]$

where $( \beta_1 )$ is the coefficient (slope) associated with the independent variable of interest.

#### Alternative Hypothesis $(H_1)$
The alternative hypothesis contradicts the null hypothesis and asserts that there is a significant relationship between the independent variable(s) and the dependent variable. For a simple linear regression with one independent variable, the alternative hypothesis can be expressed as:

$[ H_1: \beta_1 \neq 0. ]$

This means that the coefficient $( \beta_1 )$ is not equal to zero, indicating that there is a linear relationship between the independent variable and the dependent variable.

During hypothesis testing in linear regression, we typically perform a t-test or F-test to assess the statistical significance of the coefficients. The p-value associated with these tests indicates the probability of observing the estimated coefficient (or more extreme values) if the null hypothesis were true. If the p-value is less than a predefined significance level (e.g., 0.05), we reject the null hypothesis in favor of the alternative hypothesis, concluding that there is a significant relationship between the variables.

In summary, the null hypothesis represents the absence of a relationship, while the alternative hypothesis suggests the presence of a relationship between the independent and dependent variables in the linear regression model.

</div>

In [24]:
ages = np.array([20, 25, 30, 35, 40])
ages_2d = ages.reshape(-1, 1)
cigarettes_per_day = np.array([10, 15, 20, 25, 30])
reg = LinearRegression().fit(ages_2d, cigarettes_per_day)
reg.score(ages_2d, cigarettes_per_day)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1)
slope = reg.coef_
intercept = reg.intercept_

print("Slope:", slope)
print("Intercept:", intercept)

# reg.intercept_ = 5
reg.predict([[20]])

Slope: [1.]
Intercept: -10.0


array([10.])

In the summary table printed by `model.summary()`, there are several statistics that can be used to assess the performance and goodness of fit of the linear regression model. Here are some of the key metrics commonly used for this purpose:

1. **R-squared (R²):** This statistic measures the proportion of the variance in the dependent variable (cigarettes per day) that is explained by the independent variable (age). A higher R-squared value indicates a better fit of the model to the data.

2. **Adjusted R-squared:** This is a modified version of R-squared that takes into account the number of predictors in the model. It penalizes the addition of unnecessary predictors and is especially useful when comparing models with different numbers of predictors.

3. **F-statistic and Prob (F-statistic):** The F-statistic tests the overall significance of the regression model. A low p-value (Prob (F-statistic)) indicates that at least one of the independent variables is significantly related to the dependent variable.

4. **Root Mean Squared Error (RMSE):** This is the square root of the mean of the squared differences between the observed and predicted values. It provides a measure of the average deviation of the predicted values from the actual values. A lower RMSE indicates better model performance.

5. **AIC and BIC:** Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are measures of the relative quality of statistical models. Lower values of AIC and BIC indicate better model fit, with penalization for model complexity.

You can use these metrics to evaluate the performance of the linear regression model and determine how well it captures the relationship between age and cigarettes per day in the dataset.

In [6]:
from sklearn import metrics
# mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Actual and predicted values
y_true = cigarettes_per_day
y_pred = y_pred

# Mean Absolute Error (MAE)
mae = metrics.mean_absolute_error(y_true, y_pred)

# Mean Squared Error (MSE)
mse = metrics.mean_squared_error(y_true, y_pred)

# Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# R-squared (R2)
r2 = metrics.r2_score(y_true, y_pred)

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("R-squared (R2):", r2)

Mean Absolute Error (MAE): 2.2787128074520755
Mean Squared Error (MSE): 8.172361389474789
Root Mean Squared Error (RMSE): 2.8587342285485002
R-squared (R2): 0.03229548620208289


In [7]:
x1 = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
x1

array([[1, 1],
       [1, 2],
       [2, 2],
       [2, 3]])

In [8]:
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(x1, np.array([1, 2])) + 3
y

array([ 6,  8,  9, 11])