# Part 1: SLR

Consider a Simple Linear Regression model of the form:

$$
y = \alpha + \beta x + \epsilon
$$

where:
- $y$ is the dependent variable,
- $x$ is the independent variable,
- $\epsilon$ is the error term
- $\alpha$ and $\beta$ are parameters to be estimated

Using your brain and a pen, prove that :

$\hat{\beta} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$

$\hat{\alpha} = \bar{y} - \hat{\beta} \bar{x}$

hint : start by $\hat{\alpha}$

# Part 2: MLR

Consider a Muliple Linear Regression model of the form:
$$
y = X \beta + \epsilon
$$

1. What is $\beta$ ?
2. What is $X$ ?
3. What are the dimensions of $X$ & $\beta$
4. Prove that $\hat{\beta} = (X^T X)^{-1} X^T y$
5. What are the dimensions of each element of $\hat{\beta}$
6. Under what conditions $X^T X$ is invertible ?

# Part 3 : Calculate Simple Linear Regression Coefficients


Given the following dataset:
   \( x \) | \( y \) |
 |---------|---------|
 | 1       | 3       |
 | 2       | 5       |
 | 3       | 4       |
 | 4       | 7       |
 | 5       | 6       |

1. Calculate the intercept and slope for a Simple Linear Regression model.
2. Draw the regression line on paper, and then plot it by coding. Compare
4. Calculate the residuals on paper and then by coding. Compare
5. Calculate the SSR on paper and then by coding. Compare
6. Calculate the R_squared on paper and then by coding. Analyse


##### R_squared (Coefficient of Determination)

$R^2$, also known as the coefficient of determination, is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. It provides insight into the goodness of fit of a model, indicating how well the independent variables explain the variability of the dependent variable.

The $R^2$ value ranges from 0 to 1, where:
- $R^2 = 1$ indicates that the regression model perfectly fits the data.
- $R^2 = 0$ indicates that the model does not explain any of the variability of the response data around its mean.

$R^2$ is calculated using the formula:
$R^2 = 1 - \frac{\text{SSR}}{\text{SST}}$, 

where :
- SSR (Sum of Squared Residuals) is :
  $\text{SSR} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$
- SST (Total Sum of Squares) is :
  $\text{SST} = \sum_{i=1}^{n} (y_i - \bar{y})^2$

## Part 4. Coding task: Regression to the California Housing dataset. 

The goal of this task is to predict house prices based on median income using a linear regression model.

### **Tasks:**
1) Divide the dataset into training and testing sets.
2) Fit a linear regression model using the training data.
3) Make predictions on the test data
4) Evaluate the model’s performance using appropriate metrics, such as Mean Squared Error (MSE) and R² score.
5) Plot the regression line on a scatter plot of the actual data to visually assess the model’s fit.
6) Plot the residuals distribution
7) Repeat those steps by using more features

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [12]:
y=data.target
X = df
X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
    test_size=0.2, 
    random_state=42
)


In [13]:
linReg = LinearRegression()
linReg.fit(X_train, y_train)

In [18]:
#3) Make predictions on the test data 
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
linReg.coef_

array([ 4.48674910e-01,  9.72425752e-03, -1.23323343e-01,  7.83144907e-01,
       -2.02962058e-06, -3.52631849e-03, -4.19792487e-01, -4.33708065e-01])

In [17]:
#4) Evaluate the model’s performance using appropriate metrics, such as Mean Squared Error (MSE) and R² score.
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred)

# coef (R^2):
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R² Score:", r2)

Mean Squared Error (MSE): 0.555891598695244
R² Score: 0.5757877060324511


In [20]:
#5) Plot the regression line on a scatter plot of the actual data to visually assess the model’s fit.
import matplotlib.pyplot as plt

# Предположим, что:
# X_test - это тестовые данные (один столбец)
# y_test - реальные целевые значения из тестовой выборки
# y_pred - предсказанные моделью значения для X_test

# Преобразуем X_test и y_pred в одномерные массивы (Series, np.array),
# чтобы удобнее было с ними работать
X_test_array = X_test.values.ravel()  # превращаем DataFrame/Series в ndarray
y_pred_array = y_pred.ravel()

# Сортируем по возрастанию X_test, чтобы линия рисовалась ровно
sorted_indices = X_test_array.argsort()
X_test_sorted = X_test_array[sorted_indices]
y_pred_sorted = y_pred_array[sorted_indices]

plt.figure(figsize=(8, 6))

# 1. Диаграмма рассеяния реальных значений (тестовая выборка)
plt.scatter(X_test, y_test, color='blue', alpha=0.5, label='Actual Data')

# 2. Линейная регрессия (предсказанные значения), выстроенные по оси X
plt.plot(X_test_sorted, y_pred_sorted, color='red', label='Regression Line')

plt.xlabel("MedInc (Median Income)")
plt.ylabel("MedHouseVal (Median House Value)")
plt.title("Linear Regression Fit on Test Data")
plt.legend()
plt.show()


IndexError: index 8623 is out of bounds for axis 0 with size 4128