<a href="https://colab.research.google.com/github/aleksejalex/EIEE9E_2025_ZS/blob/main/PyPEF_08_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyPEF, lecture 08. Linear regression.

Prepared by: Aleksej Gaj ( pythonforstudents24@gmail.com )

üîó Course website: [https://aleksejgaj.cz/pef_python/](https://aleksejgaj.cz/pef_python/)


üëâ Technical note: Numbering of the notebooks is a little messed up:
 - use links from the subject's webpage, those should be OK.

 (if you spot some inconsistency, please feel free to email me)

### Two basic tasks/approaches:

 - **regression** = fitting a "shape" to data as close as possible

<img src="https://higherlogicdownload.s3.amazonaws.com/IMWUC/UploadedImages/92757287-d116-4157-b004-c2a0aba1b048/linear-regression-in-machine-learning.png" alt="banner" width="300" align="center"><br>
 *Examples:* predict the age of a cat based on its weight, predict price of a house based on its size and location, predict failure of some kind of machinery, ...

 - **classification** = separating data into classes (response is represented by discrete values)

<img src="https://miro.medium.com/v2/resize:fit:980/1*wm5m2Wd0gMXAkztc7vP7yA.png" alt="banner" width="260" align="center"><br>
*Examples:* predict whether the creature is mammal or not, predict whether there is a man or woman on the picture, predict what scene is on the video, ...



## scikit-learn
<a href="https://scikit-learn.org/stable/index.html"><img src="https://scikit-learn.org/stable/_static/scikit-learn-logo-small.png" alt="banner" width="220" align="right"></a>
 = library offering algorithms for classification, regression, clustering, etc.
 - user friendly, doesn't require deep knowledge
 - compatible with NumPy, SciPy, matplotlib and Pandas
 - has very nice documentation [here](https://scikit-learn.org/stable/index.html)




In [None]:
# imports for today (we are already familiar with those libraries)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In case you will need to install this package:

In [None]:
%%capture
%pip install scikit-learn

## A few useful things to learn:
Before we get to ML, let's get familiar with some functions we will use:

 - **generate random data in specific way** (so you can easily train your classifying skills)

In [None]:
from sklearn.datasets import make_blobs

# Generate random data using make_blobs
num_samples = 15
x, y = make_blobs(n_samples=num_samples, centers=2, cluster_std=1.3, random_state=42)

function output are pairs, each pair is: $(\vec{x}, y)$, where:
 - $y \in \lbrace 0, 1, 2, ... N \rbrace $, where $N$ is number of classes ... index, to which class the observation belongs
 - $\vec{x} = (x_1, x_2, ..., x_a)$, where $x_j$ is a value of $j$-th feature ... vector of features (example: if our observation are customers, their features can be: age, gender, amount of spent money, ...)

In [None]:
x

In [None]:
y

In [None]:
plt.figure(figsize=(4, 3))
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='viridis', marker='o')
plt.title('Visualization of Blobs')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

<a href="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.geeksforgeeks.org%2Fwp-content%2Fcdn-uploads%2F20190523171258%2Foverfitting_2.png&f=1&nofb=1&ipt=042ce96dbef98d7aa8ba54ca1e90ce3d0132347982c54dd1ca7580e6da8411e3&ipo=images"><img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.geeksforgeeks.org%2Fwp-content%2Fcdn-uploads%2F20190523171258%2Foverfitting_2.png&f=1&nofb=1&ipt=042ce96dbef98d7aa8ba54ca1e90ce3d0132347982c54dd1ca7580e6da8411e3&ipo=images" alt="banner" width="600" align="right"></a>
 - function that splits our data (`X, y`) into training dataset and testing dataset.
This is needed to evaluate how successful are the predictions of our model on those data that it hasn't seen. (The ability to extrapolate is important.)

Also this helps to avoid overfitting: if model learns to much from our data, it starts to remember not only the main pattern, but also things like random noise

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, shuffle=True)

In [None]:
print(f"size of x = {x.shape}")
print(f"size of X_train = {X_train.shape}")
print(f"size of X_test = {X_test.shape}")
print(f"size of y = {y.shape}")
print(f"size of y_train = {y_train.shape}")
print(f"size of y_test = {y_test.shape}")

In [None]:
# we don't want to mess with different variables onder same name, so:
del(x, y, X_train, X_test, y_train, y_test)

## Linear regression - most common tool
 - intuitively: the algorithm tries to find optimal coefficients of a line to interpolate the data

### The idea behind fitting:
*find coefficients of the line (slope and intercept) such that total sum of distances `datapoint<->line` is minimised*

<img src="https://aleksejalex.4fan.cz/imgsbin/uploads/regression_animation.gif" width="400" >

(Unfortunately credits lost, probably [this](https://www.youtube.com/watch?v=lorPvqyGsZU&t=4s) or [this](https://dataphys.org/list/doing-regression-with-a-cardboard-straw-and-strings/))

## Intuition behind
 - It shows the **trend**: how one variable changes when another changes.

   **Example:** `(Hours studied) ‚Üí (Exam score)` The line helps you **predict** the score for any number of hours.

 - we estimate or ‚Äúpull back toward‚Äù the best-fitting line
 - we are not assuming growth, only trying to describe a relationship


### ü§î Why it's called "regression"
The term regression comes from a 19th-century statistician, Francis Galton.
He noticed that:

 - very tall parents had children who were tall, but a little closer to average

 - very short parents had children who were short, but also closer to average

He called this:
```
‚Äúregression to the mean‚Äù
```
The name stuck.




## Trivial example:
‚úÖ **1. A line that best fits your data**

Imagine you plot pairs of numbers on a graph‚Äîfor example:
 - hours studied (x)
 - exam score (y)

The dots may look scattered.
Linear regression draws the straight line that best fits those dots.

‚úÖ **2. The line helps you predict**

If you know the line, then:
```
If a student studies 5 hours, what score do we expect?
```
You just read off the line.

Let's try it! But to be sure it "works", we will do it in this way:
 1. choose $a=2$, $b=1$
 2. calculate x and y, where: $$ y = a \cdot x + b + e, $$ where $e$ is same gaussian noise $e \sim \mathcal{N}(0,0.1) $ (to not make it so obvious)
 3. split the dataset into train data and test data
 4. fit model to training data (i.e. estimate the parameters $\hat{a}$, $\hat{b}$ from training data)
 5. plot results, calculate errors

If we manage to do everything all right, this should hold: $\hat{a} = a $, $\hat{b} = b$

In [None]:
# Set coefficients
a = 2
b = 1

In [None]:
# Generate random data with Gaussian noise
np.random.seed(0)
num_samples = 10
x = np.random.rand(num_samples)
noise = np.random.normal(0, 0.1, num_samples)
y = a * x + b + noise

In [None]:
from sklearn.model_selection import train_test_split
# Split dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [None]:
x_train.shape

In [None]:
# Reshape x arrays to be 2D (technical part, required by scikit-learn)
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)

( In scikit-learn, the input data for training a linear regression model is expected to be a 2D array where each row represents a sample and each column represents a feature.)

In [None]:
from sklearn.linear_model import LinearRegression

# Create and fit linear regression model
model = LinearRegression()
model.fit(x_train, y_train)

# Get the estimated coefficients
a_hat = model.coef_[0]
b_hat = model.intercept_

# Print the estimated coefficients
print("Estimated coefficient a:", a_hat)
print("Estimated coefficient b:", b_hat)

In [None]:
# Plot the data and regression line
plt.scatter(x_train, y_train, color='blue', label='Training Data')
plt.scatter(x_test, y_test, color='red', label='Testing Data')
plt.plot(x_train, model.predict(x_train), color='green', label='Regression Line')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Linear Regression')
plt.legend()
plt.show()

In [None]:
# 6. Evaluate predictions (numerically)
predictions = model.predict(x_test)
mse = np.mean((y_test - predictions) ** 2)
print("Mean squared error:", mse)

from sklearn import metrics

accu = metrics.r2_score(y_test, predictions)
print(f"Accuracy (r2 score): {accu*100} %")

In [None]:
# Computing the errors in parameter estimation:
error_of_a = np.abs(a - a_hat)
error_of_b = np.abs(b - b_hat)

print(f"Error of 'a' is {error_of_a:.5f} and error of 'b' is {error_of_b:.5f}.")

In [None]:
my_values = [-1, 0.3, 0.56, 7, 1000]
for value in my_values:
    value = np.array([value]).reshape(-1, 1)
    print(f"Prediction in {value} is {model.predict(value)}, while real value is {a*value + b}.")

üî• What do we see here? It seems our linear model predicts nicely only for values in range of data it has been trained on. And it predicts badly on *outliers*.

Well that's inevitable üòï. Any model struggles to predict something it has never seen, and *the further* new value is from training data, *the worse* is the prediction.


We have 2 ways to deal with it:
 - increase the number of data in training dataset (the more, the better)
 - increase the range of training dataset

Yet - of course any model has its limits, and we cannot expect the same precision for *any* input.

### statsmodels (statistically about regression)
<a href="https://www.statsmodels.org/stable/index.html"><img src="https://www.statsmodels.org/stable/_images/statsmodels-logo-v2-horizontal.svg" alt="banner" width="380" align="right"></a>
 = library for different statistical models
 - statistical approach
 - under active development
 - best of 2 worlds: Python and R
 - operates on `pandas.DataFrame`-s
 - homepage: [here](https://www.statsmodels.org/stable/index.html)


**Linear regression**
 - is a statistical method used to model the relationship between a *dependent variable* and one or more *independent variables* by *fitting a straight line* to the observed data points.
 - aim: to find the best-fitting line that minimizes the difference between the actual values and the predicted values.

In [None]:
%%capture
%pip install statsmodels

In [None]:
import statsmodels.formula.api as smf
import statsmodels.api as sm

In [None]:
# Define DataFrame from x_train and y_train
df_nums = pd.DataFrame(data=x_train, columns=['x_train'])
df_nums['y_train'] = y_train
print(df_nums.head(3))


# Create and fit the model using formula
import statsmodels.formula.api as smf
model_nums = smf.ols(formula='y_train ~ x_train', data=df_nums)
results_nums = model_nums.fit()

# Print summary
print(results_nums.summary())

In [None]:
import matplotlib.pyplot as plt
import statsmodels.api as sm

fig = plt.figure(figsize=(7,5), dpi=100)
ax = fig.add_subplot(111)
sm.graphics.plot_ccpr(results_nums, "x_train", ax=ax)
plt.show()

Explanation: [https://www.geeksforgeeks.org/solving-linear-regression-in-python/](https://www.geeksforgeeks.org/solving-linear-regression-in-python/) ... building the table from the very beginning

#### But how it really works inside, "under the hood"?

In [None]:
# Standalone simple linear regression example
from math import sqrt

# Calculate root mean squared error
def rmse_metric(actual, predicted):
	"""returns mean square error"""
	sum_error = 0.0
	for i in range(len(actual)):
		prediction_error = predicted[i] - actual[i]
		sum_error += (prediction_error ** 2)
	mean_error = sum_error / float(len(actual))
	return sqrt(mean_error)

# Evaluate regression algorithm on training dataset
def evaluate_algorithm(dataset, algorithm):
	test_set = list()
	for row in dataset:
		row_copy = list(row)
		row_copy[-1] = None
		test_set.append(row_copy)
	predicted = algorithm(dataset, test_set)
	print(predicted)
	actual = [row[-1] for row in dataset]
	rmse = rmse_metric(actual, predicted)
	return rmse

# Calculate the mean value of a list of numbers
def mean(values):
	return sum(values) / float(len(values))

# Calculate covariance between x and y
def covariance(x, mean_x, y, mean_y):
	covar = 0.0
	for i in range(len(x)):
		covar += (x[i] - mean_x) * (y[i] - mean_y)
	return covar

# Calculate the variance of a list of numbers
def variance(values, mean):
	return sum([(x-mean)**2 for x in values])

# Calculate coefficients
def coefficients(dataset):
	x = [row[0] for row in dataset]
	y = [row[1] for row in dataset]
	x_mean, y_mean = mean(x), mean(y)
	b1 = covariance(x, x_mean, y, y_mean) / variance(x, x_mean)
	b0 = y_mean - b1 * x_mean
	return [b0, b1]

# Simple linear regression algorithm
def simple_linear_regression(train, test):
	predictions = list()
	b0, b1 = coefficients(train)
	for row in test:
		yhat = b0 + b1 * row[0]
		predictions.append(yhat)
	return predictions

# Test simple linear regression
dataset = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
rmse = evaluate_algorithm(dataset, simple_linear_regression)
print(f"RMSE: {rmse:.3f}")

### More real example:

#### Wine regression
(Cortez,Paulo, Cerdeira,A., Almeida,F., Matos,T., and Reis,J.. (2009). Wine Quality. UCI Machine Learning Repository. https://doi.org/10.24432/C56S3T., [source of data](https://archive.ics.uci.edu/dataset/186/wine+quality))

In [None]:
df_wine = pd.read_csv("https://gist.githubusercontent.com/aleksejalex/e63e1f9cf8f9ff90e70dff71e0228ab7/raw/d27a61ce5a76dfa9f02f977d32381ff3a3eb9f86/wine.csv", delimiter=';')
df_wine.describe(include='all')

In [None]:
sns.pairplot(df_wine)
plt.plot()

In [None]:
model_wine = smf.ols(formula='quality ~ alcohol', data=df_wine)
results_wine = model_wine.fit()

# Print summary
print(results_wine.summary())

So called saturated model (perfect fit: as much variables as datapoints):

In [None]:
model_wine_satur = smf.ols(formula='quality ~ fixed_acidity*volatile_acidity*citric_acid*residual_sugar*chlorides*free_sulfur_dioxide*total_sulfur_dioxide*density*pH*sulphates*alcohol', data=df_wine)
results_wine_satur = model_wine_satur.fit()

# Print summary
print(results_wine_satur.summary())

Saturated model, no interactions:

In [None]:
import statsmodels.formula.api as smf

# Define the formula for the linear regression model with interactions up to order 3
formula = 'quality ~ (fixed_acidity + volatile_acidity + citric_acid + residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_dioxide + density + pH + sulphates + alcohol)**3'

# Fit the model
model_wine_satur_no_iter = smf.ols(formula=formula, data=df_wine)
results_wine_satur_no_iter = model_wine_satur_no_iter.fit()

# Print summary
print(results_wine_satur_no_iter.summary())


In [None]:
model_wine_satur_no_iter = smf.ols(formula='quality ~ fixed_acidity:volatile_acidity:citric_acid:residual_sugar:chlorides:free_sulfur_dioxide:total_sulfur_dioxide:density:pH:sulphates:alcohol', data=df_wine)
results_wine_satur_no_iter = model_wine_satur_no_iter.fit()

# Print summary
print(results_wine_satur_no_iter.summary())

=========================================================================================================================