---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: Web Analytics

### 📋 **Topic**: Pandas (self-study)

### 🔗 **Link**: bit.ly/WA_LEC7_LINEAR

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

# 🪄 1. Learning a Linear Model

Much of the power of data comes from it's ability to help us predict unknown (e.g., future) quantities. 

## 1.1 Getting and cleaning our data
Let's build a simple model to predict the mpg of cars from the other information we have available on those cars.

In [None]:
import pandas as pd

# url from which to retrieve the data (UCI Machine Learning Repository)
# more info on the dataset: https://archive.ics.uci.edu/dataset/9/auto+mpg
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original"

# define the names of the columns of our data
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration',
                'model', 'origin', 'car_name']

# read the data from the url into a pandas dataframe
mpg_df = pd.read_csv(url,
                     delim_whitespace=True,
                     header=None,
                     names=column_names)

# display the first 5 rows of the dataframe
mpg_df.head(5)

In [None]:
mpg_df.describe()

First let's separate our X (attributes) from our y (target variable, attribute to be predicted)

In [3]:
# we'll use these columns as the features we will use in our predictions
predictors = ["weight", "acceleration", "horsepower", "cylinders", "displacement"]
# and this column as a target
target = "mpg"

# drop any NaNs for now
cleaned_df = mpg_df.dropna()

In [None]:
cleaned_df[predictors].head()

In [None]:
cleaned_df[target].head()

# 1.2 Linear regression
Linear regression is a fundamental statistical and machine learning method used to model and analyze the relationships between a dependent variable and one or more independent variables. The main goal of linear regression is to find the best fit straight line that accurately predicts some numerical values..

Simply put, when we say we're "plotting a linear regression", we're trying to draw a straight line that best represents the data according to the "least squares criterion".

## Example
Consider a scenario where we have some data on the number of hours studied and the respective grades achieved by a group of students. Let's create a simple linear regression to see if there's a relationship between these two variables.

In [None]:
%pip install matplotlib

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sample data: Hours studied vs. exam scores
data = {'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8],
        'Exam_Score': [50, 55, 62, 65, 70, 75, 78, 85]}
df = pd.DataFrame(data)

# Plotting the data
plt.scatter(df['Hours_Studied'], df['Exam_Score'], color='blue', label='Data Points')
plt.title('Hours Studied vs. Exam Score')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.legend()
plt.show()

In the above plot, you'll observe data points that seem to follow a linear trend - as the number of hours studied increases, the exam score generally seems to increase as well.

## Fitting a linear regression model
When we say "fitting a linear regression to the data", we're trying to find the straight line that best represents our data. Mathematically, a straight line is represented as  
- $y=mx+c$, where $m$ is the slope and $c$ is the y-intercept, if we have only one predictor x
- $y=m_1x_1+m_2x_2+...+m_nx_n + c$, if we have multiple predictors $x_1, ..., x_n$

In the context of our example, "fitting" would mean finding the best line -- the best possible slope $m^*$ and intercept $c^*$ -- that describes the relationship between hours studied and exam scores.

## What does *best* mean?

The code below has some candidate lines. Which one do you think is the best fit?

In [None]:
# Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Sample data: Hours studied vs. exam scores
data = {'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8],
        'Exam_Score': [50, 55, 62, 65, 70, 75, 78, 85]}
df = pd.DataFrame(data)

# Generating some potential fit lines
lines = {
    'Line 1': [48 + 5*x for x in df['Hours_Studied']],
    'Line 2': [45 + 6*x for x in df['Hours_Studied']],
    'Line 3': [50 + 4.5*x for x in df['Hours_Studied']],
    'Line 4': [70 + 0.2*x for x in df['Hours_Studied']]
}

# Plotting the data and the potential fit lines
plt.scatter(df['Hours_Studied'], df['Exam_Score'], color='blue', label='Data Points')

for line, values in lines.items():
    plt.plot(df['Hours_Studied'], values, label=line)

plt.title('Potential Fit Lines: Hours Studied vs. Exam Score')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.legend()
plt.show()


## Least Squares Criterion as an Objective Function

The fundamental idea behind the method of least squares is quite intuitive: Find the line (or curve) that minimizes the sum of the squared differences (residuals) between the observed values and the values that our model predicts.

### A. Why Squares?
Before diving into the math, it's worth noting why we consider the squared differences:
1. **Positivity**: Squaring ensures that negative and positive differences don't cancel each other out.
2. **Penalty**: Larger deviations are given more weight, meaning we're penalizing large deviations more than smaller ones.
3. **Differentiability**: In calculus, squared functions are differentiable, which is essential for optimization techniques.

### B. Mathematical Representation

Suppose you have a set of data points $\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\} $. 
Now, let's assume that the relationship between $x$ and $y$ is linear and can be represented as $ y = \beta_0 + \beta_1 x $, where
- $ \beta_0 $ is the y-intercept.
- $ \beta_1 $ is the slope of the line.

For each data point $ x_i $, your model would then predict a y-value equal to $\hat{y_i} = \beta_0 + \beta_1 x_i $.

The residual (difference between observed and predicted) for this data point is $ e_i = y_i - \hat{y_i} $.

The goal of the least squares method is to find values for $ \beta_0 $ and $ \beta_1 $ that minimize the sum of the squared residuals:

$ Q(\beta_0, \beta_1) = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2 $

#### C. Optimization (Optional)

To find the values of $\beta_0$ and $\beta_1$ that minimize $Q$, we set its partial derivatives with respect to $\beta_0$ and $\beta_1$ to zero. The resulting system of equations can be solved to get:
- $\beta_1 = \frac{n(\sum xy) - (\sum x)(\sum y)}{n(\sum x^2) - (\sum x)^2} $
- $\beta_0 = \frac{\sum y - \beta_1 \sum x}{n}$
Where:
- $ n $ is the number of data points in our dataset.
- The sums $\sum x$, $\sum y$, $\sum xy$, and $\sum x^2$ are taken over all the data points.

In [None]:
# Required Libraries
from sklearn.linear_model import LinearRegression

# Create and fit the model
model = LinearRegression()
X = df[['Hours_Studied']]  # Features
y = df['Exam_Score']  # Target variable

model.fit(X, y)
# Predict the values using the model
df['Predicted_Score'] = model.predict(X)

# Plotting the data, potential fit lines, and best fit line
plt.scatter(df['Hours_Studied'], df['Exam_Score'], color='blue', label='Data Points')

for line, values in lines.items():
    plt.plot(df['Hours_Studied'], values, label=line)

# Plotting the best fit line
plt.plot(df['Hours_Studied'], df['Predicted_Score'], color='red', linestyle='--', label='Best Fit Line')

plt.title('Potential Fit Lines and Best Fit Line: Hours Studied vs. Exam Score')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.legend()

One  advantage of fitting a linear regression model is that it allows us to inspect easily the coefficients of the predictor variables. In our case, we can see how much the exam score is expected to increase (or decrease) for every additional hour of study.

In [None]:
# inspect the coeffs
print(model.coef_)

This suggests that each additional hour of study is associated with a 4.8 point increase in the exam score.
- Note: this is just a suggestion, and not necessarily a causal relationship

# 1.4 Back to cars


In [None]:
from sklearn import linear_model

# build a model
linear = linear_model.LinearRegression()

# fit the model to the data!
# the first argument is the predictors, the second argument the target variable!
linear.fit(cleaned_df[predictors], cleaned_df[target])

# inspect the coefficients (zip joins two tuples together)
pd.DataFrame([dict(zip(predictors, linear.coef_))])

We fitted our linear regression. Let's now see how it predicts!

In [None]:
# get some predictions from the model
preds = linear.predict(cleaned_df[predictors])

predictions_df = cleaned_df.assign(predictions=preds)

predictions_df.head(10)

In [None]:
# and lets try a scatter plot of our predicted mpg and the true value
predictions_df.plot(kind="scatter", 
                    x="mpg", 
                    y="predictions", 
                    c='forestgreen', 
                    xlim=(0,50), 
                    ylim=(0,50))

### Question: Why didn't we plot the best fitting line -- like before?
The reason is that it is easier to plot the best fitting line on paper when we have only one predictor. However, when we have multiple predictors, it is impossible to plot the best fitting line on paper. Instead, we can plot the actual values of the target variable against the predicted values of the target variable.