# Linear regression in PyTorch

Welcome to the `02_linear_regression` notebook. This is part of a portfolio designed to showcase foundational concepts and techniques in PyTorch, with this one focusing on linear regression — a fundamental algorithm in machine learning used for predicting a continuous target variable based on one or more input features.

In this notebook, I cover essential topics including generating synthetic data, defining and training a linear regression model, and evaluating its performance. I'll also explore optimizations and best practices to improve model accuracy and efficiency. 

Through various exercises, this notebook demonstrates practical applications of linear regression in PyTorch, providing a solid foundation for more advanced projects.

## Table of contents
1. [Understanding linear regression](#understanding-linear-regression)
2. [Setting up the environment](#setting-up-the-environment)
3. [Generating synthetic data](#generating-synthetic-data)
4. [Defining the linear regression model](#defining-the-linear-regression-model)
5. [Loss function and optimizer](#loss-function-and-optimizer)
6. [Training the linear regression model](#training-the-linear-regression-model)
7. [Evaluating the model](#evaluating-the-model)
8. [Saving and loading the model](#saving-and-loading-the-model)
9. [Optimizations](#optimizations)
10. [Conclusion](#conclusion)
11. [Further exercises](#further-exercises)


## Understanding linear regression

Linear regression is a fundamental statistical method used in machine learning and data analysis to model the relationship between one or more input variables (features) and a continuous output variable (target). The primary objective of linear regression is to find the best-fitting straight line (or hyperplane in higher dimensions) that can predict the target variable from the input features.

### Key concepts

#### 1. Simple vs. multiple linear regression
- **Simple linear regression**: Involves a single input variable and aims to model the relationship between this variable and the target variable. The goal is to find a straight line that best describes this relationship.
- **Multiple linear regression**: Involves two or more input variables. The model tries to fit a hyperplane in a multidimensional space to predict the target variable.

#### 2. The best-fit line
The best-fit line (or hyperplane) is the one that minimizes the difference between the actual target values and the predicted values. This difference is known as the residuals. The smaller the residuals, the better the model fits the data.

#### 3. Model parameters
Linear regression models have coefficients (weights) and an intercept (bias). The coefficients represent the relationship between each input feature and the target variable. The intercept is the value of the target variable when all input features are zero.

#### 4. Assumptions of linear regression
For linear regression to provide reliable results, several assumptions must be met:
- **Linearity**: The relationship between the input features and the target variable should be linear.
- **Independence**: The residuals (errors) should be independent of each other.
- **Homoscedasticity**: The residuals should have constant variance across all levels of the input features.
- **Normality**: The residuals should be normally distributed.

#### 5. Model evaluation
To evaluate the performance of a linear regression model, several metrics are commonly used:
- **Mean Squared Error (MSE)**: Measures the average squared difference between actual and predicted values. Lower values indicate a better fit.
- **R-squared (R²)**: Represents the proportion of variance in the target variable that can be explained by the input features. Values range from 0 to 1, with higher values indicating a better fit.

#### 6. Overfitting and underfitting
- **Overfitting**: Occurs when the model learns the noise in the training data, leading to poor generalization to new data. This can happen if the model is too complex.
- **Underfitting**: Occurs when the model is too simple to capture the underlying pattern in the data, leading to poor performance on both training and new data.

#### 7. Regularization
Regularization techniques are used to prevent overfitting by adding a penalty to the model's complexity. Common regularization methods include:
- **Ridge regression (L2 Regularization)**: Adds a penalty proportional to the square of the coefficients.
- **Lasso regression (L1 Regularization)**: Adds a penalty proportional to the absolute value of the coefficients, which can lead to sparse models with some coefficients being zero.

### Applications

#### Economics and finance
- **Stock price prediction**: Estimating future stock prices based on historical data and market indicators.
- **Risk management**: Assessing the relationship between risk factors and asset returns.
- **Economic forecasting**: Predicting economic indicators such as GDP growth, inflation rates, and unemployment rates.
- **Credit scoring**: Evaluating the likelihood of a borrower defaulting on a loan based on their financial history.

#### Healthcare
- **Disease progression**: Modeling the progression of diseases over time based on patient data.
- **Healthcare costs**: Predicting healthcare costs for individuals or populations based on demographic and medical history data.
- **Medical research**: Identifying relationships between various factors (e.g., lifestyle, genetics) and health outcomes.
- **Patient outcomes**: Forecasting patient recovery times or survival rates based on treatment variables.

#### Marketing and sales
- **Sales forecasting**: Estimating future sales based on historical sales data, seasonality, and market trends.
- **Customer lifetime value**: Predicting the long-term value of customers based on their purchasing behavior.
- **Advertising effectiveness**: Assessing the impact of advertising campaigns on sales or brand awareness.
- **Market analysis**: Identifying trends and relationships in consumer behavior and market data.

#### Environmental science
- **Climate modeling**: Analyzing the relationship between greenhouse gas emissions and global temperatures.
- **Pollution levels**: Predicting pollution levels based on industrial activities, traffic patterns, and weather conditions.
- **Water quality**: Estimating water quality parameters based on land use, agricultural practices, and rainfall data.
- **Renewable energy forecasting**: Predicting the output of renewable energy sources like solar and wind based on weather conditions.

#### Real estate
- **Property valuation**: Estimating the value of properties based on features such as location, size, and amenities.
- **Rental price prediction**: Forecasting rental prices based on market conditions and property characteristics.
- **Market trends**: Analyzing trends in the real estate market to guide investment decisions.

#### Social sciences
- **Sociological research**: Examining relationships between social factors (e.g., education, income) and various outcomes (e.g., crime rates, health).
- **Education analysis**: Predicting student performance based on socioeconomic background, school resources, and attendance.

#### Engineering and manufacturing
- **Quality control**: Monitoring and predicting product quality based on production parameters.
- **Process optimization**: Modeling relationships between input variables and output quality to optimize manufacturing processes.
- **Failure prediction**: Estimating the likelihood of equipment failure based on usage data and maintenance history.

#### Sports and performance analysis
- **Player performance**: Predicting athlete performance based on historical performance data and training metrics.
- **Game outcome prediction**: Estimating the outcomes of sports events based on team statistics and player performance.

#### Agriculture
- **Crop yield prediction**: Forecasting crop yields based on factors such as weather conditions, soil quality, and farming practices.
- **Livestock health**: Modeling the health and productivity of livestock based on feeding practices and environmental conditions.

#### Transportation and logistics
- **Demand forecasting**: Predicting transportation demand for public transit systems or ride-sharing services.
- **Logistics optimization**: Estimating delivery times and optimizing routes based on traffic patterns and order volumes.

#### Insurance
- **Premium calculation**: Determining insurance premiums based on risk factors such as age, health status, and driving history.
- **Claim prediction**: Estimating the likelihood of insurance claims based on customer data and historical claim records.

#### Technology and internet
- **User behavior analysis**: Predicting user engagement and retention based on interaction data from websites and apps.
- **Recommendation systems**: Modeling user preferences to provide personalized content or product recommendations.

### Maths

Linear regression aims to model the relationship between a dependent variable $ y $ and one or more independent variables $ x_1, x_2, \ldots, x_n $. The goal is to find the best-fitting linear equation that predicts the value of $ y $ from the input variables.

#### Simple linear regression
For simplicity, let's start with simple linear regression, where there is only one independent variable $ x $.

##### 1. The linear model
The model assumes a linear relationship between the independent variable $ x $ and the dependent variable $ y $:
$$ y = \beta_0 + \beta_1 x + \epsilon $$
- $ \beta_0 $ is the intercept (the value of $ y $ when $ x = 0 $).
- $ \beta_1 $ is the slope (the change in $ y $ for a one-unit change in $ x $).
- $ \epsilon $ is the error term, representing the deviation of the observed values from the true line.

##### 2. The objective
The objective of linear regression is to estimate the coefficients $ \beta_0 $ and $ \beta_1 $ such that the sum of the squared differences between the observed values $ y_i $ and the predicted values $ \hat{y_i} $ is minimized. This method is known as **Ordinary Least Squares (OLS)**.

##### 3. The cost function
The cost function, also known as the **mean squared error (MSE)**, is defined as:
$$ J(\beta_0, \beta_1) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 $$
where:
$$ \hat{y_i} = \beta_0 + \beta_1 x_i $$
- $ n $ is the number of data points.
- $ y_i $ are the actual values.
- $ \hat{y_i} $ are the predicted values.

##### 4. Minimizing the cost function
To find the values of $ \beta_0 $ and $ \beta_1 $ that minimize the cost function, we take the partial derivatives of $ J(\beta_0, \beta_1) $ with respect to $ \beta_0 $ and $ \beta_1 $, set them to zero, and solve for $ \beta_0 $ and $ \beta_1 $.

The partial derivatives are:
$$ \frac{\partial J}{\partial \beta_0} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i) = 0 $$
$$ \frac{\partial J}{\partial \beta_1} = -\frac{2}{n} \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i) x_i = 0 $$

Solving these equations simultaneously gives the estimates:
$$ \beta_1 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} $$
$$ \beta_0 = \bar{y} - \beta_1 \bar{x} $$
where:
- $ \bar{x} $ is the mean of the $ x $ values.
- $ \bar{y} $ is the mean of the $ y $ values.

#### Multiple linear regression
When there are multiple independent variables, the model extends to:
$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n + \epsilon $$

##### 1. The vector form
This can be written in vector form for convenience:
$$ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} $$
where:
- $ \mathbf{y} $ is the vector of observed values.
- $ \mathbf{X} $ is the matrix of input features, including a column of ones for the intercept.
- $ \boldsymbol{\beta} $ is the vector of coefficients.
- $ \boldsymbol{\epsilon} $ is the vector of error terms.

##### 2. The cost function
The cost function in multiple linear regression is similarly:
$$ J(\boldsymbol{\beta}) = \frac{1}{n} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^T (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) $$

##### 3. Solving for the coefficients
To minimize the cost function, we take the derivative with respect to $ \boldsymbol{\beta} $ and set it to zero:
$$ \frac{\partial J}{\partial \boldsymbol{\beta}} = -\frac{2}{n} \mathbf{X}^T (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) = 0 $$

Solving for $ \boldsymbol{\beta} $ gives:
$$ \boldsymbol{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} $$
This equation is known as the **normal equation** and provides the least-squares estimates of the coefficients.

#### Interpretation of the coefficients
- **Intercept ($ \beta_0 $)**: The expected value of $ y $ when all $ x $ variables are zero.
- **Slope ($ \beta_i $)**: The change in the expected value of $ y $ for a one-unit change in $ x_i $, holding all other variables constant.

#### Evaluating model performance
1. **Mean Squared Error (MSE)**: The average of the squares of the residuals, providing a measure of the model's accuracy.
2. **R-squared ($ R^2 $)**: The proportion of the variance in the dependent variable that is predictable from the independent variables.

## Setting up the environment

##### **Q1: How do you install the necessary PyTorch libraries using a Jupyter notebook?**

##### **Q2: How do you import the required libraries for linear regression?**

##### **Q3: How do you check the version of PyTorch installed?**

## Generating synthetic data

##### **Q4: How do you generate synthetic data for linear regression in PyTorch?**

##### **Q5: How do you add noise to the synthetic data to simulate real-world scenarios?**

##### **Q6: How do you visualize the synthetic data using `matplotlib`?**

##### **Q7: How do you split the synthetic data into training and testing sets?**

## Defining the linear regression model

##### **Q8: How do you define a simple linear regression model using `nn.Module` in PyTorch?**

##### **Q9: How do you initialize the weights and biases of the linear regression model?**

##### **Q10: How do you print the model summary to view its structure?**

## Loss function and optimizer

##### **Q11: How do you define a loss function for linear regression in PyTorch?**

##### **Q12: How do you choose and configure an optimizer for your linear regression model?**

##### **Q13: What is the purpose of the learning rate in the optimizer, and how do you set it?**

## Training the linear regression model

##### **Q14: How do you create a training loop for linear regression in PyTorch?**

##### **Q15: How do you update the model parameters during training?**

##### **Q16: How do you calculate and print the training loss during each epoch?**

##### **Q17: How do you visualize the training loss over epochs using matplotlib?**

##### **Q18: How do you use early stopping to prevent overfitting during training?**

## Evaluating the model

##### **Q19: How do you make predictions using your trained linear regression model?**

##### **Q20: How do you evaluate the model's performance using metrics like Mean Squared Error (MSE)?**

##### **Q21: How do you visualize the model's predictions against the actual data using `matplotlib`?**

##### **Q22: How do you calculate the $R$-squared value to evaluate the goodness-of-fit for your model?**

## Saving and loading the model

##### **Q23: How do you save the trained linear regression model in PyTorch?**

##### **Q24: How do you load a saved linear regression model in PyTorch?**

##### **Q25: How do you save and load the model's state dictionary in PyTorch?**

## Optimizations

##### **Q26: How do you perform hyperparameter tuning to improve the performance of your linear regression model?**

##### **Q27: How do you implement learning rate scheduling to adjust the learning rate during training?**

##### **Q28: How do you normalize or standardize your data before training a linear regression model?**

##### **Q29: How do you handle multicollinearity in linear regression models?**

##### **Q30: How do you implement polynomial regression using PyTorch to capture non-linear relationships?**

## Conclusion

## Further exercises

##### **Q31: How do you extend the linear regression model to handle multiple features (multivariate linear regression)?**

##### **Q32: How do you use PyTorch to perform linear regression on a real-world dataset, such as the Boston Housing dataset?**

##### **Q33: How do you implement ridge regression (L2 regularization) using PyTorch to prevent overfitting?**

##### **Q34: How do you implement lasso regression (L1 regularization) using PyTorch to enforce sparsity in the model?**

##### **Q35: How do you visualize the learned weights of the linear regression model to interpret feature importance?**