# Linear Regression in Python

by Mirko Stojiljkoviƒá.[Here](https://realpython.com/linear-regression-in-python/)

## Table of Contents

1. Regression
    - What Is Regression?
    - When Do You Need Regression?
2. Linear Regression
    - Problem Formulation
    - Regression Performance
    - Simple Linear Regression
    - Multiple Linear Regression
    - Polynomial Regression
    - Underfitting and Overfitting
3. Implementing Linear Regression in Python
    - Python Packages for Linear Regression
    - Simple Linear Regression With scikit-learn
    - Multiple Linear Regression With scikit-learn
    - Polynomial Regression With scikit-learn
    - Advanced Linear Regression With statsmodels
4. Beyond Linear Regression
5. Conclusion

By the end of this article, you‚Äôll have learned:

- What linear regression is
- What linear regression is used for
- How linear regression works
- How to implement linear regression in Python, step by step

### 1. Regression
#### 1.1 What Is Regression?
Regression searches for relationships among variables.

In regression analysis, you usually consider some phenomenon of interest and have a number of observations. Each observation has two or more features. Following the assumption that (at least) one of the features depends on the others, you try to establish a relation among them.

- The __dependent features__ are called the dependent variables, `outputs`, or `responses`.
- The __independent features__ are called the independent variables, `inputs`, or `predictors`.

#### 1.2 When Do You Need Regression?

Use regression to answer whether and __how some phenomenon influences the other or how several variables are related__. 

Regression is also useful when you want `to forecast a response using a new set of predictors`. 

### 2. Linear Regression
It‚Äôs among the simplest regression methods. One of its main advantages is the ease of interpreting results.

#### 2.1 Problem Formulation
Regression is about determining the best predicted weights, that is the weights corresponding to the smallest residuals.

To get the best weights, you usually __minimize the sum of squared residuals__ (`SSR`) for all observations $ùëñ = 1, ‚Ä¶, ùëõ: SSR = Œ£·µ¢(ùë¶·µ¢ - ùëì(ùê±·µ¢))¬≤$. This approach is __called the method of ordinary least squares__.

#### 2.2 Regression Performance


The __coefficient of determination__, denoted as $ùëÖ¬≤$, tells you `which amount of variation in ùë¶ can be explained by the dependence on ùê± using the particular regression model`. Larger $ùëÖ¬≤$ __indicates a better fit and means that the model can better explain the variation of the output with different inputs__.

The value $ùëÖ¬≤ = 1$ corresponds to $SSR = 0$, that is to the __perfect fit__ since the values of predicted and actual responses fit completely to each other.

#### 2.3 Simple Linear Regression
Simple or single-variate linear regression is the simplest case of linear regression with a single independent variable, $ùê± = ùë•$.

![fig-spl-lin-reg.webp](attachment:fig-spl-lin-reg.webp)

When implementing simple linear regression, you typically start with a given set of __input-output (ùë•-ùë¶) pairs__ (green circles). These pairs are your __observations__. For example, the leftmost observation (green circle) has the input ùë• = 5 and the actual output (response) ùë¶ = 5. The next one has ùë• = 15 and ùë¶ = 20, and so on.

The __estimated regression function__ (black line) has the equation $ùëì(ùë•) = ùëè‚ÇÄ + ùëè‚ÇÅùë•$. Your goal is to calculate the optimal values of the predicted weights $ùëè‚ÇÄ$ and $ùëè‚ÇÅ$ that __minimize SSR and determine the estimated regression function__. 

The __predicted responses__ (red squares) are the points on the regression line that correspond to the input values. For example, for the input ùë• = 5, the predicted response is ùëì(5) = 8.33 (represented with the leftmost red square).

The __residuals__ (vertical dashed gray lines) can be calculated as $ùë¶·µ¢ - ùëì(ùê±·µ¢) = ùë¶·µ¢ - ùëè‚ÇÄ - ùëè‚ÇÅùë•·µ¢$ for $ùëñ = 1, ‚Ä¶, ùëõ$. They are the `distances between the green circles and red squares`. When you implement linear regression, you are actually trying to __minimize these distances and make the red squares as close to the predefined green circles as possible__.

#### 2.4 Multiple Linear Regression
Multiple or multivariate linear regression is a case of linear regression with two or more independent variables.

If there are just two independent variables, the estimated regression function is $ùëì(ùë•‚ÇÅ, ùë•‚ÇÇ) = ùëè‚ÇÄ + ùëè‚ÇÅùë•‚ÇÅ + ùëè‚ÇÇùë•‚ÇÇ$. It `represents a regression plane in a three-dimensional space`. The goal of regression is to determine the values of the weights ùëè‚ÇÄ, ùëè‚ÇÅ, and ùëè‚ÇÇ such that this plane is as close as possible to the actual responses and yield the __minimal SSR__.


#### 2.5 Polynomial Regression
Assume the polynomial dependence between the output and inputs and, consequently, the polynomial estimated regression function.

The simplest example of polynomial regression has a single independent variable, and the estimated regression function is a polynomial of degree 2: $ùëì(ùë•) = ùëè‚ÇÄ + ùëè‚ÇÅùë• + ùëè‚ÇÇùë•¬≤$.

Now, remember that you want to calculate $ùëè‚ÇÄ, ùëè‚ÇÅ, and ùëè‚ÇÇ$, which __minimize SSR__ .

#### 2.6 Underfitting and Overfitting

- __Underfitting__ occurs `when a model can‚Äôt accurately capture the dependencies among data`, usually as a consequence of its own `simplicity`. It often yields a `low $ùëÖ¬≤$` with known data and bad generalization capabilities when applied with new data.

- __Overfitting__ happens when a model learns both dependencies among data and random fluctuations. In other words, a model learns the existing data too well. When applied to known data, such models usually yield high $ùëÖ¬≤$. However, they often don‚Äôt generalize well and have significantly `lower $ùëÖ¬≤$` when used with new data.

Example of underfitted, well-fitted and overfitted models

![fig-und-over-poly-reg.webp](attachment:fig-und-over-poly-reg.webp)

- __Top left__ plot shows a linear regression line that has a low ùëÖ¬≤. This is likely an example of underfitting.
- __Top right__ plot illustrates polynomial regression with the degree equal to 2. The model has a value of ùëÖ¬≤ that is satisfactory in many cases and shows trends nicely.
- __Bottom left__ plot presents polynomial regression with the degree equal to 3. The value of ùëÖ¬≤ is higher than in the preceding cases. Shows some signs of overfitting, especially for the input values close to 60 where the line starts decreasing, although actual data don‚Äôt show that.
- __Bottom right__ plot, you can see the perfect fit: six points and the polynomial line of the degree 5 (or higher) yield ùëÖ¬≤ = 1. 



### 3. Implementing Linear Regression in Python
#### 3.1 Python Packages for Linear Regression
#### 3.2 Simple Linear Regression With scikit-learn
#### 3.3 Multiple Linear Regression With scikit-learn
#### 3.4 Polynomial Regression With scikit-learn
#### 3.5 Advanced Linear Regression With statsmodels

### 4. Beyond Linear Regression
### 5. Conclusion