# Outline

1. What is Regression?
2. Purpose of Regression
3. Types of Regression Models
4. Python Packages for Linear Regression

## 1. What is a Regression?

Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables).

Also called simple regression or ordinary least squares (OLS), linear regression is the most common form of this technique. Linear regression establishes the linear relationship between two variables based on a line of best fit. Linear regression is thus graphically depicted using a straight line with the slope defining how the change in one variable impacts a change in the other. The y-intercept of a linear regression relationship represents the value of one variable when the value of the other is zero. Non-linear regression models also exist, but are far more complex.

![image.png](attachment:image.png)

Regression problems are prevalent in machine learning, and regression analysis is the most often used technique for solving them. It is based on data modelling and entails determining the best fit line that passes through all data points with the shortest distance possible between the line and each data point. While there are other techniques for regression analysis, linear and logistic regression are the most widely used. Ultimately, the type of regression analysis model we adopt will be determined by the nature of the data

### 2. Purpose of Regression
Regression analysis is used for one of two purposes: predicting the value of the dependent variable when information about the independent variables is known or predicting the effect of an independent variable on the dependent variable.

### 3. Types of Regression Models

#### 3.1 Linear Regression
The most extensively used modelling technique is linear regression, which assumes a linear connection between a dependent variable (Y) and an independent variable (X). It employs a regression line, also known as a best-fit line. The linear connection is defined as Y = c+m*X + e, where ‘c’ denotes the intercept, ‘m’ denotes the slope of the line, and ‘e’ is the error term.

The linear regression model can be simple (with only one dependent and one independent variable) or complex (with numerous dependent and independent variables) (with one dependent variable and more than one independent variable).
![image.png](attachment:image.png)

#### 3.2 Logistic Regression
When the dependent variable is discrete, the logistic regression technique is applicable. In other words, this technique is used to compute the probability of mutually exclusive occurrences such as pass/fail, true/false, 0/1, and so forth. Thus, the target variable can take on only one of two values, and a sigmoid curve represents its connection to the independent variable, and probability has a value between 0 and 1.
![image-2.png](attachment:image-2.png)

#### 3.3 Polynomial Regression
The technique of polynomial regression analysis is used to represent a non-linear relationship between dependent and independent variables. It is a variant of the multiple linear regression model, except that the best fit line is curved rather than straight.
![image-3.png](attachment:image-3.png)

#### 3.4 Ridge Regression
When data exhibits multicollinearity, that is, the ridge regression technique is applied when the independent variables are highly correlated. While least squares estimates are unbiased in multicollinearity, their variances are significant enough to cause the observed value to diverge from the actual value. Ridge regression reduces standard errors by biassing the regression estimates.

The lambda (λ) variable in the ridge regression equation resolves the multicollinearity problem.
![image-4.png](attachment:image-4.png)

#### 3.5 Lasso Regression
As with ridge regression, the lasso (Least Absolute Shrinkage and Selection Operator) technique penalizes the absolute magnitude of the regression coefficient. Additionally, the lasso regression technique employs variable selection, which leads to the shrinkage of coefficient values to absolute zero.
![image-5.png](attachment:image-5.png)

#### 3.6 Quantile Regression
The quantile regression approach is a subset of the linear regression technique. It is employed when the linear regression requirements are not met or when the data contains outliers. In statistics and econometrics, quantile regression is used.
![image-6.png](attachment:image-6.png)

## 4. Python Packages for Linear Regression

It’s time to start implementing linear regression in Python. To do this, you’ll apply the proper packages and their functions and classes.

NumPy is a fundamental Python scientific package that allows many high-performance operations on single-dimensional and multidimensional arrays. It also offers many mathematical routines. Of course, it’s open-source.

The package scikit-learn is a widely used Python library for machine learning, built on top of NumPy and some other packages.

#### 4.1 Simple Linear Regression With scikit-learn
You’ll start with the simplest case, which is simple linear regression. There are five basic steps when you’re implementing linear regression:

1. Import the packages and classes that you need.
2. Provide data to work with, and eventually do appropriate transformations.
3. Create a regression model and fit it with existing data.
4. Check the results of model fitting to know whether the model is satisfactory.
5. Apply the model for predictions.

These steps are more or less general for most of the regression approaches and implementations.

#### Step 1: Import packages and classes
The first step is to import the package numpy and the class LinearRegression from sklearn.linear_model:

In [1]:
import numpy as np
from sklearn.linear_model import LinearRegression

#### Step 2: Provide data
The second step is defining data to work with. The inputs (regressors, 𝑥) and output (response, 𝑦) should be arrays or similar objects. This is the simplest way of providing data for regression:

In [2]:
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))

y = np.array([5, 20, 14, 32, 22, 38])

In [3]:
x

array([[ 5],
       [15],
       [25],
       [35],
       [45],
       [55]])

In [4]:
y

array([ 5, 20, 14, 32, 22, 38])

As you can see, x has two dimensions, and x.shape is (6, 1), while y has a single dimension, and y.shape is (6,).

##### Step 3: Create a model and fit it

The next step is to create a linear regression model and fit it using the existing data.

Create an instance of the class LinearRegression, which will represent the regression model:

In [5]:
model = LinearRegression()

This statement creates the variable model as an instance of LinearRegression. You can provide several optional parameters to LinearRegression:

* fit_intercept is a Boolean that, if True, decides to calculate the intercept 𝑏₀ or, if False, considers it equal to zero. It defaults to True.
* normalize is a Boolean that, if True, decides to normalize the input variables. It defaults to False, in which case it doesn’t normalize the input variables.
* copy_X is a Boolean that decides whether to copy (True) or overwrite the input variables (False). It’s True by default.
* n_jobs is either an integer or None. It represents the number of jobs used in parallel computation. It defaults to None, which usually means one job. -1 means to use all available processors.
* Your model as defined above uses the default values of all parameters.

It’s time to start using the model. First, you need to call .fit() on model:

In [6]:
model.fit(x, y)

With .fit(), you calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using the existing input and output, x and y, as the arguments. In other words, .fit() fits the model. It returns self, which is the variable model itself. That’s why you can replace the last two statements with this one:

In [7]:
model = LinearRegression().fit(x, y)

This statement does the same thing as the previous two. It’s just shorter.

#### Step 4: Get results

Once you have your model fitted, you can get the results to check whether the model works satisfactorily and to interpret it.

You can obtain the coefficient of determination, 𝑅², with .score() called on model:

In [8]:
r_sq = model.score(x, y)

print(f"coefficient of determination: {r_sq}")

coefficient of determination: 0.715875613747954


When you’re applying .score(), the arguments are also the predictor x and response y, and the return value is 𝑅².

The attributes of model are .intercept_, which represents the coefficient 𝑏₀, and .coef_, which represents 𝑏₁:

In [9]:
print(f"intercept: {model.intercept_}")

intercept: 5.633333333333333


In [10]:
print(f"slope: {model.coef_}")

slope: [0.54]


#### Step 5: Predict response

Once you have a satisfactory model, then you can use it for predictions with either existing or new data. To obtain the predicted response, use .predict():



In [11]:
y_pred = model.predict(x)

print(f"predicted response:\n{y_pred}")

predicted response:
[ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333 35.33333333]


In [12]:
y

array([ 5, 20, 14, 32, 22, 38])