 # Multiple Linear Regression Algorithm using Scikit Learn



__What is Multiple Linear Regression and when should you use it?__

Multiple Linear Regression is a supervised learning algorithm used for predicting a continuous outcome variable (also known as the dependent variable) based on multiple independent variables (also known as predictors, features or attributes).

Multiple Linear Regression is appropriate when you want to predict a continuous outcome variable based on multiple independent variables. It is a form of linear regression that allows you to model the relationship between multiple independent variables and a single dependent variable.

Here are a few examples of when you might use Multiple Linear Regression:

- You have a dataset of housing prices and you want to predict the price of a house based on its square footage, number of bedrooms, and location.
- You have a dataset of stock prices and you want to predict the future price of a stock based on its past prices, dividends, and trading volume.
- You have a dataset of weather data and you want to predict temperature based on humidity, wind speed, and pressure.

In general, Multiple Linear Regression is a good choice when you have multiple independent variables and you want to model the relationship between them and the dependent variable. It is important to note that Multiple Linear Regression assumes linear relationship between independent and dependent variables, also it assumes that there is no multicollinearity between independent variables.

__What is multicollinearity?__

Multicollinearity is a statistical phenomenon that occurs when two or more independent variables in a multiple linear regression model are highly correlated with each other. This correlation can cause the coefficient estimates of the independent variables to become unstable and hard to interpret.



__When we shouldn't use Multiple Linear Regression Algorithms?__

While Multiple Linear Regression is a widely used and powerful tool for predicting a continuous outcome variable based on multiple independent variables, there are certain situations where it may not be the best choice.

Here are a few examples of when you should not use Multiple Linear Regression:

- The relationship between the independent and dependent variables is non-linear: Multiple Linear Regression assumes a linear relationship between the independent and dependent variables. If the relationship is non-linear, a non-linear model such as polynomial regression or decision tree may be a better choice.
- The dependent variable is binary or categorical: Multiple Linear Regression is designed for predicting continuous outcome variables. If the dependent variable is binary or categorical, a logistic regression or a classification algorithm such as Random Forest or Support Vector Machine would be more appropriate.
- There is a high degree of multicollinearity between the independent variables: Multiple Linear Regression assumes that there is no multicollinearity between the independent variables. If multicollinearity is present, it can lead to unreliable and unstable coefficient estimates. In this case, you can try dimensionality reduction techniques or other methods to deal with multicollinearity.
- The data has too many missing values: Multiple Linear Regression assumes that the data is complete and that there are no missing values. If the data has too many missing values, it will be difficult to get accurate results with Multiple Linear Regression. In this case, you can use techniques such as imputation, or use a different model such as random forests that can handle missing values.

It's important to consider the nature of your data and the assumptions of the model before using Multiple Linear Regression, as using it in an inappropriate situation can lead to inaccurate or unreliable results.

In [1]:
# Importing Important Libraries 
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("error", message=".*check_inverse*.", category=UserWarning, append=False)

In [2]:
# Create a DataFrame with 50 rows and 3 columns
df = pd.DataFrame({'x1': np.random.randint(25,75,50),
                   'x2': np.random.randint(25,75,50),
                   'y': np.random.randint(25,75,50)})

In [3]:
df.head()

Unnamed: 0,x1,x2,y
0,43,38,64
1,49,70,73
2,47,66,73
3,38,71,68
4,61,39,32


In [4]:
x=df.drop(columns=['y'])
x.head()

Unnamed: 0,x1,x2
0,43,38
1,49,70
2,47,66
3,38,71
4,61,39


In [5]:
y=df['y']
y.head()

0    64
1    73
2    73
3    68
4    32
Name: y, dtype: int32

In [6]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [7]:
# Create a multiple linear regression model
model = LinearRegression()

In [8]:
# Fit the model to the training data
model.fit(X_train, y_train)

LinearRegression()

In [9]:
# Use the model to make predictions on the test data
y_pred = model.predict(X_test)

In [10]:
# Print the mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

Mean squared error: 305.09


In [11]:
# Print the R-squared score
print("R-squared score: %.2f" % r2_score(y_test, y_pred))

R-squared score: -0.01


In [12]:
# New data
new_data = [[65,85]]

In [13]:
# Use the model to make predictions on new data
predictions = model.predict(new_data)
print("Predictions: ", predictions)

Predictions:  [43.07754782]




__Multiple Linear Regression Algorithm advantage?__

Multiple Linear Regression is a powerful tool that allows you to predict a continuous outcome variable based on multiple independent variables. 
Here are some of the advantages of using MLR:

- Easy to implement: Multiple Linear Regression is relatively simple to implement and understand, making it a popular choice among practitioners.
- Flexibility: Multiple Linear Regression can handle multiple independent variables, making it a flexible tool for modeling complex relationships.
- Linearity: Multiple Linear Regression assumes a linear relationship between the independent and dependent variables, which can make it easy to interpret the results.
- Linearity of features: Multiple Linear Regression assumes that the independent variables are linearly related to the outcome variable. This linearity assumption makes it easy to identify the relationship between the independent variables and the outcome variable.
- Efficiency: Multiple Linear Regression is computationally efficient and fast, making it suitable for large datasets.
- Model interpretability: Multiple Linear Regression is relatively easy to interpret and the coefficients can be used to estimate the effect of the independent variables on the outcome variable.

It's important to note that Multiple Linear Regression assumes that the data is normally distributed, the error term is homoscedastic and independent. Also, it assumes that there is no multicollinearity between the independent variables. These assumptions should be checked before applying MLR.

__Multiple Linear Regression Algorithm disadvantage?__

While  is a powerful tool for predicting a continuous outcome variable based on multiple independent variables, it also has some limitations and potential disadvantages. 

Here are a few examples:

- Linearity assumption: Multiple Linear Regression assumes a linear relationship between the independent and dependent variables. If the relationship is non-linear, the model will not be able to capture the true relationship and the results will be inaccurate.
- Assumes no multicollinearity: Multiple Linear Regression assumes that there is no multicollinearity between the independent variables. If multicollinearity is present, it can lead to unreliable and unstable coefficient estimates.
- Assumes no autocorrelation: Multiple Linear Regression assumes that the errors are independent and identically distributed, if there is autocorrelation in the error term, the results can be misleading.
- Assumes no heteroscedasticity: Multiple Linear Regression assumes that the error term has constant variance, if the error term has non-constant variance, the results will be affected.
- Assumes no outliers: Multiple Linear Regression assumes that there are no outliers in the data, if outliers are present, they can have a large effect on the results.
- Assumes normal distribution of errors: Multiple Linear Regression assumes that the error term is normally distributed, if the error term is not normally distributed, the results can be affected.

In addition, Multiple Linear Regression can be sensitive to irrelevant independent variable, and it may not perform well when predictors are correlated. It is important to consider the assumptions of the model and the nature of your data when using Multiple Linear Regression, and to check for potential issues such as multicollinearity, outliers, and non-normality before interpreting the results.

##### Md. Ashiqur Rahman
##### Thank You