# <h1 style="text-align:center;">Machine Learning</h1>


<h2> Multiple Linear Regression </h2>

Multiple Linear Regression is one of the important regression algorithms which models the linear relationship between a single dependent continuous variable and more than one independent variable.


1. **Dependent Variable (Y)**: In multiple linear regression, the dependent variable (Y) is indeed a continuous or real-valued variable. The goal of MLR is to model and predict this continuous variable based on the values of the independent variables.


2. **Independent Variables (Predictors)**: The independent variables, also known as predictors or features, can be of different types. They may be continuous, categorical, or a mix of both. MLR can handle a combination of variable types. However, for categorical variables, they need to be converted into a suitable format, often through a process called one-hot encoding, to be included in the regression model.


3. **Linear Relationship**: MLR assumes that there is a linear relationship between each independent variable and the dependent variable. This means that as the values of the independent variables change, the change in the dependent variable is proportional and constant. It's important to check for linearity and perform data transformations if necessary to meet this assumption.


4. **Regression Line in Multidimensional Space**: MLR, as the name suggests, models the relationship between the dependent variable and multiple independent variables simultaneously. It tries to fit a regression line through a multidimensional space of data points. This line is represented by the equation:


  <h3> Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε </h3>


   Where:
   - Y is the dependent variable.
   - X₁, X₂, ..., Xₖ are the independent variables.
   - β₀ is the intercept.
   - β₁, β₂, ..., βₖ are the coefficients for the independent variables.
   - ε is the error term, representing the variability in Y that is not explained by the model.

The goal of MLR is to estimate the coefficients (β values) that provide the best fit for the data. This allows you to make predictions for the dependent variable based on the values of the independent variables.

It's worth noting that the assumptions of MLR, including linearity, independence of errors, constant variance, and absence of multicollinearity, should be checked and validated to ensure the model's reliability and accuracy. Data preprocessing, exploratory data analysis, and diagnostic tests play crucial roles in building and evaluating MLR models.

<h3> Assumptions for Multiple Linear Regression:</h3>


1. A linear relationship should exist between the Target and predictor variables.


2. The regression residuals must be normally distributed.


3. MLR assumes little or no multicollinearity (correlation between the independent variable) in data.

<h3> Implementation of Multiple Linear Regression model using Python:</h3>

<h4>Problem Description:</h4>

We have a dataset of 50 start-up companies. This dataset contains five main information: R&D Spend, Administration Spend, Marketing Spend, State, and Profit for a financial year. Our goal is to create a model that can easily determine which company has a maximum profit, and which is the most affecting factor for the profit of a company.

Since we need to find the Profit, so it is the dependent variable, and the other four variables are independent variables. Below are the main steps of deploying the MLR model:

1. Data Pre-processing Steps
    
    
2. Fitting the MLR model to the training set
    
    
3. Predicting the result of the test set 

<h4> Step-1: Data Pre-processing Step:</h4>

In [58]:
# importing libraries  
import numpy as nm  
import matplotlib.pyplot as mtp  
import pandas as pd  

In [59]:
#importing datasets  
data_set= pd.read_csv('50_Startups.csv')  

In [60]:
data_set

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


In above output, we can clearly see that there are five variables, in which four variables are continuous and one is categorical variable.

In [61]:
#Extract dependant and independant variables

x=data_set.iloc[:,:-1].values
y=data_set.iloc[:,:4].values


In [62]:
x

array([[165349.2, 136897.8, 471784.1, 'New York'],
       [162597.7, 151377.59, 443898.53, 'California'],
       [153441.51, 101145.55, 407934.54, 'Florida'],
       [144372.41, 118671.85, 383199.62, 'New York'],
       [142107.34, 91391.77, 366168.42, 'Florida'],
       [131876.9, 99814.71, 362861.36, 'New York'],
       [134615.46, 147198.87, 127716.82, 'California'],
       [130298.13, 145530.06, 323876.68, 'Florida'],
       [120542.52, 148718.95, 311613.29, 'New York'],
       [123334.88, 108679.17, 304981.62, 'California'],
       [101913.08, 110594.11, 229160.95, 'Florida'],
       [100671.96, 91790.61, 249744.55, 'California'],
       [93863.75, 127320.38, 249839.44, 'Florida'],
       [91992.39, 135495.07, 252664.93, 'California'],
       [119943.24, 156547.42, 256512.92, 'Florida'],
       [114523.61, 122616.84, 261776.23, 'New York'],
       [78013.11, 121597.55, 264346.06, 'California'],
       [94657.16, 145077.58, 282574.31, 'New York'],
       [91749.16, 114175.79, 29491

In [63]:
y

array([[165349.2, 136897.8, 471784.1, 'New York'],
       [162597.7, 151377.59, 443898.53, 'California'],
       [153441.51, 101145.55, 407934.54, 'Florida'],
       [144372.41, 118671.85, 383199.62, 'New York'],
       [142107.34, 91391.77, 366168.42, 'Florida'],
       [131876.9, 99814.71, 362861.36, 'New York'],
       [134615.46, 147198.87, 127716.82, 'California'],
       [130298.13, 145530.06, 323876.68, 'Florida'],
       [120542.52, 148718.95, 311613.29, 'New York'],
       [123334.88, 108679.17, 304981.62, 'California'],
       [101913.08, 110594.11, 229160.95, 'Florida'],
       [100671.96, 91790.61, 249744.55, 'California'],
       [93863.75, 127320.38, 249839.44, 'Florida'],
       [91992.39, 135495.07, 252664.93, 'California'],
       [119943.24, 156547.42, 256512.92, 'Florida'],
       [114523.61, 122616.84, 261776.23, 'New York'],
       [78013.11, 121597.55, 264346.06, 'California'],
       [94657.16, 145077.58, 282574.31, 'New York'],
       [91749.16, 114175.79, 29491

As we can see in the above output, the last column contains categorical variables which are not suitable to apply directly for fitting the model. So we need to encode this variable.

<h4> Encoding Dummy Variables:</h4>

As we have one categorical variable (State), which cannot be directly applied to the model, so we will encode it. To encode the categorical variable into numbers, we will use the LabelEncoder class. But it is not sufficient because it still has some relational order, which may create a wrong model. So in order to remove this problem, we will use OneHotEncoder, which will create the dummy variables. Below is code for it:

In [64]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Assuming you have your data in the variable 'x' where the 4th column is categorical
# Create a ColumnTransformer to apply transformations to specific columns
column_transformer = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(), [3])  # Apply OneHotEncoder to the 4th column
    ],
    remainder='passthrough'  # Keep the other columns as they are
)

# Apply the transformations to your data
x = column_transformer.fit_transform(x)



In [65]:
x

array([[0.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [1.0, 0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [0.0, 1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [0.0, 0.0, 1.0, 144372.41, 118671.85, 383199.62],
       [0.0, 1.0, 0.0, 142107.34, 91391.77, 366168.42],
       [0.0, 0.0, 1.0, 131876.9, 99814.71, 362861.36],
       [1.0, 0.0, 0.0, 134615.46, 147198.87, 127716.82],
       [0.0, 1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [0.0, 0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [1.0, 0.0, 0.0, 123334.88, 108679.17, 304981.62],
       [0.0, 1.0, 0.0, 101913.08, 110594.11, 229160.95],
       [1.0, 0.0, 0.0, 100671.96, 91790.61, 249744.55],
       [0.0, 1.0, 0.0, 93863.75, 127320.38, 249839.44],
       [1.0, 0.0, 0.0, 91992.39, 135495.07, 252664.93],
       [0.0, 1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [0.0, 0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [1.0, 0.0, 0.0, 78013.11, 121597.55, 264346.06],
       [0.0, 0.0, 1.0, 94657.16, 145077.58

As we can see in the above output, the state column has been converted into dummy variables (0 and 1). Here each dummy variable column is corresponding to the one State. We can check by comparing it with the original dataset. The first column corresponds to the California State, the second column corresponds to the Florida State, and the third column corresponds to the New York State.

In [66]:
#avoiding the dummy variable trap:  
X = x[:, 1:]  

# If we do not remove the first dummy variable, then it may introduce multicollinearity in the model.

In [67]:
X

array([[0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [0.0, 1.0, 144372.41, 118671.85, 383199.62],
       [1.0, 0.0, 142107.34, 91391.77, 366168.42],
       [0.0, 1.0, 131876.9, 99814.71, 362861.36],
       [0.0, 0.0, 134615.46, 147198.87, 127716.82],
       [1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [0.0, 0.0, 123334.88, 108679.17, 304981.62],
       [1.0, 0.0, 101913.08, 110594.11, 229160.95],
       [0.0, 0.0, 100671.96, 91790.61, 249744.55],
       [1.0, 0.0, 93863.75, 127320.38, 249839.44],
       [0.0, 0.0, 91992.39, 135495.07, 252664.93],
       [1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [0.0, 0.0, 78013.11, 121597.55, 264346.06],
       [0.0, 1.0, 94657.16, 145077.58, 282574.31],
       [1.0, 0.0, 91749.16, 114175.79, 294919.57],
       [0.0, 1.0, 86419.7

As we can see in the above output image, the first column has been removed.

Now we will split the dataset into training and test set. The code for this is given below:

In [68]:
# Splitting the dataset into training and test set.  
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state=0)  

<h4> Training set :</h4>

In [69]:
x_train

array([[1.0, 0.0, 55493.95, 103057.49, 214634.81],
       [0.0, 1.0, 46014.02, 85047.44, 205517.64],
       [1.0, 0.0, 75328.87, 144135.98, 134050.07],
       [0.0, 0.0, 46426.07, 157693.92, 210797.67],
       [1.0, 0.0, 91749.16, 114175.79, 294919.57],
       [1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [0.0, 1.0, 1000.23, 124153.04, 1903.93],
       [0.0, 1.0, 542.05, 51743.15, 0.0],
       [0.0, 1.0, 65605.48, 153032.06, 107138.38],
       [0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [1.0, 0.0, 61994.48, 115641.28, 91131.24],
       [0.0, 0.0, 63408.86, 129219.61, 46085.25],
       [0.0, 0.0, 78013.11, 121597.55, 264346.06],
       [0.0, 0.0, 23640.93, 96189.63, 148001.11],
       [0.0, 0.0, 76253.86, 113867.3, 298664.47],
       [0.0, 1.0, 15505.73, 127382.3, 35534.17],
       [0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [0.0, 0.0, 91992.39, 135495.07, 252664.93],
       [0.0, 0.0, 64664.71, 139553.16, 137962.

In [70]:
y_train

array([[55493.95, 103057.49, 214634.81, 'Florida'],
       [46014.02, 85047.44, 205517.64, 'New York'],
       [75328.87, 144135.98, 134050.07, 'Florida'],
       [46426.07, 157693.92, 210797.67, 'California'],
       [91749.16, 114175.79, 294919.57, 'Florida'],
       [130298.13, 145530.06, 323876.68, 'Florida'],
       [119943.24, 156547.42, 256512.92, 'Florida'],
       [1000.23, 124153.04, 1903.93, 'New York'],
       [542.05, 51743.15, 0.0, 'New York'],
       [65605.48, 153032.06, 107138.38, 'New York'],
       [114523.61, 122616.84, 261776.23, 'New York'],
       [61994.48, 115641.28, 91131.24, 'Florida'],
       [63408.86, 129219.61, 46085.25, 'California'],
       [78013.11, 121597.55, 264346.06, 'California'],
       [23640.93, 96189.63, 148001.11, 'California'],
       [76253.86, 113867.3, 298664.47, 'California'],
       [15505.73, 127382.3, 35534.17, 'New York'],
       [120542.52, 148718.95, 311613.29, 'New York'],
       [91992.39, 135495.07, 252664.93, 'California'],
  

<h4> Testing Dataset :</h4>

In [71]:
x_test

array([[1.0, 0.0, 66051.52, 182645.56, 118148.2],
       [0.0, 0.0, 100671.96, 91790.61, 249744.55],
       [1.0, 0.0, 101913.08, 110594.11, 229160.95],
       [1.0, 0.0, 27892.92, 84710.77, 164470.71],
       [1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [0.0, 1.0, 72107.6, 127864.55, 353183.81],
       [0.0, 1.0, 20229.59, 65947.93, 185265.1],
       [0.0, 1.0, 61136.38, 152701.92, 88218.23],
       [1.0, 0.0, 73994.56, 122782.75, 303319.26],
       [1.0, 0.0, 142107.34, 91391.77, 366168.42]], dtype=object)

In [72]:
y_test

array([[66051.52, 182645.56, 118148.2, 'Florida'],
       [100671.96, 91790.61, 249744.55, 'California'],
       [101913.08, 110594.11, 229160.95, 'Florida'],
       [27892.92, 84710.77, 164470.71, 'Florida'],
       [153441.51, 101145.55, 407934.54, 'Florida'],
       [72107.6, 127864.55, 353183.81, 'New York'],
       [20229.59, 65947.93, 185265.1, 'New York'],
       [61136.38, 152701.92, 88218.23, 'New York'],
       [73994.56, 122782.75, 303319.26, 'Florida'],
       [142107.34, 91391.77, 366168.42, 'Florida']], dtype=object)

<h4> Step: 2- Fitting our MLR model to the Training set:</h4>

In [73]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Assuming you have your data in the variable 'x' where the 4th column is categorical
# Create a ColumnTransformer to apply transformations to specific columns
column_transformer = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(), [3])  # Apply OneHotEncoder to the 4th column
    ],
    remainder='passthrough'  # Keep the other columns as they are
)

# Apply the transformations to your data
x = column_transformer.fit_transform(x)


Now, we have successfully trained our model using the training dataset. In the next step, we will test the performance of the model using the test dataset.

<h4>Step 3: Prediction of the test set results : </h4>

The last step for our model is checking the performance of the model. We will do it by predicting the test set result. For prediction, we will create a y_pred vector. Below is the code for it:

In [74]:
#Predicting the Test set result;  
y_pred= reg.predict(x_test)

NotFittedError: This LinearRegression instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

<h3>Applications of Multiple Linear Regression:</h3>

There are mainly two applications of Multiple Linear Regression:

1. Effectiveness of Independent variable on prediction:
2. Predicting the impact of changes: