# Multiple Linear Regression

It is a regression technique in which we predict the label from multiple features.

### Equation: y = b<sub>0</sub> + b<sub>1</sub>x<sub>1</sub> + b<sub>1</sub>x<sub>2</sub> + ..... + b<sub>n</sub>x<sub>n</sub>

## Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Importing Dataset

In [2]:
df = pd.read_csv("./50_Startups.csv")
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [3]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1:].values

In [4]:
X[:10]

array([[165349.2, 136897.8, 471784.1, 'New York'],
       [162597.7, 151377.59, 443898.53, 'California'],
       [153441.51, 101145.55, 407934.54, 'Florida'],
       [144372.41, 118671.85, 383199.62, 'New York'],
       [142107.34, 91391.77, 366168.42, 'Florida'],
       [131876.9, 99814.71, 362861.36, 'New York'],
       [134615.46, 147198.87, 127716.82, 'California'],
       [130298.13, 145530.06, 323876.68, 'Florida'],
       [120542.52, 148718.95, 311613.29, 'New York'],
       [123334.88, 108679.17, 304981.62, 'California']], dtype=object)

In [5]:
y[:10]

array([[192261.83],
       [191792.06],
       [191050.39],
       [182901.99],
       [166187.94],
       [156991.12],
       [156122.51],
       [155752.6 ],
       [152211.77],
       [149759.96]])

## Ecoding Categorical Data

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(transformers=[("encoder", OneHotEncoder(), [3])], remainder="passthrough")
X = np.array(ct.fit_transform(X))

In [7]:
X[:10]

array([[0.0, 0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [1.0, 0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [0.0, 1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [0.0, 0.0, 1.0, 144372.41, 118671.85, 383199.62],
       [0.0, 1.0, 0.0, 142107.34, 91391.77, 366168.42],
       [0.0, 0.0, 1.0, 131876.9, 99814.71, 362861.36],
       [1.0, 0.0, 0.0, 134615.46, 147198.87, 127716.82],
       [0.0, 1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [0.0, 0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [1.0, 0.0, 0.0, 123334.88, 108679.17, 304981.62]], dtype=object)

## Splitting data into Train and Test Set

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Training The Multi Linear Regression Model

In [9]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

## Predicting the Model on Test Data

In [10]:
y_preds = regressor.predict(X_test)
y_preds.reshape(len(y_preds), 1)

# round of upto 2 decimals
y_preds = y_preds.round(decimals=2)
y_preds

array([[126362.88],
       [ 84608.45],
       [ 99677.49],
       [ 46357.46],
       [128750.48],
       [ 50912.42],
       [109741.35],
       [100643.24],
       [ 97599.28],
       [113097.43]])

In [11]:
results_df = pd.DataFrame(y_test, columns=["Actual"])
results_df["Predited"] = y_preds
results_df

Unnamed: 0,Actual,Predited
0,134307.35,126362.88
1,81005.76,84608.45
2,99937.59,99677.49
3,64926.08,46357.46
4,125370.37,128750.48
5,35673.41,50912.42
6,105733.54,109741.35
7,107404.34,100643.24
8,97427.84,97599.28
9,122776.86,113097.43


In [12]:
# Accuracy of the Model on Test data
print(f"Accuracy of Model on Test data: {round(regressor.score(X_test, y_test) * 100)}%")

Accuracy of Model on Test data: 90%
