# Multiple Linear Regression

In multiple linear regression we have **many** independent variables and only **one** dependent vatiable.

### 1. Assumptions of a linear regression models -->

Each LR(linear regression) model has some set of assumptions. The major of them are:
>    1. linearity,
>    2. honoscedosticity,
>    3. multivatient normality,
>    4. independance of errors, and
>    5. lack of multicollinearity.

### 2. Dummy variables -->

Dummy variables are one way to handel categorical value. The idea is to create different features derived from the different categories. Example:
> colors: {red, blue, green, red, green}

Here the three categories are {red, blue, green}. The three different columns are created what will hold boolean values. If the i<sup>th</sup> was red then only the _red_ column will have 1 in it and all other zeros. This is repeated for every training example.

This is a great way to handel categorical values, but it can lead to some problems. The major one is the _multi-collinearity_.

### 3. P value -->

Every event has some probabiity associated to it. As an example tossing a coin has 50/50 probability of giving heads and tails.

But how can the "_fairness_" of the coin be juged? How can it be juged that the coin is _fair_? This is were _hypotesis testing_ comes is.

The coin can be a fair coin or an unfair coin. Then an assumption is made about the _state_ of the coin and by tossing it the assumption is tested. It the coin is fair we expect a mix of heads and tail. But it we to see the same outcome again and again it seems to get <u>sus</u>.

That sus feeling, the point at which it feels that maybe the initial hypothesis, called the "_null hypothesis_" was incorrect is called the _**P**_ value of the hypothesis.

### 4. How to build a model? -->

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# importing the dataset -->
raw_data = pd.read_csv("data/50_Startups.csv")

In [3]:
# data description -->
raw_data.describe()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
count,50.0,50.0,50.0,50.0
mean,73721.6156,121344.6396,211025.0978,112012.6392
std,45902.256482,28017.802755,122290.310726,40306.180338
min,0.0,51283.14,0.0,14681.4
25%,39936.37,103730.875,129300.1325,90138.9025
50%,73051.08,122699.795,212716.24,107978.19
75%,101602.8,144842.18,299469.085,139765.9775
max,165349.2,182645.56,471784.1,192261.83


In [4]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


In [5]:
# creating the train-test split -->
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(raw_data, test_size=0.2, random_state=42)