# Multiple Linear Regression

In multiple linear regression we have **many** independent variables and only **one** dependent vatiable.

### 1. Assumptions of a linear regression models -->

Each LR(linear regression) model has some set of assumptions. The major of them are:
>    1. linearity,
>    2. honoscedosticity,
>    3. multivatient normality,
>    4. independance of errors, and
>    5. lack of multicollinearity.

### 2. Dummy variables -->

Dummy variables are one way to handel categorical value. The idea is to create different features derived from the different categories. Example:
> colors: {red, blue, green, red, green}

Here the three categories are {red, blue, green}. The three different columns are created what will hold boolean values. If the i<sup>th</sup> was red then only the _red_ column will have 1 in it and all other zeros. This is repeated for every training example.

This is a great way to handel categorical values, but it can lead to some problems. The major one is the _multi-collinearity_.

### 3. P value -->

Every event has some probabiity associated to it. As an example tossing a coin has 50/50 probability of giving heads and tails.

But how can the "_fairness_" of the coin be juged? How can it be juged that the coin is _fair_? This is were _hypotesis testing_ comes is.

The coin can be a fair coin or an unfair coin. Then an assumption is made about the _state_ of the coin and by tossing it the assumption is tested. It the coin is fair we expect a mix of heads and tail. But it we to see the same outcome again and again it seems to get <u>sus</u>.

That sus feeling, the point at which it feels that maybe the initial hypothesis, called the "_null hypothesis_" was incorrect is called the _**P**_ value of the hypothesis.

### 4. How to build a model? -->

### 5. Building the model -->

#### Data preprocessing =>

In [67]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [68]:
# importing the dataset -->
raw_data = pd.read_csv("data/50_Startups.csv")

In [69]:
# data description -->
raw_data.describe()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
count,50.0,50.0,50.0,50.0
mean,73721.6156,121344.6396,211025.0978,112012.6392
std,45902.256482,28017.802755,122290.310726,40306.180338
min,0.0,51283.14,0.0,14681.4
25%,39936.37,103730.875,129300.1325,90138.9025
50%,73051.08,122699.795,212716.24,107978.19
75%,101602.8,144842.18,299469.085,139765.9775
max,165349.2,182645.56,471784.1,192261.83


In [70]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


In [71]:
# creating the train-test split -->
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(raw_data, test_size=0.2, random_state=42)

In [72]:
# using pipelines and column transformers to pre-process data -->
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [73]:
def split_xy(dataset: pd.DataFrame, target: str) -> tuple[pd.DataFrame, pd.DataFrame]:
    # split the data in X and y
    X = dataset.drop(target, axis=1)
    y = dataset[target].copy()

    return X, y

In [74]:
def get_nums_cats(X: pd.DataFrame, cat_vars: list) -> tuple[pd.DataFrame, pd.DataFrame]:
    X_nums = X.select_dtypes([np.number])
    X_cats = X[cat_vars]

    return X_nums, X_cats

In [75]:
X_train, y_train = split_xy(train_data, "Profit")

In [76]:
X_train_num, X_train_cat= get_nums_cats(X_train, ["State"])

In [77]:
# applying the column transformer -->
num_cols = X_train_num.columns
cat_cols = X_train_cat.columns

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("std_scaler", StandardScaler())
])

X_train_num_pl = num_pipeline.fit_transform(X_train_num)

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_cols),
    ("cat", OneHotEncoder(), cat_cols)
])

X_train_prepared = full_pipeline.fit_transform(X_train)

In [78]:
X_train_prepared

array([[ 0.34202149,  0.22787678,  0.12425038,  0.        ,  1.        ,
         0.        ],
       [ 1.36207849, -1.0974737 ,  1.14990688,  0.        ,  1.        ,
         0.        ],
       [-0.71081297, -2.5770186 , -0.34136825,  1.        ,  0.        ,
         0.        ],
       [ 0.90611438,  1.0172367 ,  0.66890185,  0.        ,  0.        ,
         1.        ],
       [ 1.40997088, -0.09115403,  1.30006861,  0.        ,  0.        ,
         1.        ],
       [ 1.20367103,  0.96116332, -0.95248784,  1.        ,  0.        ,
         0.        ],
       [-1.05285826, -1.34392538, -0.62843389,  0.        ,  1.        ,
         0.        ],
       [-1.61480906, -0.19649414,  0.54106768,  0.        ,  1.        ,
         0.        ],
       [-1.642623  ,  0.52691442, -2.07854935,  1.        ,  0.        ,
         0.        ],
       [ 0.77885123,  0.05437051,  0.2294954 ,  0.        ,  0.        ,
         1.        ],
       [ 0.96515572, -0.45976843,  0.61043134,  1.

#### Training the model =>

In [79]:
from sklearn.linear_model import LinearRegression

In [80]:
linear_model: LinearRegression = LinearRegression()
linear_model.fit(X_train_prepared, y_train)

In [81]:
print(f"{X_train_prepared[0]} -> {y_train[0]}")

[0.34202149 0.22787678 0.12425038 0.         1.         0.        ] -> 192261.83


In [82]:
# calculating the error -->
from sklearn.metrics import mean_squared_error
y_predicted = linear_model.predict(X_train_prepared)

In [83]:
mse = mean_squared_error(y_train, y_predicted)
rmse = np.sqrt(mse)
print(f"MSE: {mse}, RMSE: {rmse}")

MSE: 79700060.0825932, RMSE: 8927.489013300055


#### Evaluation on the test set =>

In [87]:
X_test, y_test = split_xy(test_data, "Profit")

In [88]:
# applying the column transformer -->
X_test_prepared = full_pipeline.transform(X_test)

In [90]:
y_final_predict = linear_model.predict(X_test_prepared)
mse = mean_squared_error(y_test, y_final_predict)
rmse = np.sqrt(mse)
print(f"MSE: {mse}, RMSE: {rmse}")

MSE: 82010363.04501358, RMSE: 9055.957323497807
