# Multiple Linear Regression

In multiple linear regression we have **many** independent variables and only **one** dependent vatiable.

### 1. Assumptions of a linear regression models -->

Each LR(linear regression) model has some set of assumptions. The major of them are:
>    1. linearity,
>    2. honoscedosticity,
>    3. multivatient normality,
>    4. independance of errors, and
>    5. lack of multicollinearity.

### 2. Dummy variables -->

Dummy variables are one way to handel categorical value. The idea is to create different features derived from the different categories. Example:
> colors: {red, blue, green, red, green}

Here the three categories are {red, blue, green}. The three different columns are created what will hold boolean values. If the i<sup>th</sup> was red then only the _red_ column will have 1 in it and all other zeros. This is repeated for every training example.

This is a great way to handel categorical values, but it can lead to some problems. The major one is the _multi-collinearity_.

### 3. P value -->

Every event has some probabiity associated to it. As an example tossing a coin has 50/50 probability of giving heads and tails.

But how can the "_fairness_" of the coin be juged? How can it be juged that the coin is _fair_? This is were _hypotesis testing_ comes is.

The coin can be a fair coin or an unfair coin. Then an assumption is made about the _state_ of the coin and by tossing it the assumption is tested. It the coin is fair we expect a mix of heads and tail. But it we to see the same outcome again and again it seems to get <u>sus</u>.

That sus feeling, the point at which it feels that maybe the initial hypothesis, called the "_null hypothesis_" was incorrect is called the _**P**_ value of the hypothesis.

### 4. How to build a model? -->

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# importing the dataset -->
raw_data = pd.read_csv("data/50_Startups.csv")

In [None]:
# data description -->
raw_data.describe()

In [None]:
raw_data.info()

In [5]:
# creating the train-test split -->
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(raw_data, test_size=0.2, random_state=42)

In [26]:
# using pipelines and column transformers to pre-process data -->
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

In [47]:
def split_xy(dataset: pd.DataFrame, target: str) -> tuple[pd.DataFrame, pd.DataFrame]:
    # split the data in X and y
    X = dataset.drop(target, axis=1)
    y = dataset[target].copy()

    return X, y

In [46]:
def get_nums_cats(X: pd.DataFrame, cat_vars: list) -> tuple[pd.DataFrame, pd.DataFrame]:
    X_nums = X.select_dtypes([np.number])
    X_cats = X[cat_vars]

    return X_nums, X_cats

In [13]:
X_train, y_train = split_xy(train_data, "Profit")

In [38]:
X_train_num, X_train_cat= get_nums_cats(X_train, ["State"])

In [43]:
# applying the column transformer -->
num_cols = X_train_num.columns
cat_cols = X_train_cat.columns

num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("std_scaler", StandardScaler())
])

X_train_num_pl = num_pipeline.fit_transform(X_train_num)

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_cols),
    ("cat", OneHotEncoder(), cat_cols)
])

X_train_prepared = full_pipeline.fit_transform(X_train)

In [None]:
X_train_prepared