# Machine Learning
### Textbook is available at: [https://www.github.com/a-mhamdi/isetbz](https://www.github.com/a-mhamdi/isetbz)

---

### Multiple Linear Regression

**Importing the libraries**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

**Importing the dataset**

In [2]:
df = pd.read_csv('datasets/50_Startups.csv')
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


Extract features $X$ and target $y$ from the dataset. **Profit** is the dependant variable.

In [3]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

Check the first five observations within $X$

In [4]:
print(X[:5])

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']]


Check the corresponding first five values from **Profit** column.

In [5]:
print(y[:5])

[192261.83 191792.06 191050.39 182901.99 166187.94]


**Encoding categorical data**

In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [7]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [8]:
print(X[:5])

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]]


**Splitting the dataset into training set and test set**

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

**Training the multiple linear regression model on the training set**

In [11]:
from sklearn.linear_model import LinearRegression

In [12]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

**Making predictions using the $X$ test set and comparison**

In [13]:
y_pred = lr.predict(X_test)
np.set_printoptions(precision=2)
print(type(y_pred), y_pred.shape)
y_pred = y_pred.reshape(-1,1)
print(type(y_pred), y_pred.shape)
y_test = y_test.reshape(-1,1)
print(np.concatenate((y_pred, y_test), axis=1))

<class 'numpy.ndarray'> (10,)
<class 'numpy.ndarray'> (10, 1)
[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]
