## Multiple Linear Regression

- In this practice, we are going to find regression model to predict profit of the startup company with various dependent variables such as R & D spend and Marketing spend.
- For simplicity, we will not consider categorical value such as 'state'. Encoding categorical values will be covered in other tutorial.

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [12]:
dataset = pd.read_csv('./data/50_Startups.csv')
dataset

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


Sklearn's regression models expect the shape of X and y as
- X: [number_of_data, number_of_features]
- y: [number_of_data]

Therefore, we need to convert dataframe's result to correct numpy object.

In [13]:
dataset.columns

Index(['R&D Spend', 'Administration', 'Marketing Spend', 'State', 'Profit'], dtype='object')

In [14]:
X_input = dataset[['R&D Spend', 'Administration', 'Marketing Spend']]   # do not use 'state'
y_input = dataset['Profit']
X_input

Unnamed: 0,R&D Spend,Administration,Marketing Spend
0,165349.2,136897.8,471784.1
1,162597.7,151377.59,443898.53
2,153441.51,101145.55,407934.54
3,144372.41,118671.85,383199.62
4,142107.34,91391.77,366168.42
5,131876.9,99814.71,362861.36
6,134615.46,147198.87,127716.82
7,130298.13,145530.06,323876.68
8,120542.52,148718.95,311613.29
9,123334.88,108679.17,304981.62


In [15]:
y_input

0     192261.83
1     191792.06
2     191050.39
3     182901.99
4     166187.94
5     156991.12
6     156122.51
7     155752.60
8     152211.77
9     149759.96
10    146121.95
11    144259.40
12    141585.52
13    134307.35
14    132602.65
15    129917.04
16    126992.93
17    125370.37
18    124266.90
19    122776.86
20    118474.03
21    111313.02
22    110352.25
23    108733.99
24    108552.04
25    107404.34
26    105733.54
27    105008.31
28    103282.38
29    101004.64
30     99937.59
31     97483.56
32     97427.84
33     96778.92
34     96712.80
35     96479.51
36     90708.19
37     89949.14
38     81229.06
39     81005.76
40     78239.91
41     77798.83
42     71498.49
43     69758.98
44     65200.33
45     64926.08
46     49490.75
47     42559.73
48     35673.41
49     14681.40
Name: Profit, dtype: float64

In [16]:
X = X_input.values
y = y_input.values
print("Shape of X: ", X.shape)
print("Shape of y: ", y.shape)

Shape of X:  (50, 3)
Shape of y:  (50,)


## Splitting the dataset into the Training set and Test set

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Training the Multiple Linear Regression model on the Training set

In [18]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

## Predicting the Test set results

In [19]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))

[[103901.9  103282.38]
 [132763.06 144259.4 ]
 [133567.9  146121.95]
 [ 72911.79  77798.83]
 [179627.93 191050.39]
 [115166.65 105008.31]
 [ 67113.58  81229.06]
 [ 98154.81  97483.56]
 [114756.12 110352.25]
 [169064.01 166187.94]]


In [20]:
data = {'ref': y_test, 'pred': y_pred, 'diff': np.abs((y_test - y_pred))}
df = pd.DataFrame(data)
df

Unnamed: 0,ref,pred,diff
0,103282.38,103901.89697,619.51697
1,144259.4,132763.059931,11496.340069
2,146121.95,133567.9037,12554.0463
3,77798.83,72911.789767,4887.040233
4,191050.39,179627.925672,11422.464328
5,105008.31,115166.648648,10158.338648
6,81229.06,67113.576906,14115.483094
7,97483.56,98154.806868,671.246868
8,110352.25,114756.115552,4403.865552
9,166187.94,169064.014088,2876.074088
