# Multiple Linear Regression

- Jika SLR hanya menggunakan satu prediktor maka MLR menggunakan lebih dari satu prediktor 
- rumus SLR ...    **Y=βo+β1X1+....+βnXn**
    - Y  = variable target
    - X  = variable prediktor
    - βo = konstanta, perpotongan garis regresi dengan sumbu Y (nilai estimasi jika x = 0)
    - β1 = koefisien regresi (slope)
    - n  = banyaknya hitungan 

#### 1. Membaca Data

In [26]:
df = pd.read_csv('dataset/50_Startups.csv')
df.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


#### 2. Data Pre-processing

##### 2.1 mengetahui informasi data 

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


**insight :**
- terdapat 4 kolom numerik dan satu kolom kategorik
- tidak ada null values
- kolom kategorik harus diubah menjadi numerik melalui one-hot encoding

##### 2.2 Mentukan Variable Prediktor dan Variable Target

In [28]:
x = df.iloc[:,:-1] # prediktor
y = df.iloc[:,4] #target

In [29]:
print(x.head())

   R&D Spend  Administration  Marketing Spend       State
0  165349.20       136897.80        471784.10    New York
1  162597.70       151377.59        443898.53  California
2  153441.51       101145.55        407934.54     Florida
3  144372.41       118671.85        383199.62    New York
4  142107.34        91391.77        366168.42     Florida


In [30]:
print(y.head())

0    192261.83
1    191792.06
2    191050.39
3    182901.99
4    166187.94
Name: Profit, dtype: float64


setelah menentukan variable target dan prediktor, kolom kategorik harus diubah menjadi kolom numerik. 

##### 2.3 One-Hot Encoding

tahapan One-hot encoding adalah mengubah data kategorik menjadi numerik.

In [31]:
# pertama membuat dummy variable (variable contoh), berfungsi sebagai variable yang menyimpan data diskrit menjadi numerik
states = pd.get_dummies(x['State'], drop_first=True) # drop_first akan menghapus kolom pertama dari variable dummy supaya terhinda dari variable dummy trap
states = states.astype(int) 
states.head()

Unnamed: 0,Florida,New York
0,0,1
1,0,0
2,1,0
3,0,1
4,1,0


setelah proses encoding selesai, maka `states` yang merupakan variable hasil encoding digabungkan dengan variable prediktor x 

In [32]:
x=pd.concat([x,states],axis=1)

In [33]:
print(x.head())

   R&D Spend  Administration  Marketing Spend       State  Florida  New York
0  165349.20       136897.80        471784.10    New York        0         1
1  162597.70       151377.59        443898.53  California        0         0
2  153441.51       101145.55        407934.54     Florida        1         0
3  144372.41       118671.85        383199.62    New York        0         1
4  142107.34        91391.77        366168.42     Florida        1         0


Selain itu kita juga harus menghapus kolom `State` pada variable yang ada pada variable x karena tidak digunakan 

In [34]:
x=x.drop('State',axis=1)

In [35]:
print(x.head())

   R&D Spend  Administration  Marketing Spend  Florida  New York
0  165349.20       136897.80        471784.10        0         1
1  162597.70       151377.59        443898.53        0         0
2  153441.51       101145.55        407934.54        1         0
3  144372.41       118671.85        383199.62        0         1
4  142107.34        91391.77        366168.42        1         0


#### 3. Splitting Data

In [36]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

#### 4. Pemodelan Linear Regression

In [37]:
regressor = LinearRegression()
regressor.fit(x_train, y_train)

#### 5. Melakukan prediksi data

In [38]:
y_pred = regressor.predict(x_test)

In [39]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df

Unnamed: 0,Actual,Predicted
28,103282.38,103015.201598
11,144259.4,132582.277608
10,146121.95,132447.738452
41,77798.83,71976.098513
2,191050.39,178537.482211
27,105008.31,116161.242302
38,81229.06,67851.692097
31,97483.56,98791.733747
22,110352.25,113969.43533
4,166187.94,167921.065696


#### 6. Melakukan Uji performa Model

In [40]:
from sklearn import metrics
print('Model R^2 Square value', metrics.r2_score(y_test, y_pred))

Model R^2 Square value 0.9347068473282423


**insight :**
- nilai R-square sebesar 93% menunjukan bahwa model regresi dapat dengan baik memprediksi data 

#### 7. Analysis Regression

##### 7.1 Menggunakan 5 prediktor

In [41]:
import statsmodels.api as sm # menyediakan berbagai metode untuk melakukan analisis statistik dan regresi.

# Menambahkan kolom konstanta ke array x. Kolom konstanta diperlukan untuk menghitung koefisien intersep dalam model regresi. 
# Fungsi add_constant dari statsmodels melakukan ini dengan menambahkan kolom berisi nilai konstan 1 ke array x.
x_new = sm.add_constant(x)  

# Memilih kolom-kolom yang akan digunakan dalam model regresi. 
#  Pada tahap awal, semua variabel independen mungkin dimasukkan ke dalam model. x_opt adalah array yang berisi kolom konstan dan kolom-kolom yang dipilih.
x_opt = x_new.iloc[:, [0, 1, 2, 3, 4, 5]]

# membuat model regresi linear berganda menggunakan metode Ordinary Least Squares (OLS) dari statsmodels
# Parameter endog adalah variabel dependen (dalam hal ini, y), dan exog adalah variabel independen (dalam hal ini, x_opt)
# metode fit() digunakan untuk menghitung koefisien regresi dan parameter lainnya.
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()

# Menampilkan ringkasan hasil analisis regresi
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Mon, 27 Nov 2023",Prob (F-statistic):,1.34e-27
Time:,08:41:08,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.013e+04,6884.820,7.281,0.000,3.62e+04,6.4e+04
R&D Spend,0.8060,0.046,17.369,0.000,0.712,0.900
Administration,-0.0270,0.052,-0.517,0.608,-0.132,0.078
Marketing Spend,0.0270,0.017,1.574,0.123,-0.008,0.062
Florida,198.7888,3371.007,0.059,0.953,-6595.030,6992.607
New York,-41.8870,3256.039,-0.013,0.990,-6604.003,6520.229

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


**insight:**
- P-value menyatakan adanya perbedaan dalam sebuah uji statistika. P-value kurang dari 0.005 maka akan semakin baik. 
- Pada nilai diatas p-value terbaik adalah kolom `R&D Spend`
- sementara kolom `New York` adalah kolom tertinggi p-value nya

##### 7.2 Menggunakan 4 prediktor 

In [42]:
x_opt = x_new.iloc[:, [0, 1, 2, 3, 4]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Mon, 27 Nov 2023",Prob (F-statistic):,8.49e-29
Time:,08:41:08,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.011e+04,6647.870,7.537,0.000,3.67e+04,6.35e+04
R&D Spend,0.8060,0.046,17.606,0.000,0.714,0.898
Administration,-0.0270,0.052,-0.523,0.604,-0.131,0.077
Marketing Spend,0.0270,0.017,1.592,0.118,-0.007,0.061
Florida,220.1585,2900.536,0.076,0.940,-5621.821,6062.138

0,1,2,3
Omnibus:,14.758,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.172
Skew:,-0.948,Prob(JB):,2.53e-05
Kurtosis:,5.563,Cond. No.,1400000.0


**insight:**
- kolom `Florida` perlu dihilangkan

##### 7.3 Menggunakan 3 prediktor 

In [43]:
x_opt = x_new.iloc[:, [0, 1, 2, 3]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()


0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Mon, 27 Nov 2023",Prob (F-statistic):,4.53e-30
Time:,08:41:08,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.012e+04,6572.353,7.626,0.000,3.69e+04,6.34e+04
R&D Spend,0.8057,0.045,17.846,0.000,0.715,0.897
Administration,-0.0268,0.051,-0.526,0.602,-0.130,0.076
Marketing Spend,0.0272,0.016,1.655,0.105,-0.006,0.060

0,1,2,3
Omnibus:,14.838,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.442
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.586,Cond. No.,1400000.0


**insight:**
- kolom `Administration` dan `Marketing Spend` perlu dihilangkan

In [44]:
x_opt = x_new.iloc[:, [0, 1]]
regressor_OLS = sm.OLS(endog = y, exog = x_opt).fit()
regressor_OLS.summary()


0,1,2,3
Dep. Variable:,Profit,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Mon, 27 Nov 2023",Prob (F-statistic):,3.5000000000000004e-32
Time:,08:41:08,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.897,19.320,0.000,4.39e+04,5.41e+04
R&D Spend,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,165000.0


**insight:**
- tersisa satu variable saja yaitu variable `R&D Spend` dimana memenuhi nilai p-values
- persamaan yang dapat digunakan adalah y = 490303 + 0.8543X
- artinya profit 50 perusahaan tergantung pada banyaknya biaya R&D. Semakin besar R&D maka semakin tinggi profit yang akan didapatkan 

## Kelebihan dan Kekurangan Regresi

### Kelebihan

+ mudah diterapkan dan dijelaskan koefisiennya
+ menjadi algoritma terbaik pada data linear karena kompleksitasnya yang lebih sedikit 
+ rentan terhadap outlier, namun dapat diatasi dengan standardisasi, dimensionality reduction, regularisasi atau cross validation

### Kekurangan 
- jika gagal deteksi outlier maka sangat mempengaruhi model
- algoritma ini tidak dapat menangkap seluruh variable prediktor yang mempengaruhi target