# Linear Regression Case Study

## Case 1
Evaluate start-up sector.
Data consistes of 4 predictor variables:
- R&D Expenditure (in $)
- Admin Expenditure (in $)
- Marketing Expenditure (in $)
- State

Dependent variable
- Profit (in $)

In [2]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

sns.set()


In [4]:
# read in data
data = pd.read_csv('./data/Startups.csv')
data.head()

Unnamed: 0,R&D Expenditure,Administration Expenditure,Marketing Expenditure,State,Profit
0,165349.2,136897.8,471784.1,Florida,192261.83
1,162597.7,151377.59,443898.53,Florida,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,Florida,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [6]:
# Regression
# Profit ~ Marketing Expenditure
y = data['Profit']
x1 = data['Marketing Expenditure']
X = sm.add_constant(x1)

results = sm.OLS(y, X).fit()
results.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.559
Model:,OLS,Adj. R-squared:,0.55
Method:,Least Squares,F-statistic:,60.88
Date:,"Wed, 20 Mar 2024",Prob (F-statistic):,4.38e-10
Time:,22:29:56,Log-Likelihood:,-580.18
No. Observations:,50,AIC:,1164.0
Df Residuals:,48,BIC:,1168.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,6e+04,7684.530,7.808,0.000,4.46e+04,7.55e+04
Marketing Expenditure,0.2465,0.032,7.803,0.000,0.183,0.310

0,1,2,3
Omnibus:,4.42,Durbin-Watson:,1.178
Prob(Omnibus):,0.11,Jarque-Bera (JB):,3.882
Skew:,-0.336,Prob(JB):,0.144
Kurtosis:,4.188,Cond. No.,489000.0


In [7]:
# Prediction
results.params[0] + results.params[1]*325000

140102.80976194615

In [11]:
# Regression
# Profit ~ Marketing Expenditure + R&D + Admin
y = data['Profit']
x1 = data[['Marketing Expenditure', 'R&D Expenditure', 'Administration Expenditure']]
X = sm.add_constant(x1)

results = sm.OLS(y, X).fit()
results.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Wed, 20 Mar 2024",Prob (F-statistic):,4.53e-30
Time:,22:32:50,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.012e+04,6572.353,7.626,0.000,3.69e+04,6.34e+04
Marketing Expenditure,0.0272,0.016,1.655,0.105,-0.006,0.060
R&D Expenditure,0.8057,0.045,17.846,0.000,0.715,0.897
Administration Expenditure,-0.0268,0.051,-0.526,0.602,-0.130,0.076

0,1,2,3
Omnibus:,14.838,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.442
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.586,Cond. No.,1400000.0


In [12]:
# Regression
# Profit ~ R&D + Admin
y = data['Profit']
x1 = data['R&D Expenditure']
X = sm.add_constant(x1)

results = sm.OLS(y, X).fit()
results.summary()

0,1,2,3
Dep. Variable:,Profit,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Wed, 20 Mar 2024",Prob (F-statistic):,3.5000000000000004e-32
Time:,22:34:48,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.897,19.320,0.000,4.39e+04,5.41e+04
R&D Expenditure,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,165000.0


In [13]:
# Prediction
results.params[0] + results.params[1]*125000

155819.32050860324