## Multi-linear regression estimation of the effect of ads on different platforms on sales

Objective: We want to estimate how much sales have increased as a result of advertisements in television, radio and newspaper news.

In [2]:
import pandas as pd
import numpy as np
df=pd.read_csv("Advertising.csv")
df=df.iloc[:,1:len(df)]# made to not get the first index.
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [3]:
X=df.drop('sales', axis=1) # Independent variables
y=df[["sales"]] ## the dependent variable

In [4]:
y.head()

Unnamed: 0,sales
0,22.1
1,10.4
2,9.3
3,18.5
4,12.9


In [5]:
X.head()

Unnamed: 0,TV,radio,newspaper
0,230.1,37.8,69.2
1,44.5,39.3,45.1
2,17.2,45.9,69.3
3,151.5,41.3,58.5
4,180.8,10.8,58.4


In [6]:
# Build with Statsmodels
import statsmodels.api as sm # related library

In [7]:
lm=sm.OLS(y,X) # function to build a regression model OLS(DEPENDENT Variable, Independent variable)

In [8]:
model = lm.fit() # model output in multiple linear regression

In [9]:
model.summary() #model summary

0,1,2,3
Dep. Variable:,sales,R-squared (uncentered):,0.982
Model:,OLS,Adj. R-squared (uncentered):,0.982
Method:,Least Squares,F-statistic:,3566.0
Date:,"Fri, 17 Feb 2023",Prob (F-statistic):,2.43e-171
Time:,01:55:04,Log-Likelihood:,-423.54
No. Observations:,200,AIC:,853.1
Df Residuals:,197,BIC:,863.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
TV,0.0538,0.001,40.507,0.000,0.051,0.056
radio,0.2222,0.009,23.595,0.000,0.204,0.241
newspaper,0.0168,0.007,2.517,0.013,0.004,0.030

0,1,2,3
Omnibus:,5.982,Durbin-Watson:,2.038
Prob(Omnibus):,0.05,Jarque-Bera (JB):,7.039
Skew:,-0.232,Prob(JB):,0.0296
Kurtosis:,3.794,Cond. No.,12.6


Comment: When we look at our multiple linear regression output, the "F-statistic" probability value in our model was significant at the 1% significance level. Our model is meaningful. Our R2 value and adjusted R2 value are 0.982. In other words, the variables in our model explain 98.2% of the dependent variable.
The independent variables TV, radio, newspaper were statistically significant and can be interpreted. With all other variables held constant, one unit of spending on TV ads will increase sales by 0.05 units. With all other variables fixed, a unit spending on radio advertisements will increase sales by 0.22 units. With all other variables fixed, a unit spend on newspaper ads will increase sales by 0.01 units.

In [10]:
from sklearn.linear_model import LinearRegression #gereklik kütüphane
lm= LinearRegression()
lm.fit(X,y)

LinearRegression()

In [11]:

model=lm.fit(X,y) #prediction not processed

In [14]:
model.intercept_ # Fixed value

array([2.93888937])

In [15]:
model.coef_ # coefficient values (tv, radio, newspaper)

array([[ 0.04576465,  0.18853002, -0.00103749]])

## Tahmin 

#### Sales= 2.94+TV*0.04+radio*0.19-newspaper*0.001

Example= What happens if we spend 30 units of TV, 10 units of radio, 40 units of newspaper?

In [16]:
2.94+30*0.04+10*0.19-40*0.001 # groping

5.999999999999999

In [17]:
yeni_veri =[[30],[10],[40]]

In [18]:
import pandas as pd
yeni_veri=pd.DataFrame(yeni_veri).T #transpozunu aldık.

In [19]:
yeni_veri 

Unnamed: 0,0,1,2
0,30,10,40


In [20]:
model.predict(yeni_veri)  ## we are guessing.



array([[6.15562918]])

In [22]:
from sklearn.metrics import mean_squared_error # LIBRARY REQUIRED FOR MSE

In [23]:
y.head()

Unnamed: 0,sales
0,22.1
1,10.4
2,9.3
3,18.5
4,12.9


In [24]:
model.predict(X)[0:10]

array([[20.52397441],
       [12.33785482],
       [12.30767078],
       [17.59782951],
       [13.18867186],
       [12.47834763],
       [11.72975995],
       [12.12295317],
       [ 3.72734086],
       [12.55084872]])

In [27]:
MSE=mean_squared_error(y,model.predict(X)) #Mean_squared_error(actual,guess)
MSE

2.7841263145109365

In [30]:
import numpy as np
RMSE= np.sqrt(MSE)# We took the square root of the mse
RMSE

1.66857014072257

## Model Tuning (Model Doğrulama)

In [31]:
#sinama seti
from sklearn.model_selection import train_test_split  # required library to split train and test data

In [33]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.20, random_state=99)#(bağımsız değiken,bağımlı değiken, test veri seti yüzdesi, kayıt numarası)

In [34]:
X_train.head()

Unnamed: 0,TV,radio,newspaper
16,67.8,36.6,114.0
51,100.4,9.6,3.6
97,184.9,21.0,22.0
164,117.2,14.7,5.4
71,109.8,14.3,31.7


In [35]:
X_test.head()

Unnamed: 0,TV,radio,newspaper
135,48.3,47.0,8.5
127,80.2,0.0,9.2
191,75.5,10.8,6.0
66,31.5,24.6,2.2
119,19.4,16.0,22.3


In [37]:
y_train.head()

Unnamed: 0,sales
16,12.5
51,10.7
97,15.5
164,11.9
71,12.4


In [38]:
lm=LinearRegression() 
model =lm.fit(X_train,y_train) #training dependent arguments will be written

In [39]:
#training error
np.sqrt(mean_squared_error(y_train,model.predict(X_train)))

1.723682482265075

In [42]:
 #test error
np.sqrt(mean_squared_error(y_test,model.predict(X_test)))

1.4312783138301641

In [47]:
#k-floor cv
from sklearn.model_selection import cross_val_score

In [49]:
cross_val_score(model,X_train,y_train,cv=10, scoring="neg_mean_squared_error")

array([-2.1019073 , -2.48953197, -3.09704214, -2.34694216, -3.68175761,
       -1.8691401 , -3.18173007, -4.1927349 , -2.17128376, -8.03821974])