# 마케팅 예산 및 매출 선형 회귀분석

데이터셋 출처
    <https://www.kaggle.com/datasets/singhnavjot2062001/product-advertising-data>

In [None]:
import pandas as pd

In [None]:
pd.options.display.float_format = '{:.2f}'.format

In [None]:
data = pd.read_csv('Advertising_data.csv')

In [None]:
data.info()

# 컬럼 설명

  * TV: TV방송 비용
  * Billboards : 빌보드 마케팅 비용
  * Google_Ads : 구글 광고 비용
  * Social_Media : 소셜미디어 비용
  * Influencer_Marketing : 인플루언서 마케팅 비용
  * Affiliate_Marketing : 제휴 마케팅 비용
  * Product_Sold : 상품판매액



각 데이터는 월별 비용/매출에 관련된 데이터라고 가정하고, Product_Sold 컬럼은 다음달 매출액 데이터라고 가정한다.

In [None]:
data.head()

Unnamed: 0,TV,Billboards,Google_Ads,Social_Media,Influencer_Marketing,Affiliate_Marketing,Product_Sold
0,281.42,538.8,123.94,349.3,242.77,910.1,7164.0
1,702.97,296.53,558.13,180.55,781.06,132.43,5055.0
2,313.14,295.94,642.96,505.71,438.91,464.23,6154.0
3,898.52,61.27,548.73,240.93,278.96,432.27,5480.0
4,766.52,550.72,651.91,666.33,396.33,841.93,9669.0


# Exploratory Data Analysis

  * Y :(종속변수/반응변수) Product_Sold
  * X :(독립변수/설명변수) TV ~ Affiliate_Marketing



라고 할 때, 채널별 마케팅 비용과 상품판매액간의 관계를 탐색해보고자 선형회귀모델을 사용하고자 함.

선형회귀모델을 사용하기 위해 아래의 가정을 만족해야함

  * 

  1. 반응변수와 설명변수 간의 비선형관계인지 / 선형관계인지 (선형성)
  2. 오차항들간의 상관성은 있는지 없는지 (계열상관성)
  3. 오차항의 기댓값은 0
  4. 오차항의 분산은 모든 관찰치에서 일정할 것 (등분산성) (이분산성X)
  5. 오차항과 독립변수가 상관관계가 없을 것 (내생성)
  6. 다중공선성 확인 (다중회귀모형)
  7. 이상치 확인 및 제거
  8. 레버리지가 높은 관측치 확인 및 제거



<https://blog.naver.com/yonxman/220950614789>

In [None]:
# data.columns

In [None]:
X = data[['TV', 'Billboards', 'Google_Ads', 'Social_Media','Influencer_Marketing', 'Affiliate_Marketing']]
y = data['Product_Sold']

from sklearn.linear_model import LinearRegression
mlr_model = LinearRegression()
mlr_model.fit(X,  y)

data['Sold_predict'] = mlr_model.predict(X)
data['Residual'] = (data['Product_Sold'] - data['Sold_predict'])
print('MSE: ', np.mean(data['Residual']**2)) # MSE

In [None]:
print('R^2: ', mlr_model.score(X,y))

MSE가 104, $R^2$ 가 99.9%

In [None]:
mlr_model.coef_

In [None]:
mlr_model.intercept_

In [None]:
data.columns

In [None]:
b0 = round(mlr_model.intercept_, 2)
b1 = round(mlr_model.coef_[0], 2)
b2 = round(mlr_model.coef_[1], 2)
b3 = round(mlr_model.coef_[2], 2)
b4 = round(mlr_model.coef_[3], 2)
b5 = round(mlr_model.coef_[4], 2)
b6 = round(mlr_model.coef_[5], 2)

print(f'회귀식 Product_Sold ~ {b0} + {b1} TV + {b2} Billboards + {b3} Google_Ads + {b4} Social_Media + {b5} Influencer_Marketing + {b6} Affiliate_Marketing')

### 1\. 반응변수와 설명변수 간의 비선형 관계인지 확인

-> 잔차그래프로 확인 만약 패턴이 존재한다면, 선형모델에 어떤 문제가 있을 수 있다.

잔차그래프에서 인지할만한 패턴이 존재하지 않는다.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(nrows = 2, ncols = 3, figsize = (20, 10))
for n, col in enumerate(data[X.columns]):
    i, j = n//3, n%3
    sns.regplot(x = data[col], y = data["Residual"], ax =  axs[i][j])

plt.suptitle('Residual Plot for Linear Fit', fontsize = 15)
plt.show()

### 2.오차항들간의 상관성은 있는지 없는지 (계열상관성)

  1. 시각화 하여 확인



<https://blog.naver.com/yonxman/220960992282>

In [None]:
plt.figure(figsize = (30, 10))
plt.plot(data.index, data['Residual'], marker ='o', markersize = 8, mfc = 'white', mec = 'black')

오차항들 간의 상관성이 없어 보임

  2. 더빈왓슨검정법

In [None]:
import statsmodels.formula.api as smf
lm = smf.ols(formula = 'Product_Sold ~ TV+Billboards+Google_Ads+Social_Media+Influencer_Marketing+Affiliate_Marketing', data = data).fit()
lm.summary()

0,1,2,3
Dep. Variable:,Product_Sold,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,1347000.0
Date:,"Thu, 09 May 2024",Prob (F-statistic):,0.0
Time:,13:02:43,Log-Likelihood:,-1123.6
No. Observations:,300,AIC:,2261.0
Df Residuals:,293,BIC:,2287.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0887,2.723,0.033,0.974,-5.270,5.447
TV,2.0011,0.002,956.708,0.000,1.997,2.005
Billboards,2.9980,0.002,1375.489,0.000,2.994,3.002
Google_Ads,1.4997,0.002,704.836,0.000,1.496,1.504
Social_Media,2.5000,0.002,1138.719,0.000,2.496,2.504
Influencer_Marketing,1.1998,0.002,574.871,0.000,1.196,1.204
Affiliate_Marketing,3.9989,0.002,1827.081,0.000,3.995,4.003

0,1,2,3
Omnibus:,0.258,Durbin-Watson:,2.081
Prob(Omnibus):,0.879,Jarque-Bera (JB):,0.17
Skew:,0.057,Prob(JB):,0.918
Kurtosis:,3.028,Cond. No.,5670.0


Durbin Watson 검정통계량이 2 주변이라면 자기 상관이 존재하지 않음 <https://m.blog.naver.com/yolwooju/221915408383>

2보다 큰값이면 음의 자기 상관성

2보다 작은값이면 양의 자기상관성을 가진다고 판단함.

\--> Durbin Watson 검정통계량이 2.081 이므로 자기상관이 거의 존재하지 않다고 봄

### 3.오차항의 기댓값은 0 -> 만족

In [None]:
lm.resid.describe()

In [None]:
sns.displot(lm.resid, kde = True)
plt.title(f"Skew: {round(lm.resid.skew(), 2)} ")

### 4.오차항의 분산은 모든 관찰치에서 일정할 것 (이분산성)

-> 만약 불만족 한다면? 종속변수(반응변수)를 Transform

-> weighted regression을 사용

In [None]:
sns.regplot(x = data["Sold_predict"], y = data["Residual"])

  2. White의 이분산 검정법



<https://www.statology.org/white-test-in-python/>

In [None]:
from statsmodels.stats.diagnostic import het_white

white_test = het_white(lm.resid, lm.model.exog)

In [None]:
lm.model.exog # 관측치 값

In [None]:
data.head()

Unnamed: 0,TV,Billboards,Google_Ads,Social_Media,Influencer_Marketing,Affiliate_Marketing,Product_Sold,Sold_predict,Residual
0,281.42,538.8,123.94,349.3,242.77,910.1,7164.0,7168.42,19.53
1,702.97,296.53,558.13,180.55,781.06,132.43,5055.0,5050.98,16.2
2,313.14,295.94,642.96,505.71,438.91,464.23,6154.0,6125.56,809.01
3,898.52,61.27,548.73,240.93,278.96,432.27,5480.0,5470.42,91.77
4,766.52,550.72,651.91,666.33,396.33,841.93,9669.0,9670.94,3.76


In [None]:
#define labels to use for output of White's test
labels = ['Test Statistic', 'Test Statistic p-value', 'F-Statistic', 'F-Test p-value']
#print results of White's test
test_result = dict(zip(labels, white_test))
print(test_result)

Test Statistic $X^2$ = 27.50

p-value = 0.437

$H_0$ : 등분산성이 존재하지 않는다. (잔차들은 동등하게 흩어져있다)

$H_a$ : 이분산성이 존재한다. (잔차들은 동등하지 않게 흩어져있다)

p-value가 0.05보다 크므로 귀무가설을 기각할 수 없고 이분산성이 존재하지 않다고 봐야함

### 5\. 오차항과 독립변수가 상관관계가 없을 것 (내생성)

In [None]:
for col in data[X.columns]:
    corr = data[[col, 'Residual']].corr()
    print(col, round(corr.loc[col,'Residual'], 2))

상관관계가 거의없음.

### 6.다중공선성 확인

<https://aliencoder.tistory.com/17>

  1. 각 독립변수들간의 상관관계를 확인해본다.

  2. 분산팽창요인 (Variance Inflation Factor, VIF)를 구하여 5 또는 10을 넘으면 다중공선성 문제가 있다고 본다.

연관있는 변수끼리는 VIF가 높다

In [None]:
corr_by_X = data[X.columns].corr()
corr_by_X

Unnamed: 0,TV,Billboards,Google_Ads,Social_Media,Influencer_Marketing,Affiliate_Marketing
TV,1.0,-0.03,0.03,-0.04,0.01,0.09
Billboards,-0.03,1.0,0.05,0.05,-0.01,-0.04
Google_Ads,0.03,0.05,1.0,0.04,-0.06,-0.13
Social_Media,-0.04,0.05,0.04,1.0,-0.04,-0.02
Influencer_Marketing,0.01,-0.01,-0.06,-0.04,1.0,-0.05
Affiliate_Marketing,0.09,-0.04,-0.13,-0.02,-0.05,1.0


In [None]:
sns.heatmap(corr_by_X, annot = True)

In [None]:
# !pip install seaborn --upgrade

독립변수간의 상관관계가 거의 없어보인다.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
def get_VIF(X: pd.DataFrame):
    from statsmodels.stats.outliers_influence import variance_inflation_factor

    vif = pd.DataFrame()
    vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif["features"] = X.columns
    vif = vif.sort_values("VIF Factor", ascending = False).reset_index(drop = True)
    return vif[["features", "VIF Factor"]]

In [None]:
get_VIF(X)

Unnamed: 0,features,VIF Factor
0,TV,3.71
1,Billboards,3.7
2,Social_Media,3.57
3,Google_Ads,3.55
4,Affiliate_Marketing,3.35
5,Influencer_Marketing,3.02


VIF 값이 5 이하이므로 다중공선성은 없다

### 7\. 이상치 확인

  * 이상치: 주어진 설명변수의 값에 대한 반응변수의 값이 보통수준과는 다른 관측치



스튜던트화 잔차의 절대값이 3보다 크면 관측치가 이상치 일 수 있음 <https://datascienceschool.net/03%20machine%20learning/05.03%20%EB%A0%88%EB%B2%84%EB%A6%AC%EC%A7%80%EC%99%80%20%EC%95%84%EC%9B%83%EB%9D%BC%EC%9D%B4%EC%96%B4.html>

In [None]:
lm.summary()

0,1,2,3
Dep. Variable:,Product_Sold,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,1347000.0
Date:,"Thu, 09 May 2024",Prob (F-statistic):,0.0
Time:,12:44:30,Log-Likelihood:,-1123.6
No. Observations:,300,AIC:,2261.0
Df Residuals:,293,BIC:,2287.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0887,2.723,0.033,0.974,-5.270,5.447
TV,2.0011,0.002,956.708,0.000,1.997,2.005
Billboards,2.9980,0.002,1375.489,0.000,2.994,3.002
Google_Ads,1.4997,0.002,704.836,0.000,1.496,1.504
Social_Media,2.5000,0.002,1138.719,0.000,2.496,2.504
Influencer_Marketing,1.1998,0.002,574.871,0.000,1.196,1.204
Affiliate_Marketing,3.9989,0.002,1827.081,0.000,3.995,4.003

0,1,2,3
Omnibus:,0.258,Durbin-Watson:,2.081
Prob(Omnibus):,0.879,Jarque-Bera (JB):,0.17
Skew:,0.057,Prob(JB):,0.918
Kurtosis:,3.028,Cond. No.,5670.0


In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

### 8.레버리지가 높은 관측치 확인

  * 레버리지 : 대응하는 x 값이 보통 수준과 다름 -> 최소제곱선에 큰영향을 주는 데이터

In [None]:
hat = lm.get_influence().hat_matrix_diag

plt.figure(figsize=(20, 5))
plt.stem(hat)
n,p = X.shape[0], X.shape[1]
plt.axhline((p+1)/ n, c="g", ls="--")
plt.title("Leverage")
plt.show()

In [None]:
print(round((p+1)/ n, 2), '보다 훨씬 크다면, 높은 레버리지를 가진다고 의심할 수 있음')

In [None]:
data['Leverage'] = hat

In [None]:
sns.scatterplot(x = 'Leverage', y = 'Studentized Residuals', data = data, hue = 'outlier_yn')
plt.axvline((p+1)/n, color = 'red', linestyle = '--')

최소제곱적합에 큰 영향을 주는 데이터가 반 이상 레버리지가 큰 데이터들임

변수 하나씩 선형회귀모델에 적합했을 때, R-sqaured 값 비교

In [None]:
import statsmodels.formula.api as smf
lm = smf.ols(formula = 'Product_Sold ~ TV+Billboards+Google_Ads+Social_Media+Influencer_Marketing+Affiliate_Marketing', data = data).fit()
lm.summary()

제휴마케팅이 분산을 가장 많이 설명함

In [None]:
lm_dict = {col : smf.ols(formula = f'Product_Sold ~ {col}', data = data).fit() for col in X.columns}

각 단순선형회귀의 잔차 평균이 0인지 확인

In [None]:
for col in data[X.columns]:
    corr = data[[col, 'Residual']].corr()
    print(col, round(corr.loc[col,'Residual'], 2))

In [None]:
for col in data[X.columns]:
    corr = data[[col, 'Residual']].corr()
    print(col, round(corr.loc[col,'Residual'], 2))

잔차는 정규분포를 따름

## 전진선택법을 통한 변수선택

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

clf = LinearRegression()

# Build step forward feature selection
sfs1 = sfs(clf,k_features = len(X.columns),forward=True,floating=False, scoring='r2',cv=5)

# Perform SFFS
sfs1 = sfs1.fit(X, y)

In [None]:
# !pip install mlxtend

In [None]:
sfs1.subsets_

95% 분산에 대한 설명력을 가지는 TV / Billboards / Google Ads / Social Media / Affiliate_Marketing 을 적합하면 오차항이 정규분포를 따를까?

In [None]:
import statsmodels.formula.api as smf
lm = smf.ols(formula = 'Product_Sold ~ TV+Billboards+Google_Ads+Social_Media+Influencer_Marketing+Affiliate_Marketing', data = data).fit()
lm.summary()

0,1,2,3
Dep. Variable:,Product_Sold,R-squared:,0.959
Model:,OLS,Adj. R-squared:,0.958
Method:,Least Squares,F-statistic:,1378.0
Date:,"Thu, 09 May 2024",Prob (F-statistic):,1.14e-201
Time:,12:51:20,Log-Likelihood:,-2177.9
No. Observations:,300,AIC:,4368.0
Df Residuals:,294,BIC:,4390.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,651.6564,83.031,7.848,0.000,488.246,815.066
TV,2.0192,0.070,28.783,0.000,1.881,2.157
Billboards,2.9951,0.073,40.968,0.000,2.851,3.139
Google_Ads,1.4203,0.071,19.943,0.000,1.280,1.561
Social_Media,2.4489,0.074,33.282,0.000,2.304,2.594
Affiliate_Marketing,3.9268,0.073,53.577,0.000,3.783,4.071

0,1,2,3
Omnibus:,148.415,Durbin-Watson:,1.987
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18.073
Skew:,0.101,Prob(JB):,0.000119
Kurtosis:,1.815,Cond. No.,4780.0


In [None]:
round(lm5.resid.mean(),2)

In [None]:
sns.displot(lm.resid, kde = True)
plt.title(f"Skew: {round(lm.resid.skew(), 2)} ")

이상치만를 제거하고 다중선형회귀 재적합

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
data_prep1 = data[(data['outlier_yn'] == 'normal')]

X1 = data_prep1[['TV', 'Billboards', 'Google_Ads', 'Social_Media','Influencer_Marketing', 'Affiliate_Marketing']]
y1 = data_prep1['Product_Sold']

from sklearn.linear_model import LinearRegression
mlr_model1 = LinearRegression()
mlr_model1.fit(X1,  y1)

data_prep1['Sold_predict'] = mlr_model1.predict(X1)
data_prep1['Residual'] = (data_prep1['Product_Sold'] - data_prep1['Sold_predict'])
print('MSE: ', np.mean(data_prep1['Residual']**2)) # MSE

In [None]:
(104-98)/104

MSE가 104 에서 98로 5.7% 감소

In [None]:
print('R^2: ', mlr_model.score(X,y))

In [None]:
b0 = round(mlr_model.intercept_, 2)
b1 = round(mlr_model.coef_[0], 2)
b2 = round(mlr_model.coef_[1], 2)
b3 = round(mlr_model.coef_[2], 2)
b4 = round(mlr_model.coef_[3], 2)
b5 = round(mlr_model.coef_[4], 2)
b6 = round(mlr_model.coef_[5], 2)

print(f'회귀식 Product_Sold ~ {b0} + {b1} TV + {b2} Billboards + {b3} Google_Ads + {b4} Social_Media + {b5} Influencer_Marketing + {b6} Affiliate_Marketing')

회귀식의 intercept만 0.09 -> 0.35 로 증가

In [None]:
data

Unnamed: 0,TV,Billboards,Google_Ads,Social_Media,Influencer_Marketing,Affiliate_Marketing,Product_Sold,Sold_predict,Residual,Studentized Residuals,outlier_yn,Leverage
0,281.42,538.80,123.94,349.30,242.77,910.10,7164.00,7168.42,-4.42,0.43,normal,0.02
1,702.97,296.53,558.13,180.55,781.06,132.43,5055.00,5050.98,4.02,0.39,normal,0.02
2,313.14,295.94,642.96,505.71,438.91,464.23,6154.00,6125.56,28.44,2.76,normal,0.01
3,898.52,61.27,548.73,240.93,278.96,432.27,5480.00,5470.42,9.58,0.93,normal,0.02
4,766.52,550.72,651.91,666.33,396.33,841.93,9669.00,9670.94,-1.94,0.19,normal,0.01
...,...,...,...,...,...,...,...,...,...,...,...,...
295,770.05,501.36,694.60,172.26,572.26,410.56,6851.00,6844.93,6.07,0.59,normal,0.01
296,512.38,250.83,373.78,366.95,987.14,509.03,6477.00,6475.35,1.65,0.16,normal,0.02
297,998.10,858.75,781.06,60.61,174.63,213.53,6949.00,6958.30,-9.30,0.91,normal,0.04
298,322.35,681.22,640.29,343.65,534.22,648.71,7737.00,7741.99,-4.99,0.48,normal,0.01


# 결론

다중 선형회귀식

#### Product_Sold ~ 0.35 + 2.0 TV + 3.0 Billboards + 1.5 Google_Ads + 2.5 Social_Media + 1.2 Influencer_Marketing + 4.0 Affiliate_Marketing

  * 상품매출판매액은 **제휴마케팅 예산** 에 민감하게 반응하며, **인플루언서 마케팅 비용** 에 가장 적게 영향을 받는다.
  * 상품매출판매액은 아무 마케팅 예산을 사용하지 않을때 0.35 단위 정도 판매액에 영향을 준다.


  1. 반응변수와 설명변수 간의 비선형관계인지 / 선형관계인지 (선형성) -> 잔차그래프에서 인지할만한 패턴이 존재하지 않는다
  2. 오차항들간의 상관성은 있는지 없는지 (계열상관성))-> 오차항들 간의 상관성이 없어 보임, Durbin Watson 검정통계량이 2.081 이므로 자기상관이 거의 존재하지 않다고봄
  3. 오차항의 기댓값은 0 정규분포를 따름
  4. 오차항의 분산은 모든 관찰치에서 일정할 것 (등분산성) (이분산성X) -> 이분산성이 존재하지 않다고 봐야함
  5. 오차항과 독립변수가 상관관계가 없을 것 (내생성) -> 상관관계가 거의없음.
  6. 다중공선성 확인 (다중회귀모형) -> 독립변수간의 상관관계가 거의 없어보인다. VIF 값이 5 이하이므로 다중공선성은 없다
  7. **이상치 확인 및 제거** -> 스튜던트화 잔차의 절대값이 3보다 큰, 2개의 이상치가 존재한다.
  8. **레버리지가 높은 관측치 확인** -> 반은 레버리지가 큰 데이터들임, 최소제곱적합에 큰 영향을 주는 데이터들이 존재한다.

# 가설 검정

## Q. 광고 예산과 매출액 사이에 상관관계가 있는가?

> H0: $b_1$ = ... = $b_6$ = 0
> 
> Ha: 하나의 $b_j$ 가 0이 아니다.

In [None]:
import statsmodels.formula.api as smf
lm = smf.ols(formula = 'Product_Sold ~ TV+Billboards+Google_Ads+Social_Media+Influencer_Marketing+Affiliate_Marketing', data = data).fit()
lm.summary()

0,1,2,3
Dep. Variable:,Product_Sold,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,1431000.0
Date:,"Thu, 09 May 2024",Prob (F-statistic):,0.0
Time:,13:38:58,Log-Likelihood:,-1107.1
No. Observations:,298,AIC:,2228.0
Df Residuals:,291,BIC:,2254.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.3505,2.643,0.133,0.895,-4.852,5.553
TV,2.0007,0.002,979.398,0.000,1.997,2.005
Billboards,2.9978,0.002,1413.649,0.000,2.994,3.002
Google_Ads,1.4993,0.002,724.041,0.000,1.495,1.503
Social_Media,2.5002,0.002,1173.268,0.000,2.496,2.504
Influencer_Marketing,1.2000,0.002,587.841,0.000,1.196,1.204
Affiliate_Marketing,3.9996,0.002,1878.503,0.000,3.995,4.004

0,1,2,3
Omnibus:,3.395,Durbin-Watson:,2.089
Prob(Omnibus):,0.183,Jarque-Bera (JB):,3.376
Skew:,0.222,Prob(JB):,0.185
Kurtosis:,2.726,Cond. No.,5650.0


In [None]:
lm_prep.fvalue

In [None]:
lm_prep.f_pvalue

f_pvalue 값이 0.05보다 작으므로, 귀무가설을 기각하고, 광고와 매출액 사이의 상관관계가 존재한다고 볼 수있다.

## Q. 광고 예산과 매출액 사이에 얼마나 강한 상관관계가 있는가?

In [None]:
print("RSE (Residual Standard Error of the model)", lm_prep.resid.std(ddof = X1.shape[1]))

In [None]:
print("R-sqaure", round(lm_prep.rsquared*100,3))

In [None]:
print("반응변수의 평균값:", y1.mean())

In [None]:
print("오차:" ,10.03 / 7029.14  * 100  , "%")

R-sqaure 가 99.99% 으로 설명변수에 의해 반응변수의 99%가 설명이되며, 오차가 0.14% 로 매우 낮기 때문에 강한 상관관계가 있다고 볼 수 있다.

## Q. 어느 매체가 판매에 기여하는가?

In [None]:
lm_prep.summary()

0,1,2,3
Dep. Variable:,Product_Sold,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,1431000.0
Date:,"Thu, 09 May 2024",Prob (F-statistic):,0.0
Time:,13:59:30,Log-Likelihood:,-1107.1
No. Observations:,298,AIC:,2228.0
Df Residuals:,291,BIC:,2254.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.3505,2.643,0.133,0.895,-4.852,5.553
TV,2.0007,0.002,979.398,0.000,1.997,2.005
Billboards,2.9978,0.002,1413.649,0.000,2.994,3.002
Google_Ads,1.4993,0.002,724.041,0.000,1.495,1.503
Social_Media,2.5002,0.002,1173.268,0.000,2.496,2.504
Influencer_Marketing,1.2000,0.002,587.841,0.000,1.196,1.204
Affiliate_Marketing,3.9996,0.002,1878.503,0.000,3.995,4.004

0,1,2,3
Omnibus:,3.395,Durbin-Watson:,2.089
Prob(Omnibus):,0.183,Jarque-Bera (JB):,3.376
Skew:,0.222,Prob(JB):,0.185
Kurtosis:,2.726,Cond. No.,5650.0


p-value가 0으로 모두 낮아서, 모든 매체가 판매에 기여하며, 계수가 가장 높은 제휴마케팅(Affiliate_Marketing) 이 가장 큰 기여를 했고, 인플루언서 마케팅이 가장 낮은 기여를 했다.

## Q 판매에 대한 각 매체의 효과는 얼마나 되는가?

모든 매체들의 신뢰구간이 0을 포함하지 않다. 제휴마케팅, 빌보드, 소셜미디어 순으로 판매에 기여했다.

매체간 상관관계가 가장 높은 두 매체는 TV와 제휴마케팅이다.

## Q 광고 매체 사이에 시너지가 있는가?

두 변수를 곱한 설명변수 없이도 99.99%를 설명하기 때문에 매체간 상호작용효과(시너지 효과)는 없다고 할 수있다.