<a href="https://colab.research.google.com/github/francji1/01RAD/blob/main/python/01RAD_Ex06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 01RAD Exercise 6


In [None]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


Imagine  that you are statistical consultants asked to build a marketing plan for next year that will result in high product sales. On the basis of this data and your final model answer following questions:

1 - Is there a relationship between advertising budget and sales?

2 - Which media contribute to sales, i.e. do all three media - TV, radio, and newspapers contribute to sales?

3 - Which media generate the biggest boost in sales?

4- How strong is the relationship between advertising budget and sales?

5 - How much increase in sales is associated with a given increase in TV advertising?

6 - How much increase in sales is associated with a given increase in Radio advertising?

7 - How accurately can we estimate the effect of each medium on sales?

8 - How accurately can we predict future sales?

9 - Is there synergy among the advertising media?

10 - Imagine you have 100k $, what is the best strategy how to spend it in advertising?

11 - How much more pruduct will we sell, if we spend 10k$ in TV and 20k$ in radio advertising?

12 - What is the 95% confidence interval of previous question?

Problem described in the book:  An Introduction to Statistical Learning with Applications in R  https://faculty.marshall.usc.edu/gareth-james/ISL/


In [None]:
# Load the data
Advert = pd.read_csv("https://raw.githubusercontent.com/francji1/01RAD/main/data/Advert.csv", sep=",")
Advert.head()

In [None]:
print(Advert.describe())
print(Advert.info())

In [None]:
# Simple linear regression models
model_tv = smf.ols('sales ~ TV', data=Advert).fit()
print(model_tv.summary())

model_ra = smf.ols('sales ~ radio', data=Advert).fit()
print(model_ra.summary())

model_np = smf.ols('sales ~ newspaper', data=Advert).fit()
print(model_np.summary())

In [None]:
# Predictions for simple models
new_data = pd.DataFrame({
    'TV': np.arange(0, 301, 5),
    'radio': np.arange(0, 301, 5),
    'newspaper': np.arange(0, 301, 5)
})

predictions_tv = model_tv.get_prediction(new_data).summary_frame()
predictions_ra = model_ra.get_prediction(new_data).summary_frame()
predictions_np = model_np.get_prediction(new_data).summary_frame()

# Plotting
sns.scatterplot(x='TV', y='sales', data=Advert)
sns.lineplot(x=new_data['TV'], y=predictions_tv['mean'], color="red")
plt.fill_between(new_data['TV'], predictions_tv['obs_ci_lower'], predictions_tv['obs_ci_upper'], color='blue', alpha=0.3)
plt.show()

In [None]:

# Create a figure with subplots
fig, axs = plt.subplots(1, 3, figsize=(15, 5))

# TV vs Sales
sns.scatterplot(x='TV', y='sales', data=Advert, ax=axs[0])
model_tv = ols('sales ~ TV', data=Advert).fit()
sns.lineplot(x='TV', y=model_tv.predict(Advert['TV']), data=Advert, ax=axs[0], color='blue')

# Radio vs Sales
sns.scatterplot(x='radio', y='sales', data=Advert, ax=axs[1])
model_ra = ols('sales ~ radio', data=Advert).fit()
sns.lineplot(x='radio', y=model_ra.predict(Advert['radio']), data=Advert, ax=axs[1], color='blue')

# Newspaper vs Sales
sns.scatterplot(x='newspaper', y='sales', data=Advert, ax=axs[2])
model_np = ols('sales ~ newspaper', data=Advert).fit()
sns.lineplot(x='newspaper', y=model_np.predict(Advert['newspaper']), data=Advert, ax=axs[2], color='blue')

# Set the title for each subplot
axs[0].set_title('TV Advertisements vs Sales')
axs[1].set_title('Radio Advertisements vs Sales')
axs[2].set_title('Newspaper Advertisements vs Sales')

plt.tight_layout()
plt.show()

In [None]:
# Models with and without interactions
model0 = smf.ols('sales ~ TV * radio * newspaper', data=Advert).fit()
print(model0.summary())
print(model0.conf_int())

model1 = smf.ols('sales ~ TV + radio + newspaper', data=Advert).fit()
print(model1.summary())
print(model1.conf_int())

model2 = smf.ols('sales ~ TV * radio', data=Advert).fit()
print(model2.summary())
print(model2.conf_int())

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
Advert.columns

In [None]:
variables = Advert.columns[1:-1]  # Skipping the first and last columns

# DataFrame to store VIF values
vif_data = pd.DataFrame()
vif_data["feature"] = variables

# Create a new DataFrame for the independent variables (predictors) only
X = Advert[variables]
X = sm.add_constant(X)

# Calculate VIF for each predictor
vif_data["VIF"] = [1 / (1 - sm.OLS(X[col], X.loc[:, X.columns != col]).fit().rsquared) for col in variables]
print(vif_data)

In [None]:
variables = Advert.columns[1:-1]  # Skip the first and the last column
vif_dict = {}
Advert_x = Advert.iloc[:, 1:-1]
print(variables)
for variable in variables:
    # The independent variables set.
    x_vars = Advert_x.drop([variable], axis=1)
    # The dependent variable.
    y_var = Advert_x[variable]

    # Add constant for OLS model
    x_vars_const = sm.add_constant(x_vars)
    # Fit the model
    model = sm.OLS(y_var, x_vars_const).fit()

    # Calculate R-squared value
    rsq = model.rsquared
    #print(model.summary())
    # Calculate VIF
    vif = 1 / (1 - rsq)

    vif_dict[variable] = vif

# Display the VIF values
vif_dict


In [None]:
# Compute the correlation matrix for TV, radio, newspaper, and sales
correlation_matrix = Advert[['TV', 'radio', 'newspaper', 'sales']].corr()
correlation_matrix

### Is at least one of the predictors X1, X2,...,Xp useful in predicting the response?

To determine if at least one predictor is useful, look at the F-statistic and its corresponding p-value from the overall regression model.

In [None]:
# Fit the model with TV, radio, and newspaper as predictors
model1 = smf.ols('sales ~ TV + radio + newspaper', data=Advert).fit()

# Perform the F-test to test if all parameters (excluding the intercept) are zero
f_statistic = model1.fvalue
f_pvalue = model1.f_pvalue

print(f"F-statistic: {f_statistic}")
print(f"P-value: {f_pvalue}")

In [None]:
from scipy.stats import norm, t, f
import scipy.stats

In [None]:
# Fit the model with TV, radio, and newspaper as predictors
model1 = smf.ols('sales ~ TV + radio + newspaper', data=Advert).fit()

# Calculate TSS
y_mean = Advert['sales'].mean()
TSS = ((Advert['sales'] - y_mean)**2).sum()

# RSS is the sum of squared residuals from the model
RSS = model1.ssr

# Number of predictors p (excluding the intercept)
p = len(model1.params) - 1

# Number of observations n
n = Advert.shape[0]

# Calculate the F-statistic
F = ((TSS - RSS) / p) / (RSS / (n - p - 1))

# Get the p-value from the F-distribution
p_value = 1 - f.cdf(F, p, n - p - 1)

F, p_value


In [None]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
from statsmodels.stats.anova import anova_lm
from itertools import combinations

# Assuming the Advert dataframe is already loaded and contains the columns 'TV', 'Radio', 'Newspaper', 'Sales'

# Fit full model with all interactions
model1 = smf.ols('sales ~ TV * radio * newspaper', data=Advert).fit()
#print(model1.summary())

# Compare models using ANOVA
model2 = smf.ols('sales ~ TV * radio', data=Advert).fit()
#print(anova_lm(model1, model2))

# Perform stepwise regression (manual implementation since statsmodels does not have a built-in function)
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out = 0.05,
                       verbose=True):
    """ Perform a forward-backward feature selection
    based on p-value from statsmodels.api.OLS
    Arguments:
        X - pandas.DataFrame with candidate features
        y - list-like with the target
        initial_list - list of features to start with (column names of X)
        threshold_in - include a feature if its p-value < threshold_in
        threshold_out - exclude a feature if its p-value > threshold_out
        verbose - whether to print the sequence of inclusions and exclusions
    Returns: list of selected features
    Always set threshold_in < threshold_out to avoid infinite looping.
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded, dtype=float)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

        # backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

result = stepwise_selection(Advert[['TV', 'radio', 'newspaper']], Advert['sales'])
print(result)
#

In [None]:
!pip install rpy2
%load_ext rpy2.ipython
from rpy2.robjects import pandas2ri



In [None]:
%%R -o trees
library(MASS)
install.packages('leaps')
library(leaps)

Advert <- read.table("https://raw.githubusercontent.com/francji1/01RAD/main/data/Advert.csv",header=TRUE,sep=",")
head(Advert)


model1 <-  lm(sales ~ TV*radio*newspaper, data = Advert) # model with all interactions
summary(model1)

model2 <-  lm(sales ~ TV*radio, data = Advert)
summary(model2)

anova(model1,model2)

pairs(Advert)
n = nrow(Advert)

model_step0 <- step(model1)   # what is step function doing?
summary(model_step0)

anova(model1,model2)
anova(model1,model_step0)
anova(model2,model_step0)


# BIC
model_step1 <- stepAIC(model1, k=log(n))
summary(model_step1)

# AIC
model_step2 <- stepAIC(model1, k=2)
summary(model_step2)
model_step2 <- stepAIC(model1, direction="both")
model_step2$anova

# compare obtained model from BIC and AIC step functions
anova(model_step1,model_step2)

# Drop 1 predictor from full model
# which one?
dropterm(model1, test = "F")  # 'arg' should be one of “none”, “Chisq”, “F”

# Add 1 predictor to null model (model with intercept only)
# which one?
with(Advert, add1(lm(sales~TV),sales~TV+radio+newspaper, test = "F"))

AIC = matrix(0,8,2)
BIC = matrix(0,8,2)

# write loop - add one variable per step and save AIC
AIC[1,]= extractAIC(lm(sales~1, data = Advert))
AIC[2,]= extractAIC(lm(sales~TV, data = Advert))
AIC[3,]= extractAIC(lm(sales~TV+radio, data = Advert))
AIC[4,]= extractAIC(lm(sales~TV*radio, data = Advert))
AIC[5,]= extractAIC(lm(sales~TV*radio+newspaper, data = Advert))
AIC[6,]= extractAIC(lm(sales~TV*radio+TV*newspaper, data = Advert))
AIC[7,]= extractAIC(lm(sales~(.)^2, data = Advert[,2:5]))
AIC[8,]= extractAIC(lm(sales~(.)^3, data = Advert[,2:5]))


BIC[1,] = extractAIC(lm(sales~1, data = Advert), k =log(n))
BIC[2,] = extractAIC(lm(sales~TV, data = Advert), k =log(n))
BIC[3,] = extractAIC(lm(sales~TV+radio, data = Advert), k =log(n))
BIC[4,] = extractAIC(lm(sales~TV*radio, data = Advert), k =log(n))
BIC[5,] = extractAIC(lm(sales~TV*radio+newspaper, data = Advert), k =log(n))
BIC[6,] = extractAIC(lm(sales~TV*radio+TV*newspaper, data = Advert), k =log(n))
BIC[7,] = extractAIC(lm(sales~(.)^2, data = Advert[,2:5]), k =log(n))
BIC[8,] = extractAIC(lm(sales~(.)^3, data = Advert[,2:5]), k =log(n))

# make it nicer

plot(AIC[,1],AIC[,2],type = "l",col = "red")
lines(BIC[,1],BIC[,2],col = "blue")

head(Advert)
leaps(x=Advert[,2:4], y=Advert[,5],
      names=names(Advert)[2:4], method="Cp")

leaps<-regsubsets(sales~TV+radio+newspaper,data=Advert,nbest=10)
summary(leaps)
plot(leaps,scale="Cp")


# plot statistic by subset size
subsets(leaps, statistic="cp")

