## Evaluating Regression Assumptions for Departments' Sales Response to Markdowns

### Background:

The objective was to assess the effect of markdowns on sales for different departments or groups. Multiple regression models were applied, and the following statistical tests were used to check the assumptions:

1. Breusch-Pagan Test
2. Anderson-Darling Test
3. Variance Inflation Factor (VIF)
4. Durbin-Watson Statistic

### Results:

1. **Breusch-Pagan Test**: 
   - Objective: Test for homoscedasticity.
   - Null hypothesis: The variances for the errors are equal across values of independent variables.
   - Findings: Many departments had p-values below 0.05, indicating heteroscedasticity.

2. **Anderson-Darling Test**: 
   - Objective: Test for normality.
   - Null hypothesis: The data is normally distributed.
   - Findings: Many departments violated the normality assumption based on the Anderson-Darling Statistic.

3. **VIF**: 
   - Objective: Quantify variance inflation due to multicollinearity.
   - Rule of thumb: VIF > 5-10 indicates problematic multicollinearity.
   - Findings: Most departments had high VIF values, indicating multicollinearity issues.

4. **Durbin-Watson Statistic**:
   - Objective: Test for autocorrelation in residuals.
   - Ideal Value: Around 2.0, suggesting no autocorrelation.
   - Findings: Many departments showed potential autocorrelation issues.

### Recommendations:

1. **Homoscedasticity**:
   - Transform the dependent variable (e.g., using a log transformation).
   - Consider weighted least squares regression.

2. **Normality**:
   - Transform the dependent variable.
   - If sample size is large, rely on the Central Limit Theorem.

3. **Multicollinearity**:
   - Drop redundant variables.
   - Use ridge or lasso regression.

4. **Autocorrelation**:
   - For time-series data, use time-series specific models.
   - Implement methods to remove autocorrelation.

### Closing Remarks:

For understanding the effect of markdowns on sales:
- If the focus is on determining the effect and not building a predictive model, simpler models or methods like difference-in-differences or interrupted time series might be more appropriate.

- Addressing the issues of the multiple regression model's assumptions is essential. If not met, conclusions can be misleading.

In [1]:
import pandas as pd

In [2]:
train_df = pd.read_csv("data/train.csv")
features_df = pd.read_csv("data/features.csv")
stores_df = pd.read_csv("data/stores.csv")
test_df = pd.read_csv("data/test.csv")

In [3]:
train_df = (train_df
           .merge(features_df, how='left', indicator=True)
           .merge(stores_df, how='left'))

In [4]:
train_df = train_df.loc[train_df['Weekly_Sales'] > 0] #outliers
# train_df["Date"] = pd.to_datetime(train_df["Date"])
train_df['year'] = pd.DatetimeIndex(train_df['Date']).year # Separating year data.
train_df['month'] = pd.DatetimeIndex(train_df['Date']).month # extract month data

week_df = pd.DatetimeIndex(train_df['Date']).isocalendar()
train_df['week'] = week_df['week'].values
# train_df.set_index("Date")

In [5]:
times_without_markdowns = [train_df['Date'][i] for i in range(40, 91)]
times_with_markdowns = [train_df['Date'][i] for i in range(92, 143)]

In [6]:
data_without_markdown = train_df[train_df['Date'].isin(times_without_markdowns)]
data_with_markdown = train_df[train_df['Date'].isin(times_with_markdowns)]

In [7]:
import statsmodels.api as sm
import pandas as pd
from statsmodels.stats.diagnostic import het_breuschpagan
from scipy.stats import anderson
from statsmodels.stats.outliers_influence import variance_inflation_factor

def diagnostic_tests(y, X, model):
    results = {}

    # Breusch-Pagan Test for Homoscedasticity
    bp_test = het_breuschpagan(model.resid, model.model.exog)
    results['Breusch-Pagan p-value'] = bp_test[1]

    # Anderson-Darling Test for Normality
    a_test = anderson(model.resid)
    results['Anderson-Darling Statistic'] = a_test.statistic
    results['Anderson-Darling 5% Critical Value'] = a_test.critical_values[2]

    # VIF for Multicollinearity
    vif_data = pd.DataFrame()
    vif_data["Variable"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    results['Max VIF'] = vif_data['VIF'].max()

    # Durbin-Watson for Autocorrelation
    results['Durbin-Watson Statistic'] = sm.stats.durbin_watson(model.resid)

    return results

def dept_or_store_markdown_responsiveness(dataframe: pd.DataFrame, selector: str = "top", identifier="Dept"):
    department_effects = {}
    diagnostics = {}
    departments = dataframe[identifier].unique()

    for dept in departments:
        subset_data = dataframe[dataframe[identifier] == dept].copy()
        subset_data['Total_MarkDown'] = subset_data[['MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5']].sum(axis=1)

        predictors = ['Total_MarkDown', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']
        X = sm.add_constant(subset_data[predictors])
        y = subset_data['Weekly_Sales']

        model = sm.OLS(y, X).fit()
        department_effects[str(dept)] = model.params['Total_MarkDown']

        # Diagnostic tests
        diagnostics[dept] = diagnostic_tests(y, X, model)

    diagnostics_df = pd.DataFrame(diagnostics).T
    diagnostics_df.to_csv('diagnostics.csv')

    df = pd.Series(department_effects)
    # df.to_csv('department_effects.csv')

    if selector == "top":
        return df.sort_values(ascending=False).head()
    elif selector == "bottom":
        return df.sort_values(ascending=True).head()
    else:
        print("No valid selector is entered")

# Mock data application
# df = pd.read_csv('your_data.csv') # replace with your data file
dept_or_store_markdown_responsiveness(data_with_markdown)

72    1.178458
92    0.847536
95    0.685443
2     0.593135
40    0.522279
dtype: float64

In [8]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# '#FF6347', '#1E90FF'
fig = make_subplots(rows=1, cols=2, subplot_titles=["Top 5 responsive depts to markdowns", "Bottom 5 responsive depts to markdowns"])
data = dept_or_store_markdown_responsiveness(data_with_markdown, selector="top", identifier="Dept")
fig.add_trace(go.Bar(x=data.index, y=data, marker_color='#FF6347', name="Top 5"), row=1, col=1)

data = dept_or_store_markdown_responsiveness(data_with_markdown, selector="bottom", identifier="Dept")
fig.add_trace(go.Bar(x=data.index, y=data, marker_color='#1E90FF', name="Bottom 5"), row=1, col=2)

fig.update_layout(yaxis_title="Coefficient", autosize=True, template="plotly_dark")

axes_kwargs = {"showgrid": True, "gridwidth": 1, "gridcolor": "Gray"}
fig.update_xaxes(**axes_kwargs)
fig.update_yaxes(**axes_kwargs)

fig.show()

In [9]:
fig = make_subplots(rows=1, cols=2, subplot_titles=["Top 5 responsive stores to markdowns", "Bottom 5 responsive stores to markdowns"])
data = dept_or_store_markdown_responsiveness(data_with_markdown, selector="top", identifier="Store")
fig.add_trace(go.Bar(x=data.index, y=data, marker_color='#FF6347', name="Top 5"), row=1, col=1)

data = dept_or_store_markdown_responsiveness(data_with_markdown, selector="bottom", identifier="Store")
fig.add_trace(go.Bar(x=data.index, y=data, marker_color='#1E90FF', name="Bottom 5"), row=1, col=2)

fig.update_layout(yaxis_title="Coefficient", autosize=True, template="plotly_dark")

axes_kwargs = {"showgrid": True, "gridwidth": 1, "gridcolor": "Gray"}
fig.update_xaxes(**axes_kwargs)
fig.update_yaxes(**axes_kwargs)

fig.show()