<a href="https://colab.research.google.com/github/aadyakoirala/analytics-projects/blob/main/Pricing_%26_Confounds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Confounding Variables Case Study

 Pricing & Confounds

**Context:**  


Steve, an analyst at your company recently made the following argument:
> “We should raise prices because when we do so, customers are more likely to perceive our products as luxury goods. When I analyzed the data, I found that every additional dollar the price increased was associated with 45 units higher sales.”  

You are skeptical about this idea that customers will buy more of the products when the price is higher and suspect that there might be some confounds in this historical correlation that generate a misleading conclusion.



### Overview of the Data Fields

- **dayofyear**  
  Calendar day of the year (1 to 365).

- **month1, month2, …, month12**  
  Dummy variables (0/1). Each equals 1 if the observation is in that month, 0 otherwise.

- **product**  
  - Value = 1 → *Product 1* (entry-level version).  
  - Value = 2 → *Product 2* (enhanced version with more features).

- **price**  
  Price listed for the product on the website that day. Prices vary because the company runs discount sales on ~20–40% of days.

- **units**  
  Number of units of the product purchased that day.


## Setup
Load Libraries

In [None]:
# Load Libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

## 1. Load and Explore the data

### 1.a. Load the data
Use `pd.read_csv` to pull the dataset.

In [None]:
# The data are available from this link:
url ='https://raw.githubusercontent.com/dansacks/gb740/main/retaildata.csv'
data = pd.read_csv(url)
print(data.columns)
print(data.shape)
print(data.describe())

Index(['dayofyear', 'product', 'price', 'units', 'month1', 'month2', 'month3',
       'month4', 'month5', 'month6', 'month7', 'month8', 'month9', 'month10',
       'month11', 'month12'],
      dtype='object')
(730, 16)
        dayofyear     product       price        units      month1  \
count  730.000000  730.000000  730.000000   730.000000  730.000000   
mean   183.000000    1.500000   46.507945  2980.834247    0.084932   
std    105.438271    0.500343   17.612156   927.817395    0.278971   
min      1.000000    1.000000   25.500000  1759.000000    0.000000   
25%     92.000000    1.000000   30.000000  2095.000000    0.000000   
50%    183.000000    1.500000   44.250000  3203.500000    0.000000   
75%    274.000000    2.000000   65.000000  3635.750000    0.000000   
max    365.000000    2.000000   65.000000  5860.000000    1.000000   

           month2      month3      month4      month5      month6      month7  \
count  730.000000  730.000000  730.000000  730.000000  730.000000  73

### 1.b.  First five rows

In [None]:
## First five Rows
data.head(5)

Unnamed: 0,dayofyear,product,price,units,month1,month2,month3,month4,month5,month6,month7,month8,month9,month10,month11,month12
0,1,1,30.0,2064,1,0,0,0,0,0,0,0,0,0,0,0
1,1,2,65.0,3616,1,0,0,0,0,0,0,0,0,0,0,0
2,2,1,30.0,2222,1,0,0,0,0,0,0,0,0,0,0,0
3,2,2,65.0,3454,1,0,0,0,0,0,0,0,0,0,0,0
4,3,1,30.0,2026,1,0,0,0,0,0,0,0,0,0,0,0


### 1.c. Report the mean of `units` and `price`.

In [None]:
print(f"units(y) = {data['units'].mean():.2f}")
print(f"price(d) = {data['price'].mean():.2f}")

units(y) = 2980.83
price(d) = 46.51


## 2. Unadjusted regression: Estimate `units ~ price`.

In [None]:
# Unadjusted regression:outcomes(y) = units, treatments(d) = price
# Model: y ~ d
model = smf.ols(formula='units ~ price', data=data).fit()

# Print summary
print(model.summary())
model = smf.ols(formula='units ~ price', data=data).fit()

# Print summary
print(model.summary())



                            OLS Regression Results                            
Dep. Variable:                  units   R-squared:                       0.740
Model:                            OLS   Adj. R-squared:                  0.740
Method:                 Least Squares   F-statistic:                     2071.
Date:                Tue, 04 Nov 2025   Prob (F-statistic):          4.63e-215
Time:                        19:54:04   Log-Likelihood:                -5531.8
No. Observations:                 730   AIC:                         1.107e+04
Df Residuals:                     728   BIC:                         1.108e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    873.3970     49.519     17.638      0.0

### 2.a Report coefficient on `price`.  

coefficient on price: 45.31

## 3. Explain in words what would make a variable a confounding factor in this situation.




(A variable is a confounding factor if it is associated with both the independent variable(price, denoted as d) and the dependent variable(units sold, denoted as y), but is not caused by the independent variable.In this situation, a cofounder would be any variable that influences both the d and y-such as product quality, brand reputation, or marketing intensity- and is not caused by price.
- Product quality may lead to higher prices (d) and also drive more sales (y), but it is typically determined before pricing decisions, so it is not caused by (d).
- Brand reputation can justify premium pricing (d) and increase consumer demand (y), and it usually exists independently of the price.
- Marketing intensity can affect both the price (d) — through positioning and perceived value — and the number of units sold (y), but marketing strategies are generally set before pricing.


If such confounders are not included in the analysis, the estimated effect of price (d) on units sold (y) may be biased. This could make it seem like price has a stronger or weaker effect than it actually does, because part of the variation attributed to price may actually be due to these omitted variables.)






## 4. Investigate product as a confound
Make a table by `product` with mean price and mean units.

In [None]:
# Group by product and calculate mean price and mean units
product_table = data.groupby('product').agg({
    'price': ['mean'],
    'units': ['mean']
})
product_table.columns = ['price (d)', 'units (y)']

print(product_table)

         price (d)    units (y)
product                        
1        28.998904  2178.761644
2        64.016986  3782.906849


## 5. Does your analysis in part 4 indicate that product is a confound?


(Yes, the analysis in part 4 suggests that product is a cofounder in the relationship between price(d) and units sold(y).This is because

* Product is associated with price(d): The table shows that different products have different average prices- for example, Product 2 has a higher average price than Product 1.

* Product is associated with units sold(y): The same table shows that products also differe in average units sold-Product 2 also has higher sales than Product1

* Product is not caused by price (d): The product type is determined before the price is set;it's characteristics of the item, not a consequence of its price.

Because product satisfies all three conditions, failing to control for it could bias the estimated effect of price on units sold.)






## 6. Estimate a regression of units on price and product. What is the new coefficient on price?

In [None]:
# Adjusted regression: outcome (y) = units, treatment (d) = price, confounder = product
# Model: y ~ d + confounder
model = smf.ols(formula='units ~ price + product', data=data).fit()
# Print summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                  units   R-squared:                       0.748
Model:                            OLS   Adj. R-squared:                  0.748
Method:                 Least Squares   F-statistic:                     1081.
Date:                Tue, 04 Nov 2025   Prob (F-statistic):          1.53e-218
Time:                        20:01:20   Log-Likelihood:                -5519.7
No. Observations:                 730   AIC:                         1.105e+04
Df Residuals:                     727   BIC:                         1.106e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    561.2966     79.676      7.045      0.0

### 6.a What is the new coefficient on price?

coefficient on price:-2.2129

## 7. Estimate a regression of units on price, product, and month2, month3, ... month12.



In [None]:
# Adjusted regression with confounders:
# Outcome (y) = units
# Treatment (d) = price
# Confounders = product and month (seasonality)
# Model: y ~ d + confounders
model = smf.ols(formula='''
    units ~ price + product + month2 + month3 + month4 + month5 + month6 +
            month7 + month8 + month9 + month10 + month11 + month12
''', data=data).fit()

# Print summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                  units   R-squared:                       0.924
Model:                            OLS   Adj. R-squared:                  0.923
Method:                 Least Squares   F-statistic:                     670.2
Date:                Tue, 04 Nov 2025   Prob (F-statistic):               0.00
Time:                        20:03:31   Log-Likelihood:                -5082.4
No. Observations:                 730   AIC:                         1.019e+04
Df Residuals:                     716   BIC:                         1.026e+04
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    249.7913     53.440      4.674      0.0

### 7.a What is the new coefficient on price?

coefficient on price:-25.7320


## 8. Explain in words what your analysis found relative to Steve's.



(Steve's analysis found a positive association between the treatment variable (d = price) and the outcome variable (y = units sold), estimating that each additional dollar increase in d was linked to 45 more units of y. However, this was based on an unadjusted regression, which did not account for potential confounding factors.
My analysis controlled for product type and seasonality (month) — both of which may confound the relationship between d and y — and found that the relationship actually reversed: the adjusted coefficient on d was -25.73, meaning higher d was associated with fewer y once confounds were addressed. This suggests that Steve's result was likely driven by differences between products — for example, Product 2 may have both a higher d and higher y, creating a misleading correlation. When we account for these differences, the data does not support the idea that increasing d leads to higher y.)

