# Week 2 Day 5 Assignment: Customer Retention

**Dataset:** `storedata_total1.csv`

Tasks:
- Analyze customer retention vs non‑retention
- Conduct a t‑test to check statistical significance
- Report findings in plain English

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

# Load dataset
path = "storedata_total1.csv"
df = pd.read_csv(path)

df.head()

Unnamed: 0,custid,retained,created,firstorder,lastorder,esent,eopenrate,eclickrate,avgorder,ordfreq,paperless,refill,doorstep,favday,city
0,6H6T6N,0,28-09-2012,11-08-2013,11-08-2013,29,100.0,3.448276,14.52,0.0,0,0,0,Monday,DEL
1,APCENR,1,19-12-2010,01-04-2011,19-01-2014,95,92.631579,10.526316,83.69,0.181641,1,1,1,Friday,DEL
2,7UP6MS,0,03-10-2010,01-12-2010,06-07-2011,0,0.0,0.0,33.58,0.059908,0,0,0,Wednesday,DEL
3,7ZEW8G,0,22-10-2010,28-03-2011,28-03-2011,0,0.0,0.0,54.96,0.0,0,0,0,Thursday,BOM
4,8V726M,1,27-11-2010,29-11-2010,28-01-2013,30,90.0,13.333333,111.91,0.00885,0,0,0,Monday,BOM


In [2]:
# Basic overview
print(df.shape)
print(df.columns)

# Retention counts
retention_counts = df["retained"].value_counts(dropna=False)
retention_counts

(30801, 15)
Index(['custid', 'retained', 'created', 'firstorder', 'lastorder', 'esent',
       'eopenrate', 'eclickrate', 'avgorder', 'ordfreq', 'paperless', 'refill',
       'doorstep', 'favday', 'city'],
      dtype='object')


retained
1    24472
0     6329
Name: count, dtype: int64

In [3]:
# Split groups
retained = df[df["retained"] == 1]
not_retained = df[df["retained"] == 0]

# Compare average order value (avgorder)
retained_avg = retained["avgorder"].dropna()
not_retained_avg = not_retained["avgorder"].dropna()

retained_avg.mean(), not_retained_avg.mean()

(np.float64(61.95767489375613), np.float64(61.55018802338442))

In [4]:
# Welch's t-test for avgorder
avg_ttest = stats.ttest_ind(retained_avg, not_retained_avg, equal_var=False, nan_policy="omit")

avg_ttest

TtestResult(statistic=np.float64(0.6901421078050171), pvalue=np.float64(0.4901215064484649), df=np.float64(9601.352732263727))

In [5]:
# Optional: compare order frequency (ordfreq)
retained_freq = retained["ordfreq"].dropna()
not_retained_freq = not_retained["ordfreq"].dropna()

freq_ttest = stats.ttest_ind(retained_freq, not_retained_freq, equal_var=False, nan_policy="omit")

retained_freq.mean(), not_retained_freq.mean(), freq_ttest

(np.float64(0.038298768468576334),
 np.float64(0.0355234908389951),
 TtestResult(statistic=np.float64(1.950582794390669), pvalue=np.float64(0.05113391459158599), df=np.float64(10248.596826812807)))

## Findings (Plain English)

- We compared **retained** vs **not retained** customers using a t‑test.
- The test checks whether the average values (like `avgorder` or `ordfreq`) are statistically different between the two groups.
- If the p‑value is below 0.05, we can say the difference is statistically significant.
- Use the output above to state the final conclusion (significant or not) for each metric.