Dunnhumby Coupon Redemption

What it does:

A classification model that predicts whether or not a given customer in a group of frequent shoppers will redeem a coupon. Data sourced from Kaggle

This Capstone Project was built as part of the academic requirement of the PGP-DSE program at Great Learning Hyderabad.

Sometimes GitHub is unable to Preview Code Blocks for Jupyter Notebooks. If this happens, you can just view my Notebook.

How to build it yourself:

Install Python.

Install non-standard Python libraries: launch command prompt and run this command:

C:\Windows\system32\ pip install ipykernal, jupyterlab, notebook, numpy, pandas, matplotlib, seaborn, scikit-learn, statsmodels

Download the Jupyter Notebook.
Download the Dataset from Kaggle.
Launch Jupyter Notebook from the Start Menu, and navigate to the folder containing the dataset and Jupyter Notebook you just downloaded.
Extract the csv files into a folder named 'archive'. Ensure that the Jupyter Notebook is in the same folder as 'archive'.
Go to Cell -> Run All.
Profit!

How to interpret it:

The EDA portion answers some important questions that arise regarding the data.
The Data was not present in a monolithic form, several features were created using the different tables and joined to create the base DataFrame.
Several Classification Models were built, the best of which had a Macro Average F1-Score of 0.7

My instance and the insights derived:

Data Collection:

8 .csv files were downloaded off of Kaggle Since we didn't have a monolithic DataFrame, the HH_demographic DataFrame was taken and several numerical features were created and added to the DataFrame in order to build the models.

Feature Engineering:

First things first, the HH_demographic DataFrame didn't contain the Target Variable (Redeemed). If the household_key (unique identifier) was present in the coupon_redempt DataFrame, the the Target Variable was assigned the value 1, and if not, then 0.

Next, the HH_demographic DataFrame contained categorical variables that were static descriptors of the customers/households, such as income bracket (INCOME_DESC) , and household compostion (HH_COMP_DESC) I wanted to add numerical features that would be "dynamic" descriptors of the customers'/households' shopping habbits. I believed the following features would contribute useful information to the model:

Number of Campaigns a Household was targeted for
Number of Distinct Coupons a Household Redeemed
Coupon Success Ratio - Ratio of Coupons Redeemed to Coupons Received
Average Number of Items a Household Purchases Per Visit
Number of Visits a Household Pays to the Retail Store
Average Amount Spent by a Household Per Visit

Data Cleaning:

After the features were created, several defects remained:

Null Values in the Numerical Features created were imputed with zero, since the value didn't exist for said household_key (unique identifier)
Outliers were treated using Winsorization (Capping)
The Data was split into Train and Test sets.
The numerical variables were Scaled and the categorical variables were encoded for both Train and Test sets.
The categorical variables were binned better than they were initially.

Exploratory Data Analysis:

The following questions were answered:

On average, how long did each type of campaign run for?

On average, TypeB campaigns run the shortest, for 37.6 days, and TypeC campaigns run the longest, for 74.5 days.

Which was the most popular type of campaign?

TypeA was the most popular type of Campaign

Who were the most frequent shoppers?

The Data spans across a period of 2 years. This means that there were definitely customers/households that visited more than once a day, on average.

Data Visualization:

Used Seaborn to visualize the distribution of individual variables (Univariate Analysis), as well as the relationship of the variables with the Target Variable (Bivariate Analysis)

Univariate Analysis:

Numerical Variables:

Categorical Variables:

Bivariate Analysis:

Numerical Variables vs Target:

Categorical Variables vs Target:

Treating Quasi-Separation:

Upon further inspection, 'distinct_coupons_redeemed_household' and 'coupon_success_ratio' both almost perfectly separate the subgroups in the target variable 'Redeemed'.
Logically, if the value of these columns is zero, then the target is 0, and if the value is non-zero, then the target is 1.
Since it doesn't make any sense to include these columns in the model, let us remove them.

Further Binning the Categorical Variables:

Treating Multicollinearity:

Variables with the highest VIF (Variance Inflation Factor) were dropped from the DataFrame iteratively, until each variable in the DataFrame had a VIF value lower than the chosen threshold of 10 (The maximum variation in the variable that can be explained using the other variables is 90%)

Final Model:

Several classification models were built, including:

Logistic Regression
K-Nearest Neighbors
Bagged K-Nearest Neighbors
Decision Tree Classifier
Bagged Decision Tree Classifier
Random Forest Classifier
AdaBoost Classifier

The hyperparameters of the Random Forest Classifier were tuned, viz. 'criterion', 'n_estimators', 'max_depth', 'min_impurity_decrease', 'min_samples_split'. The final model gave us a Macro Average F1-Score of 0.696.

Future Scope:

I am satisfied with the outcome of this project. However it is simplistic and undoubtedly a prototype with huge scope for improvement:

Building a model on the Top 10 Most Important Features.
Performing Further Feature Engineering to obtain more significant Features.
Performing Oversampling Techniques to obtain more rows.
Product-Based Coupon Redemption instead of Customer-Based Coupon Redemption.

Since Academic Requirements have been met for this project, I will put off implementing these for some time. In the meantime, feel free to contribute to this Project!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
dunnhumby-coupon-redemption.ipynb		dunnhumby-coupon-redemption.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md