What is in your basket? — Instacart Dataset Exploration

Project Introduction

The aim of the project is to explore and analyze the Instacart Dataset, and make actionable recommendations and conlcusions regarding customer habits, and needs.

Technologies Used

Methods Used

Data Processing / Data Cleaning
Data Analysis
Descriptive Statistics
Data Visualization
Reporting

Project Description

Instacart is a popular app used for grocery ordering and delivery. They make it quite easy for you to fill your refrigerator and pantry with favorite produce, anytime you need them. In 2017, they published a Kaggle competition which challenged the Data Science community to create a powerful recommendation system for their customers. They published an anonymized dataset that contained a sample of over 3 million grocery orders from more than 200,000 Instacart users.

I was curious enough to explore and analyse the dataset. However, I will leave out the recommender system part for another time, and focus on exploratory data analysis (EDA), and answering general questions about customer habits, most popular produce and departments and similar.

In order to perform an analysis, I started by asking general questions about customer habits, popular produce, dominant departments, returning customers and other questions that seemed relevant. I ended up with a list of questions I deemed interesting enough to answer:

When do people place their orders?
How many orders are there per customer?
How often do customers reorder?
Which products are frequently ordered?
Which products are usually reordered?
What is the proportion between reordered and newly ordered items?
How many products are there in the cart usually?
What are the most popular aisles per products ordered?
What are the most popular departments?

Data Sources

The data used for this analysis is available on the Instacart Kaggle Webpage. It contains 6 files: orders, products, departments, aisles, order products prior, and order products train.

File Descriptions

Data - folder containing competition data
Images - images
Instacart Dataset Exploration - Notebook containing the complete process of Data Exploration and Analysis

Feature Notebooks and Deliverables

Blog Posts

Blog post on Instacart Dataset Exploration: What is in your basket? — Instacart Dataset Exploration

Structure of Notebooks

Collapse

  1. Data Preprocessing and Basic EDA

        1. Imports
        2. Data
           2.1 Orders Dataset
           2.2 Products Dataset
           2.3 Aisles Dataset
           2.4 Departments Dataset
           2.5 Order Products Dataset
           2.6 Merging Dataframes
        3. Business Case
           3.1 What is the structure of our data?
           3.2 When do people place their orders?
           3.3 How many orders are there per customer?
           3.4 How often do customers reorder?
           3.5 Which products are frequently ordered?
           3.6 Which products are usually reordered?
           3.7 What is the proportion between reordered and newly ordered items?
           3.8 How many products are there in the cart usually?
           3.9 What are the most popular aisles per products ordered?
           3.10 What is the share of orders per aisle?
           3.11 What are the most popular departments?
           3.12 What is the share of orders per department?

Presentation

Link to the presentation: Instacart Dataset Exploration Presentation

Most Important Findings

1. When do people place their orders?

It seems that the most popular days to place orders are Monday and Tuesday. Number of orders on those days easily exceed 100K. This insight creates follow up questions such as: Why do customers opt to order on Monday and Tuesday more than other days? Are there any difference in produce availability during the week? Does delivery play any role in this difference? Other weekdays show a similar number of orders throughout.

It is conclusive that most orders are placed on Monday and Tuesday between 9am and 16pm. The heatmap shows us bit more specific info about hours off order in this already registered days.

2. How many orders are there per customer?

Here we can clearly see that more than half of customers make between 4 and 10 orders. The number of orders are capped at 100 as per instructions in the competition. As the data does not include customers who have placed one, two, or three orders, I can only speculate about their share in the dataset. It would be interesting to see how many customers have placed one order, and how does that compare to the total number of orders. Current insights show us there is room for the business to grow its revenue by increasing the number of recurring customers i.e. customers who place more than 10 orders.

3. Which products are frequently ordered?

As a frequent banana shopper myself, this visualization gave me the giggles. The majority of top 12 added products to cart belong to fresh fruit, mostly organic. The rest is fresh vegetables. Are the majority of businesses offered on Instacart actually farmer’s market produce? Or do we have a direct answer to a question about healthy diet choices of customers?

4. How many products are there in the cart usually?

The average number of products per cart is between 4 and 6. Along with the analysis of number of orders per user with the data visualized here, I would focus on elements which would potentially increase the average number of items placed in a cart.

5. Conclusion and Future Recommendations

The original goal of the Instacart Kaggle competition was to develop an efficient recommender system for their customers, and potentially increase the number of reorders, and items placed in a cart. I deemed these two numbers to have the most impact on a business revenue. Increasing the number of reorders and items added to a cart would definitely play a role in increasing the profit margin. It would be interesting to see and compare results of a potentially implemented Data Science solution to the existing dataset, and see if improvements can be made in different areas, rather than recommending items.

Licenses

Database Contents License (DbCL) v1.0

Contact

Find me on LinkedIn, Twitter or adzictanja.com.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
img		img
.DS_Store		.DS_Store
Instacart Dataset Exploration.ipynb		Instacart Dataset Exploration.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is in your basket? — Instacart Dataset Exploration

Table of Contents

Project Introduction

Technologies Used

Methods Used