# Instacart Market Basket Analysis

This notebook analyzes shopping patterns and predicts product reorders using machine learning models.

The dataset was downloaded from [Instacart Market Basket Analysis - Kaggle](https://www.kaggle.com/datasets/psparks/instacart-market-basket-analysis).



## Import Required Libraries

### Install packages

In [7]:
!brew install libomp
!pip install pandas numpy matplotlib seaborn
!pip install scikit-learn xgboost imbalanced-learn

To reinstall 20.1.8, run:
  brew reinstall libomp


In [8]:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE

## Data Loading

Define file paths and load the required datasets. We assume all files are in the same directory as the notebook.

In [10]:
current_dir = os.getcwd()
files = {
    "aisles": os.path.join(current_dir, 'aisles.csv'),
    "departments": os.path.join(current_dir, 'departments.csv'),
    "orders": os.path.join(current_dir, 'orders.csv'),
    "order_products_prior": os.path.join(current_dir, 'order_products__prior.csv'),
    "order_products_train": os.path.join(current_dir, 'order_products__train.csv'),
    "products": os.path.join(current_dir, 'products.csv')
}

print("Loading datasets...")
aisles_df = pd.read_csv(files['aisles'])
departments_df = pd.read_csv(files['departments'])
orders_df = pd.read_csv(files['orders'])
order_products_prior_df = pd.read_csv(files['order_products_prior'])
order_products_train_df = pd.read_csv(files['order_products_train'])
products_df = pd.read_csv(files['products'])
print("Loaded.")

Loading datasets...
Loaded.


In [21]:
aisles_df.shape

(134, 2)

In [22]:
aisles_df.head()

Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation


In [23]:
departments_df.shape

(21, 2)

In [24]:
departments_df.head()

Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol


In [25]:
orders_df.shape

(3421083, 7)

In [13]:
orders_df.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [26]:
order_products_prior_df.shape

(32434489, 4)

In [14]:
order_products_prior_df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [27]:
order_products_train_df.shape

(1384617, 4)

In [15]:
order_products_train_df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [28]:
products_df.shape

(49688, 4)

In [29]:
products_df.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


## Data Preprocessing

Merge datasets and handle missing values

In [None]:
print("Merging datasets...")
products_df = pd.merge(products_df, aisles_df, on='aisle_id')
products_df = pd.merge(products_df, departments_df, on='department_id')
order_products_all_df = pd.concat([order_products_prior_df, order_products_train_df])
merged_df = pd.merge(order_products_all_df, orders_df, on='order_id', how='inner')
merged_df = pd.merge(merged_df, products_df, on='product_id', how='inner')

# Handle missing values
merged_df['days_since_prior_order'].fillna(0, inplace=True)

## Feature Engineering

Create new features to improve model performance:
- Average basket size per user
- Purchase frequency per user
- Product reorder rate

In [None]:
print("Feature Engineering...")
merged_df['average_basket_size'] = merged_df.groupby('user_id')['product_id'].transform('count')
merged_df['purchase_frequency'] = merged_df.groupby('user_id')['order_number'].transform('max')
merged_df['product_reorder_rate'] = merged_df.groupby('product_id')['reordered'].transform('mean')

# Sample 1% of the dataset for efficiency
print("Sampling 1% of the dataset for efficiency...")
sampled_df = merged_df.sample(frac=0.01, random_state=42)

## Prepare Data for Modeling

Select features and target variable, apply SMOTE for class balancing, and split the data

In [None]:
features = [
    'order_hour_of_day', 'order_dow', 'days_since_prior_order', 'aisle_id', 
    'department_id', 'average_basket_size', 'purchase_frequency', 'product_reorder_rate'
]
target = 'reordered'

X = sampled_df[features]
y = sampled_df[target]

# Balance data using SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

## Random Forest Model with Hyperparameter Tuning

In [None]:
print("Tuning Random Forest Classifier...")
rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf_random = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, n_iter=10, cv=3, verbose=2, random_state=42)
rf_random.fit(X_train, y_train)

# Predictions and evaluation
y_pred_rf = rf_random.best_estimator_.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)
classification_rep_rf = classification_report(y_test, y_pred_rf)

print("\nRandom Forest Classifier Performance:")
print(f"Accuracy: {accuracy_rf:.4f}")
print("Confusion Matrix:\n", conf_matrix_rf)
print("\nClassification Report:\n", classification_rep_rf)

## XGBoost Model

In [None]:
print("Training XGBoost Classifier...")
xgb_model = XGBClassifier(n_estimators=200, max_depth=10, learning_rate=0.1, random_state=42)
xgb_model.fit(X_train, y_train)

# Predictions and evaluation
y_pred_xgb = xgb_model.predict(X_test)
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
conf_matrix_xgb = confusion_matrix(y_test, y_pred_xgb)
classification_rep_xgb = classification_report(y_test, y_pred_xgb)

print("\nXGBoost Classifier Performance:")
print(f"Accuracy: {accuracy_xgb:.4f}")
print("Confusion Matrix:\n", conf_matrix_xgb)
print("\nClassification Report:\n", classification_rep_xgb)

## Model Comparison and Feature Importance Analysis

In [None]:
print("Model Comparison:")
print(f"Random Forest Accuracy: {accuracy_rf:.4f}")
print(f"XGBoost Accuracy: {accuracy_xgb:.4f}")

# Feature Importance Plot
feature_importances = xgb_model.feature_importances_
feature_names = features

plt.figure(figsize=(10, 5))
sns.barplot(x=feature_importances, y=feature_names, palette="Blues_r")
plt.title("Feature Importance in Reorder Prediction (XGBoost)")
plt.xlabel("Importance Score")
plt.ylabel("Feature")
plt.tight_layout()
plt.savefig(os.path.join(current_dir, 'feature_importance_xgb.png'))
plt.show()

## Key Findings and Conclusions

- The XGBoost model outperforms Random Forest, achieving higher accuracy.
- Important features include order hour, purchase frequency, and reorder rate.
- Future improvements could involve deep learning models for even better accuracy.

In [None]:
# Save Model Performance Metrics to File
performance_output = os.path.join(current_dir, 'model_performance.txt')
with open(performance_output, 'w') as f:
    f.write(f"Random Forest Accuracy: {accuracy_rf:.4f}\n")
    f.write(f"XGBoost Accuracy: {accuracy_xgb:.4f}\n")

print(f"\nModel performance metrics saved to {performance_output}")

In [None]:

# 🧠 OpenAI + FAISS Setup
!pip install openai faiss-cpu --quiet
import openai
import faiss
import numpy as np
openai.api_key = "sk-..."  # <-- Replace with your OpenAI key


In [None]:

from openai.embeddings_utils import get_embedding
from tqdm import tqdm

# Use a subset if needed to avoid token cost
products_sampled = products_df[['product_id', 'product_name']].drop_duplicates().copy()
products_sampled['embedding'] = [
    get_embedding(name, engine="text-embedding-ada-002") for name in tqdm(products_sampled['product_name'])
]


In [None]:

# Build FAISS index
embedding_matrix = np.array(products_sampled['embedding'].tolist()).astype('float32')
faiss_index = faiss.IndexFlatL2(embedding_matrix.shape[1])
faiss_index.add(embedding_matrix)


In [None]:

# Semantic recommendation using FAISS
def recommend_similar_products(product_name, top_k=5):
    query_vec = np.array([get_embedding(product_name, engine="text-embedding-ada-002")]).astype('float32')
    D, I = faiss_index.search(query_vec, top_k)
    return products_sampled.iloc[I[0]][['product_id', 'product_name']]


In [None]:

# Save embeddings for reuse
products_sampled.to_pickle("products_with_embeddings.pkl")
