# Project: Product Recommendation Model

Goal: Predict the probability a user will purchase a product → rank products for personalized recommendations.

# Dataset

Dataset: Olist Brazilian E-Commerce (Kaggle).
Period: 2016–2018.
Rows: 99K orders, 113K items, 33K products, 96K users.
Merged tables: orders, items, products, customers, reviews, category translations.
Engineered features (11): price, product rating/reviews, user spend/rating, recency, clicked (1 if purchased), category (one-hot).
Target: purchased (1 = bought).
Negatives: 3× random non-purchased pairs.
Final size: ~450K rows, ~25% positive.
Real transactional data with RFM signals; ideal for purchase prediction and top-N recommendations.

Here is link to ccess it : https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce

In [6]:
!pip install kaggle



In [7]:
import os
print(os.listdir())
os.listdir()


['.ipynb_checkpoints', 'brazilian-ecommerce.zip', 'kaggle.json', 'recommendation.ipynb']


['.ipynb_checkpoints',
 'brazilian-ecommerce.zip',
 'kaggle.json',
 'recommendation.ipynb']

In [8]:
import shutil

# Create .kaggle directory
os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)

# Move kaggle.json to ~/.kaggle
shutil.copy("kaggle.json", os.path.expanduser("~/.kaggle/kaggle.json"))

# Set permissions
os.chmod(os.path.expanduser("~/.kaggle/kaggle.json"), 0o600)

In [16]:
!kaggle datasets download -d olistbr/brazilian-ecommerce

Dataset URL: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
License(s): CC-BY-NC-SA-4.0
brazilian-ecommerce.zip: Skipping, found more recently modified local copy (use --force to force download)


In [17]:
import zipfile
import os

# Extract ZIP contents into 'olist_data' folder
with zipfile.ZipFile("brazilian-ecommerce.zip", "r") as zip_ref:
    zip_ref.extractall("olist_data")

# Confirm extracted files
print("Extracted files:", os.listdir("olist_data"))

Extracted files: ['olist_customers_dataset.csv', 'olist_geolocation_dataset.csv', 'olist_orders_dataset.csv', 'olist_order_items_dataset.csv', 'olist_order_payments_dataset.csv', 'olist_order_reviews_dataset.csv', 'olist_products_dataset.csv', 'olist_sellers_dataset.csv', 'product_category_name_translation.csv']


# Load & Merge the Data

In [23]:
import pandas as pd
import numpy as np

# Load core tables from the extracted folder
orders        = pd.read_csv('olist_data/olist_orders_dataset.csv')
order_items   = pd.read_csv('olist_data/olist_order_items_dataset.csv')
products      = pd.read_csv('olist_data/olist_products_dataset.csv')
customers     = pd.read_csv('olist_data/olist_customers_dataset.csv')
reviews       = pd.read_csv('olist_data/olist_order_reviews_dataset.csv')
category_name = pd.read_csv('olist_data/product_category_name_translation.csv')

In [24]:
# Merge step-by-step
df = (order_items
      .merge(orders, on='order_id')
      .merge(customers, on='customer_id')
      .merge(products, on='product_id')
      .merge(reviews, on='order_id', how='left')
      .merge(category_name, on='product_category_name', how='left'))

# Display shape and preview
print(f"Dataset shape: {df.shape}")
print(df.head())

Dataset shape: (113314, 33)
                           order_id  order_item_id  \
0  00010242fe8c5a6d1ba2dd792cb16214              1   
1  00018f77f2f0320c557190d7a144bdd3              1   
2  000229ec398224ef6ca0657da4fc703e              1   
3  00024acbcdf0a6daa1e931b038114c75              1   
4  00042b26cf59d7ce69dfabb4e55b4fd9              1   

                         product_id                         seller_id  \
0  4244733e06e7ecb4970a6e2683c13e61  48436dade18ac8b2bce089ec2a041202   
1  e5f2d52b802189ee658865ca93d83a8f  dd7ddc04e1b6c2c614352b383efe2d36   
2  c777355d18b72b67abbeef9df44fd0fd  5b51032eddd242adc84c38acab88f23d   
3  7634da152a4610f1595efa32f14722fc  9d7a1d34a5052409006425275ba1c2b4   
4  ac6c3623068f30de03045865e4e10089  df560393f3a51e74553ab94004ba5c87   

   shipping_limit_date   price  freight_value  \
0  2017-09-19 09:45:35   58.90          13.29   
1  2017-05-03 11:05:13  239.90          19.93   
2  2018-01-18 14:48:30  199.00          17.87   
3  2018-08-1

# Feature Engineering