All the prvious steps done in EDA:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
customers_df = pd.read_csv("/content/drive/MyDrive/Zeotap_Assignment_Ansh/Customers.csv")
products_df = pd.read_csv("/content/drive/MyDrive/Zeotap_Assignment_Ansh/Products.csv")
transactions_df = pd.read_csv("/content/drive/MyDrive/Zeotap_Assignment_Ansh/Transactions.csv")

In [4]:
customers_df['SignupDate'] = pd.to_datetime(customers_df['SignupDate'])
transactions_df['TransactionDate'] = pd.to_datetime(transactions_df['TransactionDate'])

In [5]:
merged_data = pd.merge(transactions_df, customers_df, on='CustomerID')
merged_data = pd.merge(merged_data, products_df, on='ProductID')

In [9]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics.pairwise import cosine_similarity

**1. Aggregating Transaction Data to Customer Level:**

I group the merged_data by CustomerID and aggregate the transaction-level data to create customer-level features:

TotalSpending: Sum of all transaction values for each customer.

TotalTransactions: Count of transactions each customer has made.

AvgSpendingPerTransaction: Average spending per transaction.

PreferredCategory: The most frequent product category purchased by each customer (based on transaction data).


**2. Encoding Categorical Features:**

I use LabelEncoder to convert the PreferredCategory and Region (which are categorical) into numeric values, making it usable in machine learning models.

**3. Merging Demographic Information:**

I merge customer_features (aggregated transactional features) with customers_df (demographic information like Region) on CustomerID to create a comprehensive customer profile.

**4. Normalizing Numerical Features:**

I normalize numerical features (TotalSpending, TotalTransactions, and AvgSpendingPerTransaction) using StandardScaler to ensure they are on a similar scale for distance/similarity calculations.

**5. Building the Similarity Matrix:**

I created a cosine similarity matrix using the features Region, TotalSpending, TotalTransactions, AvgSpendingPerTransaction, and PreferredCategory to calculate how similar each customer is to all others based on these attributes.

**6. Recommendation Logic:**

For each of the first 20 customers (C0001 to C0020):

I calculate their similarity scores with every other customer.

I exclude the self-comparison (i.e., the target customer comparing themselves).

I select the top 3 most similar customers based on the highest similarity scores.

**7. Saving Results:**

We create a Lookalike.csv file where for each target customer, we save the top 3 similar customers and their respective similarity scores.

In [19]:

# Aggregate transaction data to customer level
customer_features = merged_data.groupby('CustomerID').agg(
    TotalSpending=('TotalValue', 'sum'),
    TotalTransactions=('TransactionID', 'count'),
    AvgSpendingPerTransaction=('TotalValue', 'mean'),
    PreferredCategory=('Category', lambda x: x.value_counts().idxmax())
).reset_index()

# Encode categorical features
le = LabelEncoder()
customer_features['PreferredCategory'] = le.fit_transform(customer_features['PreferredCategory'])

# Add demographic information from Customers.csv
customer_profiles = pd.merge(customers_df, customer_features, on='CustomerID')

# Normalize numerical features
scaler = StandardScaler()
numerical_cols = ['TotalSpending', 'TotalTransactions', 'AvgSpendingPerTransaction']
customer_profiles[numerical_cols] = scaler.fit_transform(customer_profiles[numerical_cols])
le_region = LabelEncoder()
customer_profiles['Region'] = le_region.fit_transform(customer_profiles['Region'])
# Build a similarity matrix
feature_cols = ['Region', 'TotalSpending', 'TotalTransactions', 'AvgSpendingPerTransaction', 'PreferredCategory']
similarity_matrix = cosine_similarity(customer_profiles[feature_cols])

# Recommendation Logic
lookalike_results = {}

for i, target_customer in enumerate(customer_profiles['CustomerID']):
    # Skip if not in the first 20 customers
    if target_customer not in [f"C{i:04}" for i in range(1, 21)]:
        continue

    # Get similarity scores for the target customer
    similarity_scores = list(enumerate(similarity_matrix[i]))

    # Sort by similarity score (descending) and exclude self-comparison
    similar_customers = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similar_customers = [(customer_profiles.iloc[j]['CustomerID'], score)
                         for j, score in similar_customers if j != i][:3]

    # Save to lookalike results
    lookalike_results[target_customer] = similar_customers

# Create Lookalike.csv
lookalike_df = pd.DataFrame({
    'CustomerID': list(lookalike_results.keys()),
    'Lookalikes': [str(lst) for lst in lookalike_results.values()]
})
lookalike_df.to_csv('Lookalike.csv', index=False)

print("Lookalike recommendations saved to Lookalike.csv!")


Lookalike recommendations saved to Lookalike.csv!


Below is The code to calculate Precision and Recall if the ground truth data is given:

In [21]:
# # Assume ground_truth_dict is a dictionary where:
# # key = customer_id, value = set of true similar customers
# # e.g., ground_truth_dict = {'C0001': {'C0002', 'C0003'}, 'C0002': {'C0001', 'C0004'}, ...}

# def calculate_precision_recall(lookalike_results, ground_truth_dict, top_n=3):
#     precision_list = []
#     recall_list = []

#     for customer_id, recommended_customers in lookalike_results.items():
#         true_similars = ground_truth_dict.get(customer_id, set())

#         # Get top_n recommended customers
#         recommended_set = {rec[0] for rec in recommended_customers[:top_n]}

#         # Calculate precision
#         intersection = true_similars.intersection(recommended_set)
#         precision = len(intersection) / top_n if top_n > 0 else 0
#         precision_list.append(precision)

#         # Calculate recall
#         recall = len(intersection) / len(true_similars) if len(true_similars) > 0 else 0
#         recall_list.append(recall)

#     # Average precision and recall
#     avg_precision = sum(precision_list) / len(precision_list) if precision_list else 0
#     avg_recall = sum(recall_list) / len(recall_list) if recall_list else 0

#     return avg_precision, avg_recall

# # Example usage:
# avg_precision, avg_recall = calculate_precision_recall(lookalike_results, ground_truth_dict)
# print(f"Average Precision: {avg_precision}")
# print(f"Average Recall: {avg_recall}")
