# 1. Introduction

# Lookalike Model for Customer Recommendation
This notebook develops a lookalike model to recommend similar customers based on their profile and transaction history.  
We use customer, product, and transaction data to calculate similarities and assign scores to identify the top 3 most similar customers.

## **Key Steps:**
1. Data Loading
2. Data Preprocessing
3. Feature Engineering
4. Similarity Calculation
5. Generating Recommendations
6. Output Results


# 2. Data Loading

In [5]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity

# Load datasets
customers = pd.read_csv('data/Customers.csv')
products = pd.read_csv('data/Products.csv')
transactions = pd.read_csv('data/Transactions.csv')

# Display data summaries
print("Customers Dataset:")
display(customers.head())
print("Products Dataset:")
display(products.head())
print("Transactions Dataset:")
display(transactions.head())


Customers Dataset:


Unnamed: 0,CustomerID,CustomerName,Region,SignupDate
0,C0001,Lawrence Carroll,South America,2022-07-10
1,C0002,Elizabeth Lutz,Asia,2022-02-13
2,C0003,Michael Rivera,South America,2024-03-07
3,C0004,Kathleen Rodriguez,South America,2022-10-09
4,C0005,Laura Weber,Asia,2022-08-15


Products Dataset:


Unnamed: 0,ProductID,ProductName,Category,Price
0,P001,ActiveWear Biography,Books,169.3
1,P002,ActiveWear Smartwatch,Electronics,346.3
2,P003,ComfortLiving Biography,Books,44.12
3,P004,BookWorld Rug,Home Decor,95.69
4,P005,TechPro T-Shirt,Clothing,429.31


Transactions Dataset:


Unnamed: 0,TransactionID,CustomerID,ProductID,TransactionDate,Quantity,TotalValue,Price
0,T00001,C0199,P067,2024-08-25 12:38:23,1,300.68,300.68
1,T00112,C0146,P067,2024-05-27 22:23:54,1,300.68,300.68
2,T00166,C0127,P067,2024-04-25 07:38:55,1,300.68,300.68
3,T00272,C0087,P067,2024-03-26 22:55:37,2,601.36,300.68
4,T00363,C0070,P067,2024-03-21 15:10:10,3,902.04,300.68


## **Data Loading**
The datasets are loaded from the following files:
- `Customers.csv`: Contains customer details like CustomerID, name, region, and signup date.
- `Products.csv`: Includes product information like ProductID, category, and price.
- `Transactions.csv`: Tracks customer purchases, with details like TransactionID, ProductID, quantity, and total value.

### **Preview of the Data:**
Displayed above are the first five rows of each dataset.
# 3. Data Preprocessing

In [6]:
# Convert dates to datetime format
customers['SignupDate'] = pd.to_datetime(customers['SignupDate'])
transactions['TransactionDate'] = pd.to_datetime(transactions['TransactionDate'])

# Merge datasets for easier analysis
merged_data = transactions.merge(customers, on='CustomerID').merge(products, on='ProductID')

# Clean data (remove duplicates, handle missing values if any)
merged_data = merged_data.drop_duplicates()
merged_data = merged_data.dropna()

# Preview merged data
print("Merged Dataset:")
display(merged_data.head())


Merged Dataset:


Unnamed: 0,TransactionID,CustomerID,ProductID,TransactionDate,Quantity,TotalValue,Price_x,CustomerName,Region,SignupDate,ProductName,Category,Price_y
0,T00001,C0199,P067,2024-08-25 12:38:23,1,300.68,300.68,Andrea Jenkins,Europe,2022-12-03,ComfortLiving Bluetooth Speaker,Electronics,300.68
1,T00112,C0146,P067,2024-05-27 22:23:54,1,300.68,300.68,Brittany Harvey,Asia,2024-09-04,ComfortLiving Bluetooth Speaker,Electronics,300.68
2,T00166,C0127,P067,2024-04-25 07:38:55,1,300.68,300.68,Kathryn Stevens,Europe,2024-04-04,ComfortLiving Bluetooth Speaker,Electronics,300.68
3,T00272,C0087,P067,2024-03-26 22:55:37,2,601.36,300.68,Travis Campbell,South America,2024-04-11,ComfortLiving Bluetooth Speaker,Electronics,300.68
4,T00363,C0070,P067,2024-03-21 15:10:10,3,902.04,300.68,Timothy Perez,Europe,2022-03-15,ComfortLiving Bluetooth Speaker,Electronics,300.68


## **Data Preprocessing**
- Converted date columns (`SignupDate`, `TransactionDate`) to `datetime` format.
- Merged the three datasets to create a unified view for analysis.
- Removed duplicates and handled missing values.


# 4. Feature Engineering

In [7]:
# Aggregate transaction data for each customer
customer_features = merged_data.groupby('CustomerID').agg({
    'TotalValue': 'sum',
    'Quantity': 'sum',
    'ProductID': lambda x: x.nunique(),  # Number of unique products purchased
    'TransactionID': 'count'  # Number of transactions
}).rename(columns={
    'TotalValue': 'TotalSpent',
    'Quantity': 'TotalQuantity',
    'ProductID': 'UniqueProducts',
    'TransactionID': 'TransactionCount'
}).reset_index()

# Normalize the features for similarity calculations
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(customer_features.iloc[:, 1:])
customer_features_normalized = pd.DataFrame(scaled_features, columns=customer_features.columns[1:])
customer_features_normalized['CustomerID'] = customer_features['CustomerID']

# Preview normalized features
print("Customer Features (Normalized):")
display(customer_features_normalized.head())


Customer Features (Normalized):


Unnamed: 0,TotalSpent,TotalQuantity,UniqueProducts,TransactionCount,CustomerID
0,0.308942,0.354839,0.444444,0.4,C0001
1,0.168095,0.290323,0.333333,0.3,C0002
2,0.249541,0.419355,0.333333,0.3,C0003
3,0.497806,0.709677,0.777778,0.7,C0004
4,0.184287,0.193548,0.222222,0.2,C0005


## **Feature Engineering**
Created aggregated features for each customer:
- **TotalSpent**: Sum of all transaction values.
- **TotalQuantity**: Total quantity of products purchased.
- **UniqueProducts**: Number of distinct products bought.
- **TransactionCount**: Total number of transactions.

The features were normalized using `MinMaxScaler` for uniformity in similarity calculations.


# 5. Similarity Calculation

In [8]:
# Compute cosine similarity between all customers
similarity_matrix = cosine_similarity(customer_features_normalized.iloc[:, :-1])
similarity_df = pd.DataFrame(similarity_matrix, index=customer_features['CustomerID'], columns=customer_features['CustomerID'])

# Function to get top 3 similar customers for a given customer
def get_top_3_similar(customers_df, customer_id):
    similar_customers = customers_df[customer_id].sort_values(ascending=False).iloc[1:4]
    return [(cust_id, round(score, 4)) for cust_id, score in similar_customers.items()]

# Generate recommendations for the first 20 customers
lookalike_recommendations = {}
for customer_id in customer_features['CustomerID'][:20]:
    lookalike_recommendations[customer_id] = get_top_3_similar(similarity_df, customer_id)

# Save recommendations to Lookalike.csv
recommendations_df = pd.DataFrame({
    "CustomerID": lookalike_recommendations.keys(),
    "Recommendations": [str(rec) for rec in lookalike_recommendations.values()]
})
recommendations_df.to_csv('output/Rakesh_Valasala_Lookalike.csv', index=False)

# Preview recommendations
print("Lookalike Recommendations:")
display(recommendations_df.head())


Lookalike Recommendations:


Unnamed: 0,CustomerID,Recommendations
0,C0001,"[('C0173', 1.0), ('C0177', 0.9999), ('C0122', ..."
1,C0002,"[('C0030', 0.9999), ('C0029', 0.9999), ('C0031..."
2,C0003,"[('C0136', 0.9993), ('C0073', 0.9992), ('C0197..."
3,C0004,"[('C0195', 0.9999), ('C0072', 0.9998), ('C0190..."
4,C0005,"[('C0125', 1.0), ('C0064', 1.0), ('C0105', 0.9..."


## **Similarity Calculation**
- Calculated pairwise cosine similarity between customers based on normalized features.
- For each customer, identified the top 3 most similar customers with their similarity scores.
- Saved the recommendations as `Lookalike.csv`.

### **Sample Output:**
Displayed above are the top 3 recommendations for the first 5 customers.


# 6. Conclusion

## **Conclusion**
This notebook successfully builds a lookalike model for customer recommendations.  
Key steps included:
1. Loading and preprocessing customer, product, and transaction data.
2. Engineering meaningful features for similarity calculations.
3. Using cosine similarity to identify the top 3 most similar customers for each user.

The output file `Lookalike.csv` contains the recommendations in the format:  
`Map<CustomerID, List<CustomerID, SimilarityScore>>`
