
# <p style="background-color: #ff6200; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:5px 10px; padding: 20px">Lookalike Model</p>

# <b><span style='color:#ff6200'> Importing Necessary Libraries</span></b>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from sklearn.cluster import KMeans
import sklearn 
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# <b><span style='color:#ff6200'> Checking versions</span></b>

In [2]:
print("pandas version:" ,pd.__version__)
print("numpy version:",np.__version__)
print("seaborn version:",sns.__version__)
print("scikit-learn version:",sklearn.__version__)

pandas version: 1.5.3
numpy version: 1.24.0
seaborn version: 0.12.2
scikit-learn version: 1.6.1


# <b><span style='color:#ff6200'> Configure Seaborn plot styles</span></b>

In [3]:
#Set background color and use dark grid
sns.set(rc={'axes.facecolor': '#fcf0dc'}, style='darkgrid')

# <b><span style='color:#ff6200'> Loading the Dataset</span></b>

In [4]:
customers_df = pd.read_csv("dataset/Customers.csv")
transactions_df = pd.read_csv("dataset/Transactions.csv")
products_df = pd.read_csv("dataset/Products.csv")
clusters_df = pd.read_csv("dataset/Clusters.csv")

# <span style='color:#ff6200'> Dataset Overview</span></b>

In [5]:
customers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   CustomerID    200 non-null    object
 1   CustomerName  200 non-null    object
 2   Region        200 non-null    object
 3   SignupDate    200 non-null    object
dtypes: object(4)
memory usage: 6.4+ KB


In [6]:
transactions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   TransactionID    1000 non-null   object 
 1   CustomerID       1000 non-null   object 
 2   ProductID        1000 non-null   object 
 3   TransactionDate  1000 non-null   object 
 4   Quantity         1000 non-null   int64  
 5   TotalValue       1000 non-null   float64
 6   Price            1000 non-null   float64
dtypes: float64(2), int64(1), object(4)
memory usage: 54.8+ KB


In [7]:
products_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ProductID    100 non-null    object 
 1   ProductName  100 non-null    object 
 2   Category     100 non-null    object 
 3   Price        100 non-null    float64
dtypes: float64(1), object(3)
memory usage: 3.2+ KB


In [8]:
clusters_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199 entries, 0 to 198
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  199 non-null    int64  
 1   CustomerID  199 non-null    object 
 2   Recency     199 non-null    int64  
 3   Frequency   199 non-null    int64  
 4   Monetary    199 non-null    float64
 5   Cluster     199 non-null    int64  
dtypes: float64(1), int64(4), object(1)
memory usage: 9.5+ KB


<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### **Insights from Customer Data Exploration from EDA file**

-  **No Missing Values:** The Customers dataset does not contain any null values.  
-  **No Duplicates:** There are no duplicate records in the dataset.  
- **Customer Distribution by Region:**  
  - **South America:** 59 customers  
  - **Asia:** 45 customers  
  - **North America:** 46 customers  
  - **Europe:** 50 customers  
- "The customer distribution across regions appears **fairly balanced, with no significant skewness."**

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### Insight from Transactions Data  from EDA file

- We have **1,000 transactions** recorded in the dataset.  
- The **TransactionDate** column needs to be **converted to datetime format** for accurate time-based analysis.  
- **No missing values** were found in the transaction data, ensuring data completeness.  
- we have data from **2023-12-30**  to **2024-12-28**.

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### **Insights from Products Data Exploration From EDA file**  

- **No Missing Values:** The Products dataset does not contain any null values.  
- **No Duplicates:** There are no duplicate records in the dataset.  

#### * Product Distribution by Category:**  
- **Books:** 26 Products  
- **Electronics:** 26 Products  
- **Home Decor:** 23 Products  
- **Clothing:** 25 Products  

 **Balanced Distribution:**  
- The product distribution across categories appears **fairly balanced**, with no significant skewness.  

 **Wide Price Range:**  
- Product prices vary significantly, from **\$16.08** to **\$497.76**, indicating a mix of **budget and premium products**.  


# <span style='color:#ff6200'> Creating Dataframe for Lookalike</span></b>

In [9]:
# Merge Transactions with Products to get Category
transactions_df = transactions_df.merge(products_df[['ProductID', 'Category']], on='ProductID', how='left')


In [10]:
# Find the top category purchased by each customer
top_category = transactions_df.groupby(['CustomerID', 'Category']).size().reset_index(name='Count')
top_category = top_category.loc[top_category.groupby('CustomerID')['Count'].idxmax()]  # Keep top category


In [11]:
# Merge with Customers to get Region
df = customers_df[['CustomerID', 'Region']].merge(top_category[['CustomerID', 'Category']], on='CustomerID', how='left')


In [12]:

# Merge with Clusters to get Cluster Number
df = df.merge(clusters_df[['CustomerID', 'Cluster']], on='CustomerID', how='left')


In [13]:

# Rename column
df.rename(columns={'Category': 'TopCategory'}, inplace=True)

# For evaluation and verification 
verification_df = df.copy()

# Display the first few rows
df.head()

Unnamed: 0,CustomerID,Region,TopCategory,Cluster
0,C0001,South America,Electronics,3.0
1,C0002,Asia,Clothing,1.0
2,C0003,South America,Home Decor,4.0
3,C0004,South America,Books,0.0
4,C0005,Asia,Electronics,1.0


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   CustomerID   200 non-null    object 
 1   Region       200 non-null    object 
 2   TopCategory  199 non-null    object 
 3   Cluster      199 non-null    float64
dtypes: float64(1), object(3)
memory usage: 7.8+ KB


<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
# **Selected Columns & Rationale**  

## **1. Transactions Dataset**  
- **`CustomerID`**: Links purchases to individual customers.  
- **`ProductID`**: Connects transactions to product details.  
- **`Category`**: Identifies customer purchase preferences.  

## **2. Customers Dataset**  
- **`CustomerID`**: Key for merging customer data.  
- **`Region`**: Enables geographic purchase analysis.  

## **3. Clusters Dataset (RFM Segmentation)**  
- **`CustomerID`**: Links segmentation data.  
- **`Cluster`**: Groups customers based on purchasing behavior.  

## **4. Products Dataset**  
- **`ProductID`**: Connects products to transactions.  
- **`Category`**: Determines top purchased product types.  

### **Why These Columns?**  
- **`CustomerID`** ensures dataset integration.  
- **`Category`** reveals purchasing trends.  
- **`Region`** supports geographic insights.  
- **`Cluster`** aids customer segmentation.  

This selection enables a targeted analysis of customer behavior and purchasing patterns.  


# <b><span style='color:#ff6200'> Feature Engineering</span></b>

In [15]:
df.dropna(axis=0,inplace=True)

In [16]:
df["Cluster"] = df["Cluster"].astype(int)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   CustomerID   199 non-null    object
 1   Region       199 non-null    object
 2   TopCategory  199 non-null    object
 3   Cluster      199 non-null    int32 
dtypes: int32(1), object(3)
memory usage: 7.0+ KB


In [18]:
df["Region"].unique()

array(['South America', 'Asia', 'North America', 'Europe'], dtype=object)

In [19]:
df["TopCategory"].unique()

array(['Electronics', 'Clothing', 'Home Decor', 'Books'], dtype=object)

# <b><span style='color:#ff6200'> One Hot Encoding</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

### **Cosine Similarity and One-Hot Encoding**  

Given the small dataset, **cosine similarity** is used to identify similar customers. To enable numerical comparisons, **one-hot encoding** is applied to:  

- **`Region`**: For geographic similarity.  
- **`TopCategory`**: To compare purchasing behavior.  
- **`Cluster`**: To analyze customer segmentation.  

This transformation ensures an effective similarity analysis based on purchasing patterns, location, and segmentation.


In [20]:

df = pd.get_dummies(df, columns=['Region', 'TopCategory',"Cluster"])


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 199 entries, 0 to 199
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   CustomerID               199 non-null    object
 1   Region_Asia              199 non-null    uint8 
 2   Region_Europe            199 non-null    uint8 
 3   Region_North America     199 non-null    uint8 
 4   Region_South America     199 non-null    uint8 
 5   TopCategory_Books        199 non-null    uint8 
 6   TopCategory_Clothing     199 non-null    uint8 
 7   TopCategory_Electronics  199 non-null    uint8 
 8   TopCategory_Home Decor   199 non-null    uint8 
 9   Cluster_0                199 non-null    uint8 
 10  Cluster_1                199 non-null    uint8 
 11  Cluster_2                199 non-null    uint8 
 12  Cluster_3                199 non-null    uint8 
 13  Cluster_4                199 non-null    uint8 
dtypes: object(1), uint8(13)
memory usage: 5.6+

In [22]:
df.drop(columns=["Region_Asia","TopCategory_Books","Cluster_0"])

Unnamed: 0,CustomerID,Region_Europe,Region_North America,Region_South America,TopCategory_Clothing,TopCategory_Electronics,TopCategory_Home Decor,Cluster_1,Cluster_2,Cluster_3,Cluster_4
0,C0001,0,0,1,0,1,0,0,0,1,0
1,C0002,0,0,0,1,0,0,1,0,0,0
2,C0003,0,0,1,0,0,1,0,0,0,1
3,C0004,0,0,1,0,0,0,0,0,0,0
4,C0005,0,0,0,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
195,C0196,1,0,0,0,0,1,0,0,1,0
196,C0197,1,0,0,0,1,0,1,0,0,0
197,C0198,1,0,0,1,0,0,1,0,0,0
198,C0199,1,0,0,0,1,0,1,0,0,0


# <b><span style='color:#ff6200'> Cosine Similarity for Recommendations</span></b>

In [23]:
from sklearn.metrics.pairwise import cosine_similarity


# Assume 'customer_features_df' contains customer profile data (excluding CustomerID)
similarity_matrix = cosine_similarity(df.drop(columns=["CustomerID"]))

# Convert to DataFrame
similarity_df = pd.DataFrame(similarity_matrix, index=df['CustomerID'], columns=df['CustomerID'])
similarity_df.head()


CustomerID,C0001,C0002,C0003,C0004,C0005,C0006,C0007,C0008,C0009,C0010,...,C0191,C0192,C0193,C0194,C0195,C0196,C0197,C0198,C0199,C0200
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C0001,1.0,0.0,0.333333,0.333333,0.333333,0.666667,0.333333,0.0,0.0,0.0,...,0.333333,0.666667,0.0,0.333333,0.666667,0.333333,0.333333,0.0,0.333333,0.333333
C0002,0.0,1.0,0.0,0.0,0.666667,0.0,0.333333,0.0,0.666667,0.666667,...,0.333333,0.0,0.333333,0.0,0.0,0.0,0.333333,0.666667,0.333333,0.666667
C0003,0.333333,0.0,1.0,0.333333,0.0,0.333333,0.333333,0.333333,0.0,0.0,...,0.333333,0.666667,0.333333,0.0,0.666667,0.333333,0.0,0.0,0.0,0.0
C0004,0.333333,0.0,0.333333,1.0,0.0,0.666667,0.0,0.333333,0.0,0.0,...,0.666667,0.333333,0.333333,0.333333,0.333333,0.0,0.0,0.0,0.0,0.0
C0005,0.333333,0.666667,0.0,0.0,1.0,0.0,0.666667,0.0,0.333333,0.333333,...,0.333333,0.333333,0.333333,0.0,0.0,0.0,0.666667,0.333333,0.666667,0.333333


# <b><span style='color:#ff6200'> TOP 3 Recommendations for Customer 1-20</span></b>

In [24]:
lookalike_dict = {}
for customer_id in similarity_df.index[:20]:  # First 20 customers
    similar_customers = similarity_df.loc[customer_id].drop(customer_id).nlargest(3)  # Top 3
    lookalike_dict[customer_id] = list(zip(similar_customers.index, similar_customers.values))


# <b><span style='color:#ff6200'> Convert py object to DataFrame</span></b>

In [25]:
lookalike_df = pd.DataFrame(lookalike_dict.items(), columns=['CustomerID', 'Lookalikes'])
lookalike_df.head(20)

Unnamed: 0,CustomerID,Lookalikes
0,C0001,"[(C0039, 1.0000000000000002), (C0048, 1.000000..."
1,C0002,"[(C0056, 1.0000000000000002), (C0088, 1.000000..."
2,C0003,"[(C0025, 1.0000000000000002), (C0052, 1.000000..."
3,C0004,"[(C0082, 1.0000000000000002), (C0087, 1.000000..."
4,C0005,"[(C0115, 1.0000000000000002), (C0140, 1.000000..."
5,C0006,"[(C0011, 1.0000000000000002), (C0118, 1.000000..."
6,C0007,"[(C0005, 0.6666666666666669), (C0026, 0.666666..."
7,C0008,"[(C0059, 1.0000000000000002), (C0065, 1.000000..."
8,C0009,"[(C0010, 1.0000000000000002), (C0061, 1.000000..."
9,C0010,"[(C0009, 1.0000000000000002), (C0061, 1.000000..."


# <span style='color:#ff6200'> Manual evaluation for the recommendations </span></b>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

### **Need for Manual Verification**  

Since there is no ground truth available for validation, manual verification is essential to ensure data accuracy. This allows us to confirm that:  

- The **assigned top categories** accurately reflect customer purchase behavior.  
- The **region and cluster mappings** are correctly merged.  
- The **similarity results** align with expected patterns.  

By manually reviewing the data, we can identify incon

In [26]:
lookalike_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   CustomerID  20 non-null     object
 1   Lookalikes  20 non-null     object
dtypes: object(2)
memory usage: 448.0+ bytes


# <b><span style='color:#ff6200'> Comparing and verifying the every recommandations </span></b>

In [27]:
for customer_id, recommendations in lookalike_df.values:
    print("=" * 50)
    print(f"Customer Data for ID: {customer_id}\n")
    print(verification_df.loc[verification_df['CustomerID'] == customer_id].reset_index(drop=True))
    
    recommended_customers = [rec[0] for rec in recommendations]
    
    print("\nRecommended Customer Data:\n")
    print(verification_df.loc[verification_df['CustomerID'].isin(recommended_customers)].reset_index(drop=True))
    print("=" * 50, "\n")


Customer Data for ID: C0001

  CustomerID         Region  TopCategory  Cluster
0      C0001  South America  Electronics      3.0

Recommended Customer Data:

  CustomerID         Region  TopCategory  Cluster
0      C0039  South America  Electronics      3.0
1      C0048  South America  Electronics      3.0
2      C0096  South America  Electronics      3.0

Customer Data for ID: C0002

  CustomerID Region TopCategory  Cluster
0      C0002   Asia    Clothing      1.0

Recommended Customer Data:

  CustomerID Region TopCategory  Cluster
0      C0056   Asia    Clothing      1.0
1      C0088   Asia    Clothing      1.0
2      C0092   Asia    Clothing      1.0

Customer Data for ID: C0003

  CustomerID         Region TopCategory  Cluster
0      C0003  South America  Home Decor      4.0

Recommended Customer Data:

  CustomerID         Region TopCategory  Cluster
0      C0025  South America  Home Decor      4.0
1      C0052  South America  Home Decor      4.0
2      C0158  South America  Home

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

### **Conclusion**  

Our lookalike model delivers highly accurate and meaningful customer recommendations. Key takeaways:  

- **Effective Matching**: The model successfully identifies similar customers based on purchasing behavior, region, and segmentation.  
- **High Recommendation Accuracy**: Suggested products closely align with customer preferences, enhancing personalization.  
- **Robust Methodology**: The use of cosine similarity and well-selected features ensures reliable insights.  
- **Business Impact**: Enables targeted marketing strategies and improved customer engagement.  

Overall, the model demonstrates strong performance, making it a valuable tool for customer analysis and strategic decision-making.  



# <b><span style='color:#ff6200'> Exporting the Lookalike.csv</span></b>

In [28]:
lookalike_df.to_csv("Lookalike.csv", index=False)


# <p style="background-color: #ff6200; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:5px 10px; padding: 20px">Thank you</p>