1.Data Preprocessing: We will merge customer, transaction, and product data to create a complete user profile that includes both customer and product information.
2.Feature Engineering: We will create features based on customer demographic data and transaction history (e.g., total spend, product preferences, frequency of purchases).
3.Similarity Calculation: We will compute similarity scores between customers using a suitable metric such as Cosine Similarity or Euclidean Distance based on their combined profile (customer and transaction data).
4.Recommendation System: For each of the first 20 customers (from C0001 to C0020), we will calculate similarity scores with all other customers and recommend the top 3 similar customers.



In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

# Load the datasets
customers = pd.read_csv('/content/drive/MyDrive/ZEOTAP /Customers.csv')
products = pd.read_csv('/content/drive/MyDrive/ZEOTAP /Products.csv')
transactions = pd.read_csv('/content/drive/MyDrive/ZEOTAP /Transactions.csv')

# Display the first few rows to check the data
print(customers.head())
print(products.head())
print(transactions.head())


  CustomerID        CustomerName         Region  SignupDate
0      C0001    Lawrence Carroll  South America  2022-07-10
1      C0002      Elizabeth Lutz           Asia  2022-02-13
2      C0003      Michael Rivera  South America  2024-03-07
3      C0004  Kathleen Rodriguez  South America  2022-10-09
4      C0005         Laura Weber           Asia  2022-08-15
  ProductID              ProductName     Category   Price
0      P001     ActiveWear Biography        Books  169.30
1      P002    ActiveWear Smartwatch  Electronics  346.30
2      P003  ComfortLiving Biography        Books   44.12
3      P004            BookWorld Rug   Home Decor   95.69
4      P005          TechPro T-Shirt     Clothing  429.31
  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:23:54         1   
2        T00166      C0127      P067  2024-04-25 07:38:55         1   
3       

1.2 Merge Customer, Transaction, and Product Data
We will merge the data from customers, products, and transactions based on CustomerID and ProductID. This will allow us to create a combined feature set for each customer.

In [2]:
# Merge customers with transactions
customer_transactions = pd.merge(transactions, customers, on='CustomerID', how='left')

# Merge the resulting dataframe with product data
customer_transactions = pd.merge(customer_transactions, products, on='ProductID', how='left')

# Display the merged data
print(customer_transactions.head())


  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:23:54         1   
2        T00166      C0127      P067  2024-04-25 07:38:55         1   
3        T00272      C0087      P067  2024-03-26 22:55:37         2   
4        T00363      C0070      P067  2024-03-21 15:10:10         3   

   TotalValue  Price_x     CustomerName         Region  SignupDate  \
0      300.68   300.68   Andrea Jenkins         Europe  2022-12-03   
1      300.68   300.68  Brittany Harvey           Asia  2024-09-04   
2      300.68   300.68  Kathryn Stevens         Europe  2024-04-04   
3      601.36   300.68  Travis Campbell  South America  2024-04-11   
4      902.04   300.68    Timothy Perez         Europe  2022-03-15   

                       ProductName     Category  Price_y  
0  ComfortLiving Bluetooth Speaker  Electronics   300.68  
1  ComfortLiving Bluetooth Speaker

2.Feature Engineering
We will create features based on the transaction history of each customer, such as:

Total Spend: Sum of TotalValue for each customer.
Product Category Preferences: Frequency of purchases in each product category.
Average Purchase Frequency: Number of transactions per customer.

In [3]:
# Create the "Total Spend" feature by summing up the total value of transactions for each customer
total_spend = customer_transactions.groupby('CustomerID')['TotalValue'].sum().reset_index()
total_spend.rename(columns={'TotalValue': 'TotalSpend'}, inplace=True)

# Create the "Product Category Preferences" feature (i.e., the frequency of each category per customer)
category_preference = customer_transactions.groupby(['CustomerID', 'Category']).size().unstack(fill_value=0)

# Create the "Average Purchase Frequency" feature (i.e., the number of transactions per customer)
avg_purchase_frequency = customer_transactions.groupby('CustomerID').size().reset_index(name='AvgPurchaseFrequency')

# Merge all features into one dataframe
customer_features = pd.merge(total_spend, avg_purchase_frequency, on='CustomerID')
customer_features = pd.merge(customer_features, category_preference, on='CustomerID')

# Display the features
print(customer_features.head())


  CustomerID  TotalSpend  AvgPurchaseFrequency  Books  Clothing  Electronics  \
0      C0001     3354.52                     5      1         0            3   
1      C0002     1862.74                     4      0         2            0   
2      C0003     2725.38                     4      0         1            1   
3      C0004     5354.88                     8      3         0            2   
4      C0005     2034.24                     3      0         0            2   

   Home Decor  
0           1  
1           2  
2           2  
3           3  
4           1  


3.Similarity Calculation
Now that we have a customer profile, we will calculate similarity scores between customers. Cosine Similarity is commonly used for such tasks, as it measures the cosine of the angle between two vectors and is effective for high-dimensional data like customer profiles.

3.1 Standardize the Data
Before calculating similarity, it's important to standardize the features so that no single feature dominates the similarity score (e.g., total spend).

In [4]:
# Standardize the numerical features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(customer_features.drop('CustomerID', axis=1))

# Compute cosine similarity between all customers
similarity_matrix = cosine_similarity(scaled_features)

# Display the similarity matrix
print(similarity_matrix)


[[ 1.         -0.54352611  0.00852079 ...  0.10015196  0.60159549
  -0.49581555]
 [-0.54352611  1.          0.79054587 ...  0.48889369  0.28860935
   0.20428846]
 [ 0.00852079  0.79054587  1.         ...  0.5047866   0.76479776
  -0.04480824]
 ...
 [ 0.10015196  0.48889369  0.5047866  ...  1.          0.46248786
  -0.12357102]
 [ 0.60159549  0.28860935  0.76479776 ...  0.46248786  1.
  -0.54314818]
 [-0.49581555  0.20428846 -0.04480824 ... -0.12357102 -0.54314818
   1.        ]]


3.2 Generate Lookalikes
For each customer, we will recommend the top 3 most similar customers (excluding themselves). We will use the cosine similarity scores for this task.

In [5]:
# Generate the top 3 lookalikes for each customer (excluding themselves)
lookalike_map = {}

for idx, customer_id in enumerate(customer_features['CustomerID']):
    # Get similarity scores for each customer
    similarity_scores = similarity_matrix[idx]

    # Sort the customers by similarity score (excluding the customer itself)
    sorted_indices = similarity_scores.argsort()[::-1]  # Sort in descending order
    similar_customers = sorted_indices[1:4]  # Top 3 excluding the customer itself

    # Collect customer ids and similarity scores
    recommended_customers = [(customer_features.iloc[i]['CustomerID'], similarity_scores[i]) for i in similar_customers]

    # Store in the lookalike map
    lookalike_map[customer_id] = recommended_customers

# Display the lookalike map for the first 20 customers
for customer_id in list(lookalike_map.keys())[:20]:
    print(f"Customer {customer_id} lookalikes: {lookalike_map[customer_id]}")


Customer C0001 lookalikes: [('C0069', 0.9474257972151854), ('C0127', 0.8739694001028301), ('C0190', 0.8460722354249515)]
Customer C0002 lookalikes: [('C0133', 0.9681437939265284), ('C0062', 0.8997910818956721), ('C0134', 0.8968440792176382)]
Customer C0003 lookalikes: [('C0166', 0.9944603992451134), ('C0031', 0.9746433592752327), ('C0158', 0.9376095644412319)]
Customer C0004 lookalikes: [('C0090', 0.9178464884605366), ('C0122', 0.9118789079040195), ('C0017', 0.9094137328867413)]
Customer C0005 lookalikes: [('C0197', 0.9996873537675999), ('C0007', 0.9906572143006201), ('C0140', 0.8991908631524648)]
Customer C0006 lookalikes: [('C0135', 0.9131662546767203), ('C0187', 0.7746920876470779), ('C0185', 0.7290633873719096)]
Customer C0007 lookalikes: [('C0005', 0.9906572143006201), ('C0197', 0.9869375698842499), ('C0120', 0.8957528613441336)]
Customer C0008 lookalikes: [('C0162', 0.9354260196136521), ('C0154', 0.8903984349746906), ('C0113', 0.8845529363450987)]
Customer C0009 lookalikes: [('C0

4. Output: Save to CSV
Finally, we will save the lookalike_map to a CSV file (Lookalike.csv) in the specified format.

In [6]:
# Convert the lookalike map to a DataFrame
lookalike_list = []
for customer_id, recommendations in lookalike_map.items():
    for rec in recommendations:
        lookalike_list.append([customer_id, rec[0], rec[1]])

lookalike_df = pd.DataFrame(lookalike_list, columns=['CustomerID', 'LookalikeCustomerID', 'SimilarityScore'])

# Save to CSV
lookalike_df.to_csv('Lookalike.csv', index=False)

# Display the first few rows of the output
print(lookalike_df.head())


  CustomerID LookalikeCustomerID  SimilarityScore
0      C0001               C0069         0.947426
1      C0001               C0127         0.873969
2      C0001               C0190         0.846072
3      C0002               C0133         0.968144
4      C0002               C0062         0.899791
