<a href="https://colab.research.google.com/github/balanireekshan/Data-Science-Intern-Assignment/blob/main/Bala_Nireekshan_Lookalike.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1. Data Preparation**

In [1]:
# Import libraries
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# Load datasets
customers = pd.read_csv('Customers.csv')
products = pd.read_csv('Products.csv')
transactions = pd.read_csv('Transactions.csv')

# Merge datasets
data = transactions.merge(customers, on='CustomerID').merge(products, on='ProductID')

# Inspect merged dataset
print(data.head())

  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:23:54         1   
2        T00166      C0127      P067  2024-04-25 07:38:55         1   
3        T00272      C0087      P067  2024-03-26 22:55:37         2   
4        T00363      C0070      P067  2024-03-21 15:10:10         3   

   TotalValue  Price_x     CustomerName         Region  SignupDate  \
0      300.68   300.68   Andrea Jenkins         Europe  2022-12-03   
1      300.68   300.68  Brittany Harvey           Asia  2024-09-04   
2      300.68   300.68  Kathryn Stevens         Europe  2024-04-04   
3      601.36   300.68  Travis Campbell  South America  2024-04-11   
4      902.04   300.68    Timothy Perez         Europe  2022-03-15   

                       ProductName     Category  Price_y  
0  ComfortLiving Bluetooth Speaker  Electronics   300.68  
1  ComfortLiving Bluetooth Speaker

**2. Feature Matrix Creation**

In [2]:
# Create a customer-product interaction matrix
customer_product_matrix = data.pivot_table(
    index='CustomerID',
    columns='ProductID',
    values='TotalValue',
    aggfunc='sum',
    fill_value=0
)

# Check the feature matrix
print(customer_product_matrix.head())

ProductID   P001    P002  P003    P004  P005    P006  P007   P008  P009  P010  \
CustomerID                                                                      
C0001        0.0     0.0   0.0    0.00   0.0    0.00   0.0    0.0   0.0   0.0   
C0002        0.0     0.0   0.0  382.76   0.0    0.00   0.0    0.0   0.0   0.0   
C0003        0.0  1385.2   0.0    0.00   0.0  363.96   0.0    0.0   0.0   0.0   
C0004        0.0     0.0   0.0    0.00   0.0    0.00   0.0  293.7   0.0   0.0   
C0005        0.0     0.0   0.0    0.00   0.0    0.00   0.0    0.0   0.0   0.0   

ProductID   ...  P091  P092  P093  P094    P095    P096    P097  P098  P099  \
CustomerID  ...                                                               
C0001       ...   0.0   0.0   0.0   0.0    0.00  614.94    0.00   0.0   0.0   
C0002       ...   0.0   0.0   0.0   0.0  454.52    0.00    0.00   0.0   0.0   
C0003       ...   0.0   0.0   0.0   0.0    0.00    0.00    0.00   0.0   0.0   
C0004       ...   0.0   0.0   0.0   0

**3. Similarity Calculation**

In [3]:
# Calculate cosine similarity between customers
similarity_matrix = cosine_similarity(customer_product_matrix)

# Convert similarity matrix to a DataFrame for easy manipulation
similarity_df = pd.DataFrame(similarity_matrix,
                             index=customer_product_matrix.index,
                             columns=customer_product_matrix.index)

# Inspect the similarity DataFrame
print(similarity_df.head())

CustomerID  C0001  C0002     C0003     C0004     C0005  C0006     C0007  \
CustomerID                                                                
C0001         1.0    0.0  0.000000  0.000000  0.000000    0.0  0.203038   
C0002         0.0    1.0  0.000000  0.000000  0.000000    0.0  0.000000   
C0003         0.0    0.0  1.000000  0.139782  0.347737    0.0  0.000000   
C0004         0.0    0.0  0.139782  1.000000  0.186362    0.0  0.000000   
C0005         0.0    0.0  0.347737  0.186362  1.000000    0.0  0.000000   

CustomerID     C0008  C0009     C0010  ...    C0191     C0192  C0193  \
CustomerID                             ...                             
C0001       0.000000    0.0  0.000000  ...  0.13837  0.000000    0.0   
C0002       0.095163    0.0  0.000000  ...  0.00000  0.000000    0.0   
C0003       0.004856    0.0  0.000000  ...  0.00000  0.000000    0.0   
C0004       0.016953    0.0  0.071485  ...  0.00000  0.000000    0.0   
C0005       0.000000    0.0  0.000000  ...

**4. Recommendation Function**

In [4]:
# Function to find top 3 similar customers for a given customer
def get_top_lookalikes(customer_id, similarity_df, top_n=3):
    similar_customers = similarity_df[customer_id].nlargest(top_n + 1).iloc[1:]  # Exclude self
    return [(sim_cust, round(score, 2)) for sim_cust, score in similar_customers.items()]

# Test the function with a sample customer
print(get_top_lookalikes('C0001', similarity_df))

[('C0050', 0.53), ('C0100', 0.53), ('C0105', 0.52)]


**5. Generate Recommendations for Target Customers**

In [5]:
# Generate lookalikes for the first 20 customers (CustomerID: C0001 to C0020)
lookalikes = {}
for customer_id in similarity_df.index[:20]:  # First 20 customers
    lookalikes[customer_id] = get_top_lookalikes(customer_id, similarity_df)

# Convert lookalikes to a DataFrame for export
lookalike_df = pd.DataFrame({
    "CustomerID": lookalikes.keys(),
    "Lookalikes": [str(values) for values in lookalikes.values()]
})

# Save lookalike recommendations to CSV
lookalike_df.to_csv("Lookalike.csv", index=False)
print("Lookalike recommendations saved to Lookalike.csv")

Lookalike recommendations saved to Lookalike.csv
