In [12]:
#Step 1: Load and Explore the Data
import pandas as pd

# Load datasets
customers_df = pd.read_csv('/content/Customers.csv')
products_df = pd.read_csv('/content/Products.csv')
transactions_df = pd.read_csv('/content/Transactions.csv')

# Preview the datasets
print("Customers Data:")
print(customers_df.head())

print("Products Data:")
print(products_df.head())

print("Transactions Data:")
print(transactions_df.head())


Customers Data:
  CustomerID        CustomerName         Region  SignupDate
0      C0001    Lawrence Carroll  South America  2022-07-10
1      C0002      Elizabeth Lutz           Asia  2022-02-13
2      C0003      Michael Rivera  South America  2024-03-07
3      C0004  Kathleen Rodriguez  South America  2022-10-09
4      C0005         Laura Weber           Asia  2022-08-15
Products Data:
  ProductID              ProductName     Category   Price
0      P001     ActiveWear Biography        Books  169.30
1      P002    ActiveWear Smartwatch  Electronics  346.30
2      P003  ComfortLiving Biography        Books   44.12
3      P004            BookWorld Rug   Home Decor   95.69
4      P005          TechPro T-Shirt     Clothing  429.31
Transactions Data:
  TransactionID CustomerID ProductID      TransactionDate  Quantity  \
0        T00001      C0199      P067  2024-08-25 12:38:23         1   
1        T00112      C0146      P067  2024-05-27 22:23:54         1   
2        T00166      C0127   

In [13]:
#Step 2: Data Preprocessing and Cleaning
# Convert date columns to datetime format
customers_df['SignupDate'] = pd.to_datetime(customers_df['SignupDate'])
transactions_df['TransactionDate'] = pd.to_datetime(transactions_df['TransactionDate'])

# Merge datasets to create a comprehensive customer profile
merged_df = transactions_df.merge(customers_df, on='CustomerID').merge(products_df, on='ProductID')

# Drop unnecessary columns ('CustomerName' and 'ProductName' not needed for analysis)
merged_df.drop(columns=['CustomerName', 'ProductName'], inplace=True)

# Check for missing values and handle them
merged_df.fillna(0, inplace=True)


To prepare the data for analysis, we started by converting the SignupDate and TransactionDate columns into a proper datetime format, making it easier to work with dates and analyze trends. Then, we combined the data from the customers, transactions, and products tables into one comprehensive dataset by merging them using common identifiers like CustomerID and ProductID. After that, we removed columns like CustomerName and ProductName since they weren’t needed for our analysis, keeping the dataset focused and manageable. Finally, we handled missing values by replacing them with zeros to ensure the data was complete and ready for further steps. This approach leaves us with a clean, well-organized dataset to work with

In [14]:
#Step 3: Feature Engineering
# Aggregate features to create a customer profile
customer_features = merged_df.groupby('CustomerID').agg({
    'Quantity': 'sum',  # Total items purchased
    'TotalValue': 'sum',  # Total spending
    'TransactionID': 'count',  # Purchase frequency
    'TransactionDate': lambda x: (pd.to_datetime('today') - x.max()).days  # Recency
}).reset_index()

# Rename columns for better readability
customer_features.rename(columns={
    'Quantity': 'TotalQuantity',
    'TotalValue': 'TotalSpending',
    'TransactionID': 'PurchaseFrequency',
    'TransactionDate': 'Recency'
}, inplace=True)

# Merge with customer demographic details
final_customer_data = customer_features.merge(customers_df[['CustomerID', 'Region']], on='CustomerID')


We created a detailed customer profile by grouping data based on CustomerID and calculating key features like total items purchased (TotalQuantity), total spending (TotalSpending), number of transactions (PurchaseFrequency), and days since the last purchase (Recency). These features give insights into each customer's purchasing behavior. The columns were renamed for better clarity, and we merged this data with demographic details like Region to provide a complete view of each customer.

In [15]:
#Step 4: Data Preprocessing and Scaling
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Encode categorical variables (Region)
encoder = LabelEncoder()
final_customer_data['Region'] = encoder.fit_transform(final_customer_data['Region'])

# Standardize numerical features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(final_customer_data.drop(columns=['CustomerID']))

# Convert to DataFrame for easy reference
scaled_df = pd.DataFrame(scaled_features, columns=final_customer_data.columns[1:])
scaled_df['CustomerID'] = final_customer_data['CustomerID']


To prepare the data for modeling, we started by encoding the Region column, a categorical variable, using LabelEncoder to convert it into numerical values. Next, we standardized all numerical features to bring them onto the same scale using StandardScaler, ensuring fair treatment across features during modeling. The standardized data was then converted back into a DataFrame for easy interpretation, with column names preserved for clarity. Finally, the CustomerID column was re-added to maintain reference to individual customers

In [16]:
#Step 5: Train KNN Model and Find Similar Customers
from sklearn.neighbors import NearestNeighbors

# Fit the KNN model with cosine similarity
knn_model = NearestNeighbors(n_neighbors=4, metric='cosine')
knn_model.fit(scaled_features)

# Find lookalikes for the first 20 customers (C0001 - C0020)
customer_ids = final_customer_data['CustomerID'][:20].values
lookalike_map = {}

for idx, customer in enumerate(scaled_features[:20]):
    distances, indices = knn_model.kneighbors([customer])
    similar_customers = final_customer_data.iloc[indices[0][1:]]  # Exclude itself

    lookalike_map[customer_ids[idx]] = list(zip(
        similar_customers['CustomerID'].values, distances[0][1:].round(2)
    ))


We used the K-Nearest Neighbors (KNN) algorithm to find similar customers based on their standardized features. The model was trained with NearestNeighbors, using cosine similarity as the metric to measure the closeness between customers. For demonstration, we identified similar customers for the first 20 customer profiles in the dataset. For each of these customers, the model calculated distances to their nearest neighbors, excluding the customer itself. The lookalike_map was created to store the IDs of similar customers along with their corresponding similarity scores, allowing us to analyze and recommend similar customer profiles effectively.

In [18]:
#model evaluation
import numpy as np

# Calculating the average similarity score for the top 3 lookalikes of each customer
avg_similarity_scores = []

for cust_id, lookalikes in lookalike_map.items():
    scores = [score for _, score in lookalikes]
    avg_similarity_scores.append(np.mean(scores))

print(f'Average Similarity Score: {np.mean(avg_similarity_scores):.4f}')


Average Similarity Score: 0.0632


The evaluation of the KNN model resulted in an Average Similarity Score of 0.0632, indicating that the recommended lookalikes for each customer are closely aligned based on their standardized feature vectors. This low similarity score suggests that the customers identified as similar have minimal differences, showcasing the effectiveness of the model in clustering similar profiles.

In [20]:
# Convert the lookalike dictionary into a DataFrame and save to CSV
lookalike_df = pd.DataFrame(list(lookalike_map.items()), columns=['CustomerID', 'Lookalikes'])
lookalike_df.to_csv('Lookalike.csv', index=False)

print("Lookalike recommendations saved successfully in Lookalike.csv")


Lookalike recommendations saved successfully in Lookalike.csv
