#Zeotap Data Science Assingment

### Task 2: Lookalike Model

Build a **Lookalike Model** that takes a user's information as input and recommends **3 similar customers** based on their profile and transaction history.

#### Requirements:
- Use both **customer** and **product** information to build the model.
- Assign a **similarity score** to each recommended customer.

#### Deliverables:
1. Provide the top **3 lookalikes** with their similarity scores for the first 20 customers (CustomerID: C0001 - C0020) from `Customers.csv`.
2. Create a file named **Lookalike.csv**, which contains a single map:  
   `Map<cust_id, List<cust_id, score>>`.  
   For example:  
   ```plaintext
   C0001: [(C0002, 0.95), (C0003, 0.87), (C0004, 0.83)]


#Import Libraries

In [110]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler, LabelEncoder

#Load Datasets

In [111]:
customers = pd.read_csv('Customers.csv')
products = pd.read_csv('Products.csv')
transactions = pd.read_csv('Transactions.csv')

customers.head()

Unnamed: 0,CustomerID,CustomerName,Region,SignupDate
0,C0001,Lawrence Carroll,South America,2022-07-10
1,C0002,Elizabeth Lutz,Asia,2022-02-13
2,C0003,Michael Rivera,South America,2024-03-07
3,C0004,Kathleen Rodriguez,South America,2022-10-09
4,C0005,Laura Weber,Asia,2022-08-15


In [112]:
customers.dtypes

Unnamed: 0,0
CustomerID,object
CustomerName,object
Region,object
SignupDate,object


In [113]:
products.head()

Unnamed: 0,ProductID,ProductName,Category,Price
0,P001,ActiveWear Biography,Books,169.3
1,P002,ActiveWear Smartwatch,Electronics,346.3
2,P003,ComfortLiving Biography,Books,44.12
3,P004,BookWorld Rug,Home Decor,95.69
4,P005,TechPro T-Shirt,Clothing,429.31


In [114]:
transactions.head()

Unnamed: 0,TransactionID,CustomerID,ProductID,TransactionDate,Quantity,TotalValue,Price
0,T00001,C0199,P067,2024-08-25 12:38:23,1,300.68,300.68
1,T00112,C0146,P067,2024-05-27 22:23:54,1,300.68,300.68
2,T00166,C0127,P067,2024-04-25 07:38:55,1,300.68,300.68
3,T00272,C0087,P067,2024-03-26 22:55:37,2,601.36,300.68
4,T00363,C0070,P067,2024-03-21 15:10:10,3,902.04,300.68


In [115]:
transactions.dtypes

Unnamed: 0,0
TransactionID,object
CustomerID,object
ProductID,object
TransactionDate,object
Quantity,int64
TotalValue,float64
Price,float64


Convert the column to datetime format

In [116]:
customers['SignupDate'] = pd.to_datetime(customers['SignupDate'])
customers['DaysSinceSignup'] = (pd.Timestamp.now() - customers['SignupDate']).dt.days

Encode the Region column to numerical values

In [117]:
region_encoder = LabelEncoder()
customers['RegionEncoded'] = region_encoder.fit_transform(customers['Region'])

In [118]:
customers.head()

Unnamed: 0,CustomerID,CustomerName,Region,SignupDate,DaysSinceSignup,RegionEncoded
0,C0001,Lawrence Carroll,South America,2022-07-10,932,3
1,C0002,Elizabeth Lutz,Asia,2022-02-13,1079,0
2,C0003,Michael Rivera,South America,2024-03-07,326,3
3,C0004,Kathleen Rodriguez,South America,2022-10-09,841,3
4,C0005,Laura Weber,Asia,2022-08-15,896,0


In [119]:
product_agg = transactions.groupby('CustomerID').agg({
    'ProductID': lambda x: ' '.join(x),
    'Price': 'mean',
}).reset_index()

In [120]:
product_agg.head()

Unnamed: 0,CustomerID,ProductID,Price
0,C0001,P054 P022 P096 P083 P029,278.334
1,C0002,P095 P004 P019 P071,208.92
2,C0003,P025 P006 P035 P002,195.7075
3,C0004,P049 P053 P038 P025 P097 P024 P008 P077,240.63625
4,C0005,P025 P039 P012,291.603333


In [121]:
customer_data = customers.merge(product_agg, on="CustomerID", how="left").fillna('')

In [122]:
customer_data.head()

Unnamed: 0,CustomerID,CustomerName,Region,SignupDate,DaysSinceSignup,RegionEncoded,ProductID,Price
0,C0001,Lawrence Carroll,South America,2022-07-10,932,3,P054 P022 P096 P083 P029,278.334
1,C0002,Elizabeth Lutz,Asia,2022-02-13,1079,0,P095 P004 P019 P071,208.92
2,C0003,Michael Rivera,South America,2024-03-07,326,3,P025 P006 P035 P002,195.7075
3,C0004,Kathleen Rodriguez,South America,2022-10-09,841,3,P049 P053 P038 P025 P097 P024 P008 P077,240.63625
4,C0005,Laura Weber,Asia,2022-08-15,896,0,P025 P039 P012,291.603333


In [123]:
customer_data.dtypes

Unnamed: 0,0
CustomerID,object
CustomerName,object
Region,object
SignupDate,datetime64[ns]
DaysSinceSignup,int64
RegionEncoded,int64
ProductID,object
Price,object


#Task

Replace empty strings in the Price column with 0

In [124]:
customer_data['Price'] = customer_data['Price'].replace('', 0)

  customer_data['Price'] = customer_data['Price'].replace('', 0)


Vectorize the ProductID column using TF-IDF

In [125]:
customer_data['Price'] = customer_data['Price'].astype(float)

Vectorize the ProductID column using TF-IDF

In [126]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [127]:
vectorizer = TfidfVectorizer()
product_vectors = vectorizer.fit_transform(customer_data['ProductID'])

In [128]:
print(product_vectors)

  (0, 53)	0.44670179082042977
  (0, 21)	0.44670179082042977
  (0, 95)	0.44670179082042977
  (0, 82)	0.4635960154130754
  (0, 28)	0.4318000286319847
  (1, 94)	0.5155574644782344
  (1, 3)	0.5155574644782344
  (1, 18)	0.46606709394813095
  (1, 70)	0.5011810706524065
  (2, 24)	0.48458565078107324
  (2, 5)	0.47497937887137304
  (2, 34)	0.5062400951581463
  (2, 1)	0.5322521045128139
  (3, 24)	0.33280285984008173
  (3, 48)	0.3200634696551294
  (3, 52)	0.34767466012001563
  (3, 37)	0.32620548173200814
  (3, 96)	0.35615954069957473
  (3, 23)	0.4016352120657565
  (3, 7)	0.34767466012001563
  (3, 76)	0.3879121227559116
  (4, 24)	0.5584582469547062
  (4, 38)	0.5584582469547062
  (4, 11)	0.6133912069931822
  (5, 39)	0.48348173324614474
  :	:
  (193, 89)	0.3993835682650084
  (193, 4)	0.3993835682650084
  (194, 83)	0.3984079335232049
  (194, 66)	0.3984079335232049
  (194, 58)	0.37108290706347963
  (194, 46)	0.4246291084421297
  (194, 27)	0.4064655779403157
  (194, 32)	0.44644772069094874
  (195, 19)	

Scale the numeric features (DaysSinceSignup and Price)

In [129]:
from sklearn.preprocessing import StandardScaler

In [130]:
scaler = StandardScaler()
numeric_features = scaler.fit_transform(customer_data[['DaysSinceSignup', 'Price']])

In [131]:
numeric_features

array([[ 1.15288412e+00,  1.10365810e-01],
       [ 1.60559336e+00, -8.54626425e-01],
       [-7.13386604e-01, -1.03830637e+00],
       [ 8.72635547e-01, -4.13707663e-01],
       [ 1.04201655e+00,  2.94835854e-01],
       [-5.28607325e-01,  8.56916189e-01],
       [ 1.22063652e+00,  1.10694335e+00],
       [-5.47085253e-01, -5.45446328e-01],
       [-7.89777438e-02,  3.95347189e-01],
       [ 6.66298685e-01, -1.62827177e+00],
       [ 6.75537649e-01,  3.09080752e-01],
       [-1.18457377e+00,  5.60841110e-03],
       [-9.38201395e-01,  7.00588392e-01],
       [-1.04290965e+00, -1.54402575e+00],
       [-3.80783901e-01,  1.18715779e+00],
       [-5.16288706e-01,  1.68151230e+00],
       [-4.26978721e-01, -6.20169485e-01],
       [-1.32315823e+00, -2.23226170e-01],
       [ 5.80068354e-01, -4.13748210e-01],
       [-1.00903345e+00,  5.15413981e-01],
       [ 2.87501161e-01,  5.30793049e-01],
       [-3.06872189e-01,  1.25162153e-01],
       [ 1.54707992e+00,  1.72581328e+00],
       [-6.

Calculate the cosine similarity matrix

In [132]:
from scipy.sparse import hstack
combined_features = hstack([numeric_features, product_vectors])

Calculate the cosine similarity matrix

In [133]:
similarity_matrix = cosine_similarity(combined_features)

Generate recommendations for the first 20 customers

In [134]:
recommendations = {}
for idx, customer_id in enumerate(customers['CustomerID'][:20]):
    scores = list(enumerate(similarity_matrix[idx]))
    similar_customers = sorted(scores, key=lambda x: x[1], reverse=True)[1:4]  # Exclude self
    recommendations[customer_id] = [(customers.iloc[i]['CustomerID'], round(score, 2)) for i, score in similar_customers]


Create a DataFrame with the lookalike customer recommendations

In [135]:
lookalike_data = [{"cust_id": cust_id, "similar_list": recs} for cust_id, recs in recommendations.items()]
lookalike_df = pd.DataFrame(lookalike_data)
lookalike_df.to_csv("Lookalike.csv", index=False)

Display the lookalike customer recommendations DataFrame

In [136]:
lookalike_df

Unnamed: 0,cust_id,similar_list
0,C0001,"[(C0104, 0.71), (C0045, 0.69), (C0160, 0.65)]"
1,C0002,"[(C0109, 0.84), (C0176, 0.83), (C0134, 0.82)]"
2,C0003,"[(C0144, 0.8), (C0151, 0.74), (C0033, 0.73)]"
3,C0004,"[(C0053, 0.67), (C0175, 0.64), (C0160, 0.63)]"
4,C0005,"[(C0192, 0.64), (C0023, 0.64), (C0118, 0.64)]"
5,C0006,"[(C0058, 0.78), (C0078, 0.68), (C0048, 0.66)]"
6,C0007,"[(C0079, 0.84), (C0023, 0.78), (C0040, 0.78)]"
7,C0008,"[(C0144, 0.62), (C0111, 0.6), (C0151, 0.56)]"
8,C0009,"[(C0156, 0.47), (C0140, 0.44), (C0085, 0.39)]"
9,C0010,"[(C0060, 0.84), (C0091, 0.78), (C0180, 0.78)]"
