<a href="https://colab.research.google.com/github/cyloic/Data-Preprocessing-Formative/blob/main/Group_5_Formative_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**TASK 1: Tabular Data Merge & Product Recommendation Model**

Uploading the Datasets

In [None]:
from google.colab import files
uploaded = files.upload()

Saving customer_social_profiles - customer_social_profiles.csv to customer_social_profiles - customer_social_profiles (1).csv


In [None]:
uploaded.keys()

dict_keys(['customer_social_profiles - customer_social_profiles (1).csv'])

In [None]:
from google.colab import files
uploaded = files.upload()

Saving customer_transactions - customer_transactions.csv to customer_transactions - customer_transactions (2).csv


In [None]:
uploaded.keys()

dict_keys(['customer_transactions - customer_transactions (2).csv'])

 Load and Explore the Data

In [None]:
import pandas as pd

#  Load the correct files by their exact names
transactions = pd.read_csv('customer_transactions - customer_transactions (2).csv')
social = pd.read_csv('customer_social_profiles - customer_social_profiles (1).csv')

#  Preview data
print("Transactions:")
print(transactions.head())

print("\nSocial Profiles:")
print(social.head())

Transactions:
   customer_id_legacy  transaction_id  purchase_amount purchase_date  \
0                 151            1001              408    2024-01-01   
1                 192            1002              332    2024-01-02   
2                 114            1003              442    2024-01-03   
3                 171            1004              256    2024-01-04   
4                 160            1005               64    2024-01-05   

  product_category  customer_rating  
0           Sports              2.3  
1      Electronics              4.2  
2      Electronics              2.1  
3         Clothing              2.8  
4         Clothing              1.3  

Social Profiles:
  customer_id_new social_media_platform  engagement_score  \
0            A178              LinkedIn                74   
1            A190               Twitter                82   
2            A150              Facebook                96   
3            A162               Twitter                89   
4 

Check columns to prepare for merging

In [None]:
print("Transactions columns:", transactions.columns.tolist())
print("Social columns:", social.columns.tolist())

Transactions columns: ['customer_id_legacy', 'transaction_id', 'purchase_amount', 'purchase_date', 'product_category', 'customer_rating']
Social columns: ['customer_id_new', 'social_media_platform', 'engagement_score', 'purchase_interest_score', 'review_sentiment']


**Prepare for Merging**

Rename **customer_id_new** to match customer_id_legacy
Since both datasets refer to the customer with slightly different names:

**transactions:**  uses customer_id_legacy

**social:** uses customer_id_new

 We have renamed both columns to a common name, e.g., customer_id, so we can merge them.

In [None]:
# Rename columns in both DataFrames
transactions.rename(columns={'customer_id_legacy': 'customer_id'}, inplace=True)
social.rename(columns={'customer_id_new': 'customer_id'}, inplace=True)

Convert the ID types to the same type and Merge

In [None]:
# 1️  Ensure both IDs are strings (safest & quickest)
transactions['customer_id'] = transactions['customer_id'].astype(str)
social['customer_id']      = social['customer_id'].astype(str)

# 2️  Now merge
merged_df = pd.merge(
    transactions,
    social,
    on='customer_id',
    how='inner'        # keep only customers that appear in BOTH tables
)

print("Merged shape:", merged_df.shape)
merged_df.head()

Merged shape: (0, 10)


Unnamed: 0,customer_id,transaction_id,purchase_amount,purchase_date,product_category,customer_rating,social_media_platform,engagement_score,purchase_interest_score,review_sentiment


Merged shape: (0, 10)  means that  none of the customer_ids matched between the two datasets so they still have different IDs making the two datasets not to merge so to fix we are going to Aligning the ID Format, Convert both IDs to string and Re-merge the datasets again.

In [None]:
# Remove non-digit characters (like "A") to get only the numbers
social['customer_id'] = social['customer_id'].str.extract('(\d+)', expand=False)

In [None]:
social['customer_id'] = social['customer_id'].astype(str)
transactions['customer_id'] = transactions['customer_id'].astype(str)

In [None]:
merged_df = pd.merge(transactions, social, on='customer_id', how='inner')
print(" New merged shape:", merged_df.shape)
merged_df.head()

 New merged shape: (219, 10)


Unnamed: 0,customer_id,transaction_id,purchase_amount,purchase_date,product_category,customer_rating,social_media_platform,engagement_score,purchase_interest_score,review_sentiment
0,151,1001,408,2024-01-01,Sports,2.3,TikTok,61,1.3,Neutral
1,151,1001,408,2024-01-01,Sports,2.3,Twitter,72,1.6,Neutral
2,151,1001,408,2024-01-01,Sports,2.3,Twitter,82,3.6,Negative
3,192,1002,332,2024-01-02,Electronics,4.2,Instagram,60,4.3,Positive
4,114,1003,442,2024-01-03,Electronics,2.1,Facebook,87,4.8,Negative


Clean the Data

In [None]:
# See how much missing data remains
print(merged_df.isnull().sum())

# Drop any rows with missing values for now
merged_df.dropna(inplace=True)
print(" After cleaning:", merged_df.shape)

customer_id                 0
transaction_id              0
purchase_amount             0
purchase_date               0
product_category            0
customer_rating            19
social_media_platform       0
engagement_score            0
purchase_interest_score     0
review_sentiment            0
dtype: int64
 After cleaning: (200, 10)


**Feature Engineering (Preparing Data for ML)**


In [None]:
# One-hot encode categorical input features
merged_df_encoded = pd.get_dummies(
    merged_df,
    columns=['social_media_platform', 'review_sentiment'],
    drop_first=True
)

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
merged_df_encoded['product_category_encoded'] = label_encoder.fit_transform(merged_df_encoded['product_category'])

Split the Data (X = features, y = target)

In [None]:
from sklearn.model_selection import train_test_split

X = merged_df_encoded.drop(['product_category', 'product_category_encoded', 'purchase_date'], axis=1)
y = merged_df_encoded['product_category_encoded']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

**Train the Product Recommendation Model**

Let’s start with a Random Forest Classifier:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Train
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score (weighted):", f1_score(y_test, y_pred, average='weighted'))

Accuracy: 0.675
F1 Score (weighted): 0.6714854426619132


Save Merged Dataset for Later Use

In [None]:
merged_df_encoded.to_csv('merged_dataset.csv', index=False)

**Task 2: Image Data Collection & Processing**