# Product Recommendation Prediction


## Data preparation
We load the dataset and define the target variable as 'is_recommended', indicating whether a user would recommend a product. To ensure balanced training, we sampled equal numbers of positive and negative examples. 

In [None]:
import sys
import os
sys.path.append(os.path.abspath(os.path.join("..")))
import pandas as pd

# loading the preprocessed data 
from src.data_preprocessing import load_processedDfs
from src.sentiment import sentiment_vader

df = load_processedDfs()

# loading the cleaned data
results = [sentiment_vader(text) for text in df['review_text']]
df[['vader_score', 'vader_label']] = pd.DataFrame(results, index=df.index)

In [7]:
df['is_recommended'].value_counts()

is_recommended
1.0    778160
0.0    148263
Name: count, dtype: int64

In [None]:
# balancing the dataset
balanced_df = pd.concat([
    df[df['is_recommended']==1].sample(148000,random_state=42),
    df[df['is_recommended']==0].sample(148000,random_state=42)
]).reset_index(drop=True)

balanced_df['is_recommended'].value_counts()

is_recommended
1.0    148000
0.0    148000
Name: count, dtype: int64

## Model Training
Relevant features were extracted including product attributed, reviews and skin type. All the categorical column was transformed into numerical using ColumnTransformer method. Also we used textblob library to get the numerical value fo the review text column.

An XGBoost classifer was trained on this feature set. This tree-based model captures nonlinear relationship and feature interactions, making it well suited for product recommendation.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from xgboost import XGBClassifier
from sklearn.compose import ColumnTransformer
from textblob import TextBlob

# feature extraction
balanced_df['sentiment'] = balanced_df['clean_text'].apply(
    lambda x: TextBlob(x).sentiment.polarity
)

numeric_features = [
    'loves_count','product_rating','price_usd','reviews',
    'sephora_exclusive','total_feedback_count','new',
    'sentiment', 'sentiment_score'
]

categorical_features = [
    'skin_type','sentiment_label'
]

# transforming categorical columns to numerical
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='drop'
)

# splitting the data into train and test set
X = balanced_df[numeric_features + categorical_features]
y = balanced_df['is_recommended']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# setting up model pipeline
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(
        n_estimators=400,
        max_depth=6,
        learning_rate=0.05,
        subsample=0.8,
        colsample_bytree=0.8,
        objective='binary:logistic',
        eval_metric='logloss',
        random_state=42,
        n_jobs=-1
    ))
])


In [None]:
# model prediction
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print('Accuracy Score:', accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

         0.0       0.76      0.75      0.75     29600
         1.0       0.75      0.76      0.76     29600

    accuracy                           0.76     59200
   macro avg       0.76      0.76      0.76     59200
weighted avg       0.76      0.76      0.76     59200

Accuracy Score: 0.7562331081081081
[[22138  7462]
 [ 6969 22631]]


In [None]:
from sklearn.metrics import roc_auc_score

y_prob = model.predict_proba(X_test)[:,1]
print("ROC AUC:", roc_auc_score(y_test, y_prob))

ROC AUC: 0.8352349194211104


Analysis: The model achieves 75% accuracy and an AUC of 0.83 on the test set, indicating a strong ability to distinguish between recommended and non-recommended products.

## Feature Importance
The importance of each feature was analyzed to understand which factors most influence product recommendations.

In [None]:
feature_names = model.named_steps['preprocessor'].get_feature_names_out()

xgb_model = model.named_steps['classifier']

importances = xgb_model.feature_importances_

feature_importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values(by='importance', ascending=False)

feature_importance_df.head(20)


Unnamed: 0,feature,importance
8,num__sentiment_score,0.199496
13,cat__sentiment_label_negative,0.19493
15,cat__sentiment_label_positive,0.189548
5,num__total_feedback_count,0.156129
1,num__product_rating,0.078296
7,num__sentiment,0.047144
14,cat__sentiment_label_neutral,0.019347
3,num__reviews,0.017567
2,num__price_usd,0.017234
0,num__loves_count,0.017221


Analysis: Sentiment score and label plays a key role in driving recommendation system suggesting user words holds great value in making a product successful or not