# Part 2: Predictive Analysis

In [48]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE

# Part 1: Feature Set Preparation and Preprocessing

In [49]:
print("Step 1: Data Loading, Merging, and Preprocessing")

Step 1: Data Loading, Merging, and Preprocessing


In [50]:
# Define file paths
data_dir = "../data/data-1"
orders_path = os.path.join(data_dir, "olist_orders_dataset.csv")
reviews_path = os.path.join(data_dir, "olist_order_reviews_dataset.csv")
items_path = os.path.join(data_dir, "olist_order_items_dataset.csv")
payments_path = os.path.join(data_dir, "olist_order_payments_dataset.csv")
products_path = os.path.join(data_dir, "olist_products_dataset.csv")
customers_path = os.path.join(data_dir, "olist_customers_dataset.csv")

In [51]:
try:
    # Load all relevant datasets
    orders = pd.read_csv(orders_path, parse_dates=['order_purchase_timestamp', 'order_estimated_delivery_date', 'order_delivered_customer_date'])
    reviews = pd.read_csv(reviews_path)
    items = pd.read_csv(items_path)
    payments = pd.read_csv(payments_path)
    products = pd.read_csv(products_path)
    customers = pd.read_csv(customers_path)
except FileNotFoundError as e:
    print(f"Error: {e}. Please ensure the 'data' directory exists and contains all required CSVs.")
    exit()

In [52]:
# Merge datasets into a single DataFrame
df = orders.merge(reviews, on='order_id', how='left')
df = df.merge(items, on='order_id', how='left')
df = df.merge(payments, on='order_id', how='left')
df = df.merge(products, on='product_id', how='left')
df = df.merge(customers, on='customer_id', how='left')

In [53]:
# Drop rows with missing review scores, as this is our target variable
df.dropna(subset=['review_score'], inplace=True)

In [54]:
# Feature Engineering and Cleaning

# Create the binary target variable
# 1 if the review score is 4 or 5 (high), 0 if it's 1, 2, or 3 (low)
df['review_score_binary'] = df['review_score'].apply(lambda x: 1 if x >= 4 else 0)

In [55]:
# Check for class imbalance
print(f"High review scores (4-5): {len(df[df['review_score_binary'] == 1])} samples")
print(f"Low review scores (1-3): {len(df[df['review_score_binary'] == 0])} samples")

High review scores (4-5): 88662 samples
Low review scores (1-3): 29484 samples


In [56]:
# Calculate delivery delay in days
df['delivery_delay'] = (df['order_delivered_customer_date'] - df['order_estimated_delivery_date']).dt.days.fillna(0)

In [57]:
# Calculate customer's average review score from previous orders
customer_avg_review = df.groupby('customer_unique_id')['review_score'].transform('mean')
df['customer_avg_review'] = customer_avg_review.fillna(df['review_score'].mean()) # Fill NaN with overall average

In [58]:
# Calculate product's average review score from previous orders
product_avg_review = df.groupby('product_id')['review_score'].transform('mean')
df['product_avg_review'] = product_avg_review.fillna(df['review_score'].mean()) # Fill NaN with overall average

In [59]:
# Select features for the model
# We'll use order, payment, and product features that we believe influence the review score
features = [
    # 'delivery_delay',
    'payment_value',
    'freight_value',
    'payment_installments',
    'product_weight_g',
    'product_length_cm',
    'product_height_cm',
    'product_width_cm',
    'product_photos_qty',
    'customer_avg_review',
    'product_avg_review'
]

In [60]:
# Handle missing values for numerical features
for feature in features:
    if df[feature].isnull().any():
        df[feature].fillna(df[feature].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[feature].fillna(df[feature].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[feature].fillna(df[feature].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we ar

In [61]:
# Define X (features) and y (target)
X = df[features]
y = df['review_score_binary']

# Step 2: Train a Supervised Model

In [62]:
print("Step 2: Training a Logistic Regression Model")

Step 2: Training a Logistic Regression Model


In [63]:
# Split the data into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Check the new class balance before applying SMOTE
print(y_train.value_counts())
# --- SMOTE Implementation to handle class imbalance ---
print("\nApplying SMOTE to balance the training data...")
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print(f"Original training data shape: {X_train.shape}, {y_train.shape}")
print(f"Resampled training data shape: {X_train_resampled.shape}, {y_train_resampled.shape}")

X_train, y_train = X_train_resampled, y_train_resampled

review_score_binary
1    70929
0    23587
Name: count, dtype: int64

Applying SMOTE to balance the training data...
Original training data shape: (94516, 10), (94516,)
Resampled training data shape: (141858, 10), (141858,)


In [64]:
# Check the new class balance after applying SMOTE
print(y_train_resampled.value_counts())

review_score_binary
1    70929
0    70929
Name: count, dtype: int64


In [65]:
# Initialize and train the Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


# Step 3: Model Evaluation and Feature Importance 

In [66]:
print("Step 3: Model Evaluation and Feature Importance")

Step 3: Model Evaluation and Feature Importance


In [67]:
# Make predictions on the test set
y_pred = model.predict(X_test)

In [68]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)


print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print(classification_report(y_test, y_pred, target_names=['Low', 'High'], zero_division=0))

Accuracy: 0.9566
Precision: 0.9702
Recall: 0.9720

Confusion Matrix:
[[ 5367   530]
 [  496 17237]]
              precision    recall  f1-score   support

         Low       0.92      0.91      0.91      5897
        High       0.97      0.97      0.97     17733

    accuracy                           0.96     23630
   macro avg       0.94      0.94      0.94     23630
weighted avg       0.96      0.96      0.96     23630



### Performance Analysis with New Features ('customer_avg_review', 'product_avg_review')

Overall Accuracy: **95.66%**.

* The precision for the "Low" class is now **92%**, and the recall is **91%**. 

* It can now reliably identify customers who will give a low score, and when it predicts a low score, it is almost always correct.

* No more suffering from class imbalance. 

* The model also performs extremely well on the "High" class, with **97%** precision and recall.

#### Conclusion
* The addition of the `customer_avg_review` and `product_avg_review` features has improved the model significantly. 

* The model is now a highly accurate tool for identifying at-risk customers. 

In [69]:
import xgboost as xgb
# XGBoost Model
print("\nTraining and Evaluating XGBoost Model")

model_xgb = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
model_xgb.fit(X_train, y_train)
y_pred_xgb = model_xgb.predict(X_test)

accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
precision_xgb = precision_score(y_test, y_pred_xgb)
recall_xgb = recall_score(y_test, y_pred_xgb)
conf_matrix_xgb = confusion_matrix(y_test, y_pred_xgb)

print(f"XGBoost Accuracy: {accuracy_xgb:.4f}")
print(f"XGBoost Precision: {precision_xgb:.4f}")
print(f"XGBoost Recall: {recall_xgb:.4f}")
print("\nXGBoost Confusion Matrix:")
print(conf_matrix_xgb)
print(classification_report(y_test, y_pred_xgb, target_names=['Low', 'High'], zero_division=0))

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)



Training and Evaluating XGBoost Model
XGBoost Accuracy: 0.9934
XGBoost Precision: 0.9966
XGBoost Recall: 0.9946

XGBoost Confusion Matrix:
[[ 5837    60]
 [   95 17638]]
              precision    recall  f1-score   support

         Low       0.98      0.99      0.99      5897
        High       1.00      0.99      1.00     17733

    accuracy                           0.99     23630
   macro avg       0.99      0.99      0.99     23630
weighted avg       0.99      0.99      0.99     23630



### Performance Analysis with New Features ('customer_avg_review', 'product_avg_review')
Overall Accuracy: **99.34%**.

* The precision for the "Low" class is now **98%**, and the recall is **99%**.

* It can now reliably identify customers who will give a low score, and when it predicts a low score, it is almost always correct.

* The model is no longer suffering from class imbalance.

* The model also performs extremely well on the "High" class, with **100%** precision and **99%** recall.

#### Conclusion
* The addition of the `customer_avg_review` and `product_avg_review` features has improved the model significantly.

# Appendix - A (Old results without incorporating the `customer_avg_review` and `product_avg_review`)
# Step 4: View on the Model
## Model’s Limitations
The limitation of this model is the class imbalance. As the analysis showed, there are far more high review scores than low ones. This can cause the model to be biased, as it learns that predicting "high score" is often correct, potentially leading to a high accuracy score that doesn't truly reflect its ability to identify the less frequent "low score" cases. The precision and recall for the "low score" class will likely be poor.

One more limitation is the simplicity of the features. Currently, model only considers a few numerical variables related to the order and product. It lacks contextual information, such as the customer's previous history, the seller's reputation, or the qualitative content of the review itself.

# Performance Analysis With SMOTE
The model's accuracy is 62.36%, which seems decent at first glance. However, it's crucial to look at the precision and recall to understand what the model is truly doing.

Recall for "Low" Class (0.56). The most significant improvement is likely here. The model is now (after SMOTE) able to correctly identify 56% of the actual low-score reviews. This is a crucial metric to improve, as the main goal of the model is to find and flag these unhappy customers. Reduced False-negatives (after SMOTE). 

Precision for "Low" Class (0.34). This is a key area of concern. A precision of 0.34 means that when the model predicts a review will be "low," it is only correct about one-third of the time. This indicates a high number of false positives—orders that the model incorrectly flags as high-risk. This could lead to a customer support team wasting resources on happy customers.

Accuracy: The overall accuracy of 62% shows that the model still struggles to correctly classify the majority of cases, especially the low-score ones. This is a common trade-off when using SMOTE: we improve recall for the minority class, but often at the cost of overall precision and accuracy.

## Performance Analysis Without SMOTE
The output shows a high overall accuracy of 77.12% and a precision of 76.92%. At first glance, this seems like a very good model. However, a closer look at the precision and recall for each class reveals an issue related to the class imbalance we noted earlier.

Recall for "Low" Class (0.10). This is the most critical metric. Without SMOTE, the model is only able to correctly identify 10% of the actual low-score reviews. This means it is failing to flag the vast majority of at-risk customers, which is the primary goal of the model.

Precision for "Low" Class (0.83). The high precision here (83%) indicates that when the model does predict a low score, it is highly likely to be correct. The problem is that it is so conservative in making these predictions that it misses almost all of them.

Recall for "High" Class (0.99). The model is good at predicting "high" scores, with a near-perfect recall of 99%. This is expected given the dataset's heavy class imbalance. The model can simply learn to predict "high" for almost every case and still achieve a high overall accuracy.

# XGBoost Performance Analysis
The XGBoost model shows a modest improvement in overall performance metrics compared to the Logistic Regression model.

Accuracy: The XGBoost model has an accuracy of 80.30%, which is higher than the Logistic Regression model's 62.36%. This indicates that it correctly predicts the review score more often overall.

Precision: The precision is 81.64%, a slight increase from the previous model.

Recall: The recall is 95.13%, a substantial jump from the previous model's 64.62%.

However, when looking at the more detailed breakdown, we see a more nuanced story:

Low Score Class: The precision for the 'Low' score class is 0.34, and the recall is 0.56. This is a small improvement in recall compared to the model without SMOTE, but it is not a large one. The precision is the same as the Logistic Regression model with SMOTE. This means the model still struggles to predict low scores accurately, with a high number of false positives.

High Score Class: The precision is 0.81 and the recall is 0.65. The model is still very good at predicting high scores, but the recall is not as high as it was for the model without SMOTE.

The XGBoost model, with the help of SMOTE, is a better classifier than the Logistic Regression model. It correctly identifies more of the at-risk customers (the "Low" class), which is a crucial business objective. While the precision for the "Low" class is still relatively low, the trade-off for a significantly higher recall is often acceptable in a business setting where the cost of a missed low-score customer (a false negative) is higher than the cost of a false positive. You could now use this model to proactively flag orders at a higher risk of a low review. Further fine-tuning of the XGBoost hyperparameters could lead to even better results.