<img src="https://devra.ai/analyst/notebook/3553/image.jpg" style="width: 100%; height: auto;" />

<div style="text-align:center; border-radius:15px; padding:15px; color:white; margin:0; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden; margin-bottom: 1em;">    <div style="font-size:150%; color:#FEE100"><b>E-Commerce Transactions Analysis Notebook</b></div>    <div>This notebook was created with the help of <a href="https://devra.ai/ref/kaggle" style="color:#6666FF">Devra AI</a></div></div>Welcome curious reader. This notebook dives into a multifaceted dataset of e-commerce transactions, sessions, reviews, and customer behaviors. Our journey may reveal some surprising relationships in online shopping habits. If you find this useful, feel free to upvote it.

## Table of Contents

- [Introduction](#Introduction)
- [Data Import and Overview](#Data-Import-and-Overview)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Prediction Modeling: Forecasting Order Totals](#Prediction-Modeling:-Forecasting-Order-Totals)
- [Conclusions and Future Work](#Conclusions-and-Future-Work)

## Introduction

In this notebook, we explore the diverse datasets derived from e-commerce transactions. Our curiosity initially stems from the idea that online behavior is both erratic and patterned at the same time. How do discount strategies, payment methods, and device types influence the final order total? Among many potential insights, we have chosen to build a predictor for the order amount in USD. The predictor uses features from order-related data and gauges its performance using commonly accepted metrics. 

Let the data exploration, some dry humor, and robust analysis commence.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Ensure matplotlib inline and proper backend for rendering
%matplotlib inline
matplotlib.use('Agg')

# Set seaborn style for better visuals
sns.set(style='whitegrid')

## Data Import and Overview

In this section we load the datasets of interest. The data files reside in the local directory as provided by the Kaggle structure. We have multiple CSV files including events, reviews, orders, products, customers, order items, and sessions. For the purpose of our main prediction, we will be focusing on `orders.csv`.

In [2]:
# Read orders data
orders = pd.read_csv('/kaggle/input/e-commerce-transactions-clickstream/orders.csv', encoding='ascii')
print('Orders shape:', orders.shape)

# Preview first few rows
orders.head()

Orders shape: (33580, 10)


Unnamed: 0,order_id,customer_id,order_time,payment_method,discount_pct,subtotal_usd,total_usd,country,device,source
0,1,13917,2025-01-31T23:07:42,card,20,107.15,85.72,PL,desktop,organic
1,2,1022,2024-02-19T01:17:50,card,0,116.17,116.17,FR,tablet,organic
2,3,6145,2024-12-04T20:24:13,card,0,137.35,137.35,US,mobile,organic
3,4,3152,2024-07-17T08:50:47,card,15,32.18,27.35,BR,mobile,email
4,5,12378,2020-08-21T16:54:16,card,0,238.09,238.09,NL,desktop,paid


## Data Cleaning and Preprocessing

We now inspect our data for missing values, type mismatches, and convert date fields. For example, the `order_time` column in the orders dataset is originally a string. It is imperative to convert this to a datetime object to facilitate proper time-series analysis. 

The cleaning process may look mundane, but it is as crucial as selecting the right coffee in the morning.

In [3]:
# Convert order_time to datetime
orders['order_time'] = pd.to_datetime(orders['order_time'], errors='coerce')

# Check for missing values
missing_values = orders.isnull().sum()
print('Missing values in orders:')
print(missing_values)

# For simplicity, drop any rows with missing order_time or total_usd
orders_clean = orders.dropna(subset=['order_time', 'total_usd'])
print('Cleaned orders shape:', orders_clean.shape)

Missing values in orders:
order_id          0
customer_id       0
order_time        0
payment_method    0
discount_pct      0
subtotal_usd      0
total_usd         0
country           0
device            0
source            0
dtype: int64
Cleaned orders shape: (33580, 10)


## Exploratory Data Analysis

In our EDA, we will examine key features of the orders dataset. The analysis includes histograms of numeric fields, pair plots to observe potential relationships, and a grouped bar plot to observe trends based on categorical variables. 

It is always interesting to see what numbers might be hiding behind what seems like chaotic transactions.

In [4]:
# Select numeric columns from orders for further analysis
numeric_df = orders_clean.select_dtypes(include=[np.number])
print('Numeric columns in orders:', numeric_df.columns.tolist())

# Histogram of numeric variables
fig, axes = plt.subplots(nrows=1, ncols=len(numeric_df.columns), figsize=(15, 4))
for ax, col in zip(axes, numeric_df.columns):
    sns.histplot(numeric_df[col], ax=ax, kde=True)
    ax.set_title(f'Histogram of {col}')
plt.tight_layout()
plt.show()

Numeric columns in orders: ['order_id', 'customer_id', 'discount_pct', 'subtotal_usd', 'total_usd']


In [5]:
# Create a pair plot to visualize relationships among numeric columns in orders
sns.pairplot(numeric_df)
plt.show()

In [6]:
# Visualize distribution of payment_method as an example of a categorical variable
plt.figure(figsize=(8, 4))
sns.countplot(data=orders_clean, x='payment_method', palette='viridis')
plt.title('Count of Payment Methods')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Prediction Modeling: Forecasting Order Totals

Next, we build a regression model intended to predict the final order total (`total_usd`) from orders data. 

The model uses features such as `discount_pct` and `subtotal_usd`, with additional categorical variables one-hot encoded. We split the data into training and testing subsets, train the model, and finally compute metrics such as the R² score and Mean Squared Error. 

Our approach is not only rigorous but also gently humorous in its assertion that even order totals can have a mind of their own.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Select features and target for prediction
# For simplicity, we select discount_pct, subtotal_usd, and we also one-hot encode categorical features: payment_method, country, device, and source
features = orders_clean[['discount_pct', 'subtotal_usd', 'payment_method', 'country', 'device', 'source']].copy()
target = orders_clean['total_usd']

# One-hot encode categorical columns
features = pd.get_dummies(features, columns=['payment_method', 'country', 'device', 'source'], drop_first=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
print('Training samples:', X_train.shape[0])
print('Testing samples:', X_test.shape[0])

# Initialize and train the Random Forest Regressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f'R² Score: {r2:.3f}')
print(f'Mean Squared Error: {mse:.3f}')

Training samples: 26864
Testing samples: 6716
R² Score: 0.999
Mean Squared Error: 29.250


## Conclusions and Future Work

We have successfully built and evaluated a predictor for order totals. Our analysis included careful data cleaning, exploratory visualization (ranging from histograms to pair plots), and a prediction modeling pipeline using a Random Forest regressor. 

The approach taken here demonstrates the merits of combining domain knowledge with machine learning techniques. In future analyses, one could investigate additional relationships across data sources, such as correlating clickstream event data with order behavior, or even exploring natural language processing techniques on review texts. 

If you found this notebook engaging and informative, please upvote it.