It's simple to buy any product with a click and have it delivered to your door. Online shopping has been rapidly evolving over the last few years, making our lives easier. But behind the scenes, e-commerce companies face a complex challenge that needs to be addressed.

Uncertainty plays a big role in how the supply chains plan and organize their operations to ensure that the products are delivered on time. These uncertainties can lead to challenges such as stockouts, delayed deliveries, and increased operational costs.

You work for the Sales & Operations Planning (S&OP) team at a multinational e-commerce company. They need your help to assist in planning for the upcoming end-of-the-year sales. They want to use your insights to plan for promotional opportunities and manage their inventory. This effort is to ensure they have the right products in stock when needed and ensure their customers are satisfied with the prompt delivery to their doorstep.


## The Data

You are provided with a sales dataset to use. A summary and preview are provided below.

# Online Retail.csv

| Column     | Description              |
|------------|--------------------------|
| `'InvoiceNo'` | A 6-digit number uniquely assigned to each transaction |
| `'StockCode'` | A 5-digit number uniquely assigned to each distinct product |
| `'Description'` | The product name |
| `'Quantity'` | The quantity of each product (item) per transaction |
| `'UnitPrice'` | Product price per unit |
| `'CustomerID'` | A 5-digit number uniquely assigned to each customer |
| `'Country'` | The name of the country where each customer resides |
| `'InvoiceDate'` | The day and time when each transaction was generated `"MM/DD/YYYY"` |
| `'Year'` | The year when each transaction was generated |
| `'Month'` | The month when each transaction was generated |
| `'Week'` | The week when each transaction was generated (`1`-`52`) |
| `'Day'` | The day of the month when each transaction was generated (`1`-`31`) |
| `'DayOfWeek'` | The day of the weeke when each transaction was generated <br>(`0` = Monday, `6` = Sunday) |

**Analyze the Online Retail.csv dataset and build a forecasting model to predict 'Quantity' of products sold.**

1. Split the data into two sets based on the splitting date, "2011-09-25". All data up to and including this date should be in the training set, while data after this date should be in the test set. Return a pandas DataFrame, pd_daily_train_data, containing, at least, the columns "Country", "StockCode", "InvoiceDate", "Quantity".

2. Using your test set, calculate the Mean Absolute Error (MAE) for your forecast model for the 'Quantity' sold? Return a double (float) named mae.

3. How many units are expected to be sold during the week 39 of 2011? Store as an integer variable called quantity_sold_w39.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, weekofyear
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import StringIndexer
from pyspark.sql.types import IntegerType

In [2]:
# Create a SparkSession
spark = SparkSession.builder \
    .appName("OnlineRetailForecasting") \
    .getOrCreate()

# Load the dataset
df = spark.read.csv("OnlineRetail.csv", header=True, inferSchema=True)
# Filter the data based on the splitting date
split_date = "2011-09-25"
train_data = df.filter(col("InvoiceDate") <= split_date)
test_data = df.filter(col("InvoiceDate") > split_date)

# Index the 'Country' column
indexer = StringIndexer(inputCol="Country", outputCol="CountryIndex")
indexed_train_data = indexer.fit(train_data).transform(train_data)
indexed_test_data = indexer.fit(test_data).transform(test_data)

# Select relevant columns for training data
train_data_final = indexed_train_data.select("CountryIndex", "StockCode", "InvoiceDate", "Quantity")
test_data_final = indexed_test_data.select("CountryIndex", "StockCode", "InvoiceDate", "Quantity")

24/05/19 08:52:15 WARN Utils: Your hostname, codespaces-429cb5 resolves to a loopback address: 127.0.0.1; using 172.16.5.4 instead (on interface eth0)
24/05/19 08:52:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/19 08:52:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/05/19 08:52:30 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
                                                                                

In [3]:
df_train_numeric = train_data_final.filter(col("StockCode").rlike("^[0-9]+$")) \
                                    .withColumn("StockCode", col("StockCode").cast(IntegerType()))
df_test_numeric = test_data_final.filter(col("StockCode").rlike("^[0-9]+$")) \
                                    .withColumn("StockCode", col("StockCode").cast(IntegerType()))

In [4]:
# Define features
feature_columns = ["CountryIndex", "StockCode"]  # Add more features if needed

# Vectorize features
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
vectorized_train_data = assembler.transform(df_train_numeric)

In [5]:
rf = RandomForestRegressor(featuresCol="features", labelCol="Quantity", maxBins=50)
model = rf.fit(vectorized_train_data)

                                                                                

In [6]:
vectorized_test_data = assembler.transform(df_test_numeric)
predictions = model.transform(vectorized_test_data)

In [7]:
# Evaluate the model using Mean Absolute Error (MAE)
evaluator = RegressionEvaluator(labelCol="Quantity", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)

print("Mean Absolute Error (MAE):", mae)




Mean Absolute Error (MAE): 11.570475212250512


                                                                                

In [8]:
week_39_quantity = df.filter((year("InvoiceDate") == 2011) & (weekofyear("InvoiceDate") == 39)) \
                     .selectExpr("sum(Quantity) as total_quantity") \
                     .collect()[0]["total_quantity"]

quantity_sold_w39 = week_39_quantity if week_39_quantity is not None else 0

print("Units expected to be sold during week 39 of 2011:", quantity_sold_w39)

                                                                                

Units expected to be sold during week 39 of 2011: 0
