# Time Series

El propósito de este notebook es llevar a cabo una predicción de la cantidad de productos únicos (item) vendidos durante las últimas 4 semanas. Con el objetivo de verificar que las predicciones sean acertadas, se utilizarán los datos reales de las últimas 4 semanas disponibles para realizar una comparación gráfica posterior en Power BI.

## Librerias

In [None]:
pip install xgboost



In [None]:
# imports time series
import pandas as pd
import numpy as np

# plots
import matplotlib.pyplot as plt
%matplotlib inline

# para calculas métricas del modelo
from sklearn.metrics import mean_squared_error
import xgboost as xgb

# misc
import os
import time
import itertools
import warnings
warnings.filterwarnings("ignore")

# apartado ts
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.holtwinters import SimpleExpSmoothing, ExponentialSmoothing, Holt


## Dataset


In [None]:
df = pd.read_csv('C:/Users/paula/OneDrive/Documentos/Nuclio/TFM/gb_union_weeks.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/paula/OneDrive/Documentos/Nuclio/TFM/gb_union_weeks.csv'

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
# convertir la fecha en formato date
df["date"] = pd.to_datetime(df["date"], format = "%Y-%m-%d")

In [None]:
MIN_DATE = df["date"].min()
MAX_DATE = df["date"].max()
print(f"Min date is {MIN_DATE}\nMax date is {MAX_DATE}")

In [None]:
# Eliminamos los datos de 2011 por estar incompletos y de 2012 por tener un pico de crecimiento que podría afectar a la predicción
df = df[df['date'] >= '2013-01-01']

In [None]:
df.isnull().sum()

In [None]:
# ordenamos los valores
df.sort_values(["item", "date"], ascending = True, inplace = True)
df.tail(30)

In [None]:
# imputación de precios a los valores nulos
df["sell_price"] = df.groupby("id", group_keys = False)["sell_price"].apply(
    lambda series: series.backfill().ffill()
)

Como la columna de " quantity" es la que se va a predecir, debemos eliminar los datos que tenemos de esta, para impurle 0, ya que sino, el modelo aprendería de ellos y no daría una correcta ejecución.

In [None]:
# Establecer valores nulos en las columnas 'quantity'
# para las semanas especificadas
weeks = [201613.0, 201614.0, 201615.0, 201616.0]
df.loc[df['year_week'].isin(weeks), ['quantity']] = np.nan

In [None]:
df["quantity"].fillna(0, inplace = True)

In [None]:
df.tail()

## EDA

### EDA: Global

Se visualizan cómo se distribuyen los datos a lo largo del tiempo, además de unos gráficos de autocorrelación (utilizados en ARIMA) para ver el impacto que podrían tener los lags en el modelo-

In [None]:
def plot_ts_acf_pacf(y, title):
    '''
    Plots the ts you pass and the acf and pacf.
    '''
    fig = plt.figure(figsize = (12, 10))
    ax1, ax2, ax3 = fig.subplots(3, 1)

    ax1.plot(y)
    plot_acf(x = y, ax = ax2, lags = 14)
    plot_pacf(x = y, ax = ax3, lags = 14)

    plt.suptitle(t = title, fontsize = 20)

In [None]:
y = df.set_index("date").resample("W")["quantity"].sum()[:-4] # quitamos los registros de las últimas 4 semanas

In [None]:
plot_ts_acf_pacf(y = y, title = "Weekly Sales for all items in all shops");

### EDA: features

Se visualizan diferentes columnas que podrían ser importantes en el entrenamiento del modelo

In [None]:
(
    df.
    groupby(["store"])
    ["quantity"].sum()
    .sort_values(ascending = False)
    .plot(kind = "bar", figsize = (12, 4))
);

In [None]:
(
    df.
    groupby(["region"])
    ["quantity"].sum()
    .sort_values(ascending = False)
    .plot(kind = "bar", figsize = (12, 4))
);

In [None]:
(
    df.
    groupby(["category"])
    ["quantity"].sum()
    .sort_values(ascending = False)
    .plot(kind = "bar", figsize = (12, 4))
);

## Feature Engineering

Como XGBoost, el algoritmo que se usará para entenar la serie temporal multivariante no acepta valores de tipo categórico se ha creado una regla de código (los valores numéricos escogidos han sido al azar):

- Category: accesories = 12, home_&_garden = 13, supermarket = 14
- Region: Boston = 21, New York = 31, Philadelphia = 41
- Store: South_End = 1, Roxbury = 2, Back_Bay = 3, Greenwich_Village = 4, Harlem = 5, Tribeca = 6, Brooklyn = 7, Midtown_Village = 8, Yorktown = 9, Queen_Village = 10.

El resto de columnas han procesados a partir de concatenar estos códigos.


In [None]:
df.head()

In [None]:
df['category'].unique()

In [None]:
category_map = {
    'ACCESORIES': 12,
    'HOME_&_GARDEN': 13,
    'SUPERMARKET': 14
}

df['category'] = df['category'].map(category_map)

In [None]:
df['department'].unique()

In [None]:
import re

def replace_department(value):
    # Encuentra la parte numérica del valor
    number_part = re.search(r'\d+', value).group()

    # Reemplaza la parte textual por el número correspondiente
    if 'ACCESORIES' in value:
        return f'12{number_part}'
    elif 'HOME_&_GARDEN' in value:
        return f'13{number_part}'
    elif 'SUPERMARKET' in value:
        return f'14{number_part}'
    else:
        return value  # Retorna el valor original si no coincide con ninguna categoría

# Aplica la función a la columna 'department'
df['department'] = df['department'].apply(replace_department)

In [None]:
df['department'] = df['department'].astype('int64')

In [None]:
df['item'].unique()

In [None]:
def replace_item(value):
    # Extrae las partes del valor
    parts = re.match(r'(ACCESORIES|HOME_&_GARDEN|SUPERMARKET)_(\d+)_(\d+)', value)
    if parts:
        category, group, number = parts.groups()

        # Asigna un nuevo número basado en la categoría
        if category == 'ACCESORIES':
            category_number = '12'
        elif category == 'HOME_&_GARDEN':
            category_number = '13'
        elif category == 'SUPERMARKET':
            category_number = '14'

        # Construye el nuevo valor
        return f'{category_number}_{group}_{number}'

# Aplica la función a la columna 'item'
df['item'] = df['item'].apply(replace_item)

In [None]:
df['region'].unique()

In [None]:
region_map = {
    'Boston': 21,
    'New York': 31,
    'Philadelphia': 41
}

df['region'] = df['region'].map(region_map)

In [None]:
df['store'].unique()

In [None]:
stores_with_numbers = {'South_End': 1, 'Roxbury': 2, 'Back_Bay': 3, 'Greenwich_Village': 4,
                       'Harlem': 5, 'Tribeca': 6, 'Brooklyn': 7, 'Midtown_Village': 8,
                       'Yorktown': 9, 'Queen_Village': 10}

# Reemplazar los valores en la columna 'store' con los números del diccionario
df['store'] = df['store'].map(stores_with_numbers)

In [None]:
df['store_code'].unique()

In [None]:
store_code_mapping = {
    'BOS_1': 211,
    'BOS_2': 212,
    'BOS_3': 213,
    'NYC_1': 314,
    'NYC_2': 315,
    'NYC_3': 316,
    'NYC_4': 317,
    'PHI_1': 418,
    'PHI_2': 419,
    'PHI_3': 4110
}

df['store_code'] = df['store_code'].map(store_code_mapping)

In [None]:
df.head()

In [None]:
df.drop(['id'], axis=1, inplace=True)

In [None]:
for col in ['date']:
  df['day'] =df[col].dt.day
  df['day_of_week'] = df['date'].dt.dayofweek
  df['month'] = df[col].dt.month
  df['year'] = df[col].dt.year
  df['trim'] = df[col].dt.quarter

In [None]:
df.head()

## MA

In [None]:
df['item_store_sp_ma'] = df.groupby(['item','store'])['sell_price'].transform(
    lambda series:series.shift(1).rolling(window=3).mean())

In [None]:
df['item_department_sp_ma'] = df.groupby(['item','department'])['sell_price'].transform(
    lambda series:series.shift(1).rolling(window=3).mean())

In [None]:
df['department_store_sp_ma'] = df.groupby(['department','store'])['sell_price'].transform(
    lambda series:series.shift(1).rolling(window=3).mean())

In [None]:
df["sell_price_ma"] = df.groupby(["item"])["sell_price"].transform(
    lambda series:series.shift(1).rolling(window=3).mean())

In [None]:
df['item_region_month_q_ma'] = df.groupby(['item','region','month'])['sell_price'].transform(
    lambda series:series.shift(1).rolling(window=3).mean())

In [None]:
df['item_cat_month_q_ma'] = df.groupby(['category','region','month'])['sell_price'].transform(
    lambda series:series.shift(1).rolling(window=2).mean())

In [None]:
df['category_store_sp_ma'] = df.groupby(['category','store'])['sell_price'].transform(
    lambda series:series.shift(1).rolling(window=3).mean())

In [None]:
df['item_department_sp_ma'] = df.groupby(['item','department','month'])['sell_price'].transform(
    lambda series:series.shift(1).rolling(window=3).mean())

## Build Time Series Features - Lags

### Build Time Series Features - Lag 1

In [None]:
def build_ts_vars(df, gb_list, target_column, agg_func, agg_func_name,lag=1): # RS_ agregar lag
    assert "date" in df.columns.tolist(), "Date must be in df columns"
    new_name = "_".join(gb_list + [target_column] + [agg_func_name])
    gb_df_ = (
        df
        .set_index("date")
        .groupby(gb_list)
        .resample("M")[target_column]
        .apply(agg_func)
        .to_frame()
        .reset_index()
        .rename(
            columns = {target_column : new_name}
        )
    )
    i=1
    while i <= lag:
        gb_df_[f"{new_name}_lag{i}"] = gb_df_.groupby(gb_list)[new_name].transform(lambda series: series.shift(i))
        i+=1
    print(f"Dropping columns that might cause target leakage {new_name}")
    gb_df_.drop(new_name, inplace = True, axis = 1)
    dfst = pd.merge(df, gb_df_, on = ["date"] + gb_list, how = "left")
    return dfst

In [None]:
GB_LIST = ["department"]
TARGET_COLUMN = "sell_price"
AGG_FUNC = np.sum
AGG_FUNC_NAME = "sum"

vars_ts_ = build_ts_vars(
    df = df,
    gb_list = GB_LIST,
    target_column = TARGET_COLUMN,
    agg_func = AGG_FUNC,
    agg_func_name =  AGG_FUNC_NAME
)

df=vars_ts_

In [None]:
GB_LIST = ["category"]
TARGET_COLUMN = "sell_price"
AGG_FUNC = np.sum
AGG_FUNC_NAME = "sum"

vars_ts_ = build_ts_vars(
    df = df,
    gb_list = GB_LIST,
    target_column = TARGET_COLUMN,
    agg_func = AGG_FUNC,
    agg_func_name =  AGG_FUNC_NAME
)
df=vars_ts_

In [None]:

df.head()

### Build Time Series Features - Lag 2

In [None]:
def build_ts_vars(df, gb_list, target_column, agg_func, agg_func_name,lag=2): # RS_ agregar lag
    assert "date" in df.columns.tolist(), "Date must be in df columns"
    new_name = "_".join(gb_list + [target_column] + [agg_func_name])
    gb_df_ = (
        df
        .set_index("date")
        .groupby(gb_list)
        .resample("M")[target_column]
        .apply(agg_func)
        .to_frame()
        .reset_index()
        .rename(
            columns = {target_column : new_name}
        )
    )
    i=2
    while i <= lag:
        gb_df_[f"{new_name}_lag{i}"] = gb_df_.groupby(gb_list)[new_name].transform(lambda series: series.shift(i))
        i+=1
    print(f"Dropping columns that might cause target leakage {new_name}")
    gb_df_.drop(new_name, inplace = True, axis = 1)
    dfst = pd.merge(df, gb_df_, on = ["date"] + gb_list, how = "left")
    return dfst

In [None]:
GB_LIST = ["department"]
TARGET_COLUMN = "sell_price"
AGG_FUNC = np.sum
AGG_FUNC_NAME = "sum"

vars_ts_ = build_ts_vars(
    df = df,
    gb_list = GB_LIST,
    target_column = TARGET_COLUMN,
    agg_func = AGG_FUNC,
    agg_func_name =  AGG_FUNC_NAME
)
df=vars_ts_

In [None]:
GB_LIST = ["category"]
TARGET_COLUMN = "sell_price"
AGG_FUNC = np.sum
AGG_FUNC_NAME = "sum"

vars_ts_ = build_ts_vars(
    df = df,
    gb_list = GB_LIST,
    target_column = TARGET_COLUMN,
    agg_func = AGG_FUNC,
    agg_func_name =  AGG_FUNC_NAME
)
df=vars_ts_

In [None]:
df.head()

### Build Time Series Features - Lag 3

In [None]:
def build_ts_vars(df, gb_list, target_column, agg_func, agg_func_name,lag=3): # RS_ agregar lag
    assert "date" in df.columns.tolist(), "Date must be in df columns"
    new_name = "_".join(gb_list + [target_column] + [agg_func_name])
    gb_df_ = (
        df
        .set_index("date")
        .groupby(gb_list)
        .resample("M")[target_column]
        .apply(agg_func)
        .to_frame()
        .reset_index()
        .rename(
            columns = {target_column : new_name}
        )
    )
    i=3
    while i <= lag:
        gb_df_[f"{new_name}_lag{i}"] = gb_df_.groupby(gb_list)[new_name].transform(lambda series: series.shift(i))
        i+=1
    print(f"Dropping columns that might cause target leakage {new_name}")
    gb_df_.drop(new_name, inplace = True, axis = 1)
    dfst = pd.merge(df, gb_df_, on = ["date"] + gb_list, how = "left")
    return dfst

In [None]:
GB_LIST = ["department"]
TARGET_COLUMN = "sell_price"
AGG_FUNC = np.sum
AGG_FUNC_NAME = "sum"

vars_ts_ = build_ts_vars(
    df = df,
    gb_list = GB_LIST,
    target_column = TARGET_COLUMN,
    agg_func = AGG_FUNC,
    agg_func_name =  AGG_FUNC_NAME
)
df=vars_ts_

In [None]:
GB_LIST = ["category"]
TARGET_COLUMN = "sell_price"
AGG_FUNC = np.sum
AGG_FUNC_NAME = "sum"

vars_ts_ = build_ts_vars(
    df = df,
    gb_list = GB_LIST,
    target_column = TARGET_COLUMN,
    agg_func = AGG_FUNC,
    agg_func_name =  AGG_FUNC_NAME
)
df=vars_ts_

In [None]:
df.head()

## Train - Test

In [None]:
df.columns.tolist()

In [None]:
df.set_index("item", inplace = True)

In [None]:
train_index = sorted(list(df["date"].unique()))[:-8]

valida_index = sorted(list(df["date"].unique()))[-8:-4]

test_index = sorted(list(df["date"].unique()))[-4:]

In [None]:
print(f"Our train index is {train_index[:2]} - ... - {train_index[-2:]}\n")
print(f"Our validation index is {valida_index}\n")
print(f"Our test/prediction index is {test_index}\n")

In [None]:
X_train = df[df["date"].isin(train_index)].drop(['quantity', "date"], axis=1)
Y_train = df[df["date"].isin(train_index)]['quantity']

X_valida = df[df["date"].isin(valida_index)].drop(['quantity', "date"], axis=1)
Y_valida = df[df["date"].isin(valida_index)]['quantity']

X_test = df[df["date"].isin(test_index)].drop(['quantity', "date"], axis = 1)
Y_test = df[df["date"].isin(test_index)]['quantity']

## Model Train

In [None]:
model = xgb.XGBRegressor(eval_metric = "rmse", seed = 175)
model.fit(X_train, Y_train, eval_set = [(X_train, Y_train), (X_valida, Y_valida)], verbose = True)

In [None]:
fig, ax = plt.subplots(figsize = (10, 15))
xgb.plot_importance(model, importance_type = "gain", ax = ax);

## Prediction

In [None]:
if "quantity" in X_test.columns:
    X_test.drop("quantity", axis = 1, inplace = True)

Y_test_predict = model.predict(X_test)
X_test["quantity"] = Y_test_predict

In [None]:
X_test.reset_index(inplace = True)

In [None]:
Y_train_predict = model.predict(X_train)
Y_valida_predict = model.predict(X_valida)

rmse_train = np.sqrt(
    mean_squared_error(
        y_true = Y_train,
        y_pred = Y_train_predict
    )
)

rmse_valida = np.sqrt(
    mean_squared_error(
        y_true = Y_valida,
        y_pred = Y_valida_predict
    )
)

rmse_train= str(round(rmse_train, 3)).replace(".", "_")
rmse_valida = str(round(rmse_valida, 3)).replace(".", "_")

In [None]:
print(f"Train RMSE: {rmse_train}")
print(f"Validation RMSE: {rmse_valida}")

In [None]:
(
    X_test[["item", "quantity"]]
    .to_csv(f"submission_train_{rmse_train}_valida_{rmse_valida}.csv", index = False)
)

In [None]:
X_test

In [None]:
MIN_DATE = X_test["quantity"].min()
MAX_DATE = X_test["quantity"].max()
print(f"Min sales is {MIN_DATE}\nMax sales is {MAX_DATE}")

In [None]:
X_test['quantity'] =X_test['quantity'].round(0)

In [None]:
X_test['quantity'] =X_test['quantity'].astype('int64')

In [None]:
X_test['quantity'] = np.where(X_test['quantity'] < 0, 0, X_test['quantity'])

In [None]:
X_test

## X_test transform

Se vuelven a cambiar las columnas al formato original

In [None]:
region_map = {
     21: 'Boston',
     31: 'New York',
     41: 'Philadelphia'
}

X_test['region'] = X_test['region'].map(region_map)

In [None]:
category_map = {
     12: 'ACCESORIES',
     13: 'HOME_&_GARDEN',
     14: 'SUPERMARKET'
}

X_test['category'] = X_test['category'].map(category_map)

In [None]:
stores_with_numbers = {
     1: 'South_End',
     2: 'Roxbury',
     3: 'Back_Bay',
     4: 'Greenwich_Village',
     5: 'Harlem',
     6: 'Tribeca',
     7: 'Brooklyn',
     8: 'Midtown_Village',
     9: 'Yorktown',
     10: 'Queen_Village'}


X_test['store'] = X_test['store'].map(stores_with_numbers)

In [None]:
def replace_item2(value):
    # Extrae las partes del valor
    parts = re.match(r'(12|13|14)_(\d+)_(\d+)', value)
    if parts:
        category, group, number = parts.groups()

        # Asigna un nuevo número basado en la categoría
        if category == '12':
            category_number = 'ACCESORIES'
        elif category == '13':
            category_number = 'HOME_&_GARDEN'
        elif category == '14':
            category_number = 'SUPERMARKET'

        # Construye el nuevo valor
        return f'{category_number}_{group}_{number}'

# Aplica la función a la columna 'item'
X_test['item'] = X_test['item'].apply(replace_item2)

In [None]:
department_num = {
     121: 'ACCESORIES_1',
     122: 'ACCESORIES_2',
     131: 'HOME_&_GARDEN_1',
     132: 'HOME_&_GARDEN_2',
     141: 'SUPERMARKET_1',
     142: 'SUPERMARKET_2',
     143: 'SUPERMARKET_3'

}
X_test['department'] = X_test['department'].map(department_num)

In [None]:
X_test

In [None]:
X_test.to_csv('time_series_xgboost.csv', index=False)