### Retailer Sales Data Forecasting

This dataset contains the daily sales data of a US retailer.  
Your objective is to forecast the total sales for each State over the next 12 months, using the historical data provided.


### Part 1: Top-Down Approach

1. **Aggregate Sales:**  
   Combine the sales data to create total sales at the `year-month` level, using the `Order Date` as the time variable for aggregation.


2. **Train a Model:**  
   Using the aggregated `year-month` data, train a model to forecast the total sales for the next 12 months.  
   *(The choice of model is up to you.)*

3. **Disaggregate Predictions:**  
   Split the predicted sales from the `year-month` level back to the `year-month-State` level.  
   *(The splitting strategy is up to you.)*

4. **Evaluate Accuracy:**  
   Assess the forecast accuracy at both the `year-month` and `year-month-State` levels.


### Part 2: Alternative Approach

- Implement a different approach to forecast the next 12 months of sales at the `year-month-State` level.


In [9]:
import pandas as pd

df = pd.read_csv('Retailer Sales Data.csv')
df['Order Date'] = pd.to_datetime(df['Order Date'], format='%d/%m/%Y')


# Drop rows where 'State' is missing (we need state-level forecasts)
df_cleaned = df.dropna(subset=['State'])

# Create 'Year-Month' column from 'Order Date'
df_cleaned['Year-Month'] = df_cleaned['Order Date'].dt.to_period('M')

# Aggregate sales at the 'Year-Month' level
monthly_sales = df_cleaned.groupby('Year-Month')['Sales'].sum().reset_index()

# Convert 'Year-Month' back to a datetime format for consistency
monthly_sales['Year-Month'] = monthly_sales['Year-Month'].dt.to_timestamp()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Year-Month'] = df_cleaned['Order Date'].dt.to_period('M')


# Test 1

In [None]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing
import numpy as np

# Prepare data for time series forecasting
monthly_sales.set_index('Year-Month', inplace=True)

# Apply Exponential Smoothing model (Holt-Winters method)
model = ExponentialSmoothing(monthly_sales['Sales'], trend='add', seasonal=None, seasonal_periods=12)
hw_model = model.fit()

# Forecast the next 12 months
forecast_periods = 12
forecast = hw_model.forecast(steps=forecast_periods)

# Generate future dates for the forecast
future_dates = pd.date_range(start=monthly_sales.index[-1] + pd.DateOffset(months=1), periods=forecast_periods, freq='MS')

# Combine forecast with future dates
forecast_df = pd.DataFrame({'Year-Month': future_dates, 'Forecasted Sales': forecast})

# Show the forecasted results
forecast_df.head()


In [None]:
# Calculate historical state-wise proportions
df_cleaned['Year-Month'] = df_cleaned['Order Date'].dt.to_period('M')

# Aggregate sales by 'Year-Month' and 'State'
statewise_monthly_sales = df_cleaned.groupby(['Year-Month', 'State'])['Sales'].sum().reset_index()

# Calculate state-wise sales as a proportion of the total sales in that month
statewise_monthly_sales['Year-Month'] = statewise_monthly_sales['Year-Month'].dt.to_timestamp()
total_monthly_sales = statewise_monthly_sales.groupby('Year-Month')['Sales'].sum().reset_index()

# Merge the total monthly sales with the state-wise sales
statewise_monthly_sales = pd.merge(statewise_monthly_sales, total_monthly_sales, on='Year-Month', suffixes=('', '_Total'))

# Calculate the proportion of state sales
statewise_monthly_sales['Sales_Proportion'] = statewise_monthly_sales['Sales'] / statewise_monthly_sales['Sales_Total']

# Use the last available month's proportions for disaggregation of the forecast
last_known_month = statewise_monthly_sales[statewise_monthly_sales['Year-Month'] == statewise_monthly_sales['Year-Month'].max()]

# Disaggregate the forecast based on the proportions from the last known month
disaggregated_forecast = []

for _, forecast_row in forecast_df.iterrows():
    forecast_month = forecast_row['Year-Month']
    forecast_value = forecast_row['Forecasted Sales']
    
    # Apply the state-wise proportions from the last available month
    for _, proportion_row in last_known_month.iterrows():
        state = proportion_row['State']
        proportion = proportion_row['Sales_Proportion']
        state_forecast = forecast_value * proportion
        
        disaggregated_forecast.append({
            'Year-Month': forecast_month,
            'State': state,
            'Forecasted Sales': state_forecast
        })

# Convert disaggregated forecast to DataFrame
disaggregated_forecast_df = pd.DataFrame(disaggregated_forecast)

from sklearn.metrics import mean_absolute_error, mean_squared_error

# Assuming 'monthly_sales' is your complete historical dataset
# Split data into training and testing sets (e.g., last 12 months for testing)
train_data = monthly_sales[:-12]
test_data = monthly_sales[-12:]

# Train the model on the training data
model = ExponentialSmoothing(train_data['Sales'], trend='add', seasonal=None, seasonal_periods=12)
hw_model = model.fit()

# Forecast the same period as the test data
predictions = hw_model.forecast(steps=12)

# Calculate error metrics
mae = mean_absolute_error(test_data['Sales'], predictions)
rmse = mean_squared_error(test_data['Sales'], predictions, squared=False)
mape = np.mean(np.abs((test_data['Sales'] - predictions) / test_data['Sales'])) * 100

# Print the metrics
print(f'Mean Absolute Error (MAE): {mae}')
print(f'Root Mean Squared Error (RMSE): {rmse}')
print(f'Mean Absolute Percentage Error (MAPE): {mape}%')



# test 2

In [12]:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Train-test split (usiamo gli ultimi 12 mesi per il test)
train_data = monthly_sales[:-12]
test_data = monthly_sales[-12:]

# Addestramento del modello SARIMA con i parametri di esempio (puoi regolarli in base alla situazione)
sarima_model = SARIMAX(train_data['Sales'], 
                       order=(1, 1, 1),  # Parametri ARIMA(p, d, q)
                       seasonal_order=(1, 1, 1, 12))  # Parametri stagionali (P, D, Q, S) con stagionalità annuale

sarima_fit = sarima_model.fit(disp=False)

# Fare previsioni per i 12 mesi del test
sarima_predictions = sarima_fit.forecast(steps=12)

# Calcolare le metriche di errore
mae_sarima = mean_absolute_error(test_data['Sales'], sarima_predictions)
rmse_sarima = mean_squared_error(test_data['Sales'], sarima_predictions, squared=False)
mape_sarima = np.mean(np.abs((test_data['Sales'] - sarima_predictions) / test_data['Sales'])) * 100

# Stampare i risultati
print(f'Mean Absolute Error (MAE) - SARIMA: {mae_sarima}')
print(f'Root Mean Squared Error (RMSE) - SARIMA: {rmse_sarima}')
print(f'Mean Absolute Percentage Error (MAPE) - SARIMA: {mape_sarima}%')


Mean Absolute Error (MAE) - SARIMA: 15411.558676616303
Root Mean Squared Error (RMSE) - SARIMA: 18005.217400851787
Mean Absolute Percentage Error (MAPE) - SARIMA: 42.68170147743031%


  warn('Too few observations to estimate starting parameters%s.'


In [22]:
# print il min e max df_cleaned['Year-Month']
print(df_cleaned['Year-Month'].min())
print(df_cleaned['Year-Month'].max())


2015-01
2018-12


In [13]:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Supponiamo che 'df_cleaned' sia il dataframe pulito e pronto all'uso
# Aggregazione delle vendite a livello 'year-month'
df_cleaned['Year-Month'] = df_cleaned['Order Date'].dt.to_period('M')
monthly_sales = df_cleaned.groupby('Year-Month')['Sales'].sum().reset_index()
monthly_sales['Year-Month'] = monthly_sales['Year-Month'].dt.to_timestamp()

# Train-test split (usiamo gli ultimi 12 mesi per il test)
train_data = monthly_sales[:-12]
test_data = monthly_sales[-12:]

# Addestramento del modello SARIMA
sarima_model = SARIMAX(train_data['Sales'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
sarima_fit = sarima_model.fit(disp=False)

# Fare previsioni per i prossimi 12 mesi
sarima_predictions = sarima_fit.forecast(steps=12)

# Calcolo delle metriche di errore a livello aggregato
mae_sarima = mean_absolute_error(test_data['Sales'], sarima_predictions)
rmse_sarima = mean_squared_error(test_data['Sales'], sarima_predictions, squared=False)
mape_sarima = np.mean(np.abs((test_data['Sales'] - sarima_predictions) / test_data['Sales'])) * 100

# Stampa delle metriche di errore
print(f'Mean Absolute Error (MAE) - SARIMA: {mae_sarima}')
print(f'Root Mean Squared Error (RMSE) - SARIMA: {rmse_sarima}')
print(f'Mean Absolute Percentage Error (MAPE) - SARIMA: {mape_sarima}%')


Mean Absolute Error (MAE) - SARIMA: 15411.558676616303
Root Mean Squared Error (RMSE) - SARIMA: 18005.217400851787
Mean Absolute Percentage Error (MAPE) - SARIMA: 42.68170147743031%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Year-Month'] = df_cleaned['Order Date'].dt.to_period('M')
  warn('Too few observations to estimate starting parameters%s.'


In [20]:
# Creazione di un dataframe con le previsioni e le date future
future_dates = pd.date_range(start=monthly_sales['Year-Month'].max() + pd.DateOffset(months=1), periods=12, freq='MS')
forecast_df = pd.DataFrame({'Year-Month': future_dates, 'Forecasted Sales': sarima_predictions})

### DISAGGREGAZIONE DELLE PREVISIONI A LIVELLO DI STATO ###

# Aggregazione delle vendite a livello 'year-month-state' per calcolare le proporzioni storiche
statewise_monthly_sales = df_cleaned.groupby(['Year-Month', 'State'])['Sales'].sum().reset_index()

# Assicurati che la colonna 'Year-Month' sia di tipo Period
#statewise_monthly_sales['Year-Month'] = statewise_monthly_sales['Year-Month'].dt.to_period('M')

# Calcolo della proporzione delle vendite per stato rispetto al totale mensile
total_monthly_sales = statewise_monthly_sales.groupby('Year-Month')['Sales'].sum().reset_index()
statewise_monthly_sales = pd.merge(statewise_monthly_sales, total_monthly_sales, on='Year-Month', suffixes=('', '_Total'))
statewise_monthly_sales['Sales_Proportion'] = statewise_monthly_sales['Sales'] / statewise_monthly_sales['Sales_Total']

# Usare l'ultimo mese disponibile per calcolare le proporzioni di disaggregazione
last_known_month = statewise_monthly_sales[statewise_monthly_sales['Year-Month'] == statewise_monthly_sales['Year-Month'].max()]

# Disaggregare le previsioni basandosi sulle proporzioni dell'ultimo mese disponibile
disaggregated_forecast = []

for _, forecast_row in forecast_df.iterrows():
    forecast_month = forecast_row['Year-Month'].to_period('M')  # Assicurati che questo sia un Period
    forecast_value = forecast_row['Forecasted Sales']
    
    # Applica le proporzioni storiche di ciascuno stato
    for _, proportion_row in last_known_month.iterrows():
        state = proportion_row['State']
        proportion = proportion_row['Sales_Proportion']
        state_forecast = forecast_value * proportion
        
        disaggregated_forecast.append({
            'Year-Month': forecast_month,
            'State': state,
            'Forecasted Sales': state_forecast
        })

# Converti le previsioni disaggregate in un dataframe
disaggregated_forecast_df = pd.DataFrame(disaggregated_forecast)

# Mostra il risultato delle previsioni disaggregate
print(disaggregated_forecast_df.head())

### VALUTAZIONE A LIVELLO DI STATO ###
# Calcola la data minima nel periodo di test e convertila in Period
test_data_min_date = pd.to_datetime(test_data['Year-Month'].min()).to_period('M')

# Filtro sulle vendite effettive a livello di stato, a partire dalla data minima del periodo di test
actual_state_sales = statewise_monthly_sales[statewise_monthly_sales['Year-Month'] >= test_data_min_date]

# Mostra i risultati per verificare che il confronto avvenga correttamente
print(actual_state_sales.head())

# Calcola le metriche a livello di stato (se hai i dati effettivi da confrontare)
# MAE, RMSE, e MAPE possono essere calcolati per ciascuno stato allo stesso modo delle previsioni mensili


  Year-Month       State  Forecasted Sales
0    2019-01     Alabama        556.743317
1    2019-01     Arizona       1042.252602
2    2019-01    Arkansas        892.027452
3    2019-01  California      10890.916715
4    2019-01    Colorado       1620.707604
    Year-Month                 State     Sales  Sales_Total  Sales_Proportion
869    2018-01               Alabama    56.370    35688.064          0.001580
870    2018-01               Arizona   100.922    35688.064          0.002828
871    2018-01            California  5022.232    35688.064          0.140726
872    2018-01              Colorado   169.064    35688.064          0.004737
873    2018-01  District of Columbia    77.760    35688.064          0.002179


             State  MAE  RMSE  MAPE
0          Alabama  NaN   NaN   NaN
1          Arizona  NaN   NaN   NaN
2         Arkansas  NaN   NaN   NaN
3       California  NaN   NaN   NaN
4         Colorado  NaN   NaN   NaN
5      Connecticut  NaN   NaN   NaN
6          Florida  NaN   NaN   NaN
7            Idaho  NaN   NaN   NaN
8         Illinois  NaN   NaN   NaN
9          Indiana  NaN   NaN   NaN
10            Iowa  NaN   NaN   NaN
11        Kentucky  NaN   NaN   NaN
12       Louisiana  NaN   NaN   NaN
13   Massachusetts  NaN   NaN   NaN
14        Michigan  NaN   NaN   NaN
15       Minnesota  NaN   NaN   NaN
16     Mississippi  NaN   NaN   NaN
17        Missouri  NaN   NaN   NaN
18        Nebraska  NaN   NaN   NaN
19      New Jersey  NaN   NaN   NaN
20      New Mexico  NaN   NaN   NaN
21        New York  NaN   NaN   NaN
22  North Carolina  NaN   NaN   NaN
23    North Dakota  NaN   NaN   NaN
24            Ohio  NaN   NaN   NaN
25        Oklahoma  NaN   NaN   NaN
26    Pennsylvania  NaN   Na

# test 2

In [1]:
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

import pandas as pd

df = pd.read_csv('Retailer Sales Data.csv')
df['Order Date'] = pd.to_datetime(df['Order Date'], format='%d/%m/%Y')


# Drop rows where 'State' is missing (we need state-level forecasts)
df_cleaned = df.dropna(subset=['State'])

# Create 'Year-Month' column from 'Order Date'
df_cleaned['Year-Month'] = df_cleaned['Order Date'].dt.to_period('M')

# Aggregate sales at the 'Year-Month' level
monthly_sales = df_cleaned.groupby('Year-Month')['Sales'].sum().reset_index()

# Convert 'Year-Month' back to a datetime format for consistency
monthly_sales['Year-Month'] = monthly_sales['Year-Month'].dt.to_timestamp()



# Supponiamo che 'df_cleaned' sia il dataframe pulito e pronto all'uso
# Aggregazione delle vendite a livello 'year-month'
df_cleaned['Year-Month'] = df_cleaned['Order Date'].dt.to_period('M')
monthly_sales = df_cleaned.groupby('Year-Month')['Sales'].sum().reset_index()
monthly_sales['Year-Month'] = monthly_sales['Year-Month'].dt.to_timestamp()

# Impostiamo i dati di addestramento fino a dicembre 2017 e usiamo il 2018 per il test
train_data = monthly_sales[monthly_sales['Year-Month'] < '2018-01-01']
test_data = monthly_sales[monthly_sales['Year-Month'] >= '2018-01-01']

# Addestramento del modello SARIMA
sarima_model = SARIMAX(train_data['Sales'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
sarima_fit = sarima_model.fit(disp=False)

# Fare previsioni per tutto il 2018 (12 mesi)
sarima_predictions = sarima_fit.forecast(steps=len(test_data))

# Calcolo delle metriche di errore a livello aggregato
mae_sarima = mean_absolute_error(test_data['Sales'], sarima_predictions)
rmse_sarima = mean_squared_error(test_data['Sales'], sarima_predictions, squared=False)
mape_sarima = np.mean(np.abs((test_data['Sales'] - sarima_predictions) / test_data['Sales'])) * 100

# Stampa delle metriche di errore
print(f'Mean Absolute Error (MAE) - SARIMA: {mae_sarima}')
print(f'Root Mean Squared Error (RMSE) - SARIMA: {rmse_sarima}')
print(f'Mean Absolute Percentage Error (MAPE) - SARIMA: {mape_sarima}%')

# Creazione di un dataframe con le previsioni e le date future
future_dates = pd.date_range(start=test_data['Year-Month'].min(), periods=len(test_data), freq='MS')
forecast_df = pd.DataFrame({'Year-Month': future_dates, 'Forecasted Sales': sarima_predictions})

### DISAGGREGAZIONE DELLE PREVISIONI A LIVELLO DI STATO ###

# Aggregazione delle vendite a livello 'year-month-state' per calcolare le proporzioni storiche
statewise_monthly_sales = df_cleaned.groupby(['Year-Month', 'State'])['Sales'].sum().reset_index()

# Calcolo della proporzione delle vendite per stato rispetto al totale mensile
total_monthly_sales = statewise_monthly_sales.groupby('Year-Month')['Sales'].sum().reset_index()
statewise_monthly_sales = pd.merge(statewise_monthly_sales, total_monthly_sales, on='Year-Month', suffixes=('', '_Total'))
statewise_monthly_sales['Sales_Proportion'] = statewise_monthly_sales['Sales'] / statewise_monthly_sales['Sales_Total']

# Usare l'ultimo mese disponibile per calcolare le proporzioni di disaggregazione
last_known_month = statewise_monthly_sales[statewise_monthly_sales['Year-Month'] == statewise_monthly_sales['Year-Month'].max()]

# Disaggregare le previsioni basandosi sulle proporzioni dell'ultimo mese disponibile
disaggregated_forecast = []

for _, forecast_row in forecast_df.iterrows():
    forecast_month = forecast_row['Year-Month'].to_period('M')  # Assicurati che questo sia un Period
    forecast_value = forecast_row['Forecasted Sales']
    
    # Applica le proporzioni storiche di ciascuno stato
    for _, proportion_row in last_known_month.iterrows():
        state = proportion_row['State']
        proportion = proportion_row['Sales_Proportion']
        state_forecast = forecast_value * proportion
        
        disaggregated_forecast.append({
            'Year-Month': forecast_month,
            'State': state,
            'Forecasted Sales': state_forecast
        })

# Converti le previsioni disaggregate in un dataframe
disaggregated_forecast_df = pd.DataFrame(disaggregated_forecast)

# Mostra il risultato delle previsioni disaggregate
print(disaggregated_forecast_df.head())

### VALUTAZIONE A LIVELLO DI STATO ###
# Calcola le metriche a livello di stato (se hai i dati effettivi da confrontare)

# Unisci i dati reali con le previsioni
merged_df = pd.merge(
    disaggregated_forecast_df,
    statewise_monthly_sales[statewise_monthly_sales['Year-Month'] >= '2018-01-01'],
    on=['Year-Month', 'State'],
    how='left',
    suffixes=('_Forecast', '_Actual')
)

# Calcola le metriche per ciascuno stato
metrics = []

# Itera attraverso ciascuno stato
for state in merged_df['State'].unique():
    state_data = merged_df[merged_df['State'] == state]
    
    # Calcola MAE
    mae = np.mean(np.abs(state_data['Forecasted Sales'] - state_data['Sales']))
    
    # Calcola RMSE
    rmse = np.sqrt(np.mean((state_data['Forecasted Sales'] - state_data['Sales']) ** 2))
    
    # Calcola MAPE, evitando divisione per zero
    mape = np.mean(np.abs((state_data['Forecasted Sales'] - state_data['Sales']) / state_data['Sales'])) * 100

    metrics.append({
        'State': state,
        'MAE': mae,
        'RMSE': rmse,
        'MAPE': mape
    })

# Crea un DataFrame con i risultati
metrics_df = pd.DataFrame(metrics)

# Mostra le metriche per ciascuno stato
print(metrics_df)


Mean Absolute Error (MAE) - SARIMA: 15411.558676616303
Root Mean Squared Error (RMSE) - SARIMA: 18005.217400851787
Mean Absolute Percentage Error (MAPE) - SARIMA: 42.68170147743031%
  Year-Month       State  Forecasted Sales
0    2018-01     Alabama        556.743317
1    2018-01     Arizona       1042.252602
2    2018-01    Arkansas        892.027452
3    2018-01  California      10890.916715
4    2018-01    Colorado       1620.707604
             State          MAE         RMSE         MAPE
0          Alabama   588.137995   611.306035  2508.086480
1          Arizona   671.816038   748.746890   182.771597
2         Arkansas   948.679864  1035.178320  6345.995669
3       California  4181.289192  4789.261579    63.449909
4         Colorado  1542.783821  1672.559593          inf
5      Connecticut   783.493690   878.320336  1630.220402
6          Florida  1281.215594  1708.241162   169.801122
7            Idaho   289.610025   424.355606    74.166987
8         Illinois  1463.194774  1791.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Year-Month'] = df_cleaned['Order Date'].dt.to_period('M')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Year-Month'] = df_cleaned['Order Date'].dt.to_period('M')
  warn('Too few observations to estimate starting parameters%s.'


# xgboost

In [3]:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

# Carica i dati
df = pd.read_csv('Retailer Sales Data.csv')
df['Order Date'] = pd.to_datetime(df['Order Date'], format='%d/%m/%Y')

# Pulisci i dati
df_cleaned = df.dropna(subset=['State'])
df_cleaned['Year-Month'] = df_cleaned['Order Date'].dt.to_period('M')
monthly_sales = df_cleaned.groupby('Year-Month')['Sales'].sum().reset_index()
monthly_sales['Year-Month'] = monthly_sales['Year-Month'].dt.to_timestamp()

# Crea le caratteristiche temporali
monthly_sales['Month'] = monthly_sales['Year-Month'].dt.month
monthly_sales['Year'] = monthly_sales['Year-Month'].dt.year

# Creazione delle caratteristiche lag
def create_lag_features(df, lags):
    df = df.copy()
    for lag in lags:
        df[f'lag_{lag}'] = df['Sales'].shift(lag)
    return df

# Definisci i lag che vuoi utilizzare
lags = [1, 2, 3, 6, 12]
monthly_sales = create_lag_features(monthly_sales, lags)
monthly_sales = monthly_sales.dropna()

# Dividi i dati in addestramento e test
X = monthly_sales.drop(['Year-Month', 'Sales'], axis=1)
y = monthly_sales['Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

# Crea e addestra il modello XGBoost
model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X_train, y_train)

# Fai previsioni
y_pred = model.predict(X_test)

# Calcola le metriche di errore
mae_xgb = mean_absolute_error(y_test, y_pred)
rmse_xgb = mean_squared_error(y_test, y_pred, squared=False)
mape_xgb = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

# Stampa delle metriche di errore
print(f'Mean Absolute Error (MAE) - XGBoost: {mae_xgb}')
print(f'Root Mean Squared Error (RMSE) - XGBoost: {rmse_xgb}')
print(f'Mean Absolute Percentage Error (MAPE) - XGBoost: {mape_xgb}%')

# Previsione per il 2018
forecast_features = X[X.index >= monthly_sales[monthly_sales['Year-Month'] == '2017-12-01'].index[-1]]
forecast_predictions = model.predict(forecast_features)

# Creazione di un DataFrame con le previsioni aggregate
forecast_df = pd.DataFrame({
    'Year-Month': pd.date_range(start='2018-01-01', periods=len(forecast_predictions), freq='MS'),
    'Forecasted Sales': forecast_predictions
})

# Aggregazione delle vendite a livello 'year-month-state' per calcolare le proporzioni storiche
statewise_monthly_sales = df_cleaned.groupby(['Year-Month', 'State'])['Sales'].sum().reset_index()

# Calcolo della proporzione delle vendite per stato rispetto al totale mensile
total_monthly_sales = statewise_monthly_sales.groupby('Year-Month')['Sales'].sum().reset_index()
statewise_monthly_sales = pd.merge(statewise_monthly_sales, total_monthly_sales, on='Year-Month', suffixes=('', '_Total'))
statewise_monthly_sales['Sales_Proportion'] = statewise_monthly_sales['Sales'] / statewise_monthly_sales['Sales_Total']

# Usare l'ultimo mese disponibile per calcolare le proporzioni di disaggregazione
last_known_month = statewise_monthly_sales[statewise_monthly_sales['Year-Month'] == statewise_monthly_sales['Year-Month'].max()]

# Disaggregare le previsioni basandosi sulle proporzioni dell'ultimo mese disponibile
disaggregated_forecast = []

for _, forecast_row in forecast_df.iterrows():
    forecast_month = forecast_row['Year-Month'].to_period('M')  # Assicurati che questo sia un Period
    forecast_value = forecast_row['Forecasted Sales']
    
    # Applica le proporzioni storiche di ciascuno stato
    for _, proportion_row in last_known_month.iterrows():
        state = proportion_row['State']
        proportion = proportion_row['Sales_Proportion']
        state_forecast = forecast_value * proportion
        
        disaggregated_forecast.append({
            'Year-Month': forecast_month,
            'State': state,
            'Forecasted Sales': state_forecast
        })

# Converti le previsioni disaggregate in un DataFrame
disaggregated_forecast_df = pd.DataFrame(disaggregated_forecast)

# Mostra il risultato delle previsioni disaggregate
print(disaggregated_forecast_df.head())

### VALUTAZIONE A LIVELLO DI STATO ###
# Unisci i dati reali con le previsioni
merged_df = pd.merge(
    disaggregated_forecast_df,
    statewise_monthly_sales[statewise_monthly_sales['Year-Month'] >= '2018-01-01'],
    on=['Year-Month', 'State'],
    how='left',
    suffixes=('_Forecast', '_Actual')
)

# Calcola le metriche per ciascuno stato
metrics = []

# Itera attraverso ciascuno stato
for state in merged_df['State'].unique():
    state_data = merged_df[merged_df['State'] == state]
    
    # Calcola MAE
    mae = np.mean(np.abs(state_data['Forecasted Sales'] - state_data['Sales']))
    
    # Calcola RMSE
    rmse = np.sqrt(np.mean((state_data['Forecasted Sales'] - state_data['Sales']) ** 2))
    
    # Calcola MAPE, evitando divisione per zero
    mape = np.mean(np.abs((state_data['Forecasted Sales'] - state_data['Sales']) / state_data['Sales'])) * 100

    metrics.append({
        'State': state,
        'MAE': mae,
        'RMSE': rmse,
        'MAPE': mape
    })

# Crea un DataFrame con i risultati
metrics_df = pd.DataFrame(metrics)

# Mostra le metriche per ciascuno stato
print(metrics_df)


Mean Absolute Error (MAE) - XGBoost: 13046.314957031249
Root Mean Squared Error (RMSE) - XGBoost: 15276.032036023355
Mean Absolute Percentage Error (MAPE) - XGBoost: 20.835464866020455%
  Year-Month       State  Forecasted Sales
0    2018-01     Alabama        981.372572
1    2018-01     Arizona       1837.180770
2    2018-01    Arkansas       1572.378593
3    2018-01  California      19197.440917
4    2018-01    Colorado       2856.824572
             State          MAE         RMSE         MAPE
0          Alabama   545.626996   593.520702  2200.049375
1          Arizona   575.811650   785.994849   213.870447
2         Arkansas   651.660749   766.143386  5269.750397
3       California  5307.368212  6334.815628    78.836852
4         Colorado  1278.633876  1491.004237          inf
5      Connecticut   521.819389   616.336101  1280.333970
6          Florida  1320.118365  1839.507433   111.457969
7            Idaho   295.512619   430.637117    75.604637
8         Illinois  1168.874735  1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['Year-Month'] = df_cleaned['Order Date'].dt.to_period('M')


In [4]:
from sklearn.model_selection import GridSearchCV

# Definisci i parametri da ottimizzare
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 150],
    'subsample': [0.8, 0.9, 1.0]
}

# Crea il modello XGBoost
xgb_model = xgb.XGBRegressor(objective='reg:squarederror')

# Crea la ricerca a griglia
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)

# Esegui la ricerca
grid_search.fit(X_train, y_train)

# Stampa i migliori parametri
print("Best parameters found: ", grid_search.best_params_)

# Utilizza il miglior modello per fare previsioni
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calcola le metriche di errore
mae_xgb = mean_absolute_error(y_test, y_pred)
rmse_xgb = mean_squared_error(y_test, y_pred, squared=False)
mape_xgb = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

# Stampa delle metriche di errore
print(f'Mean Absolute Error (MAE) - XGBoost: {mae_xgb}')
print(f'Root Mean Squared Error (RMSE) - XGBoost: {rmse_xgb}')
print(f'Mean Absolute Percentage Error (MAPE) - XGBoost: {mape_xgb}%')


Fitting 3 folds for each of 81 candidates, totalling 243 fits
Best parameters found:  {'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 150, 'subsample': 0.8}
Mean Absolute Error (MAE) - XGBoost: 15656.003038281251
Root Mean Squared Error (RMSE) - XGBoost: 18505.023312138004
Mean Absolute Percentage Error (MAPE) - XGBoost: 23.81630189410809%
