<div class="alert alert-info">
   <center>
       <h3>Time Series Multi-Horizon Forecasting : comment prédire des phénomènes non stationnaires à des horizons de temps multiples ?</h3>
       <br>
      <p>Bonjour et bienvenu à cet atelier datacraft en collaboration avec Danone.</p>
</center>
    
    
        
    

   

L'objectif de cet atelier est de prédire la colonne ``ordered_volumes`` du jeu de données à différents horizons (1 semaines, 3 mois, 1 an). 

Il y a au total 125 produits différents mais vous pouvez vous concentrer sur seulement un produit de chaque cluster déterminé par Gabriel. Cette partie est détaillée dans le notebook **Clustering datacraft**

Nous pouvons donc nous concentrer sur un ou deux produit par cluster.


Ce notebook est dédié à l'exploration des données Danone et à l'utilisation des packages **prophet** et **neural prophet**.

La fonction ``train_prophet_model`` permet d'entrainer un model prophet tout en modulant les paramètre en se passant de l'API prophet d'origine.

La documentation et les exemples permettent de bien comprendre son utilisation mais n'hésitez pas à nous poser la moindre question.

Si vous voulez vous passer de la fonction et passer directement par l'API (notamment si vous voulez faire de la cross validation, ajouter des saisonnalités ou moduler des paramètres non présents dans la fonction) je vous invite à regarder la doc prophet : 

- https://facebook.github.io/prophet/docs/quick_start.html#python-api
- https://github.com/facebook/prophet/blob/main/python/prophet/forecaster.py



# Charger les librairies

In [1]:
import prophet
import numpy as np
from dash import Dash, State
import pandas as pd
import random
from datetime import date
from datetime import datetime
from prophet import Prophet 
from prophet.plot import add_changepoints_to_plot
from neuralprophet import NeuralProphet 

import plotly.graph_objects as go
import plotly.express as px
from plotly.tools import mpl_to_plotly

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import MinMaxScaler


import matplotlib.pyplot as plt



from dash.exceptions import PreventUpdate

from jupyter_dash import JupyterDash
from dash import dcc
import dash_html_components as html
from dash.dependencies import Output, Input

from dash import html
import dash_bootstrap_components as dbc



The dash_html_components package is deprecated. Please replace
`import dash_html_components as html` with `from dash import html`




# Charger les données

In [2]:
df = pd.read_parquet("./data/forecasting.parquet.gzip")
danone = df.copy(deep = True)

# Dictionnaire de sélection de variable

In [3]:
'''Colonnes qui vont rester dans le dataframe ou non, 0 si non, 1 si oui, 
si le nom n'est pas dedans, alors la colonne ne reste pas'''

selected_columns = {"time_index" : 1,"apparenttemperaturemax" :1,"cos_iso_month" : 0,"cos_iso_week" : 0,"cos_iso_week_of_month": 0,"days_before_next_holiday": 0,"forecasted_volumes": 0,"fu_cod": 0,"future_ordered_volumes_until_saturday_of_current_week": 0,"future_ordered_volumes_until_saturday_of_previous_week": 0,"holiday_day_of_week": 0,"holidays_count_in_week": 0,"iso_month": 0,"iso_week": 0,"iso_week_of_month": 0,"mat_net_weight_value_kg": 0,"ordered_volumes": 1,"precipintensity": 1,"promo_uplift_coefficient": 0}

# Présentation du dataset

In [4]:
df.sort_values(["product_index", "time_index"])\
    .head()[["time_index","product_index","ordered_volumes", "future_ordered_volumes_until_saturday_of_current_week"]]

Unnamed: 0,time_index,product_index,ordered_volumes,future_ordered_volumes_until_saturday_of_current_week
26946,2017-01-09,60198,136.199997,136.199997
26947,2017-01-16,60198,142.199997,142.199997
26948,2017-01-23,60198,162.600006,162.600006
26949,2017-01-30,60198,181.199997,181.199997
26950,2017-02-06,60198,184.199997,184.199997


In [5]:
df.sort_values(["product_index","time_index"])\
    .head()[["time_index","product_index","ordered_volumes", "promo_mean_horizon_0"]]

Unnamed: 0,time_index,product_index,ordered_volumes,promo_mean_horizon_0
26946,2017-01-09,60198,136.199997,0.0
26947,2017-01-16,60198,142.199997,0.0
26948,2017-01-23,60198,162.600006,0.0
26949,2017-01-30,60198,181.199997,0.0
26950,2017-02-06,60198,184.199997,10.653943


Dans ce dataset, on a des données sur le volume des commandes réalisés par les clients, le tout par semaine.
On aura le total des commandes effectuées par les clients du Lundi au Vendredi, cette données est disponible dans les colonnes : <br> <b>ordered_volumes</b> ainsi que dans <br> <b>future_ordered_volumes_until_saturday_of_current_week</b>

---
---

In [6]:
df.sort_values(["product_index", "time_index"])\
    .head()[["time_index","product_index","precipintensity", "apparenttemperaturemax"]]

Unnamed: 0,time_index,product_index,precipintensity,apparenttemperaturemax
26946,2017-01-09,60198,0.002171,40.996632
26947,2017-01-16,60198,0.000264,34.286678
26948,2017-01-23,60198,0.00145,40.556255
26949,2017-01-30,60198,0.003083,51.472401
26950,2017-02-06,60198,0.001145,44.919128


Ici, nous avons affichés certains <b>régresseurs</b>, ils permettent de rajouter de l'information sur la timeline actuelle et éventuellement d'expliquer certains changements aperçus sur la colonne du volume de commandes.
Le model <b>prophet</b> ira regarder ces variable et pourra en fonction de la corrélation appliquer des corrections

## Fonctions utiles

In [7]:
def min_max_scale_df(df):
    """
        Allows to transform data with large scale into percent of min-max
        from data set

        @param : df, dataframe
        @return : df, dataframe
    """
    scaler = MinMaxScaler()
    df["ordered_volumes"] = scaler.fit_transform(np.array(df["ordered_volumes"]).reshape(-1,1))
    return df

def get_df_selected_columns(df,selected_columns):
    """
        Allows to select columns of a dataframe 

        @param : df, dataframe
        @return : df, dataframe
    """
    
    columns = []
    for col in selected_columns:
        if selected_columns[col] == 1:
            columns.append(col)
    
    return df[columns]

def year_to_date(date_sep):
    """
        Convert year (from int format) to date based on 1 January <year>
        
        @param : year, int
        @return : date, converted year to date
    """
    if type(date_sep) in [float, int]:
        date_sep = datetime.strptime(str(date_sep), "%Y")

    else:
        if len(date_sep) == 4:
            date_sep = datetime.strptime(str(date_sep), "%Y")
        else :
            date_sep = datetime.strptime(str(date_sep), "%Y-%m-%d")
    return date_sep
    

def prepare_df(df,product_id,selected_columns):
    
    """
        Processings on df : 
            - Select wanted product
            - Sort from start date time to end
            - Distinguish necessary columns from regressors
            - Rename columns (df,y) to fit prophet model expectations
        
        @param : df, dataframe
        @return : df (dataframe), regressor (list of regressors column's names)
        
    """
    
    df = df.query(f'product_index=={product_id}')
    df = get_df_selected_columns(df,selected_columns).sort_values('time_index')
    updated_columns = set(df.columns)
    main_columns = set(["time_index", "ordered_volumes"])
    
    regressor = updated_columns.difference(main_columns)
    
    df= df[df['ordered_volumes']>0]
    df.rename(columns = {"time_index" : "ds", "ordered_volumes" : "y"}, inplace=True)
    df.dropna(subset=regressor, inplace=True)
    regressor = list(regressor)
    df.reset_index(inplace=True, drop=True)
    df.dropna(subset=regressor, inplace=True)
    
    
    return df, regressor
    
def split_df(df, prediction_w_period,separation_date):
    
    """
        Used to make train set and test set, takes a separation date
        and a prediction period in weeks
        
        @param : df, dataframe
        @param : prediction_w_period, period of prediction in weeks
        @param : separation_date, the separation date which takes
        the first part for train and second part calculated with prediction_w_period
    """
    
    separation_date = year_to_date(separation_date)
    
    df_train = df[df.ds.apply(lambda date:date)<separation_date]
    df_test = df[df.ds.apply(lambda date:date)>=separation_date]
    df_test = df_test.iloc[:prediction_w_period]
    
    return df_train, df_test

# Voir la documentation : <a href = "https://github.com/ourownstory/neural_prophet/blob/main/neuralprophet/forecaster.py">NeuralProphet</a>

In [1]:
params = {
    "changepoints" : None, #lorsqu'on donne ce paramètre, le model ne va detecter aucun changepoints (ce n'est pas le paramètre par défaut)
    "n_changepoints" : 25, # nombre de changepoints # pas utile si on spécifie changepoints
    "changepoints_range" : 0.8, # n_changepoints répartis sur 80% sur train set 
    "trend_reg" : 0,
    "trend_reg_threshold" : False,
    "yearly_seasonality" :"auto",
    "weekly_seasonality" :"auto",
    "daily_seasonality": "auto",
    "seasonality_mode" : "additive",
    "seasonality_reg" : 0,
    "n_forecasts" : 1,
    "n_lags" : 0,
    "num_hidden_layers" : 0,
    "d_hidden" : None,
    "ar_reg" : None,
    "learning_rate" : None,
    "epochs" : None,
    "batch_size" : None,
    "loss_func" : "Huber",
    "optimizer" : "AdamW",
    "newer_samples_weight" : 2,
    "newer_samples_start" : 0.0,
    "impute_missing" : True,
    "collect_metrics" :True,
    "normalize" :"auto",
    "global_normalization" :False,
    "global_time_normalization" : True,
    "unknown_data_normalization" : False,
}

def train_neural_prophet_model( params = params,
                                df = danone,
                                freq = "W-MON",
                                product_id = 70189,
                                date_sep = 2021,
                                periode = 12 # nombre de semaine dans le test_se
):
    
    selected_columns = {"time_index" : 1, "ordered_volumes" : 1}
    
    df, regressor = prepare_df(df, product_id, selected_columns)
    df_train, df_test = split_df(df,periode,date_sep)
    date_sep = year_to_date(date_sep)
    
            
    #if add_holiday == True: 
    
            
    m = NeuralProphet(**params)
    
    m.add_country_holidays("FR")
    
    df_train, df_val = NeuralProphet().split_df(df_train, valid_p=0.2)
    
    m.fit(df_train, freq = freq, validation_df = df_val)
    
    forecast = m.predict(df_test)
    
    metrics = m.test(df_test)
    
    future = df_test
    
    df = pd.concat([df_train, df_val],axis=0)
    
    df = pd.concat([df, df_test], axis=0)
    
    residuals = pd.DataFrame()
    
    residuals['ds']=df[df.ds>=date_sep]["ds"]

    residuals['e'] = df[df.ds>=date_sep]['y']-forecast[forecast.ds>=date_sep]['yhat1']

    residuals.reset_index(drop=True, inplace=True)
    
    return m, df, future, forecast, residuals

NameError: name 'danone' is not defined

In [9]:
def train_prophet_model(df = danone,
                product_id = 70189,

                interval = 0.95,

                add_holiday = True,

                regressor = [],
                
                #changepoints 

                changepoints = None, #lorsqu'on donne ce paramètre, le model ne va detecter aucun changepoints (ce n'est pas le paramètre par défaut)

                changepoint_prior_scale=0.05, #sensibilité aux changement dans la trend (si on estime par exemple que le changement de trend n'en est pas un)

                n_changepoints = 25, # nombre de changepoints # pas utile si on spécifie changepoints

                changepoint_range = 0.8, # n_changepoints répartis sur 80% sur train set 

                #growth

                growth = 'linear', #type de trend 'linear', 'logistic' or 'flat' 
                                     #si growth = logistic on est obligé de renseigné cap 

                cap = None, # default = None# carrying capacity : When forecasting growth, there is usually some maximum achievable point: total market size = carrying capacity
                        #cap peut être une liste comme une constante

                floor = None,

                holidays = pd.DataFrame({'ds':pd.to_datetime(['2017-01-16']),
                                         'holiday':'saint-machin'}), #default = None

                #saisonnalité 

                #on peut donner la valeur False lorsqu'on vetu disabler une forme de saisonnalité

                yearly_seasonality = 'auto', # un entier sinon 
                                          #The default values are often appropriate, but they can be increased when the seasonality needs to fit higher-frequency changes, and generally be less smooth
                                          # tester 'auto' et 10 qui est selon la doc la valeur par défaut
                                          #Increasing the number of Fourier terms allows the seasonality to fit faster changing cycles, but can also lead to overfitting: N Fourier terms corresponds to 2N variables used for modeling the cycle
                                          # https://facebook.github.io/prophet/docs/seasonality,_holiday_effects,_and_regressors.html#specifying-custom-seasonalities
                                                          # on peut ajouté seasonality_prior_scale 

                seasonality_mode = 'additive', #ou mutliplicative

                # si les vacances ou les saisonnalités sont overfittés ou peut utiliser prior scale (defaut =10)

                seasonality_prior_scale = 10.0,
                holidays_prior_scale = 10.0,

                mcmc_samples = 0,

                uncertainty_samples = 1000,
                
                date_sep = 2021,

                periode = 12 # nombre de semaine dans le test_set
):
    
    
    
    """
    Process the danone df, split it, train the model on the train. 

    Return the trained model, the processed prophet friendly df, the forecast df with forecasted values and the residuals df with residuals.
        
    @param df, DataFrame (default=danone)
    
    @param product_id, int : the produt_index we want to forecast, must be in the 'product_index' column of df
    
    @param date_sep int or string : the date at which we want to divide our dataset. 
    example : '2019', 2019, '2020-01-13'
    
    @param periode, int : number of weeks we want in the train set 
    
    @param cap : int, DataFrame, Series, array, list : carrying capacity : When forecasting growth, there is usually some maximum achievable point: total market size, total population size, etc. This is called the carrying capacity, and the forecast should saturate at this point. 
    
    @param floor : int, DataFrame, Series, array, list saturating minimum
    
    @param interval : Float, width of the uncertainty intervals provided for the forecast.
    If mcmc_samples=0, this will be only the uncertainty in the trend using the MAP estimate of the extrapolated generative model.
    If mcmc.samples>0, this will be integrated over all model parameters, which will include uncertainty in seasonality. 
    
    @param add_holidays, bool : True if we want to include the effects of prophet built-in holidays (defalut=False)
    (Jour de l'an, Fête du Travail, Armistice 1945, Fête nationale, Armistice 1918, Lundi de Pâques, Lundi de Pentecôte, Ascension, Assomption, Toussaint, Noël)
    
    @param regressor, list : list of the regressors we want to include in the model, must be names of columns of df. (default=empty list)
    
    @param changepoints, list : list of dates where we suspect a change in the trend, dates must be strings and must be in the time_index column of df (default=None)
    
    @param n_changepoints, int : if changepoints=None, the model distribute n_changepoints on the train_set and test if there is a change in the trend at those points.
    Not used if `changepoints` is specified
    
    @param changepoint_range, float in [0,1] : Proportion of history in which trend changepoints willbe estimated.
    Defaults to 0.8 for the first 80%. Not used if`changepoints` is specified
    
    @param yearly_seasonality : 'auto', True, False or a number of Fourier terms to generate (default=auto)
    
    
    @param holidays : DataFrame with columns holiday (string) and ds (date type)
    and optionally columns lower_window and upper_window which specify arange of days around the date to be included as holidays.lower_window=-2 will include 2 days prior to the date as holidays.
    Also optionally can have a column prior_scale specifying the prior scale for that holiday.
    
    @param seasonality_mode : 'additive'(default), 'mutltiplicative'
    
    @param seasonality_prior_scale : int or float, Parameter modulating the strength of the seasonality model.
    Larger values allow the model to fit larger seasonal fluctuations, smaller values dampen the seasonality. Can be specified for individual seasonalities using add_seasonality.
    
    @param holidays_prior_scale : int or float, Parameter modulating the strength of the holiday components model, unless overridden in the holidays input.
    
    @param mcmc_samples : int, if greater than 0, will do full Bayesian inference with the specified number of MCMC samples. If 0, will do MAP estimation.
    
    @param uncertainty_samples : Number of simulated draws used to estimate uncertainty intervals. Settings this value to 0 or False will disable uncertainty estimation and speed up the calculation. 
    
    @return m : prophet model : the trained model
    
    @return df : DataFrame : the processed df in a prophet-friendly shape 
    
    @return future : DataFrame : a single column DataFrame with train set dates and test set dates concatenated
    
    @return residuals : DataFrame : DataFrame : DataFrame with e = y-yhat for each prediction
    """
    
    df = df.query(f'product_index=={product_id}')
    df = df[['time_index','ordered_volumes']+regressor].sort_values('time_index')
    df = df[df['ordered_volumes']>0]
    df.columns = ['ds','y']+regressor
    df.dropna(subset=regressor, inplace=True)
    df.reset_index(inplace=True, drop=True)


    if cap !=None and growth == 'linear':
        df['cap'] = cap ####################### future a remplacer 
        df['floor'] = floor

    if floor !=None and growth == 'logistic':
        df['cap'] = cap 
        df['floor'] = floor  #####################

    df.dropna(subset=regressor, inplace=True)


    if type(date_sep) in [float, int]:
        date_sep = datetime.strptime(str(date_sep), "%Y")
        df_train = df[df.ds<date_sep]
        df_test = df[df.ds>=date_sep]

    else:
        if len(date_sep) == 4:
            date_sep = datetime.strptime(str(date_sep), "%Y")
            df_train = df[df.ds<date_sep]
            df_test = df[df.ds>=date_sep]
        else :
            date_sep = datetime.strptime(str(date_sep), "%Y-%m-%d")
            df_train = df[df.ds<date_sep]
            df_test = df[df.ds>=date_sep]

    m = Prophet(interval_width=interval, changepoints=changepoints, changepoint_prior_scale=changepoint_prior_scale,n_changepoints=n_changepoints,
                growth=growth,
                holidays=holidays, holidays_prior_scale=holidays_prior_scale,
                yearly_seasonality=yearly_seasonality,
                #weekly_seasonality=weekly_seasonality, daily_seasonality=daily_seasonality,
                seasonality_prior_scale=seasonality_prior_scale,
                mcmc_samples=mcmc_samples, uncertainty_samples=uncertainty_samples)

    if add_holiday == True: 
        m.add_country_holidays(country_name="FR")

    for reg in regressor:
        m.add_regressor(regressor)

    m.fit(df_train)

    future = m.make_future_dataframe(periods=len(df_test), freq='W-MON', include_history=True)

    future[regressor] = df[regressor] # à modifier si include_history=False

    if cap !=None and growth == 'linear':
        future['cap'] = cap  

    if floor !=None and growth == 'logistic':
        future['cap'] = cap 
        future['floor'] = floor 

    forecast = m.predict(future)

    residuals = pd.DataFrame()
    
    df = pd.concat([df_train,df_test], axis = 0)

    residuals['ds']=future.tail(periode)["ds"]

    residuals['e'] = df[df.ds>=date_sep]['y']-forecast[forecast.ds>=date_sep]['yhat']

    residuals.reset_index(drop=True, inplace=True)

    return m, df, future, forecast, residuals

# Affichage des performances du modèle

In [10]:
def plot_my_ploty_graph(df, forecast, product_id=70189,year=2021):
    mse = mean_squared_error(forecast['yhat'],df['y']) ######### a modifier
    mae = mean_absolute_error(forecast['yhat'],df['y'])

    s1 = go.Scatter(x=forecast['ds'], y=forecast['yhat_lower'], name='yhat_lower',fill='tonexty',line={"color":"gray"},fillcolor='rgba(68, 68, 68, 0.1)',showlegend=True)

    s2 = go.Scatter(x=forecast['ds'], y=forecast['yhat'], name='prediction',line={"color":"red"})

    s3 = go.Scatter(x=forecast['ds'], y=forecast['yhat_upper'], name='yhat_upper',line={"color":"gray"},showlegend=True)

    fig = go.Figure(data=[s2,s3,s1],layout={'title':f'Prédiction à 1 an des quantités de {product_id}'})

    fig.add_scatter(x=df['ds'],y=df['y'],mode='markers',name='Quantité observée')

    fig.add_vline(x=str(year_to_date(year)), line_width=3, line_color="green")

    fig.add_annotation(x='2017', y=600,
                text=f"MSE : {mse}",
                showarrow=False,
                arrowhead=1)

    fig.add_annotation(x='2017', y=550,
                text=f"MAE : {mae}",
                showarrow=False,
                arrowhead=1)


    fig.update_layout(hovermode='x',
                     xaxis_title='Date',yaxis_title='Quantités')

    #fig.show()
    
    return(fig)

In [6]:
def plot_my_neural_graph(forecasts, product_id=70189,year=2021):
    """
    Show model's benchmarks via graphical interface
    
    @param : forecast, dataframe with ds, y and yhat (prediction value)
    @param : product_id, the product id to select
    @param : year, the separation year between train and test set
    """
        
    #mse = mean_squared_error(forecast['yhat1'],forecast['y']) ######### a modifier
    #mae = mean_absolute_error(forecast['yhat1'],forecast['y'])
    
    #s1 = go.Scatter(x=forecast['ds'], y=forecast['yhat1'], name='prediction',line={"color":"green"})
    
    dat = []
    
    for forecast in forecasts :
        rd = random.randint(0,16777215)
        hex_number = str(hex(rd))
        hex_color ='#'+ hex_number[2:]
        dat.append(go.Scatter(x=forecast['ds'], y=forecast['yhat1'], name='prediction',line={"color":hex_color}))
    

        
    fig = go.Figure(data=dat,layout={'title':f'Prédiction à 1 an des quantités de {product_id}'})
    
    for forecast in forecasts :
        fig.add_scatter(x=forecast['ds'],y=forecast['y'],mode='markers',name='Quantité observée')
        '''fig.add_annotation(x='2017', y=600,
                text=f"MSE : {mean_squared_error(forecast['yhat1'],forecast['y'])}",
                showarrow=False,
                arrowhead=1)
        fig.add_annotation(x='2017', y=550,
                text=f"MAE : {mean_absolute_error(forecast['yhat1'],forecast['y'])}",
                showarrow=False,
                arrowhead=1)'''
        

    #fig.add_vline(x=str(year_to_date(year)), line_width=3, line_color="green")


    '''fig.add_annotation(x='2017', y=600,
                text=f"MSE : {mse}",
                showarrow=False,
                arrowhead=1)

    fig.add_annotation(x='2017', y=550,
                text=f"MAE : {mae}",
                showarrow=False,
                arrowhead=1)'''

    fig.update_layout(hovermode='x',
                     xaxis_title='Date',yaxis_title='Quantités')

    #fig.show()
    return(fig)

# Entraîner le modèle

In [12]:
#m, danone, future, forecast, residuals, metrics = train_neural_prophet_model(params)
#m, danone, future, forecast, residuals = train_prophet_model(df)

INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.

The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


