# NYC Open Data - Demand for Shelter:

## Problem Statement
The "Demand for Shelter" dataset from NYC Open Data provides a comprehensive view of the homeless shelter system in New York City. It includes historical and current data on the number of people seeking shelter on a daily basis, broken down into categories such as adults, families with children, adult families (without children), and single individuals. The data also captures various metrics like the type of shelter, borough distribution, and how these numbers fluctuate over time.

This data is crucial because it reflects the city's response to homelessness and tracks trends that may indicate underlying social, economic, or policy-driven factors affecting homelessness.

When individuals have access to stable housing, they are better able to contribute to the economy. Homeless individuals often struggle to maintain employment or education due to the instability of their living situations. Providing stable housing allows them to re-enter the workforce, gain skills, and become productive members of society. Studies have shown that providing housing and support services for homeless individuals is more cost-effective than allowing them to live on the streets, where they often rely on emergency services, shelters, and hospitals, which are much more costly. By addressing homelessness, cities can reduce costs in areas such as emergency medical care, policing, and sanitation.

## Business Objective:
As a government data scientist, the objective is to develop a forecasting model that accurately predicts the future demand for homeless shelters in New York City. By leveraging historical data and forecasting techniques, the model will enable the city’s Department of Homeless Services (DHS) and related agencies to anticipate shelter needs, allocate resources efficiently, and make informed policy decisions to reduce homelessness and improve public services.

## Goals:
* Accurate Forecasting of Shelter Demand: Build a predictive model using time series or machine learning techniques to forecast the number of individuals and families seeking shelter on a daily, weekly, or monthly basis.

* Identifying Key Drivers of Shelter Demand: Analyze the factors contributing to increased shelter use, such as unemployment rates, housing costs, extreme weather events, or policy changes, to understand what drives homelessness trends.

* Optimizing Resource Allocation: Provide actionable insights to DHS on when and where resources (shelters, beds, staffing, services) will be most needed, improving service delivery and reducing instances of overcrowded shelters.

# Load Dataset

In [1]:
# Install the greykite library - Linkedin Silverkite
!pip install greykite

Collecting greykite
  Downloading greykite-1.0.0-py2.py3-none-any.whl.metadata (9.4 kB)
Collecting dill>=0.3.1.1 (from greykite)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting holidays<0.25 (from greykite)
  Downloading holidays-0.24-py3-none-any.whl.metadata (16 kB)
Collecting holidays-ext>=0.0.7 (from greykite)
  Downloading holidays_ext-0.0.8-py3-none-any.whl.metadata (1.3 kB)
Collecting numpy<1.25.0,>=1.22.0 (from greykite)
  Downloading numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting overrides>=2.8.0 (from greykite)
  Downloading overrides-7.7.0-py3-none-any.whl.metadata (5.8 kB)
Collecting pandas<2.0.0,>=1.5.0 (from greykite)
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting pmdarima<=1.8.5,>=1.8.0 (from greykite)
  Downloading pmdarima-1.8.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl.metadata (7.7 kB)
Co

Collecting scikit-learn==1.3.1
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/commands/install.py", line 377, in run
    requirement_set = resolver.resolve(
  File "/usr/local/lib/python3.10/dist-packages/pip/_internal/resolution/resolvelib/resolver.py", line 95, in resolve
    result = self._result = resolver.resolve(
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 546, in resolve
    state = resolution.resolve(requirements, max_rounds=max_rounds)
  File "/usr/local/lib/python3.10/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 397, in resolve
    self._add_to_criteria(self.state.criteria, r, parent=None

In [34]:
# Make sure to use the preferred scikit-learn version to avoid misconfig error
!pip install --upgrade --force-reinstall scikit-learn==1.3.1

Collecting scikit-learn==1.3.1
  Downloading scikit_learn-1.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting numpy<2.0,>=1.17.3 (from scikit-learn==1.3.1)
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy>=1.5.0 (from scikit-learn==1.3.1)
  Downloading scipy-1.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib>=1.1.1 (from scikit-learn==1.3.1)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn==1.3.1)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.3.1-cp310-cp310-manylinux_

In [65]:
# import module for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import (
    plot_pacf, plot_pacf,
    month_plot, quarter_plot
)

In [2]:
# Greykite functions
from greykite.framework.templates.autogen.forecast_config import *
from greykite.framework.templates.forecaster import Forecaster
from greykite.framework.templates.model_templates import ModelTemplateEnum
from greykite.common.features.timeseries_features import *
from greykite.common.evaluation import EvaluationMetricEnum
from greykite.framework.utils.result_summary import summarize_grid_search_results

In [3]:
# load dataset
path = "/content/drive/MyDrive/Colab Notebooks/Time-Series/New York City Shelter Demand Projection/DHS_Homeless_Shelter_Census.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0,Date of Census,Total Individuals in Shelter,Total Single Adults in Shelter,Families with Children in Shelter,Adult Families in Shelter
0,07/17/2023,81957,21636,16490,2905
1,07/16/2023,81848,21538,16491,2891
2,07/15/2023,81808,21467,16508,2895
3,07/14/2023,81725,21529,16477,2891
4,07/13/2023,81916,21638,16505,2894


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4012 entries, 0 to 4011
Data columns (total 5 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   Date of Census                     4012 non-null   object
 1   Total Individuals in Shelter       4012 non-null   int64 
 2   Total Single Adults in Shelter     4012 non-null   int64 
 3   Families with Children in Shelter  4012 non-null   int64 
 4   Adult Families in Shelter          4012 non-null   int64 
dtypes: int64(4), object(1)
memory usage: 156.8+ KB


# Data Preprocessings

## Rename Columns and Change Dtypes

In [5]:
# rename columns
df.rename(
    columns={
        "Date of Census" : 'ds',
        "Total Individuals in Shelter" : "y"}, inplace=True
)

In [6]:
# change datatype
df['ds'] = pd.to_datetime(df['ds'])
# chech data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4012 entries, 0 to 4011
Data columns (total 5 columns):
 #   Column                             Non-Null Count  Dtype         
---  ------                             --------------  -----         
 0   ds                                 4012 non-null   datetime64[ns]
 1   y                                  4012 non-null   int64         
 2   Total Single Adults in Shelter     4012 non-null   int64         
 3   Families with Children in Shelter  4012 non-null   int64         
 4   Adult Families in Shelter          4012 non-null   int64         
dtypes: datetime64[ns](1), int64(4)
memory usage: 156.8 KB


In [7]:
max_df_date = df['ds'].max()
min_df_date = df['ds'].min()

print(f"Date start from {min_df_date} up to {max_df_date}")

Date start from 2013-08-21 00:00:00 up to 2024-10-12 00:00:00


## Check Duplicate

In [8]:
# check the number of duplicate
df.ds.duplicated().sum()

15

In [9]:
# check duplicated rows
df[df.ds.duplicated()]

Unnamed: 0,ds,y,Total Single Adults in Shelter,Families with Children in Shelter,Adult Families in Shelter
130,2023-03-10,71018,21358,13518,2779
453,2022-04-19,45348,16575,8467,1389
455,2022-04-18,45239,16592,8426,1390
457,2022-04-17,45084,16449,8418,1392
491,2022-03-15,45158,16542,8428,1396
493,2022-03-14,45309,16633,8435,1417
495,2022-03-13,45261,16585,8429,1415
520,2022-02-17,45224,16619,8387,1431
556,2022-01-13,45335,16557,8456,1437
606,2021-11-22,45838,16725,8528,1509


In [10]:
# delete duplocates
df.drop_duplicates(subset='ds', inplace=True)

# Plot Dataset

In [11]:
# since pandas is better at analyzing time series data when the index is actually within datetime format
# it is a good idea to create temporary dataframe where it uses `ds` column as index

# create temp. dataframe
df_temp = df.copy()

# set `ds` column as index
df_temp.set_index('ds', inplace=True)

# see the result
df_temp.head(3)

Unnamed: 0_level_0,y,Total Single Adults in Shelter,Families with Children in Shelter,Adult Families in Shelter
ds,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2023-07-17,81957,21636,16490,2905
2023-07-16,81848,21538,16491,2891
2023-07-15,81808,21467,16508,2895


In [78]:
# replot data
fig, ax = plt.subplots(1,1,figsize=(12, 6))
sns.lineplot(df_temp, x=df_temp.index, y='y',
             color = 'blue',ax=ax)
sns.scatterplot(df_temp, x=df_temp.index, y='y',
                color = 'k', ax=ax)

plt.title("Homeless Shelter Census", size=15)
plt.xlabel("Date", size=12)
plt.ylabel("Number of Individuals", size=12)
plt.show();





In [13]:
# prompt: plot acf and pacf using 6 months data. data is availabel in df_temp['y']. make it a function

import statsmodels.api as sm

def plot_acf_pacf(df, column_name, months=6):
  """
  Plots ACF and PACF for a given time series column.

  Args:
    df: Pandas DataFrame containing the time series data.
    column_name: Name of the column to analyze.
    months: Number of months of data to consider.
  """
  # Filter data to include the last 'months' months
  # Calculate ACF and PACF
  fig = plt.figure(figsize=(12, 8))
  ax1 = fig.add_subplot(211)
  fig = sm.graphics.tsa.plot_acf(df[column_name], lags=months*30, ax=ax1)
  ax2 = fig.add_subplot(212)
  fig = sm.graphics.tsa.plot_pacf(df[column_name], lags=months*30, ax=ax2)
  plt.show()


# Example usage:
plot_acf_pacf(df_temp, 'y', months=6)

In [14]:
# Example usage:
plot_acf_pacf(df_temp, 'y', months=1)

seems that it is a relatively steady time series at daily data and has nothing to do with auto-regression

# Training and Testing Data Boundaries

In [15]:
# get df, just use ds, and y columns
data = df.loc[:, ['ds', 'y']].copy()

In [16]:
# testing start
forecast_test_horizon = 3*30 #three months
# see the date of testing
test_cutoff = data.ds.max() - pd.Timedelta(forecast_test_horizon, unit='D')

print(f"Testing start from {test_cutoff}")

Testing start from 2024-07-14 00:00:00


# Modeling

## Defining Components

### Defining MetaData

In [17]:
# uses data with

# Define the metadata for the time series, specifying the date column, value column, frequency, and training end date
metadata = MetadataParam(time_col = "ds",                # Column name for the time variable
                         value_col = "y",                  # Column name for the target variable
                         freq = "D",                       # Frequency of the data ('D' for daily)
                         train_end_date = test_cutoff)     # End date for the training period

metadata

MetadataParam(anomaly_info=None, date_format=None, freq='D', time_col='ds', train_end_date=Timestamp('2024-07-14 00:00:00'), value_col='y')

### Growth Term

In [18]:
# Define the possible growth terms for the model
# {} == dict()
growth = dict(growth_term = ["linear", "quadratic", "sqrt"])
growth

{'growth_term': ['linear', 'quadratic', 'sqrt']}

### Seasonality

In [19]:
# Define the seasonality parameters with automatic detection for different seasonalities
# Because our data is on daily basis, we can let SilverKite to Find the seasonality by its own
# SilverKite has 6 possible seasonalies, they are: yearly, quarterly, monthly, weekly, daily, and hourly
# We can use and fit all of them except for hourly seasonality bcz our time series is daily
# In this case, we just use 4, ignoring daily seasonality
# Therefor the seasonality defined for our model is:

seasonality = dict(yearly_seasonality = "auto",
                   quarterly_seasonality = "auto",
                   monthly_seasonality = "auto",
                   weekly_seasonality = "auto",
                   daily_seasonality = False)

seasonality

{'yearly_seasonality': 'auto',
 'quarterly_seasonality': 'auto',
 'monthly_seasonality': 'auto',
 'weekly_seasonality': 'auto',
 'daily_seasonality': False}

### Holidays

In [20]:
# Retrieve available holidays across the specified years in the US
get_available_holidays_across_countries(countries = ["US"],  # Country to check
                                        year_start = 2013,    # Start year for holiday search
                                        year_end = 2024)      # End year for holiday search

['Christmas Day',
 'Christmas Day (Observed)',
 'Columbus Day',
 'Halloween',
 'Independence Day',
 'Independence Day (Observed)',
 'Juneteenth National Independence Day',
 'Juneteenth National Independence Day (Observed)',
 'Labor Day',
 'Martin Luther King Jr. Day',
 'Memorial Day',
 "New Year's Day",
 "New Year's Day (Observed)",
 'Thanksgiving',
 'Veterans Day',
 'Veterans Day (Observed)',
 "Washington's Birthday"]

In [21]:
# choose holidays to use:
used_holidays = [
    "Christmas Day (Observed)",
    "Independence Day (Observed)",
    "Thanksgiving",
    "New Year's Day (Observed)"
]

# Define events to be included in the model
events = dict(holidays_to_model_separately = used_holidays,          # Holidays to model separately
              holiday_lookup_countries = ["US"],                     # Country for holiday lookup
              holiday_pre_num_days = 2,                              # Number of days before the holiday to include
              holiday_post_num_days = 2,                             # Number of days after the holiday to include
              holiday_pre_post_num_dict = {"New Year's Day": (3,1)}, # Specific pre and post days for certain holidays
              daily_event_df_dict = {                                # Dictionary of daily events
                  "elections": pd.DataFrame({                        # Include Elections event
                  "date": ["2016-11-08", "2020-11-03"],
                  "event_name": ["elections"] * 2
                  })})

events

{'holidays_to_model_separately': ['Christmas Day (Observed)',
  'Independence Day (Observed)',
  'Thanksgiving',
  "New Year's Day (Observed)"],
 'holiday_lookup_countries': ['US'],
 'holiday_pre_num_days': 2,
 'holiday_post_num_days': 2,
 'holiday_pre_post_num_dict': {"New Year's Day": (3, 1)},
 'daily_event_df_dict': {'elections':          date event_name
  0  2016-11-08  elections
  1  2020-11-03  elections}}

### Change Points

In [22]:
# we see a lot of potential change points within our data
# lets SilverKite decide where they are by setting change points as auto:

changepoints = dict(changepoints_dict = dict(method = "auto")) # Automatically detect changepoints
changepoints

{'changepoints_dict': {'method': 'auto'}}

### Set Algorithm

In [23]:
# Define custom algorithms to be used for fitting
custom = dict(fit_algorithm_dict = [dict(fit_algorithm = "linear"),
                                    dict(fit_algorithm = "ridge"),
                                    dict(fit_algorithm = "gradient_boosting")])
custom

{'fit_algorithm_dict': [{'fit_algorithm': 'linear'},
  {'fit_algorithm': 'ridge'},
  {'fit_algorithm': 'gradient_boosting'}]}

## SilverKite Models

In [24]:
# Build the silverkite model with the specified components
model = ModelComponentsParam(growth = growth,                    # Growth term parameters
                             seasonality = seasonality,          # Seasonality parameters
                             events = events,                    # Events and holidays to model
                             changepoints = changepoints,        # Changepoints configuration
                             custom = custom)                    # Custom algorithms for fitting

## Evaluation Metrics

In [40]:
# Define the evaluation metric using Root Mean Squared Error (RMSE) for cross-validation selection
evaluation_metric = EvaluationMetricParam(
    cv_selection_metric=EvaluationMetricEnum.MeanSquaredError.name,
    cv_report_metrics=[
        EvaluationMetricEnum.MeanSquaredError.name,
        EvaluationMetricEnum.MeanAbsoluteError.name,
        EvaluationMetricEnum.MedianAbsoluteError.name,
        EvaluationMetricEnum.MedianAbsolutePercentError.name,
    ],
    agg_periods=7,
    agg_func=np.sum,
    null_model_params = {
        "strategy": "mean"
    },
    relative_error_tolerance=0.01
)

## Cross-Validations Parameters

In [41]:
# Set up the cross-validation (CV) parameters
evaluation_period = EvaluationPeriodParam(
    cv_min_train_periods = data.shape[0] - forecast_test_horizon - 180, # Minimum training period based on the dataset size
    cv_expanding_window = True,                                         # Use expanding window for CV or Rolling forecast for instance
    cv_max_splits = 50,                                                # Maximum number of CV splits
    cv_periods_between_splits = int(0.5*forecast_test_horizon) # Periods between CV splits
    )

In [42]:
# Configure the forecasting model using the Silverkite template
config = ForecastConfig(
    model_template = ModelTemplateEnum.SILVERKITE.name,       # Use the Silverkite model template
    forecast_horizon = forecast_test_horizon,                 # Forecast horizon of 6 months
    coverage = 0.95,
    metadata_param = metadata,                                # Metadata parameters defined earlier
    model_components_param = model,                           # Model components defined earlier
    evaluation_metric_param = evaluation_metric,              # Evaluation metric for model selection
    evaluation_period_param = evaluation_period)              # CV parameters for model evaluation

## Perform Training, and CV

In [43]:
# Initialize the forecaster and run the forecast configuration
forecaster = Forecaster()
result = forecaster.run_forecast_config(df = data,
                                        config = config) # Run the forecast using the prepared config

Fitting 2 folds for each of 9 candidates, totalling 18 fits



Requested holiday 'New Years Day' is not valid. Valid holidays are: ['Christmas Day (Observed)', 'Independence Day (Observed)', 'Thanksgiving', 'New Years Day (Observed)', 'Other']


The following Fourier series terms are removed due to collinearity:
['sin1_toq_quarterly', 'cos1_toq_quarterly', 'sin2_toq_quarterly', 'cos2_toq_quarterly', 'sin3_toq_quarterly', 'cos3_toq_quarterly', 'cos4_tow_weekly']


73 value(s) in y_true were NA or infinite and are omitted in error calc.


73 value(s) in y_true were NA or infinite and are omitted in error calc.


73 value(s) in y_true were NA or infinite and are omitted in error calc.


73 value(s) in y_true were NA or infinite and are omitted in error calc.


Requested holiday 'New Years Day' is not valid. Valid holidays are: ['Christmas Day (Observed)', 'Independence Day (Observed)', 'Thanksgiving', 'New Years Day (Observed)', 'Other']


The following Fourier series terms are removed due to collinearity:
['sin1_toq_quarterly', 'cos1_toq_quarterly'

In [80]:
# Summarize the results of the grid search with RMSE as the evaluation metric
cv_results = summarize_grid_search_results(
    grid_search = result.grid_search,                          # Grid search results from the forecast run
    decimals = 1,                                              # Number of decimal places to round in the summary
    score_func = EvaluationMetricEnum.MeanSquaredError.name)  # Use RMSE as the score function

In [81]:
# Convert the 'params' column to string type and set it as the index
cv_results["params"] = cv_results["params"].astype(str)
cv_results.set_index("params", drop = True, inplace = True)
cv_results

Unnamed: 0_level_0,rank_test_MSE,rank_test_MAE,rank_test_MedAE,rank_test_MedAPE,mean_test_MSE,mean_test_MAE,mean_test_MedAE,mean_test_MedAPE,split_test_MSE,split_test_MAE,...,std_test_MedAE,split0_train_MedAE,split1_train_MedAE,std_train_MedAE,split0_test_MedAPE,split1_test_MedAPE,std_test_MedAPE,split0_train_MedAPE,split1_train_MedAPE,std_train_MedAPE
params,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"[('estimator__growth_term', 'linear'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'linear'})]",5,5,5,5,6382280000.0,77074.0,77217.5,12.5,"(7615890538.7, 5148670299.4)","(85015.6, 69132.4)",...,6054.6,23659.9,24179.5,259.8,13.3,11.7,0.8,5.7,5.8,0.1
"[('estimator__growth_term', 'quadratic'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'linear'})]",6,6,6,6,7143079000.0,81010.9,81277.9,13.2,"(9339431885.6, 4946726357.7)","(94340.7, 67681.2)",...,11551.8,21282.6,19327.2,977.7,14.9,11.5,1.7,5.1,4.7,0.2
"[('estimator__growth_term', 'sqrt'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'linear'})]",4,4,4,4,5993435000.0,74923.6,74995.2,12.2,"(6614194834.9, 5372675434.5)","(79111.1, 70736.1)",...,2247.4,24563.8,24309.6,127.1,12.4,12.0,0.2,5.8,5.8,0.0
"[('estimator__growth_term', 'linear'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'ridge'})]",8,8,8,8,42770680000.0,206508.8,206278.0,33.5,"(45993919521.6, 39547440856.5)","(214415.8, 198601.9)",...,10011.0,17318.5,15823.2,747.7,34.7,32.4,1.2,4.1,3.9,0.1
"[('estimator__growth_term', 'quadratic'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'ridge'})]",7,7,7,7,42739270000.0,206432.3,206201.0,33.5,"(45965422315.0, 39513115656.9)","(214349.3, 198515.3)",...,10020.6,17337.1,15830.8,753.1,34.7,32.4,1.2,4.1,3.9,0.1
"[('estimator__growth_term', 'sqrt'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'ridge'})]",9,9,9,9,42794260000.0,206566.2,206335.5,33.6,"(46015293670.5, 39573230862.0)","(214465.6, 198666.9)",...,10003.6,17315.5,15832.2,741.7,34.7,32.4,1.2,4.1,3.9,0.1
"[('estimator__growth_term', 'linear'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'gradient_boosting'})]",2,2,2,2,1183619000.0,28632.9,29531.6,4.8,"(2184601429.4, 182635599.0)","(46400.8, 10865.0)",...,18120.2,754.0,740.6,6.7,7.7,1.9,2.9,0.2,0.2,0.0
"[('estimator__growth_term', 'quadratic'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'gradient_boosting'})]",3,3,3,3,1648842000.0,32858.5,33144.6,5.3,"(3165372975.4, 132311370.3)","(55944.0, 9773.0)",...,24702.3,738.9,718.6,10.2,9.3,1.4,3.9,0.2,0.2,0.0
"[('estimator__growth_term', 'sqrt'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'gradient_boosting'})]",1,1,1,1,1162656000.0,28059.8,28400.2,4.6,"(2182910918.4, 142401593.0)","(46352.4, 9767.3)",...,19162.0,725.8,736.0,5.1,7.7,1.5,3.1,0.2,0.2,0.0


In [82]:
# Display the best results based on RMSE and other parameters
cv_results[["rank_test_MSE", "mean_test_MSE",
            "param_estimator__fit_algorithm_dict",
            "param_estimator__growth_term"]]

Unnamed: 0_level_0,rank_test_MSE,mean_test_MSE,param_estimator__fit_algorithm_dict,param_estimator__growth_term
params,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"[('estimator__growth_term', 'linear'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'linear'})]",5,6382280000.0,{'fit_algorithm': 'linear'},linear
"[('estimator__growth_term', 'quadratic'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'linear'})]",6,7143079000.0,{'fit_algorithm': 'linear'},quadratic
"[('estimator__growth_term', 'sqrt'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'linear'})]",4,5993435000.0,{'fit_algorithm': 'linear'},sqrt
"[('estimator__growth_term', 'linear'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'ridge'})]",8,42770680000.0,{'fit_algorithm': 'ridge'},linear
"[('estimator__growth_term', 'quadratic'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'ridge'})]",7,42739270000.0,{'fit_algorithm': 'ridge'},quadratic
"[('estimator__growth_term', 'sqrt'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'ridge'})]",9,42794260000.0,{'fit_algorithm': 'ridge'},sqrt
"[('estimator__growth_term', 'linear'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'gradient_boosting'})]",2,1183619000.0,{'fit_algorithm': 'gradient_boosting'},linear
"[('estimator__growth_term', 'quadratic'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'gradient_boosting'})]",3,1648842000.0,{'fit_algorithm': 'gradient_boosting'},quadratic
"[('estimator__growth_term', 'sqrt'), ('estimator__fit_algorithm_dict', {'fit_algorithm': 'gradient_boosting'})]",1,1162656000.0,{'fit_algorithm': 'gradient_boosting'},sqrt


In [47]:
# Plot the backtest results from the forecasting model
result.backtest.plot()

# Forecasting

In [60]:
forecast_df = result.forecast.df
forecast_df

Unnamed: 0,ts,actual,forecast,forecast_lower,forecast_upper,forecast_null
0,2013-08-21,49673.0,49956.997716,49548.914060,50365.081372,59682.017081
1,2013-08-22,49690.0,49960.609963,49539.621823,50381.598102,59682.017081
2,2013-08-23,49548.0,49934.947123,49526.457432,50343.436815,59682.017081
3,2013-08-24,49617.0,49925.759602,49513.459343,50338.059860,59682.017081
4,2013-08-25,49858.0,49966.953542,49547.664065,50386.243019,59682.017081
...,...,...,...,...,...,...
4066,2024-10-08,87218.0,87180.553003,86752.812793,87608.293213,59682.017081
4067,2024-10-09,87301.0,87191.706796,86783.623140,87599.790452,59682.017081
4068,2024-10-10,87301.0,87191.706796,86770.718656,87612.694935,59682.017081
4069,2024-10-11,87285.0,87168.054169,86759.564477,87576.543861,59682.017081
