# **ML ENGINEER TECH ASSESSMENT: KAYODE TAIWO**

**Problem Statement**

Forecasting vaccination numbers is crucial in identifying the demand for, and thereby the required supply doses. This will help plan logistics operations and campaigns for optimal distribution.

*   Build a simple forecasting model using the vaccination data in the [*time_series_covid19_vaccine_doses_admin_US.csv*](https://raw.githubusercontent.com/govex/COVID-19/master/data_tables/vaccine_data/us_data/time_series/time_series_covid19_vaccine_doses_admin_US.csv) folder. The forecasting should be done on at least 3 different states.

*   Starting on May 1st 2022, the model should forecast the trend 
for a suitable prediction horizon and update these predictions daily based on the new data that comes in. You may choose how you show this.

*   Once you have a suitable training flow and model, create a PowerBI page that shows the timeseries of the 3 states and their forecasts.



**Solution Approach**

We have a univariate time series data here. We will be working specifically with the three states *Alabama*, *California*, and *Georgia* as specified in the *provinces* list below. More *provinces* can easily be added to this list or perhaps a different set of three states can be used in this analysis by modifying this *provinces* list.

A univariate time series contains a single time dependent variable. Typical analysis of such a series starts by extracting features from both the *datetime* field and the *datetime dependent* variable.

Features such as *day*, *month*, *year*, *dayofweek*, *weekofyear*, *hour*, *minute*, *second*, *is_weekend*, *is_holiday*, etc are extracted from the datetime field, while *time lag* features are extracted from the datetime dependent variable.

Once the time series data has been augmented with these additional features as described above we will explore several model types such as

*   Machine Learning models such as xgboost, LinearRegression, etc.
*   Deep Learning models such as RNN, LSTM, GRU, etc.

For evaluating the results of our model we will use the following metrics

*   Root Mean Squared Error (RMSE)
*   Coefficient of Correlation (R2)

**NOTES**

*   The original data provided was split into 3 sets after pre-processing and feature extraction - train, validation, and test. Everything before 1st February 2022 was taken as train. Everything on or after 1st of February 2022 and before 1st of May 2022 was taken as validation. Everything on or after 1st May 2022 was taken as test.

*   For the LinearRegression model from sklearn that was used in this analysis there was no need of parameter tuning so the train and validation sets as described above where combined into one single train set.

*   The results of some of our machine learning and deep learning model approaches are not shown in this notebook primarily because they did not perform sufficiently well. Specifically the models xgboost, RNN, LSTM, and GRU all exhibited signs of *overfitting* to the train and validation datasets. Each of these model types had good RMSE and R2 on the train and validation datasets but extremely poor values on the hold out set (test set). Perhaps they might be improved by regularization or more performance tuning.

**Installing libraries**

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import iplot

import holidays

us_holidays = holidays.US()
provinces = ['Alabama', 'California', 'Georgia']


**Download the COVID-19 Time Series dataset**

In [None]:
!wget https://raw.githubusercontent.com/govex/COVID-19/master/data_tables/vaccine_data/us_data/time_series/time_series_covid19_vaccine_doses_admin_US.csv

--2022-09-25 12:47:26--  https://raw.githubusercontent.com/govex/COVID-19/master/data_tables/vaccine_data/us_data/time_series/time_series_covid19_vaccine_doses_admin_US.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 317102 (310K) [text/plain]
Saving to: ‘time_series_covid19_vaccine_doses_admin_US.csv.2’


2022-09-25 12:47:26 (76.5 MB/s) - ‘time_series_covid19_vaccine_doses_admin_US.csv.2’ saved [317102/317102]



**Function to read covid-19 dataset and reshape data format from wide to long by pivoting date columns into rows**

This function reads the downloaded covid-19 dataset into a pandas dataframe and reshapes date columns to rows with the pandas pd.melt API which keeps some columns as identifier columns while converting the date columns to rows where the column names are values in a new *variable* column and the column data are values in a new *value* column.

Returns the reshaped pandas dataframe.

In [None]:
def read_and_pivot():
    df = pd.read_csv("./time_series_covid19_vaccine_doses_admin_US.csv")
    df = pd.melt(df, id_vars=['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Lat', 'Long_', 'Combined_Key', 'Population'], var_name='Date', value_name='Dose').drop(columns=['Admin2'])
    df['Date'] = pd.to_datetime(df['Date'])
    df = df.set_index(['Date'])
    df.index = pd.to_datetime(df.index)
    if not df.index.is_monotonic:
        df = df.sort_index()
    return df

**Function to read specific covid-19 dataset and index the data by a date column**

This function reads the covid-19 dataset specified in its paramater into a pandas dataframe and indexes the dataframe by a date column named *Date*.

Returns the reindexed pandas dataframe.

In [None]:
def read_and_index(file_name):
    df = pd.read_csv(file_name)
    df = df.set_index(['Date'])
    df.index = pd.to_datetime(df.index)
    if not df.index.is_monotonic:
        df = df.sort_index()
    return df

**Function to generate time lag features**

This function generates time lag features for the pandas dataframe specified in its *df* parameter. The number of lags to generate is also specified in the *n_lags* parameter.

The first *n_lags* rows are deleted from the resulting dataframe.

This function returns the dataframe, *df*, with *n_lag* new time lag features and the first *n_lag* rows removed.

In [None]:
def generate_time_lags(df, n_lags):
    df_n = df.copy()
    for n in range(1, n_lags + 1):
        df_n[f"lag{n}"] = df_n["Dose"].shift(n)
    df_n = df_n.iloc[n_lags:]
    return df_n

**Function to generate one hot encoded features for categorical columns**

This function generates one hot encoded features for the pandas dataframe specified in its *df* parameter. The columns to one hot encode are also specified in its *col_name* parameter.

The *col_name* columns to one hot encode are deleted from the resulting dataframe.

This function returns the dataframe, *df*, with new one hot encoded features and the *col_name* columns removed.

In [None]:
def onehot_encode_pd(df, col_name):
    dummies = pd.get_dummies(df[col_name], prefix=col_name)
    return pd.concat([df, dummies], axis=1).drop(columns=[col_name])

**Function to generate cyclical encoded features for periodic categorical columns**

This function generates cyclical encoded features for the pandas dataframe specified in its *df* parameter. The columns to encode are also specified in its *col_name* parameter.

The *col_name* columns to encode are deleted from the resulting dataframe.

This function returns the dataframe, *df*, with new cyclical encoded features and the *col_name* columns removed.


In [None]:
def generate_cyclical_features(df, col_name, period, start_num=0):
    kwargs = {
        f'sin_{col_name}' : lambda x: np.sin(2*np.pi*(df[col_name]-start_num)/period),
        f'cos_{col_name}' : lambda x: np.cos(2*np.pi*(df[col_name]-start_num)/period)    
             }
    return df.assign(**kwargs).drop(columns=[col_name])

**Function to plot a time series dataset**

In [None]:
def plot_dataset(df, title):
    data = []
    value = go.Scatter(
        x=df.index,
        y=df['Dose'],
        mode="lines",
        name="values",
        marker=dict(),
        text=df.index,
        line=dict(color="rgba(0,0,0, 0.3)"),
    )
    data.append(value)

    layout = dict(
        title=title,
        xaxis=dict(title="Date", ticklen=5, zeroline=False),
        yaxis=dict(title="Value", ticklen=5, zeroline=False),
    )

    fig = dict(data=data, layout=layout)
    iplot(fig)

**Data preprocessing and feature engineering**

*   Load and transform covid-19 data
*   Generate date features, *Day*, *Month*, *Year*, *WeekOfYear*, *DayOfWeek*, and *Quarter* from *Date* index.
*   Generate time lag features, *Lag_1*, *Lag_2*, *Lag_3*, *Lag_4*, *Lag_5*, *Lag_6*, etc from *Dose* column.
*   Generate cyclical features for the columns, *Month*, *WeekOfYear*, *DayOfWeek*, and *Quarter*.
*   Generate one hot features for the columns, *Day*.

In [None]:
covid = read_and_pivot()
cutoffs_1 = {}
for province in provinces:
    print('Processing Province => {0}'.format(province))
    df = covid[covid['Province_State']==province].copy()
    df.drop(['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Province_State', 'Country_Region', 'Lat', 'Long_', 'Combined_Key', 'Population'], axis=1, inplace=True)
    df = df.sort_index()
    # extract month and year from dates
    df['Day'] = [i.day for i in df.index]
    df['Month'] = [i.month for i in df.index]
    df['Year'] = [i.year for i in df.index]
    df['WeekOfYear'] = [i.week for i in df.index]
    df['DayOfWeek'] = [i.dayofweek for i in df.index]
    df['Quarter'] = [i.quarter for i in df.index]
    df = df[['Day', 'Month', 'Year', 'WeekOfYear', 'DayOfWeek', 'Quarter', 'Dose']]
    df.fillna(0, inplace=True)
    df = generate_time_lags(df, 8)
    # create a sequence of numbers
    df['Series'] = np.arange(1, len(df)+1)

    may_01_2022 = df[(df['Year'] == 2022) & (df['Month'] == 5) & (df['Day'] == 1)]
    print('may_01_2022[Series].to_list()[0] = {0}'.format(may_01_2022['Series'].to_list()[0]))
    test_cutoff = may_01_2022['Series'].to_list()[0]

    feb_01_2022 = df[(df['Year'] == 2022) & (df['Month'] == 2) & (df['Day'] == 1)]
    print('feb_01_2022[Series].to_list()[0] = {0}'.format(feb_01_2022['Series'].to_list()[0]))
    val_cutoff = feb_01_2022['Series'].to_list()[0]

    cutoffs_1[province] = (test_cutoff, val_cutoff)

    df = generate_cyclical_features(df, 'Quarter', 4, 1)
    df = generate_cyclical_features(df, 'Month', 12, 1)
    df = generate_cyclical_features(df, 'DayOfWeek', 7, 0)
    df = generate_cyclical_features(df, 'WeekOfYear', 52, 0)
    df = onehot_encode_pd(df, 'Day')

    df.to_csv('./time_series_covid19_vaccine_doses_admin_US_{0}.csv'.format(province), index=True)


Processing Province => Alabama
may_01_2022[Series].to_list()[0] = 496
feb_01_2022[Series].to_list()[0] = 407
Processing Province => California
may_01_2022[Series].to_list()[0] = 496
feb_01_2022[Series].to_list()[0] = 407
Processing Province => Georgia
may_01_2022[Series].to_list()[0] = 496
feb_01_2022[Series].to_list()[0] = 407


**Preview the data and time series plot for each province**

From the plot of the time series data for each of our 3 provinces, we notice an increasing trend with no seasonality.

In [None]:
for province in provinces:
    print('Processing Province => {0}\n'.format(province))
    df = read_and_index('./time_series_covid19_vaccine_doses_admin_US_{0}.csv'.format(province))
    print(df.head())
    plot_dataset(df, title='Covid-19 Dose by Date for {0}'.format(province))

Processing Province => Alabama

            Year    Dose    lag1    lag2    lag3    lag4    lag5    lag6  \
Date                                                                       
2020-12-22  2020  5181.0  5181.0  5181.0  5181.0  5181.0     0.0     0.0   
2020-12-23  2020  5181.0  5181.0  5181.0  5181.0  5181.0  5181.0     0.0   
2020-12-24  2020  5181.0  5181.0  5181.0  5181.0  5181.0  5181.0  5181.0   
2020-12-25  2020  5181.0  5181.0  5181.0  5181.0  5181.0  5181.0  5181.0   
2020-12-26  2020  5181.0  5181.0  5181.0  5181.0  5181.0  5181.0  5181.0   

              lag7    lag8  ...  Day_22  Day_23  Day_24  Day_25  Day_26  \
Date                        ...                                           
2020-12-22     0.0     0.0  ...       1       0       0       0       0   
2020-12-23     0.0     0.0  ...       0       1       0       0       0   
2020-12-24     0.0     0.0  ...       0       0       1       0       0   
2020-12-25  5181.0     0.0  ...       0       0       0     

Processing Province => California

            Year  Dose  lag1  lag2  lag3  lag4  lag5  lag6  lag7  lag8  ...  \
Date                                                                    ...   
2020-12-22  2020   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
2020-12-23  2020   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
2020-12-24  2020   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
2020-12-25  2020   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   
2020-12-26  2020   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...   

            Day_22  Day_23  Day_24  Day_25  Day_26  Day_27  Day_28  Day_29  \
Date                                                                         
2020-12-22       1       0       0       0       0       0       0       0   
2020-12-23       0       1       0       0       0       0       0       0   
2020-12-24       0       0       1       0       0       0       0       0   
2020-12-25       0   

Processing Province => Georgia

            Year     Dose     lag1     lag2     lag3     lag4     lag5  \
Date                                                                     
2020-12-22  2020  26010.0  17870.0   1258.0      0.0      0.0      0.0   
2020-12-23  2020  26010.0  26010.0  17870.0   1258.0      0.0      0.0   
2020-12-24  2020  26010.0  26010.0  26010.0  17870.0   1258.0      0.0   
2020-12-25  2020  26010.0  26010.0  26010.0  26010.0  17870.0   1258.0   
2020-12-26  2020  26010.0  26010.0  26010.0  26010.0  26010.0  17870.0   

              lag6  lag7  lag8  ...  Day_22  Day_23  Day_24  Day_25  Day_26  \
Date                            ...                                           
2020-12-22     0.0   0.0   0.0  ...       1       0       0       0       0   
2020-12-23     0.0   0.0   0.0  ...       0       1       0       0       0   
2020-12-24     0.0   0.0   0.0  ...       0       0       1       0       0   
2020-12-25     0.0   0.0   0.0  ...       0       0   

**Covid-19 Vaccinations Doses Increase Year on Year**

In [None]:
for province in provinces:
    df = read_and_index('./time_series_covid19_vaccine_doses_admin_US_{0}.csv'.format(province))
    fig = px.bar(df.groupby(df.index.year)['Dose'].mean(), title="Average Covid-19 Dose by Year for {0}".format(province), height=600)
    fig.update_xaxes(type='category')
    fig.show()

**Covid-19 Vaccinations Doses Peaks circa September Yearly?**

Or perhaps not. If we remember that we have a data set of covid-19 vaccinations for 3 years starting around 22nd of December 2020 and ending around 21 September 2022. So an average over months for each year would exclude every month in 2020 except December and include every month in 2022 except October and above. So the monthly average distribution shown below can only be accurate if we have data for every month of every year 2020 through 2022.

In [None]:
for province in provinces:
    df = read_and_index('./time_series_covid19_vaccine_doses_admin_US_{0}.csv'.format(province))
    fig = px.bar(df.groupby(df.index.month)['Dose'].mean(), title="Average Covid-19 Dose by Month for {0}".format(province), height=600)
    fig.show()

**Any variance in Covid-19 vaccinations by Year and Month?**

Yes. The charts below are the more accurate picture of average covid vaccinations grouped sequentially by year and then month. This is different than the previous (inaccurate) chart above which globally groups by month without considering the year!

Just like the daily variance which displays an increasing trend, we can see the same increasing trend in the monthly variance charts shown below.

In [None]:
for province in provinces:
    df = read_and_index('./time_series_covid19_vaccine_doses_admin_US_{0}.csv'.format(province))
    gb = df.groupby([df.index.year, df.index.month])['Dose'].mean()
    x_data = [str(elem[0])+'-'+str(elem[1]) for elem in gb.index.to_list()]
    y_data = gb.to_list()
    fig = px.bar(x=x_data, y=y_data, title="Average Covid-19 Dose by Year and Month for {0}".format(province), height=600)
    fig.show()

**Any variations in Covid-19 vaccinations by specific day of week?**

To check this, we group the observations by day of week (0 - Sunday to 6 - Saturday) take the mean or average and plot the data.

As the charts show their is no variance by day of week on average.

In [None]:
for province in provinces:
    df = read_and_index('./time_series_covid19_vaccine_doses_admin_US_{0}.csv'.format(province))
    fig = px.bar(df.groupby(df.index.dayofweek)['Dose'].mean(), title="Average Covid-19 Dose by Day of Week for {0}".format(province), height=600)
    fig.show()

**Model, predict and visualize predictions**

*   Train a Linear Regression model for each province using data prior to 1st May 2022.
*   Predict covid-19 doses from 1st May 2022 through end of data.
*   Visualize your predictions and compare to actual data using the metrics *RMSE* and *R2*.


In [None]:
for province in provinces:
    print('\nProcessing Province => {0}\n'.format(province))
    df = read_and_index('./time_series_covid19_vaccine_doses_admin_US_{0}.csv'.format(province))

    test_cutoff = (cutoffs_1[province])[0]
    val_cutoff = (cutoffs_1[province])[1]

    data = df[df['Series'] < test_cutoff]
    train = data[data['Series'] < val_cutoff]
    val = data[data['Series'] >= val_cutoff]
    test = df[df['Series'] >= test_cutoff]
    # print('data.shape = {0}, train.shape = {1}, val.shape = {2}, test.shape = {3}'.format(data.shape, train.shape, val.shape, test.shape))

    X_data = data.drop(['Dose'], axis=1)
    Y_data = data['Dose']
    X_train = train.drop(['Dose'], axis=1)
    Y_train = train['Dose']
    X_valid = val.drop(['Dose'], axis=1)
    Y_valid = val['Dose']
    X_test = test.drop(['Dose'], axis=1)
    Y_test = test['Dose']

    X_data, Y_data, X_train, Y_train, X_valid, Y_valid, X_test, Y_test = X_data.values, Y_data.values, X_train.values, Y_train.values, X_valid.values, Y_valid.values, X_test.values, Y_test.values

    model = LinearRegression()
    model.fit(X_data, Y_data)

    Y_Hat_data = model.predict(X_data)
    rmse_data = mean_squared_error(Y_data, Y_Hat_data, squared=False)
    r2_data = r2_score(Y_data, Y_Hat_data)
    print("rmse_train = {0}, r2_data = {1}".format(rmse_data, r2_data))

    Y_Hat_test = model.predict(X_test)
    rmse_test = mean_squared_error(Y_test, Y_Hat_test, squared=False)
    r2_test = r2_score(Y_test, Y_Hat_test)
    print("rmse_test = {0}, r2_test = {1}\n".format(rmse_test, r2_test))

    preds = test.drop(columns=['Dose'])
    preds['Predicted'] = Y_Hat_test
    concat_df = pd.concat([data, preds], axis=0)

    concat_df.to_csv('./time_series_covid19_vaccine_doses_preds_admin_US_{0}.csv'.format(province), index=True)

    fig = px.line(concat_df, x=concat_df.index, y=["Dose", "Predicted"], template = 'plotly_dark')
    fig.show()



Processing Province => Alabama

rmse_train = 11687.095615273849, r2_data = 0.9999649041519526
rmse_test = 5220.477414643021, r2_test = 0.9966572711190536




Processing Province => California

rmse_train = 117837.88873568237, r2_data = 0.9999747592389752
rmse_test = 84177.07777172951, r2_test = 0.9968387820491801




Processing Province => Georgia

rmse_train = 36705.04406796915, r2_data = 0.9999380297143837
rmse_test = 17465.55576928533, r2_test = 0.9948695075460656



**Conclusion**

While more complex models (xgboost, RNN, LSTM, GRU) overfit the training dataset we found that the LinearRegression model fit the data nearly perfectly exhibiting R2 above 0.9 in both the train data set and the test data set as shown above. This is quite excellent generalization - at least for the forecast horizon of 1st May 2022 to 22 September 2022.

If one were to seek to deploy this LinearRegression model in practice one might benefit from retraining the model at specific intervals (perhaps daily) in order to incorporate new actual observations into the training data set in order to keep the model true and accurate.

# **APPENDIX**

**Retraining Linear Regression Model Daily to Incorporate New Observations**

The method described above incorporates the new daily observations from 1st of May 2022 because the lag features on the test set are based on these actual observations and not the model predictions. However it does not retrain the Linear Regression model daily by replacing the prediction with the actual observation.

Given the accuracies obtained above with the model that already incorporates new actual observations in the test set lag features, it may not be absolutely necessary to retrain to have a useful model. However we include here a model that does retrain on a daily basis from 1st of May 2022 while also incorporating the new actual observations in the lag features.

In [None]:
def preprocess_data():
    covid = read_and_pivot()
    cutoffs = {}
    for province in provinces:
        print('Processing Province => {0}'.format(province))
        df = covid[covid['Province_State']==province].copy()
        df.drop(['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Province_State', 'Country_Region', 'Lat', 'Long_', 'Combined_Key', 'Population'], axis=1, inplace=True)
        df = df.sort_index()
        # extract month and year from dates
        df['Day'] = [i.day for i in df.index]
        df['Month'] = [i.month for i in df.index]
        df['Year'] = [i.year for i in df.index]
        df['WeekOfYear'] = [i.week for i in df.index]
        df['DayOfWeek'] = [i.dayofweek for i in df.index]
        df['Quarter'] = [i.quarter for i in df.index]
        df = df[['Day', 'Month', 'Year', 'WeekOfYear', 'DayOfWeek', 'Quarter', 'Dose']]
        df.fillna(0, inplace=True)
        df = generate_time_lags(df, 8)

        # create a sequence of numbers
        df['Series'] = np.arange(1, len(df)+1)

        may_01_2022 = df[(df['Year'] == 2022) & (df['Month'] == 5) & (df['Day'] == 1)]
        test_cutoff = may_01_2022['Series'].to_list()[0]
        print('test_cutoff = {0}'.format(test_cutoff))

        feb_01_2022 = df[(df['Year'] == 2022) & (df['Month'] == 2) & (df['Day'] == 1)]
        val_cutoff = feb_01_2022['Series'].to_list()[0]
        print('val_cutoff = {0}'.format(val_cutoff))

        max_cutoff = df['Series'].max()
        print('max_cutoff = {0}'.format(max_cutoff))

        cutoffs[province] = (test_cutoff, val_cutoff, max_cutoff)

        df = generate_cyclical_features(df, 'Quarter', 4, 1)
        df = generate_cyclical_features(df, 'Month', 12, 1)
        df = generate_cyclical_features(df, 'DayOfWeek', 7, 0)
        df = generate_cyclical_features(df, 'WeekOfYear', 52, 0)
        df = onehot_encode_pd(df, 'Day')

        df.to_csv('./time_series_covid19_vaccine_doses_admin_US_{0}_pre1.csv'.format(province), index=True)
    return cutoffs

cutoffs_2 = preprocess_data()

Processing Province => Alabama
test_cutoff = 496
val_cutoff = 407
max_cutoff = 642
Processing Province => California
test_cutoff = 496
val_cutoff = 407
max_cutoff = 642
Processing Province => Georgia
test_cutoff = 496
val_cutoff = 407
max_cutoff = 642


In [None]:
for province in provinces:
    print('\nProcessing Province => {0}\n'.format(province))
    df = read_and_index('./time_series_covid19_vaccine_doses_admin_US_{0}_pre1.csv'.format(province))

    test_cutoff = (cutoffs_2[province])[0]
    val_cutoff = (cutoffs_2[province])[1]
    max_cutoff = (cutoffs_2[province])[2]

    n_test = (max_cutoff - test_cutoff) + 1
    preds = np.zeros(n_test)

    for i in range(test_cutoff, max_cutoff+1):

        j = i - test_cutoff

        data = df[df['Series'] < i]
        train = data[data['Series'] < (val_cutoff + j)]
        val = data[data['Series'] >= (val_cutoff + j)]
        test = df[df['Series'] == i]
        # print('data.shape = {0}, train.shape = {1}, val.shape = {2}, test.shape = {3}'.format(data.shape, train.shape, val.shape, test.shape))

        X_data = data.drop(['Dose'], axis=1)
        Y_data = data['Dose']
        X_train = train.drop(['Dose'], axis=1)
        Y_train = train['Dose']
        X_valid = val.drop(['Dose'], axis=1)
        Y_valid = val['Dose']
        X_test = test.drop(['Dose'], axis=1)
        Y_test = test['Dose']

        X_data, Y_data, X_train, Y_train, X_valid, Y_valid, X_test, Y_test = X_data.values, Y_data.values, X_train.values, Y_train.values, X_valid.values, Y_valid.values, X_test.values, Y_test.values

        model = LinearRegression()
        model.fit(X_data, Y_data)

        Y_Hat_test = model.predict(X_test)
        
        preds[j] = Y_Hat_test[0]

    # print("notebook_predict.run: preds.shape = {0}".format(preds.shape))
    data = df[df['Series'] < test_cutoff]
    test = df[df['Series'] >= test_cutoff]
    X_test = test.drop(['Dose'], axis=1)
    Y_test = test['Dose']
    X_test, Y_test = X_test.values, Y_test.values

    rmse_test = mean_squared_error(Y_test, preds, squared=False)
    r2_test = r2_score(Y_test, preds)
    print("rmse_test = {0}, r2_test = {1}\n".format(rmse_test, r2_test))

    preds_df = test.drop(columns=['Dose'])
    preds_df['Label'] = preds
    concat_df = pd.concat([data, preds_df], axis=0)

    concat_df.to_csv('./time_series_covid19_vaccine_doses_preds_retrain_admin_US_{0}.csv'.format(province), index=True)

    fig = px.line(concat_df, x=concat_df.index, y=["Dose", "Label"], template = 'plotly_dark')
    fig.show()



Processing Province => Alabama

rmse_test = 5042.633288517207, r2_test = 0.9968811428465372




Processing Province => California

rmse_test = 78779.91931214281, r2_test = 0.9972311603041631




Processing Province => Georgia

rmse_test = 16299.37802639027, r2_test = 0.9955317619949436



As can be seen above the results are pretty much the same as when the model was not retrained daily. This means the most important way of incorporating new daily observations is to use them to form the lag features for our test set examples. Both models presented in this notebook form their lag features from the new daily observations. The first one does not retrain daily while the second one presented in this appendix above does retrain daily!

Looking at the *RMSE*, *R2* and time series plots for both the model in the main section of this notebook and the model in this appendix above, one can see that both models achieve comparable levels in their metrics. Both generalize well without overfitting to the training data set.