# Lab 09: Planning for Growth at Washington Cafe

Congratulations! You've just gotten your dream student work-study job as a data analyst for Washington Cafe. As you can guess from the name, Washginton Cafe was started by a group of students at Washington University in St. Louis three years ago. At the time, they were ambitious sophomores. Because putting so much of their passion into Washington Cafe took time away from studies, they barely graduated. They love the restaurant, and it's been successful enough that they want to grow it. And they know enough from their studies that the best way to do this will involve gathering insights from the data they've been collecting for the past three years.

**That's where you fit it!**

## Chapter 1: The Small Diner

Your first order of business is to just get some plans in place for managing the labor demand for the small diner that Washington Cafe currently is.  You've got weekly labor data from the past two years of business. Each month, it seems like the management team is scrambling to figure out the shift schedule and, more often than not, the dining room is either short on staff or staff are sitting aorund doing nothing. Not the best experience for diners or for the staff.

Start with a simple Auto-Regressive (AR) model to forecase the "typical" labor needs. There really isn't much to go on at this point other than previous labor usage, so that's where we'll start.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

# These ACF/PACF plotters from statsmodels are helpful
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# From statsmodels, we can load in useful time series analysis tools
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX

# And scikit-learn has useful tools for evaluating the usefulness of a model
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [None]:
# Since we have to read and setup data for each phase...
def read_data(file):
    df = pd.read_csv(file, parse_dates=['week_start_date'])
    # Convert the start date time and specify that the requency is
    # "weekly, starting on Monday"
    df = df.set_index('week_start_date').asfreq('W-MON')
    df.drop(columns='restaurant', inplace=True)
    return df

labor1 = read_data('washington_cafe_stage1_diner_2018_2020.csv')
labor1.head()

In [None]:
plt.rcParams["figure.figsize"] = (10, 3)

# Create a function to do all the time series plots
def ts_plots(df, title):
    # Visualize time series raw
    df.plot(title=title)
    plt.show()
    
    # Calculate appropriate number of lags (max 40% of data length, capped at 52)
    max_lags = min(52, len(df) // 2 - 1)
    
    # ACF/PACF
    plot_acf(df, lags=max_lags)
    plt.show()
    
    plot_pacf(df, lags=max_lags, method='ywm')
    plt.show()

# Do timeline, acf, and pacf
ts_plots(labor1, 'Stage 1 - Diner')

**Main Plot** - Doesn't seem to be much pattern in the labor hours worked each week. That's consistent with what we've been told: There's limited planning going on and people just work whatever shifts they want. Lots of shortage and lots of excess relative to the actual demand.

**Auto Correlation Function** - The ACF tells us how closely correlated each data point is with the data point that lags by 1, 2, 3, ... 52 time periods. In this case:
* The ACF drops off immediately
* There doesn't appear to be any seasonality (with strong bumps as certain points)
* Mostly what matters seems to be whatever happened yesterday

**Partial Auto Correlation Function** - The PACF tells us how much each data point is correlated with the lagging data points after taking all other lagging data points into consideration.
* The PACF drops off immediately.


So, we can conclude exactly what we expected, there aren't any other hidden patterns to the data. We can simply using a auto-regression model for our planning purposes.

In [None]:
# Let's reserve our last 10 weeks for the testing period and 
# the other 94 weeks before that for training
labor1_train = labor1.iloc[:-10]
labor1_test = labor1.iloc[-10:]

ar_model = AutoReg(labor1_train, lags=1).fit()
ar_model.summary()

In [None]:
pred1 = ar_model.predict(start=labor1_test.index[0], end=labor1_test.index[-1])

mae = mean_absolute_error(labor1_test, pred1)
rmse = math.sqrt(mean_squared_error(labor1_test, pred1))

print(f'Mean Absolute Error:     {mae}')
print(f'Root Mean Squared Error: {rmse}')

In [None]:
# Plot forecast vs actual
def plot_predict(df_train, df_test, df_pred, title):
    plt.plot(df_train.index, df_train.values, label='train')
    plt.plot(df_test.index, df_test.values, label='test')
    plt.plot(df_pred.index, df_pred.values, label='forecast')
    plt.title(title)
    plt.legend(); plt.show()

plot_predict(labor1_train, labor1_test, pred1, 'Stage 1 - AR Forecast')

The actual test data observations (orange) bounce all of the place just like the training data observations.  Our model doesn't know much better that to basically repeat whatever it saw yesterday.  So, you tell the management team to just look at the average over the past year and plan on that being the go-forward plan... for now.

## 2. Washington Cafe is becoming a local favorite

After another year, you decide to reevalute the labor trends. You've had lots of other projects going on around favorite dishes, where patrons are coming from, cost of ingredients. But now it's time to look back at the labor trends again.  You decide to look at the ACF and PACF again as a starting point.

In [None]:
labor2 = read_data('washington_cafe_stage2_local_favorite_2018_2022.csv')
labor2.head()

labor2 = labor2[-104:]
labor2.head()

In [None]:
ts_plots(labor2, 'Stage 2 - Local Favorite')

In [None]:
# Let's reserve our last 10 weeks for the testing period and 
# And we can see a clear change in the past 2 years, so let's just look at the past 2 years
labor2_train = labor2.iloc[:-10]
labor2_test = labor2.iloc[-10:]

# This time, we'll use a moving average model
# (0,0,1) means:
# p = auto-regressive - 0 means none
# d = differences - 0 means none
# q = moving average - 
#     1 means use the error from the 1 previous forecast to correct the next
ma_model = ARIMA(labor2_train, order=(0,0,1)).fit()

ma_model.summary()

In [None]:
pred2 = ma_model.predict(start=labor2_test.index[0], end=labor2_test.index[-1])

mae = mean_absolute_error(labor2_test, pred2)
rmse = math.sqrt(mean_squared_error(labor2_test, pred2))

print(f'Mean Absolute Error:     {mae}')
print(f'Root Mean Squared Error: {rmse}')

In [None]:
plot_predict(labor2_train, labor2_test, pred2, 'Stage 2 - MA Forecast')

## Business is Booming!

Finally, the business starts to take off and there's a major influx of new business. The hard work, some marketing genius, and great food has paid off. Let's use the data to see if we can predict the growth rather than just react to it after a few stressful shifts.

In [None]:
labor3 = read_data('washington_cafe_stage3_boom_2023.csv')
labor3.head()

In [None]:
ts_plots(labor3, 'Stage 3 - Boom')

What's different?


There is a clear, strong **upward trend** in the data. Labor hours are consistently growing from around 1,300 hours per week at the start of 2023 to over 2,000 hours per week by the end of the year. This is completely different from Stages 1 and 2, which showed relatively flat patterns with random fluctuations.

**Auto Correlation Function (ACF)**: The ACF shows a **slow, gradual decay** rather than dropping off quickly like in Stages 1 and 2. The correlation stays high for many lags before slowly declining. This shows signs of **non-stationarity** caused by a trend in the data. The data points are highly correlated with their recent past because they all on the same upward trajectory.

**Partial Auto Correlation Function (PACF)**: The PACF shows a **significant spike at lag 1**, with the other lags being much smaller and within the confidence bands. This suggests that an autoregressive component of order 1 (AR(1)) would be appropriate.

**Key Insight**: This data is **non-stationary** due to the strong trend. We need to use **differencing** to remove the trend and make the series stationary before we can model it effectively. This is why ARIMA models will perform much better than simple AR or MA models for this stage.

The models requires differencing due to the trend.


In [None]:
#added some code to eliminate the pesky warning (didn't change the outcome, but makes me feel better
import warnings
from statsmodels.tools.sm_exceptions import ConvergenceWarning

warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=ConvergenceWarning)

labor3_train = labor3[:-10]
labor3_test = labor3[-10:]


    # 'White noise': (0,0,0),
    # 'Random walk': (0,1,0),
    # 'Constant': (0,2,0),
    # '1st-order regression': (1,0,0),
    # '2nd-order regression': (2,0,0),
    # 'Differenced 1st-order': (1,2,0),
    # 'Simple exponential smoothing': (0,1,1),
    # '1st-order moving average': (0,0,1),
    # '2nd-order moving average': (0,0,2),
    # 'ARMA': (1,0,1),
    # 'ARIMA': (1,1,1),
    # 'Damped-trend linear exponential smoothing': (1,1,2),
    # 'Linear exponential smoothing 1': (0,2,1),
    # 'Linear exponential smoothing 2': (0,2,2)


plt.rcParams['figure.figsize'] = (10,6)

params = {
    'White noise': (0,0,0),
    'Random walk': (0,1,0),
    'Constant': (0,2,0),
    '1st-order regression': (1,0,0),
    '2nd-order regression': (2,0,0),
    'Differenced 1st-order': (1,2,0),
    'Simple exponential smoothing': (0,1,1),
    '1st-order moving average': (0,0,1),
    '2nd-order moving average': (0,0,2),
    'ARMA': (1,0,1),
    'ARIMA': (1,1,1),
    'Damped-trend linear exponential smoothing': (1,1,2),
    'Linear exponential smoothing 1': (0,2,1),
    'Linear exponential smoothing 2': (0,2,2)
}

preds = {}

for label, param in params.items():
    try:
        model = ARIMA(labor3_train, order=param).fit()
        pred = model.predict(start=labor3_test.index[0], end=labor3_test.index[-1])
        preds[label] = pred
#added this to handle another annoying warning
    except Exception as e:
        print(f"Could not fit {label}: {e}")
        continue

plt.plot(labor3_train.index, labor3_train.values, label='train')
plt.plot(labor3_test.index, labor3_test.values, label='test')

for label, pred in preds.items():
    plt.plot(pred.index, pred.values, linestyle='dashed', label=(label + ' ' + str(params[label])))

plt.title('Stage 3 — ARIMA Forecasts')
plt.legend()
plt.show()



In [None]:
# Select the best performing model - typically ARIMA(1,1,1) for trend data
best_model = ARIMA(labor3_train, order=(1,1,1)).fit()
pred3 = best_model.predict(start=labor3_test.index[0], end=labor3_test.index[-1])

mae3 = mean_absolute_error(labor3_test, pred3)
rmse3 = math.sqrt(mean_squared_error(labor3_test, pred3))

print(f'Mean Absolute Error:     {mae3}')
print(f'Root Mean Squared Error: {rmse3}')

In [None]:
plot_predict(labor3_train, labor3_test, pred3, 'Stage 3 - ARIMA(1,1,1) Forecast')

In [None]:
# Create a comparison table of all models tested
results = []

for label, param in params.items():
    try:
        model = ARIMA(labor3_train, order=param).fit()
        pred = model.predict(start=labor3_test.index[0], end=labor3_test.index[-1])
        mae = mean_absolute_error(labor3_test, pred)
        rmse = math.sqrt(mean_squared_error(labor3_test, pred))
        aic = model.aic
        
        results.append({
            'Model': label,
            'Order (p,d,q)': str(param),
            'MAE': round(mae, 2),
            'RMSE': round(rmse, 2),
            'AIC': round(aic, 2)
        })
    except Exception as e:
        continue

# Create dataframe and sort by MAE
results_df = pd.DataFrame(results).sort_values('MAE')
print('Model Comparison for Stage 3 (sorted by MAE):')
print(results_df.to_string(index=False))

print(f'\nBest Model: {results_df.iloc[0]["Model"]} with MAE = {results_df.iloc[0]["MAE"]}')

## Into Fine Dining

With that huge growth in business last year, Washington Cafe has decided to transform into a fine dining establishment. As a result, there's more seasonal fluctuation in business (e.g., parent's weekend and holidays). Let's take a look and see if we can build a seasonal model, too.

In [None]:
labor4 = read_data('washington_cafe_stage4_fine_dining_2024_2026.csv')
labor4.head()

In [None]:
ts_plots(labor4, 'Stage 4 - Fine Dining')

In [None]:
# The seasonality shows up in how high the ACF stays in the first chart
# So, we need to difference over the past 52 weeks

# Create seasonally differenced data
labor4_diff = labor4.diff(52).dropna()

print('Analyzing seasonal differencing (52-week period):')
print(f'Original data shape: {labor4.shape}')
print(f'Differenced data shape: {labor4_diff.shape}')

# Plot the differenced data
ts_plots(labor4_diff, 'Stage 4 - Seasonally Differenced (52 weeks)')


In [None]:

# SARIMAX(p,d,q)(P,D,Q, s) — starting point (1,1,1)(1,1,1,52)

#split into train and test
labor4_train = labor4.iloc[:-10]
labor4_test = labor4.iloc[-10:]

print(f'Training data: {len(labor4_train)} weeks')
print(f'Test data: {len(labor4_test)} weeks')

# Fit SARIMAX model with seasonal components
# (1,1,1) non-seasonal: AR(1), I(1), MA(1)
# (1,1,1,52) seasonal: AR(1), I(1), MA(1) with 52-week period
sarimax_model = SARIMAX(
    labor4_train,
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 52),
    enforce_stationarity=False,
    enforce_invertibility=False
).fit(disp=False)

print('\nSARIMAX Model Summary:')
print(sarimax_model.summary())


In [None]:
pred4 = sarimax_model.predict(start=labor4_test.index[0], end=labor4_test.index[-1])

mae4 = mean_absolute_error(labor4_test, pred4)
rmse4 = math.sqrt(mean_squared_error(labor4_test, pred4))

print(f'Mean Absolute Error:     {mae4}')
print(f'Root Mean Squared Error: {rmse4}')

In [None]:
plot_predict(labor4_train, labor4_test, pred4, 'Stage 4 - SARIMAX Forecast')

## In Summary: Our Beloved Washington Cafe

Over the course of this analysis, we've witnessed Washington Cafe's remarkable transformation through four distinct stages, each requiring increasingly sophisticated forecasting methods:

**Stage 1 - Small Diner (2018-2020)**: Random fluctuations with no clear pattern. A simple **AR(1) model** was sufficient since labor demand was essentially unpredictable, driven by ad-hoc scheduling rather than actual customer patterns.

**Stage 2 - Local Favorite (2020-2022)**: Slight patterns began to emerge, but volatility remained high. A **Moving Average MA(1) model** helped capture short-term corrections based on recent forecast errors.

**Stage 3 - Business Boom (2023)**: A clear upward trend emerged as the restaurant gained popularity. **ARIMA(1,1,1)** with first-order differencing successfully captured this growth trajectory. The differencing component (I) was critical for handling the non-stationary trend.

**Stage 4 - Fine Dining (2024-2026)**: The transformation into a fine dining establishment introduced **seasonal patterns** tied to holidays, parent weekends, and other predictable events. **SARIMAX(1,1,1)(1,1,1,52)** captures both the underlying trend and the 52-week seasonal cycle.


### The SARIMAX model's predictions show that:

1. **Seasonality is real**: The model successfully identifies and forecasts the recurring patterns in labor demand. High-demand periods (likely holidays and special events) can now be anticipated rather than reacted to.

2. **Planning is now possible**: Unlike the early diner days, Washington Cafe can now forecast labor needs weeks in advance with reasonable accuracy. The MAE and RMSE metrics indicate how close our predictions are to reality.

3. **Staffing optimization**: Management can now:
   - Hire additional staff ahead of busy seasons
   - Schedule training during predictable slow periods
   - Budget more accurately for labor costs
   - Improve employee satisfaction through better scheduling

### The Bigger Picture

This analysis demonstrates a fundamental principle: **as a business matures, its patterns become more predictable**. What started as chaotic, reactive scheduling evolved into data-driven workforce planning. The progression from AR → MA → ARIMA → SARIMAX mirrors the restaurant's journey from a scrappy startup to an established fine dining destination.

The key lesson for Washington Cafe's management: **invest in data collection and analysis early**. The patterns were always there in Stages 3 and 4, but without proper time series analysis, they would have remained hidden, leading to continued overstaffing or understaffing issues.

### Next Steps

- Continue monitoring model performance and refine parameters as more data becomes available
- Consider adding exogenous variables (weather, local events, marketing campaigns) to further improve predictions
- Extend this methodology to other aspects of the business (inventory, revenue, customer traffic)
- Use these forecasts for strategic planning, such as expansion decisions or menu changes

Now that Washington Cafe is running smoothly after some nice growth, we recommend our WashU founders enroll in the WashU McKelvey CAPS MDAA program so they can better operate their business.  Either that, or they should ask Paul, Greg, Luke or Nick to periodically conduct these analyses for them in exchange for an occasional free dining experience!

In [None]:
#forecast out an additional 52 weeks (1 full year) beyond our test data
#this extends the prediction beyond the boundaries of our actual data

#get the last date in our dataset and forecast 52 weeks forward
last_date = labor4.index[-1]
forecast_periods = 52

#generate forecast starting from the day after our last observation
forecast_extended = sarimax_model.forecast(steps=forecast_periods)

#create a date range for the extended forecast
forecast_index = pd.date_range(
    start=last_date + pd.Timedelta(weeks=1),
    periods=forecast_periods,
    freq='W-MON'
)
forecast_extended.index = forecast_index

#plot the extended forecast
plt.figure(figsize=(12, 5))
plt.plot(labor4.index, labor4.values, label='Historical Data', linewidth=2)
plt.plot(forecast_extended.index, forecast_extended.values, 
         label='52-Week Forecast', linestyle='--', linewidth=2, color='red')
plt.axvline(x=last_date, color='gray', linestyle=':', linewidth=1, label='Forecast Start')
plt.title('Stage 4 - Extended 52-Week Forecast Beyond Available Data')
plt.xlabel('Date')
plt.ylabel('Labor Hours')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

#display forecast statistics
print(f'Forecast Summary for next 52 weeks:')
print(f'Starting from: {forecast_index[0].strftime("%Y-%m-%d")}')
print(f'Ending at: {forecast_index[-1].strftime("%Y-%m-%d")}')
print(f'Mean forecasted labor hours: {forecast_extended.mean():.2f}')
print(f'Min forecasted labor hours: {forecast_extended.min():.2f}')
print(f'Max forecasted labor hours: {forecast_extended.max():.2f}')
print(f'Standard deviation: {forecast_extended.std():.2f}')

In [None]:
#get forecast with confidence intervals
forecast_result = sarimax_model.get_forecast(steps=forecast_periods)
forecast_mean = forecast_result.predicted_mean
forecast_ci = forecast_result.conf_int()

#setup the index
forecast_mean.index = forecast_index
forecast_ci.index = forecast_index

#plot with confidence intervals
plt.figure(figsize=(12, 5))
plt.plot(labor4.index, labor4.values, label='Historical Data', linewidth=2)
plt.plot(forecast_mean.index, forecast_mean.values, 
         label='52-Week Forecast', linestyle='--', linewidth=2, color='red')
plt.fill_between(forecast_ci.index, 
                 forecast_ci.iloc[:, 0], 
                 forecast_ci.iloc[:, 1], 
                 color='red', alpha=0.2, label='95% Confidence Interval')
plt.axvline(x=last_date, color='gray', linestyle=':', linewidth=1, label='Forecast Start')
plt.title('Stage 4 - Extended Forecast with Confidence Intervals')
plt.xlabel('Date')
plt.ylabel('Labor Hours')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Extended Forecast Analysis

52-week forecast shows what Washington Cafe can expect for labor demand over the next full year:

**Key Observations:**

1. **Seasonal Pattern Continues**: The SARIMAX model projects that the seasonal cycle will continue, with peaks and valleys occurring at the same intervals as in the historical data (every 52 weeks).

2. **Forecast Uncertainty**: The confidence interval (shaded area) widens as we move further into the future. This reflects increasing uncertainty as time goes on. The further out we go, the less certain we are with our forecast.

3. **Long-term Planning**: Despite the uncertainty, this forecast provides valuable insights for:
   - Annual budgeting for labor costs
   - Identifying high-demand periods throughout the year for staff hiring (aligned to expected cycles)
   - Planning major events or promotions during forecasted slow periods
   - Setting realistic growth targets

**Caution**: This forecast assumes the business will continue operating under similar conditions. Major changes (new menu, different hours, economic shifts, competitors) could invalidate these predictions. The model should be retrained regularly as new data becomes available.