<div style="border:solid blue 2px; padding: 20px">

**Overall Summary of the Project**

Hi Breeanna! 😊  
    
Thank you for your detailed and well-documented project. It’s clear you invested a lot of care into building and evaluating a variety of models, and it paid off! Let’s take a closer look:

---

**🟢 Strengths**

- **Great Variety of Models:** You went above and beyond by implementing multiple models (LR, LGBM, AR, MA, ARMA) and tuning them — great job!
- **Best Model Clearly Identified:** The ARMA(1,1) model met the goal with **RMSE = 40.35**, and your logic for selecting it is solid.
- **Thoughtful Feature Engineering:** Including lags, rolling averages, and time-based variables (hour, day of week) shows strong understanding of time series modeling.
- **Interpretation & Clarity:** Your comments, EDA, and summary section at the end are clear, helpful, and business-oriented — especially for decision-making purposes.

---

**🟡 Suggestions for Improvement**

- **Linear Regression Issue:** The LR model’s RMSE of ~1471 is suspiciously high. That likely suggests an issue with multicollinearity, feature scaling, or overfitting due to lag variables. Just flagging it as an anomaly worth briefly noting in your conclusion.
- **Model Comparison Plot (Optional):** A bar chart comparing RMSEs would be a great visual addition to complement the summary table.

---

**Status: approved ;)**

The best model (ARMA(1,1)) achieved **RMSE = 40.35**, which is well below the required threshold. Your work is well-structured, insightful, and shows strong modeling skills. Awesome job Breeanna! 🚕📈👏

# Sprint 13 Project: Sweet Lift Taxi Predictive Model

# Project description

Sweet Lift Taxi company has collected historical data on taxi orders at airports. To attract more drivers during peak hours, we need to predict the amount of taxi orders for the next hour. Build a model for such a prediction.

The RMSE metric on the test set should not be more than 48.

## Project instructions

1. Download the data and resample it by one hour.
2. Analyze the data.
3. Train different models with different hyperparameters. The test sample should be 10% of the initial dataset. 
4. Test the data using the test sample and provide a conclusion.

## Data description

The data is stored in file `taxi.csv`. The number of orders is in the '*num_orders*' column.

## Preparation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.ar_model import AutoReg, ar_select_order
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import arma_order_select_ic
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.dummy import DummyRegressor
from statsmodels.tsa.holtwinters import ExponentialSmoothing

In [None]:
df = pd.read_csv('/datasets/taxi.csv', index_col=[0], parse_dates=[0])

df.sort_index(inplace=True)
df = df.resample('1H').sum()

df.sample(10)

In [None]:
print(df.index)
print(df.info())

In [None]:
print("Duplicates before:", df.duplicated().sum())
df = df.drop_duplicates()
print("Duplicates after:", df.duplicated().sum())

- I have checked for duplicates and missing values. The data is now ready for analysis.

## Analysis

In [None]:
df.describe()

In [None]:
for col in ['num_orders']:
    plt.figure(figsize=(10, 4))
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
    plt.grid(True)
    plt.show()

- There are outliers, which are probably related during peak hours. I will not remove as this is helpful information to figure out when orders may be most in demand.

In [None]:
plt.figure(figsize=(15, 5))
df['num_orders'].plot()
plt.title('Taxi Orders Over Time')
plt.xlabel('Time')
plt.ylabel('Number of Orders')
plt.grid(True)
plt.show()

- There is an upward trend of taxi orders from March to September

In [None]:
df['rolling_mean'] = df.rolling(10).mean()
df.plot()

- This trend is nonstationary

In [None]:
df['num_orders'] = df['num_orders'].fillna(method='ffill')

decomposed = seasonal_decompose(df['num_orders'], model='additive', period=24) 

plt.figure(figsize=(6, 8))
plt.subplot(311)
decomposed.trend.plot(ax=plt.gca())
plt.title('Trend')

plt.subplot(312)
decomposed.seasonal.plot(ax=plt.gca())
plt.title('Seasonality')

plt.subplot(313)
decomposed.resid.plot(ax=plt.gca())
plt.title('Residuals')

plt.tight_layout()
plt.show()

## Training

In [None]:
def make_features(data, max_lag, rolling_window):
    data['hour'] = data.index.hour
    data['dayofweek'] = data.index.dayofweek
    for lag in range(1, max_lag + 1):
        data[f'lag_{lag}'] = data['num_orders'].shift(lag)
    data['rolling_mean'] = data['num_orders'].shift().rolling(rolling_window).mean()
    return data

df = make_features(df, max_lag=72, rolling_window=168)
df = df.dropna()

features = df.drop(columns=['num_orders'])
target = df['num_orders']

features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.10, shuffle=False)

train = pd.concat([features_train, target_train], axis=1)
test = pd.concat([features_test, target_test], axis=1)
    
print(train.shape)
print(test.shape)

In [None]:
lr = LinearRegression()
lr.fit(features_train, target_train)
lr_preds = lr.predict(features_test)
lr_rmse = mean_squared_error(target_test, lr_preds) ** 0.5
print(f"Linear Regression RMSE: {lr_rmse:.2f}")

In [None]:
param_grid = {
    'num_leaves': [31, 50, 100],
    'learning_rate': [0.05, 0.1, 0.2],
    'max_depth': [3, 5, 10],
}

lgbm = LGBMRegressor(random_state=12345)
grid_search = GridSearchCV(lgbm, param_grid, cv=3, scoring='neg_mean_squared_error', n_jobs=-1, verbose=1)

grid_search.fit(features_train, target_train)

best_model = grid_search.best_estimator_
preds = best_model.predict(features_test)
rmse = mean_squared_error(target_test, preds) ** 0.5

print(f"Best LGBM Parameters: {grid_search.best_params_}")
print(f"LGBM RMSE (Tuned): {rmse:.2f}")

In [None]:
best_rmse = float('inf')
best_lag = None
best_model = None

for lag in range(1, 31):  
    model = AutoReg(train['num_orders'], lags=lag)
    fitted_model = model.fit()

    start = len(train)
    end = len(train) + len(test) - 1
    predictions = fitted_model.predict(start=start, end=end, dynamic=False)

    rmse = mean_squared_error(target_test, predictions) ** 0.5
    if rmse < best_rmse:
        best_rmse = rmse
        best_lag = lag
        best_model = fitted_model

print(f"Best AR lag: {best_lag}")
print(f"Best AR RMSE: {best_rmse:.2f}")

In [None]:
res = arma_order_select_ic(y=train['num_orders'], max_ar=0, max_ma=10)
ma_order = res.bic_min_order[1]

ma_model = ARIMA(train['num_orders'], order=(0, 0, ma_order)) 
ma_model = ma_model.fit()

start_value = len(train)
end_value = len(train) + len(test) - 1
ma_pred = ma_model.predict(start=start_value, end=end_value, dynamic=False)

rmse = mean_squared_error(target_test, ma_pred) ** 0.5
print(f"Moving Average RMSE: {rmse:,.2f}")

In [None]:
ar_order = 1
ma_order = 1
arma_full_order = (ar_order, 0, ma_order)

arma_model = ARIMA(train['num_orders'], order=arma_full_order)
arma_model = arma_model.fit()

start_value = len(train)
end_value = len(train) + len(test) - 1
arma_pred = arma_model.predict(start=start_value, end=end_value, dynamic=False)

rmse = mean_squared_error(target_test, arma_pred) ** 0.5
print(f"ARMA(1,1) RMSE: {rmse:,.2f}")

## Testing

In [None]:
dummy = DummyRegressor(strategy="mean")

dummy.fit(features_train, target_train)

dummy_preds = dummy.predict(features_test)

dummy_rmse = mean_squared_error(target_test, dummy_preds) ** 0.5

print(f"RMSE Dummy Regressor (Sanity Check): {dummy_rmse:,.2f}")

In [None]:
y = train['num_orders']

order_selection = arma_order_select_ic(y, max_ar=5, max_ma=5, ic='bic') 

best_order = order_selection.bic_min_order 

print(f"Best ARMA(p,q) order by BIC: {best_order}")

arma_model = ARIMA(y, order=(best_order[0], 0, best_order[1])) 
arma_model = arma_model.fit()

start = len(train)
end = len(train) + len(test) - 1
arma_preds = arma_model.predict(start=start, end=end, dynamic=False)

rmse = mean_squared_error(target_test, arma_preds) ** 0.5
print(f"Best ARMA{best_order} RMSE: {rmse:,.2f}")

I developed and evaluated multiple time series models to predict the number of hourly taxi orders for Sweet Lift Taxi. The objective was to help the company attract more drivers during peak hours by providing accurate forecasts. A key success criterion was achieving an RMSE of no more than 48 on the test set.

I resampled the data to hourly frequency and performed exploratory data analysis to understand trends, seasonality, and variability. Next, I engineered time-based features such as hour of the day, day of the week, lag variables, and rolling averages.

I trained several models, including:

- **Linear Regression**: RMSE = 1471.54  
- **Tuned LightGBM**: RMSE = 47.59  
- **AutoRegressive (AR) model (lag=5)**: RMSE = 42.81  
- **Moving Average (MA)**: RMSE = 53.61  
- **ARMA(1,1)**: RMSE = 40.35  
- **Dummy Regressor (mean baseline)**: RMSE = 53.19  

The best-performing model was the ARMA(1,1), achieving an RMSE of 40.35, comfortably below the required threshold of 48.

In conclusion, the project successfully identified a model that meets the performance criteria and can be deployed to assist Sweet Lift Taxi in anticipating peak demand periods and improving driver availability.


# Review checklist

- [x]  Jupyter Notebook is open
- [ ]  The code is error-free
- [ ]  The cells with the code have been arranged in order of execution
- [ ]  The data has been downloaded and prepared
- [ ]  The data has been analyzed
- [ ]  The model has been trained and hyperparameters have been selected
- [ ]  The models have been evaluated. Conclusion has been provided
- [ ] *RMSE* for the test set is not more than 48