## Time Series Analysis of US Air Quality by State and County

### Part IV: Forecasting 7 days using ARIMA timeseries model

Author: Gem Ruby </br>
Date: April 2023

Note: This notebook consists only of modeling and prediction of 7-day period. 

In [None]:
## Import libraries
import pandas as pd
import os
import numpy as np
import requests
import warnings
warnings.filterwarnings("ignore")

# plotting
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns

# stats
from statsmodels.api import tsa # time series analysis
import statsmodels.api as sm

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#change directory
os.chdir('/content/drive/MyDrive/2022 - BrainStation/AirQuality_Capstone')

In [None]:
#read in the dataframe
df = pd.read_csv('/content/drive/MyDrive/2022 - BrainStation/AirQuality_Capstone/Data/county 2015-2022.csv')

In [None]:
#check df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2271501 entries, 0 to 2271500
Data columns (total 10 columns):
 #   Column                     Dtype 
---  ------                     ----- 
 0   State Name                 object
 1   county Name                object
 2   State Code                 int64 
 3   County Code                int64 
 4   Date                       object
 5   AQI                        int64 
 6   Category                   object
 7   Defining Parameter         object
 8   Defining Site              object
 9   Number of Sites Reporting  int64 
dtypes: int64(4), object(6)
memory usage: 173.3+ MB


In [None]:
#Change Date
df['Date'] = pd.to_datetime(df['Date'])

In [None]:
!pip3 install arch yfinance pmdarima --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m908.0/908.0 KB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from pmdarima.arima import auto_arima

In [None]:
# Create an empty dataframe to store the forecasted AQI values
forecast_df = pd.DataFrame(columns=['Date', 'State', 'County', 'AQI'])

# Loop over each unique county and state combination
for (state, county), group in df.groupby(['State Name', 'county Name']):
    
    # Interpolate missing values if any
    group = group.set_index('Date').interpolate(method = 'linear',option = "spline").reset_index()
    
    # Fit ARIMA model
    model = auto_arima(group['AQI'], seasonal=False, error_action='ignore', suppress_warnings=True)
    
    # Generate date range for forecast period
    last_date = group['Date'].max()
    date_range = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=7, freq='D')

    # Make 7-day forecast
    forecasted = model.predict(n_periods=7)

    # Create a dataframe with the forecasted AQI values
    forecast_df_AQI = pd.DataFrame({'Date': date_range, 'State': state, 'County': county, 'AQI': forecasted})
    
    # Append the forecasted AQI values to the main dataframe
    forecast_df = forecast_df.append(forecast_df_AQI, ignore_index=True)

In [None]:
#check the forecast df
forecast_df.sample(10)

Unnamed: 0,Date,State,County,AQI
1237,2022-01-06,Florida,Manatee,32.40571
5557,2022-01-07,Pennsylvania,Delaware,46.671413
3555,2022-01-07,Missouri,Jefferson,41.036449
6977,2022-01-06,Washington,Clallam,37.196185
2813,2022-01-07,Maine,Oxford,25.407629
1465,2022-01-03,Georgia,Floyd,29.340293
1984,2022-01-04,Indiana,Floyd,24.0755
1401,2022-01-01,Georgia,Clayton,35.871209
5956,2019-11-12,South Carolina,Pickens,41.340327
6962,2022-01-05,Washington,Benton,23.213135


In [None]:
# Set the file path and name
file_path = '/content/drive/MyDrive/2022 - BrainStation/AirQuality_Capstone/Data/Forecasted_AQI.csv'

# Save dataframe to CSV file
forecast_df.to_csv(file_path, index=False)

The completed file has now been saved to the drive to ensure that the forecasted information has been loaded. As you can note in in the above, there were counties with no recorde of AQI data after 2015. To avoid issues with forecasting stale data, we will be removing all the AQI forecast before October 2021. 