# Time Series Forecasting: Analyzing JetTrain Traffic Data

In this notebook, we delve into the world of time series forecasting, a crucial technique in understanding and predicting temporal data trends. Our focus will be on the JetTrain Traffic dataset, where we aim to uncover patterns, trends, and potential forecasting models that can predict future traffic flows.

Time series forecasting has widespread applications, from financial markets to traffic management. By analyzing the JetTrain dataset, we'll gain insights that are not only academically interesting but also practically significant in managing urban traffic effectively.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

## Data Loading and Preprocessing

The first step in our analysis is to load and preprocess the data. We'll use pandas to read the dataset and perform initial preprocessing, including converting date and time information to datetime objects and setting them as the index. This preprocessing lays the groundwork for our time series analysis.

Proper formatting of datetime data is essential for time series analysis, as it allows us to utilize various time-based functionalities provided by pandas.


In [None]:
df = pd.read_csv('Train_SU63ISt.csv')
df

In [None]:
df['Datetime'] = pd.to_datetime(df['Datetime'])

In [None]:
df.set_index('Datetime', inplace=True)
df.drop('ID', axis=1, inplace=True)


## Exploratory Data Analysis (EDA)

Now that our data is loaded and preprocessed, we'll explore it to understand underlying patterns and distributions. We'll start with basic visualizations like time series plots and then delve into more detailed analyses, including decomposing the series into its trend, seasonality, and residuals.

Visualizations play a key role in identifying patterns, outliers, and anomalies in time series data, which are critical for accurate forecasting.


In [None]:
df['Count'].plot()
plt.show()

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose(df['Count'], model='additive')
result.plot()
plt.show()

In [None]:
df['Count'].describe()

In [None]:
df['Count'].plot(kind='hist', bins=30)
plt.title('Histogram of Traffic Counts')
plt.xlabel('Count')
plt.show()

df['Count'].plot(kind='box')
plt.title('Boxplot of Traffic Counts')
plt.show()

## Feature Engineering

For effective time series forecasting, it's often beneficial to introduce additional features that capture periodic patterns like hours of the day, days of the week, or months of the year. We'll create these features and examine how they correlate with traffic counts.

These engineered features can significantly improve the performance of forecasting models by providing more context about temporal patterns.


In [None]:
# Convert the index to datetime if it's not already
df.index = pd.to_datetime(df.index)

df['day_of_week'] = df.index.dayofweek  # Monday=0, Sunday=6
df['hour'] = df.index.hour
df['month'] = df.index.month
df['day_of_month'] = df.index.day
df

In [None]:
df.columns

In [None]:
import seaborn as sns

# Assuming 'day_of_week' and 'hour' features are created
correlation_matrix = df[['Count', 'day_of_week', 'hour', 'day_of_month', 'month']].corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()

In [None]:
# Group the data by 'month' and sum the 'Count' for each month
monthly_counts = df.groupby('month')['Count'].mean()

# Plot the total 'Count' for each month as a line or bar chart
monthly_counts.plot(kind='bar')  # You can change 'bar' to 'line' if you prefer a line plot

plt.title('Mean Count by Month')
plt.xlabel('Month')
plt.ylabel('Total Count')
plt.xticks(ticks=range(len(monthly_counts)), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], rotation=45)
plt.show()


In [None]:
import matplotlib.pyplot as plt

# Group the data by 'month' and sum the 'Count' for each month
monthly_counts = df.groupby('month')['Count'].sum()

# Plot the total 'Count' for each month as a line or bar chart
monthly_counts.plot(kind='bar')  # You can change 'bar' to 'line' if you prefer a line plot

plt.title('Total Count by Month')
plt.xlabel('Month')
plt.ylabel('Total Count')
plt.xticks(ticks=range(len(monthly_counts)), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], rotation=45)
plt.show()

In [None]:
# Group the data by 'month' and sum the 'Count' for each month
monthly_counts = df.groupby('day_of_week')['Count'].mean()

# Plot the total 'Count' for each month as a line or bar chart
monthly_counts.plot(kind='bar')  # You can change 'bar' to 'line' if you prefer a line plot

plt.title('Mean Count by DoW')
plt.xlabel('Day of Week')
plt.ylabel('Total Count')
plt.xticks(ticks=range(len(monthly_counts)), labels=['Mon','Tue','Wed','Thu','Fri','Sat','Sun'], rotation=45)
plt.show()

In [None]:
df['hour'] = df.index.hour

df['month'] = df.index.month
df['year'] = df.index.year
df.columns

In [None]:
# First, resample and sum the 'Count' data to get a Series
daily_counts = df['Count'].resample('D').sum()

# Then, create the seasonal lag by shifting the Series
count_seasonal_lag = daily_counts.shift(365)

# If you want to combine them into a new DataFrame:
daily_df = pd.DataFrame({'Count': daily_counts, 'Count_seasonal_lag': count_seasonal_lag})

daily_df

In [None]:
from pmdarima import auto_arima

model = auto_arima(daily_df['Count'], seasonal=True, m=365, trace=True, 
                   error_action='ignore', suppress_warnings=True)

In [9]:
future_df = pd.read_csv('Test_0qrQsBZ.csv')
future_df

Unnamed: 0,ID,Datetime
0,18288,26-09-2014 00:00
1,18289,26-09-2014 01:00
2,18290,26-09-2014 02:00
3,18291,26-09-2014 03:00
4,18292,26-09-2014 04:00
...,...,...
5107,23395,26-04-2015 19:00
5108,23396,26-04-2015 20:00
5109,23397,26-04-2015 21:00
5110,23398,26-04-2015 22:00


In [10]:
future_df.drop('ID', axis=1, inplace=True)

In [12]:
future_df['Datetime']

0       26-09-2014 00:00
1       26-09-2014 01:00
2       26-09-2014 02:00
3       26-09-2014 03:00
4       26-09-2014 04:00
              ...       
5107    26-04-2015 19:00
5108    26-04-2015 20:00
5109    26-04-2015 21:00
5110    26-04-2015 22:00
5111    26-04-2015 23:00
Name: Datetime, Length: 5112, dtype: object

In [14]:
# Make predictions
n_periods = len(future_df['Datetime'])
forecasted_counts = model.predict(n_periods=n_periods)

NameError: name 'model' is not defined

In [None]:
# Create a DataFrame with the future dates and the forecasted counts
future_df = pd.DataFrame({'Datetime': future_dates, 'Predicted_Count': forecasted_counts})
future_df.set_index('Datetime', inplace=True)

In [None]:
rom sklearn.metrics import mean_squared_error
from math import sqrt

# If you have actual test data to compare against
test_actual = ...  # your actual observed counts

# Compute RMSE
rmse = sqrt(mean_squared_error(test_actual, future_forecasts[:len(test_actual)]))
print(f'The RMSE is: {rmse}')

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

# Generate predictions
predictions = model.predict(n_periods=len(test))

# Calculate RMSE
rmse = sqrt(mean_squared_error(test['Count'], predictions))