**Time Series Data**




-Data collected over time, in order (e.g., hourly electricity usage, daily temperature).







-Important because values often depend on previous time points.

In [1]:
!pip install pandas numpy matplotlib seaborn scikit-learn statsmodels prophet xgboost



In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt

In [4]:
from statsmodels.tsa.arima.model import ARIMA
from prophet import Prophet
import xgboost as xgb

In [5]:
df = pd.read_csv(
    "household_power_consumption.txt",
    sep=";",
    low_memory=False,
    na_values=["?"]
)

print(df.columns[:10])

Index(['Date', 'Time', 'Global_active_power', 'Global_reactive_power',
       'Voltage', 'Global_intensity', 'Sub_metering_1', 'Sub_metering_2',
       'Sub_metering_3'],
      dtype='object')


In [6]:
import warnings
warnings.filterwarnings("ignore")

**Resampling:**

Changing the frequency of data.

Example: The dataset records every minute, but we “resample” to hourly average to make it simpler and more meaningful.

In [7]:
# Reloading the same dataset
df = pd.read_csv(
    "household_power_consumption.txt",
    sep=";",
    low_memory=False,
    na_values=["?"]
)

print("Before combining:", df.columns[:5])  # sanity check

#Combining Date + Time into datetime
df["datetime"] = pd.to_datetime(
    df["Date"].astype(str) + " " + df["Time"].astype(str),
    dayfirst=True,
    errors="coerce"
)

#Keeping only datetime + target
df = df[["datetime", "Global_active_power"]].dropna()

# Converting to float
df["Global_active_power"] = df["Global_active_power"].astype(float)

# Resampling hourly
df = df.set_index("datetime").resample("h").mean()

# Fill missing
df = df.interpolate(method="time")

print("After processing:", df.head())
print("Index type:", type(df.index))


Before combining: Index(['Date', 'Time', 'Global_active_power', 'Global_reactive_power',
       'Voltage'],
      dtype='object')
After processing:                      Global_active_power
datetime                                
2006-12-16 17:00:00             4.222889
2006-12-16 18:00:00             3.632200
2006-12-16 19:00:00             3.400233
2006-12-16 20:00:00             3.268567
2006-12-16 21:00:00             3.056467
Index type: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>


**Train-Test Split**

Train data → used to teach the model patterns.



Test data → unseen data to check if the model predicts well.


In time series, we usually use the past as training and the future as testing.

In [8]:
# Using last 30 days as test set
train = df.iloc[:-24*30]
test  = df.iloc[-24*30:]

**Feature Engineering**

Creating new useful inputs for the model.

Example: From datetime, we extract:

Hour of day (electricity differs morning vs evening)

Day of week (weekdays vs weekends)

Month (seasonal changes)

Weekend flag (0 = weekday, 1 = weekend)

In [9]:
#Feature engineering for ML models (XGBoost)
train_feat = train.copy()
test_feat = test.copy()

# Adding time-based features
for dataset in [train_feat, test_feat]:
    dataset["hour"] = dataset.index.hour
    dataset["dayofweek"] = dataset.index.dayofweek
    dataset["month"] = dataset.index.month
    dataset["is_weekend"] = (dataset.index.dayofweek >= 5).astype(int)

X_train = train_feat.drop("Global_active_power", axis=1)
y_train = train_feat["Global_active_power"]
X_test = test_feat.drop("Global_active_power", axis=1)
y_test = test_feat["Global_active_power"]


**1. ARIMA (Auto-Regressive Integrated Moving Average):**

A traditional statistical model for time series. It predicts future values by looking at past values + past errors.Good when data has trends but not too complex patterns.

Parameters:

p = past values (autoregression)

d = differencing (to remove trend)

q = past errors (moving average)

Think of ARIMA as saying:
            “The next hour’s electricity usage depends on the last few hours + some noise correction.”

In [10]:
# Using only the target series
arima_series = train["Global_active_power"]

In [None]:
# Fit ARIMA model (order chosen manually)
arima_model = ARIMA(arima_series, order=(2,1,2))
arima_fit = arima_model.fit()

In [None]:
# Forecasting for test set length
arima_forecast = arima_fit.forecast(steps=len(test))

In [None]:
# Evaluate the model
arima_mae = mean_absolute_error(test["Global_active_power"], arima_forecast)
arima_rmse = sqrt(mean_squared_error(test["Global_active_power"], arima_forecast))
print(f"ARIMA → MAE: {arima_mae:.3f}, RMSE: {arima_rmse:.3f}")

In [None]:
plt.figure(figsize=(12,5))
plt.plot(train.index, train["Global_active_power"], label="Train")
plt.plot(test.index, test["Global_active_power"], label="Test")
plt.plot(test.index, arima_forecast, label="ARIMA Forecast")
plt.legend()
plt.title("ARIMA Forecast")
plt.show()

**2. Prophet (by Facebook/Meta):**

A modern forecasting tool.Handles daily, weekly, yearly seasonality very well (like patterns repeating each day or week).More robust for business/time-based patterns.

Easy to use — just feed it a dataframe with ds (date) and y (value).

Think of Prophet as: “The model learns repeating patterns in the calendar (weekends, seasons, holidays) and uses them to forecast.”

In [None]:
# Prophet requires dataframe with columns "ds" and "y"
prophet_train = train.reset_index()[["datetime", "Global_active_power"]].rename(
    columns={"datetime": "ds", "Global_active_power": "y"}
)

In [None]:
# Fitting Prophet model
prophet_model = Prophet()
prophet_model.fit(prophet_train)

In [None]:
# Creating future dataframe
future = prophet_model.make_future_dataframe(periods=len(test), freq="H")
forecast = prophet_model.predict(future)

In [None]:
# Extracting only test part
prophet_forecast = forecast.set_index("ds").loc[test.index]["yhat"]

In [None]:
# Evaluate the model
prophet_mae = mean_absolute_error(test["Global_active_power"], prophet_forecast)
prophet_rmse = sqrt(mean_squared_error(test["Global_active_power"], prophet_forecast))
print(f"Prophet → MAE: {prophet_mae:.3f}, RMSE: {prophet_rmse:.3f}")

In [None]:
plt.figure(figsize=(12,5))
plt.plot(train.index, train["Global_active_power"], label="Train")
plt.plot(test.index, test["Global_active_power"], label="Test")
plt.plot(test.index, prophet_forecast, label="Prophet Forecast")
plt.legend()
plt.title("Prophet Forecast")
plt.show()

**3. XGBoost (Extreme Gradient Boosting):**

A machine learning model that uses decision trees. It’s not specifically built for time series, but we can make it work by giving it time features (hour, weekday, etc.).

Very powerful for capturing complex relationships.

Think of XGBoost as: “Instead of only looking at the past values, it uses smart trees to learn how time-related features affect energy usage.”

In [None]:
# Train XGBoost Regressor
xgb_model = xgb.XGBRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)
xgb_model.fit(X_train, y_train)

In [None]:
# Forecast
xgb_forecast = xgb_model.predict(X_test)

In [None]:
# Evaluate the model
xgb_mae = mean_absolute_error(y_test, xgb_forecast)
xgb_rmse = sqrt(mean_squared_error(y_test, xgb_forecast))
print(f"XGBoost → MAE: {xgb_mae:.3f}, RMSE: {xgb_rmse:.3f}")

In [None]:
plt.figure(figsize=(12,5))
plt.plot(train.index, train["Global_active_power"], label="Train")
plt.plot(test.index, test["Global_active_power"], label="Test")
plt.plot(test.index, xgb_forecast, label="XGBoost Forecast")
plt.legend()
plt.title("XGBoost Forecast")
plt.show()

**Evaluation Metrics:**

**MAE (Mean Absolute Error):**

Average size of the errors (without direction).

Example: If the model predicts 4.5 but the actual is 5, the error is 0.5.

Smaller MAE = better model.

**RMSE (Root Mean Squared Error):**

Similar to MAE, but punishes big errors more strongly.

Often used in forecasting.

**Comparing all models**

In [None]:
results = pd.DataFrame({
    "Model": ["ARIMA", "Prophet", "XGBoost"],
    "MAE": [arima_mae, prophet_mae, xgb_mae],
    "RMSE": [arima_rmse, prophet_rmse, xgb_rmse]
})


In [None]:
print(results)

In [None]:
sns.barplot(data=results, x="Model", y="RMSE")
plt.title("Model Comparison (lower is better)")
plt.show()

**Summary of Models:**

ARIMA → traditional, statistical, works with past values.

Prophet → modern, calendar-aware, great for seasonal patterns.

XGBoost → machine learning, uses time features for prediction.

**Visualization:**

We plot actual vs forecasted energy usage to see how well each model follows the real data.

Helps quickly see which model performs better.