# 05 - EDA and Time Series Analysis

This notebook demonstrates the usage of the newly developed EDA and Time Series modules in the `ML_Engine` library. We will use a public dataset to showcase the functionalities for exploratory data analysis and time series feature engineering.

## 1. Environment Setup

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import os

# Import from our new modules
from ML_Engine.data import eda as data_eda
from ML_Engine.visualization import eda as viz_eda
from ML_Engine.data import timeseries as data_ts
from ML_Engine.visualization import timeseries as viz_ts

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')

# Ensure outputs directory exists
output_dir = 'outputs'
os.makedirs(output_dir, exist_ok=True)

## 2. Data Loading

We will use the 'Daily Female Births' dataset, which is a simple univariate time series.

In [None]:
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-total-female-births.csv'
df = pd.read_csv(url, header=0, index_col=0, parse_dates=True).squeeze()

# For consistency, let's convert it to a DataFrame
df = df.to_frame(name='Births')
df.index.name = 'Date'
df.head()

## 3. Exploratory Data Analysis (EDA)

Let's use the new EDA modules to quickly understand our dataset.

In [None]:
# Get a statistical summary of the DataFrame
summary_df = data_eda.summarize(df)
print("Dataset Summary:")
print(summary_df)

In [None]:
# Analyze the target variable ('Births')
target_analysis = data_eda.analyze_target(df, 'Births')
print("\nTarget Variable Analysis:")
print(target_analysis)

In [None]:
# Get a visual overview of the dataset
fig = viz_eda.plot_dataset_overview(df)
fig.savefig(os.path.join(output_dir, '05_dataset_overview.png'))
plt.show()

## 4. Time Series Analysis

### 4.1. Basic Time Series Plot

In [None]:
# Plot the time series data
fig = viz_ts.plot_timeseries(df.reset_index(), date_col='Date', value_col='Births')
fig.savefig(os.path.join(output_dir, '05_timeseries_plot.png'))
plt.show()

### 4.2. Stationarity Check

In [None]:
# Check for stationarity using the ADF test
p_value, is_stationary, interpretation = data_ts.check_stationarity(df['Births'])
print(f"Is the series stationary? {is_stationary} (p-value: {p_value:.4f})")
print("ADF Test Results:", interpretation)

### 4.3. ACF and PACF Plots

In [None]:
# Plot ACF and PACF to identify potential model parameters (p, d, q)
fig = viz_ts.plot_acf_pacf(df['Births'], lags=40)
fig.savefig(os.path.join(output_dir, '05_acf_pacf.png'))
plt.show()

### 4.4. Seasonality Decomposition

In [None]:
# Decompose the time series to observe trend, seasonality, and residuals
# Assuming a weekly seasonality (period=7)
fig = viz_ts.plot_seasonality_decomposition(df['Births'], period=7)
fig.savefig(os.path.join(output_dir, '05_decomposition.png'))
plt.show()

### 4.5. Feature Engineering

In [None]:
# Create date-based features
df_featured = data_ts.extract_date_features(df.reset_index(), 'Date')

# Create lag features
df_featured = data_ts.create_lag_features(df_featured, 'Births', lags=[1, 7, 30])

# Create rolling window features
df_featured = data_ts.create_rolling_features(df_featured, 'Births', windows=[7, 30], agg=['mean', 'std'])

df_featured.head(10)

## 5. Time Series Splitting

Finally, let's demonstrate how to split the data for time series modeling.

In [None]:
# Drop rows with NaNs created by feature engineering
df_featured_clean = df_featured.dropna()

# Split the data
X_train, X_test, y_train, y_test = data_ts.split_timeseries(df_featured_clean, date_col='Date', test_size=0.2)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print("\nTraining data starts:", X_train['Date'].min())
print("Training data ends:", X_train['Date'].max())
print("Test data starts:", X_test['Date'].min())
print("Test data ends:", X_test['Date'].max())

## 6. Advanced Financial Time Series Analysis

In this section, we demonstrate more sophisticated time series analysis using financial data.
We'll use Apple Inc. (AAPL) stock data to showcase advanced EDA and time series feature engineering.

In [None]:
# Install yfinance if not already installed
try:
    import yfinance
except ImportError:
    !pip install yfinance -q
    import yfinance

In [None]:
import yfinance as yf
import pandas as pd
import numpy as np

# Download AAPL stock data for the last 2 years
ticker = 'AAPL'
df_fin = yf.download(ticker, period='2y', interval='1d', progress=False)
df_fin.head()

In [None]:
# Basic statistics
print(f"Data shape: {df_fin.shape}")
print("\nColumn dtypes:")
print(df_fin.dtypes)
print("\nSummary statistics:")
print(df_fin.describe())

# Check for missing values
print("\nMissing values per column:")
print(df_fin.isnull().sum())

In [None]:
# Calculate daily returns and volatility
df_fin['Returns'] = df_fin['Close'].pct_change()
df_fin['Log_Returns'] = np.log(df_fin['Close'] / df_fin['Close'].shift(1))
df_fin['Volatility'] = df_fin['Returns'].rolling(window=30).std() * np.sqrt(252)  # annualized

# Drop NaN rows created by shifts
df_fin_clean = df_fin.dropna()

print("First few rows with returns and volatility:")
print(df_fin_clean[['Close', 'Returns', 'Log_Returns', 'Volatility']].head())

In [None]:
import matplotlib.pyplot as plt

# Plot closing price and returns
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Closing price
axes[0, 0].plot(df_fin_clean.index, df_fin_clean['Close'], color='blue', linewidth=1)
axes[0, 0].set_title('AAPL Closing Price')
axes[0, 0].set_ylabel('Price (USD)')
axes[0, 0].grid(True)

# Daily returns
axes[0, 1].plot(df_fin_clean.index, df_fin_clean['Returns'], color='green', linewidth=1)
axes[0, 1].set_title('Daily Returns')
axes[0, 1].set_ylabel('Returns')
axes[0, 1].grid(True)

# Volume (line plot)axes[1, 0].plot(df_fin_clean.index, df_fin_clean['Volume'], color='gray', linewidth=1)axes[1, 0].set_title('Trading Volume')axes[1, 0].set_ylabel('Volume')axes[1, 0].grid(True)
# Rolling volatility
axes[1, 1].plot(df_fin_clean.index, df_fin_clean['Volatility'], color='red', linewidth=1)
axes[1, 1].set_title('30-Day Rolling Volatility (Annualized)')
axes[1, 1].set_ylabel('Volatility')
axes[1, 1].grid(True)

plt.tight_layout()
plt.show()

In [None]:
# Use our new EDA modules
from ML_Engine.data.eda import summarize, analyze_target
from ML_Engine.visualization.eda import plot_dataset_overview, plot_correlations

# Summarize the financial DataFrame
summary = summarize(df_fin_clean)
print("Financial Data Summary:")
print(summary)

# Analyze a target variable (e.g., Returns)
target_analysis = analyze_target(df_fin_clean, 'Returns')
print("\nReturns Analysis:")
print(target_analysis)

In [None]:
# Plot correlations between financial features
fig = plot_correlations(df_fin_clean[['Open', 'High', 'Low', 'Close', 'Volume', 'Returns', 'Volatility']])
fig.suptitle('Correlation Matrix - Financial Features', fontsize=16)
plt.show()

In [None]:
# Use our time series modules
from ML_Engine.data.timeseries import extract_date_features, create_lag_features, create_rolling_features
from ML_Engine.visualization.timeseries import plot_acf_pacf, plot_seasonality_decomposition

# Reset index to have Date as a column
df_fin_reset = df_fin_clean.reset_index()

# Extract date features
df_with_date_features = extract_date_features(df_fin_reset, 'Date')
print("Date features added. New columns:")
print([col for col in df_with_date_features.columns if 'Date' in col])

In [None]:
# Create lag features for Returns
df_with_lags = create_lag_features(df_with_date_features, 'Returns', lags=[1, 2, 3, 5, 10])

# Create rolling features for Returns
df_with_rolling = create_rolling_features(df_with_lags, 'Returns', windows=[5, 10, 20], agg=['mean', 'std', 'min', 'max'])

print(f"DataFrame shape after feature engineering: {df_with_rolling.shape}")
print("\nSample of engineered features:")
print(df_with_rolling[['Returns', 'Returns_lag_1', 'Returns_rolling_5_mean', 'Returns_rolling_10_std']].head())

In [None]:
# Check stationarity of returns series
from ML_Engine.data.timeseries import check_stationarity

p_value, is_stationary, interpretation = check_stationarity(df_fin_clean['Returns'])
print(f"Is the returns series stationary? {is_stationary} (p-value: {p_value:.6f})")
print("\nADF Test Results:")
for key, value in interpretation.items():
    print(f"  {key}: {value}")

In [None]:
# Plot ACF and PACF for returns
fig = plot_acf_pacf(df_fin_clean['Returns'], lags=40)
fig.suptitle('ACF and PACF - AAPL Daily Returns', fontsize=16)
plt.show()

In [None]:
# Try seasonality decomposition (assuming weekly seasonality for trading days)
# Note: Financial data typically has 5 trading days per week
try:
    fig = plot_seasonality_decomposition(df_fin_clean['Returns'].dropna(), period=5)
    fig.suptitle('Seasonality Decomposition - AAPL Daily Returns (5-day period)', fontsize=16)
    plt.show()
except Exception as e:
    print(f"Seasonality decomposition failed: {e}")
    print("This is expected if there's no clear seasonality in the returns series.")

In [None]:

# Demonstrate time series split
from ML_Engine.data.timeseries import split_timeseries
import pandas as pd

# Prepare a feature matrix (using engineered features)
# Flatten MultiIndex columns to strings
if isinstance(df_with_rolling.columns, pd.MultiIndex):
    df_with_rolling.columns = ['_'.join(col).strip('_') for col in df_with_rolling.columns]
feature_cols = [col for col in df_with_rolling.columns if col not in ['Date', 'Returns'] and not col.startswith('Date_')]
feature_cols = feature_cols[:10]  # Limit to first 10 features for demonstration
target_col = 'Returns'

df_for_split = df_with_rolling[['Date'] + feature_cols + [target_col]].dropna()

X_train, X_test, y_train, y_test = split_timeseries(df_for_split, date_col='Date', test_size=0.2)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Training period: {X_train['Date'].min()} to {X_train['Date'].max()}")
print(f"Test period: {X_test['Date'].min()} to {X_test['Date'].max()}")

## Summary

This advanced financial time series analysis demonstrates:

1. **Financial Data Acquisition**: Using `yfinance` to download real stock data
2. **Advanced EDA**: Returns calculation, volatility measurement, and correlation analysis
3. **ML_Engine Integration**: Using the new EDA and time series modules for automated analysis
4. **Feature Engineering**: Date features, lag features, and rolling window statistics
5. **Time Series Diagnostics**: Stationarity testing, ACF/PACF analysis, and seasonality decomposition
6. **Temporal Splitting**: Proper time-aware train/test split for forecasting tasks

The engineered features can now be used with the existing `ML_Engine.models` module for time series forecasting, demonstrating the full integration of the new time series capabilities with the existing ML pipeline.