# Session 1: Data Analysis & Basic Model

## Learning Objectives

In this session, we will:

1. **Load and explore financial data** - Understand the structure, quality, and characteristics of our dataset
2. **Perform exploratory data analysis (EDA)** - Identify patterns, outliers, missing data, and relationships
3. **Build a simple linear regression model** - Predict returns using features
4. **Implement a basic trading strategy** - Convert predictions into buy/sell signals
5. **Backtest the strategy** - Evaluate performance using both ML metrics (RMSE, R²) and financial metrics (Sharpe ratio, drawdown)
6. **Identify data leakage** - Understand one of the most common pitfalls in financial ML

## The Problem

We want to build a **systematic trading strategy** that:
- Uses machine learning to predict future returns
- Makes trading decisions **without any arbitrary rules** (pure model-based)
- Can be backtested and evaluated objectively

The key question: **Can we predict $r_{t+1}$ (return from time $t$ to $t+1$) using features observed at time $t$?**

$$r_{t+1} = \frac{P_{t+1} - P_t}{P_t}$$

Where $P_t$ is the price at time $t$.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys

# Set style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)

# Import our modules - add parent directory to path
sys.path.insert(0, str(Path("..").resolve()))

from eda.analysis import basic_summary
from features.engineering import prepare_features, prepare_target
from backtesting.engine import backtest_strategy, print_backtest_metrics


In [None]:
# Load data
data_path = Path("../data/saved/stock_a.csv")
df = pd.read_csv(data_path, parse_dates=["timestamp"])

print(f"Data loaded: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()


### Question 1.1: Data Structure

**What columns do we have? What does each represent?**

**Answer:**
- `timestamp`: Date of observation
- `X1, X2, X3`: Features (predictors) observed at time $t$
- `price`: Price at time $t$
- `bid`, `ask`: Bid and ask prices (for transaction cost modeling)
- `returns`: Forward return from $t$ to $t+1$ (this is our target variable)

**Key insight:** Features at time $t$ should predict returns from $t$ to $t+1$. This is the correct temporal alignment for trading.


In [None]:
# Basic summary
basic_summary(df)


### Question 1.2: Missing Data

**How much missing data do we have? In which columns?**

**Answer:**
We should see missing data only in feature columns (X1, X2, X3) - about 5% as we configured. Missing data in features is common in real financial data (data feed issues, corporate actions, etc.). We'll handle this in feature engineering.

**Important:** We should NOT have missing data in `returns` or `price` - these are critical for our analysis.


In [None]:
# Visualize missing data
missing_pct = df.isnull().sum() / len(df) * 100
missing_pct[missing_pct > 0].plot(kind="bar")
plt.title("Missing Data Percentage")
plt.ylabel("Percentage")
plt.xticks(rotation=45)
plt.show()


### 2.1 Price and Returns Analysis


In [None]:
# Plot price series
plt.figure(figsize=(14, 6))
plt.subplot(2, 1, 1)
plt.plot(df["timestamp"], df["price"])
plt.title("Price Series")
plt.xlabel("Date")
plt.ylabel("Price")
plt.grid(True)

plt.subplot(2, 1, 2)
plt.plot(df["timestamp"], df["returns"])
plt.title("Returns Series")
plt.xlabel("Date")
plt.ylabel("Returns")
plt.grid(True)

plt.tight_layout()
plt.show()


### Question 2.1: Returns Distribution

**What does the distribution of returns look like? Is it normal?**

**Answer:**
Financial returns typically exhibit:
- **Fat tails** (more extreme values than normal distribution)
- **Near-zero mean** (in efficient markets)
- **Volatility clustering** (periods of high/low volatility)

Let's check our synthetic data:


In [None]:
# Returns distribution
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
df["returns"].hist(bins=50, edgecolor="black")
plt.title("Returns Distribution")
plt.xlabel("Returns")
plt.ylabel("Frequency")

plt.subplot(1, 2, 2)
df["returns"].plot(kind="box")
plt.title("Returns Box Plot")
plt.ylabel("Returns")

plt.tight_layout()
plt.show()

print(f"Returns statistics:")
print(df["returns"].describe())
print(f"\nSkewness: {df['returns'].skew():.4f}")
print(f"Kurtosis: {df['returns'].kurtosis():.4f}")


### 2.2 Feature Analysis


In [None]:
# Feature distributions
feature_cols = ["X1", "X2", "X3"]

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, col in enumerate(feature_cols):
    df[col].hist(bins=50, ax=axes[i], edgecolor="black")
    axes[i].set_title(f"{col} Distribution")
    axes[i].set_xlabel(col)
    axes[i].set_ylabel("Frequency")

plt.tight_layout()
plt.show()


### Question 2.2: Feature-Target Relationship

**Do our features have predictive power? Let's check correlations.**

**Answer:**
We should see some correlation between features and returns. However, correlation doesn't guarantee predictive power in a trading context (we'll see why later).


In [None]:
# Correlation matrix
corr_cols = feature_cols + ["returns"]
corr_matrix = df[corr_cols].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, fmt=".3f", cmap="coolwarm", center=0)
plt.title("Correlation Matrix: Features vs Returns")
plt.tight_layout()
plt.show()

print("\nCorrelations with returns:")
for col in feature_cols:
    corr = df[col].corr(df["returns"])
    print(f"  {col}: {corr:.4f}")


### 2.3 Note on Automated EDA Tools

**ydata-profiling** (formerly pandas-profiling) is a useful Python package that automatically generates comprehensive HTML reports with detailed statistics, correlations, missing data analysis, and visualizations. It can save significant time during exploratory data analysis.

To use it:
```python
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Data Profile")
profile.to_file("report.html")
```

For this lab, we'll use manual EDA to understand each step, but in practice, automated tools like ydata-profiling can be very helpful for quick data overviews.


In [None]:
# Additional correlation analysis for all numeric features
plt.figure(figsize=(10, 8))
numeric_cols = df.select_dtypes(include=[np.number]).columns
sns.heatmap(df[numeric_cols].corr(), annot=True, fmt=".3f", cmap="coolwarm", center=0)
plt.title("Correlation Matrix (All Numeric Features)")
plt.tight_layout()
plt.show()


## 3. Feature Engineering

For now, feature engineering is minimal - we'll just handle missing values. In future sessions, we'll add:
- Technical indicators (moving averages, RSI, etc.)
- Lagged features
- Rolling statistics
- Feature interactions


In [None]:
# Prepare features and target
X = prepare_features(df, feature_cols=feature_cols)
y = prepare_target(df, target_col="returns")

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nMissing values after processing: {X.isnull().sum().sum()}")

X.head()


## 4. Train-Test Split

**Critical for time series:** We must use a **chronological split**, not random!

Why? In real trading, we can't use future data to predict the past. Random splits would give us unrealistic performance.


In [None]:
# Chronological split: 80% train, 20% test
split_idx = int(len(df) * 0.8)

X_train = X.iloc[:split_idx]
X_test = X.iloc[split_idx:]
y_train = y.iloc[:split_idx]
y_test = y.iloc[split_idx:]

print(f"Train set: {len(X_train)} samples ({X_train.index[0]} to {X_train.index[-1]})")
print(f"Test set: {len(X_test)} samples ({X_test.index[0]} to {X_test.index[-1]})")

# Also split the full dataframe for backtesting
df_train = df.iloc[:split_idx].copy()
df_test = df.iloc[split_idx:].copy()


## 5. Linear Regression Model

We'll start with the simplest model: **Linear Regression**

$$\hat{r}_{t+1} = \beta_0 + \beta_1 X_{1,t} + \beta_2 X_{2,t} + \beta_3 X_{3,t} + \epsilon_t$$

Where $\hat{r}_{t+1}$ is the predicted return.


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Fit model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Model coefficients
print("Model Coefficients:")
for i, col in enumerate(X.columns):
    print(f"  {col}: {model.coef_[i]:.6f}")
print(f"  Intercept: {model.intercept_:.6f}")


### 5.1 Model Evaluation (ML Metrics)


In [None]:
# Calculate ML metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print("=" * 80)
print("MODEL PERFORMANCE (ML METRICS)")
print("=" * 80)
print(f"\nTrain Set:")
print(f"  RMSE: {train_rmse:.6f}")
print(f"  MAE: {train_mae:.6f}")
print(f"  R²: {train_r2:.4f}")
print(f"\nTest Set:")
print(f"  RMSE: {test_rmse:.6f}")
print(f"  MAE: {test_mae:.6f}")
print(f"  R²: {test_r2:.4f}")
print("=" * 80)


### Question 5.1: Model Interpretation

**What do these metrics tell us?**

**Answer:**
- **RMSE/MAE**: Measure prediction error. Lower is better.
- **R²**: Proportion of variance explained. R² = 1 means perfect predictions, R² = 0 means model is no better than predicting the mean.
- **Train vs Test**: If train R² >> test R², we might be overfitting.

**But wait!** Good ML metrics don't guarantee profitable trading. We need to backtest!


In [None]:
# Visualize predictions vs actual
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Train set
axes[0].scatter(y_train, y_train_pred, alpha=0.5, s=10)
axes[0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 'r--', lw=2)
axes[0].set_xlabel("Actual Returns")
axes[0].set_ylabel("Predicted Returns")
axes[0].set_title(f"Train Set: R² = {train_r2:.4f}")
axes[0].grid(True)

# Test set
axes[1].scatter(y_test, y_test_pred, alpha=0.5, s=10)
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1].set_xlabel("Actual Returns")
axes[1].set_ylabel("Predicted Returns")
axes[1].set_title(f"Test Set: R² = {test_r2:.4f}")
axes[1].grid(True)

plt.tight_layout()
plt.show()


## 6. Trading Strategy Implementation

**Simple strategy:**
- If predicted return > 0: **BUY** (go long)
- If predicted return < 0: **SELL** (go short)

**Assumptions (for now):**
- We can trade at the **close price** (we'll relax this later)
- Transaction costs: 0.1% per trade
- We can go both long and short


In [None]:
# Add predictions to test dataframe
df_test["prediction"] = y_test_pred

# Simple strategy: buy if prediction > 0, sell if prediction < 0
df_test["signal"] = np.where(df_test["prediction"] > 0, 1, -1)

print("Signal distribution:")
print(df_test["signal"].value_counts())
print(f"\n% Long: {(df_test['signal'] == 1).sum() / len(df_test) * 100:.2f}%")
print(f"% Short: {(df_test['signal'] == -1).sum() / len(df_test) * 100:.2f}%")


## 7. Backtesting

Now let's see if our strategy is actually profitable!


In [None]:
# Run backtest
results = backtest_strategy(
    df_test,
    y_test_pred,
    initial_capital=100000,
    transaction_cost=0.001,  # 0.1%
    trade_at="close"
)

# Print metrics
print_backtest_metrics(results)


### Question 7.1: Interpreting Backtest Results

**What do these metrics mean?**

**Answer:**

**Financial Metrics:**
- **Total Return**: Overall profit/loss percentage
- **Sharpe Ratio**: Risk-adjusted return. > 1 is decent, > 2 is good, > 3 is excellent (annualized)
- **Max Drawdown**: Worst peak-to-trough decline. Lower is better (investors hate large drawdowns)
- **Win Rate**: Percentage of profitable trades

**ML Metrics:**
- Same as before, but now we see them in context of actual trading performance

**Key insight:** A model can have good R² but poor trading performance if:
- Predictions are too small (can't overcome transaction costs)
- Predictions have wrong sign (direction matters more than magnitude for binary signals)
- Model overfits to noise


In [None]:
# Plot equity curve
plt.figure(figsize=(14, 6))
plt.plot(df_test["timestamp"], results["equity_curve"])
plt.title("Equity Curve")
plt.xlabel("Date")
plt.ylabel("Portfolio Value")
plt.grid(True)
plt.axhline(y=100000, color='r', linestyle='--', label='Initial Capital')
plt.legend()
plt.tight_layout()
plt.show()


## 8. Exercise: Find the Data Leakage Bug!

**Challenge:** There's a subtle data leakage issue in our current setup. Can you find it?

**Hint:** Think about when we observe features vs when we can actually trade.

**Answer:**

The issue is that we're using features at time $t$ to predict returns from $t$ to $t+1$, but we're **trading at the close price of time $t$**. 

In reality:
- We observe features at the **end of day $t$** (at close)
- We can only trade at the **next day's open** (or later)
- So we should be predicting returns from $t+1$ to $t+2$, not $t$ to $t+1$!

**The fix:** We need to shift our target variable forward by one period, or shift our features backward. This is a common mistake that leads to unrealistic backtest performance.

**Try it:** Modify the code to account for this latency and see how it affects performance!

```
y = prepare_target(df, target_col="returns").shift(-1).fillna(0)
```

**Challenge:** Re-run the same analysis with stock_b that has been generated with little auto-correlation in returns. Do you see a difference? How can you explain it?

**Hint:** Now we have little but real explanatory power on the returns at $t+2$


## Summary

In this session, we:

1. ✅ Loaded and explored financial data
2. ✅ Performed EDA to understand data characteristics
3. ✅ Built a linear regression model to predict returns
4. ✅ Implemented a simple trading strategy
5. ✅ Backtested the strategy with both ML and financial metrics
6. ✅ Identified a data leakage issue (latency between observation and execution)

**Key Takeaways:**
- Good ML metrics ≠ Profitable trading
- Chronological train-test split is essential
- Transaction costs matter!
- Latency/execution timing is critical

**Next Session:** We'll explore logistic regression for direction prediction, threshold tuning, and more sophisticated backtesting.
