# Step 2: Feature Creation

## Objective
Create engineered features from raw OHLC data to predict USD/BRL exchange rate direction.

## Process
1. Load raw data from `data/raw/BRL_X_raw.csv`
2. Create target variable (price direction: up=1, down=0)
3. Engineer statistical features (moving averages, standard deviations)
4. Calculate momentum indicators (Relative Strength Level)
5. Generate physics-based features (velocity, acceleration, momentum, force)
6. Handle missing values from rolling calculations
7. Save feature set to `data/processed/` directory

## Output
- `data/processed/BRL_X_features.csv`: Complete feature set ready for model training

## Feature Categories
- Target: Binary classification (price direction)
- Statistical: Moving averages and standard deviations (6 and 12 periods)
- Momentum: Relative Strength Level indicators
- Physics-Based: Velocity, acceleration, momentum, force, kinetic energy, work, torque

In [1]:
# Import required libraries
import os
import numpy as np
import pandas as pd
from datetime import datetime

print(f"Feature creation started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Feature creation started at: 2025-10-25 10:57:16


In [2]:
# Define configuration parameters
RAW_DATA_PATH = '../data/raw/BRL_X_raw.csv'  # Input file from notebook 01
PROCESSED_DATA_PATH = '../data/processed/BRL_X_features.csv'  # Output file for notebook 03

# Feature engineering parameters
SHORT_WINDOW = 6   # Short-term window for moving averages and momentum calculations
LONG_WINDOW = 12   # Long-term window for moving averages and momentum calculations
PHYSICS_PERIOD = 6 # Period for physics-based feature calculations

print(f"Configuration:")
print(f"  Input: {RAW_DATA_PATH}")
print(f"  Output: {PROCESSED_DATA_PATH}")
print(f"  Short Window: {SHORT_WINDOW} periods")
print(f"  Long Window: {LONG_WINDOW} periods")
print(f"  Physics Period: {PHYSICS_PERIOD} periods")

Configuration:
  Input: ../data/raw/BRL_X_raw.csv
  Output: ../data/processed/BRL_X_features.csv
  Short Window: 6 periods
  Long Window: 12 periods
  Physics Period: 6 periods


In [3]:
# Load raw BRL=X data from previous notebook
df = pd.read_csv(RAW_DATA_PATH, index_col=0)

# Convert index to datetime and ensure proper formatting
df.index = pd.to_datetime(df.index)
df.index.name = 'Date'

# Sort by date to ensure chronological order
df = df.sort_index()

# Display data information
print(f"Loaded {len(df)} records from {df.index.min().strftime('%Y-%m-%d')} to {df.index.max().strftime('%Y-%m-%d')}")
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nFirst 5 rows:")
df.head()

Loaded 4116 records from 2010-01-04 to 2025-10-24
Dataset shape: (4116, 4)
Columns: ['Close', 'High', 'Low', 'Open']

Data types:
Close    float64
High     float64
Low      float64
Open     float64
dtype: object

Missing values:
Close    0
High     0
Low      0
Open     0
dtype: int64

First 5 rows:


Unnamed: 0_level_0,Close,High,Low,Open
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,1.719,1.7412,1.6723,1.693
2010-01-05,1.737,1.737,1.6713,1.6713
2010-01-06,1.7315,1.7359,1.6798,1.6798
2010-01-07,1.7389,1.7472,1.6805,1.7242
2010-01-08,1.732,1.7492,1.6954,1.6954


In [4]:
# Create target variable for classification
# Target represents the direction of price movement on the next day
# 1 = Price goes up (positive return)
# 0 = Price goes down (negative or zero return)

# Calculate percentage change (daily returns)
df['pct_change'] = df['Close'].pct_change()

# Create target by shifting returns forward by 1 day
# This means we're predicting tomorrow's direction based on today's features
df['target'] = df['pct_change'].shift(-1)
df['target'] = (df['target'] > 0).astype(int)

# Display target distribution to check class balance
print("Target variable distribution:")
print(df['target'].value_counts())
print(f"\nClass balance:")
print(df['target'].value_counts(normalize=True))
print(f"\nFirst 10 rows with target:")
df[['Close', 'pct_change', 'target']].head(10)

Target variable distribution:
target
1    2076
0    2040
Name: count, dtype: int64

Class balance:
target
1    0.504373
0    0.495627
Name: proportion, dtype: float64

First 10 rows with target:


Unnamed: 0_level_0,Close,pct_change,target
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-04,1.719,,1
2010-01-05,1.737,0.010471,0
2010-01-06,1.7315,-0.003166,1
2010-01-07,1.7389,0.004274,0
2010-01-08,1.732,-0.003968,0
2010-01-11,1.731,-0.000577,1
2010-01-12,1.7373,0.00364,1
2010-01-13,1.7459,0.00495,1
2010-01-14,1.7641,0.010424,1
2010-01-15,1.7644,0.00017,0


In [5]:
# Statistical Features: Moving averages and standard deviations
# These features capture trend and volatility information at different time scales

# Short-term (6 periods) moving statistics
df['mm_std6'] = df['pct_change'].rolling(SHORT_WINDOW).mean()  # 6-period moving average of returns
df['std6'] = df['pct_change'].rolling(SHORT_WINDOW).std()      # 6-period volatility (standard deviation)

# Long-term (12 periods) moving statistics
df['mm_std12'] = df['pct_change'].rolling(LONG_WINDOW).mean()  # 12-period moving average of returns
df['std12'] = df['pct_change'].rolling(LONG_WINDOW).std()      # 12-period volatility (standard deviation)

print(f"Statistical features created:")
print(f"  - mm_std6: {SHORT_WINDOW}-period moving average")
print(f"  - std6: {SHORT_WINDOW}-period standard deviation")
print(f"  - mm_std12: {LONG_WINDOW}-period moving average")
print(f"  - std12: {LONG_WINDOW}-period standard deviation")

Statistical features created:
  - mm_std6: 6-period moving average
  - std6: 6-period standard deviation
  - mm_std12: 12-period moving average
  - std12: 12-period standard deviation


In [6]:
# Momentum Features: Relative Strength Level (RSL) indicators
# RSL measures how current returns compare to average returns over a period
# Positive RSL = Current momentum stronger than historical average
# Negative RSL = Current momentum weaker than historical average

# Short-term relative strength (6 periods)
df['RSL_6'] = (df['pct_change'] / df['pct_change'].rolling(SHORT_WINDOW).mean()) - 1

# Long-term relative strength (12 periods)
df['RSL_12'] = (df['pct_change'] / df['pct_change'].rolling(LONG_WINDOW).mean()) - 1

print(f"Momentum features created:")
print(f"  - RSL_6: {SHORT_WINDOW}-period Relative Strength Level")
print(f"  - RSL_12: {LONG_WINDOW}-period Relative Strength Level")

Momentum features created:
  - RSL_6: 6-period Relative Strength Level
  - RSL_12: 12-period Relative Strength Level


In [7]:
# Physics-Based Features: Apply physics concepts to price movement analysis
# These features model price changes using analogies from classical mechanics
# The intuition is that price movements follow patterns similar to physical motion

p = PHYSICS_PERIOD  # Use configured period for all physics calculations

# Velocity (v): Rate of change in returns (acceleration of price momentum)
# Represents how quickly the momentum is changing
df['v'] = (df['pct_change'] - df['pct_change'].shift(p)) / p

# Acceleration (a): Rate of change in velocity (jerk of price momentum)
# Indicates changing momentum trends
df['a'] = (df['v'] - df['v'].shift(p)) / p

# Mass (m): Sum of returns over period (accumulated momentum)
# Represents the "weight" or significance of recent price movements
df['m'] = df['pct_change'].rolling(p).sum()

# Force (f): Mass times acceleration (Newton's second law applied to prices)
# Measures the strength of momentum change
df['f'] = df['m'] * df['a']

# Kinetic Energy (T): 0.5 * mass * velocity^2
# Represents the "energy" in price movement
df['T'] = 0.5 * df['m'] * df['v'] * df['v']

# Catalyst (cat): Average rate of return change
# Measures the direction and magnitude of momentum shift
cat_op = df['pct_change'].rolling(p).sum() - df['pct_change'].rolling(1).sum()
cat_ad = p
df['cat'] = cat_op / cat_ad

# Work (w): Force in the direction of motion (F * cos(theta))
# Represents effective force applied in direction of price movement
df['w'] = df['f'] * np.cos(np.arctan(df['cat']))

# Kinetic Energy Alternative (k): Momentum * catalyst
# Another measure of price movement energy
df['k'] = cat_op * df['m']

# Torque (tau): Rotational force (F * sin(theta))
# Captures perpendicular component of force (potential turning points)
df['tau'] = df['f'] * np.sin(np.arctan(df['cat']))

# Momentum (M): Mass times velocity (classical momentum definition)
# Represents persistence of price movement
df['M'] = df['m'] * df['v']

# Gravitational-like force (g): Attraction/repulsion force
# Models mean-reversion tendency
df['g'] = df['m'] * df['pct_change'].rolling(p).sum() / (df['m'] / df['pct_change'].rolling(p).sum()) ** 2

print(f"Physics-based features created ({PHYSICS_PERIOD}-period):")
print(f"  - v: Velocity (rate of return change)")
print(f"  - a: Acceleration (rate of velocity change)")
print(f"  - m: Mass (accumulated returns)")
print(f"  - f: Force (mass * acceleration)")
print(f"  - T: Kinetic Energy")
print(f"  - cat: Catalyst (momentum shift)")
print(f"  - w: Work (directional force)")
print(f"  - k: Kinetic Energy (alternative)")
print(f"  - tau: Torque (rotational force)")
print(f"  - M: Momentum (mass * velocity)")
print(f"  - g: Gravitational force (mean reversion)")

Physics-based features created (6-period):
  - v: Velocity (rate of return change)
  - a: Acceleration (rate of velocity change)
  - m: Mass (accumulated returns)
  - f: Force (mass * acceleration)
  - T: Kinetic Energy
  - cat: Catalyst (momentum shift)
  - w: Work (directional force)
  - k: Kinetic Energy (alternative)
  - tau: Torque (rotational force)
  - M: Momentum (mass * velocity)
  - g: Gravitational force (mean reversion)


In [8]:
# Check data quality after feature engineering
# Rolling calculations introduce NaN values at the beginning of the series

print("Data quality check after feature engineering:")
print(f"\nDataset shape: {df.shape}")
print(f"\nMissing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")
print(f"\nRows with any missing values: {df.isnull().any(axis=1).sum()}")

Data quality check after feature engineering:

Dataset shape: (4116, 23)

Missing values per column:
Close          0
High           0
Low            0
Open           0
pct_change     1
target         0
mm_std6        6
std6           6
mm_std12      12
std12         12
RSL_6          6
RSL_12        12
v              7
a             13
m              6
f             13
T              7
cat            6
w             13
k              6
tau           13
M              7
g              6
dtype: int64

Total missing values: 152

Rows with any missing values: 13


In [9]:
# Handle missing values
# Remove rows with NaN values introduced by rolling calculations and target shift
# These typically occur at the beginning and end of the time series

rows_before = len(df)
df.dropna(inplace=True)
rows_after = len(df)
rows_dropped = rows_before - rows_after

print(f"Missing value handling:")
print(f"  Rows before: {rows_before}")
print(f"  Rows after: {rows_after}")
print(f"  Rows dropped: {rows_dropped}")
print(f"\nFinal dataset shape: {df.shape}")
print(f"Date range: {df.index.min().strftime('%Y-%m-%d')} to {df.index.max().strftime('%Y-%m-%d')}")

Missing value handling:
  Rows before: 4116
  Rows after: 4103
  Rows dropped: 13

Final dataset shape: (4103, 23)
Date range: 2010-01-21 to 2025-10-24


In [10]:
# Save processed features for model training
# Prepare dataset by removing intermediate columns and keeping only features and target

# Create output directory if it doesn't exist
processed_dir = '../data/processed/'
os.makedirs(processed_dir, exist_ok=True)

# Remove OHLC price columns and intermediate calculation column (pct_change)
# Keep only the target and engineered features for model training
target_idx = df.columns.get_loc('target')
cols_to_keep = df.columns[target_idx:]
df_to_save = df[cols_to_keep].copy()

# Remove pct_change as it's an intermediate variable
if 'pct_change' in df_to_save.columns:
    df_to_save = df_to_save.drop(columns=['pct_change'])

# Save to CSV
output_path = os.path.join(processed_dir, 'BRL_X_features.csv')
df_to_save.to_csv(output_path)

print(f"Feature dataset saved successfully:")
print(f"  Path: {output_path}")
print(f"  Shape: {df_to_save.shape}")
print(f"  Features: {df_to_save.shape[1] - 1} (excluding target)")
print(f"  Samples: {len(df_to_save)}")
print(f"  Columns: {list(df_to_save.columns)}")

Feature dataset saved successfully:
  Path: ../data/processed/BRL_X_features.csv
  Shape: (4103, 18)
  Features: 17 (excluding target)
  Samples: 4103
  Columns: ['target', 'mm_std6', 'std6', 'mm_std12', 'std12', 'RSL_6', 'RSL_12', 'v', 'a', 'm', 'f', 'T', 'cat', 'w', 'k', 'tau', 'M', 'g']


In [11]:
# Verify saved features by loading and displaying sample
# This confirms the file was saved correctly and is ready for the next notebook

df_verification = pd.read_csv(output_path, index_col=0)

print("Verification of saved features:")
print(f"\nShape: {df_verification.shape}")
print(f"Columns: {list(df_verification.columns)}")
print(f"\nRandom sample of 5 records:")
df_verification.sample(5)

Verification of saved features:

Shape: (4103, 18)
Columns: ['target', 'mm_std6', 'std6', 'mm_std12', 'std12', 'RSL_6', 'RSL_12', 'v', 'a', 'm', 'f', 'T', 'cat', 'w', 'k', 'tau', 'M', 'g']

Random sample of 5 records:


Unnamed: 0_level_0,target,mm_std6,std6,mm_std12,std12,RSL_6,RSL_12,v,a,m,f,T,cat,w,k,tau,M,g
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2015-09-02,1,0.006739,0.012821,0.005054,0.01071,2.005352,3.007,0.000988,-0.00056,0.040434,-2.3e-05,1.97523e-08,0.003363,-2.3e-05,0.000816,-7.614281e-08,4e-05,0.001635
2019-10-10,1,-0.001836,0.008569,-0.001054,0.006204,-3.317254,-5.035513,0.000717,0.000209,-0.011013,-2e-06,-2.830226e-09,-0.002544,-2e-06,0.000168,5.862569e-09,-8e-06,0.000121
2020-06-30,1,0.003084,0.022801,0.007052,0.019065,-5.661238,-3.038602,-0.000313,0.000311,0.018506,6e-06,9.083702e-10,0.005481,6e-06,0.000609,3.149935e-08,-6e-06,0.000342
2015-04-27,0,-0.003903,0.006233,-0.002676,0.009283,0.743265,1.542699,-0.000732,-0.000757,-0.023421,1.8e-05,-6.275629e-09,-0.002769,1.8e-05,0.000389,-4.911178e-08,1.7e-05,0.000549
2025-09-11,0,-0.001986,0.003607,-0.000112,0.00373,1.928264,50.781809,-0.001846,-0.000513,-0.011916,6e-06,-2.029852e-08,-0.001017,6e-06,7.3e-05,-6.217319e-09,2.2e-05,0.000142


In [12]:
# Display descriptive statistics of engineered features
print("Descriptive statistics of engineered features:")
df_verification.describe()

Descriptive statistics of engineered features:


Unnamed: 0,target,mm_std6,std6,mm_std12,std12,RSL_6,RSL_12,v,a,m,f,T,cat,w,k,tau,M,g
count,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0,4103.0
mean,0.503778,0.000329,0.009542,0.000332,0.009809,-0.049125,-2.959601,-2e-06,-3.257542e-07,0.001974,4.058609e-06,1.308728e-08,0.000275,4.058568e-06,0.00041,2.126913e-10,2e-05,0.0005075496
std,0.500047,0.003741,0.005323,0.002601,0.00449,77.527837,127.361578,0.002573,0.0007456791,0.022445,2.03963e-05,3.056797e-07,0.003426,2.03958e-05,0.00077,1.450543e-07,7.6e-05,0.0008967892
min,0.0,-0.016556,0.00058,-0.011545,0.001274,-1390.787268,-4359.59169,-0.017409,-0.00517406,-0.099337,-0.0001098325,-8.699774e-06,-0.017546,-0.0001098273,-0.000986,-2.870422e-06,-0.000372,1.907292e-10
25%,0.0,-0.001952,0.005963,-0.001321,0.006658,-2.203472,-3.037414,-0.001378,-0.0003943993,-0.011713,-2.170031e-06,-7.722312e-09,-0.001796,-2.170028e-06,1.7e-05,-6.93804e-09,-6e-06,3.421024e-05
50%,1.0,0.000253,0.008272,0.000204,0.008891,-0.031399,0.049583,3.6e-05,8.77309e-06,0.001517,6.117699e-07,6.620042e-11,0.000222,6.117678e-07,0.000126,2.65066e-12,3e-06,0.0001721186
75%,1.0,0.002434,0.011816,0.00188,0.011823,2.367262,3.525318,0.001402,0.0004025825,0.014602,6.686474e-06,1.141378e-08,0.002242,6.686446e-06,0.000476,6.84715e-09,2.7e-05,0.0005863488
max,1.0,0.016141,0.04032,0.011341,0.031838,3337.326149,2330.570441,0.01386,0.004115409,0.096847,0.0002970379,6.076659e-06,0.015844,0.0002970347,0.009407,2.66166e-06,0.000999,0.009867764


## Summary

Feature engineering completed successfully:
- Created target variable (binary price direction prediction)
- Engineered 4 statistical features (moving averages and standard deviations)
- Calculated 2 momentum indicators (Relative Strength Level)
- Generated 11 physics-based features (velocity, acceleration, force, energy, etc.)
- Handled missing values from rolling calculations
- Saved to `data/processed/BRL_X_features.csv`

**Final Dataset:**
- Total features: 17 (excluding target)
- Target variable: Binary classification (0=down, 1=up)
- All features are numeric and ready for machine learning

## Next Steps
Proceed to `03_lazyclassifier_evaluation.ipynb` to:
- Load the engineered features
- Split data into train/test sets
- Evaluate multiple classification algorithms
- Compare model performance
- Identify best performing models for ensemble