
# ⚙️ 03 - Model Training for BTC/USDT

**Objective:**  
In this notebook, we will **train a basic AI model** (a classifier) to predict whether the BTC/USDT price will move up or down in the near future. We'll use the features generated in `02_feature_engineering.ipynb`.

---

## 📌 Overview

1. **Load the enhanced dataset** (`BTCUSDT_1m_features.csv`).
2. **Define a target variable** (e.g., predict if the next close is higher than the current close).
3. **Split the data** into train and test sets.
4. **Train a baseline model** (e.g., Random Forest).
5. **Evaluate performance** with accuracy, confusion matrix, or other metrics.
6. **Discuss next steps** (hyperparameter tuning, model improvement, real-time integration).

---

## 💡 Why This Matters

- Feature engineering + a well-defined target variable + the right split → a good starting point for AI-based trading signals.
- You’ll see if your features actually help or if you need more advanced engineering.


In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# For the model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# 1) Load the feature-enhanced dataset
df = pd.read_csv('../data/BTCUSDT_1m_features.csv')

print("Data loaded. Here are the first 5 rows:")
df.head()


Data loaded. Here are the first 5 rows:


Unnamed: 0,open_time,open,high,low,close,volume,close_time,quote_asset_volume,number_of_trades,taker_buy_volume,...,day_of_week,hour_of_day,ma_14,ema_14,bb_upper,bb_lower,rsi_14,close_lag1,close_lag2,returns_1m
0,2024-01-01 00:00:00,42283.58,42298.62,42261.02,42298.61,35.92724,2024-01-01 00:00:59.999,1519032.0,1327,23.18766,...,0,0,,42298.61,,,,,,
1,2024-01-01 00:01:00,42298.62,42320.0,42298.61,42320.0,21.16779,2024-01-01 00:01:59.999,895580.9,1348,13.47483,...,0,0,,42301.462,,,,42298.61,,0.000506
2,2024-01-01 00:02:00,42319.99,42331.54,42319.99,42325.5,21.60391,2024-01-01 00:02:59.999,914371.1,1019,11.21801,...,0,0,,42304.667067,,,,42320.0,42298.61,0.00013
3,2024-01-01 00:03:00,42325.5,42368.0,42325.49,42367.99,30.5073,2024-01-01 00:03:59.999,1291997.0,1241,24.04878,...,0,0,,42313.110124,,,,42325.5,42320.0,0.001004
4,2024-01-01 00:04:00,42368.0,42397.23,42367.99,42397.23,46.05107,2024-01-01 00:04:59.999,1951945.0,1415,34.12804,...,0,0,,42324.326108,,,,42367.99,42325.5,0.00069



## 🏷️ 1. Defining Our Target Variable

We want a simple target for now:
- **`target` = 1 if next close price is higher than current close**, else 0.

This is a classic **binary classification** approach: will the price go up or not?


In [2]:

# We'll shift the 'close' price by -1 to get the "next" bar's close
df['future_close'] = df['close'].shift(-1)

# Define target: 1 if future close > current close, else 0
df['target'] = (df['future_close'] > df['close']).astype(int)

# Drop rows with NaNs created by shifting (the very last row)
df.dropna(subset=['future_close'], inplace=True)

# Quick check
df[['close','future_close','target']].tail(10)


Unnamed: 0,close,future_close,target
14989,46271.09,46280.01,1
14990,46280.01,46265.99,0
14991,46265.99,46248.36,0
14992,46248.36,46256.47,1
14993,46256.47,46227.44,0
14994,46227.44,46223.16,0
14995,46223.16,46220.65,0
14996,46220.65,46234.01,1
14997,46234.01,46249.18,1
14998,46249.18,46220.12,0



## 🌐 2. Selecting Features for the Model

We'll pick columns that might help predict future price movements:
- **Technical indicators** (ma_14, ema_14, rsi_14, etc.)
- **Time-based** (day_of_week, hour_of_day)
- **Lagged data** (close_lag1, etc.)
- **Returns or volume**

We exclude anything that leaks future info (like `future_close` itself) and also avoid the raw `close` if we prefer only derived features, but we can experiment.


In [3]:

# Let's define a list of feature columns we want to use
feature_cols = [
    'ma_14','ema_14','bb_upper','bb_lower','rsi_14',
    'close_lag1','close_lag2','returns_1m',
    'day_of_week','hour_of_day','volume'
]

# Some columns might not exist if you removed or changed them
# We'll drop missing columns to avoid errors
available_cols = [c for c in feature_cols if c in df.columns]
X = df[available_cols].copy()

print("Feature columns being used:\n", available_cols)

# The target
y = df['target']

print(f"X shape: {X.shape}, y shape: {y.shape}")
X.head()


Feature columns being used:
 ['ma_14', 'ema_14', 'bb_upper', 'bb_lower', 'rsi_14', 'close_lag1', 'close_lag2', 'returns_1m', 'day_of_week', 'hour_of_day', 'volume']
X shape: (14999, 11), y shape: (14999,)


Unnamed: 0,ma_14,ema_14,bb_upper,bb_lower,rsi_14,close_lag1,close_lag2,returns_1m,day_of_week,hour_of_day,volume
0,,42298.61,,,,,,,0,0,35.92724
1,,42301.462,,,,42298.61,,0.000506,0,0,21.16779
2,,42304.667067,,,,42320.0,42298.61,0.00013,0,0,21.60391
3,,42313.110124,,,,42325.5,42320.0,0.001004,0,0,30.5073
4,,42324.326108,,,,42367.99,42325.5,0.00069,0,0,46.05107



## 🧪 3. Splitting the Data into Train and Test

We’ll do a simple time-based split:
- Because it's time-series data, we must be careful about random splits that mix future data with past data.
- For a quick approach, we can do a normal `train_test_split` but **shuffle=False** to keep the time order.

Note: For advanced trading, you might do a more careful walk-forward split or time series split.


In [4]:

# Drop rows with any missing features
X = X.dropna()
y = y[X.index]  # Align target with dropped rows

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    shuffle=False  # preserve time order
)

print("Train set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])


Train set size: 11984
Test set size: 2996



## 🏋️ 4. Training a Basic Random Forest

We’ll use `RandomForestClassifier` as a baseline. Later, you can try other models:
- LogisticRegression
- XGBoost
- Neural Networks
- or do hyperparameter tuning.

**Warning:** This is just a quick demonstration. Serious trading models need thorough tuning, cross-validation, etc.


In [5]:

# Instantiate the model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train on the training set
model.fit(X_train, y_train)

print("Model trained successfully!")

# Let's get feature importances
importances = model.feature_importances_
for col, imp in zip(X_train.columns, importances):
    print(f"{col}: {imp:.4f}")


Model trained successfully!
ma_14: 0.0940
ema_14: 0.0924
bb_upper: 0.0998
bb_lower: 0.0994
rsi_14: 0.1133
close_lag1: 0.1002
close_lag2: 0.1002
returns_1m: 0.1166
day_of_week: 0.0160
hour_of_day: 0.0503
volume: 0.1178



## 📊 5. Evaluation

We’ll use:
- **Accuracy Score**: basic measure of how many correct up/down predictions.
- **Confusion Matrix**: see false positives/negatives.
- **Classification Report**: precision, recall, F1.

**Caution**: For trading, an accuracy metric alone is not enough. You eventually want to see real PnL (profit/loss) via a backtest. But let's do a quick check here.


In [6]:

# Predict on the test set
y_pred = model.predict(X_test)

# Basic Accuracy
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy on test set: {acc:.3f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Classification Report
report = classification_report(y_test, y_pred, digits=3)
print("Classification Report:")
print(report)


Accuracy on test set: 0.500
Confusion Matrix:
[[693 777]
 [721 805]]
Classification Report:
              precision    recall  f1-score   support

           0      0.490     0.471     0.481      1470
           1      0.509     0.528     0.518      1526

    accuracy                          0.500      2996
   macro avg      0.499     0.499     0.499      2996
weighted avg      0.500     0.500     0.500      2996




## 🏁 6. Conclusion & Next Steps

- **You’ve trained a basic Random Forest** on your engineered features to predict if the price will go up in the next minute.
- The **accuracy** you see is just the beginning. For intraday trading, you also need to:
  1. Perform **hyperparameter tuning** (grid search or random search).
  2. Use **walk-forward validation** or a more robust **time-series cross-validation**.
  3. Integrate your predictions into a **backtesting** environment to see actual profit/loss.
  4. Possibly define different target horizons (like next 5 or 15 minutes) if 1 minute is too noisy.

### Ideas to Improve
- **Add more advanced features** (MACD, Ichimoku, advanced volume analysis).
- **Create a more robust target** (e.g., price rising by at least 0.2% or 0.5%).
- **Explore neural networks** or other advanced architectures.

---

**Next:**  
After you see some potential in your model’s predictions, try hooking it into a backtesting script (`backtesting.py`) to see how it performs financially, not just in accuracy terms.
