# OANDA REST-V20 API Data Fetcher

This notebook demonstrates how to:
1. Download historical candle data for a specified instrument, timespan and frequency, using the OANDA REST-V20 API.
2. Build features from tick data (saved as Parquet, similarly as in point 1).
3. Split the data into three time intervals: training, evaluation and test data.

Optionally, list all available instruments from OANDA REST-V20 API

## Setup

Before running this notebook:
1. Copy `.env.example` to `.env`
2. Fill in your OANDA API token and account ID in the `.env` file
3. Install requirements: `pip install -r requirements.txt`

In [1]:
import pandas as pd
import numpy as np
import os

In [1]:
%run ../risk_estimator/config.py

config = get_config()

In [None]:
%run ../scripts/fetch_data.py
    
# Careful! Takes 3h to fetch!

# fetch_interval_complete(config['instrument'],
#                         config['start_date'],
#                         config['end_date'],
#                         granularity=config['timeframe'],
#                         price_types=('M','B','A'),
#                         chunk_hours=6,
#                         save_path=config['raw_data_path'])

In [None]:
%run ../scripts/build_features.py

# Careful! Building features for 44M rows takes ~30min

# src  = config['raw_data_path']
# dest = config['feature_data_path']
# col0 = config['vol_source_col_name']
# col1 = config['vol_target_col_name']
# freq = config['vol_shift_freq']
# build_features(src, dest, col0, col1, shift_freq=freq)

In [2]:

%run ../scripts/split_dataset.py
src = config['feature_data_path']
dest = config['split_dir']
train_start       = config['train_start']
train_cutoff      = config['train_cutoff']
val_cutoff        = config['val_cutoff']
split_processed_parquet(src, dest, train_start, train_cutoff, val_cutoff)


Appended 1014511 rows to ../data/split_2012_2014_2015/train.parquet (row_group 18)
Appended 1048576 rows to ../data/split_2012_2014_2015/train.parquet (row_group 19)
Appended 1048576 rows to ../data/split_2012_2014_2015/train.parquet (row_group 20)
Appended 1048576 rows to ../data/split_2012_2014_2015/train.parquet (row_group 21)
Appended 1048576 rows to ../data/split_2012_2014_2015/train.parquet (row_group 22)
Appended 1048576 rows to ../data/split_2012_2014_2015/train.parquet (row_group 23)
Appended 1048576 rows to ../data/split_2012_2014_2015/train.parquet (row_group 24)
Appended 1048576 rows to ../data/split_2012_2014_2015/train.parquet (row_group 25)
Appended 1048576 rows to ../data/split_2012_2014_2015/train.parquet (row_group 26)
Appended 1048576 rows to ../data/split_2012_2014_2015/train.parquet (row_group 27)
Appended 1048576 rows to ../data/split_2012_2014_2015/train.parquet (row_group 28)
Appended 1048576 rows to ../data/split_2012_2014_2015/train.parquet (row_group 29)
Appe

# Auxiliary experiments (kept to explain pitfalls) below.

## Function: List All Available Instruments

In [None]:
import requests
from datetime import datetime, timezone

%run ../scripts/fetch_data.py

def list_instruments():
    """
    List all instruments available in the OANDA account.
    
    Returns:
        pandas.DataFrame: DataFrame containing instrument details
    """
    url = f"{API_URL}/v3/accounts/{ACCOUNT_ID}/instruments"
    
    headers = {
        'Authorization': f'Bearer {API_TOKEN}',
        'Content-Type': 'application/json'
    }
    
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        
        data = response.json()
        instruments = data.get('instruments', [])
        
        # Create a DataFrame with relevant information
        instruments_data = []
        for inst in instruments:
            instruments_data.append({
                'name': inst.get('name'),
                'type': inst.get('type'),
                'displayName': inst.get('displayName'),
                'pipLocation': inst.get('pipLocation'),
                'displayPrecision': inst.get('displayPrecision'),
                'tradeUnitsPrecision': inst.get('tradeUnitsPrecision'),
                'minimumTradeSize': inst.get('minimumTradeSize'),
                'maximumTrailingStopDistance': inst.get('maximumTrailingStopDistance'),
                'minimumTrailingStopDistance': inst.get('minimumTrailingStopDistance')
            })
        
        df = pd.DataFrame(instruments_data)
        print(f"Found {len(df)} instruments")
        return df
        
    except requests.exceptions.RequestException as e:
        print(f"Error fetching instruments: {e}")
        if hasattr(e.response, 'text'):
            print(f"Response: {e.response.text}")
        return None

## Test: List All Instruments

In [19]:
# Get all available instruments
instruments_df = list_instruments()

if instruments_df is not None:
    # Display first few instruments
    print("\nFirst 10 instruments:")
    display(instruments_df.head(10))
    
    # Display instrument types
    print("\nInstrument types available:")
    print(instruments_df['type'].value_counts())
    
    # Display all instrument names
    print("\nAll instrument names:")
    print(instruments_df['name'].tolist())

Found 68 instruments

First 10 instruments:


Unnamed: 0,name,type,displayName,pipLocation,displayPrecision,tradeUnitsPrecision,minimumTradeSize,maximumTrailingStopDistance,minimumTrailingStopDistance
0,TRY_JPY,CURRENCY,TRY/JPY,-2,3,0,1,100.0,0.05
1,AUD_JPY,CURRENCY,AUD/JPY,-2,3,0,1,100.0,0.05
2,USD_CNH,CURRENCY,USD/CNH,-4,5,0,1,1.0,0.0005
3,NZD_JPY,CURRENCY,NZD/JPY,-2,3,0,1,100.0,0.05
4,EUR_GBP,CURRENCY,EUR/GBP,-4,5,0,1,1.0,0.0005
5,CHF_HKD,CURRENCY,CHF/HKD,-4,5,0,1,1.0,0.0005
6,USD_CZK,CURRENCY,USD/CZK,-4,5,0,1,1.0,0.0005
7,NZD_HKD,CURRENCY,NZD/HKD,-4,5,0,1,1.0,0.0005
8,EUR_NOK,CURRENCY,EUR/NOK,-4,5,0,1,1.0,0.0005
9,USD_CAD,CURRENCY,USD/CAD,-4,5,0,1,1.0,0.0005



Instrument types available:
type
CURRENCY    68
Name: count, dtype: int64

All instrument names:
['TRY_JPY', 'AUD_JPY', 'USD_CNH', 'NZD_JPY', 'EUR_GBP', 'CHF_HKD', 'USD_CZK', 'NZD_HKD', 'EUR_NOK', 'USD_CAD', 'EUR_AUD', 'EUR_SGD', 'USD_HKD', 'CAD_HKD', 'USD_CHF', 'AUD_HKD', 'NZD_CHF', 'AUD_CHF', 'GBP_CHF', 'USD_THB', 'EUR_HKD', 'CHF_JPY', 'GBP_HKD', 'EUR_NZD', 'AUD_SGD', 'EUR_JPY', 'EUR_TRY', 'USD_JPY', 'SGD_JPY', 'GBP_ZAR', 'ZAR_JPY', 'USD_SEK', 'GBP_SGD', 'CAD_CHF', 'AUD_NZD', 'HKD_JPY', 'USD_NOK', 'GBP_AUD', 'USD_PLN', 'EUR_ZAR', 'NZD_USD', 'USD_ZAR', 'CAD_JPY', 'CAD_SGD', 'USD_HUF', 'EUR_CAD', 'CHF_ZAR', 'USD_DKK', 'EUR_HUF', 'EUR_CHF', 'EUR_DKK', 'EUR_USD', 'EUR_CZK', 'NZD_CAD', 'SGD_CHF', 'GBP_JPY', 'USD_TRY', 'GBP_PLN', 'AUD_USD', 'GBP_USD', 'USD_MXN', 'GBP_CAD', 'AUD_CAD', 'EUR_PLN', 'GBP_NZD', 'EUR_SEK', 'USD_SGD', 'NZD_SGD']


In [None]:
#Question: are there missing data points in the downloaded data?
# Explanation: this cell had been used for 2005-2015 data, in yearly chunks.

import pandas as pd
# Let's check all the data we have for NaNs
import os
data_dir = 'data/raw'
files = sorted([f for f in os.listdir(data_dir) if f.endswith('.parquet')])
print(f"Checking {len(files)} parquet files for missing data...")
for f in files:
    df = pd.read_parquet(os.path.join(data_dir, f))
    if df.isnull().values.any():
        missing_percentage = df.isnull().mean().mean() * 100
        print(f"Missing data found in file {f}: percentage of missing values: {missing_percentage:.2f}%")
    else:
        print(f"No missing data in file {f}.")

# Conclusion: Missing data found in 2005 (22%) and 2008 (0.03%); other years are clean.
# Decided: start analysis from 2009 onward (see features.py).

Checking 11 parquet files for missing data...
Missing data found in file EUR_CHF_2005_S5_BA.parquet: percentage of missing values: 22.47%
No missing data in file EUR_CHF_2006_S5_BA.parquet.
No missing data in file EUR_CHF_2007_S5_BA.parquet.
Missing data found in file EUR_CHF_2008_S5_BA.parquet: percentage of missing values: 0.58%
No missing data in file EUR_CHF_2009_S5_BA.parquet.
No missing data in file EUR_CHF_2010_S5_BA.parquet.
No missing data in file EUR_CHF_2011_S5_BA.parquet.
No missing data in file EUR_CHF_2012_S5_BA.parquet.
No missing data in file EUR_CHF_2013_S5_BA.parquet.
No missing data in file EUR_CHF_2014_S5_BA.parquet.
No missing data in file EUR_CHF_2015_S5_BA.parquet.


In [None]:
# Question: do 'a_t, 'b_t', 'm_t' (number of quotes per candle) ever differ within the same candle?
# Explanation: this cell had been used for data with 'a_t', 'b_t', 'm_t' columns.

import pandas as pd
# Let's check all the data we have for differences
import os
data_dir = 'data/raw'
files = sorted([f for f in os.listdir(data_dir) if f.endswith('.parquet')])
print(f"Checking {len(files)} parquet files for differences in 'a_t', 'b_t', 'm_t'...")
for f in files:
    df = pd.read_parquet(os.path.join(data_dir, f))
    diffs = df[(df['a_t'] != df['b_t']) | (df['a_t'] != df['m_t']) | (df['b_t'] != df['m_t'])]
    if not diffs.empty:
        print(f"Differences found in file {f} in {len(diffs)} candles:")
        print(diffs)   
    else:   
        print(f"No differences in file {f}.") 

# Conclusion: No differences found in any year. All 'a_t', 'b_t', 'm_t' are identical per candle. Let's keep 'm_t'.

Checking 7 parquet files for differences in 'a_t', 'b_t', 'm_t'...
No differences in file EUR_CHF_2009_S5_BA.parquet.
No differences in file EUR_CHF_2010_S5_BA.parquet.
No differences in file EUR_CHF_2011_S5_BA.parquet.
No differences in file EUR_CHF_2012_S5_BA.parquet.
No differences in file EUR_CHF_2013_S5_BA.parquet.
No differences in file EUR_CHF_2014_S5_BA.parquet.
No differences in file EUR_CHF_2015_S5_BA.parquet.


Test the algorithm for time-since-last-true:

In [32]:
import pandas as pd
import numpy as np

# Create an artificial DataFrame for testing
df = pd.DataFrame({
    'a': [1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1],
    # 'b': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
})


def since_last_nonzero_nonshifted(series):
    mask = series != 0
    # cumsum increases at each nonzero, so group by this
    group = mask.cumsum()
    # Where group==0, it's before the first nonzero, so set to nan
    result = series.groupby(group).cumcount() + 1
    result[group == 0] = np.nan
    return result.values

def add_since_last_nonzero(df, col, new_col):
    df[new_col] = since_last_nonzero_nonshifted(df[col])
    df[new_col] = df[new_col].shift()
    return df

df = add_since_last_nonzero(df, 'a', 'since_a')
# df = add_since_last_nonzero(df, 'b', 'since_b')
print(df)

    a  since_a
0   1      NaN
1   0      1.0
2   0      2.0
3   0      3.0
4   1      4.0
5   0      1.0
6   1      2.0
7   0      1.0
8   1      2.0
9   1      1.0
10  1      1.0
11  0      1.0
12  0      2.0
13  1      3.0


In [5]:
# Testing the future-shifted feature function

import pandas as pd
rng = pd.date_range('2024-01-01', periods=10, freq='1min')
df = pd.DataFrame({'x': range(10)}, index=rng)
%run ../scripts/build_features.py
df = add_future_shifted_feature(df, 'x', 'x_future', freq='3min')
print(df)


                     x  x_future
2024-01-01 00:00:00  0       3.0
2024-01-01 00:01:00  1       4.0
2024-01-01 00:02:00  2       5.0
2024-01-01 00:03:00  3       6.0
2024-01-01 00:04:00  4       7.0
2024-01-01 00:05:00  5       8.0
2024-01-01 00:06:00  6       9.0
