# Databento Python client quickstart

**Welcome to the Databento client library quickstart tutorial!**

We'll walk through how to use our client library's functionality to work with the data available from Databento.

**Note:**

For information on our symbology, refer to https://databento.com/docs/api-reference-historical/basics/symbology. 

For a more detailed API reference, refer to https://databento.com/docs/api-reference-historical.

This tutorial covers the following:
- Using the historical client to request metadata
- Using the historical client to request time series market data
- Working with DBNStore data I/O helper objects
- Using the historical client to make batch data requests
- Querying batch job states
- Programmatically downloading batch jobs

**Tip:** You can call help() on any class or method to see the 'docstring'.

## Installation and setup

Firtly, ensure you have the latest `databento` client library installed:
```bash
pip install -U databento
```

## Historical data client

Once you've installed the Python client library, you can import it and initialize a historical client for requests. We'll use this `client` throughout the rest of the tutorial.

To initialize a client, you need to provide a valid API key. You can find these on the API Keys page of your Databento portal at https://databento.com.

In [1]:
import databento as db


client = db.Historical(key="YOUR_API_KEY")
db.__version__

'0.17.0'

## Requesting metadata

Before we make any requests for actual data, we can look into the metadata to see what is available to download.

In [2]:
client.metadata.list_publishers()

[{'publisher_id': 1,
  'dataset': 'GLBX.MDP3',
  'venue': 'GLBX',
  'description': 'CME Globex MDP 3.0'},
 {'publisher_id': 2,
  'dataset': 'XNAS.ITCH',
  'venue': 'XNAS',
  'description': 'Nasdaq TotalView-ITCH'},
 {'publisher_id': 3,
  'dataset': 'XBOS.ITCH',
  'venue': 'XBOS',
  'description': 'Nasdaq XBOS TotalView-ITCH'},
 {'publisher_id': 4,
  'dataset': 'XPSX.ITCH',
  'venue': 'XPSX',
  'description': 'Nasdaq XPSX TotalView-ITCH'},
 {'publisher_id': 5,
  'dataset': 'BATS.PITCH',
  'venue': 'BATS',
  'description': 'CBOE BZX'},
 {'publisher_id': 6,
  'dataset': 'BATY.PITCH',
  'venue': 'BATS',
  'description': 'CBOE BYX'},
 {'publisher_id': 7,
  'dataset': 'EDGA.PITCH',
  'venue': 'EDGA',
  'description': 'CBOE EDGA'},
 {'publisher_id': 8,
  'dataset': 'EDGX.PITCH',
  'venue': 'EDGX',
  'description': 'CBOE EDGX'},
 {'publisher_id': 9,
  'dataset': 'XNYS.PILLAR',
  'venue': 'XNYS',
  'description': 'NYSE'},
 {'publisher_id': 10,
  'dataset': 'XCIS.PILLAR',
  'venue': 'XCIS',
  'd

In [3]:
client.metadata.list_datasets()

['GLBX.MDP3', 'OPRA.PILLAR', 'XNAS.ITCH']

In [4]:
client.metadata.list_schemas(dataset="GLBX.MDP3")

['mbo',
 'mbp-1',
 'mbp-10',
 'tbbo',
 'trades',
 'ohlcv-1s',
 'ohlcv-1m',
 'ohlcv-1h',
 'ohlcv-1d',
 'definition',
 'statistics']

In [5]:
client.metadata.list_fields(schema="mbo", encoding="dbn")

[{'name': 'length', 'type': 'uint8_t'},
 {'name': 'rtype', 'type': 'uint8_t'},
 {'name': 'publisher_id', 'type': 'uint16_t'},
 {'name': 'instrument_id', 'type': 'uint32_t'},
 {'name': 'ts_event', 'type': 'uint64_t'},
 {'name': 'order_id', 'type': 'uint64_t'},
 {'name': 'price', 'type': 'int64_t'},
 {'name': 'size', 'type': 'uint32_t'},
 {'name': 'flags', 'type': 'uint8_t'},
 {'name': 'channel_id', 'type': 'uint8_t'},
 {'name': 'action', 'type': 'char'},
 {'name': 'side', 'type': 'char'},
 {'name': 'ts_recv', 'type': 'uint64_t'},
 {'name': 'ts_in_delta', 'type': 'int32_t'},
 {'name': 'sequence', 'type': 'uint32_t'}]

### Unit cost / GB

To get the unit cost / GB of all the different data schemas, use this API call to the `list_unit_prices` function.

In [6]:
client.metadata.list_unit_prices(dataset="GLBX.MDP3")

[{'mode': 'historical',
  'unit_prices': {'mbo': 1.1,
   'mbp-1': 2.42,
   'mbp-10': 0.45,
   'tbbo': 17.89,
   'trades': 24.8,
   'ohlcv-1s': 50.5,
   'ohlcv-1m': 63.5,
   'ohlcv-1h': 130.0,
   'ohlcv-1d': 175.0,
   'definition': 1.66,
   'statistics': 0.8}},
 {'mode': 'historical-streaming',
  'unit_prices': {'mbo': 1.1,
   'mbp-1': 2.42,
   'mbp-10': 0.45,
   'tbbo': 17.89,
   'trades': 24.8,
   'ohlcv-1s': 50.5,
   'ohlcv-1m': 63.5,
   'ohlcv-1h': 130.0,
   'ohlcv-1d': 175.0,
   'definition': 1.66,
   'statistics': 0.8}},
 {'mode': 'live',
  'unit_prices': {'mbo': 1.32,
   'mbp-1': 2.9,
   'mbp-10': 0.54,
   'tbbo': 21.48,
   'trades': 29.76,
   'ohlcv-1s': 60.6,
   'ohlcv-1m': 76.2,
   'ohlcv-1h': 156.0,
   'ohlcv-1d': 210.0,
   'definition': 1.99,
   'statistics': 0.96}}]

### Parameter setup for data cost query

First, instantiate a dictionary of the query parameters for the data you're interested in.

In [7]:
params = {
    "dataset": "GLBX.MDP3",
    "symbols": "ES.FUT",
    "stype_in": "parent",
    "schema": "mbo",
    "start": "2022-06-10T14:30",
    "end": "2022-06-11",
}

### Data cost
Before making a request for data, you can query the expected price in US dollars. The price is determined from the following formula: `unit_cost` * `uncompressed_size_GB`.

In [8]:
client.metadata.get_cost(**params)

0.7847935900092124

## Requesting time series data

The rest of this tutorial involves historical time series data. Here's how to request for this data.

The historical time series data is streamed into an in-memory buffer encapsulated by a `DBNStore` object, which we'll use later to work with the data.

The following code sample requests for all E-mini S&P500 Futures contracts, active between 2022-06-10T14:30 and 2022-06-11 using `smart` symbology.

In [9]:
data = client.timeseries.get_range(
    **params,
    limit=1000,  # <-- request limited to 1000 records
)

## Working with the DBNStore

All time series data requests include a metadata header with the following specifications:
- The original query parameters (these can be used to re-request the data)
- Symbology mappings

### Metadata properties

In [10]:
data.dataset

'GLBX.MDP3'

In [11]:
data.schema

<Schema.MBO: 'mbo'>

In [12]:
data.symbols

['ES.FUT']

In [13]:
data.stype_in

<SType.PARENT: 'parent'>

In [14]:
data.stype_out

<SType.INSTRUMENT_ID: 'instrument_id'>

In [15]:
data.start

Timestamp('2022-06-10 14:30:00+0000', tz='UTC')

In [16]:
data.end

Timestamp('2022-06-12 00:00:00+0000', tz='UTC')

In [17]:
data.limit

1000

In [18]:
data.compression

<Compression.ZSTD: 'zstd'>

### Symbology resolution

The metadata contains all information which would have been provided in a `symbology.resolve` request:

In [19]:
data.symbology

{'symbols': ['ES.FUT'],
 'stype_in': 'parent',
 'stype_out': 'instrument_id',
 'start_date': '2022-06-10',
 'end_date': '2022-06-12',
 'partial': [],
 'not_found': [],
 'mappings': {'ESZ6': [{'start_date': datetime.date(2022, 6, 10),
    'end_date': datetime.date(2022, 6, 12),
    'symbol': '10252'}],
  'ESU2-ESU3': [{'start_date': datetime.date(2022, 6, 10),
    'end_date': datetime.date(2022, 6, 12),
    'symbol': '16445'}],
  'ESH3-ESU3': [{'start_date': datetime.date(2022, 6, 10),
    'end_date': datetime.date(2022, 6, 12),
    'symbol': '20604'}],
  'ESZ5': [{'start_date': datetime.date(2022, 6, 10),
    'end_date': datetime.date(2022, 6, 12),
    'symbol': '294973'}],
  'ESU3-ESH4': [{'start_date': datetime.date(2022, 6, 10),
    'end_date': datetime.date(2022, 6, 12),
    'symbol': '18909'}],
  'ESM3-ESH4': [{'start_date': datetime.date(2022, 6, 10),
    'end_date': datetime.date(2022, 6, 12),
    'symbol': '2018'}],
  'ESH4-ESM4': [{'start_date': datetime.date(2022, 6, 10),
   

### Symbology mappings

A subset of the symbology metadata includes mappings — per date interval — between the requested symbols `stype_in` and the specified `stype_out`.

In [20]:
data.mappings

{'ESU2-ESM3': [{'start_date': datetime.date(2022, 6, 10),
   'end_date': datetime.date(2022, 6, 12),
   'symbol': '26998'}],
 'ESU3-ESZ3': [{'start_date': datetime.date(2022, 6, 10),
   'end_date': datetime.date(2022, 6, 12),
   'symbol': '35947'}],
 'ESZ5': [{'start_date': datetime.date(2022, 6, 10),
   'end_date': datetime.date(2022, 6, 12),
   'symbol': '294973'}],
 'ESH3': [{'start_date': datetime.date(2022, 6, 10),
   'end_date': datetime.date(2022, 6, 12),
   'symbol': '206299'}],
 'ESZ2-ESM3': [{'start_date': datetime.date(2022, 6, 10),
   'end_date': datetime.date(2022, 6, 12),
   'symbol': '19355'}],
 'ESM2-ESH3': [{'start_date': datetime.date(2022, 6, 10),
   'end_date': datetime.date(2022, 6, 12),
   'symbol': '6817'}],
 'ESU3-ESH4': [{'start_date': datetime.date(2022, 6, 10),
   'end_date': datetime.date(2022, 6, 12),
   'symbol': '18909'}],
 'ESU2-ESZ2': [{'start_date': datetime.date(2022, 6, 10),
   'end_date': datetime.date(2022, 6, 12),
   'symbol': '431796'}],
 'ESM3':

### Instrument definitions

The metadata also contains 'mini-definitions,' which are a subset of the full `definition` schema. The full instrument definitions — including all data from the exchange — can be obtained in a separate request.

### Pandas DataFrame

To construct a pandas `DataFrame` from the data, you can call the `.to_df()` method.

In [21]:
import pandas as pd


pd.set_option("display.max_columns", None)

df = data.to_df()
df.head(20)

Unnamed: 0_level_0,ts_event,rtype,publisher_id,instrument_id,action,side,price,size,channel_id,order_id,flags,ts_in_delta,sequence,symbol
ts_recv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2022-06-10 14:30:00.000637393+00:00,2022-06-10 14:30:00.000025147+00:00,160,1,97042,A,B,3903.5,10,0,6410153029859,130,16158,65509273,ESU2
2022-06-10 14:30:00.000758205+00:00,2022-06-10 14:30:00.000644983+00:00,160,1,97042,A,B,3903.5,1,0,6410153029860,130,18571,65509274,ESU2
2022-06-10 14:30:00.003570401+00:00,2022-06-10 14:30:00.003445373+00:00,160,1,3403,A,A,3904.75,1,0,6410153029861,130,21409,65509275,ESM2
2022-06-10 14:30:00.003759884+00:00,2022-06-10 14:30:00.003627981+00:00,160,1,3403,A,B,3895.0,1,0,6410153029862,130,20756,65509276,ESM2
2022-06-10 14:30:00.003766076+00:00,2022-06-10 14:30:00.003630375+00:00,160,1,3403,A,A,3904.5,1,0,6410153029863,130,16130,65509277,ESM2
2022-06-10 14:30:00.003773116+00:00,2022-06-10 14:30:00.003651355+00:00,160,1,3403,A,A,3904.75,1,0,6410153029864,130,17362,65509278,ESM2
2022-06-10 14:30:00.003786908+00:00,2022-06-10 14:30:00.003657773+00:00,160,1,3403,A,B,3894.75,1,0,6410153029865,130,25327,65509279,ESM2
2022-06-10 14:30:00.003787568+00:00,2022-06-10 14:30:00.003675667+00:00,160,1,3403,A,B,3894.5,1,0,6410153029866,130,15046,65509280,ESM2
2022-06-10 14:30:00.003802408+00:00,2022-06-10 14:30:00.003690029+00:00,160,1,3403,A,A,3905.0,1,0,6410153029867,130,20556,65509281,ESM2
2022-06-10 14:30:00.003812108+00:00,2022-06-10 14:30:00.003704327+00:00,160,1,3403,A,A,3905.25,1,0,6410153029868,130,19123,65509282,ESM2


### Numpy arrays

To cast the data to an array of individual records represented as `np.ndarray`(s), you can call the `to_ndarray()` method.



In [22]:
data.to_ndarray()[:10]

array([(14, 160, 1, 97042, 1654871400000025147, 6410153029859, 3903500000000, 10, 130, 0, b'A', b'B', 1654871400000637393, 16158, 65509273),
       (14, 160, 1, 97042, 1654871400000644983, 6410153029860, 3903500000000,  1, 130, 0, b'A', b'B', 1654871400000758205, 18571, 65509274),
       (14, 160, 1,  3403, 1654871400003445373, 6410153029861, 3904750000000,  1, 130, 0, b'A', b'A', 1654871400003570401, 21409, 65509275),
       (14, 160, 1,  3403, 1654871400003627981, 6410153029862, 3895000000000,  1, 130, 0, b'A', b'B', 1654871400003759884, 20756, 65509276),
       (14, 160, 1,  3403, 1654871400003630375, 6410153029863, 3904500000000,  1, 130, 0, b'A', b'A', 1654871400003766076, 16130, 65509277),
       (14, 160, 1,  3403, 1654871400003651355, 6410153029864, 3904750000000,  1, 130, 0, b'A', b'A', 1654871400003773116, 17362, 65509278),
       (14, 160, 1,  3403, 1654871400003657773, 6410153029865, 3894750000000,  1, 130, 0, b'A', b'B', 1654871400003786908, 25327, 65509279),
       (14, 1

### Replay

To replay the time series data stream record by record to a handlers callback, you can use the `.replay(callback)` method.

In [23]:
def my_handler(record):
    # backtesting / trading strategies (event-driven)
    print(record)

In [24]:
#  data.replay(my_handler)

## Writing data to disk

You can write the raw DBN data to disk using the `.to_file()` method.



In [26]:
data.to_file("test.dbn")

You can also write to disk as CSV or JSON.

In [27]:
data.to_csv("my_data.csv")

In [28]:
data.to_json("my_data.json")

## Time series batch requests

The client library can also make batch download requests to the Databento API.

In [29]:
new_job = client.batch.submit_job(
    dataset="GLBX.MDP3",
    symbols=["ESH1"],
    schema="trades",
    start="2020-12-27T12:00",
    end="2020-12-29",
    encoding="dbn",
    delivery="download",
    limit=1000,  # <-- request limited to 1000 records
)
new_job_id = new_job["id"]

new_job

{'id': 'GLBX-20230811-PDJMMKWNRG',
 'user_id': '46PCMCVF',
 'bill_id': None,
 'cost_usd': None,
 'dataset': 'GLBX.MDP3',
 'symbols': 'ESH1',
 'stype_in': 'raw_symbol',
 'stype_out': 'instrument_id',
 'schema': 'trades',
 'start': '2020-12-27 12:00:00+00:00',
 'end': '2020-12-30 00:00:00+00:00',
 'limit': 1000,
 'encoding': 'dbn',
 'compression': 'zstd',
 'pretty_px': False,
 'pretty_ts': False,
 'split_duration': 'day',
 'split_size': None,
 'split_symbols': False,
 'packaging': None,
 'delivery': 'download',
 'record_count': None,
 'billed_size': None,
 'actual_size': None,
 'package_size': None,
 'state': 'queued',
 'ts_received': '2023-08-11 00:24:03.786913+00:00',
 'ts_queued': None,
 'ts_process_start': None,
 'ts_process_done': None,
 'ts_expiration': None}

## Querying batch job state

It's possible to query for a list of your batch jobs, with optional filter parameters for `state` (the state of the batch job) and `since` (when the job was received). 

This could help to programmatically build and manage larger data pipelines. Once we see the batch job has completed processing (with a state of `done`), then we can download the files.

Note the value of the batch job's `id` which we'll need to provide for download. This is saved to `new_job_id`.

In [30]:
client.batch.list_jobs(since=pd.Timestamp.utcnow() - pd.Timedelta(minutes=5))

[{'id': 'GLBX-20230811-PDJMMKWNRG',
  'user_id': '46PCMCVF',
  'bill_id': None,
  'cost_usd': None,
  'dataset': 'GLBX.MDP3',
  'symbols': 'ESH1',
  'stype_in': 'raw_symbol',
  'stype_out': 'instrument_id',
  'schema': 'trades',
  'start': '2020-12-27 12:00:00+00:00',
  'end': '2020-12-30 00:00:00+00:00',
  'limit': 1000,
  'encoding': 'dbn',
  'compression': 'zstd',
  'pretty_px': False,
  'pretty_ts': False,
  'split_duration': 'day',
  'split_size': None,
  'split_symbols': False,
  'packaging': None,
  'delivery': 'download',
  'record_count': None,
  'billed_size': None,
  'actual_size': None,
  'package_size': None,
  'state': 'queued',
  'ts_received': '2023-08-11 00:24:03.786913+00:00',
  'ts_queued': None,
  'ts_process_start': None,
  'ts_process_done': None,
  'ts_expiration': None,
  'progress': 0}]

In [32]:
help(client.batch.download)

Help on method download in module databento.historical.api.batch:

download(output_dir: 'PathLike[str] | str', job_id: 'str', filename_to_download: 'str | None' = None, enable_partial_downloads: 'bool' = True) -> 'list[Path]' method of databento.historical.api.batch.BatchHttpAPI instance
    Download a batch job or a specific file to `{output_dir}/{job_id}/`.
    
    Will automatically generate any necessary directories if they do not
    already exist.
    
    Makes one or many `GET /batch/download/{job_id}/{filename}` HTTP request(s).
    
    Parameters
    ----------
    output_dir: PathLike or str
        The directory to download the file(s) to.
    job_id : str
        The batch job identifier.
    filename_to_download : str, optional
        The specific file to download.
        If `None` then will download all files for the batch job.
    enable_partial_downloads : bool, default True
        If partially downloaded files will be resumed using range request(s).
    
    Retu

## Programmatic downloads
Now that the batch job has completed (with a state of `done`), we can download the files by providing an output directory path, and the `job_id` (found above):

In [33]:
client.batch.download(job_id=new_job_id, output_dir="mydata")

[PosixPath('mydata/GLBX-20230811-PDJMMKWNRG/manifest.json'),
 PosixPath('mydata/GLBX-20230811-PDJMMKWNRG/condition.json'),
 PosixPath('mydata/GLBX-20230811-PDJMMKWNRG/metadata.json'),
 PosixPath('mydata/GLBX-20230811-PDJMMKWNRG/symbology.json'),
 PosixPath('mydata/GLBX-20230811-PDJMMKWNRG/glbx-mdp3-20201227.trades.dbn.zst')]

Or, we can download a specific file for the job:

In [34]:
client.batch.download(job_id=new_job_id, output_dir="mydata", filename_to_download="metadata.json")

[PosixPath('mydata/GLBX-20230811-PDJMMKWNRG/metadata.json')]