# Databento Python client quickstart

**Welcome to the Databento client library quickstart tutorial!**

We'll walk through how to use our client library's functionality to work with the data available from Databento.

Note: 

For information on our symbology, refer to https://docs.databento.com/reference-historical/basics/symbology. 

For a more detailed API reference, refer to https://docs.databento.com/reference-historical.

This tutorial covers the following:
- Using the historical client to request metadata
- Using the historical client to request time series market data
- Working with Bento data I/O helper objects
- Using the historical client to make batch data requests
- Querying batch job states

**Tip:** You can call help() on any class or method to see the 'docstring.'

## Installation and setup

Firtly, ensure you have the latest `databento` client library installed:
```bash
pip install -U databento
```

## Historical data client

Once you've installed the Python client library, you can import it and initialize a historical client for requests. We'll use this `client` throughout the rest of the tutorial.

To initialize a client, you need to provide a valid API key. You can find these on the API Keys page of your Databento portal at https://databento.com.

In [2]:
import databento as db

In [44]:
client = db.Historical(key="YOUR_API_KEY")

## Requesting metadata

Before we make any requests for actual data, we can look into the metadata to see what is available to download.

In [5]:
client.metadata.list_datasets()

['GLBX.MDP3', 'XNAS.ITCH']

In [8]:
client.metadata.list_schemas(dataset="GLBX.MDP3")

['mbo',
 'mbp-1',
 'mbp-10',
 'tbbo',
 'trades',
 'ohlcv-1s',
 'ohlcv-1m',
 'ohlcv-1h',
 'ohlcv-1d',
 'definition',
 'statistics',
 'status']

In [11]:
client.metadata.list_fields(dataset="GLBX.MDP3", schema="trades", encoding="csv")

{'GLBX.MDP3': {'csv': {'trades': {'ts_recv': 'int',
    'ts_event': 'int',
    'ts_in_delta': 'int',
    'publisher_id': 'int',
    'product_id': 'int',
    'action': 'string',
    'side': 'string',
    'flags': 'int',
    'price': 'int',
    'size': 'int',
    'sequence': 'int'}}}}

In [12]:
client.metadata.list_encodings()

['dbz', 'csv', 'json']

In [13]:
client.metadata.list_compressions()

['none', 'zstd']

### Unit cost / GB

To get the unit cost / GB of all the different data schemas, use this API call to the `list_unit_prices` function.

In [14]:
client.metadata.list_unit_prices(dataset="GLBX.MDP3", mode="historical-streaming")

{'historical-streaming': {'mbo': 21.05,
  'mbp-1': 82.05,
  'mbp-10': 31.95,
  'tbbo': 22.56,
  'trades': 67.76,
  'ohlcv-1s': 78.68,
  'ohlcv-1m': 63.32,
  'ohlcv-1h': 52.91,
  'ohlcv-1d': 41.5,
  'definition': 66.8,
  'statistics': 97.92,
  'status': 62.72}}

### Parameter setup for data cost query

First, instantiate a dictionary of the query parameters for the data you're interested in.

In [17]:
params = {
    "dataset": "GLBX.MDP3",
    "symbols": "ES.FUT",
    "stype_in": "smart",
    "schema": "mbo",
    "start": "2020-12-27",
    "end": "2020-12-30",
}

### Data cost
Before making a request for data, you can query the expected price in USD. The price is determined from the following formula: `unit_cost` * `uncompressed_size_GB`.

In [18]:
client.metadata.get_cost(**params)

13.778901880607009

## Requesting time series data

The rest of this tutorial involves historical time series data. Here's how to request for this data.

The historical time series data is streamed into an in-memory buffer encapsulated by a `Bento` object, which we'll use later to work with the data.

The following code sample requests for all E-mini S&P500 futures contract outrights, active between 2020-12-27 and 2020-12-30 using `smart` symbology.

In [19]:
data = client.timeseries.stream(
    **params,
    limit=1000,  # <-- request limited to 1000 records
)

In [20]:
import pandas as pd

## Working with the Bento helper object

All time series data requests include a metadata header with the following specifications:
- The original query paramaters (these can be used to re-request the data)
- Data shape
- Symbology mappings
- Instrument 'mini-definitions'

### Metadata properties

In [21]:
data.dataset

'GLBX.MDP3'

In [22]:
data.schema

<Schema.MBO: 'mbo'>

In [23]:
data.symbols

['ES.FUT']

In [24]:
data.stype_in

<SType.SMART: 'smart'>

In [25]:
data.stype_out

<SType.PRODUCT_ID: 'product_id'>

In [26]:
data.start

Timestamp('2020-12-27 00:00:00+0000', tz='UTC')

In [27]:
data.end

Timestamp('2020-12-30 00:00:00+0000', tz='UTC')

In [28]:
data.limit

1000

In [29]:
data.encoding

<Encoding.DBZ: 'dbz'>

In [30]:
data.compression

<Compression.ZSTD: 'zstd'>

In [31]:
data.shape

(1000, 14)

In [32]:
data.dtype

dtype([('nwords', 'u1'), ('type', 'u1'), ('publisher_id', '<u2'), ('product_id', '<u4'), ('ts_event', '<u8'), ('order_id', '<u8'), ('price', '<i8'), ('size', '<u4'), ('flags', 'i1'), ('channel_id', 'u1'), ('action', 'S1'), ('side', 'S1'), ('ts_recv', '<u8'), ('ts_in_delta', '<i4'), ('sequence', '<u4')])

In [33]:
data.struct_size

56

### Symbology resolution

The metadata contains all information which would have been provided in a `symbology.resolve` request:

In [34]:
data.symbology

{'result': {'ESH1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '5482'}],
  'ESH1-ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '21885'}],
  'ESH1-ESM1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '19651'}],
  'ESH1-ESU1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '4223'}],
  'ESH1-ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '20076'}],
  'ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '5782'}],
  'ESM1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '3853'}],
  'ESM1-ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '22223'}],
  'ESM1-ESU1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '4673'}],
  'ESM1-ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '22279'}],
  'ESU1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '1030'}],
  'ESU1-ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '16280'}],
  'ESU1-ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '20117'}],
  'ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '8858'}],


### Symbology mappings

A subset of the symbology metadata includes mappings — per date interval — between the requested symbols `stype_in` and the specified `stype_out`.

In [35]:
data.mappings

{'ESH1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '5482'}],
 'ESH1-ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '21885'}],
 'ESH1-ESM1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '19651'}],
 'ESH1-ESU1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '4223'}],
 'ESH1-ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '20076'}],
 'ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '5782'}],
 'ESM1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '3853'}],
 'ESM1-ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '22223'}],
 'ESM1-ESU1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '4673'}],
 'ESM1-ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '22279'}],
 'ESU1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '1030'}],
 'ESU1-ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '16280'}],
 'ESU1-ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '20117'}],
 'ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '8858'}],
 'ESZ1-ESH2': [{'t0': '2

### Instrument definitions

The metadata also contains 'mini-definitions,' which are a subset of the full `definition` schema. The full instrument definitions — including all data from the exchange — can be obtained in a separate request.

### Pandas DataFrame

To construct a pandas `DataFrame` from the data, you can call the `.to_df()` method.

In [38]:
df = data.to_df(pretty_px=True, pretty_ts=True)
df.head(20)

Unnamed: 0_level_0,ts_event,ts_in_delta,publisher_id,product_id,order_id,action,side,flags,price,size,sequence
ts_recv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2020-12-27 20:00:25.252293395+00:00,2020-12-27 20:00:25.061045683+00:00,26179,1,5482,647773887134,B,A,0,3634.0,10,1124
2020-12-27 20:00:25.252293395+00:00,2020-12-27 20:00:25.061045683+00:00,26179,1,5482,647773887135,B,A,0,3634.0,10,1124
2020-12-27 20:00:25.252293395+00:00,2020-12-27 20:00:25.061045683+00:00,26179,1,5482,647773887136,B,A,0,3634.25,10,1124
2020-12-27 20:00:25.252293395+00:00,2020-12-27 20:00:25.061045683+00:00,26179,1,5482,647773887137,B,A,0,3634.25,10,1124
2020-12-27 20:00:25.252293395+00:00,2020-12-27 20:00:25.061045683+00:00,26179,1,5482,647773887138,B,A,0,3634.5,19,1124
2020-12-27 20:00:25.252293395+00:00,2020-12-27 20:00:25.061045683+00:00,26179,1,5482,647773887139,B,A,0,3634.5,1,1124
2020-12-27 20:00:25.252293395+00:00,2020-12-27 20:00:25.061045683+00:00,26179,1,5482,647773887140,B,A,0,3634.75,5,1124
2020-12-27 20:00:25.252293395+00:00,2020-12-27 20:00:25.061045683+00:00,26179,1,5482,647773887141,B,A,0,3634.75,5,1124
2020-12-27 20:00:25.252293395+00:00,2020-12-27 20:00:25.061045683+00:00,26179,1,5482,647773887142,B,A,0,3634.75,5,1124
2020-12-27 20:00:25.252293395+00:00,2020-12-27 20:00:25.061045683+00:00,26179,1,5482,647773887143,B,A,0,3634.75,5,1124


### Numpy arrays

To cast the data to an array of individual records represented as `np.ndarray`(s), you can call the `to_ndarray()` method.



In [None]:
data.to_ndarray()[:10]

array([(14, 32, 1, 5482, 1609099225061045683, 647439984644, 315950000000000,  2, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647439984689, 310550000000000,  3, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647508324609, 330000000000000,  2, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647530969859, 287000000000000, 10, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647570749552, 321325000000000,  1, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647570749560, 321225000000000,  1, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647570749656, 321125000000000,  1, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647570749727, 32

### Replay

To replay the time series data stream record by record to a handlers callback, you can use the `.replay(callback)` method.

In [None]:
def my_handler(record):
    # backtesting / trading strategies (event-driven)
    print(record)

In [None]:
# data.replay(my_handler)

## Writing data to disk

You can write the raw DBZ data to disk using the `.to_file()` method.



In [40]:
data.to_file("test.dbz")

<databento.common.bento.FileBento at 0x7f9ede557340>

You can also write to disk as CSV or JSON.

In [41]:
data.to_csv("my_data.csv")

In [42]:
data.to_json("my_data.json")

## Time series batch requests

The client library can also make batch data requests to the Databento API.

In [None]:
client.batch.submit_job(
    dataset="GLBX.MDP3",
    symbols=["ESH1"],
    schema="trades",
    start="2020-12-27T12:00",
    end="2020-12-29",
    encoding="dbz",
    delivery="download",
    compression="zstd",
    limit=1000,  # <-- request limited to 1000 records
)

{'id': 'GLBX-20220720-BTW9J5HY5C',
 'user_id': '46PCMCVF',
 'bill_id': '3eaf1158',
 'dataset': 'GLBX.MDP3',
 'symbols': 'ESH1',
 'stype_in': 'native',
 'stype_out': 'product_id',
 'schema': 'trades',
 'start': '2020-12-27 12:00:00+00:00',
 'end': '2020-12-29 00:00:00+00:00',
 'limit': 100,
 'encoding': 'dbz',
 'compression': 'zstd',
 'split_duration': 'day',
 'split_size': None,
 'packaging': 'none',
 'delivery': 'download',
 'is_example': False,
 'record_count': 100,
 'billed_size': 4800,
 'actual_size': None,
 'package_size': None,
 'state': 'queued',
 'ts_received': '2022-07-20 07:26:45.617296+00:00',
 'ts_queued': None,
 'ts_process_start': None,
 'ts_process_done': None,
 'ts_expiration': None}

## Querying batch job state

It's possible to query for a list of your batch jobs, with optional filter parameters for `state` and `since`. This could help to programatically build and manage larger data pipelines.

In [None]:
client.batch.list_jobs(since=pd.Timestamp.utcnow() - pd.Timedelta(minutes=5))

[{'id': 'GLBX-20220720-BTW9J5HY5C',
  'user_id': '46PCMCVF',
  'bill_id': '3eaf1158',
  'dataset': 'GLBX.MDP3',
  'symbols': 'ESH1',
  'stype_in': 'native',
  'stype_out': 'product_id',
  'schema': 'trades',
  'start': '2020-12-27 12:00:00+00:00',
  'end': '2020-12-29 00:00:00+00:00',
  'limit': 100,
  'encoding': 'dbz',
  'compression': 'zstd',
  'split_duration': 'day',
  'split_size': None,
  'packaging': 'none',
  'delivery': 'download',
  'is_example': False,
  'record_count': 100,
  'billed_size': 4800,
  'actual_size': None,
  'package_size': None,
  'state': 'queued',
  'ts_received': '2022-07-20 07:26:45.617296+00:00',
  'ts_queued': '2022-07-20 07:26:46.395321+00:00',
  'ts_process_start': None,
  'ts_process_done': None,
  'ts_expiration': None,
  'progress': 0}]