# Databento Python client quickstart

**Welcome to the Databento official Python client library quickstart tutorial!**

Learn how to utilize the functionality provided through the client library to request and analyse the data available from Databento.

This tutorial will cover the following:
- Using the historical client to request for metadata
- Using the historical client to request for time series market data
- Working with `Bento` data I/O helper objects

**Tips:**
- We can call `help()` on any class or method to see the 'docstring'


## Historical data client

Once we have installed the Python client library, we can import it and initialize a historical client for requests.
We'll use this `client` throught the remainder of the tutorial.

To initialize a client you'll need to provide a valid API access key. You can find these on the 'Access Keys' page of the user portal by logging into your account at https://databento.com.

In [1]:
import databento as db

In [2]:
client = db.Historical(key="YOUR_ACCESS_KEY")

## Requesting metadata

Before we make any requests for actual data, we can explore various metadata to discover what is available.

In [3]:
client.metadata.list_datasets()

['GLBX.MDP3', 'XNAS.ITCH']

In [4]:
client.metadata.list_schemas(dataset="GLBX.MDP3")

['mbo',
 'mbp-1',
 'mbp-10',
 'tbbo',
 'trades',
 'ohlcv-1s',
 'ohlcv-1m',
 'ohlcv-1h',
 'ohlcv-1d',
 'definition',
 'statistics',
 'status']

In [5]:
client.metadata.list_fields(dataset="GLBX.MDP3", schema="trades", encoding="csv")

{'GLBX.MDP3': {'csv': {'trades': {'ts_recv': 'int',
    'ts_event': 'int',
    'ts_in_delta': 'int',
    'pub_id': 'int',
    'product_id': 'int',
    'action': 'string',
    'side': 'string',
    'flags': 'int',
    'price': 'int',
    'size': 'int',
    'sequence': 'int'}}}}

In [6]:
client.metadata.list_encodings()

['dbz', 'csv', 'json']

In [7]:
client.metadata.list_compressions()

['none', 'zstd']

### Unit cost / GB

In [8]:
client.metadata.list_unit_prices(dataset="GLBX.MDP3", mode="historical-streaming")

{'historical-streaming': {'mbo': 21.05,
  'mbp-1': 82.05,
  'mbp-10': 31.95,
  'tbbo': 22.56,
  'trades': 67.76,
  'ohlcv-1s': 78.68,
  'ohlcv-1m': 63.32,
  'ohlcv-1h': 52.91,
  'ohlcv-1d': 41.5,
  'definition': 66.8,
  'statistics': 97.92,
  'status': 62.72}}

## Requesting time series data

First we will instantiate a dictionary of the query parameters for the data we're interested in.


In [9]:
params = {
    "dataset": "GLBX.MDP3",
    "symbols": "ES.FUT",
    "stype_in": "smart",
    "schema": "mbo",
    "start": "2020-12-27",
    "end": "2020-12-30",
}

### Data cost
Before making an actual request for data, we can query the price (`unit_cost` * `uncompressed_size_GB`)

In [10]:
client.metadata.get_cost(**params)

13.778901880607009

## Requesting time series data

Now we will request for the historical time series data which will be used throughout the remainder of the tutorial.

The data will be streamed into an in-memory buffer encapsulated by a `Bento` object, which we'll use later to work with the data.

Here we will request for all E-mini S&P500 futures contract outrights active between 2020-12-27 and 2020-12-30 using `smart` symbology:

In [11]:
data = client.timeseries.stream(
    **params,
    limit=1000,  # <-- request limited to 1000 records
)

In [12]:
import pandas as pd

## Working with the Bento helper object

All timeseries data requests will contain an accompanying metadata header which includes:
- The original query paramaters
- Symbology mappings
- Instrument 'mini-definitions'

### Metadata properties

In [13]:
data.dataset

'GLBX.MDP3'

In [14]:
data.schema

<Schema.MBO: 'mbo'>

In [15]:
data.symbols

['ES.FUT']

In [16]:
data.stype_in

<SType.SMART: 'smart'>

In [17]:
data.stype_out

<SType.PRODUCT_ID: 'product_id'>

In [18]:
data.start

Timestamp('2020-12-27 00:00:00+0000', tz='UTC')

In [19]:
data.end

Timestamp('2020-12-30 00:00:00+0000', tz='UTC')

In [20]:
data.encoding

<Encoding.DBZ: 'dbz'>

In [21]:
data.compression

<Compression.ZSTD: 'zstd'>

In [22]:
data.shape

(1000, 14)

In [23]:
data.dtype

dtype([('nwords', 'u1'), ('type', 'u1'), ('pub_id', '<u2'), ('product_id', '<u4'), ('ts_event', '<u8'), ('order_id', '<u8'), ('price', '<i8'), ('size', '<u4'), ('flags', 'i1'), ('chan_id', 'u1'), ('side', 'S1'), ('action', 'S1'), ('ts_recv', '<u8'), ('ts_in_delta', '<i4'), ('sequence', '<u4')])

In [24]:
data.struct_size

56

### Symbology resolution

The metadata contains all information which would have been provided in a `symbology.resolve` request:

In [25]:
data.symbology

{'result': {'ESH1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '5482'}],
  'ESH1-ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '21885'}],
  'ESH1-ESM1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '19651'}],
  'ESH1-ESU1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '4223'}],
  'ESH1-ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '20076'}],
  'ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '5782'}],
  'ESM1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '3853'}],
  'ESM1-ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '22223'}],
  'ESM1-ESU1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '4673'}],
  'ESM1-ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '22279'}],
  'ESU1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '1030'}],
  'ESU1-ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '16280'}],
  'ESU1-ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '20117'}],
  'ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '8858'}],


### Symbology mappings

A subset of the symbology metadata includes the mappings between the requested symbols `stype_in` and the specified `stype_out`.

In [26]:
data.mappings

{'ESH1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '5482'}],
 'ESH1-ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '21885'}],
 'ESH1-ESM1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '19651'}],
 'ESH1-ESU1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '4223'}],
 'ESH1-ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '20076'}],
 'ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '5782'}],
 'ESM1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '3853'}],
 'ESM1-ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '22223'}],
 'ESM1-ESU1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '4673'}],
 'ESM1-ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '22279'}],
 'ESU1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '1030'}],
 'ESU1-ESH2': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '16280'}],
 'ESU1-ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '20117'}],
 'ESZ1': [{'t0': '2020-12-27', 't1': '2020-12-30', 's': '8858'}],
 'ESZ1-ESH2': [{'t0': '2

### Instrument definitions

The metadata also contains 'mini-definitions', which are a subset of the full `definition` schema (also available through the `timeseries` endpoint).

In [27]:
data.definitions

{'ESH1': [{'symbol': 'ESH1',
   'date': 20201227,
   'asset': 'ES',
   'exchange': 'XCME',
   'security_type': 'FUT',
   'min_price_increment': 25000000000,
   'display_factor': 10000000,
   'activation': 1576852200000000000,
   'expiration': 1616160600000000000,
   'currency': 'USD',
   'ts_event': 1609088715926494829}],
 'ESH2': [{'symbol': 'ESH2',
   'date': 20201227,
   'asset': 'ES',
   'exchange': 'XCME',
   'security_type': 'FUT',
   'min_price_increment': 25000000000,
   'display_factor': 10000000,
   'activation': 1608301800000000000,
   'expiration': 1647610200000000000,
   'currency': 'USD',
   'ts_event': 1609088715926494829}],
 'ESM1': [{'symbol': 'ESM1',
   'date': 20201227,
   'asset': 'ES',
   'exchange': 'XCME',
   'security_type': 'FUT',
   'min_price_increment': 25000000000,
   'display_factor': 10000000,
   'activation': 1584711000000000000,
   'expiration': 1624023000000000000,
   'currency': 'USD',
   'ts_event': 1609088715926494829}],
 'ESZ1': [{'symbol': 'ESZ1',

In [28]:
data.instrument('ESH1')

[{'symbol': 'ESH1',
  'date': 20201227,
  'asset': 'ES',
  'exchange': 'XCME',
  'security_type': 'FUT',
  'min_price_increment': 25000000000,
  'display_factor': 10000000,
  'activation': 1576852200000000000,
  'expiration': 1616160600000000000,
  'currency': 'USD',
  'ts_event': 1609088715926494829}]

### Pandas DataFrame

We can easily obtain a pandas `DataFrame` by calling the below method.

In [38]:
df = data.to_df(pretty_px=True, pretty_ts=True)

# For now we must lightly process prices to account for display factor
# this will eventually occur using `pretty_px`.
definition = data.instrument("ESH1")
df["price"] = df["price"] * (definition[0]["display_factor"] * 1e-9)

df.head(20)

Unnamed: 0_level_0,ts_event,ts_in_delta,pub_id,product_id,order_id,action,side,flags,price,size,sequence
ts_recv,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2020-12-27 20:00:25.250461359+00:00,2020-12-27 20:00:25.061045683+00:00,92701,1,5482,647439984644,A,B,0,3159.5,2,1098
2020-12-27 20:00:25.250461359+00:00,2020-12-27 20:00:25.061045683+00:00,92701,1,5482,647439984689,A,B,0,3105.5,3,1098
2020-12-27 20:00:25.250461359+00:00,2020-12-27 20:00:25.061045683+00:00,92701,1,5482,647508324609,A,B,0,3300.0,2,1098
2020-12-27 20:00:25.250461359+00:00,2020-12-27 20:00:25.061045683+00:00,92701,1,5482,647530969859,A,B,0,2870.0,10,1098
2020-12-27 20:00:25.250461359+00:00,2020-12-27 20:00:25.061045683+00:00,92701,1,5482,647570749552,A,B,0,3213.25,1,1098
2020-12-27 20:00:25.250461359+00:00,2020-12-27 20:00:25.061045683+00:00,92701,1,5482,647570749560,A,B,0,3212.25,1,1098
2020-12-27 20:00:25.250461359+00:00,2020-12-27 20:00:25.061045683+00:00,92701,1,5482,647570749656,A,B,0,3211.25,1,1098
2020-12-27 20:00:25.250461359+00:00,2020-12-27 20:00:25.061045683+00:00,92701,1,5482,647570749727,A,B,0,3210.25,1,1098
2020-12-27 20:00:25.250461359+00:00,2020-12-27 20:00:25.061045683+00:00,92701,1,5482,647570749776,A,B,0,3209.25,1,1098
2020-12-27 20:00:25.250461359+00:00,2020-12-27 20:00:25.061045683+00:00,92701,1,5482,647570749868,A,B,0,3208.25,1,1098


### Numpy arrays

It's also possible to work with an array of individual records represented as `np.ndarray`(s).



In [30]:
data.to_ndarray()[:10]

array([(14, 32, 1, 5482, 1609099225061045683, 647439984644, 315950000000000,  2, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647439984689, 310550000000000,  3, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647508324609, 330000000000000,  2, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647530969859, 287000000000000, 10, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647570749552, 321325000000000,  1, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647570749560, 321225000000000,  1, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647570749656, 321125000000000,  1, 0, 0, b'B', b'A', 1609099225250461359, 92701, 1098),
       (14, 32, 1, 5482, 1609099225061045683, 647570749727, 32

### Replay

We may also have a usecase for replaying the time series data stream record by record through a handler.

In [31]:
def my_handler(record):
    # backtesting / trading strategies (event-driven)
    print(record)

In [32]:
# data.replay(my_handler)

## Time series batch requests

It's possible to make time series batch requests programatically with the client library. This will hit the batch endpoint of the backend API with a batch request.

In [33]:
client.batch.timeseries_submit(
    dataset="GLBX.MDP3",
    symbols=["ESH1"],
    schema="trades",
    start="2020-12-27T12:00",
    end="2020-12-29",
    encoding="dbz",
    delivery="download",
    compression="zstd",
    limit=100,  # <-- request limited to 1000 records
)

{'id': 'GLBX-20220718-B6WM8NCU3K',
 'user_id': '46PCMCVF',
 'bill_id': '75c1035a',
 'dataset': 'GLBX.MDP3',
 'symbols': 'ESH1',
 'stype_in': 'native',
 'stype_out': 'product_id',
 'schema': 'trades',
 'start': '2020-12-27 12:00:00+00:00',
 'end': '2020-12-29 00:00:00+00:00',
 'limit': 100,
 'encoding': 'dbz',
 'compression': 'zstd',
 'nrows': 100,
 'ncols': 14,
 'split_duration': 'day',
 'split_size': None,
 'packaging': 'none',
 'delivery': 'download',
 'is_example': False,
 'billed_size': 4800,
 'actual_size': None,
 'package_size': None,
 'state': 'queued',
 'ts_received': '2022-07-18 08:35:57.625343+00:00',
 'ts_queued': None,
 'ts_process_start': None,
 'ts_process_done': None,
 'ts_expiration': None}

We can also list the batch jobs we have made, with optional filter parameters for `state` and `since`. This could help us programatically build and manage larger data pipelines.

In [34]:
client.batch.list_jobs(since=pd.Timestamp.utcnow() - pd.Timedelta(minutes=5))

[{'id': 'GLBX-20220718-B6WM8NCU3K',
  'user_id': '46PCMCVF',
  'bill_id': '75c1035a',
  'dataset': 'GLBX.MDP3',
  'symbols': 'ESH1',
  'stype_in': 'native',
  'stype_out': 'product_id',
  'schema': 'trades',
  'start': '2020-12-27 12:00:00+00:00',
  'end': '2020-12-29 00:00:00+00:00',
  'limit': 100,
  'encoding': 'dbz',
  'compression': 'zstd',
  'nrows': 100,
  'ncols': 14,
  'split_duration': 'day',
  'split_size': None,
  'packaging': 'none',
  'delivery': 'download',
  'is_example': False,
  'billed_size': 4800,
  'actual_size': None,
  'package_size': None,
  'state': 'queued',
  'ts_received': '2022-07-18 08:35:57.625343+00:00',
  'ts_queued': '2022-07-18 08:35:58.144992+00:00',
  'ts_process_start': None,
  'ts_process_done': None,
  'ts_expiration': None,
  'progress': 0}]