In [1]:
import datetime
import pandas as pd
import pprint
import pyaurorax

aurorax = pyaurorax.PyAuroraX()

# Search for data product records

The AuroraX database also includes records describing data products for auroral data, such as keograms, montages, summary plots, etc. We can search for data products much we searching for ephemeris and conjunctions.

More information about data product records can be found [here](https://docs.aurorax.space/about_the_data/overview/) and [here](https://docs.aurorax.space/about_the_data/categories/#data-products).

A common stumbling block for making search queries is being unclear on the values that you can use for the program, platform, instrument type, etc. The AuroraX search engine is underpinned by 'data sources', and this is where the information can be found. Use the `aurorax.search.sources.list()` function to show all the available data sources that you can use when constructing search queries. For metadata filter requests, the information is also contained in the data sources that identify the available filter keys and values. We'll have a closer look at this in the metadata filter examples further below.

In [2]:
# The data sources are what we use for search queries. We list some below,
# and in the following search queries in this notebook, we utilize this
# information for the program, platform, instrument type fields.

# let's list the first 10 data sources just to get us a table view of a few
aurorax.search.sources.list_in_table(program="trex", instrument_type="RGB ASI")

# the below line gets all data sources, which we'll use later to explore the
# available metadata filters
sources = aurorax.search.sources.list(program="trex", instrument_type="RGB ASI")

Identifier   Program   Platform      Instrument Type   Source Type   Display Name 
96           trex      fort smith    RGB ASI           ground        TREx RGB FSMI
101          trex      lucky lake    RGB ASI           ground        TREx RGB LUCK
102          trex      pinawa        RGB ASI           ground        TREx RGB PINA
103          trex      gillam        RGB ASI           ground        TREx RGB GILL
104          trex      rabbit lake   RGB ASI           ground        TREx RGB RABB
339          trex      athabasca     RGB ASI           ground        TREx RGB ATHA


Now that we know a bit more about how the data sources come into play with the search engine, let's have a look at a basic data product search. Let's get data product records for all TREx RGB instruments for a day.

In [3]:
# set search parameters
start = datetime.datetime(2020, 2, 1, 0, 0, 0)
end = datetime.datetime(2020, 2, 1, 23, 59, 59)
programs = ["trex"]
instrument_types = ["RGB ASI"]

# perform search
s = aurorax.search.data_products.search(start, end, programs=programs, verbose=True)

[2025-01-26 00:14:34.088667] Search object created
[2025-01-26 00:14:34.176518] Request submitted
[2025-01-26 00:14:34.176585] Request ID: a1cd8422-4a13-45e8-b91b-386f0ca7a7bb
[2025-01-26 00:14:34.176614] Request details available at: https://api.aurorax.space/api/v1/data_products/requests/a1cd8422-4a13-45e8-b91b-386f0ca7a7bb
[2025-01-26 00:14:34.176639] Waiting for data ...
[2025-01-26 00:14:35.705916] Checking for data ...
[2025-01-26 00:14:36.224453] Data is now available
[2025-01-26 00:14:36.224645] Retrieving data ...
[2025-01-26 00:14:36.362717] Retrieved 684.8 kB of data containing 280 records


In [4]:
# show the first 10 data product records
#
# NOTE: while here we format the results into a Pandas dataframe, this
# is not required. We actually don't include Pandas as a dependency since
# it's used simply as a nice add-on to view data. If you're good with slicing
# and dicing lists and dictionaries, you'll be fine without it.
data_products = [d.__dict__ for d in s.data]
df = pd.DataFrame(data_products)
df.sort_values("start")[0:10]

Unnamed: 0,data_source,data_product_type,start,end,url,metadata
0,"DataSource(identifier=104, program='trex', pla...",montage,2020-02-01T00:00:00,2020-02-01T00:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'montage_type': 'hourly', 'imaging_end_time':..."
26,"DataSource(identifier=103, program='trex', pla...",montage,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'montage_type': 'daily', 'imaging_end_time': ..."
27,"DataSource(identifier=101, program='trex', pla...",montage,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'montage_type': 'daily', 'imaging_end_time': ..."
28,"DataSource(identifier=101, program='trex', pla...",keogram,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily_moviederived', 'imagin..."
29,"DataSource(identifier=102, program='trex', pla...",movie,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'movie_type': 'real-time daily', 'imaging_end..."
30,"DataSource(identifier=104, program='trex', pla...",keogram,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily', 'imaging_end_time': ..."
31,"DataSource(identifier=104, program='trex', pla...",keogram,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily_moviederived', 'imagin..."
32,"DataSource(identifier=104, program='trex', pla...",montage,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'montage_type': 'daily', 'imaging_end_time': ..."
33,"DataSource(identifier=96, program='trex', plat...",montage,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'montage_type': 'daily', 'imaging_end_time': ..."
34,"DataSource(identifier=103, program='trex', pla...",movie,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'movie_type': 'real-time daily', 'imaging_end..."


# Search with metadata filters

Using the metadata filters to help search for data product records is one of the more advanced tools available. There exist metadata filters for some ground-based auroral data sources that we can utilize to filter our results further.

An important part of being able to utilize the metadata filters is knowing the available keys and values. Each data source record has an attribute named `data_product_metadata_schema`. 

In [5]:
# using the data source listing that we retrieved further above, let's
# have a look at one of the records
#
# for the first data source, print only the first metadata filter info
print(sources[0].program, sources[0].platform, sources[0].instrument_type)
pprint.pprint(sources[0].data_product_metadata_schema[0])  # type: ignore

trex fort smith RGB ASI
{'additional_description': '',
 'allowed_values': ['hourly',
                    'hourly_hires',
                    'daily',
                    'daily_hires',
                    'daily_hires_200px',
                    'daily_moviebased'],
 'data_type': 'string',
 'description': 'Type of keogram',
 'field_name': 'keogram_type'}


We see above, just one of the metadata filters we can use for a TREx RGB. We'll leave it up to you from here to explore the additional filters for the TREx RGBs, and the available filters for any other data source.

Now that we understand the metadata filter keys and values a bit more, we will look at a simple example where we search for data product data filtering for specifically daily keograms for the TREx RGBs.

In [6]:
# set search parameters
start = datetime.datetime(2021, 1, 1, 0, 0, 0)
end = datetime.datetime(2021, 1, 1, 23, 59, 59)
programs = ["trex"]
instrument_types = ["RGB ASI"]

# set metadata filters
# set metadata filter
metadata_filter = aurorax.search.MetadataFilter(expressions=[
    aurorax.search.MetadataFilterExpression("keogram_type", "daily", operator="="),
])

# perform search
s = aurorax.search.data_products.search(
    start,
    end,
    programs=programs,
    instrument_types=instrument_types,
    metadata_filters=metadata_filter,
    verbose=True,
)

[2025-01-26 00:14:36.415289] Search object created


[2025-01-26 00:14:36.458252] Request submitted
[2025-01-26 00:14:36.458401] Request ID: 39d9f13e-96df-4cd2-bd83-86d826d0551b
[2025-01-26 00:14:36.458433] Request details available at: https://api.aurorax.space/api/v1/data_products/requests/39d9f13e-96df-4cd2-bd83-86d826d0551b
[2025-01-26 00:14:36.458456] Waiting for data ...
[2025-01-26 00:14:37.966013] Checking for data ...
[2025-01-26 00:14:38.461931] Data is now available
[2025-01-26 00:14:38.462105] Retrieving data ...
[2025-01-26 00:14:38.518135] Retrieved 12.8 kB of data containing 5 records


In [7]:
# show the results
data_products = [d.__dict__ for d in s.data]
df = pd.DataFrame(data_products)
df.sort_values("start")

Unnamed: 0,data_source,data_product_type,start,end,url,metadata
0,"DataSource(identifier=96, program='trex', plat...",keogram,2021-01-01T00:00:00,2021-01-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily', 'imaging_end_time': ..."
1,"DataSource(identifier=101, program='trex', pla...",keogram,2021-01-01T00:00:00,2021-01-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily', 'imaging_end_time': ..."
2,"DataSource(identifier=102, program='trex', pla...",keogram,2021-01-01T00:00:00,2021-01-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily', 'imaging_end_time': ..."
3,"DataSource(identifier=103, program='trex', pla...",keogram,2021-01-01T00:00:00,2021-01-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily', 'imaging_end_time': ..."
4,"DataSource(identifier=104, program='trex', pla...",keogram,2021-01-01T00:00:00,2021-01-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily', 'imaging_end_time': ..."


# Do the search step-by-step

Under the hood, the AuroraX API performs a data products search asynchronously. Note that this does not mean that it can be done using a Python async method; it means that PyAuroraX does more than just a single HTTP request when doing a search. With the API operating this way, it adds some more complexity within PyAuroraX but also opens the search up to some very important capabilities. 

The main capability enabled by this architecture is being able to perform queries for large timeframes, and/or between a large number of data sources. Queries like this can sometimes take several minutes, and cause browsers and programmatic HTTP requests to timeout.

Instead of using the `aurorax.search.data_products.search()` method like we have been, you can also perform a search step-by-step when more control over the process is desired. This is achieved by using the `return_immediately` parameter. One use case for this is if you want to start a series of searches, and then go through each getting the results back as they finish, as opposed to doing one search at a time (parallelized searches).

In [8]:
# set up the search parameters
start = datetime.datetime(2020, 2, 1, 0, 0, 0)
end = datetime.datetime(2020, 2, 5, 23, 59, 59)
programs = ["trex"]
instrument_types = ["RGB ASI"]

# create the Search object
s = aurorax.search.data_products.search(start, end, programs=programs, instrument_types=instrument_types, return_immediately=True)
s.pretty_print()

DataProductSearch:
  executed     : False
  completed    : False
  request_id   : 
  request      : None
  request_url  : 
  data_url     : 
  query        : {'data_sources': {'programs': ['trex'], 'platforms': [], 'instrument_types': ['R...
  status       : {}
  data         : 
  logs         : 


In [9]:
# submit the search to begin
s.execute()
s.pretty_print()

DataProductSearch:
  executed     : True
  completed    : False
  request_id   : 0348c764-7afa-4a3b-a72a-4385baa6b908
  request      : AuroraXAPIResponse [202] (Accepted)
  request_url  : https://api.aurorax.space/api/v1/data_products/requests/0348c764-7afa-4a3b-a72a-4385baa6b908
  data_url     : 
  query        : {'data_sources': {'programs': ['trex'], 'platforms': [], 'instrument_types': ['R...
  status       : {}
  data         : [0 data product results]
  logs         : [0 log messages]


In [10]:
# update the search request status
s.update_status()
s.pretty_print()

DataProductSearch:
  executed     : True
  completed    : False
  request_id   : 0348c764-7afa-4a3b-a72a-4385baa6b908
  request      : AuroraXAPIResponse [202] (Accepted)
  request_url  : https://api.aurorax.space/api/v1/data_products/requests/0348c764-7afa-4a3b-a72a-4385baa6b908
  data_url     : 
  query        : {'data_sources': {'programs': ['trex'], 'platforms': [], 'instrument_types': ['R...
  status       : {'search_request': {'request_id': '0348c764-7afa-4a3b-a72a-4385baa6b908', 'query...
  data         : [0 data product results]
  logs         : [2 log messages]


In [11]:
# wait for the data to be available
s.wait()
s.update_status()
s.pretty_print()

DataProductSearch:
  executed     : True
  completed    : True
  request_id   : 0348c764-7afa-4a3b-a72a-4385baa6b908
  request      : AuroraXAPIResponse [202] (Accepted)
  request_url  : https://api.aurorax.space/api/v1/data_products/requests/0348c764-7afa-4a3b-a72a-4385baa6b908
  data_url     : https://api.aurorax.space/api/v1/data_products/requests/0348c764-7afa-4a3b-a72a-4385baa6b908/data
  query        : {'data_sources': {'programs': ['trex'], 'platforms': [], 'instrument_types': ['R...
  status       : {'search_request': {'request_id': '0348c764-7afa-4a3b-a72a-4385baa6b908', 'query...
  data         : [0 data product results]
  logs         : [7 log messages]


In [12]:
# now that we know the request is complete, let's retrieve the data
s.get_data()
s.pretty_print()

DataProductSearch:
  executed     : True
  completed    : True
  request_id   : 0348c764-7afa-4a3b-a72a-4385baa6b908
  request      : AuroraXAPIResponse [202] (Accepted)
  request_url  : https://api.aurorax.space/api/v1/data_products/requests/0348c764-7afa-4a3b-a72a-4385baa6b908
  data_url     : https://api.aurorax.space/api/v1/data_products/requests/0348c764-7afa-4a3b-a72a-4385baa6b908/data
  query        : {'data_sources': {'programs': ['trex'], 'platforms': [], 'instrument_types': ['R...
  status       : {'search_request': {'request_id': '0348c764-7afa-4a3b-a72a-4385baa6b908', 'query...
  data         : [820 data product results]
  logs         : [7 log messages]


In [13]:
# show the first 10 data product results
data_products = [d.__dict__ for d in s.data]
df = pd.DataFrame(data_products)
df.sort_values("start")[0:10]

Unnamed: 0,data_source,data_product_type,start,end,url,metadata
0,"DataSource(identifier=102, program='trex', pla...",keogram,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily_moviederived', 'imagin..."
20,"DataSource(identifier=101, program='trex', pla...",keogram,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily_hires', 'imaging_end_t..."
21,"DataSource(identifier=101, program='trex', pla...",keogram,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily_moviederived', 'imagin..."
22,"DataSource(identifier=103, program='trex', pla...",montage,2020-02-01T00:00:00,2020-02-01T00:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'montage_type': 'hourly', 'imaging_end_time':..."
23,"DataSource(identifier=103, program='trex', pla...",keogram,2020-02-01T00:00:00,2020-02-01T00:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'hourly', 'imaging_end_time':..."
24,"DataSource(identifier=103, program='trex', pla...",keogram,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily_moviederived', 'imagin..."
26,"DataSource(identifier=103, program='trex', pla...",keogram,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily_hires_200px', 'imaging..."
27,"DataSource(identifier=103, program='trex', pla...",montage,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'montage_type': 'daily', 'imaging_end_time': ..."
28,"DataSource(identifier=103, program='trex', pla...",keogram,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'keogram_type': 'daily', 'imaging_end_time': ..."
29,"DataSource(identifier=103, program='trex', pla...",movie,2020-02-01T00:00:00,2020-02-01T23:59:00,https://data.phys.ucalgary.ca/sort_by_project/...,"{'movie_type': 'real-time daily', 'imaging_end..."


# Describe the data products search as an SQL-like statement

To help understand the query a bit more, you can also 'describe' the search as an SQL-like statement. Let's look at a simple example of this.

In [2]:
# set search parameters
start = datetime.datetime(2020, 2, 1, 0, 0, 0)
end = datetime.datetime(2020, 2, 1, 23, 59, 59)
programs = ["trex"]
instrument_types = ["RGB ASI"]

# set up search
#
# NOTE: to 'describe' a query, you don't need to actually execute the search
# if you don't want to. You can use this to help you make sure the search is
# what you want, and then execute it after.
s = aurorax.search.data_products.search(start, end, programs=programs, return_immediately=True)

# describe the search
print("Query description: %s" % (s.describe()))

# execute the search now
s.execute()
s.wait()
s.get_data()
print("\nSearch completed. Found %d data product records" % (s.status["search_result"]["result_count"]))

Query description: Find data_products for ((program in (trex) filtered by metadata ()) AND  data_product_metadata_filters []) AND data_product start >= 2020-02-01T00:00 UTC AND data_product end <= 2020-02-01T23:59:59 UTC AND

Search completed. Found 280 data product records


# Configure the response data

Search data can be configured to only return certain pieces of information for each data product record. For any web developers reading this guide, this is similar to GraphQL where you can easily control what data you get back from an API. The common use case for this is if you do a data product search and you'd like to optimize the response time, you can cut down on the amount of data that you get back. Perhaps you want to do a search but only really care about a few certain key pieces of information. You can use the response format parameter to prune down the result to be a fraction of the data that would normally be returned, effectively increasing the speed of data download / overall `search()` function time.

To do this, we utilize the `response_format` parameter to control the data product data structure we get back. One asterisks surrounding searches which use the response format parameter - the data returned will be a list of dictionaries, instead of a list of `DataProductData` objects.

The next question you may have is 'how do you know the response format possibilities'?

To help answer this, you can use the `aurorax.search.data_products.create_response_format_template()` function. This will return a template for the `response_format` parameter. Take this, adjust as needed, and use for search requests.

In [3]:
pprint.pprint(aurorax.search.data_products.create_response_format_template())

{'data_product_type': True,
 'data_source': {'data_product_metadata_schema': {'additional_description': True,
                                                  'allowed_values': True,
                                                  'data_type': True,
                                                  'description': True,
                                                  'field_name': True},
                 'display_name': True,
                 'ephemeris_metadata_schema': {'additional_description': True,
                                               'allowed_values': True,
                                               'data_type': True,
                                               'description': True,
                                               'field_name': True},
                 'identifier': True,
                 'instrument_type': True,
                 'maintainers': True,
                 'metadata': True,
                 'owner': True,
                 'platform': T

Ok, let's have a look at an example now. 

We'll do a simple search for TREx RGB data products spanning a single day. The response format will include just the identifier and display name for the data source, and the start, end, and URL for the data product.

In [2]:
# set our response format
response_format = {
    "start": True,
    "end": True,
    "url": True,
    "data_product_type": True,
    "data_source": {
        "display_name": True,
        "identifier": True,
    },
}

# set core search parameters
start = datetime.datetime(2020, 2, 1, 0, 0, 0)
end = datetime.datetime(2020, 2, 1, 23, 59, 59)
programs = ["trex"]
instrument_types = ["RGB ASI"]

# perform search
s = aurorax.search.data_products.search(start, end, programs=programs, response_format=response_format, verbose=True)

[2025-02-12 07:03:49.969766] Search object created
[2025-02-12 07:03:50.003836] Request submitted
[2025-02-12 07:03:50.003861] Request ID: 795149d6-a326-40e6-ad1f-515ecd400260
[2025-02-12 07:03:50.003865] Request details available at: https://api.aurorax.space/api/v1/data_products/requests/795149d6-a326-40e6-ad1f-515ecd400260
[2025-02-12 07:03:50.003868] Waiting for data ...
[2025-02-12 07:03:51.440168] Checking for data ...
[2025-02-12 07:03:51.883085] Data is now available
[2025-02-12 07:03:51.883171] Retrieving data ...
[2025-02-12 07:03:52.020265] Retrieved 684.8 kB of data containing 280 records


In [3]:
# let's see the first record
#
# remember - since this search was done with a response_format parameter, the
# response is not a list of DataProductData objects, but is instead a list of
# dictionaries
pprint.pprint(s.data[0])

{'data_product_type': 'montage',
 'data_source': {'display_name': 'TREx RGB RABB', 'identifier': 104},
 'end': datetime.datetime(2020, 2, 1, 0, 59),
 'start': datetime.datetime(2020, 2, 1, 0, 0),
 'url': 'https://data.phys.ucalgary.ca/sort_by_project/TREx/RGB/stream2/2020/02/01/rabb_rgb-05/ut00/20200201_00_rabb_rgb-05_full-montage.jpg'}
