In [1]:
from mdf_forge.forge import Forge

In [2]:
mdf = Forge()

# Generally Useful Help

### current_query
You can see the query you're currently building with `current_query()`.

In [3]:
mdf.match_field("mdf.source_name", "oqmd")
mdf.current_query()

'(mdf.source_name:oqmd)'

### reset_query
If you have a query in memory that you don't want, you can use `reset_query()` to start a new query. This method will clear the current query entirely.

In [4]:
mdf.reset_query()

In [5]:
mdf.current_query()

''

### Query info
We can build a query using `exclude_field()` and `match_field()` and execute it with `search()`. But if you are interested in knowing more about the query, including the actual query string that was made, you can use the `info=True` argument to `search()`.

In [6]:
mdf.exclude_field("mdf.source_name", "sluschi").match_field("mdf.elements", "Al").exclude_field("mdf.source_name", "oqmd")
res, info = mdf.search(limit=10, info=True)

When you use the `info=True` argument, `search()` will return a tuple instead of a list. The first element in the tuple will be the same list of results you're used to, but the second tuple element will be a dictionary of query info.

In [7]:
res[0]

{'mdf': {'collection': 'AMCS',
  'composition': 'Al4',
  'elements': ['Al'],
  'ingest_date': '2017-08-04T19:07:56.890584Z',
  'links': {'cif': {'http_host': 'http://rruff.geo.arizona.edu',
    'path': '/AMS/xtal_data/CIFfiles/19141.cif'},
   'dif': {'http_host': 'http://rruff.geo.arizona.edu',
    'path': '/AMS/xtal_data/DIFfiles/19141.txt'},
   'landing_page': 'http://rruff.geo.arizona.edu/AMS/minerals/Aluminum#18135',
   'parent_id': '5984c125f2c0043771d1507c'},
  'mdf_id': '5984c60cf2c0043771d19753',
  'metadata_version': '0.3.2',
  'resource_type': 'record',
  'scroll_id': 18135,
  'source_name': 'amcs',
  'tags': ['Aluminum', 'cif', 'dif'],
  'title': 'AMCS - Aluminum'}}

In [8]:
info

{'query': {'advanced': True,
  'limit': 10,
  'offset': 0,
  'q': '( NOT mdf.source_name:sluschi AND mdf.elements:Al AND  NOT mdf.source_name:oqmd)'},
 'total_query_matches': 18170}

### Repeat a query
You can stop a query from being cleared out of memory after a search by using the `reset_query=False` argument.

In [9]:
mdf.match_field("mdf.source_name", "nist_xps_db")

<mdf_forge.forge.Forge at 0x7faa2d990748>

In [10]:
res, info = mdf.search(limit=10, info=True, reset_query=False)
info["query"]["q"]

'(mdf.source_name:nist_xps_db)'

In [11]:
res, info = mdf.search(limit=10, info=True)
info["query"]["q"]

'(mdf.source_name:nist_xps_db)'

### show_fields
How do you know what fields there are to search on? Use `show_fields()` to find out. If you just call `show_fields()` by itself, it will show you all of the top-level blocks (such as "mdf").

In [12]:
mdf.show_fields()

{'dss_tox': 'object',
 'fe_cr_al_oxidation': 'object',
 'gdb9_14': 'object',
 'header': 'object',
 'hopv': 'object',
 'jcap_xps_spectral_db': 'object',
 'md_17': 'object',
 'mdf': 'object',
 'metadata': 'object',
 'mpi_mainz': 'object',
 'natural_fiber_composite_tensile': 'object',
 'nist_janaf': 'object',
 'oqmd': 'object',
 'pppdb': 'object',
 'qm_mdt_c': 'object',
 'quinary_alloys': 'object',
 'xafs_sl': 'object'}

If you give `show_fields()` a top-level block, it will show you the mapping for that block, including the expected datatypes.

In [13]:
mdf.show_fields("mdf")

{'mdf.author.email': 'text',
 'mdf.author.family_name': 'text',
 'mdf.author.full_name': 'text',
 'mdf.author.given_name': 'text',
 'mdf.author.instituition': 'text',
 'mdf.author.institution': 'text',
 'mdf.author.orcid': 'text',
 'mdf.citation': 'text',
 'mdf.collection': 'text',
 'mdf.composition': 'text',
 'mdf.data_contact.email': 'text',
 'mdf.data_contact.family_name': 'text',
 'mdf.data_contact.full_name': 'text',
 'mdf.data_contact.given_name': 'text',
 'mdf.data_contact.instituition': 'text',
 'mdf.data_contact.institution': 'text',
 'mdf.data_contact.orcid': 'text',
 'mdf.data_contributor.email': 'text',
 'mdf.data_contributor.family_name': 'text',
 'mdf.data_contributor.full_name': 'text',
 'mdf.data_contributor.github': 'text',
 'mdf.data_contributor.given_name': 'text',
 'mdf.data_contributor.institution': 'text',
 'mdf.description': 'text',
 'mdf.elements': 'text',
 'mdf.ingest_date': 'date',
 'mdf.license': 'text',
 'mdf.links.DSC_data.globus_endpoint': 'text',
 'mdf.li

# Fetching Datasets

### fetch_datasets_from_results
This method allows you to automatically collect all the datasets that have records returned from a search. In other words, if you search for `mdf.elements:Al` and a _record_ from OQMD is returned, you can pass that record to `fetch_datasets_from_results()` and get the OQMD _dataset_ entry back.

In [14]:
records = mdf.search("mdf.tags:outcar AND mdf.resource_type:record")

In [15]:
res = mdf.fetch_datasets_from_results(records)
res[0]

{'mdf': {'author': [{'email': 'gibbons.dayna@epa.gov',
    'family_name': 'Gibbons',
    'full_name': 'Dayna Gibbons',
    'given_name': 'Dayna',
    'institution': 'US EPA Research'}],
  'citation': ['Richard, A M. AND C. R. Williams. DISTRIBUTED STRUCTURE-SEARCHABLE TOXICITY (DSSTOX) PUBLIC DATABASE NETWORK: A PROPOSAL. MUTATION RESEARCH NEW FRONTIERS ISSUE 499(1):27-52, (2001).'],
  'collection': 'DSS Tox',
  'data_contact': {'email': 'gibbons.dayna@epa.gov',
   'family_name': 'Gibbons',
   'full_name': 'Dayna Gibbons',
   'given_name': 'Dayna',
   'institution': 'US EPA Research'},
  'data_contributor': [{'email': 'dep78@uchicago.edu',
    'family_name': 'Pike',
    'full_name': 'Evan Pike',
    'github': 'dep78',
    'given_name': 'Evan',
    'institution': 'The University of Chicago'}],
  'description': 'DSSTox provides a high quality public chemistry resource for supporting improved predictive toxicology. A distinguishing feature of this effort is the accurate mapping of bioassa

If you don't want to keep the results at all, you can also use `fetch_datasets_from_results()` to execute a search and use those results instead of passing it your own results.

In [16]:
res = mdf.match_field("mdf.elements", "Al").fetch_datasets_from_results()
res[0]

{'mdf': {'author': [{'email': 'qhong@alumni.caltech.edu',
    'family_name': 'Hong',
    'full_name': 'Qi-Jun Hong',
    'given_name': 'Qi-Jun',
    'institution': 'Brown University'},
   {'email': 'avdw@alum.mit.edu',
    'family_name': 'van de Walle',
    'full_name': 'Axel van de Walle',
    'given_name': 'Axel',
    'institution': 'Brown University'}],
  'citation': ['Qi-Jun Hong, Axel van de Walle, A user guide for SLUSCHI: Solid and Liquid in Ultra Small Coexistence with Hovering Interfaces, Calphad, Volume 52, March 2016, Pages 88-97, ISSN 0364-5916, http://doi.org/10.1016/j.calphad.2015.12.003.'],
  'collection': 'SLUSCHI',
  'data_contact': {'email': 'qhong@alumni.caltech.edu',
   'family_name': 'Hong',
   'full_name': 'Qi-Jun Hong',
   'given_name': 'Qi-Jun',
   'institution': 'Brown University'},
  'data_contributor': [{'email': 'jgaff@uchicago.edu',
    'family_name': 'Gaff',
    'full_name': 'Jonathon Gaff',
    'github': 'jgaff',
    'given_name': 'Jonathon',
    'institu

# Aggregations

### aggregate
Queries submitted with `search()` are limited to returning 10,000 results. If this limit is too low, you can use `aggregate()` to retrieve _all_ results from a query, no matter how many. Please be careful with this function, as you can easily accidentally retrieve a very large number of results without meaning to. Consider using `search(your_query, limit=0, info=True)` (see above) first to discover how many results you will get beforehand.

For this example, we will see how many results the query will retrieve before aggregating.

In [17]:
mdf.match_field("mdf.source_name", "oqmd").match_field("mdf.elements", "Pb").exclude_field("mdf.elements", "Al")
res, info = mdf.search(limit=0, info=True, reset_query=False)
print("Number of results:", info["total_query_matches"])

Number of results: 23269


Assuming we want all of these results, we can use `aggregate()` on the same query.

In [18]:
res = mdf.aggregate()
print("Number of results:", len(res))

100%|██████████| 23269/23269 [01:00<00:00, 388.45it/s]

Number of results: 23269



