# Inspect the Data Portal

## JSON-LD Schema

Google Search has the capability of displaying rich results for data portals. To enable this feature, the data portal should include a JSON-LD schema in the HTML head.

This is automatically generated by the data portal in the `_include/head.html` file (which is part of the Jekyll template.

We can verify the JSON-LD schema by using the Google Rich Results Test tool:

https://search.google.com/test/rich-results

and paste one of the dataset pages full URL (for example the `index-cmb.html` page).

## Fetching Data Using BDBag

[BDBag](https://github.com/fair-research/bdbag)

- CLI and Python package
- can fetch public and restricted datasets
- uses the manifest format we created for the datasets
- will validate (checksum) the data to ensure file integrity
- uses the [BagIt](https://en.wikipedia.org/wiki/BagIt) convention to organize data in a portable unit. See IETF [RFC 8493](https://datatracker.ietf.org/doc/html/rfc8493)
- useful for small datasets and when installing Globus Connect is challenging

We'll start with the public datasets, then show how to download a restricted one. We can create bags using both a CLI tool and the Python API.

### Create a Bag Using the CLI

In [None]:
import os
current_folder = %pwd
if not current_folder.endswith("cheapandfair-template"):
    %cd cheapandfair-template

Create an empty directory to hold the bag and the data.

In [None]:
!mkdir synch
!ls synch

We'll create an empty (or holey) bag using the remote file manifest for the `synch` dataset. 

In [None]:
!bdbag --remote-file-manifest synch-manifest.json synch

The synch folder now how manifest in the BagIt format, and a fetch file which has the URLs of the data files. There's also a `data/` folder where the files will go when they're retrieved. Data in the bag can have subfolders.

In [None]:
!ls -R synch

In [None]:
!cat synch/fetch.txt

`bdbag` will materialize bags by retrieving data using the URIs in `fetch.txt` and comparing the file checksums to the manifests.

In [None]:
!bdbag --materialize synch

In [None]:
!ls -R synch

### Create a Bag Using the Python SDK

In [None]:
import os
from bdbag import bdbag_api

In [None]:
os.mkdir('dust')

In [None]:
for i in os.listdir('dust'):
    print(i)

In [None]:
# https://github.com/fair-research/bdbag/blob/master/doc/api.md#make_bag
dust_bag = bdbag_api.make_bag('dust', remote_file_manifest='dust-manifest.json')

In [None]:
for i in os.listdir('dust'):
    print(i)

In [None]:
for i in os.listdir('dust/data'):
    print(i)

In [None]:
# https://github.com/fair-research/bdbag/blob/master/doc/api.md#materialize
dust_bag_path = bdbag_api.materialize('dust')

In [None]:
print(dust_bag_path)

In [None]:
for i in os.listdir('dust/data'):
    print(i)

### Retrieve Restricted Files Using the BDbag Keychain

The BDbag tool will use credentials store in the `~/.bdbag/keychain.json` if the file is present and the URIs of the files match the URI prefix in the keychain. We can reuse the token for uploading the manifests to the Guest Collection to retrieve the files from the `cmb` dataset.

#### Try to Retrieve Without the Keychain

In [None]:
!mkdir cmb

In [None]:
!bdbag --remote-file-manifest cmb-manifest.json cmb

Right now, there's a new empty bag

In [None]:
!ls -R cmb

When we try to materialize the bag, the checksums will fail, because instead of the expected files we're downloading login pages.

In [None]:
!bdbag --materialize cmb

In [None]:
!head cmb/data/cmb_023GHz.fits

#### Create the Keychain File

In [None]:
from os.path import expanduser
import json
import toml

In [None]:
collection_uuid = toml.load("config.toml")["UUID"]
domain = toml.load("config.toml")["DOMAIN"]
folder = toml.load("config.toml")["FOLDER"] 

# only want to use the token for the restricted dataset
uri_prefix = f'https://{domain}{folder}cmb'

with open(expanduser('~/.cheapandfair.json')) as f:
    tokens = json.load(f)

https_token = tokens['by_rs'][collection_uuid]['access_token']

In [None]:
keychain = [
    {
        "uri": uri_prefix,
        "auth_type": "bearer-token",
        "auth_params": {
            "token": https_token,
            "allow_redirects_with_token": "True"
        }
    }
]

In [None]:
with open(expanduser('~/.bdbag/keychain.json'), 'w') as f:
    json.dump(keychain, f)

#### Materialize the Bag

In [None]:
!bdbag --materialize cmb