# Get NMDC biosample geographical origin coordinates

You can use this notebook to generate a list of the `id` and geographical origin coordinates of each biosample in the NMDC database.

### Install and import dependencies

In [None]:
%pip install requests

In [2]:
import csv
from typing import Optional

import requests

### Fetch the `id` and `lat_lon` of each biosample

In this cell, we fetch the `id` value and the `lat_lon` value (i.e., the [geographical origin coordinates](https://microbiomedata.github.io/nmdc-schema/lat_lon/)) of each biosample in the NMDC database.

The NMDC API endpoint we use here only returns up to 2000 biosamples per request. Since the NMDC database currently contains more than 2000 biosamples, we submit multiple requests to the NMDC API endpoint — each request being for a distinct page (i.e. batch) of biosamples. 

Also, when using _page number_-based pagination to access the various pages, the endpoint only provides access to a maximum of 10000 biosamples in total. Since the NMDC database currently contains more than 10000 biosamples, we use _cursor_-based pagination instead.

In [3]:
lat_lons_by_biosample_id = dict()

page_num: int = 1
page_cursor: Optional[str] = "*"  # the "*" cursor refers to the first page
while True:
    print(f"Fetching page number {page_num} via cursor '{page_cursor}'", end=": ")
    request_params = dict(per_page=2000, fields="lat_lon", cursor=page_cursor)
    response = requests.get("https://api.microbiomedata.org/biosamples", params=request_params)

    # Collect the `id` and `lat_lon` value of each biosample in the response.
    # Note: Once we have it locally, we can explore it without Internet access.
    response_payload = response.json()
    biosamples = response_payload["results"]
    print(f"{len(biosamples)} biosamples")
    for biosample in biosamples:
        biosample_id = biosample["id"]
        biosample_lat_lon = biosample["lat_lon"]
        lat_lons_by_biosample_id[biosample_id] = biosample_lat_lon

    # If we haven't fetched all the biosamples yet, prepare to fetch the next batch.
    # Note: In the NMDC database, each biosample has a unique `id` value.
    next_page_cursor = response_payload["meta"]["next_cursor"]
    if next_page_cursor != None:
        page_num += 1
        page_cursor = next_page_cursor
    else:
        break

print(f"Fetched `id` and `lat_lon` values for {len(lat_lons_by_biosample_id)} biosamples")

Fetching page number 1 via cursor '*': 2000 biosamples
Fetching page number 2 via cursor 'nmdc:sys0kypbyj24': 2000 biosamples
Fetching page number 3 via cursor 'nmdc:sys0bhe4tv08': 2000 biosamples
Fetching page number 4 via cursor 'nmdc:sys02vhbtb14': 2000 biosamples
Fetching page number 5 via cursor 'nmdc:sys0nqe4y821': 2000 biosamples
Fetching page number 6 via cursor 'nmdc:sys03av6n346': 2000 biosamples
Fetching page number 7 via cursor 'nmdc:sys09d3ets91': 1006 biosamples
Fetched `id` and `lat_lon` values for 13006 biosamples


### Dump the `id` and geographical origin coordinates to a CSV file

In this cell, we dump the fetched data to a CSV file. The CSV file will have the following columns:
- `biosample_id`
- `latitude`
- `longitude`

In [4]:
OUTFILE_PATH = "nmdc_biosample_geo_coordinates.csv"

with open(OUTFILE_PATH, "w") as file:
    fieldnames = ["biosample_id", "latitude", "longitude"]
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    for biosample_id, lat_lon_value in lat_lons_by_biosample_id.items():
        latitude = lat_lon_value["latitude"]
        longitude = lat_lon_value["longitude"]
        row = dict(biosample_id=biosample_id, latitude=latitude, longitude=longitude)
        writer.writerow(row)

    print(f"Dumped data to: {OUTFILE_PATH}")

Dumped data to: nmdc_biosample_geo_coordinates.csv


### (Example) Fetch all metadata about a biosample

Once you have the `id` of a biosample, you can use the NMDC API to get more metadata about that biosample.

In this cell, we fetch metadata about an arbitrary biosample, given its `id`.

In [5]:
biosample_id = "nmdc:bsm-13-amrnys72"  # this is the `id` of an arbitrary biosample

response = requests.get(f"https://api.microbiomedata.org/biosamples/{biosample_id}")
biosample = response.json()
biosample

{'id': 'nmdc:bsm-13-amrnys72',
 'name': 'Sand microcosm microbial communities from a hyporheic zone in Columbia River, Washington, USA - GW-RW T4_25-Nov-14',
 'description': 'Sterilized sand packs were incubated back in the ground and collected at time point T4.',
 'env_broad_scale': {'has_raw_value': 'ENVO:01000253',
  'term': {'id': 'ENVO:01000253', 'type': 'nmdc:OntologyClass'},
  'type': 'nmdc:ControlledIdentifiedTermValue'},
 'env_local_scale': {'has_raw_value': 'ENVO:01000621',
  'term': {'id': 'ENVO:01000621', 'type': 'nmdc:OntologyClass'},
  'type': 'nmdc:ControlledIdentifiedTermValue'},
 'env_medium': {'has_raw_value': 'ENVO:01000017',
  'term': {'id': 'ENVO:01000017', 'type': 'nmdc:OntologyClass'},
  'type': 'nmdc:ControlledIdentifiedTermValue'},
 'type': 'nmdc:Biosample',
 'collection_date': {'has_raw_value': '2014-11-25',
  'type': 'nmdc:TimestampValue'},
 'depth': {'has_raw_value': '0.5',
  'has_numeric_value': 0.5,
  'has_unit': 'm',
  'type': 'nmdc:QuantityValue'},
 'geo