Skip to content

Conversation

@minhkhul
Copy link
Contributor

@minhkhul minhkhul commented Nov 11, 2025

Addresses issue(s) https://github.com/cmu-delphi/epidata-etl/issues/394

Summary:

  • Add hsa and hsa_nci geo resolution
  • Adjust regex pattern to match specific geo resolutions.
  • Add various hsa_nci and hsa cases to unit test to make sure the new regex is working as intended.

Prerequisites:

  • Unless it is a documentation hotfix it should be merged against the dev branch
  • Branch is up-to-date with the branch to be merged with, i.e. dev
  • Build is successful
  • Code is cleaned up and formatted

@minhkhul minhkhul requested a review from melange396 November 11, 2025 17:43
Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice job! Your approach is a great way of handling this!

There are a couple other things we will need to take care of:

  • The following is essentially dead code now; we should remove it at the least, but potentially replace its functionality:
    if geo_type not in CsvImporter.GEOGRAPHIC_RESOLUTIONS:
    logger.warning(event='invalid geo_type', detail=geo_type, file=path)
    yield (path, None)
    continue
    (an easy way would be to add " (or geo_type)" to the log message where it will now error out).
  • The new geo_type and some basic validation code should also be added to the section at:
    if geo_type in ('hrr', 'msa', 'dma', 'hhs'):
    # these particular ids are prone to be written as ints -- and floats
    try:
    geo_id = str(CsvImporter.floaty_int(geo_id))
    except ValueError:
    # expected a number, but got a string
    return (None, 'geo_id')
    # sanity check geo_id with respect to geo_type
    if geo_type == 'county':
    if len(geo_id) != 5 or not '01000' <= geo_id <= '80000':
    return (None, 'geo_id')
    elif geo_type == 'hrr':
    if not 1 <= int(geo_id) <= 500:
    return (None, 'geo_id')
    elif geo_type == 'msa':
    if len(geo_id) != 5 or not '10000' <= geo_id <= '99999':
    return (None, 'geo_id')
    elif geo_type == 'dma':
    if not 450 <= int(geo_id) <= 950:
    return (None, 'geo_id')
    elif geo_type == 'state':
    # note that geo_id is lowercase
    if len(geo_id) != 2 or not 'aa' <= geo_id <= 'zz':
    return (None, 'geo_id')
    elif geo_type == 'hhs':
    if not 1 <= int(geo_id) <= 10:
    return (None, 'geo_id')
    elif geo_type == 'nation':
    # geo_id is lowercase
    if len(geo_id) != 2 or not 'aa' <= geo_id <= 'zz':
    return (None, 'geo_id')
    else:
    return (None, 'geo_type')
    ... According to https://seer.cancer.gov/seerstat/variables/countyattribs/hsa.html, valid codes should be 1-3 digit numbers, or the special code of "1022" (which NSSP doesnt appear to use, FWIW).
  • I DO NOT think we should add the new geo_type to the part of the server code that does input validation, as its done with the older geomapper code which does not support this definition of HSA (it uses the incompatible Dartmouth version, see doc/urlref/crosswalk). However, we should mention in a comment there that we are intentionally sidestepping it for that reason:
    # TODO: keep this translator in sync with CsvImporter.GEOGRAPHIC_RESOLUTIONS in acquisition/covidcast/ and with GeoMapper
    geo_type_translator = {
    "county": "fips",
    "state": "state_id",
    "zip": "zip",
    "hrr": "hrr",
    "hhs": "hhs",
    "msa": "msa",
    "nation": "nation"

@minhkhul minhkhul requested a review from melange396 November 12, 2025 00:49
Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two comment clarifications, and lets add an integration test to bring the whole thing together... Then i think this is done!

for https://github.com/cmu-delphi/delphi-epidata/blob/dev/integrations/server/test_covidcast.py :

  def test_hsa_nci(self):
    row = CovidcastTestRow.make_default_row(geo_type='hsa_nci', geo_value='99')
    self._insert_rows(rows)
    response = self.request_based_on_row(row)
    expected = [row.as_api_row_dict()]
    self.assertEqual(response, {
      'result': 1,
      'epidata': expected,
      'message': 'success',
    })

minhkhul and others added 3 commits November 12, 2025 11:08
Co-authored-by: george <george.haff@gmail.com>
Co-authored-by: george <george.haff@gmail.com>
@sonarqubecloud
Copy link

Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

@melange396 melange396 merged commit fd6250f into dev Nov 12, 2025
8 checks passed
@melange396 melange396 deleted the hsa-nci branch November 12, 2025 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants