-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Roni and Dmytro noticed this and brought it to my attention: flat lines for NSSP signals for NY and Los Angeles in epivis. What can be seen in the plots are not actually "zero" values but instead all "null" values, which epivis currently plots at 0 (you can verify this by looking in your browser's network debugger for the API responses). After looking into this more, i found the following...
NSSP raw source data is almost all at the county resolution (there is also state and nation resolution data, but those are vastly outnumbered by county. they do not seem to suffer from the same problem. i will ignore them from here on.). Some of these data points come from the upstream source as "empty", and are then properly interpreted as NULL by the nssp indicators pipeline code and given a missingness annotation. However, there seems to be an inordinately and surprisingly large number of these null/empty/missing values.
Some missing values may be from low-population ares where the data is not well reported or censored for privacy reasons, but data is missing from some very large regions (in terms of area and/or population) like Manhattan (New York County, FIPS code 36061) and Los Angeles (FIPS 06037), as seen in the epivis link at the top of this comment.
-
For NSSP data over all available time, as seen on Feb 27, 2025 in the "query data" tool on the cdc website ( https://data.cdc.gov/Public-Health-Surveillance/NSSP-Emergency-Department-Visit-Trajectories-by-St/rdmq-nq56/explore/query/ ) :
- total rows: 384625
- rows where at least one value column is non-null: 293330
- rows where all value columns are null: 91295 (24%)
- note that this number plus the one above it sum to the total: 91295 + 293330 = 384625
- rows with null in "combined" col: 139460 (36%)
- rows with null in rsv column: 98750 (26%)
- rows with null in flu column: 98750 (26%)
- rows with null in covid column: 98750 (26%)
- (as you probably noticed, "combined" column is different, but rsv+flu+covid column missingness appears to be coincident)
- so the number of null values that end up in our data set should be: (987503+139460)/(3846254) = 28%
-
For newer data, the missingness is even more prevalent... For a targeted date range between 29 Sep 2024 and 27 Feb 2025, we see:
- total rows: 61540
- rows where at least one value column is non-null: 45740
- rows where all value columns are null: 15800 (26%)
- note 15800 + 45740 = 61540
- rows with null in "combined" col: 61540 (100% ?!?! perhaps they silently discontinued this)
- rows with null in rsv column: 15800 (26%)
- rows with null in flu column: 15800 (26%)
- rows with null in covid column: 15800 (26%)
- (you may have noticed that the rsv, flu, and covid column missingness percentage is ~the same as for the whole dataset above)
- so the number of null values that end up in our data set should be: (158003+61540)/(615404) = >44%
(the above numbers include rows that are for state-level and national-level data points, but i claim they are negligible compared to the number of counties)
Whatever the reason for the frequency of null values, i think we should probably exclude them from the database. 44% missingness is nearly the norm and not the exception, and leaving all those nulls in place is wasteful and potentially misleading.
Ive not checked exhaustively yet, but i hypothesize the nulls appear consistently for many of the same locations -- it is certainly true that all of the NYC data is null across the entire timeseries. If we have no actual datapoints for a location, i think it is more appropriate to omit all records for that location instead of storing only null values.
This may be of particular interest to @RoniRos : if we keep the null values, the recently implemented geo coverage for the discovery app will report that nssp includes data for Manhattan, which is effectively erroneous; if we remove the null values, geo coverage will not report that nssp includes any data for Manhattan, which is arguably more accurate.
Another potential concern is that while "null-value locations" are appropriately left out of our geoaggregation computations, i fear that smaller geo types that we aggregate (like MSA perhaps) might still end up with unrepresentative values if large portions of their population are not accounted for.
Related issue for epivis "null" plotting: cmu-delphi/www-epivis#91