Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NNDSS Weekly data refresh #1022

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions scripts/us_cdc/nndss_weekly_tables/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
## CDC WONDER: NNDS - Infectious diseases

There are three data sources for this import

## National Notifiable Disease Surveillance System

The National Notifiable Disease Surveillance System (NNDSS) is a nationwide collaboration that enables all levels of public health (local, state, territorial, federal, and international) to share health information to monitor, control, and prevent the occurrence and spread of state-reportable and nationally notifiable infectious and some noninfectious diseases and conditions.

### Nationally Notifiable Infectious Diseases and Conditions, United States: Annual Tables (2016 - 2019 | tables: 4-8)
This is downloaded using the `process_annual_tables_16-19.py --mode=<all|download|process> --input_path=<path> --output_path=<path> --table_ids=4,5,6,7,8 --years=2016,2017,2018,2019

### Nationally Notifiable Infectious Diseases and Conditions, United States: Annual Tables (2007 - 2015 | tables: 4-8)
The annual tables from 1993 to 2015 is available through [MMWR](https://www.cdc.gov/mmwr/mmwr_nd/index.html). In this website, the tables are embedded as HTML table content in the webpage.

> **NOTE:** While MMWR has data from 1993 to 2015, between 1993 to 2006 the datasets are images which cannot be scrapped to plain text and will need an OCR approach to extract data points. Hence, in this import we do not import the NNDSS Infectious diseases Annual table for the period between 1993 to 2006 in this import.

The webapages by year is tabulated below:

|Year|Webpage URL|
|----|-----------|
|2007|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5653a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5653a1.htm)|
|2008|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5754a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5754a1.htm)|
|2009|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5853a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5853a1.htm)|
|2010|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5953a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5953a1.htm)|
|2011|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6053a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6053a1.htm)|
|2012|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6153a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6153a1.htm)|
|2013|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6253a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6253a1.htm)|
|2014|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6354a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6354a1.htm)|
|2015|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6453a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6453a1.htm)|

We will be using [`beautifulsoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to extract and process the datasets.

```
script for data download and processing
```

### Nationally Notifiable Infectious Diseases and Conditions, United States: Weekly Tables (1996 - 2022 | counts of medicalConditions)
The CDC releases [weekly cases](https://wonder.cdc.gov/nndss/nndss_weekly_tables_menu.asp) of selected infectious national notifiable diseases, from the National Notifiable Diseases Surveillance System (NNDSS). NNDSS data reported by the 50 states, New York City, the District of Columbia, and the U.S. territories are collated and published weekly as numbered tables and the data available in [CDC WONDER](https://wonder.cdc.gov/nndss/nndss_weekly_tables_menu.asp) starts from Week1 of 1996.

A similar dataset from 2014 onwards is available at [data.cdc.gov](https://data.cdc.gov/NNDSS/NNDSS-Weekly-Data/x9gk-5huc). This dataset is relatively clean and we extract the count of weekly cases but since we were able to get data from 2022 and not older, hence we use [CDC WONDER](https://wonder.cdc.gov/nndss/nndss_weekly_tables_menu.asp) as the preferred data source for this import.


```
scripts for data processing
```
The csv file is updated periodically every week and we pick up the entire csv file each time to ensure the corrections done after review (reflected in [Notice to Data Users](https://wonder.cdc.gov/nndss/NTR.html) page) is reflected in the import.


Columns that need to be resolved with schema

- weekly tables:
1. Carbapenemase-producing carbapenem-resistant Enterobacteriaceae †;Enterobacter spp.;Current week
2. Carbapenemase-producing carbapenem-resistant Enterobacteriaceae †;Escherichia coli;Current week
3. Carbapenemase-producing carbapenem-resistant Enterobacteriaceae †;Klebsiella spp.;Current week

184 changes: 184 additions & 0 deletions scripts/us_cdc/nndss_weekly_tables/download_weekly_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
"""
import datetime
import time
import os
import requests
import pandas as pd
from absl import flags, app
from bs4 import BeautifulSoup

_START = 2006
_END = 2025 #datetime.date.today().year + 1 #to make the last year inclusive

_BASE_URL = "https://wonder.cdc.gov/nndss/"
_WEEKLY_TABLE_2010 = _BASE_URL + "nndss_weekly_tables_menu.asp?mmwr_year={year}&mmwr_week={week}"
_WEEKLY_TABLE_2017 = _BASE_URL + 'nndss_weekly_tables_menu.asp?mmwr_year={year}&mmwr_week={week}'
_FILENAME_TEMPLATE = "mmwr_year_{year}_mmwr_week_{week}_mmwr_table_{id}"
_BAD_URLS = [
'https://wonder.cdc.gov/nndss/nndss_weekly_tables_1995_2014.asp?mmwr_year=2007&mmwr_week=13&mmwr_table=1&request=Submit'
]


def parse_html_table(table_url: str, file_path: str) -> None:
if table_url not in _BAD_URLS:
table_content = requests.get(table_url)
t_soup = BeautifulSoup(table_content.content, 'html.parser')
try:
table_result_set = t_soup.find_all('table')[1]

df = pd.read_html(table_result_set.prettify())[0]
# save the file in output path for each file
df.to_csv(file_path, index=False)
except IndexError:
# this case is observed from 2016 onwards
try:
table_result_set = t_soup.find_all('table')[0]
df = pd.read_html(table_result_set.prettify())[0]
# save the file in output path for each file
df.to_csv(file_path, index=False)
except IndexError:
# This case occurs when downloading https://wonder.cdc.gov/nndss/nndss_reps.asp?mmwr_year=2007&mmwr_week=13&mmwr_table=1&request=Submit
# The link is unaccessible even from the website and we need to skip this table's download
print("Link not working. Skipping table...")
pass


def extract_table_from_link(table_url: str,
filename: str,
output_path: str,
update: bool = False) -> None:
"""
"""
num_tries = 10
file_path = os.path.join(output_path, f'{filename}.csv')
if not os.path.exists(file_path) or update:
print(f"Downloading {table_url}", end=" ..... ", flush=True)
try:
parse_html_table(table_url, file_path)
print("Done.", flush=True)
except:
print("Terminated with error. Please check the link.", flush=True)
while num_tries > 1:
num_tries = num_tries - 1
parse_html_table(table_url, file_path)
print(f"Attempting download again. Tries remaining {num_tries}")
time.sleep(1)
time.sleep(2)
else:
print(f"Download from {table_url} already exists in {output_path}")
time.sleep(0.2)


def scrape_table_links_from_page(page_url: str,
output_path: str,
update: bool = False) -> None:
"""
"""
page = requests.get(page_url)
soup = BeautifulSoup(page.content, 'html.parser')
# get link to all tables in the page
table_link_list = [
tag.find("a")["href"] for tag in soup.select("tr:has(a)")
]

for table_link in table_link_list:
# Between years 1996 to 2016, select requestMode=Submit
if 'Submit' in table_link:
table_url = _BASE_URL + table_link
# extract filename from link patterns like https://wonder.cdc.gov/nndss/nndss_reps.asp?mmwr_year=1996&mmwr_week=01&mmwr_table=2A&request=Submit
filename = table_url.split('?')[1].split('&request')[0].replace(
'=', '_').replace('&', '_')
print("Submit", table_url, filename, output_path, update)
extract_table_from_link(table_url, filename, output_path, update)

# From year 2017, the base link structure has changed to: https://wonder.cdc.gov/nndss/static/2017/01/2017-01-table1.html
if table_link.endswith('.html') and 'table' in table_link:
# skip /nndss/ in the table_link, since it is already part of the _BASE_URL
table_url = _BASE_URL + table_link[7:]
# extract year, week, table_id from link
filename_components = table_link.split('/')[-1].split(
'.html')[0].split('-')
filename = _FILENAME_TEMPLATE.format(
year=filename_components[0],
week=filename_components[1],
id=filename_components[2].split('table')[1])
extract_table_from_link(table_url, filename, output_path, update)


def get_index_url(year, week):
"""
"""
if year < 2017:
return _WEEKLY_TABLE_2010.format(year=year, week=week)
else:
return _WEEKLY_TABLE_2017.format(year=year, week=week)


def download_weekly_nnds_data_across_years(year_range: str,
output_path: str) -> None:
"""
"""
output_path = os.path.join(output_path, './nndss_weekly_data')
if not os.path.exists(output_path):
os.makedirs(output_path)
for year in year_range:
# year + 1 for the range
week_range = [str(x).zfill(2) for x in range(1, 53)]
if year % 4 == 0:
week_range = [str(x).zfill(2) for x in range(1, 54)]
for week in week_range:
index_url = get_index_url(year, week)
print(f"Fetching data from {index_url}")
scrape_table_links_from_page(index_url, output_path, update=False)


def update_downloaded_files(year, week, file_path):
output_path = file_path.spli('/')[:-1]
output_path = '/'.join(output_path)
index_url = get_index_url(year, week)
scrape_table_links_from_page(index_url, output_path, update=True)


def get_next_week(year: str, output_path: str) -> int:
"""
"""
all_files_in_dir = os.listdir(output_path)
files_of_year = [files for files in all_files_in_dir if str(year) in files]
last_downloaded_file = files_of_year[-1]
week = last_downloaded_file.split('_mmwr_week_')[1].split('_mmwr_table')[0]
return int(week) + 1


def download_latest_weekly_nndss_data(year: str, output_path: str) -> None:
"""
"""
index_url = "https://wonder.cdc.gov/nndss/nndss_weekly_tables_menu.asp"
print(f"Fetching data from {index_url}")
scrape_table_links_from_page(index_url, output_path)


def main(_) -> None:
FLAGS = flags.FLAGS
flags.DEFINE_string(
'output_path', './data',
'Path to the directory where generated files are to be stored.')
year_range = range(_START, _END)
download_weekly_nnds_data_across_years(year_range, FLAGS.output_path)


if __name__ == '__main__':
app.run(main)
Binary file not shown.
141 changes: 141 additions & 0 deletions scripts/us_cdc/nndss_weekly_tables/nndss_data/place_name_to_dcid.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
Place Name,Resolved place dcid,Notes
U.S. Residents,country/USA,Not sure if we can do `SVObs place = country/USA` and add `citizenshipStatus=USCitizen` as pv? `USCitizen` will be a new instance in USC_CitizenshipEnum
"U.S. Residents, excluding U.S. Territories",country/USA,Not sure if we can do `SVObs place = country/USA` and add `citizenshipStatus=USCitizen` as pv? `USCitizen` will be a new instance in USC_CitizenshipEnum
United States,country/USA,
New England,usc/NewEnglandDivision,
Connecticut,geoId/09,
Maine,geoId/23,
Massachusetts,geoId/25,
New Hampshire,geoId/33,
Rhode Island,geoId/44,
Vermont,geoId/50,
Middle Atlantic,usc/MiddleAtlanticDivision,
New Jersey,geoId/34,
New York (excluding New York City),,To split geoId/36? spatial split on geojson and create a new place `Upstate_NewYork` containedInPlace: geoId/36
New York City,,Copy New York City map from geoId/36 and create a new place `NewYorkCity`
Pennsylvania,geoId/42,
East North Central,usc/EastNorthCentralDivision,
Illinois,geoId/17,
Indiana,geoId/18,
Michigan,geoId/26,
Ohio,geoId/39,
Wisconsin,geoId/55,
West North Central,usc/WestNorthCentralDivision,
Iowa,geoId/19,
Kansas,geoId/20,
Minnesota,geoId/27,
Missouri,geoId/29,
Nebraska,geoId/31,
North Dakota,geoId/38,
South Dakota,geoId/46,
South Atlantic,usc/SouthAtlanticDivision,
Delaware,geoId/10,
District of Columbia,geoId/11,
Florida,geoId/12,
Georgia,geoId/13,
Maryland,geoId/24,
North Carolina,geoId/37,
South Carolina,geoId/45,
Virginia,geoId/51,
West Virginia,geoId/54,
East South Central,usc/EastSouthCentralDivision,
Alabama,geoId/01,
Kentucky,geoId/21,
Mississippi,geoId/28,
Tennessee,geoId/47,
West South Central,usc/WestSouthCentralDivision,
Arkansas,geoId/05,
Louisiana,geoId/22,
Oklahoma,geoId/40,
Texas,geoId/48,
Mountain,usc/MountainDivision,
Arizona,geoId/04,
Colorado,geoId/08,
Idaho,geoId/16,
Montana,geoId/30,
Nevada,geoId/32,
New Mexico,geoId/35,
Utah,geoId/49,
Wyoming,geoId/56,
Pacific,usc/PacificDivision,
Alaska,geoId/02,
California,geoId/06,
Hawaii,geoId/15,
Oregon,geoId/41,
Washington,geoId/53,
U.S. Territories,,
American Samoa,geoId/60,
Commonwealth of Northern Mariana Islands,geoId/69,
Guam,geoId/66,
Puerto Rico,geoId/72,
U.S. Virgin Islands,geoId/78,
Non-U.S. Residents,country/USA,Not sure if we can do `SVObs place = country/USA` and add `citizenshipStatus=NotAUSCitizen` as pv?
Total,,Intetionally left blank
UNITED STATES,country/USA,
NEW ENGLAND,usc/NewEnglandDivision,
Conn.,geoId/09,
Maine,geoId/23,
Mass.,geoId/25,
N.H.,geoId/33,
R.I.,geoId/44,
Vt.,geoId/50,
MID. ATLANTIC,usc/MiddleAtlanticDivision,
N.J.,geoId/34,
N.Y. (Upstate),,To split geoId/36? spatial split on geojson and create a new place `Upstate_NewYork` containedInPlace: geoId/36
N.Y. City,,Copy New York City map from geoId/36 and create a new place `NewYorkCity`
Pa.,geoId/42,
E.N. CENTRAL,usc/EastNorthCentralDivision,
Ill.,geoId/17,
Ind.,geoId/18,
Mich.,geoId/26,
Ohio,geoId/39,
Wis.,geoId/55,
W.N. CENTRAL,usc/WestNorthCentralDivision,
Iowa,geoId/19,
Kans.,geoId/20,
Minn.,geoId/27,
Mo.,geoId/29,
Nebr.,geoId/31,
N. Dak.,geoId/38,
S. Dak.,geoId/46,
S. ATLANTIC,usc/SouthAtlanticDivision,
Del.,geoId/10,
D.C.,geoId/11,
Fla.,geoId/12,
Ga.,geoId/13,
Md.,geoId/24,
N.C.,geoId/37,
S.C.,geoId/45,
Va.,geoId/51,
W. Va.,geoId/54,
E.S. CENTRAL,usc/EastSouthCentralDivision,
Ala.,geoId/01,
Ky.,geoId/21,
Miss.,geoId/28,
Tenn.,geoId/47,
W.S. CENTRAL,usc/WestSouthCentralDivision,
Ark.,geoId/05,
La.,geoId/22,
Okla.,geoId/40,
Tex.,geoId/48,
MOUNTAIN,usc/MountainDivision,
Ariz.,geoId/04,
Colo.,geoId/08,
Idaho,geoId/16,
Mont.,geoId/30,
Nev.,geoId/32,
N. Mex.,geoId/35,
Utah,geoId/49,
Wyo.,geoId/56,
PACIFIC,usc/PacificDivision,
Alaska,geoId/02,
Calif.,geoId/06,
Hawaii,geoId/15,
Oreg.,geoId/41,
Wash.,geoId/53,
Amer. Samoa,geoId/60,
C.N.M.I.,geoId/69,
Guam,geoId/66,
P.R.,geoId/72,
V.I.,geoId/78,
test,test
Loading
Loading