datacommonsorg · Harsha-chandaluri · Jun 7, 2024 · Jun 7, 2024 · Jun 7, 2024
diff --git a/scripts/us_cdc/nndss_weekly_tables/README.md b/scripts/us_cdc/nndss_weekly_tables/README.md
@@ -0,0 +1,55 @@
+## CDC WONDER: NNDS - Infectious diseases
+
+There are three data sources for this import
+
+## National Notifiable Disease Surveillance System
+
+The National Notifiable Disease Surveillance System (NNDSS) is a nationwide collaboration that enables all levels of public health (local, state, territorial, federal, and international) to share health information to monitor, control, and prevent the occurrence and spread of state-reportable and nationally notifiable infectious and some noninfectious diseases and conditions.
+
+### Nationally Notifiable Infectious Diseases and Conditions, United States: Annual Tables (2016 - 2019 | tables: 4-8)
+This is downloaded using the `process_annual_tables_16-19.py --mode=<all|download|process> --input_path=<path> --output_path=<path> --table_ids=4,5,6,7,8 --years=2016,2017,2018,2019
+
+### Nationally Notifiable Infectious Diseases and Conditions, United States: Annual Tables (2007 - 2015 | tables: 4-8)
+The annual tables from 1993 to 2015 is available through [MMWR](https://www.cdc.gov/mmwr/mmwr_nd/index.html). In this website, the tables are embedded as HTML table content in the webpage.
+
+> **NOTE:** While MMWR has data from 1993 to 2015, between 1993 to 2006 the datasets are images which cannot be scrapped to plain text and will need an OCR approach to extract data points. Hence, in this import we do not import the NNDSS Infectious diseases Annual table for the period between 1993 to 2006 in this import.
+
+The webapages by year is tabulated below:
+
+|Year|Webpage URL|
+|----|-----------|
+|2007|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5653a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5653a1.htm)|
+|2008|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5754a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5754a1.htm)|
+|2009|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5853a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5853a1.htm)|
+|2010|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5953a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5953a1.htm)|
+|2011|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6053a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6053a1.htm)|
+|2012|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6153a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6153a1.htm)|
+|2013|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6253a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6253a1.htm)|
+|2014|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6354a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6354a1.htm)|
+|2015|[https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6453a1.htm](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6453a1.htm)|
+
+We will be using [`beautifulsoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to extract and process the datasets.
+
+```
+script for data download and processing
+```
+
+### Nationally Notifiable Infectious Diseases and Conditions, United States: Weekly Tables (1996 - 2022 | counts of medicalConditions)
+The CDC releases [weekly cases](https://wonder.cdc.gov/nndss/nndss_weekly_tables_menu.asp) of selected infectious national notifiable diseases, from the National Notifiable Diseases Surveillance System (NNDSS). NNDSS data reported by the 50 states, New York City, the District of Columbia, and the U.S. territories are collated and published weekly as numbered tables and the data available in [CDC WONDER](https://wonder.cdc.gov/nndss/nndss_weekly_tables_menu.asp) starts from Week1 of 1996.
+
+A similar dataset from 2014 onwards is available at [data.cdc.gov](https://data.cdc.gov/NNDSS/NNDSS-Weekly-Data/x9gk-5huc). This dataset is relatively clean and we extract the count of weekly cases but since we were able to get data from 2022 and not older, hence we use [CDC WONDER](https://wonder.cdc.gov/nndss/nndss_weekly_tables_menu.asp) as the preferred data source for this import.
+
+
+```
+scripts for data processing
+```
+The csv file is updated periodically every week and we pick up the entire csv file each time to ensure the corrections done after review (reflected in [Notice to Data Users](https://wonder.cdc.gov/nndss/NTR.html) page) is reflected in the import.
+
+
+Columns that need to be resolved with schema
+
+- weekly tables:
+1. Carbapenemase-producing carbapenem-resistant  Enterobacteriaceae  †;Enterobacter  spp.;Current  week
+2. Carbapenemase-producing carbapenem-resistant  Enterobacteriaceae  †;Escherichia coli;Current  week
+3. Carbapenemase-producing carbapenem-resistant  Enterobacteriaceae  †;Klebsiella  spp.;Current  week
+
diff --git a/scripts/us_cdc/nndss_weekly_tables/download_weekly_data.py b/scripts/us_cdc/nndss_weekly_tables/download_weekly_data.py
@@ -0,0 +1,184 @@
+# Copyright 2022 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+"""
+import datetime
+import time
+import os
+import requests
+import pandas as pd
+from absl import flags, app
+from bs4 import BeautifulSoup
+
+_START = 2006
+_END = 2025  #datetime.date.today().year + 1  #to make the last year inclusive
+
+_BASE_URL = "https://wonder.cdc.gov/nndss/"
+_WEEKLY_TABLE_2010 = _BASE_URL + "nndss_weekly_tables_menu.asp?mmwr_year={year}&mmwr_week={week}"
+_WEEKLY_TABLE_2017 = _BASE_URL + 'nndss_weekly_tables_menu.asp?mmwr_year={year}&mmwr_week={week}'
+_FILENAME_TEMPLATE = "mmwr_year_{year}_mmwr_week_{week}_mmwr_table_{id}"
+_BAD_URLS = [
+    'https://wonder.cdc.gov/nndss/nndss_weekly_tables_1995_2014.asp?mmwr_year=2007&mmwr_week=13&mmwr_table=1&request=Submit'
+]
+
+
+def parse_html_table(table_url: str, file_path: str) -> None:
+    if table_url not in _BAD_URLS:
+        table_content = requests.get(table_url)
+        t_soup = BeautifulSoup(table_content.content, 'html.parser')
+        try:
+            table_result_set = t_soup.find_all('table')[1]
+
+            df = pd.read_html(table_result_set.prettify())[0]
+            # save the file in output path for each file
+            df.to_csv(file_path, index=False)
+        except IndexError:
+            # this case is observed from 2016 onwards
+            try:
+                table_result_set = t_soup.find_all('table')[0]
+                df = pd.read_html(table_result_set.prettify())[0]
+                # save the file in output path for each file
+                df.to_csv(file_path, index=False)
+            except IndexError:
+                # This case occurs when downloading https://wonder.cdc.gov/nndss/nndss_reps.asp?mmwr_year=2007&mmwr_week=13&mmwr_table=1&request=Submit
+                # The link is unaccessible even from the website and we need to skip this table's download
+                print("Link not working. Skipping table...")
+                pass
+
+
+def extract_table_from_link(table_url: str,
+                            filename: str,
+                            output_path: str,
+                            update: bool = False) -> None:
+    """
+	"""
+    num_tries = 10
+    file_path = os.path.join(output_path, f'{filename}.csv')
+    if not os.path.exists(file_path) or update:
+        print(f"Downloading {table_url}", end=" ..... ", flush=True)
+        try:
+            parse_html_table(table_url, file_path)
+            print("Done.", flush=True)
+        except:
+            print("Terminated with error. Please check the link.", flush=True)
+            while num_tries > 1:
+                num_tries = num_tries - 1
+                parse_html_table(table_url, file_path)
+                print(f"Attempting download again. Tries remaining {num_tries}")
+                time.sleep(1)
+        time.sleep(2)
+    else:
+        print(f"Download from {table_url} already exists in {output_path}")
+        time.sleep(0.2)
+
+
+def scrape_table_links_from_page(page_url: str,
+                                 output_path: str,
+                                 update: bool = False) -> None:
+    """
+	"""
+    page = requests.get(page_url)
+    soup = BeautifulSoup(page.content, 'html.parser')
+    # get link to all tables in the page
+    table_link_list = [
+        tag.find("a")["href"] for tag in soup.select("tr:has(a)")
+    ]
+
+    for table_link in table_link_list:
+        # Between years 1996 to 2016, select requestMode=Submit
+        if 'Submit' in table_link:
+            table_url = _BASE_URL + table_link
+            # extract filename from link patterns like https://wonder.cdc.gov/nndss/nndss_reps.asp?mmwr_year=1996&mmwr_week=01&mmwr_table=2A&request=Submit
+            filename = table_url.split('?')[1].split('&request')[0].replace(
+                '=', '_').replace('&', '_')
+            print("Submit", table_url, filename, output_path, update)
+            extract_table_from_link(table_url, filename, output_path, update)
+
+        # From year 2017, the base link structure has changed to: https://wonder.cdc.gov/nndss/static/2017/01/2017-01-table1.html
+        if table_link.endswith('.html') and 'table' in table_link:
+            # skip /nndss/ in the table_link, since it is already part of the _BASE_URL
+            table_url = _BASE_URL + table_link[7:]
+            # extract year, week, table_id from link
+            filename_components = table_link.split('/')[-1].split(
+                '.html')[0].split('-')
+            filename = _FILENAME_TEMPLATE.format(
+                year=filename_components[0],
+                week=filename_components[1],
+                id=filename_components[2].split('table')[1])
+            extract_table_from_link(table_url, filename, output_path, update)
+
+
+def get_index_url(year, week):
+    """
+	"""
+    if year < 2017:
+        return _WEEKLY_TABLE_2010.format(year=year, week=week)
+    else:
+        return _WEEKLY_TABLE_2017.format(year=year, week=week)
+
+
+def download_weekly_nnds_data_across_years(year_range: str,
+                                           output_path: str) -> None:
+    """
+	"""
+    output_path = os.path.join(output_path, './nndss_weekly_data')
+    if not os.path.exists(output_path):
+        os.makedirs(output_path)
+    for year in year_range:
+        # year + 1 for the range
+        week_range = [str(x).zfill(2) for x in range(1, 53)]
+        if year % 4 == 0:
+            week_range = [str(x).zfill(2) for x in range(1, 54)]
+        for week in week_range:
+            index_url = get_index_url(year, week)
+            print(f"Fetching data from {index_url}")
+            scrape_table_links_from_page(index_url, output_path, update=False)
+
+
+def update_downloaded_files(year, week, file_path):
+    output_path = file_path.spli('/')[:-1]
+    output_path = '/'.join(output_path)
+    index_url = get_index_url(year, week)
+    scrape_table_links_from_page(index_url, output_path, update=True)
+
+
+def get_next_week(year: str, output_path: str) -> int:
+    """
+	"""
+    all_files_in_dir = os.listdir(output_path)
+    files_of_year = [files for files in all_files_in_dir if str(year) in files]
+    last_downloaded_file = files_of_year[-1]
+    week = last_downloaded_file.split('_mmwr_week_')[1].split('_mmwr_table')[0]
+    return int(week) + 1
+
+
+def download_latest_weekly_nndss_data(year: str, output_path: str) -> None:
+    """
+	"""
+    index_url = "https://wonder.cdc.gov/nndss/nndss_weekly_tables_menu.asp"
+    print(f"Fetching data from {index_url}")
+    scrape_table_links_from_page(index_url, output_path)
+
+
+def main(_) -> None:
+    FLAGS = flags.FLAGS
+    flags.DEFINE_string(
+        'output_path', './data',
+        'Path to the directory where generated files are to be stored.')
+    year_range = range(_START, _END)
+    download_weekly_nnds_data_across_years(year_range, FLAGS.output_path)
+
+
+if __name__ == '__main__':
+    app.run(main)
diff --git a/scripts/us_cdc/nndss_weekly_tables/nndss_data/nndss_weekly_data_input_files.zip b/scripts/us_cdc/nndss_weekly_tables/nndss_data/nndss_weekly_data_input_files.zip
diff --git a/scripts/us_cdc/nndss_weekly_tables/nndss_data/place_name_to_dcid.csv b/scripts/us_cdc/nndss_weekly_tables/nndss_data/place_name_to_dcid.csv
@@ -0,0 +1,141 @@
+Place Name,Resolved place dcid,Notes
+U.S. Residents,country/USA,Not sure if we can do `SVObs place = country/USA` and add `citizenshipStatus=USCitizen` as pv? `USCitizen` will be a new instance in USC_CitizenshipEnum
+"U.S. Residents, excluding U.S. Territories",country/USA,Not sure if we can do `SVObs place = country/USA` and add `citizenshipStatus=USCitizen` as pv? `USCitizen` will be a new instance in USC_CitizenshipEnum
+United States,country/USA,
+New England,usc/NewEnglandDivision,
+Connecticut,geoId/09,
+Maine,geoId/23,
+Massachusetts,geoId/25,
+New Hampshire,geoId/33,
+Rhode Island,geoId/44,
+Vermont,geoId/50,
+Middle Atlantic,usc/MiddleAtlanticDivision,
+New Jersey,geoId/34,
+New York (excluding New York City),,To split geoId/36? spatial split on geojson and create a new place `Upstate_NewYork` containedInPlace: geoId/36
+New York City,,Copy New York City map from geoId/36 and create a new place  `NewYorkCity`
+Pennsylvania,geoId/42,
+East North Central,usc/EastNorthCentralDivision,
+Illinois,geoId/17,
+Indiana,geoId/18,
+Michigan,geoId/26,
+Ohio,geoId/39,
+Wisconsin,geoId/55,
+West North Central,usc/WestNorthCentralDivision,
+Iowa,geoId/19,
+Kansas,geoId/20,
+Minnesota,geoId/27,
+Missouri,geoId/29,
+Nebraska,geoId/31,
+North Dakota,geoId/38,
+South Dakota,geoId/46,
+South Atlantic,usc/SouthAtlanticDivision,
+Delaware,geoId/10,
+District of Columbia,geoId/11,
+Florida,geoId/12,
+Georgia,geoId/13,
+Maryland,geoId/24,
+North Carolina,geoId/37,
+South Carolina,geoId/45,
+Virginia,geoId/51,
+West Virginia,geoId/54,
+East South Central,usc/EastSouthCentralDivision,
+Alabama,geoId/01,
+Kentucky,geoId/21,
+Mississippi,geoId/28,
+Tennessee,geoId/47,
+West South Central,usc/WestSouthCentralDivision,
+Arkansas,geoId/05,
+Louisiana,geoId/22,
+Oklahoma,geoId/40,
+Texas,geoId/48,
+Mountain,usc/MountainDivision,
+Arizona,geoId/04,
+Colorado,geoId/08,
+Idaho,geoId/16,
+Montana,geoId/30,
+Nevada,geoId/32,
+New Mexico,geoId/35,
+Utah,geoId/49,
+Wyoming,geoId/56,
+Pacific,usc/PacificDivision,
+Alaska,geoId/02,
+California,geoId/06,
+Hawaii,geoId/15,
+Oregon,geoId/41,
+Washington,geoId/53,
+U.S. Territories,,
+American Samoa,geoId/60,
+Commonwealth of Northern Mariana Islands,geoId/69,
+Guam,geoId/66,
+Puerto Rico,geoId/72,
+U.S. Virgin Islands,geoId/78,
+Non-U.S. Residents,country/USA,Not sure if we can do `SVObs place = country/USA` and add `citizenshipStatus=NotAUSCitizen` as pv?
+Total,,Intetionally left blank
+UNITED STATES,country/USA,
+NEW ENGLAND,usc/NewEnglandDivision,
+Conn.,geoId/09,
+Maine,geoId/23,
+Mass.,geoId/25,
+N.H.,geoId/33,
+R.I.,geoId/44,
+Vt.,geoId/50,
+MID. ATLANTIC,usc/MiddleAtlanticDivision,
+N.J.,geoId/34,
+N.Y. (Upstate),,To split geoId/36? spatial split on geojson and create a new place `Upstate_NewYork` containedInPlace: geoId/36
+N.Y. City,,Copy New York City map from geoId/36 and create a new place  `NewYorkCity`
+Pa.,geoId/42,
+E.N. CENTRAL,usc/EastNorthCentralDivision,
+Ill.,geoId/17,
+Ind.,geoId/18,
+Mich.,geoId/26,
+Ohio,geoId/39,
+Wis.,geoId/55,
+W.N. CENTRAL,usc/WestNorthCentralDivision,
+Iowa,geoId/19,
+Kans.,geoId/20,
+Minn.,geoId/27,
+Mo.,geoId/29,
+Nebr.,geoId/31,
+N. Dak.,geoId/38,
+S. Dak.,geoId/46,
+S. ATLANTIC,usc/SouthAtlanticDivision,
+Del.,geoId/10,
+D.C.,geoId/11,
+Fla.,geoId/12,
+Ga.,geoId/13,
+Md.,geoId/24,
+N.C.,geoId/37,
+S.C.,geoId/45,
+Va.,geoId/51,
+W. Va.,geoId/54,
+E.S. CENTRAL,usc/EastSouthCentralDivision,
+Ala.,geoId/01,
+Ky.,geoId/21,
+Miss.,geoId/28,
+Tenn.,geoId/47,
+W.S. CENTRAL,usc/WestSouthCentralDivision,
+Ark.,geoId/05,
+La.,geoId/22,
+Okla.,geoId/40,
+Tex.,geoId/48,
+MOUNTAIN,usc/MountainDivision,
+Ariz.,geoId/04,
+Colo.,geoId/08,
+Idaho,geoId/16,
+Mont.,geoId/30,
+Nev.,geoId/32,
+N. Mex.,geoId/35,
+Utah,geoId/49,
+Wyo.,geoId/56,
+PACIFIC,usc/PacificDivision,
+Alaska,geoId/02,
+Calif.,geoId/06,
+Hawaii,geoId/15,
+Oreg.,geoId/41,
+Wash.,geoId/53,
+Amer. Samoa,geoId/60,
+C.N.M.I.,geoId/69,
+Guam,geoId/66,
+P.R.,geoId/72,
+V.I.,geoId/78,
+test,test