# CDC Data Extraction

## Description
### Date Range
Files became available around late December 2020.  To merge with the mobility and economic issues such as unemployment, this project focuses on data from 01/02/2021.

### Data Format
CDC has made a number of excel spreadsheets available from their  web site.  Our project uses data from the Counties worksheet, since the state level did not provide enough detail to target hotspots.

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

First take a sample web page to see the structure

In [2]:
# Target web page:
url = "https://beta.healthdata.gov/Health/COVID-19-Community-Profile-Report/gqxm-d9w9"

# Establishing the connection to the web page:
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print(response.status_code)

# Pull the HTML string out of requests and convert it to a Python string.
html = response.text

200


In [3]:
soup = BeautifulSoup(html, "lxml")

Below we see that the pertinent information is contained as relative links in the javascript.  Excel spreadsheets are listed with pdf data.

In [4]:
soup.extract()

<!DOCTYPE html>
<html lang="en">
<!--
  Powered by Socrata
  http://www.socrata.com
  -->
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="authenticity_token" name="csrf-param"/>
<meta content="AHIacVMt6ZONm8/FLF6OswdgaPI2fbJ+VqYy0WI6zJCWIjv8lHNk7okIMoSwKHqEF0a21jAJAFkGAR1P78aCUg==" name="csrf-token"/>
<script>
//<![CDATA[
var socrata = {"currentUser":null,"domain":"healthdata.gov","domain_id":"3332","environment":"production","featureFlags":{"enable_usds_global_header":true,"embetter_analytics_page":false,"tyler_privacy_policy":true,"enable_colocate_ui":false,"display_dataset_landing_page_notice":false,"ignore_hiding_columns_unhidden_on_derived_views":false,"feature_map_default_extent":"","enable_vertical_filter_bar":true,"show_site_analytics_referrers_dataset":false,"enable_region_code_transform":false,"show_system_datasets_in_catalog":true,"enable_analyzer_view_validation":false,"retire_get_nbe_migrations_info":false,"domain_locale":"en_US","enable_standa

We use regex to extract and create the spreadsheet urls.  Files are downloaded to the repo download directory where they'll be loaded by another notebook into a pandas dataframe of just CDC county data.

In [5]:
# ref https://stackoverflow.com/questions/17407691/python-regex-to-match-multiple-times


# ref https://www.tutorialspoint.com/downloading-files-from-web-using-python

import urllib.request 

all_scripts = soup.find_all('script')

base_url = 'https://beta.healthdata.gov'
#pattern = re.compile('("href"):"(.*.xlsx)')

pattern = re.compile('"href":"([^,]+.xlsx)",', re.IGNORECASE)
# re https://www.tutorialspoint.com/downloading-files-from-web-using-python

for script in all_scripts:
    script_string = str(script.string)
    if script_string.count('initialState') > 0:
       
        match = pattern.findall(script_string)
        counter=0
        for group in match:
            download_url = f'{base_url}{group}'
            
            # print(download_url) for debug only
            filename = download_url[(download_url.rfind('=')+1):].replace(' ','_')
            download_url = download_url.replace(' ','+')
            # print(filename)
            urllib.request.urlretrieve(download_url, f'../download/{filename}')

