# Fentanyl Fighters CS484 Project

## Goal: Develop a predictive model for future opioid overdoses at the county level in the United States

## Datasets:

### CDC Opioid Prescription Rates 
* Used to identify the rate of opioid prescribing at the county level for training purposes. Contains county FIPS codes along with the number of opioid prescriptions per 100 individuals.  
* Location of data: datasets/opioidprescribingbycounty_YoY/
* Source: https://www.cdc.gov/drugoverdose/rxrate-maps/

### County Coordinates
* Used to map US County FIPS codes to geographic coordinates for GIS mapping 
* Location of data: datasets/county_coords.xlsx
* Source: https://gist.github.com/russellsamora/12be4f9f574e92413ea3f92ce1bc58e6

### Drug Overdose Deaths by County from 2020-2022
* Will be used as our testing data to evaluate the performance of our model
* Location of the data: datasets/VSRR_Provisional_County-Level_Drug_Overdose_Death_Counts.xlsx
* Source: https://www.cdc.gov/nchs/nvss/vsrr/drug-overdose-data.htm


# CDC Opioid Prescription Rates

## Data Pre-Processing

* This code is responsible for extracting data related to opioid prescriptions in US counties for each year from 2006 to 2020 from the CDC website. The extracted data is pre-processed, and each year's data is stored in a separate CSV file with the corresponding year in the filename. The code starts by defining the URL base and generates a list of years to iterate over. For each year in the list, the code constructs the full URL, makes a request to the website, and retrieves the HTML content.

* If the year is 2020, the code downloads the data file directly as the HTML is formatted differently for that year. Otherwise, it parses the HTML content using BeautifulSoup and extracts the table headers and rows. A DataFrame is then created from the table using pandas. A new column is added to the DataFrame with the year, and the data is saved to a CSV file with a filename based on the year. The DataFrame is also appended to a list of DataFrames.

* Once all of the years' data has been extracted and pre-processed, the code concatenates all of the DataFrames into a single DataFrame and saves it to a CSV file. The resulting CSV file contains data on opioid prescriptions in US counties for each year from 2006 to 2020. The goal of this data extraction and pre-processing is to provide data for use in a predictive model for opioid overdoses at the county level.

In [26]:
import pandas as pd
import requests
import csv
from bs4 import BeautifulSoup
import time

# URL base for the website with the HTML tables
urlbase = 'https://www.cdc.gov/drugoverdose/rxrate-maps/'

# Generate list of years to iterate
years = [f"county{year}.html" for year in range(2006, 2021)]

# Create an empty list to store the individual DataFrames
dfs = []

# Iterate over each URL and process the data
for year in years:
    # Construct the full URL for this year's data
    url = urlbase + year
    
    # Make a request to the website and get the HTML content
    response = requests.get(url)
    
    if year == 'county2020.html':
        time.sleep(3)

    # Check if the year is 2020 and download the data file directly
    if year == 'county2020.html':
        csv_url = "https://www.cdc.gov/drugoverdose/data-files/2020-County-Rx-Map.csv"
        df = pd.read_csv(csv_url)
    else:
        html_content = response.content
        
        # Parse the HTML content with BeautifulSoup and find the table
        soup = BeautifulSoup(html_content, 'html.parser')
        table = soup.find('table')
        
        # Get the table headers and rows
        headers = [th.text.strip() for th in table.find_all('th')]
        rows = []
        for tr in table.find_all('tr'):
            row = [td.text.strip() for td in tr.find_all('td')]
            if row:
                rows.append(row)
        
        # Create a DataFrame from the table using pandas
        df = pd.DataFrame(rows, columns=headers)
    
    # Add a new column to the DataFrame with the year
    df["Year"] = year[6:10]
    
    # Write the data to a CSV file with a filename based on the year
    filename = f"datasets/opioidprescribingbycounty_YoY/county{year[6:10]}.csv"
    df.to_csv(filename, index=False)
    
    print(f"{filename} saved successfully!")
    
    # Append the DataFrame to the list of DataFrames
    dfs.append(df)

# Concatenate all of the DataFrames into a single DataFrame
df_all = pd.concat(dfs, ignore_index=True)

# Save the combined DataFrame to a CSV file
df_all.to_csv("datasets/opioidprescribingbycounty_YoY/county2006to2020.csv", index=False)

print("county2006to2020.csv saved successfully!")


datasets/opioidprescribingbycounty_YoY/county2006.csv saved successfully!
datasets/opioidprescribingbycounty_YoY/county2007.csv saved successfully!
datasets/opioidprescribingbycounty_YoY/county2008.csv saved successfully!
datasets/opioidprescribingbycounty_YoY/county2009.csv saved successfully!
datasets/opioidprescribingbycounty_YoY/county2010.csv saved successfully!
datasets/opioidprescribingbycounty_YoY/county2011.csv saved successfully!
datasets/opioidprescribingbycounty_YoY/county2012.csv saved successfully!
datasets/opioidprescribingbycounty_YoY/county2013.csv saved successfully!
datasets/opioidprescribingbycounty_YoY/county2014.csv saved successfully!
datasets/opioidprescribingbycounty_YoY/county2015.csv saved successfully!
datasets/opioidprescribingbycounty_YoY/county2016.csv saved successfully!
datasets/opioidprescribingbycounty_YoY/county2017.csv saved successfully!
datasets/opioidprescribingbycounty_YoY/county2018.csv saved successfully!
datasets/opioidprescribingbycounty_YoY

In [10]:
import pandas as pd
import os

# Define the path to the directory containing the CSV files
csv_dir = "datasets/opioidprescribingbycounty_YoY"

# Create an empty list to store the individual DataFrames
dfs = []

# Iterate over each CSV file in the directory and append to the list of DataFrames
for filename in os.listdir(csv_dir):
    if filename.endswith(".csv"):
        year = filename[6:10] # Extract the year from the filename
        df = pd.read_csv(os.path.join(csv_dir, filename)) # Read the CSV file into a DataFrame
        df["Year"] = year # Add a new column to the DataFrame with the year
        dfs.append(df) # Append the DataFrame to the list

# Concatenate all of the DataFrames into a single DataFrame
df_all = pd.concat(dfs, ignore_index=True)

# Display the resulting DataFrame
print(df_all.head())

EmptyDataError: No columns to parse from file

In [3]:


dfs = {}

df = pd.read_excel('datasets/cdc_disp_rates.xlsx')
dfs['disp_rates'] = df
df = pd.read_excel('datasets/county_coords.xlsx')
dfs['coords'] = df

dfs['coords'] = dfs['coords'].sort_values('FIPS')
dfs['disp_rates'] = dfs['disp_rates'].sort_values('FIPS')

#attempting to merge the data
merged_data = pd.merge(dfs['disp_rates'], dfs['coords'], on='FIPS')

print(merged_data.head())

  State   County    FIPS  Opioid Dispensing Rate per 100     NAME   INTPTLAT  \
0    AL  AUTAUGA  1001.0                            98.3  Autauga  32.532237   
1    AL  BALDWIN  1003.0                            65.0  Baldwin  30.659218   
2    AL  BARBOUR  1005.0                            22.8  Barbour  31.870253   
3    AL     BIBB  1007.0                            24.8     Bibb  33.015893   
4    AL   BLOUNT  1009.0                            22.8   Blount  33.977357   

    INTPTLON  
0 -86.646439  
1 -87.746067  
2 -85.405103  
3 -87.127148  
4 -86.566440  


In [2]:
import pandas as pd
import folium
import geopandas as gpd
from folium.plugins import HeatMap

# Load the data
df_disp_rates = pd.read_excel('datasets/cdc_disp_rates.xlsx')
df_coords = pd.read_excel('datasets/county_coords.xlsx')

# Sort and merge the data
df_disp_rates = df_disp_rates.sort_values('FIPS')
df_coords = df_coords.sort_values('FIPS')
merged_data = pd.merge(df_disp_rates, df_coords, on='FIPS')

# Convert to a GeoDataFrame
gdf = gpd.GeoDataFrame(merged_data, geometry=gpd.points_from_xy(merged_data['INTPTLON'], merged_data['INTPTLAT']))

# Create a base map
m = folium.Map(location=[37.8, -96], zoom_start=4)

# Extract the dispensing rates and coordinates to be used in the heatmap
data = gdf[['INTPTLAT', 'INTPTLON', 'Opioid Dispensing Rate per 100']].values.tolist()

# Create and add the heatmap to the map
heatmap = HeatMap(data, radius=10, blur=6, max_zoom=5, gradient={0.2: 'blue', 0.4: 'green', 0.6: 'yellow', 0.8: 'orange', 1: 'red'})
m.add_child(heatmap)

# Display the map
m


In [15]:

# Load the death rates data
df_deaths = pd.read_excel('datasets/VSRR_Provisional_County-Level_Drug_Overdose_Death_Counts.xlsx')

# Filter the data for the latest year
#latest_year = df_deaths['Year'].max()
#df_deaths = df_deaths[df_deaths['Year'] == latest_year]

# Group by FIPS and calculate the mean death rate
df_deaths = df_deaths.groupby('FIPS')['Provisional Drug Overdose Deaths'].mean().reset_index()

# Merge the death rates data with the merged_data DataFrame
merged_data = pd.merge(merged_data, df_deaths, on='FIPS')

# Convert to a GeoDataFrame
gdf = gpd.GeoDataFrame(merged_data, geometry=gpd.points_from_xy(merged_data['INTPTLON'], merged_data['INTPTLAT']))
print(gdf.head())
# Create a base map
m = folium.Map(location=[37.8, -96], zoom_start=4)

# Extract the dispensing rates and coordinates to be used in the heatmap
data_dispensing = gdf[['INTPTLAT', 'INTPTLON', 'Opioid Dispensing Rate per 100']].values.tolist()
gdf = gdf.drop(['Provisional Drug Overdose Deaths_x', 'Provisional Drug Overdose Deaths_y'], axis=1)
data_deaths = gdf[['INTPTLAT', 'INTPTLON', 'Provisional Drug Overdose Deaths']].values.tolist()

# Convert to pandas DataFrames
df_dispensing = pd.DataFrame(data_dispensing, columns=['LAT', 'LON', 'RATE']).dropna()
df_deaths = pd.DataFrame(data_deaths, columns=['LAT', 'LON', 'DEATHS']).dropna()

# Convert back to lists
data_dispensing = df_dispensing.values.tolist()
data_deaths = df_deaths.values.tolist()

# Create and add the dispensing rates heatmap to the map
heatmap_dispensing = HeatMap(data_dispensing, radius=10, blur=6, max_zoom=5, gradient={0.2: 'blue', 0.4: 'green', 0.6: 'yellow', 0.8: 'orange', 1: 'red'})
m.add_child(heatmap_dispensing)

# Extract the deaths data
data_deaths = gdf[['INTPTLAT', 'INTPTLON', 'Provisional Drug Overdose Deaths']].dropna().values.tolist()

# Create and add the death rates heatmap to the map
heatmap_deaths = HeatMap(data_deaths, radius=10, blur=6, max_zoom=5, gradient={0: 'lightblue', 0.1: 'blue', 0.3: 'purple', 0.5: 'red', 1: 'darkred'})
m.add_child(heatmap_deaths)


# Display the map
m


  State   County    FIPS  Opioid Dispensing Rate per 100     NAME   INTPTLAT  \
0    AL  AUTAUGA  1001.0                            98.3  Autauga  32.532237   
1    AL  BALDWIN  1003.0                            65.0  Baldwin  30.659218   
2    AL  BARBOUR  1005.0                            22.8  Barbour  31.870253   
3    AL     BIBB  1007.0                            24.8     Bibb  33.015893   
4    AL   BLOUNT  1009.0                            22.8   Blount  33.977357   

    INTPTLON                    geometry  Provisional Drug Overdose Deaths_x  \
0 -86.646439  POINT (-86.64644 32.53224)                                 NaN   
1 -87.746067  POINT (-87.74607 30.65922)                           85.333333   
2 -85.405103  POINT (-85.40510 31.87025)                                 NaN   
3 -87.127148  POINT (-87.12715 33.01589)                           10.000000   
4 -86.566440  POINT (-86.56644 33.97736)                           15.666667   

   Provisional Drug Overdose Deaths_y 