# Analysis of Air Quality and Hospitalisations in the US
## Introduction
In this notebook, we compare the air quality data and the number of hospitalisation per US state per year. The geographical area was selected based on the availability of data for both air quality and hospitalisation: we could not easily find similar data for other countries or regions. This is a follow-up to the analysis on air quality and asthma prevalence. In that previous analysis, we found a counterintuitive negative correlation between air quality and asthma prevalence. We wanted to take the analysis a step further by not looking at the number of people having asthma but rather at the possible impact of poorer air quality on severe asthma symptoms that require hospitalisation. In this part of the study, we expect to see a positive correlation between poor air quality and hospitalisation. In other words: the higher the concentration of PM2.5, the more people with asthma will have to be hospitalised.

## Preparation of data
This is done in phases:
1. Import dependencies for preparation and analysis
2. Import the first dataset (air quality per US state)
3. Import the second dataset (hospitalisation for asthma per US state)
4. Merge both datasets

### Dependencies

In [None]:
# Dependencies
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
from scipy import stats
import hvplot.pandas
import requests

# Turn off warning messages
import warnings
warnings.filterwarnings("ignore")

### Dataset 1: air quality per US states
We first import the clean dataset of air quality data (PM 2.5 concentration) per US state and per year. The dataset is loaded into a DataFrame and displayed.

In [None]:
airquality_csv = Path("Cleaned_Datasets/cleaned_airquality_usstates.csv")
airquality_df = pd.read_csv(airquality_csv)
airquality_df = airquality_df.rename(columns={'PM25': 'PM2.5'})
airquality_df

### Dataset 2: hospitalisation per US states
We then import the clean dataset of hospitalisation related to asthma per US state and per year. The dataset is loaded into a DataFrame and displayed. Because the values are saved as strings with comma-separated thousands, the data must then be converted to a numeric value using a pandas function.

In [None]:
hospital_csv = Path("Cleaned_Datasets/cleaned_hospitalisations.csv")
hospital_norformat_df = pd.read_csv(hospital_csv)
hospital_raw_df = hospital_norformat_df.rename(columns={'No. of Hospitalisations' : 'Hosp.'})

hospital_df = hospital_raw_df.iloc[:]

# Convert hospitalisation data from string to numeric, take into account that the str includes a , to separate thousands
# Example: this will convert '1,042' to 1042.0
hospital_df["Hosp."] = pd.to_numeric(hospital_df['Hosp.'].str.replace(",",""))
hospital_df

### Merge datasets
Both DataFrames are merged into one based on states and years. The 'inner' method is used to keep only the states and years for which both the PM2.5 and hospitalisation data exist. 

In [None]:
airqual_hospital_df = pd.merge(airquality_df, hospital_df, on=['State','Year'],how='inner')
airqual_hospital_df

## Data analysis
In this section, we use the merged DataFrame to determine the correlation between the PM2.5 concentration in each US states for each year and the number of hospitalisation for the same year and state.

In [None]:
figure = plt.figure()
fig1 = airqual_hospital_df.plot.scatter('PM2.5','Hosp.', figsize=(8,5))
fig1.set_ylabel('No. of Hospitalisation for Asthma')

x_data = airqual_hospital_df['PM2.5']
y_data = airqual_hospital_df['Hosp.']

plt.hlines(0,min(x_data),max(x_data),colors='black')

[slope, intercept, rvalue, pvalue, stdeve] = stats.linregress(x_data,y_data)
y_reg = slope*x_data + intercept

plt.plot(x_data,y_reg,'r')

# Save figure as PNG to add to presentation
plt.savefig('Images/PM25_and_Hospitalisation_US.png')

# SHow
plt.show()

### Correlation

In [None]:
# Print the correlation coefficient bewteen PM2.5 and Number of hospitalisation
print(f"Correlation between PM2.5 density and hopsitalisation for asthma: {rvalue:.3f}")

We find a moderate correlation between the PM2.5 concentration and the number of hospitalisation. By looking at the scatter plot and the linear regression model, we can estimate that a linear model may not be the best fit and a square or exponential model may be more accurate. Nonetheless, the trend is clear that the higher the concentration of PM2.5 is, the more hospital admissions for asthma symptoms are observed.

# Gegraphical distribution

In [None]:
# Looking at the data for 2018 only
airqual_hospital_2018_df = airqual_hospital_df.loc[airqual_hospital_df['Year']==2018,:]

try:
    # Dependencies
    from api_keys import api_key_geoapify

    # Build the endpoint URL
    base_url = f"https://api.geoapify.com/v1/geocode/search?"

    params = {
        "apiKey":api_key_geoapify,
        "format":"json",
    }

    # Iterate through the types_df DataFrame
    for index, row in airqual_hospital_2018_df.iterrows():

        # Print current status
        print(f"Now adding longitude/latitude for: {airqual_hospital_2018_df.loc[index,'State']}...")

        # Add the state name as the search text
        params["text"] = airqual_hospital_2018_df.loc[index,'State']

        # Run request
        response = requests.get(base_url,params=params).json()

        airqual_hospital_2018_df.loc[index,'Lat'] = response['results'][0]['lat']
        airqual_hospital_2018_df.loc[index,'Lon'] = response['results'][0]['lon']
    
    display(airqual_hospital_2018_df)
except:
    print('Error: No API key found or Wrong API key.')
    print('Follow the steps below to solve:')
    print('1. Create a file called api_keys.py')
    print('2. In the file, include the line: api_key_geoapify = "..."')
    print('3. Replace ... with your geoapify API key')
    print('4. Restart Kernel and run this notebook again.')

In [None]:
try:
    # Configure the map
    map_plot_1 = airqual_hospital_2018_df.hvplot.points(
        'Lon',
        'Lat',
        geo=True,
        tiles = "OSM",
        frame_width = 800,
        frame_height = 600,
        size = "Hosp.",
        scale = 0.5,
        color = "PM2.5",
        cmap='bkr'
    )

    # Display the map
    display(map_plot_1)
except:
    print('Please run the block above first and make sure your API key is valid.')

# Answer to key questions
## Does a country’s air quality have an impact on the hospital admission due to asthma?
Yes. Although the data available do not allow to draw any conclusion for causality, there is a correlation between the concentration of PM2.5 in the air and the number of hospital admissions due to asthma. The higher concentration of PM2.5 is expected to cause asthmatic reactions and the data confirm that this could be the case.