# Analyse von Luftverschmutzungsdaten

## Daten
- Quelle: die [API](https://aqs.epa.gov/aqsweb/documents/data_api.html#lists) der EPA (US-Umweltbehörde)
- Luftverschmutzung Tagesdaten von Los Angeles von 2019 (Corona Jahre vernachlässigen)

In [8]:
'Import der Module'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests as requests
import json
from time import sleep

#API Sign-Up
email="felixsch00@outlook.de"
#response_signup= requests.get(f"https://aqs.epa.gov/data/api/signup?email={email}")
#API Key via E-Mail
api_key="goldfox88" 
#Check if API is available --> 200 is working
response_available=requests.get("https://aqs.epa.gov/data/api/metaData/isAvailable")
print(f"Status Code Available: {response_available.status_code}")

Status Code Available: 200


In [21]:
'Get  some information about the parameters which can be requestet'
#Definitions
response_definition=requests.get(f"https://aqs.epa.gov/data/api/metaData/fieldsByService?email={email}&key={api_key}&service=sampleData")
print(f"Status Code E-Mail and Key okay: {response_definition.status_code}")

#Check available parameter classes (groups of parameter) -> to a JSON File
resp_parameter_classes=requests.get(f"https://aqs.epa.gov/data/api/list/classes?email={email}&key={api_key}")
with open("API_Infos/Parameter_Classes.json","w") as file:
    json.dump(resp_parameter_classes.json(),file,indent=4)
#Parameters from AQI POLLUTANTS   
def get_parameter_from_class(class_code:str,email=email,api_key=api_key):
    """Get the code and definition of available AQ Measurements, stores them in a JSON-File in API_Data Folder

    Args:
        class_code (str): Class Code shown in the Parameters_Classes.json file
        email (_type_, optional): For API. Defaults to email.
        api_key (_type_, optional): For API. Defaults to api_key.
    """
    parameters_api={"email":email,"key":api_key,"pc":class_code}
    req_parameters=requests.get("https://aqs.epa.gov/data/api/list/parametersByClass?",params=parameters_api)
    with open(f"API_Infos/Parameter_{class_code}.json","w") as file:
        json.dump(req_parameters.json(),file,indent=4)
    print(f"{class_code} Parameters successfully print to json file")
    return 

#Basic Pollutants
get_parameter_from_class("AQI POLLUTANTS")
#Meteorological Parameters
get_parameter_from_class("MET")
#Volatile organic compounds
get_parameter_from_class("VOC")



Status Code Available: 200
Status Code E-Mail and Key okay: 200
AQI POLLUTANTS Parameters successfully print to json file
MET Parameters successfully print to json file
VOC Parameters successfully print to json file


## Daten 

### API
AQS-API der US (EPA)[https://aqs.epa.gov/aqsweb/documents/data_api.html]

### Welche zeitliche Intervalle soll ich in meiner Analyse beobachten ?
- es können Daten bis zu den Messintervallen abgerufen werden (SampleData) -> damit könnte ich eigenen Intervalle (z.B. 6h) durch Mittelwertbildung berechnen
- einfacher ist es einfach die **täglichen Daten** Maxima zu betrachten (für eine genauere Betrachtung können ja auch die Mittelwerte betrachtet werden) --> Daily Summary Data by Site

### Messstadionen
- wichtig ist die Auswahl von drei zentral gelegenen Messstadionen. In einer [Kartenansicht](https://epa.maps.arcgis.com/apps/webappviewer/index.html?id=5f239fd3e72f424f98ef3d5def547eb5&extent=-146.2334,13.1913,-46.3896,56.5319) wurden diese ausgewählt.
    + Los Angeles-North Main Street (Short: LA_N): AQS Site ID 06-037-1103
    + Pico Rivera : AQS Site ID	06-037-1602
    + Compton: AQS Site ID	06-037-1302

![Die drei makierten Messstadionen wurden ausgesucht.](Stations_Map.png) 

### Parameter

In dem Codeblock oben wurden die allgemein verfügbaren Messsubstanzen in JSON Files geschrieben. 
Zu den spannenden Schadstoffen (inkl Code für API) zählt: 
- CO 42101
- SO2 42401
- NO2 42602
- O3 44201
- PM10 (Feinstaub) 81102 (0-10um gesamt)
- PM2.5 88101 
- (Total NMOC 43102 (alle VOC ohne CH4))

#### CO 42101
- verschiedene SampleDurations (1h und 8h) -> hier 1h verwenden
- Einheit: ppm
- Um fehlerhafte Messungen auszuschließen muss bei validity_indicator "Y" sein und observation_percentage > 40 % sein

#### NO2
- SampeDuration: "1 HOUR"

#### SO2
- Sample Duration: "5 MINUTE"

#### PM2.5
- SampleDuration: 24 Hour

#### O3
- SampleDuration: "8-HR RUN AVG BEGIN HOUR"

#### PM10
- SampleDuration: 24 HOUR

### AirQualityIndex (AQI)

siehe https://www.eea.europa.eu/themes/air/air-quality-index -> bei About the Eu Air Q Index in der Kartenlegende

In [49]:
#IDs for location
id={"la_n":1103,"pico_rivera":1602,"compton":1302}
parameter_pollutant={"CO":42101,"SO2": 42401,"NO2": 42602 ,"O3" :44201,"PM10" : 81102,"PM2.5":88101}
#dates
bdate="20190101"
edate="20191231"
#Up to 5 parameters may be requested, separated by commas.

def get_pollution_data(param: list,year: int,site: int,filename:str ):
    """Get the data from one of the sites in LA on a daily base


    Args:
        param (list): Parameters of the pollution type
        year (int): Year
        site (int): use the site code with 4 digits
        filename (str): the first part of the filename
    """
    global email, api_key
    param_str=""
    for parameter in param:
        param_str += str(parameter) + ","
    param_str = param_str[:-1]    
    state="06"
    county="037"
    bdate = str(year) + "0101"
    edate = str(year) + "1231"
    parameters_req={"email":email,"key":api_key,"param": param_str  , "bdate":bdate,"edate":edate,"state":state,"county":county,"site":site}
    request_la_n=requests.get(
        f"https://aqs.epa.gov/data/api/dailyData/bySite?",params=parameters_req)
    print(f"API Request sucessfull: {response_available.status_code}")
    request_la_n.json()
    with open(f"Pollution_Data/{filename}_{year}_{site}.json","w") as file:
            json.dump(request_la_n.json(),file,indent=4)
    sleep(5)
            
    
get_pollution_data([42101,42401,42602],2019,id["la_n"],"first3")
get_pollution_data([44201,81102,88101],2019,id["la_n"],"last3")

get_pollution_data([42101,42401,42602],2018,id["la_n"],"first3")
get_pollution_data([44201,81102,88101],2018,id["la_n"],"last3")

get_pollution_data([42101,42401,42602],2017,id["la_n"],"first3")
get_pollution_data([44201,81102,88101],2017,id["la_n"],"last3")

get_pollution_data([42101,42401,42602],2016,id["la_n"],"first3")
get_pollution_data([44201,81102,88101],2016,id["la_n"],"last3")

API Request sucessfull: 200
API Request sucessfull: 200
API Request sucessfull: 200
API Request sucessfull: 200
API Request sucessfull: 200
API Request sucessfull: 200
API Request sucessfull: 200
API Request sucessfull: 200


In [55]:
"Concat all json Pollution files to one large file. "


filenames=["Pollution_Data/first3_2019_1103.json","Pollution_Data/first3_2017_1103.json","Pollution_Data/first3_2018_1103.json",
                      "Pollution_Data/last3_2019_1103.json","Pollution_Data/last3_2017_1103.json","Pollution_Data/last3_2018_1103.json",
                      "Pollution_Data/first3_2016_1103.json","Pollution_Data/last3_2016_1103.json"]
df_to_concat=[]
for filename in filenames:
    with open(filename) as file:
        data=json.load(file)
    df=pd.json_normalize(data["Data"])
    df.drop(labels=["state_code","county_code","poc","latitude","longitude","datum","parameter","sample_duration_code",
                    "pollutant_standard","units_of_measure","event_type","method_code","method","local_site_name",
                    "site_address","state", "county","city","cbsa_code","cbsa","date_of_last_change",
                    "aqi"]
            ,axis=1,inplace=True)
    print(filename)
    df.info()
    df = df[df["validity_indicator"]=="Y"]
    df = df[df["observation_percent"]>40]
    if "first3" in filename:
        df = df[(df["sample_duration"]=="1 HOUR") ]
    elif "last3" in filename:
        df = df[(df["sample_duration"]=="24 HOUR") | (df["sample_duration"]=="8-HR RUN AVG BEGIN HOUR")]
    df.drop(labels=["site_number","validity_indicator","observation_percent","observation_count","sample_duration"]
            ,axis=1,inplace=True)
    df.columns = ["parameter_code","date","mean","max","max_hour"]
    df["date"] = pd.to_datetime(df["date"],format="%Y-%m-%d")
    df.info()
    df.head(10)
    df_to_concat.append(df)
pollution_raw=pd.concat(df_to_concat)
pollution_raw.info()
pollution_raw.head()


Pollution_Data/first3_2019_1103.json
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3993 entries, 0 to 3992
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   site_number          3993 non-null   object 
 1   parameter_code       3993 non-null   object 
 2   sample_duration      3993 non-null   object 
 3   date_local           3993 non-null   object 
 4   observation_count    3993 non-null   int64  
 5   observation_percent  3993 non-null   float64
 6   validity_indicator   3993 non-null   object 
 7   arithmetic_mean      3993 non-null   float64
 8   first_max_value      3993 non-null   float64
 9   first_max_hour       3993 non-null   int64  
dtypes: float64(3), int64(2), object(5)
memory usage: 312.1+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2369 entries, 0 to 3992
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          -------------

Unnamed: 0,parameter_code,date,mean,max,max_hour
0,42602,2019-01-01,22.766667,34.9,19
1,42602,2019-01-02,26.722727,38.8,19
2,42602,2019-01-03,30.7,40.9,18
3,42602,2019-01-04,35.6125,46.7,17
4,42602,2019-01-05,26.2125,39.3,8


In [63]:
parameter_pollutant={"CO":42101,"SO2": 42401,"NO2": 42602 ,"O3" :44201,"PM10" : 81102,"PM2.5":88101}
df_co=pollution_raw[pollution_raw["parameter_code"]==str(parameter_pollutant["PM10"])]
df_co.info()
"Wieso nur so wenige Daten von PM10 ? "

<class 'pandas.core.frame.DataFrame'>
Int64Index: 226 entries, 3967 to 3982
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   parameter_code  226 non-null    object        
 1   date            226 non-null    datetime64[ns]
 2   mean            226 non-null    float64       
 3   max             226 non-null    float64       
 4   max_hour        226 non-null    int64         
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 10.6+ KB
