# <span style="font-size:1.2em;"> Creating Datasets for Four Major Air Pollutants 2000 - 2021
Author: Angela Kim

---

# Contents
<l></l>

<span style="font-size:1.2em;">

- <a href="#Overview">Overview</a>
    
- <a href="#Ozone">Ozone</a>
    
- <a href="#Carbon Monoxide">Carbon Monoxide</a>
    
- <a href="#Nitrogen Dioxide">Nitrogen Dioxide</a>

- <a href="#Sulfur Dioxide">Sulfur Dioxide</a>

- <a href="#Sources">Sources</a>

# <a id="#Overview">Overview</a>

For this project, I originally planned on using a dataset from [Kaggle](https://www.kaggle.com/sogun3/uspollution) on US Air Quality from 2000-2016 and then tacking on more recent data from 2017-2021. However, as I began to explore the data, I was incredibly frustrated with how poorly the data had been put together. I did my best to clean it while maintaining the integrity of the data but ultimately decided that it would probably end up being less frustrating, less time-consuming, and much cleaner if I started at the source.

The [US EPA](https://www.epa.gov/) has open-source pre-generated data files in `.csv` format on the four major air pollutants I'm interested in for this project: ozone ($O_{3}$), carbon monoxide ($CO$), nitrogen dioxide ($NO_{2}$), and sulfur dioxide ($SO_{2}$).

I downloaded a total of 88 `.csv` [daily summary data files](https://aqs.epa.gov/aqsweb/airdata/download_files.html#Daily) for the years 2000-2021 and concatenated them into their respective datasets.

Finally, I exported all four datasets as `.csv` files.

The resulting `.csv` files were too large to upload onto github, but you can download all the files I used from the EPA site and run this notebook to get the compiled datasets.

**Please consider:**
1. Take note of where you stored the downloaded data and adjusting the code accordingly before running the notebook.
2. Make sure to uncomment the export lines if you want the datasets as `.csv`.

In [1]:
# Import libraries
import pandas as pd
import numpy as np

# <a id="#Ozone">Ozone</a>

In [2]:
# Import Ozone datasets
O3_2000 = pd.read_csv('data/dailyO3/daily_44201_2000.csv')
O3_2001 = pd.read_csv('data/dailyO3/daily_44201_2001.csv')
O3_2002 = pd.read_csv('data/dailyO3/daily_44201_2002.csv')
O3_2003 = pd.read_csv('data/dailyO3/daily_44201_2003.csv')
O3_2004 = pd.read_csv('data/dailyO3/daily_44201_2004.csv')
O3_2005 = pd.read_csv('data/dailyO3/daily_44201_2005.csv')
O3_2006 = pd.read_csv('data/dailyO3/daily_44201_2006.csv')
O3_2007 = pd.read_csv('data/dailyO3/daily_44201_2007.csv')
O3_2008 = pd.read_csv('data/dailyO3/daily_44201_2008.csv')
O3_2009 = pd.read_csv('data/dailyO3/daily_44201_2009.csv')
O3_2010 = pd.read_csv('data/dailyO3/daily_44201_2010.csv')
O3_2011 = pd.read_csv('data/dailyO3/daily_44201_2011.csv')
O3_2012 = pd.read_csv('data/dailyO3/daily_44201_2012.csv')
O3_2013 = pd.read_csv('data/dailyO3/daily_44201_2013.csv')
O3_2014 = pd.read_csv('data/dailyO3/daily_44201_2014.csv')
O3_2015 = pd.read_csv('data/dailyO3/daily_44201_2015.csv')
O3_2016 = pd.read_csv('data/dailyO3/daily_44201_2016.csv')
O3_2017 = pd.read_csv('data/dailyO3/daily_44201_2017.csv')
O3_2018 = pd.read_csv('data/dailyO3/daily_44201_2018.csv')
O3_2019 = pd.read_csv('data/dailyO3/daily_44201_2019.csv')
O3_2020 = pd.read_csv('data/dailyO3/daily_44201_2020.csv')
O3_2021 = pd.read_csv('data/dailyO3/daily_44201_2021.csv')


# Concatenate datasets
O3_all = [O3_2000, O3_2001, O3_2002, O3_2003, O3_2004, O3_2005, O3_2006, O3_2007, O3_2008, O3_2009, O3_2010, 
          O3_2011, O3_2012, O3_2013, O3_2014, O3_2015, O3_2016, O3_2017, O3_2018, O3_2019, O3_2020, O3_2021]

O3 = pd.concat(O3_all, ignore_index=True)


# Export to csv
# O3.to_csv('O3.csv', index=False)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


# <a id="Carbon Monoxide">Carbon Monoxide</a>

In [3]:
# Import Carbon Monoxide datasets
CO_2000 = pd.read_csv('data/dailyCO/daily_42101_2000.csv')
CO_2001 = pd.read_csv('data/dailyCO/daily_42101_2001.csv')
CO_2002 = pd.read_csv('data/dailyCO/daily_42101_2002.csv')
CO_2003 = pd.read_csv('data/dailyCO/daily_42101_2003.csv')
CO_2004 = pd.read_csv('data/dailyCO/daily_42101_2004.csv')
CO_2005 = pd.read_csv('data/dailyCO/daily_42101_2005.csv')
CO_2006 = pd.read_csv('data/dailyCO/daily_42101_2006.csv')
CO_2007 = pd.read_csv('data/dailyCO/daily_42101_2007.csv')
CO_2008 = pd.read_csv('data/dailyCO/daily_42101_2008.csv')
CO_2009 = pd.read_csv('data/dailyCO/daily_42101_2009.csv')
CO_2010 = pd.read_csv('data/dailyCO/daily_42101_2010.csv')
CO_2011 = pd.read_csv('data/dailyCO/daily_42101_2011.csv')
CO_2012 = pd.read_csv('data/dailyCO/daily_42101_2012.csv')
CO_2013 = pd.read_csv('data/dailyCO/daily_42101_2013.csv')
CO_2014 = pd.read_csv('data/dailyCO/daily_42101_2014.csv')
CO_2015 = pd.read_csv('data/dailyCO/daily_42101_2015.csv')
CO_2016 = pd.read_csv('data/dailyCO/daily_42101_2016.csv')
CO_2017 = pd.read_csv('data/dailyCO/daily_42101_2017.csv')
CO_2018 = pd.read_csv('data/dailyCO/daily_42101_2018.csv')
CO_2019 = pd.read_csv('data/dailyCO/daily_42101_2019.csv')
CO_2020 = pd.read_csv('data/dailyCO/daily_42101_2020.csv')
CO_2021 = pd.read_csv('data/dailyCO/daily_42101_2021.csv')


# Concatenate datasets
CO_all = [CO_2000, CO_2001, CO_2002, CO_2003, CO_2004, CO_2005, CO_2006, CO_2007, CO_2008, CO_2009, CO_2010,
          CO_2011, CO_2012, CO_2013, CO_2014, CO_2015, CO_2016, CO_2017, CO_2018, CO_2019, CO_2020, CO_2021]

CO = pd.concat(CO_all, ignore_index=True)


# Export to csv
# CO.to_csv('CO.csv', index=False)

# <a id="Nitrogen Dioxide">Nitrogen Dioxide</a>

In [5]:
# Import Nitrogen Dioxide datasets
NO2_2000 = pd.read_csv('data/dailyNO2/daily_42602_2000.csv')
NO2_2001 = pd.read_csv('data/dailyNO2/daily_42602_2001.csv')
NO2_2002 = pd.read_csv('data/dailyNO2/daily_42602_2002.csv')
NO2_2003 = pd.read_csv('data/dailyNO2/daily_42602_2003.csv')
NO2_2004 = pd.read_csv('data/dailyNO2/daily_42602_2004.csv')
NO2_2005 = pd.read_csv('data/dailyNO2/daily_42602_2005.csv')
NO2_2006 = pd.read_csv('data/dailyNO2/daily_42602_2006.csv')
NO2_2007 = pd.read_csv('data/dailyNO2/daily_42602_2007.csv')
NO2_2008 = pd.read_csv('data/dailyNO2/daily_42602_2008.csv')
NO2_2009 = pd.read_csv('data/dailyNO2/daily_42602_2009.csv')
NO2_2010 = pd.read_csv('data/dailyNO2/daily_42602_2010.csv')
NO2_2011 = pd.read_csv('data/dailyNO2/daily_42602_2011.csv')
NO2_2012 = pd.read_csv('data/dailyNO2/daily_42602_2012.csv')
NO2_2013 = pd.read_csv('data/dailyNO2/daily_42602_2013.csv')
NO2_2014 = pd.read_csv('data/dailyNO2/daily_42602_2014.csv')
NO2_2015 = pd.read_csv('data/dailyNO2/daily_42602_2015.csv')
NO2_2016 = pd.read_csv('data/dailyNO2/daily_42602_2016.csv')
NO2_2017 = pd.read_csv('data/dailyNO2/daily_42602_2017.csv')
NO2_2018 = pd.read_csv('data/dailyNO2/daily_42602_2018.csv')
NO2_2019 = pd.read_csv('data/dailyNO2/daily_42602_2019.csv')
NO2_2020 = pd.read_csv('data/dailyNO2/daily_42602_2020.csv')
NO2_2021 = pd.read_csv('data/dailyNO2/daily_42602_2021.csv')


# Concatenate datasets
NO2_all = [NO2_2000, NO2_2001, NO2_2002, NO2_2003, NO2_2004, NO2_2005, NO2_2006, NO2_2007, NO2_2008, NO2_2009, 
           NO2_2010, NO2_2011, NO2_2012, NO2_2013, NO2_2014, NO2_2015, NO2_2016, NO2_2017, NO2_2018, NO2_2019, 
           NO2_2020, NO2_2021]

NO2 = pd.concat(NO2_all, ignore_index=True)


# Export to csv
# NO2.to_csv('NO2.csv', index=False)

# <a id="Sulfur Dioxide">Sulfur Dioxide</a>

In [4]:
# Import Sulfur Dioxide datasets
SO2_2000 = pd.read_csv('data/dailySO2/daily_42401_2000.csv')
SO2_2001 = pd.read_csv('data/dailySO2/daily_42401_2001.csv')
SO2_2002 = pd.read_csv('data/dailySO2/daily_42401_2002.csv')
SO2_2003 = pd.read_csv('data/dailySO2/daily_42401_2003.csv')
SO2_2004 = pd.read_csv('data/dailySO2/daily_42401_2004.csv')
SO2_2005 = pd.read_csv('data/dailySO2/daily_42401_2005.csv')
SO2_2006 = pd.read_csv('data/dailySO2/daily_42401_2006.csv')
SO2_2007 = pd.read_csv('data/dailySO2/daily_42401_2007.csv')
SO2_2008 = pd.read_csv('data/dailySO2/daily_42401_2008.csv')
SO2_2009 = pd.read_csv('data/dailySO2/daily_42401_2009.csv')
SO2_2010 = pd.read_csv('data/dailySO2/daily_42401_2010.csv')
SO2_2011 = pd.read_csv('data/dailySO2/daily_42401_2011.csv')
SO2_2012 = pd.read_csv('data/dailySO2/daily_42401_2012.csv')
SO2_2013 = pd.read_csv('data/dailySO2/daily_42401_2013.csv')
SO2_2014 = pd.read_csv('data/dailySO2/daily_42401_2014.csv')
SO2_2015 = pd.read_csv('data/dailySO2/daily_42401_2015.csv')
SO2_2016 = pd.read_csv('data/dailySO2/daily_42401_2016.csv')
SO2_2017 = pd.read_csv('data/dailySO2/daily_42401_2017.csv')
SO2_2018 = pd.read_csv('data/dailySO2/daily_42401_2018.csv')
SO2_2019 = pd.read_csv('data/dailySO2/daily_42401_2019.csv')
SO2_2020 = pd.read_csv('data/dailySO2/daily_42401_2020.csv')
SO2_2021 = pd.read_csv('data/dailySO2/daily_42401_2021.csv')


# Concatenate datasets
SO2_all = [SO2_2000, SO2_2001, SO2_2002, SO2_2003, SO2_2004, SO2_2005, SO2_2006, SO2_2007, SO2_2008, SO2_2009, 
           SO2_2010, SO2_2011, SO2_2012, SO2_2013, SO2_2014, SO2_2015, SO2_2016, SO2_2017, SO2_2018, SO2_2019, 
           SO2_2020, SO2_2021]

SO2 = pd.concat(SO2_all, ignore_index=True)


# Export to csv
# SO2.to_csv('SO2.csv', index=False)

# <a id="#Sources">Sources</a>

- [EPA AirData Daily Summary Data](https://aqs.epa.gov/aqsweb/airdata/download_files.html#Daily)