This file is used for data processing. 

The raw dataset can be obtained at https://www.google.com/url?q=https://data.cityofnewyork.us/Environment/Air-Quality/c3uy-2p5r/explore&sa=D&source=docs&ust=1714386139930215&usg=AOvVaw3jLEnKxbKgM7H_5Ttu3cYA .

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('../components/Air_Quality_20240417.csv')
df.head()

Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,Data Value,Message
0,130823,648,Asthma emergency department visits due to PM2.5,Estimated annual rate (under age 18),"per 100,000 children",UHF42,409.0,Southeast Queens,2005-2007,01/01/2005,73.8,
1,151637,643,Annual vehicle miles traveled,Million miles,per square mile,UHF42,208.0,Canarsie - Flatlands,2005,01/01/2005,35.7,
2,151647,643,Annual vehicle miles traveled,Million miles,per square mile,UHF42,307.0,Gramercy Park - Murray Hill,2005,01/01/2005,185.6,
3,154605,643,Annual vehicle miles traveled,Million miles,per square mile,CD,404.0,Elmhurst and Corona (CD4),2005,01/01/2005,78.0,
4,154586,643,Annual vehicle miles traveled,Million miles,per square mile,CD,303.0,Bedford Stuyvesant (CD3),2005,01/01/2005,69.4,


In [2]:
# select only the desired columns

# Geo type name groups the different ways to name neighborhoods
# We use UHF-42
df = df[df['Geo Type Name'] == 'UHF42']
df = df[['Name', 'Geo Join ID', 'Time Period', 'Data Value']]
df.head()

Unnamed: 0,Name,Geo Join ID,Time Period,Data Value
0,Asthma emergency department visits due to PM2.5,409.0,2005-2007,73.8
1,Annual vehicle miles traveled,208.0,2005,35.7
2,Annual vehicle miles traveled,307.0,2005,185.6
6,Annual vehicle miles traveled,107.0,2005,42.7
7,Asthma emergency department visits due to PM2.5,306.0,2005-2007,139.1


In [3]:
# rename columns

df.rename(columns={'Geo Join ID': 'Neighborhood', 'Data Value': 'Value'}, inplace=True)

In [4]:
# filter the metric column

metrics = df['Name'].unique()
print(metrics)

pollutants = ['Nitrogen dioxide (NO2)', 'Ozone (O3)', 
                       'Fine particles (PM 2.5)', 'Outdoor Air Toxics - Benzene', 
                       'Outdoor Air Toxics - Formaldehyde']
df = df[df['Name'].isin(pollutants)]

['Asthma emergency department visits due to PM2.5'
 'Annual vehicle miles traveled'
 'Respiratory hospitalizations due to PM2.5 (age 20+)'
 'Asthma hospitalizations due to Ozone'
 'Outdoor Air Toxics - Formaldehyde'
 'Cardiac and respiratory deaths due to Ozone'
 'Asthma emergency departments visits due to Ozone'
 'Cardiovascular hospitalizations due to PM2.5 (age 40+)'
 'Deaths due to PM2.5' 'Outdoor Air Toxics - Benzene'
 'Annual vehicle miles traveled (trucks)'
 'Annual vehicle miles traveled (cars)' 'Fine particles (PM 2.5)'
 'Nitrogen dioxide (NO2)' 'Ozone (O3)'
 'Boiler Emissions- Total SO2 Emissions'
 'Boiler Emissions- Total NOx Emissions'
 'Boiler Emissions- Total PM2.5 Emissions']


In [5]:
# rename pollutants

pollutants_new = ['Nitrogen Dioxide', 'Ozone', 'Fine Particles', 'Benzene', 'Formaldehyde']

df.replace(pollutants, pollutants_new, inplace=True)

In [11]:
# filter for only summer measurements

df['Time Period'].unique()

summer = df[df['Time Period'].str[:6] == "Summer"]

# drop the "Summer" part
summer['Time Period'] = summer['Time Period'].str[-4:]

summer.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  summer['Time Period'] = summer['Time Period'].str[-4:]


Unnamed: 0,Name,Neighborhood,Time Period,Value
1941,Fine Particles,209.0,2009,10.6
1942,Nitrogen Dioxide,409.0,2009,16.1
1943,Nitrogen Dioxide,410.0,2009,8.6
1944,Fine Particles,104.0,2009,10.4
1945,Nitrogen Dioxide,211.0,2009,23.1


In [12]:
# rename column name
summer.rename(columns={'Time Period': 'Year'}, inplace=True)
summer.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  summer.rename(columns={'Time Period': 'Year'}, inplace=True)


Unnamed: 0,Name,Neighborhood,Year,Value
1941,Fine Particles,209.0,2009,10.6
1942,Nitrogen Dioxide,409.0,2009,16.1
1943,Nitrogen Dioxide,410.0,2009,8.6
1944,Fine Particles,104.0,2009,10.4
1945,Nitrogen Dioxide,211.0,2009,23.1


In [13]:
summer.to_csv('../components/summer_data.csv')