# INTRODUCTION<br/>
Recent epidemiological studies recently have identified (i) altitude and (ii) specific airborne pollutants as independent risk factors for suicide and/or suicidal behavior. Specifically, nitrogen dioxide **(NO<sub>2</sub>)**, which is formed primarily through the burning of fossil fuels, was found to be associated with an increased odds ratio of 20% in suicide completers.<br/>

**Salt Lake City (SLC)**, Utah, is an urban agglomeration situated at approximately 4,500 ft above sea level. It is also noteworthy that SLC recently was identified as the nation's 7th worst air quality. SLC is surrounded by the Oquirrh mountain to the west and the Wastch mountain range to the east, with the two mountain ranges meeting at the south. During the colder fall and winter months an interplay of natural meterological events act to form a layer of warm air, which effectively traps colder air and its pollutants within the SLC valley atmosphere. Hence, due to this inversion effect, SLC's air quality is significantly deterioated during fall and winter.<br/>

In contrast, **Honolulu (HON)**, Hawaii, is a similarly sized urban area of residence (HON vs. SLC population: ~300,000 vs 200,000) with an overall air quality that is ranked amongst the nation's best. HON is situated at sea level and thus **does not** experience seasonal inversion effects.<br/>



The United States (US) Clean Air Act (est. ~1970) requires the US Environment Protection Agency (EPA) to set air quality standards for six common air pollutants, including NO<sub>2</sub>. The National Ambient Air Quality Standard (NAAQS) for NO<sub>2</sub> measures currently is set to a daily one-hour maximum of 100 parts per billion (ppb), and the national average for air NO<sub>2</sub> concentration has decreased substantially over the past couple of decades. The EPA releases outdoor air quality data for several airborn pollutants across the US, including the **daily one-hour maximum** NO<sub>2</sub> measures for SLC and HON.<br/>


Given the implication of NO<sub>2</sub> in suicidal behavior in at-risk clinical populations, a team of psychiatrists at the University of Utah are keen to investigate the daily one-hour maximum NO<sub>2</sub> levels for SLC and HON over a two-decade timespan, including the years 1999, 2004, 2009, 2014, 2018, and 2019. They are also interested in seasonal effects and the potential increase of NO<sub>2</sub> levels during the inversion months (September through February) within SLC from 2018 to 2019.

**_Specific research questions include:_**

1. Is there a significant decrease in SLC atmospheric NO<sub>2</sub> concentration from 1999 to 2019?
2. Is there a significant decrease in HON atmospheric NO<sub>2</sub> concentration from 1999 to 2019?
3. Are SLC atmospheric NO<sub>2</sub> concentrations significantly different from correposnding HON measures between 1999 and 2019?
4. Is SLC atmospheric NO<sub>2</sub> concentration higher during fall and winter months as compared to spring and summer months? 
5. Is HON atmospheric NO<sub>2</sub> concentration higher during fall and winter months as compared to spring and summer months? 
<br/>

Data source: <https://www.epa.gov/outdoor-air-quality-data/download-daily-data>

# HYPOTHESES<BR/>

***_Hypothesis 1_***
___
**Ho:** there is no significant decrease in SLC atmospheric NO<sub>2</sub> concentrations from 1999 to 2019<br/>

**Ha:** there is a significant decrease in SLC atmospheric NO<sub>2</sub> concentrations from 1999 to 2019<br/><br/>


***_Hypothesis 2_***
___
**Ho:** there is no significant decrease in HON atmospheric NO<sub>2</sub> concentrations from 1999 to 2019<br/>

**Ha:** there is a significant decrease in HON atmospheric NO<sub>2</sub> concentrations from 1999 to 2019<br/><br/>


***_Hypothesis 3_***
___
**Ho:** there is no significant difference for SLC and HON atmospheric NO<sub>2</sub> concentrations between 1999 and 2019<br/>

**Ha:** there is a significant difference for SLC and HON atmospheric NO<sub>2</sub> concentrations between 1999 and 2019<br/><br/>


***_Hypothesis 4_***
___
**Ho:** there is no significant difference in SLC atmospheric NO<sub>2</sub> concentration in fall and winter months compared to spring and summer months<br/>

**Ha:** there is a ignificant difference in SLC atmospheric NO<sub>2</sub> concentration in fall and winter months compared to spring and summer months<br/><br/>


***_Hypothesis 5_***
___
**Ho:** there is no significant difference in HON atmospheric NO<sub>2</sub> concentration in fall and winter months compared to spring and summer months<br/>

**Ha:** there is a significant difference in HON atmospheric NO<sub>2</sub> concentration in fall and winter months compared to spring and summer months<br/><br/>



# DATA<BR/>

All data were obtained from the publicly-accessible US EPA respository for outdoor data air quality (link to download: <https://www.epa.gov/outdoor-air-quality-data/download-daily-data>), which provides a tool that queries daily air quality summaries for a given **pollutant**, **year**, **geographic US location**, and **monitor site**. For most locations, data can be downloaded from multiple monitor sites. The two monitor sites deemed most representative for urban SLC and urban HON are the **Hawthorne** and **Kapolei** stations, respectively.<br/>

Daily one-hour maximum SLC and HON NO<sub>2</sub> data in the form of comma separated value (CSV) files subsequently were individually downloaded for years 1999, 2004, 2009, 2014, 2018, and 2019. The individual CSV files were concatenated into a single dataframe using the Pandas libray (Python 3.0), which enabled the assembly of a master composite CSV file. Two addtional variables were appended to the file that enhanced usability, including 'year' (four digit integer) and 'site' (i.e. 'SLC' for 'Hawthorne', and 'HON' for 'Kapolei'). This resulted in a master CSV file containing 4056 observations and 22 variables. The SLC and HON sites showed a total of 64 days and 260 days of missing data, respectively. The master CSV file used hereon for analyzing daily SLC and HON NO<sub>2</sub> concentrations is available at the following link: <https://raw.githubusercontent.com/aprescot1977/Thinkful/master/no2_master.csv>. 
___

**Dataset information**: The primary variables of interest include:<br/>
* Date (expressed as MM/DD/YYYY)
* Site ('SLC' or 'HON')
* Year (four digit integer)
* Daily Max 1-hour NO<sub>2</sub> concentration (units: ppb)





# METHODS<BR/>

**Hypothesis 1 and 2** will be tested using the main dataframe (i.e. from 'no2_master.csv'). The normaliy of the atmospheric NO<sub>2</sub>  concentration distributions for years 1999, 2004, 2009, 2014, 2018, and 2019, will be evaluated using histogram analysis, as well as measures of skewness and kurtosis. In addition, Shapiro-Wilk tests will likely be perfomed. The 'site' variable will be used to differentiate the 'SLC' from 'HON' distributions and for the resulting within-site statistical analysis. If the SLC and HON atmospheric NO<sub>2</sub>  concentration distributions are normally distributed, a one-way ANOVA will be performed, otherwise the data will be treated using non-parametric tests.
___
**Hypothesis 3** also will be tested using the main dataframe (i.e. from 'no2_master.csv'). Normality for the 6 discrete annual distributions will already have been conducted as detailed above. However, the annual SLC versus HON atmospheric NO<sub>2</sub>  concentrations will be statistically compared on a year-by-year basis. The use of parametric verus non-parametric statistical analysis will be dependent on normality of SLC and HON NO<sub>2</sub> distributions.
___
**Hypothesis 4 and 5** will be tested by creating a separate dataframe. The 'Date' column of the main dataframe (i.e. from 'no2_master.csv') will be used to create a separate dataframe depending on whether the days fall within fall 2018, winter 2018, spring 2019, summer 2019, and fall 2019. These are defined as below:<br/><br/>


* 2018-09-01 to 2018-11-30: **Fall 2018:**   
* 2018-12-01 to 2019-02-28: **Winter 2018:** 
* 2019-03-01 to 2019-05-31: **Spring 2019:** 
* 2019-06-01 to 2019-08-31: **Summer 2019:** 
* 2019-09-01 to 2019-11-30: **Fall 2018:**<br/><br/>

These specific dates will be accessed using the Pandas 'between' method, and a new column will be generated that is allocated a string showing the relevant 'season' and 'year'. The normality of the 5 new seasonal distrubutions (each containing approximately 90 observations) will be tested for both SLC and HON using the methods outlined above (see Hypothesis 1 and 2). The seasonal variation of SLC and HON air NO<sub>2</sub> concentration will be tested using parametric or non-parametric statistical analyis, depending on the outcome of the distribution analysis.   

# AUDIENCE


    Proposal clearly outlines the user or stakeholder that this research will be valuable for
    Proposal identifies why audience will benefit from research’s findings
    
    Andy - Need to add audience:
    (1) Public Health Authorities
    (2) Physicians/psychiatrists/social workers
    (3) General public, patients, families
    (4) Preventative measures for at-risk populations
    
    Also - the Dunn Bonferonni might be a cool addition
    
    Also need to add methods for concatenating datasets described above.
    That will need additon of individual datasets to GitHub, as well as preliminary distributions before uploading proposal.
