# The relationship between PM2.5 concentration and seasons in Beijing

### Introduction
Due to the rapid industrial development, air pollution has become a big concern. Major metropolitan cities, such as Beijing suffer from this difficulty. Particulate Matter 2.5 (PM2.5) are tiny particles or droplets in the air that are carcinogenic and causes respiratory disease (Zhao et al.). In order to explore the pattern of Beijing PM2.5 concentration, this study will focus on the question: is PM2.5 concentration in Beijing seasonal? The *Beijing PM2.5 Data Data Set* will be used to answer this question. The dataset includes the daily data of PM2.5 concentration, temperature, dew points, pressure, wind conditions, and weather conditions in Beijing from 2010 to 2014. The variable characteristics are either integers or real numbers.


### Preliminary Data Analysis


In [3]:
library(tidyverse)
library(repr)
library(tidymodels)
library(RCurl)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

In [4]:
pm2.5_data <- read_csv("https://raw.githubusercontent.com/gbrwg/DSCI100-Group-Project/main/data/beijing_pm2.5_data.csv") %>%
            filter(year == "2010")
pm2.5_data


Parsed with column specification:
cols(
  No = [32mcol_double()[39m,
  year = [32mcol_double()[39m,
  month = [32mcol_double()[39m,
  day = [32mcol_double()[39m,
  hour = [32mcol_double()[39m,
  pm2.5 = [32mcol_double()[39m,
  DEWP = [32mcol_double()[39m,
  TEMP = [32mcol_double()[39m,
  PRES = [32mcol_double()[39m,
  cbwd = [31mcol_character()[39m,
  Iws = [32mcol_double()[39m,
  Is = [32mcol_double()[39m,
  Ir = [32mcol_double()[39m
)



No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>
1,2010,1,1,0,,-21,-11,1021,NW,1.79,0,0
2,2010,1,1,1,,-21,-12,1020,NW,4.92,0,0
3,2010,1,1,2,,-21,-11,1019,NW,6.71,0,0
4,2010,1,1,3,,-21,-14,1019,NW,9.84,0,0
5,2010,1,1,4,,-20,-12,1018,NW,12.97,0,0
6,2010,1,1,5,,-19,-10,1017,NW,16.10,0,0
7,2010,1,1,6,,-19,-9,1017,NW,19.23,0,0
8,2010,1,1,7,,-19,-9,1017,NW,21.02,0,0
9,2010,1,1,8,,-19,-9,1017,NW,24.15,0,0
10,2010,1,1,9,,-20,-8,1017,NW,27.28,0,0


In [5]:
## template for creating a table
tab <- matrix(c(7, 5, 14, 19, 3, 2, 17, 6, 12), ncol=3, byrow=TRUE)
colnames(tab) <- c('colName1','colName2','colName3')
rownames(tab) <- c('rowName1','rowName2','rowName3')
tab <- as.table(tab)
tab

         colName1 colName2 colName3
rowName1        7        5       14
rowName2       19        3        2
rowName3       17        6       12

In [15]:
## Omitting NA, set the seed, and spilting the data
pm2.5_data_withoutna <- na.omit(pm2.5_data)
set.seed(1)
pm2.5_split <- initial_split(pm2.5_data_withoutna, prop = 0.75, strata = pm2.5)
pm2.5_train <- training(pm2.5_split)
pm2.5_test <- testing(pm2.5_split)

In [36]:
## Find the number of rows that contain missing data.
missing_data <- nrow(pm2.5_data) - nrow(pm2.5_data_withoutna)
missing_data

There are 669 rows with missing data.

In [44]:
## Find the mean for each predictor
Pm_2.5_selected <- select(pm2.5_train,DEWP,TEMP,PRES,Iws,Is,Ir)

mean_dewp <- Pm_2.5_selected%>%
summarize(mean= mean(DEWP))%>%
pull()

mean_temp <- Pm_2.5_selected%>%
summarize(mean= mean(TEMP))%>%
pull()

mean_pres <- Pm_2.5_selected%>%
summarize(mean= mean(PRES))%>%
pull()

mean_iws <- Pm_2.5_selected%>%
summarize(mean= mean(Iws))%>%
pull()

mean_is <- Pm_2.5_selected%>%
summarize(mean= mean(Is))%>%
pull()

mean_ir <- Pm_2.5_selected%>%
summarize(mean= mean(Ir))%>%
pull()

In [46]:
## Creating the table displaying the mean of each predictor
tab_pm2.5 <- matrix(c(mean_dewp, mean_temp, mean_pres, mean_iws, mean_is, mean_ir), ncol=6, byrow=TRUE)
colnames(tab_pm2.5) <- c("DEWP", "TEMP", "PRES", "Iws", "Is", "Ir")
rownames(tab_pm2.5) <- c("mean")
tab_pm2.5 <- as.table(tab_pm2.5)
tab_pm2.5

             DEWP         TEMP         PRES          Iws           Is
mean 1.567216e+00 1.152867e+01 1.016168e+03 2.950518e+01 7.611203e-02
               Ir
mean 2.886326e-01

### Methods


We will be filtering data from 2010 since the concentration of PM2.5 increases yearly. 

### Expected Outcomes and Significance
Air pollution is an increasingly important problem. To overcome it, we must study the patterns and correlations. We hope to find a seasonal pattern in PM2.5 concentration, then we can think of methods to reduce the concentration during peak seasons. From gathering seasonal patterns in PM2.5 concentration, we hope that this can help governments in implementing policies in reducing PM2.5 concentration during peak seasons. Further research could be done by finding the causes of the rise in concentration in PM2.5 and ways to mitigate such effects. 

### References
Zhao, Hui, et al. “Spatiotemporal Distribution of PM2.5 and O3 and Their Interaction during the Summer and Winter Seasons in Beijing, China.” *MDPI*, Multidisciplinary Digital Publishing Institute, 30 Nov. 2018, https://www.mdpi.com/2071-1050/10/12/4519/htm. 