<h1><center>Water Quality</center></h1>

<h2>Packages</h2>

In [5]:
# Load packages such as ggplot2, dplyr, tidyr, and readr to be able to use specialised functions for creating
# visualisations, reading, writing, and manipulating data.
library(tidyverse)

In [3]:
# Load the tidygeocoder package to be able to use a function to convert the given latitude and longitude
# to address.
library(tidygeocoder)

ERROR: Error in library(tidygeocoder): there is no package called ‘tidygeocoder’


In [None]:
# Load skimr package to be able to use a function to understand the structure of the dataframe we will analyse
library(skimr)

In [None]:
# Load the knitr package to be able to use a function for presenting information in a tidy format.
library(knitr)

In [None]:
# Load the visdat package to be able to use a function for visualisation of the data. 
library(visdat)

In [None]:
# Load the lubridate package to be able to use function(s) for manipulating datetime data type.
library(lubridate)

In [4]:
# Load the highcharter package to be able to use interactive charting/graphing functions.
library(highcharter)

ERROR: Error in library(highcharter): there is no package called ‘highcharter’


In [None]:
library(jsonlite)
library(XML)
library(xml2)
library(glue)
library(httr)

<h2>Accessing and Importing Datasets</h2>

In [None]:
# A function to get data from the Ministry for the Environment database API

get_data_from_mfe <- function(api_key, data_id){
    
    query <- glue('https://data.mfe.govt.nz/services;key={api_key}/wfs?service=WFS&version=2.0.0&request=GetFeature&typeNames={data_id}') # creates a query url using inputs of api key and data-id number available on MfE website
    
    api_response <- GET(query) #gets the API response from the query
    
    data_xml <- read_xml(api_response) #reads the xml data from the api response
    
    data_parsed <- xmlParse(data_xml) #parses the data into an xml format that is readable in R
    
    data_df <- glue('//data.mfe.govt.nz:{data_id}') %>%  # creating a node name to look for
    getNodeSet(data_parsed, .)  %>%  # looking at nodes with the name
    xmlToDataFrame(nodes = .) #turns the data within the given node into a data frame
    
    return(data_df) #returns the data frame
}

In [None]:
# Get the river quality E. coli dataset from MfE using their API service.
# Then display the first six rows.
river_ecoli <- get_data_from_mfe("e046a540d83e49248cbda9cce3f23c2e", "table-109662")
river_ecoli %>% 
    head()

In [None]:
# Get the river quality nitrogen dataset from MfE using their API service.
# Then display the first six rows.
river_nitrogen <- get_data_from_mfe("e046a540d83e49248cbda9cce3f23c2e", "table-109659")
river_nitrogen %>% 
    head()

In [None]:
# Read the groundwq.csv uploaded in GitHub and store it as groundwq for analysis.
groundwq <- "https://raw.githubusercontent.com/beuri97/data201_gp/waterq/water_quality/data/groundwq.csv" %>% 
  read_csv()
groundwq %>% 
    head()

In [None]:
# Read the new_river_ecoli uploaded in GitHub and store it as river_ecoli for analysis.
river_ecoli <- "https://raw.githubusercontent.com/beuri97/data201_gp/waterq/water_quality/data/new_river_ecoli.csv" %>% 
  read_csv()
river_ecoli %>% 
    head()

In [None]:
# Read the new_river_nitrogen.csv and store it as river_nitrogen for analysis.
river_nitrogen <- "https://raw.githubusercontent.com/beuri97/data201_gp/waterq/water_quality/data/new_river_nitrogen.csv" %>% 
  read_csv()
river_nitrogen %>% 
    head()

<p style="text-align: justify"> We used the web and API services of the Ministry for the Environment to get the river quality datasets available in their database. However, the groundwater dataset in their database is incomplete. So instead, we downloaded the dataset available on the LAWA website and uploaded it on GitHub, then copied the link and used it to read the dataset in R. </p>

<h2>Conversion and Saving Data to CSV File</h2>

In [None]:
# Takes the available latitude and longitude information from the dataset then convert it to full address.
convert_lat_long <- function(df, lat, long){
    converted_df <- df %>%
    reverse_geocode(lat = lat, long = long, 
                    method = "osm", full_results = TRUE)
    return(converted_df)
}

In [None]:
## Takes the river_ecoli, and converts the provided latitudes and longitudes to add new columns containing the 
## address of the river sites. Then displays the first six rows.
# new_riverecoli <- convert_lat_long(river_ecoli, river_ecoli$lat, river_ecoli$long)
# new_riverecoli %>% head()

In [None]:
## Takes the river_nitrogen, and converts the provided latitudes and longitudes to add new columns containing the 
## address of the river sites. Then displays the first six rows.
# new_rivernitrogen <- convert_lat_long(river_nitrogen, river_nitrogen$lat, river_nitrogen$long)
# new_rivernitrogen %>% head()

In [None]:
## Write the new dataset as CSVs for use.
# write_csv(new_riverecoli, "new_river_ecoli.csv")
# write_csv(new_rivernitrogen, "new_river_nitrogen.csv")

<p style="text-align: justify"> Region is one of the crucial variables in our dataset for relating datasets. Unfortunately, the river quality datasets we obtained from MfE only have the coordinates for the sites, so we decided to create a helper function using the `reverse_geocode` from `tidygeocoder` that takes latitude and longitude to find locations using geocoding methods. We then created new datasets containing the areas acquired from the conversion. We commented out the conversion and writing of new datasets, as the conversion process takes a lot of time, and we decided to use the links of the datasets we uploaded on GitHub. </p>

<h2>Groundwater Quality</h2>

In [None]:
# Gives an overview of groundwq such as columns, data types, the possible values, number of rows and columns.
groundwq %>% 
  glimpse()

In [None]:
# Takes the groundwq modify the values and rename some columns, select the relevant columns 
# and rows to create a clean version of groundwq.
new_groundwq <- groundwq %>% 
  mutate(CensoredValue = ifelse(is.na(CensoredValue), NA_integer_, CensoredValue),
         Year = year(Date),
         Indicator = case_when(Indicator == "E.coli" ~ "E.coli cfu/100ml", TRUE ~ "Nitrate nitrogen g/m3"),
         Region = case_when(Region == "Hawkes Bay" ~ "Hawke's Bay",
                            Region == "Manawatu-Whanganui" ~ "Manawat??-Whanganui", TRUE ~ Region),
         WellName = LAWAWellName) %>% 
  select(Region, WellName, Latitude, Longitude, Indicator, Year, CensoredValue) %>% 
  filter(Indicator %in% c("E.coli cfu/100ml", "Nitrate nitrogen g/m3"), Year >= 2002, Year <= 2019)

In [None]:
# Takes the new_groundwq, select the relevant columns about the quality if the sites, and then group by Region, Year,
# WellName, and Indicator to get the mean value of the measurements for both indicator of each well from 
# 2002 to 2019 across NZ. Then display the first 6 rows of the grouped data frame.
sites_quality <- new_groundwq %>% 
  select(Region, Year, WellName, CensoredValue, Indicator) %>% 
  group_by(Region, Year, WellName, Indicator) %>% 
  summarise(MeanVal = mean(CensoredValue))
sites_quality %>% 
    head()

In [None]:
# Converts the sites_quality to wide format to identify wells in specific years that are not assessed.
sites_quality_wide <- sites_quality %>% 
  spread(key = Indicator,
         value = MeanVal)
sites_quality_wide %>% 
    head()

In [None]:
# Takes the new_groundwq, select the relevant columns for about sites coordinates, and then get the unique 
# entries of the wells across NZ.
sites <- new_groundwq %>% 
  select(Region, WellName, Latitude, Longitude) %>% 
  distinct()
sites %>% 
    head()

In [None]:
# Takes the wide format of sites_quality, removes the missing values, group rows by Year, and summarise the mean for 
# each indicator from 2004 to 2019.
gwq_overall_change <- sites_quality_wide %>%
  na.omit() %>% 
  group_by(Year) %>% 
  summarise(MeanEcoli = mean(`E.coli cfu/100ml`), MeanNitrogen = mean(`Nitrate nitrogen g/m3`))
gwq_overall_change

In [7]:
# Creates a graph 
overall_gwq <- highchart() %>% 
  hc_yAxis_multiples(
    list(lineWidth = 3, lineColor='blue', title=list(text="E.coli cfu/100ml")),
    list(lineWidth = 3, lineColor="green", title=list(text="Nitrate nitrogen g/m3"))
  ) %>% 
  hc_add_series(data = gwq_overall_change$MeanEcoli, color='blue', name = "E.coli") %>% 
  hc_add_series(data = gwq_overall_change$MeanNitrogen, color='green', name = "Nitrate nitrogen", yAxis = 1) %>%
  hc_xAxis(categories = gwq_overall_change$Year, title = list(text = "Year")) %>% 
  hc_title(text = "Average E. coli Count and Nitrate Nitrogen Amount in NZ (2004 - 2019)")
overall_gwq

ERROR: Error in hc_title(., text = "Average E. coli Count and Nitrate Nitrogen Amount in NZ (2004 - 2019)"): could not find function "hc_title"


<h2>River Quality (E. coli & Nitrogen)</h2>

In [None]:
# Gives an overview of river_ecoli such as columns, data types, the possible values, number of rows and columns.
river_ecoli %>% 
    glimpse()

In [None]:
# Reads the entirety of river_ecoli and creates a plot to check if it contains missing data (NA).
river_ecoli %>% 
  vis_miss()

In [None]:
# Takes the river_ecoli rename the columns, standardise the indicators, select the relevant columns and rows. 
new_river_ecoli <- river_ecoli %>% 
  rename(Region = state, Year = end_year, Indicator = measure, Median = median, Units = units, S_ID = s_id,
         Latitude = lat, Longitude = long) %>% 
  mutate(Indicator = "E.coli cfu/100ml") %>% 
  select(Region, Year, S_ID, Median, Indicator, Latitude, Longitude) %>% 
  filter(Year >= 2002, Year <= 2019)
new_river_ecoli %>% 
    head()

In [None]:
# Check for missing data again (NA)
new_river_ecoli %>% 
  vis_miss()

In [None]:
# Takes the new_river_ecoli select the relevant columns, group the rows by Region, Year, Site ID, and Indicator,
# summarise the mean for each site across NZ in specific year.
river_src_quality_ecoli <- new_river_ecoli %>% 
  select(Region, Year, S_ID, Median, Indicator) %>% 
  group_by(Region, Year, S_ID, Indicator) %>% 
  summarise(MeanVal = mean(Median))
river_src_quality_ecoli %>% 
    head()

In [None]:
# Takes the new_river_ecoli select the relevant columns about sites' coordinates, and then get the unique 
# entries of the river sites across NZ.
river_src_ecoli <- new_river_ecoli %>% 
  select(Region, S_ID, Latitude, Longitude) %>% 
  distinct()
river_src_ecoli %>% 
    head()

In [None]:
# Gives an overview of river_nitrogen such as columns, data types, the possible values, number of rows and columns.
# This allow us to select which relevant columns to select.
river_nitrogen %>% 
  glimpse()

In [None]:
# Takes river_nitrogen rename the columns, get the rows from 2002 to 2009 with ammoniacal nitrogen and nitrate(-nitrite) nitrogen
# as indicators, standardise the indicators, and select the necessary columns.
new_river_nitrogen <- river_nitrogen %>% 
  rename(Region = state, Year = end_year, Indicator = measure, Median = median, Units = units, S_ID = s_id,
         Latitude = lat, Longitude = long) %>% 
  filter(Year >= 2002, Year <= 2019, Indicator %in% c("Ammoniacal nitrogen", "Nitrate-nitrite nitrogen")) %>%
  mutate(Indicator = case_when(Indicator == "Ammoniacal nitrogen" ~ "Ammoniacal nitrogen g/m3",
                               TRUE ~ "Nitrate-nitrite nitrogen g/m3")) %>%
  select(Region, Year, S_ID, Median, Indicator, Latitude, Longitude)

In [None]:
# Reads the entirety of river_nitrogen and creates a plot to check if it contains missing data (NA). 
river_nitrogen %>% 
  vis_miss()

In [None]:
# Takes the river_nitrogen data frame then rename the columns, select the necessary rows, group them
# by Region, Indicator, and Year. Lastly, summarise them by getting the sum of the median values rounded
# off by 2 s.f.
new_rivernitrogen <- river_nitrogen %>% 
  rename(Region = state, Indicator = measure, Units = units, Med_Value = median,
         Year = end_year) %>% 
  filter(Indicator %in% c("Ammoniacal nitrogen", "Nitrate-nitrite nitrogen"),
         Year >= 2002, Year <= 2019) %>% 
  group_by(Region, Indicator, Year) %>% 
  summarise(Total_MedVal = round(sum(Med_Value), 2)) %>% 
  distinct()

In [None]:
# Takes the new_rivernitrogen then convert it to wide format.
rivernitrogen_wide <- new_rivernitrogen %>% 
  spread(key = Indicator,
         value = Total_MedVal) %>% 
  mutate(`Ammoniacal nitrogen (g/m3)` = `Ammoniacal nitrogen`,
         `Nitrate-nitrite nitrogen (g/m3)` = `Nitrate-nitrite nitrogen`) %>% 
  select(-c(`Ammoniacal nitrogen`, `Nitrate-nitrite nitrogen`))
rivernitrogen_wide

<h2>Adding New Columns and Joining Dataframes</h2>

In [None]:
# Join rivernitrogen_wide and riverecoli_wide to create river_quality dataframe.
river_quality <- rivernitrogen_wide %>% 
  full_join(riverecoli_wide)

In [None]:
# Merge the tibble of categories with the existing groundwq and river_quality dataframes.
groundwq_categ <- tibble(Water_Categ = rep(c("Groundwater Quality"), each = nrow(new_groundwq)))
river_categ <- tibble(Water_Categ = rep(c("River Quality"), each = nrow(river_quality)))

groundwq <- cbind(new_groundwq, groundwq_categ)
river_quality <- cbind(river_quality, river_categ)

In [None]:
# Join the groundwq and river_quality to create water quality dataframe.
# Normalise the indicator for all E.coli observation.
water_quality <- groundwq %>% 
  full_join(river_quality) %>% 
  mutate(Indicator = case_when(Indicator == "E. coli" ~ "E.coli", TRUE ~ Indicator))