# Simulation Project 2018

## Objective

The aim of this project is to simulate a data set based on a real-world phenomenon. The data will be simulated using the numpy.random module, which is part of the NumPy Python library. The data I am simulating is based on water samples collected from the Suir River in County Tipperary.

A monitoring station at Knocknageragh Bridge, Templemore, Co. Tipperary was installed in January 2004 and to date 178 samples have been collected and analysed by the EPA laboratory in Kilkenny. The four variables that will be analysed from the samples collected will be the level of ammonia, pH level, temperature and nitrate levels. I will generate random arrays for the variables using parameters that match the real data. I will then show how the data is distributed for each variable and provide conclusions on the levels of the elements present in the water samples. 

## Introduction

The Environmental Protection Agency (EPA), as part of the Water Framework Directive (WFD), periodically collects and analyses water samples from a third of the major rivers in Ireland. EPA ecologists then assess these samples by measuring various elements to determine the quality of the water. A report on the ecological status of these rivers is produced every three years.

Once the samples are collected they are sent to the EPA laboratory in Kilkenny for analysis. The water samples are examined for pH, Biochemical Oxygen Demand (BOD), Colour, Ammonia, Oxidised Nitrogen (Nitrite/Nitrate), o-Phosphate, Chloride, Major Anions and Cations, Hardness, and Alkalinity. 

The variables that I will examine for this project are pH, ammonia, temperature and nitrate levels. I will examine the distribution of the data for each variable and I will investigate if there is any correlation between the variables.

This project is based on the collection and analysis of 178 samples from the Suir River. The location of the monitoring station is at Knocknageragh Bridge, Templemore, County Tipperary. The exact location can be seen on the map below - the red 'x' marks the spot.

![Map.PNG](attachment:Map.PNG)

## The numpy.random package
Numpy.random is a submodule of the Python library called NumPy. This submodule is a tool for creating random numbers, which can be used for simulating data. The Mersenne Twister algorithm in numpy.random is used to generate pseudorandom numbers. The Mersenne Twister is one of the most extensively tested random number generators in existence. This module in NumPy will be used to simulate the data for this project.

## Python Libraries
Python has a large number of libraries, which makes it a powerful programming language for analysing and visualising data. The libraries that were imported and used in this project were NumPy, Pandas and Matplotlib. These libraries, provided the tools, to effectively simulate the data that I required and plot the distributions of the four variables.

In [25]:
# import the python library NumPy
import numpy as np
# import python plotting library matplotlib
import matplotlib.pyplot as plt
# import pandas dataframe library
import pandas as pd

## numpy.random.normal
The numpy.random.normal function returns a random array of numbers which are defined by two numbers, the mean and the standard deviation. This function produces a normal distribution also known as the bell shaped curve. The bell shaped curve peaks at its mean and the standard deviation describes how wide or narrow the curve is. The normal distribution is symmetric.

## numpy.random.lognormal

The numpy.random.lognormal function returns a random array of numbers with a log normal distribution. This function uses the mean, standard deviation and sample size as parameters. A log-normal distribution is a continuous probability distribution commonly known as the Galton distribution. It is skewed to the left and identified by a low mean and a large variance. The values in a log-normal distribution cannot be negative.

## Create the Dataset

I will create a dataset using the Pandas and NumPy libraries in Python. There will be 178 rows in my dataset and 7 columns. The seven columns will have the following headers:

- River Name - (categorical variable with one value Suir River)
- Station Name - (categorical variable with one value Knocknageragh Bridge)
- Sample Date - (date range)
- pH - (non negative real numbers)
- Ammonia - (non negative real numbers)
- Temperature - (non negative real numbers)
- Nitrate - (non negative real numbers)

In [29]:
# create randomly generated data
rng = pd.date_range('1/1/2004', periods=178, freq='M')

raw_data = {'River Name': ('Suir River'), 
        'Station Name': ('Knocknageragh Bridge'),
        'Sample Date': (rng),
        'pH Level': range(1, 179), 
        'Ammonia': range(1, 179),
           'Temperature': range(1, 179),
           'Nitrate': range(1, 179)}

# Create a dataframe
df = pd.DataFrame(raw_data)

# View the dataframe
df

Unnamed: 0,River Name,Station Name,Sample Date,pH Level,Ammonia,Temperature,Nitrate
0,Suir River,Knocknageragh Bridge,2004-01-31,1,1,1,1
1,Suir River,Knocknageragh Bridge,2004-02-29,2,2,2,2
2,Suir River,Knocknageragh Bridge,2004-03-31,3,3,3,3
3,Suir River,Knocknageragh Bridge,2004-04-30,4,4,4,4
4,Suir River,Knocknageragh Bridge,2004-05-31,5,5,5,5
5,Suir River,Knocknageragh Bridge,2004-06-30,6,6,6,6
6,Suir River,Knocknageragh Bridge,2004-07-31,7,7,7,7
7,Suir River,Knocknageragh Bridge,2004-08-31,8,8,8,8
8,Suir River,Knocknageragh Bridge,2004-09-30,9,9,9,9
9,Suir River,Knocknageragh Bridge,2004-10-31,10,10,10,10


In [28]:
rng = pd.date_range('1/1/2004', periods=178, freq='M')
rng

DatetimeIndex(['2004-01-31', '2004-02-29', '2004-03-31', '2004-04-30',
               '2004-05-31', '2004-06-30', '2004-07-31', '2004-08-31',
               '2004-09-30', '2004-10-31',
               ...
               '2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31', '2018-06-30', '2018-07-31', '2018-08-31',
               '2018-09-30', '2018-10-31'],
              dtype='datetime64[ns]', length=178, freq='M')

## pH Level

pH is the measurement of the hydrogen-ion concentration in the water. A pH below 7 is acidic, a pH above 7 (to a maximum of 14) is basic and a pH of 7 is neutral. pH varies and is dependent on the geology of the river, the river flow and on wastewater discharges but it is generally in the range of 6 to 9. The pH level can also be effected by biological processes such as carbon dioxide uptake by plants during photosynthesis. Lethal effects of pH on aquatic life occur below pH 4.5 and above pH 9.5.

The PH level for the 178 samples analysed by the laboratory have a mean value of 8 and a standard deviation of 0.23. This variable models the normal distribution.

In [None]:
# create a normal distributed array centred around 8, standard deviation or spread of 0.23 and 178 elements
x = np.random.normal(8, 0.23, 178)

In [None]:
x

In [None]:
# plot a histogram for the ph level results with 10 bins
plt.hist(x)

### pH Level Histogram

The histogram above shows a normal distribution with a pattern known as the bell shaped curve. In a normal distribution, points are as likely to occur on one side of the average as on the other. The 178 values of x are centred around 8 with a small spread of 0.23. As the 178 pH values lie between the normal range of 6 to 9 the River Suir's pH levels are found to be normal and no further investigation is required.

## Temperature

The temperature of rivers affects the solubility of many chemical compounds and can therefore influence the effect of pollutants on aquatic life. Increased temperatures elevate the metabolic oxygen demand, which in conjunction with reduced oxygen solubility, impacts many species. The temperature of the water in the rivers is dependent on the weather and the normal range is between 6 and 16 degrees celcius.  

The temperature was measured for the 178 samples by the EPA laboratory and the results were recorded. After some statistical analysis the samples showed a mean value of 10.6 and a standard deviation of 4.14. This variable models a normal distribution.

In [None]:
# create a normal distributed array centred around 10.6, standard deviation of 4.14 and 178 elements
x = np.random.normal(10.6, 4.14, 178)

In [None]:
x

In [None]:
# plot a histogram for temperature results with 10 bins
plt.hist(x)

### Temperature Level Histogram

The histogram above shows a normal distribution with a pattern known as the bell shaped curve. In a normal distribution, points are as likely to occur on one side of the average as on the other. The 178 values of x are centred around 10.6 with a spread of 4.14 from the mean.

## Ammonia (N mg/L)

Ammonia occurs naturally in water bodies arising from the microbiological decomposition of nitrogenous compounds in organic matter. Fish and other aquatic organisms also excrete ammonia. Ammonia may also be discharged directly into water bodies by some industrial processes or as a component of domestic sewage or animal slurry. Ammonia can also arise in waters from the decay of discharged organic waste. Natural (unpolluted) waters contain relatively small amounts of ammonia, usually less than 0.02 mg/l.

Ammonia exists in aqueous solutions in two forms, ionised and un-ionised. The un-ionised fraction is toxic to freshwater fish at very low concentration. The relative proportions of ionised and un-ionised ammonia in water depend on temperature and pH. The concentration of un-ionised ammonia becomes greater with increasing temperatures and pH.

The 178 samples collected and analysed by the EPA laboratory showed a mean value of 0.05 and a standard deviation of 0.03. This variable models a normal distribution.

In [None]:
# create a normal distributed array with a mean of 0.045, standard deviation of 0.026 and 178 elements
x = np.random.normal(0.045, 0.026, 178)

In [None]:
x

In [None]:
# plot a histogram for ammonia level results with 10 bins
plt.hist(x)

### Ammonia Level Histogram

The histogram above shows a normal distribution. The points are as likely to occur on one side of the average as on the other. The 178 values of x are centred around 0.045 with a spread of 0.026 from the mean. The traces of ammonia in the River Suir are very small and indicate there is very little contamination from sewage or slurry. The water samples show no results with ammonia > 1 therefore no further investigation or flagging is required.

## Nitrate Levels

High nitrate levels in river samples are generally caused by fertilisers running into the water from agricultural land. Nitrates are a source of nutrients for plants and its presence encourages plant proliferation. If there are more plants in the river they are using more oxygen which leaves less oxygen for the bugs and insects. The increased plant life also reduces the light received by the river.

The 178 samples collected and analysed by the EPA laboratory showed a mean value of 1.61, variance of 0.436 and a standard deviation of 0.66. As this variable models a log normal distribution the mu and sigma needed to be calculated on the sample mean and variance. These values are 0.9899 and 0.8325 respectively. The formula used to calculate mu and sigma to use in the random.lognormal function can be seen below:

![Formula.PNG](attachment:Formula.PNG)


In [None]:
# create a log normal distributed array with a mu of 0.9899, ssigma of 0.8325 and 178 elements
x = np.random.lognormal(0.9899, 0.8325, 178)

In [None]:
x

In [None]:
# plot a histogram for total oxidised nitrogen level results with 10 bins
plt.hist(x)

### Nitrate Histogram

A log-normal distribution is a continuous probability distribution commonly known as the Galton distribution. It is skewed to the left and identified by a low mean and a large variance. The values in a log-normal distribution cannot be negative. The nitrate levels in the Suir River are high and this is a huge problem as the excess nitrate increase plant growth. These plants use more oxygen and block out the light resulting in there not being enough oxygen for the insects. The plants are taking over. By adding loads of nutrients to the rivers (from fertilisers), the biological quality of the water suffers. 

## References

http://www.epa.ie/pubs/reports/water/rivers/Interim%20Report_2012_web.pdf


https://docs.scipy.org/doc/numpy/


https://realpython.com/python-random/


https://www.epa.ie/pubs/reports/water/waterqua/iwqmolou/App%207.pdf


https://stat.ethz.ch/~stahel/lognormal/bioscience.pdf


https://en.wikipedia.org/wiki/Log-normal_distribution