# Simple data processing

In this notebook we will analyse sea level data measured with a tide gauge in the west Florida Shelf. These are hourly data and correspond to year 2004.

On this notebook we will do the following steps:

* Load necessary libraries
* Read the data necessary for the exercice (sea level height measured with a tide gauge)
* Calculate the standard deviation of the data
* Plots the data with +/- n * standard deviation to detect outliers/suspect points

If you need help with any command, you can of course ask me, but you can also try to search for it:
https://docs.julialang.org/en/v1/manual/getting-started/

(or plain Google search "julialang" and then your keywords...)

Once you have finished the exercise we'll comment on the results together

In [None]:
#Loading the necessary libraries
using PyPlot
using DelimitedFiles
using Statistics

Next cell will download the file for the exercise. If you run this a second time, a red message appears saying that it's already there. All ok, no problem!

In [None]:
filename = "8762075.sealevel.txt";
if !isfile(filename)
    @info("downloading $filename")
    cp(download("https://dox.uliege.be/index.php/s/PESUtiJ9RNSO73Y/download"),filename)
else
    @info("$filename is already downloaded")
end

In [None]:
#Reading the data

data,header = readdlm("8762075.sealevel.txt",header=true);



#The data have 8784 rows and 8 columns
size(data)

In [None]:
#Let's inspect the header
header[1:8]

In [None]:
#and the first 10 rows of the data
data[1:10,:]

In [None]:
# The 6th column contains the variable "sea-level-height"
# We extract this column into a new variable "sealevel" which is now a 1-column time series (with 8784 values)
sealevel = data[:,6];
size(sealevel)

#### $\rightarrow$ First exercise: calculate the standard deviation of the data (here is the equation):

\begin{equation}
    s = \sqrt{\frac{1}{N+1} \sum_{i=1}^N (x_i - \bar{x})^2}
\end{equation}


In [None]:
#Insert your equation here:
# I'll help you start:
N = length(sealevel); #number of elements
bar_x = sum(sealevel)/N; #average of our time series, sealevel

#Now you:
# Think the order of the commands you need, and if you need a loop
#
#
#mystd = ...
# @show(mystd)

In [None]:
# We compare now your result with the standard deviation calculated by Julia
# If you do not get the same result (to the 3rd decimal is enough), go back to your equation above
seal_std = std(sealevel)

@show seal_std

In [None]:
# Now we'll take a look at the data. First we plots the sealevel time series
plot(sealevel)
ylabel("Sea level (m)");
xlabel("time (years)");
#and then we add a line showing the mean of the time series
plot([1; 8784],[mean(sealevel); mean(sealevel)],"r")

# Look at how we need to use two points between [brackets], because th emean and standard deviation are
#just one number, not a time series. So we draw a line from time = 1 to time = 8784
#and join these two points by a line of a chosen colour

# Now it's your turn: add lines at 1, 2, 3 times the mean + and - the standard deviation. 
# You can use different colours, "g", "k"...


### Questions

$\rightarrow$ What do you think about the levels indicated by the standard deviation test? 
$\rightarrow$ Would you discard some data because they look suspicious? (i.e. outliers)


$\rightarrow$ Locate the data that look suspicious and check their date (columns 1-4 contain year, month, day, hour). 

$\rightarrow$ Is there a physically-reasonable reason for those outlier data? (i.e. they might not be ouliers but the result of a physical process?). Think about it, look in the internet if you need to check the timing. We'll discuss together in our next connexion. 