<a href="https://colab.research.google.com/github/ds4geo/ds4geo/blob/master/WS%202020%20Course%20Notes/Session%205.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Science for Geoscientists - Winter Semester 2020**
# **Session 5 - Time Series - 4th November 2020**
In the previous sessions and assignments, we've covered basic data handling, maniuplation and visualisation. In this session we will go deeper into working with time series, especially two frequently encountered topics: interpolation and filtering. We will use a monitoring dataset from Spanagel cave.

# Part 5.1 - Walkthrough of Session 4 LA-ICPMS excercise - *Walkthrough/Discussion*


# Part 5.2 - Cave monitoring data excercise part 1 - *Workshop*
We will work with cave monitoring data from Spanagel cave collected by Paul Töchterle and the UIBK Quaternary Research Group.

The data was collected to understand the cave circulation which is important for interpreting speleothem based palaeoclimate records from the cave. Three types of data were collected: 1. outside air temperature, 2. cave temperature, and 3. cave CO2 concentration, oxygen and carbon isotope ratios.

The following poster gives an explanation of the project and the data: https://github.com/ds4geo/ds4geo/blob/master/data/timeseries/Spanagel_Poster.pdf

We will load the data and perform some important processing steps to enable further comparative analysis of the three datasets.

## Part 5.2.1 - Load the cave monitoring data
Three datasets are available: Cave temperature, Outside air temperature, and Cave CO2 measurements (concentration, d13C, d18O).

The data sets are located as follows:
* Cave air temperature:
 * "https://github.com/ds4geo/ds4geo/raw/master/data/timeseries/Au%C3%9Fenluft%2BEingangslabyrinth.xlsx"
 * Sheet: "Daten3"
 * Time column: A, data column: G
* Outside air temperature:
 * "https://github.com/ds4geo/ds4geo/raw/master/data/timeseries/Au%C3%9Fenluft%2BEingangslabyrinth.xlsx"
 * Sheet: "Daten3"
 * Time column: I, data column: J
* Cave CO2 measurements:
 * https://github.com/ds4geo/ds4geo/raw/master/data/timeseries/CO2%20_compiled.xlsx"
 * Sheet: "Data Stream (2)
 * Time column: A, d13C column: C, d18O column: D, ppm CO2 column: E

**Task**
Load all three datasets using Pandas.

##Part 5.2.2 - Plot the data to get an overview
Make a few plots to get an overview of the data.

##Part 5.2.3 - Create a sub-set of the data for analysis
Create a sub-set of each dataset containing only the data between September and November 2015.

Hint: set the dataset indexes to the time column as we did in session 3 and use .loc to easily select time ranges.

## Part 5.2.4 - Check the number of samples/sampling rate
To do further analysis, it helps if all three datasets have the same amount of data with corresponding timestamps and sampling rate. Check the lenght of each and the sampling rate of each.

Hint: you will need to using indexing, and the `DataFrame.value_counts()` method will be useful.

# Part 5.3 - Timeseries and interpolation theory - *Mini-Lecture*
You will see from the workshop that the rate of data sampling is different in each of the datasets, and is even inconsistent within some of them. To analyse them further, we need them to be at the same rate, so we need to resample or interpolate them.

Super simple notes:

**Point vs Period:** Data representing a particular location or moment in time, or across a range. E.g. "daily" temperature data = temperature at 12 noon each day, vs average temperature across 24 hours.

**Resampling:** changing the positions or periods of data points (in e.g. time or spatial domain).

**Interpolation:** how you "get" values for positions of periods which were not directly measured.

**Upsampling:** less observations to more datapoints. E.g. creating daily data from weekly observations. Requires interpolation.

**Downsampling:** more observations to less datapoints. E.g. creating quarterly data from monthly observations. Requires interpolation or aggregation.

**Common interpolation methods:** linear, spline, nearest


# Part 5.4 - Introduction to SciPy - *Mini-lecture*
While Numpy handles multidimensional arrays and various mathematical operations on them, the SciPy library provides a wide range of more advanced functionality which is particularly useful for scientific data analysis.

It includes for example modules concerning linear algebra, regression/fitting, integration, signal processing, image manipulation and statistics.

SciPy can also refer to a collection of related libraries including Numpy, Pandas, Matplotlib and the SciPy library itself.

It contains a module called scipy.interpolate which we will use in the next section.

See here:
https://www.scipy.org/scipylib/index.html

The SciPy cookbook has many useful examples of using SciPy functions:
https://scipy-cookbook.readthedocs.io/





#Part 5.5 - Cave monitoring data excercise part 2 (resampling/interpolation) - *Workshop*
We will now resample the datasets so they have the same sampling rate, and that the data corresponds directly in terms of sampling time. We will resample everything to the same sampling rate and times as the cave temperature record.

In [None]:
# For all of the below sections we will need the SciPy Interpolate module:
from scipy import interpolate

## Part 5.5.1 - Upsampling outside air temperature
The outside air temperature is sampled at a lower rate than the cave temperature, therefore we need to upsample the outside air temperature record.

Re-write the following pseudo-code to perform resampling by linear interpolation.

In [None]:
'''
1. Create an interpolation object containing the outside air temperature data where:
We use the object "interp1d" (actually a class) from the Scipy interpolation 
<A> is the time column of the outside air temp data
   we cannot interpolate using times directly, so "to_numpy().astype(float) is a trick to convert the times to decimal numbers"
<B> is the temperature column outside air temp data
<C> is the type of interpolation - look at the in interp1d help documentation
<D> is the variable name to give to the interpolation object
'''
# <D> = interpolate.interp1d(<A>.to_numpy().astype(float), <B>, kind=<C>)

In [None]:
'''
2. Perform the interpolation by providing the times from the cave temperature
data set. These are the times where we want to calculate surface air temperatures.
<E> is the time column of the cave temp record, converted to decimal numbers as
     in <A>
<F> is the resampled time-series
'''
#<F> = <D>(<E>)

In [None]:
'''
3. Plot the resampled result and compare it to the original.
It will be helpful to visualise the datapoints themselves, so use data markers
in the plot.
'''

##Part 5.5.2 - Downsample the cave CO2 data
The cave CO2 data is at a much higher and variable sampling rate, so we will downsample it to the cave temperature sampling rate/timings.

Use the previous section as a guide. You will need to separately interpolate the CO2 concentration, the d18O and d13C data.

##Part 5.5.3 - Visualise data relationships.
Now the data is directly comparable, we can easily create some simple visualisations (or calculate statistics).

Try making a scatter plot(s) of CO2 concentration vs surface temperature, cave temperature, and the difference between cave and surface temperature.

## Part 5.5.4 - Test interpolation methods - *Workshop*
A number of interpolation methods exist besides linear interpolation. Experiment with and compare the different methods provided by interp1d. Which do you think are most appropriate in this context for the up- and downsampling and why?

## Part 5.5.5 - Optional: Try out Pandas resampling and interpolation methods - *Workshop*
Pandas has DataFrame methods `.resample()` and `.interpolate()` which make resampling very easy (although hide important details for learning what is going on).

If you have time, try out the direct Pandas based approach.

See documentation:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html

And examples here:

https://machinelearningmastery.com/resample-interpolate-time-series-data-python/

#Part 5.6 - Introduction to data filtering - *Mini-lecture*

* Most common: moving/rolling averages
* Make a moving/rolling "window", and take average of values in window for each position across dataset.
* Simple application in Numpy:
 * Filter of 1/n (n=filter length)
 * convolve filter with data

* Pandas has simple rolling/smoothing methods to explore yourselves.


#Part 5.7 - Data filtering excercise - *Workshop*
We will apply a simple moving average filter on a noisy dataset to smooth it.

You can either use a dataset of your choice (from the student submitted datasets), or use the following XRF core scanning data from an Antarctic sediment core: "https://doi.pangaea.de/10.1594/PANGAEA.859980?format=textfile" (V is particularly noisy).

1. Construct a filter: a 1d numpy array with length n and each value being 1/n (so its sum is 1). The length must be an odd number.

2. Apply the filter by convolving it with the data to be filtered using np.convolve.

3. Plot the original and smoothed data.

4. Experiment with different filter lengths.


# Part 5.8 - Advanced filtering - *Mini-lecture*
* Moving average filters equally weight importance of all data in the window.
* However, probably want to weight central points more than distant ones.
* Many more complex filter shapes exist, e.g. those in assignment task 1.
* Filtering is also possible in frequency domain with Fourier Transforms – not addressed in this course.


# Part 5.9 - Week 5 Assignment

**Task 1**

Numpy provides some more complex filters here: https://numpy.org/doc/stable/reference/routines.window.html (window means filter in this context)
Create a notebook where you apply these filters as well moving average filters to a noisy dataset. Experiment also with different filter lengths and comment on the differences and value of each method.

**Task 2**

Downsampling a noisy dataset by a significant amount (e.g. 10-100x) either by interpolation or just taking every nth data point, often leads to an unrepresentative representation of the original data. One solution is to first filter the data, then downsample the filtered dataset.

Look back to the NGRIP dataset from Session 1. Resample it to 1 sample every 1 ky. First use the method we used above for the cave CO2 data, then try pre-filtering it (try different filters/sizes). Compare and discuss the difference between the two approaches.

**Submission**
* Submit the assignment here: https://github.com/ds4geo/ds4geo_ws2020/tree/master/Assignments/Session%203
* Create a new Colab notebook via Google Drive, then save it to the submission repository using "save a copy to GitHub". See here:
 * https://github.com/ds4geo/ds4geo/blob/master/Github%20Assignment%20Readme.md
* The **deadline** is 23:59 on 10th November 2020.
* This assignment comprises 5% of the assessment for the course. Marks are split between the two tasks and are awarded for effectively running the analyses, visualising, and discussing the results.
