# Geospatial time series

*****

This notebook shows some ideas about time series analysis in a geospatial context. 

In <a href="#part1">**Part 1**</a>, we look at:
- what the temporal dimension is
- how you can access the data over this temporal dimension
- account for spatial variability
- mitigate missing data


In <a href="#part2">**Part 2**</a>, we introduce temporal statistics you can derive and how you calculate them:
- trends
- variability
- deviation from the mean
<!-- - seasonality -->


<!-- In the **3. step** you will learn about some built-in and self-made methods that aim to investigate in more detail the points of step 2, including:
- moving windows
- statistical tests (a small selection of tests that aim to show different temporal aspects) -->


In <a href="#part1">**Part 3**</a>, the ideas from parts 1 and 2 are extended to:
- compare different pixels (different time series)

By the end, you will be able to use the information stored in all 3 dimensions (x and y - spatial, and z - time). Often these results can be presented again in the form of **maps** that will be covered in a different notebook. 
*****


In [1]:
%%html
<style>
    .dothis{
    font-weight: bold;
    color: #ff7f0e;
    font-size:large
    }
</style>

In [None]:
# Make sure the script is using the proper kernel
try:
    %run ../swiss_utils/assert_env.py
except:
    %run ./swiss_utils/assert_env.py

In [None]:
# Import modules

# reload module before executing code
%load_ext autoreload
%autoreload 2


# define modules locations (you might have to adapt define_mod_locs.py)
%run swiss_utils/define_mod_locs.py

import os
import shutil
import xarray as xr
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
from matplotlib.patches import Polygon, Rectangle

# from swiss_utils.data_cube_utilities.sdc_utilities import load_multi_clean

# import datacube
# dc = datacube.Datacube()

# silence warning (not recommended during development)
import warnings
warnings.filterwarnings("ignore")

# AND THE FUNCTION
# from swiss_utils.data_cube_utilities.sdc_utilities import indices_ts_stats

plt.rcParams['figure.figsize'] = (16,8)       # this line changes the size of the figures displayed in the notebooks

<hr style="border-top:8px solid black" />

# Part 0: Preparing/downloading our data

We will use a pre-prepared small data subset around Fribourg which we extracted from the Swiss Data Cube for you earlier. <span class="dothis">Download this dataset by running the next cell.</span> After a short while you should see the .nc file appear in the file explorer pane on your left (you may need to click the 'Refresh' button).

<span style="color:gray; font-style:italic">We made this data subset using `time_series_data_preparation.ipynb`. You might find this approach useful when doing your project work.</span>

In [None]:
nc_filename = "ls8_lasrc_swiss_fribourg_example.nc"
import os
if os.path.exists(nc_filename):
    print('File already downloaded.')
else:
    print('Downloading...')
    import requests
    URL = "https://drive.switch.ch/index.php/s/D8mj6rg6VQvlbAw/download"
    response = requests.get(URL)
    open(nc_filename, "wb").write(response.content)
    print('Done.')

In [None]:
# Open the prepared Landsat 8 subset for the Fribourg region 
ds = xr.open_dataset('ls8_lasrc_swiss_fribourg_example.nc', engine='netcdf4')

In [None]:
# ds - the dataset
ds

<hr style="border-top:8px solid black">
<a name='part1'></a>

# Part 1: Temporal Data
## Time components

Of special interest for us is the `time` dimension. The different components are explained in more detail also here:https://docs.xarray.dev/en/stable/user-guide/time-series.html#datetime-components.
`time` has multiple attributes that will allow you later on to select data of interest. One can first have a look at all the time steps in the dataset by simply calling `<xarrayDataArray>.time`. In the cell below you will see that the time of each scene is stored in a very detailed format:
- 2013-04-18T10:18:18.000000000

with:
- 2013 - year
- 04 - month
- 18 - day
- 10:18:18.000000000 - Hour:Minute:Second



In [None]:
ds.time
# ds["time"]  # will yield the same output / different way of writing it

We can access the individual parts using the same writing but with an additional `.dt` followed by the attribute of interest:

In [None]:
# examples
ds.time.dt.month
# ds.time.dt.day
# ds.time.dt.year
# ds.time.dt.season

***
> NOTE: The date/time string is in a format that we understand (years, months, days, etc.). Inside a computer, the date/time is represented as a numeric value. A standard way is to represent any date as number of days since "1970-01-01". This allows to convert the date/time string into something meaningful for the computer.
***

In [None]:
from matplotlib import dates

print(dates.date2num(np.datetime64('1850-11-17 13:12:11')))
print(dates.date2num(np.datetime64('1970-01-01 00:00:00')))  # this is the standard time starting point
print(dates.date2num(np.datetime64('2022-11-17')))

# The output unit is [days since start]

## Time series

In this example, along the time axis of the DataArray`ndvi` every pixel (x,y / lon,lat) represents the evolution of the Normalized Difference Vegetation Index (NDVI). 

The map below shows the average value (`.mean()`) over the time axis (`dim="time"`) for all scenes (images) available in September (`month==9`) for the year 2013 (`year==2013`).

This example reduces the dimensions of the DataArray (`ndvi`) in 3D from:
- "time"
- "latitude"/y
- "longitude"/x

to 2D:
- "latitude"/y
- "longitude"/x

You can see this by just calling the objects `ndvi` and `xs` respectively:

In [None]:
ndvi = ds.ndvi
ndvi

In [None]:
xs = ds.ndvi.sel(time=np.logical_and(ds.ndvi.time.dt.month == 9, ds.ndvi.time.dt.year == 2013)).mean(dim="time")
xs

In [None]:
xs.plot.imshow(vmin=0, vmax=1, cmap=cm.BrBG)

For time-series analysis, the temporal information of the the `"time"` dimension is relevant. Extracting a time series for a pixel is like using a cookie cutter to cut a piece of lasagna through all its layers.
Because the 3D DataArray looks like a 3D matrix, one can use indices for the rows and columns. The `xarray` and `geopandas` object types allow also for selecting pixels using the `longitude` and `latitude` information.

From the information in the `config_cell.txt` file and from the cell output above, you can see that the spatial extent is:
- xmin = 7.15 ºE
- xmax = 7.25 ºE
- ymin = 46.7 ºN
- ymax = 46.8 ºN

In [None]:
# a point in the middle of the study area
point_coords = [7.195, 46.802]  

In [None]:
# in which "dimensions" is the information stored?
ndvi.dims

In [None]:
# With the .sel() method you select certain data. You define the dimension (dimension name)
# in which the value should be looked for. In the example these are "longitude" and "latitude"

da = ndvi.sel(
    latitude=point_coords[0],      # point_coords[0] is the 1st entry (python indices start counting at 0!)
    longitude=point_coords[1],     # point_coords[1] is the 2nd entry (python indices start counting at 0!)
    method="nearest"               # the nearest method finds the 1 closest pixel
             )

In [None]:
# Look at the output, the dimensions have been reduced. Lon and Lat are only single values and are not dimensions any more
da.dims

In [None]:
da

In [None]:
da.plot.line('o-')

The above plot shows for one pixel each time step as a blue point. The points are connected if consecutive time steps are not separated by a data gap. One can also select more pixels. The following example shows for the same latitude the selection of points along 4 different longitudes.

In [None]:
# [7.195, 46.802]  
x_coords = np.arange(7.195, 7.2, .001)
x_coords

In [None]:
y_coord = 46.802              # fixed latitude
da = ndvi.sel(  
    latitude=y_coord, 
    longitude=x_coords,
    method="nearest"          # the nearest method finds the 1 closest pixel
             )

In [None]:
da

In [None]:
# Overview map with positions indicated by circles
fig,ax = plt.subplots(1)

# First plot the mean NDVI of the whole time series as a map
ndvi.mean(dim='time').plot.imshow(vmin=0,
                   vmax=1,
                   cmap=cm.BrBG)

# On top of the map, show the locations of our positions
ax.plot(x_coords, 
     np.repeat(y_coord, len(x_coords)),
     marker='o',
     linestyle='none',
     fillstyle='none',
     color='red',
     markersize=8)

# Draw a box around the positions
ax.add_patch(Rectangle((7.192, 46.8), .01, .005,facecolor="#FF000022", edgecolor='r'))


Plotting the individual time series for the selected 5 points in the map above:

In [None]:
da.plot.line(x='time', marker='o', markersize=1)

The plot shows that the 5 locations have some variability. They share many data gaps, which could indicate that these result from cloud cover.
***


## Reducing / Preparing data
A single data point (in our case pixel) might show a lot of missing data and might not be fully representative for a landscape unit, like a forest, a lake, a city, or a crop field. One way to account for this is to select ***multiple pixels*** and calculate their mean. The following cells give some examples to calculate a spatial mean.


<!-- ### loc - Locate/Select (rows and columns)
As already shown two cells above, the `.loc[]` method. It is a **method/function** even though it uses **square brackets** - this is to highlight that its purpose is *selecting* parts of the dataset in a similar way as for matrices and other *standard python objects*. `.loc[]` selects rows and columns by its names. In a `xarray.DataArray` the **row and column names (spatial)** are the latitudes (y) and longitudes (x). By default, time takes always the first dimension.
 -->
 
### `.sel()` - masking data entries of interest

Here we use the `.sel()` method. It requires keywords or arguments. For `.sel()` these are the names of the `dimensions`. This makes the method very easy to apply because we *explicitly* define the dimensions on which to perform the selection. In the example these keywords/arguments can be any, some, or all of the dimensions:

- 1st dimension: `"time"`
- 2nd dimension: `"latitude"`
- 3rd dimension: `"longitude"`

To tell the method which selection we want to have, we define a **single value** to look for (e.g. `time='2019-10-30'`), or a **range** (`longitude=slice(7.192, 7.193)`). The `slice()` function is interpreted directly by `.sel()` to know that all the values between the first (7.192) and the last value (7.193) should be found.


**Examples**

Specific dates and date ranges:
- `time='2019-10-12'` - one date - will find a time step and its values only if there is data on that day!
- `time='2019-10-13'` - this one will not find anything because there is no data on that day.
- `time='2019-10-13', method="nearest"` - one date, and method='nearest' because the exact time entry is: `2019-10-12T10:17:17`. This will return the value from the day before.
- `time=slice('2019-10-11', '2019-10-13')` - this one will return all the entries in the time `slice`
***
All dates of the same month:
- `time=ndvi.time.dt.month==4` - select all time steps where the month is April
- `time=ndvi.time.dt.month.isin([1, 2, 3])` - select all time steps where the month is are either: January, February, or March
***
Spatial:
- `"latitude"=46.7"` - the point at latitude 46.7. If there is no data at 46.7 then you will get an empty DataArray.
- `"latitude"=46.8, method="nearest"` - the latitude closest to the value 46.8
- `"latitude"=slice(46.8, 46.81)` - all latitudes between 46.8 and 46.81
- `"longitude"` - same as for latitude

You can combine exact and "nearest" selections by using two `.sel()` operations:

`ndvi.sel(time='2019-10-13', method='nearest').sel(latitude=slice(46.8, 46.81))`


<span class='dothis'>Now try out a different date, with and without the `method='nearest'`, and a `slice(<date-start>, <date-end>)` operation.</span>

In [None]:
da = ndvi.sel(time='2019-10-12')
da

The following cell is the one that was introduced before. It will take **all time steps**, the **latitudes** between 46.805 and 46.8 (reverse order because latitudes increase from bottom to top, but image coordinates increase from top to bottom), and **longitudes** from 7.192 to 7.193. No `method='nearest'` is required here.

In [None]:
da = ndvi.sel(                                # no time is defined --> all time steps
              latitude=slice(46.805, 46.8),   # latitudes from 46.80 to 46.805
              # NOTE: the order ^  ,  ^   is the higher and then the lower latitude. That is
              #because the image coordinates go from top to bottom, but latitudes 
              #go from south to north --> botttom to top. That's why they are reversed
              longitude=slice(7.192, 7.193))  # longitudes from 7.192 to 7.193
da


***
`.sel()` allows also to select certain months, seasons, or years by asking where the **time components** match a condition. In the example below the expression `ndvi.time.dt.month==4` asks where the `month` component matches the value `4` (April). <span class='dothis'>Try to select all time steps from the `ndvi` DataArray that correspond to summer `JJA` (June, July, August) using the **time component**:`.time.dt.season`.</span>

In [None]:
# Only the time dimension values are show with the additional ".time" at the end. For the whole dataset, remove this ending.
ndvi.sel(time=ndvi.time.dt.month==4).time

*** 
Often there are many ways that lead to the desired result. There is not a single ***correct way*** to do many things. The following example will extract all the values for the months December and January using three different methods. The results are stored in the variables `a`, `b`, and `c` and plotted to showcase if they are identical.


>
<font color=blue> The `.sel()` method is the most flexible one - it can be used for almost all selection purposes for time series analysis.</font>

In [None]:
# All steps with January or December
a = da.loc[np.logical_or(da.time.dt.month==12, da.time.dt.month==1),: ,: ]
b = da.where((da["time.month"]==12) | (da["time.month"]==1)) 
c = da.sel(time=np.logical_or(da.time.dt.month == 12, da.time.dt.month==1))

# Showing that the two methods find the same data points
a.sel(latitude=7.19,longitude= 46.8,method='nearest').plot.line('o',markersize=15,c='black', label='da.loc[]')
b.sel(latitude=7.19,longitude= 46.8,method='nearest').plot.line('o',markersize=10,c='yellow',label='da.where()')
c.sel(latitude=7.19,longitude= 46.8,method='nearest').plot.line('o',markersize=5, c='red',   label='da.sel()')
plt.legend()

alen = len(a.sel(latitude=da.latitude.values[0],longitude= da.longitude.values[0]))
blen = len(b.sel(latitude=da.latitude.values[0],longitude= da.longitude.values[0]))
clen = len(c.sel(latitude=da.latitude.values[0],longitude= da.longitude.values[0]))
print('Length of `time` with .loc[] is :', alen)
print('Length of `time` with .where() is :', blen)
print('Length of `time` with .sel() is :', clen)

### `.groupby()` - apply operations over dimensions 

The `groupby` function takes one **dimension** and identifies its unique entries. For those dimensions that have multiple identical values, e.g. **longitude**, it means that all pixels are grouped together with e.g. a longitude of 7.192. 
After the grouping, an operation has to be applied, e.g. to calculate the `.mean()` of all these values that were grouped. If there are **more than one dimension** over which this operation can be applied, one has to specify over which this shall take place. See the examples below.


In [None]:
# mean values for "vertical stripes", i.e. per longitude take all the values of the different latitudes and average them
da_grpByLon = da.groupby('longitude').mean('latitude')
# Note in the output that the dimension latitude disappeared, and we only have time and longitude left
da_grpByLon

In [None]:
da_grpByLon.plot.line(x='time')

One can specify also multiple dimensions (axes) over which to apply the function. The next example applies the `.mean()` for each time step (identified with `.groupby()`), by averaging over both spatial dimensions/axes `('longitude', 'latitude')`.


In [None]:
# mean values over the entire "red box" that is shown in the map earlier
da_grpByLonLat = da.groupby('time').mean(('latitude', 'longitude'))
# da_grpByLonLat

In [None]:
# The resulting plot shows the average value for each time step in the red box shown in the map.
da_grpByLonLat.plot.line()

### `.resample()`
Similar to groupby, `.resample()` will generate a new dataset by e.g. averaging over the `"time"` dimension. The main purpose of `.resample()` is to create new temporal resolutions/aggregations:
- `D` - daily
- `M` - monthly
- `Y` - annual
- `Q` - quarterly (4 months)
- ... and more

<span class='dothis'>Resample the `da` DataArray to daily and annual quarterly values.</span>

In [None]:
# da_monthly = da.resample(time='M').mean()

# Note in the output that the new monthly dates always stop on the last day of the month.
# To have the dates be the first day of the month, one can add "S" into the resample argument string:

da_monthly = da.resample(time='M').mean()

da_monthly

In [None]:
# da_annual = da.resample(time='AS').mean()
# da_annual.sel(longitude=7.193 , latitude=46.8, method='nearest').plot.line()

# da_seasonal = da.resample(time='QS').mean()
# da_seasonal.sel(longitude=7.193 , latitude=46.8, method='nearest').plot.line()

### Combining different methods
The methods `.groupby()`, `.resample()`, and `.sel()` will always create a new `xarray.DataArray` as an output. As such, one can again apply the previously mentioned methods.
This allows for running multiple processing steps in one line. In the following example, the monthly mean values are calculated **(1)** for all pixels individually, and **(2)** then averaged over the spatial dimension. In the end, there will be one time series (only time dimension).

In [None]:
daa = ndvi.sel(latitude=slice(46.805 , 46.8),
               longitude=slice(7.192, 7.193)).groupby('time').mean(('latitude', 'longitude')).resample(time='MS').mean()
                                                             

In [None]:
daa

In [None]:
daa.plot.line('-o')

<hr style="border-top:8px solid black">
<a name='part2'></a>

# Part 2: Trends, variability, deviation
This section will first introduce the ideas behind these three keywords. Then there will be examples on how to extract them and what pre-processing is needed.

## Trends

In our day-to-day life the words **trend** and **tendency** can often be used interchangeably; in the context of climate, however, they are different. 

> *Climate describes the average weather conditions for a particular location and over a long period of time.
> $[...]$ climate normals—30-year historical averages of variables like temperature and precipitation [...]* (WMO 2022)

The main use: **trend** is a **statistically significant change** over time in our variable. With other words, over a time period in a climatological context (>= 30 years), the values increase or decrease; and there is a very small chance that this is observed by chance. If there is less than **30 years** of data available, one can use the term **tendency** (this is not an agreed-upon term!!!) to highlight that this is not a climatological context.

However, if we **de-trend** our data, it means that any systematic increase in values over time is removed. This independent of whether this change over time is **significant or non-significant**.

***

In order to be statistically significant the following must be fulfilled:
- $p \le \alpha$

The significance level $\alpha$ is chosen by us (usually 5%). In the example of a trend, it is the probability of observing a change over time even though in reality that is not true. *The general definition: it is the probability of rejecting the Null-hypothesis $H_{0}$ (=no trend), given that the Null-hypothesis is true.*
>With other words, we allow to make a mistake in 5% of the cases by assuming there is a trend even though there is no trend. Lowering the value of $\alpha$ makes us more sure there is really a trend, but it also makes it more difficult to find one that is not as obvious.

The $p-value$ (sometimes written only $p$) is the result of a statistical test. In the example of a trend, it is the probability of seeing a change over time as extreme as we do, assuming ($H_{0}$) there is no real trend. *The general definition: it is the probability obtaining a result as extreme, given that $H_{0}$ is true.*

> Example: $H_{0} =$"no trend", $\alpha=0.05$, $p-value=0.0231$ (outcome of our analysis). Because $p \le \alpha$, we reject $H_{0}$ and accept $H_{1}$ the alternative hypothesis that there is a trend. The result is statistically significant at our chosen significance level $\alpha=0.05$. The lower the $p-value$, the less likely that an identified trend was identified even though in reality there is no trend. Reversing the wording: The lower the $p-value$, the more likely there is a real trend.

***
## Calculating trends/tendencies 
There are various packages for the calculation of trends and associated statistics. While `xarray` directly provides a fitting over time function `.polyfit()` it does not provide the statistics like $p-value$ or the goodness of fit. Instead we use the `linregress` function from the `scipy.stats` package.

In order to determine the change over time, it is neccessary to represent the time in a continuous format. As Shown in part 1, there are functions that convert the date/time string into a numeric format. In order to convert the date/time string to numeric and vice versa, you have two functions in the next cell. 

In [None]:
from matplotlib.dates import date2num, num2date

# two functions to convert back and forth
def xr_date2num(time):
    return date2num(time)

def xr_num2date(time_numeric):
    # transforms the num2date (days since ...) into datetime64 (seconds since ...)
    return np.array([np.datetime64(d) for d in num2date(time_numeric)])


In [None]:
# You can check that the two functions work as intended:

# # forward: date to numeric
# xr_date2num(daa.time)
# # backward:
# xr_num2date(xr_date2num(daa.time))

### `linregress()`
`linregress` takes two arrays of values (x and y) to check if there is a relationship between them. `x` will be time and `y` the NDVI values. If there is a significant relationship between them, it means there is a significant change over time; with other words a statistically significant `trend` or `tendency`.

The output of `linregress` is multiple statistics. You can check them by uncommenting the `linregress?` line in the next cell. These are in short:

- Slope of the regression line.
- Intercept of the regression line.
- Pearson correlation coefficient. The square of `rvalue` is equal to the coefficient of determination.
- p-value for a hypothesis test whose null hypothesis is that the slope is zero, using Wald Test with t-distribution of the test statistic. 
- and others

***
#### Slope
The **slope** says how much the change is **per time unit**. If we use monthly data, then the change would normally be **change per month**. However, <span style="color:red"> we use the transformed time in days! So we get the slope in units of NDVI/days</span>. 

#### Intercept
The **intercept** is graphically the point on the y-axis where the regression line cuts through it at **x=0**. This statistic is **only of interest for the graphical interpretation** in our case. <span style="color:red"> But keep in mind that time starts "1970-01-01"</span>.

#### Correlation coefficient (r) and coefficient of determination (r$^{2}$)
The **correlation coefficient** and the **coefficient of determination r$^{2}$** tell us how much of the variance is explained. With other words, how well our regression explains the relationship. You will always have a **low $p-value$ if the r$^{2}$ is high**. But: <span style="color:red"> You can have a **low $p-value$ but also a low r$^{2}$**.</span>

#### p-value
Statistically, maybe the most important outcome. **Is there really a change over time, or do we see something by chance?**. The lower, the more robust/striking.
***

In [None]:
from scipy.stats import linregress
# linregress?

### Removing non-valid data points `NaN`
The `linregress` function is very strict with regards to missing data. We can only use data where there are no missing values (`NaN`). The next cell filters them away.

In [None]:
# our values for y and x
y = daa.values  
x = xr_date2num(daa.time.values)

# this checks if the value is a valid numeric data point
clean_mask = np.isfinite(y)  

# the mask has the indices of valid data in y.
# you can compare the before and after:

# y
# y[clean_mask]

# The cleaning is applied to both:
# - time
# - ndvi
# so that they have the same length
y_clean = y[clean_mask]
x_clean = x[clean_mask]

In [None]:
# Finally the regression
result = linregress(x_clean, y_clean)
print(result)

# We are only interested in
# - slope 
# - intercept (only graphical)
# - p-value
# - r-value

*** 
The output shows us that $p > \alpha$. The change over time is thus not statistically significant.

The information on what the optimal (*ordinary least square regression*) regression looks like is stored in the `result` object. The `slope` and `intercept` can be assessed with `result.slope` and `result.intercept`; slope with `result.slope`, $p-value$ with `result.pvalue`, and $r$ with `result.rvalue`.

The line can be created by plotting the x-values (time) against the values calculated from the simple formula for a line:

$y = m*x + b$,

where $m$ is the slope, $b$ is the intercept, $x$ are the time values (in days), and $y$ are the NDVI values.
***

The following is an example of a non-significant trend:


In [None]:
daa.plot.line()
# add the regression line
m = result.slope
b = result.intercept

# back-transform the time? Not needed, because matplotlib knows!
y_pred = m * x + b     # the predicted values; using all time steps (non-filtered)
plt.plot(x, y_pred, 'bo-', markersize=3)

# But you can try. Comment out the command before, and use the following two lines instead:
# x_rev = xr_num2date(x)
# plt.plot(x_rev, y_pred)

### Slope units
As mentioned before, we transform the time. If we look at the `slope` value, we can see a value of `-4.828947187444238e-06`. Since the transformed time values have the units of **days**, this value indicates a change of **-4.828947187444238e-06 per day**.

We can check quickly by looking at the predicted values and corresponding time entries:

In [None]:
print(result.slope)

In [None]:
t0 = x[0]
t1 = x[1]

dt = t1 - t0
print('The difference in days between the two time steps:',dt)

# the predicted NDVI values
ndvi_predicted = m * x + b
ndvi_pred0 = ndvi_predicted[0]
ndvi_pred1 = ndvi_predicted[1]

dndvi = ndvi_pred1 - ndvi_pred0
print('The difference in NDVI from the regression between the two time steps:',dndvi)

rate = dndvi/dt
print('The slope is:',rate)

# Quick check if this rate is the same as from the regression (ratio=1):
print("This should be close to 1:",rate/result.slope)

***
### Complete examples

In the following you have two examples with a full workflow:
- selecting data
- averaging the data
- resampling to monthly time steps
- filtering NaN values
- running linregress
- plotting


#### Monthly example

In [None]:
# spatial subset
da = ndvi.sel(latitude=slice(46.8, 46.79),  # higher value before lower for latitudes!!
              longitude=slice(7.178, 7.19))
# select all values of a certain month
da_mon = da.sel(time=da.time.dt.month == 9)
# resample to monthly means
da_mon = da_mon.resample(time='MS').mean()
# average over all latitudes and longitudes per time step
da_mon = da_mon.groupby('time').mean(('longitude','latitude'))

# regression preparation:
y = da_mon.values  
x = xr_date2num(da_mon.time.values)  

# only take non-NaN values
clean_mask = np.isfinite(y)  

y_clean = y[clean_mask]
x_clean = x[clean_mask]

# regression
reg = linregress(x_clean, y_clean)

# print results
print(reg)

# slope
m = reg.slope
# intercept
b = reg.intercept

# calculate regression line (all months)
y_pred = m * x + b
# calculate regression line (only September)
y_pred_mon = m * x_clean + b

# plot
da_mon.plot.line('bo-')
plt.plot(x_clean, y_pred_mon, 'r-')

#### Annual example

In [None]:
# spatial subset
da = ndvi.sel(latitude=slice(46.8, 46.79),  # higher value before lower for latitudes!!
              longitude=slice(7.178, 7.19))
# select all values of a certain year
da_mon = da.sel(time=da.time.dt.year == 2014)
# resample to monthly means
da_mon = da_mon.resample(time='MS').mean()
# average over all latitudes and longitudes per time step
da_mon = da_mon.groupby('time').mean(('longitude','latitude'))

# regression preparation:
y = da_mon.values  
x = xr_date2num(da_mon.time.values)  

# only take non-NaN values
clean_mask = np.isfinite(y)  

y_clean = y[clean_mask]
x_clean = x[clean_mask]

# regression
reg = linregress(x_clean, y_clean)

# print results
print(reg)

# slope
m = reg.slope
# intercept
b = reg.intercept

# calculate regression line (all months)
y_pred = m * x + b
# calculate regression line (only September)
y_pred_mon = m * x_clean + b

# plot
da_mon.plot.line('bo-')
plt.plot(x_clean, y_pred_mon, 'r-')

### Spatio-temporal trends
It is possible to calculate the trend through time for every pixel in our datacube. This allows to compare pixel by pixel in the form of a map.

To do this, `xarray` provides a special function: `apply_ufunc()`. The output is very helpful in understanding in more detail, where we observe statistically significant trends (Is it at fields, in the city, ...?). And we can use these results also to select these areas for further analysis (not covered here).

The example below shows first the function and then the output in the next cell for the `slope`.



In [None]:
# from scipy.stats import linregress

dataset = ndvi

# Don't change anything in the code below.
x = np.arange(dataset.time.shape[0])

def new_linregress(y):
    # Wrapper around scipy linregress to use in apply_ufunc
    clean_mask = np.isfinite(y)  
    y_clean = y[clean_mask]
    x_clean = x[clean_mask]
    slope, intercept, r_value, p_value, _ = linregress(x_clean, y_clean)
    return np.array([slope, intercept, r_value, p_value])

stats = xr.apply_ufunc(new_linregress, dataset, 
                       input_core_dims=[['time']],
                       output_core_dims=[["parameter"]],
                       vectorize=True,
                       dask="parallelized",
                       output_dtypes=['float64'],
                       output_sizes={"parameter": 4},
                      )

In [None]:
# The output is a bit different here. We have two spatial dimensions (longitude and latitude), and an array 'parameter'
stats.dims

# Inside the 'paramter' array we have the 4 columns
# slope, intercept, r_value, p_value 
# this is the result of the "return np.array([slope, intercept, r_value, p_value])"

In [None]:
# the easiest way to access the outputs is by using the positional index
# Column 1 - slope (index 0)
# Column 2 - intercept (index 1)
# Column 3 - rvalue (index 2)
# Column 4 - pvalue (index 3)

ndvi_slope = stats[:,:,0]  # slope
ndvi_slope.name = 'slope'  # change the name from 'ndvi' to 'slope'
ndvi_slope.plot.imshow()

<span class='dothis'>1) Change the `parameter` from slope to $p-value$.</span>

<span class='dothis'>2) Change the dataset from `da` to the full dataset `ndvi`.</span>

***
<a name='part3'></a>
# Part 3: Variability

Variability refers to how much a variable like NDVI changes in general, as compared to how much the values change systematically over time (--> trends/tendencies). The monthly example from before shows no statistically significant trend ($p > \alpha$). But we see that the values change a lot. Some example for high variability can be different crops on the fields that result in differentt NDVI values, different precipitation patterns in combination with temperature that lead to variable snow cover, etc.

A common statistic to describe variability is the **standard deviation**. 


$s = \sqrt\frac{\sum{(x_i-\bar{x})^2}}{n}$

The standard deviation has the same unit as the data in the time series. It makes it therefore more intuitive to use it instead of the ***variance***.


Another useful way to investigate variability is by looking at the **deviation from the mean**, sometimes called anomalies. Instead of calculating a single statistic over all time steps, one derives for each time step a value.


### Application
#### Standard deviation

One can directly calculate the standard deviation for each pixel by calling the function `.std('time')`, indicating that it should be applied over the **time** dimension.

The following example shows directly the difference between the urban and the rural area in terms of NDVI variability. Crops fields can easily be identified where the variability is especially high.


In [None]:
ndvi.std('time').plot.imshow()

In [None]:
# The same example but only for the month of August
ndvi.sel(time=ndvi.time.dt.month==8).std('time').plot.imshow()

### Application
#### Deviation from the mean
As the name says, we have to calculate the mean first and subtract this value from each individual NDVI value. If there is a strong seasonality, we have to think of which mean we calculate (monthly, annual, ...), and of which data we subtract this mean (also monthly, annual, ...).



In [None]:
da_annual_mean = ndvi.mean('time')
da_annual = ndvi.resample(time='AS').mean()
da_dev_from_mean = da_annual - da_annual_mean

# plot the time series for a pixel:
da_dev_from_mean_pixel = da_dev_from_mean.sel(longitude=7.19, latitude=46.793, method='nearest')

da_dev_from_mean_pixel.plot.line('ko-')
plt.hlines(y= da_dev_from_mean_pixel.mean(), 
           xmin=da_dev_from_mean_pixel.time[0], 
           xmax=da_dev_from_mean_pixel.time[-1])


In [None]:
# plot the deviation from the mean for the year 2018 - as a map
# da_dev_from_mean.sel(time=da_dev_from_mean.time.dt.year==2018)[0].plot.imshow()
da_dev_from_mean.sel(time=da_dev_from_mean.time.dt.year==2018).mean(dim='time').plot.imshow()


In [None]:
# the easiest way to access the outputs is by using the positional index
# this line from the cell above defines the order:
# return np.array([slope, intercept, r_value, p_value])
# Column 1 - slope (index 0)
# Column 2 - intercept (index 1)
# Column 3 - rvalue (index 2)
# Column 4 - pvalue (index 3)
ndvi_slope = stats[:,:,0]  # pvalue
ndvi_slope.name = 'slope'
ndvi_slope.plot.imshow()