# Spatio-temporal folds generation for cross validation

In [3]:
using Pkg
# the path where you have the Manifest.toml and Project.toml 
Pkg.activate("/home/dpabon/YAXArraysToolboxNotebooks/Notebooks/")


[32m[1m  Activating[22m[39m project at `~/YAXArraysToolboxNotebooks/Notebooks`


```For this short tutorial we will use the cookfarm dataset from the GSIF package in R. "The R.J. Cook Agronomy Farm (cookfarm) is a Long-Term Agroecosystem Research Site operated by Washington State University, located near Pullman, Washington, USA. Contains spatio-temporal (3D+T) measurements of three soil properties and a number of spatial and temporal regression covariates."

The cookfarm data set contains four data frames. The readings data frame contains measurements
of volumetric water content (cubic-m/cubic-m), temperature (degree C) and bulk electrical conduc-
tivity (dS/m), measured at 42 locations using 5TE sensors at five standard depths (0.3, 0.6, 0.9, 1.2,
1.5 m) for the period "2011-01-01" to "2012-12-31":
SOURCEID factor; unique station ID
Date date; observation day
Port*VW numeric; volumetric water content measurements at five depths
Port*C numeric; soil temperature measurements at five depths
Port*EC numeric; bulk electrical conductivity measurements at five depths
The profiles data frame contains soil profile descriptions from 142 sites:
SOURCEID factor; unique station ID
Easting numeric; x coordinate in the local projection system
Northing numeric; y coordinate in the local projection system
TAXNUSDA factor; Keys to Soil Taxonomy taxon name e.g. "Caldwell"
HZDUSD factor; horizon designation
UHDICM numeric; upper horizon depth from the surface in cm
LHDICM numeric; lower horizon depth from the surface in cm
BLD bulk density in tonnes per cubic-meter
PHIHOX numeric; pH index measured in water solution
The grids data frame contains values of regression covariates at 10 m resolution:
DEM numeric; Digital Elevation Model
TWI numeric; SAGA GIS Topographic Wetness Index
MUSYM factor; soil mapping units e.g. "Thatuna silt loam"
NDRE.M numeric; mean value of the Normalized Difference Red Edge Index (time series of 11
RapidEye images)
NDRE.sd numeric; standard deviation of the Normalized Difference Red Edge Index (time series of
11 RapidEye images)
Cook_fall_ECa numeric; apparent electrical conductivity image from fall
Cook_spr_ECa numeric; apparent electrical conductivity image from spring
X2011 factor; cropping system in 2011
X2012 factor; cropping system in 2012
The weather data frame contains daily temperatures and rainfall from the nearest meteorological
station:
Date date; observation day
Precip_wrcc numeric; observed precipitation in mm
MaxT_wrcc numeric; observed maximum daily temperature in degree C
MinT_wrccc numeric; observed minimum daily temperature in degree C

Gasch, C.K., Hengl, T., Gräler, B., Meyer, H., Magney, T., Brown, D.J., 2015. Spatio-temporal
interpolation of soil water, temperature, and electrical conductivity in 3D+T: the Cook Agron-
omy Farm data set. Spatial Statistics, 14, pp.70–90.
```

In [4]:
using DataFrames, CSV, YAXArraysToolbox 

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mPrecompiling YAXArraysToolbox [fe326ce2-c736-437b-b94b-0bdf007dd2e5]


In [5]:
cookdata = CSV.read("cookfarm.csv", DataFrame)

Row,SOURCEID,VW,Easting,Northing,altitude,DEM,TWI,NDRE.M,NDRE.Sd,Bt,BLD,PHI,Crop,Date,Precip_wrcc,MaxT_wrcc,MinT_wrcc,Precip_cum,cday,cdayt
Unnamed: 0_level_1,String7,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,String7,Date,Float64,Float64,Float64,Float64,Int64,Float64
1,CAF357,0.303,4.93828e5,5.18102e6,-0.3,792.576,3.79125,0.0816121,0.280518,0.0,1.22,5.84,SL,2010-01-01,5.8,2.8,-3.3,5.8,14611,-0.052336
2,CAF357,0.328,4.93828e5,5.18102e6,-0.6,792.576,3.79125,0.0816121,0.280518,0.0,1.36,6.32,SL,2010-01-01,5.8,2.8,-3.3,5.8,14611,-0.052336
3,CAF357,0.376,4.93828e5,5.18102e6,-0.9,792.576,3.79125,0.0816121,0.280518,0.0,1.48,6.52,SL,2010-01-01,5.8,2.8,-3.3,5.8,14611,-0.052336
4,CAF357,0.35,4.93828e5,5.18102e6,-1.2,792.576,3.79125,0.0816121,0.280518,0.0,1.56,6.68,SL,2010-01-01,5.8,2.8,-3.3,5.8,14611,-0.052336
5,CAF357,0.323,4.93828e5,5.18102e6,-1.5,792.576,3.79125,0.0816121,0.280518,0.0106,1.6,6.72,SL,2010-01-01,5.8,2.8,-3.3,5.8,14611,-0.052336
6,CAF357,0.297,4.93828e5,5.18102e6,-0.3,792.576,3.79125,0.0816121,0.280518,0.0,1.22,5.84,SL,2010-01-02,6.9,6.1,0.6,12.7,14612,-0.0348995
7,CAF357,0.33,4.93828e5,5.18102e6,-0.6,792.576,3.79125,0.0816121,0.280518,0.0,1.36,6.32,SL,2010-01-02,6.9,6.1,0.6,12.7,14612,-0.0348995
8,CAF357,0.375,4.93828e5,5.18102e6,-0.9,792.576,3.79125,0.0816121,0.280518,0.0,1.48,6.52,SL,2010-01-02,6.9,6.1,0.6,12.7,14612,-0.0348995
9,CAF357,0.35,4.93828e5,5.18102e6,-1.2,792.576,3.79125,0.0816121,0.280518,0.0,1.56,6.68,SL,2010-01-02,6.9,6.1,0.6,12.7,14612,-0.0348995
10,CAF357,0.323,4.93828e5,5.18102e6,-1.5,792.576,3.79125,0.0816121,0.280518,0.0106,1.6,6.72,SL,2010-01-02,6.9,6.1,0.6,12.7,14612,-0.0348995


In [8]:
# Now we can check the documentation of the spacetime_folds function
@doc spacetime_folds

# Create Space-time Folds

Create spatial, temporal or spatio-temporal Folds for cross validation based on pre-defined groups.

## Arguments:

  * `x` DataFrame containing spatio-temporal data.
  * `spacevar`: String. which column of x identifies the spatial units (e.g. ID of weather stations).
  * `timevar`: String. which column of x identifies the temporal units (e.g. the day of the year).
  * `k`: Int64. Number of folds. If spacevar or timevar is nothing and a leave one location out or leave one time step out cv should be performed, set k to the number of unique spatial or temporal units.
  * `class`: String. which column of x identifies a class unit (e.g. land cover) NOT IMPLEMENTED YET!!.
  * `seed`: Int64 or Float64, See ?Random.seed!().

## Return

`cv_indices_train, cv_indices_test = spacetime_folds(x;spacevar="var1", timevar="var2", k=10, class=nothing, seed=23)`

## References

Meyer, H., Reudenbach, C., Hengl, T., Katurji, M., Nauß, T. (2018): Improving performance of spatio-temporal machine learning models using forward feature selection and target-oriented validation. Environmental Modelling & Software 101: 1-9.


In [9]:
cv_indices_train, cv_indices_test = spacetime_folds(cookdata; spacevar = "SOURCEID", timevar = "Date", class=nothing, seed=23)

(Vector[[1098, 1106, 1114, 1121, 1127, 1135, 1144, 1153, 1176, 1186  …  182875, 182910, 182947, 182983, 183019, 183057, 183092, 183128, 183164, 183200], [2229, 2244, 2259, 2273, 2290, 2305, 2320, 2335, 2350, 2367  …  182683, 182721, 182761, 182799, 182836, 182875, 182910, 182947, 182983, 183019], [2229, 2244, 2259, 2273, 2290, 2305, 2320, 2335, 2350, 2367  …  182683, 182721, 182761, 182799, 182836, 183057, 183092, 183128, 183164, 183200], [2229, 2244, 2259, 2273, 2290, 2305, 2320, 2335, 2350, 2367  …  182875, 182910, 182947, 182983, 183019, 183057, 183092, 183128, 183164, 183200], [2229, 2244, 2259, 2273, 2290, 2305, 2320, 2335, 2350, 2367  …  182875, 182910, 182947, 182983, 183019, 183057, 183092, 183128, 183164, 183200], [2229, 2244, 2259, 2273, 2290, 2381, 2396, 2411, 2427, 2443  …  182875, 182910, 182947, 182983, 183019, 183057, 183092, 183128, 183164, 183200], [2305, 2320, 2335, 2350, 2367, 2381, 2396, 2411, 2427, 2443  …  182875, 182910, 182947, 182983, 183019, 183057, 183092, 18

The object ```cv_indices_train``` contains each one of the 10 folds (training data) and the indices in the dataframe.

In [10]:
cv_indices_train

10-element Vector{Vector}:
 [1098, 1106, 1114, 1121, 1127, 1135, 1144, 1153, 1176, 1186  …  182875, 182910, 182947, 182983, 183019, 183057, 183092, 183128, 183164, 183200]
 [2229, 2244, 2259, 2273, 2290, 2305, 2320, 2335, 2350, 2367  …  182683, 182721, 182761, 182799, 182836, 182875, 182910, 182947, 182983, 183019]
 [2229, 2244, 2259, 2273, 2290, 2305, 2320, 2335, 2350, 2367  …  182683, 182721, 182761, 182799, 182836, 183057, 183092, 183128, 183164, 183200]
 [2229, 2244, 2259, 2273, 2290, 2305, 2320, 2335, 2350, 2367  …  182875, 182910, 182947, 182983, 183019, 183057, 183092, 183128, 183164, 183200]
 [2229, 2244, 2259, 2273, 2290, 2305, 2320, 2335, 2350, 2367  …  182875, 182910, 182947, 182983, 183019, 183057, 183092, 183128, 183164, 183200]
 [2229, 2244, 2259, 2273, 2290, 2381, 2396, 2411, 2427, 2443  …  182875, 182910, 182947, 182983, 183019, 183057, 183092, 183128, 183164, 183200]
 [2305, 2320, 2335, 2350, 2367, 2381, 2396, 2411, 2427, 2443  …  182875, 182910, 182947, 182983, 183019

The object ```cv_indices_test``` contains each one of the 10 folds (testing data) and the indices in the dataframe.

In [11]:
cv_indices_test

10-element Vector{Vector}:
 [36, 37, 38, 39, 40, 41, 42, 43, 44, 45  …  183117, 183118, 183144, 183146, 183155, 183156, 183172, 183180, 183189, 183190]
 [11, 12, 13, 14, 15, 86, 87, 88, 89, 90  …  183201, 183202, 183203, 183204, 183205, 183206, 183207, 183208, 183209, 183210]
 [96, 97, 98, 99, 100, 131, 132, 133, 134, 135  …  183124, 183132, 183134, 183141, 183160, 183168, 183177, 183196, 183204, 183206]
 [1, 2, 3, 4, 5, 51, 52, 53, 54, 55  …  183121, 183122, 183149, 183151, 183158, 183159, 183184, 183185, 183193, 183194]
 [6, 7, 8, 9, 10, 21, 22, 23, 24, 25  …  183126, 183129, 183137, 183150, 183162, 183165, 183173, 183198, 183201, 183209]
 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  183130, 183136, 183139, 183161, 183166, 183171, 183175, 183197, 183202, 183208]
 [146, 147, 148, 149, 150, 186, 187, 188, 189, 190  …  183110, 183114, 183145, 183147, 183148, 183152, 183181, 183182, 183183, 183186]
 [31, 32, 33, 34, 35, 76, 77, 78, 79, 80  …  183097, 183116, 183131, 183133, 183154, 183167, 183169,