# Tutorial: Analyse data downloaded from Google Earth Engine using pyveg 


Google Earth Engine is a powerful tool for obtaining and analysing satellite imagery. The pyveg package provides useful scripts for interacting with the Earth Engine API and downloading data. 

The location used in this tutorial is Tiger Bush vegetation from Niger in coordinates 2.59, 13.12. The downloaded data is a JSON file containing weather and network centrality metrics in a monthly basis from 2015 to 2020.

Now let's use the functions provided by pyveg to run a simple analysis on the data.

In [None]:
import argparse
import os
from matplotlib import pyplot as plt
import shutil
from pyveg.src.data_analysis_utils import *
from pyveg.src.plotting import *
from pyveg.src.image_utils import create_gif_from_images
from pyveg.src.analysis_preprocessing import *
from pyveg.scripts.analyse_gee_data import plot_feature_vector


%matplotlib inline

Input dataset is a json file found in this directory. 

In [None]:
# results directory from `download_gee_data` script.
json_summary_path =  'results_summary_TigerBush_Niger.json'

The output figures will be saved in an `analysis` sub-directory. 

In [None]:
# put output plots in the results dir
input_dir = '.'
output_dir = os.path.join(input_dir, 'analysis')

Read all json files in the directory and produce a dictionary of dataframes. Each key is a satellite, either weather related or image related (for the network centrality measures).


In [None]:
print(f"Reading results from '{os.path.abspath(json_summary_path)}'...")
json_data = json.load(open(json_summary_path))

ts_dirname, dfs = preprocess_data(
        json_data, output_dir, n_smooth=4, resample=False, period="MS"
    )

print (dfs.keys())

This is how the output dataframe looks:

In [None]:
print (dfs['COPERNICUS/S2'].head())
print (dfs['ECMWF/ERA5/MONTHLY'].head())

## Spatial analysis

First, let's build 2D plots showing the network centrality values on the general 10km images for each date. 

In [None]:
# create new subdir for time series analysis
spatial_subdir = os.path.join(output_dir, 'spatial')

#if directory exists delete results from previous runs
if os.path.exists(spatial_subdir):
    shutil.rmtree(spatial_subdir)

os.makedirs(spatial_subdir, exist_ok=True)


In [None]:
# spatial analysis and plotting 
# from the dataframe, produce network metric figure for each avalaible date
print('\nCreating spatial plots...')

for collection_name, df in dfs.items():
    if collection_name == 'COPERNICUS/S2' or 'LANDSAT' in collection_name:
        # convert the dataframe of each image to geopandas and coarse its resolution slightly
        data_df_geo = convert_to_geopandas(df.copy())
        create_lat_long_metric_figures(data_df_geo, 'offset50', spatial_subdir)

output_plots_name = create_gif_from_images(spatial_subdir,'output.gif')

Let's visualise the result on a GIF.

In [None]:
from IPython.display import Image
with open(output_plots_name,'rb') as f:
    display(Image(data=f.read(), format='png',width=500, height=500))

The average network centrality feature vectors ver all time points and sub images are the following:

In [None]:
# create new subdir for time series analysis
tsa_subdir = output_dir

 # remove outliers from the time series
dfs = drop_veg_outliers(dfs, sigmas=3) # not convinced this is really helping much

# plot the feature vectors averaged over all time points and sub images
try:
    input_dir = os.path.join(output_dir,'preprocessed_data')
    plot_feature_vector(input_dir)
except AttributeError:
    print('Can not plot feature vectors...') 
            


## Time series analysis

Using the data we can build a time series. For this analysis we do the following steps:

- Build time series for every sub-image, we drop points with large outliers and smooth the sub-image time series.
- We average all the network centrality measures from every sub-image into a single time series.
- Compare time series with precipitation data and calculate measures such as correlations, etc.

In [None]:
# convert dataframe to time series
time_series_path = os.path.join(output_dir,'processed_data','time_series.csv')
time_series_dfs = pd.read_csv(time_series_path)

#corr_subdir = os.path.join(output_dir, "correlations")
corr_subdir = os.path.join(output_dir, "correlations")
if not os.path.exists(corr_subdir):
    os.makedirs(corr_subdir, exist_ok=True)

Investigate the cross-correlation between the network centrality measures and precipitation for different lags of time.

In [None]:


# make cross correlation scatterplot matrix plots
plot_cross_correlations(time_series_dfs, corr_subdir)


In [None]:
   
# make a smoothed time series plot
plot_time_series(time_series_dfs, os.path.join(tsa_subdir,'analysis'))

Explore auto-correlation of the time series of all available time series.

In [None]:
# make autocorrelation plots
plot_autocorrelation_function(time_series_dfs, corr_subdir)


## Seasonal and trend analysis

The time series shown above show a clear seasonal trend. The STL decomposition implementation from the statsmodels package is applied to the un-smoothed time series to separate the different components. 

This is done for both the network centrality metrics and precipitation data.

In [None]:
plot_stl_decomposition(time_series_dfs[['S2_offset50_mean','total_precipitation','date']], 12, os.path.join(output_dir, "detrended/STL"))
