---
title: Week 2 - PCA (principal component analysis) for GRACE and GRACE-FO mascon data
subtitle: Perform signal separation of GRACE data with PCA and EOF (empirical orthogonal function) analysis
authors:
  - name: Katrin Bentel (katrin.bentel@ethz.ch)
---

:::{important} Learning Goals &#9971;
- [ ] I can prepare the GRACE/GRACE-FO (or any other) data set for PCA/EOF analysis (centered data matrix)
- [ ] I can perform PCA/EOF analysis in Python
- [ ] I can explain the equations behind PCA
- [ ] I can plot the results
- [ ] I can interpret the results (EOF patterns, PC time series, and variance expressed)
- [ ] I can perform simple selection techniques for the dominant modes and explain more rigorous approaches
- [ ] I can reconstruct the data and explain the benefits of PCA/EOF analysis
- [ ] I can handle gaps in the data appropriately
:::

:::{attention} Questions
Don't hesitate to ask any question that might be coming up. If you think your question could be relevant to others as well, please post it in the [**Moodle forum**](https://moodle-app2.let.ethz.ch/mod/forum/view.php?id=1187440). Or you can just contact me by email.

____

## Table of Contents
#### [](#h-prepare-data)

#### [](#h-PCA-global)

#### [](#h-data-reconstruction)

#### [](#h-gaps)

#### [](#h-regional)

#### [](#h-outlook-feedback)


_____

(h-prepare-data)=
# 1. Get and prepare GRACE data

:::{tip} _Exercise 1:_ Load the GRACE and GRACE-FO mascon data set RL06.3_v04 from JPL and prepare the data for PCA/EOF analysis.
- The required data is  the same data set as the one we worked with last week. It is called `GRCTellus.JPL.200204_202411.GLO.RL06.3M.MSCNv04CRI.nc`. Please refer to the routines from last week to get the data ready. To load the data, adjust the file path so that you can use last weeks data which is stored in the data folder.

- The data has to be arranged in a rectangular matrix, with time along the first dimension and location along the second dimension. Each monthly field should be in a row and each column represents the time series for one pixel. Please also refer to the slides.

- The data has to be centered, it needs to be true anomalies. Therefore, remove the time mean from each grid location
:::

NameError: name 'os' is not defined

#### &#128187; Coding starts here

The first code cell loads the Python libraries and you can continue with your code by adding more cells below.

In [12]:
# reads the .nc and .nc4 files
import netCDF4 as nc 
# miscellaneous operating system interfaces
import os

# visualizes the data
import matplotlib.pyplot as plt 
import matplotlib.dates as mdates

#processes the data
import numpy as np 
import pandas as pd

# helps visualize the data
import cartopy.crs as ccrs 
from cartopy.mpl.geoaxes import GeoAxes

from datetime import datetime

from sklearn.decomposition import PCA


In [13]:
# LOAD DATA
file_path = './../data/GRCTellus.JPL.200204_202411.GLO.RL06.3M.MSCNv04CRI.nc'

# Check if the file exists
if os.path.exists(file_path):
    mascons = nc.Dataset(file_path)
else:
    raise FileNotFoundError(f"File not found: {file_path}")

# short alternative, without checking the file path:
# mascons = nc.Dataset('./../data/GRCTellus.JPL.200204_202411.GLO.RL06.3M.MSCNv04CRI.nc')


# ACCESS METADATA

# Printing the dataset, mascons, gives us information about the data contained in the file.

mascons   # same as print(mascons)

FileNotFoundError: File not found: ./../data/GRCTellus.JPL.200204_202411.GLO.RL06.3M.MSCNv04CRI.nc

:::{hint} On Jupyternaut or your favourite chatbot &#9756;
:class: dropdown
- You may work with a chatbot also in this exercise. In case you used Jupyternaut with the provided configuration, this configuration is still set in Jupyternaut and you can continue to use it. 
- Everything that was mentioned about good practise and the use of a LLM (large language model) in the last homework still holds, and please keep in mind that it is crucially important to understand all the code that you use in your notebook (if you don't understand some code in detail, ask your chatbot and **test** the functionality).
- Since today's exercises are less plotting-focussed, the support you can get from a chatbot might be bit different to last week.
:::

_____

(h-PCA-global)=
# 2. Perform EOF analysis of the entire data set

:::{tip} _Exercise 2:_ Perform EOF analysis
Take the data matrix which you just prepared in exercise 1 and do EOF analysis of this data with the routines from `sklearn`. The required packages have already been imported above. Please refer to PCA demo example notebook where the routines are used on synthetic data. Add your code cells below.
:::

:::{tip} _Exercise 3:_ Plot the first few modes
Now it is time to reuse you routines from last week to plot maps and time series. For the first few modes, plot the EOF patterns the PC time series and the explained variance value. Arrange plots next to each other to get a better overview of your results.
:::


:::{tip} _Exercise 4:_ Physical interpretation
What do you see in your plots? Try to identify signals in you EOF and PC time series plots. Discuss in your team what the physical meaning of the different signals in the first few modes could be.
:::


_____

(h-data-reconstruction)=
# 3. Reconstruct the data from the dominant modes of EOF pattern and principal component time series


:::{tip} _Exercise 5:_ Derive criteria for significance of modes from singular values 
For data reconstruction, only the dominant modes are used, so that most of the variance of the data is retained, but espressed in fewer basis functions (dimensionality reduction). This step also acts as a **filter** on the data.
- Plot the singular values / variance explained for each mode. Can you guess from the curve which values might be containing signal?
- Plot the explained variance cummulative. Another strategy is, to make sure to expalin a certain level of variance in the reconstruction, e.g. 90%. How does this compare to the significant number of modes you found from the previous plot?
- For a more rigorous way to choose the modes, one possibility is to test if the time series are significantly different from white noise (unless you are very familiar with statistical tests, you can skip this for this homework).


:::{tip} _Exercise 6:_ Reconstruct the data with selected modes from the previous exercise
Using the modes you selected above, reconstruct the data. In case you were not sure in your selection, test with different numbers of modes. Look at a few monthly snapshots and compare to the original data (e.g. plot the differences) what do you observe?
:::

:::{attention} Congratulations! 
You have completed the entire process of data analysis and data reconstruction with PCA on GRACE data until here. The following chapters and exercises look into adapting, refining, and optimising this process.
:::

_____

(h-gaps)=
# 4. Gap handling

:::{tip} _Exercise 7:_ Explore different ways of handling gaps
- There are differnet ways of handling gaps in the data before performing EOF analysis. In the exercises above, the missing months have just been left out. Another approach is to interpolate data to fill missing months. Please try this for shorter gaps and see how this effects your EOF analysis results. Plot your new EOFs and PC time series below.
- Handling the gap between GRACE and GRACE-FO is more tricky. Try removing one more month, so that you have a gap of an entire year instead of 11 months. Does it make a difference? (plot EOFs and PCs again)
- You might see now, that it makes sense to arrange you EOF and PC plots in compact way. Maybe refine you plot arrangement from above.
- There are several scientific publications making use of EOF analysis to bridge the gap between GRACE and GRACE-FO. They often target the spherical harmonic coefficients (and using other GRACE solutions). But this works for our mascon data, too. Can you think of and sketch at leat one approach, how the gap between the two missions could be bridged by using EOF analysis?
:::

_____

(h-regional)=
# 5. EOF analysis of selected regions

:::{tip} _Exercise 8:_ Perform EOF analysis of a region only
- Select any region which you think might contain an interesting signal

- EOF analysis of the continents only should lead to a more clear signal in the first modes -> try and compare the results

:::

_____

(h-outlook-feedback)=
# 6. Outlook and feedback

&#9989; Task:
: Finally, I'd again really appreciate your feedback on this JupyterNotebook homework

In [1]:
from IPython.display import IFrame
IFrame('https://docs.googlae.com/forms/d/e/1FAIpQLSc2lg39Siu95lTva0OIIN6tVEAfTls-uAp0LNg-Wz7YZHH3VQ/viewform?embedded=true', 640, 1657)