# Lesson: working with netCDF data

Last week you learned how to use the basic features of the python language with the numpy and matplotlib libraries. The purpose of this lesson is to introduce you to most of the tools that you will use in the semester.

This is a very **dense** lesson. Please do it entirely and try to remember the principles. This code will provide a template for your own code, and you can always come back to these examples when you'll need them. I don't expect you to understand all details, but I do hope that you are going to catch the major lines. You will have to copy and adapt parts of the code below to complete the exercises.

Remember that I will never ask you to use tools you didn't use in a lesson before! <small> If this happens it was a mistake from my side, sorry!</small>

## More about the python syntax

### Python objects

In python, all variables are also "things". In the programming jargon, these "things" are called *objects*. Without going into details that you won't need for this lecture, objects have so-called "attributes" and "methods" (what you may know under the name "functions"). Attributes are information stored about the object.

For example, even simple integers are also "things with attributes":

In [None]:
# Let's define an interger
a = 1
# Get its attributes
print('The real part of a is', a.real)
print('The imaginary part of a is', a.imag)

Attributes are read with a *dot*. They are very much like variables. In fact, they are variables:

In [None]:
ra = a.real
ra

Importantly, objects can also have functions that apply to them. For example, strings have a function called ``split()``:

In [None]:
s = 'This:is:a:splitted:example'
s_splitted = s.split(':')
print(s_splitted)

One difference between attributes and functions is that the functions are called with parentheses, and sometimes they require arguments (the ``':'`` in this case). Another difference between functions and variables is that the function is almost always returning you something back (yes, some functions return nothing, but they are rare).

Strings also have a ``join()`` method by the way:

In [None]:
' '.join(s_splitted)

It is not necessary to know the details about object oriented programming to use python (in fact, most of the time you don't need to implement these concepts yourselves). But it is important to know that you can have access to attributes and methods on almost *everything* in python.

## NetCDF data

Let's do some serious data crunching! 

First, let's import the tools we need. Remember why we need to import our tools? If not, ask Fabien! 

*Note: this can take a few seconds. On a normal system this is fast, but here the modules are imported from a network directory.*

### Imports and options

In [None]:
# Display the plots in the notebook:
%matplotlib inline
# Import the tools we are going to need today:
import matplotlib.pyplot as plt  # plotting library
import numpy as np  # numerical library
import xarray as xr  # netCDF library
import cartopy  # Map projections libary
import cartopy.crs as ccrs  # Projections list
# Some defaults:
plt.rcParams['figure.figsize'] = (12, 5)  # Default plot size
np.set_printoptions(threshold=20)  # avoid to print very large arrays on screen
# The commands below are to ignore certain warnings.
import warnings
warnings.filterwarnings('ignore')

### Get the data 

The data we are going to use today is from the [CERES](https://climatedataguide.ucar.edu/climate-data/ceres-ebaf-clouds-and-earths-radiant-energy-systems-ceres-energy-balanced-and-filled) (Clouds and the Earth's Radiant Energy System) mission. We are going to use the EBAF-[TOA](https://eosweb.larc.nasa.gov/project/ceres/guide/cer_ebaf-toa.pdf) and the EBAF-[Surface](https://eosweb.larc.nasa.gov/project/ceres/guide/cer_ebaf-sfc.pdf) data products (both freeley available).

The easiest way if you are on the university computers is simply to read them directly from the online directory as shown below, but you can also get them from OLAT, or download it [here for TOA](https://www.dropbox.com/s/ozbo099btcz87qy/CERES_EBAF-TOA_Ed2.8_Avg-2001-2014.nc?dl=0),  [here for Surface](https://www.dropbox.com/s/cyh1uovjx38hdx7/CERES_EBAF-Surface_Ed2.8_Avg-2001-2014.nc?dl=0).

### Read the data

Most of today's meteorological data are stored in the NetCDF format (``*.nc``). NetCDF files are binary files, which means that you can't just open them in a text editor. You need a special reader for it. Nearly all the programming languages offer an interface to NetCDF. For this course we are going to use the [xarray](http://xarray.pydata.org/en/stable/) library to read the data:

In [None]:
# Here I downloaded the file in the "data" folder which I placed in the same folder as the notebook
# "ds" stands for "dataset". Of course you can give any name to your variables
ds = xr.open_dataset('./data/CERES_EBAF-TOA_Ed2.8_Avg-2001-2014.nc')

**Note**: you'll have to give an absolute or relative path to the file for this to work. For example ``'C:\PATH\TO\FILE\CERES_EBAF-TOA_Ed2.8_Avg-2001-2014.nc'`` in windows.

In [None]:
# See what we have
ds

The NetCDF dataset is constituted of various elements: Dimensions, Coordinates, Variables, Attributes:
- the dimensions specify the number of elements of each data coordinates, their name is chosen to be understandable and representative
- the attributes usually do not contain any data: they provide some information about the file
- the variables contain the actual data. In our file there is are five variables. All have the dimensions [month, lat, lon], so we can expect an array of size [12, 180, 360]
- the coordinates locate the data in space or time

### Coordinates 

Let's have a look at the **month** coordinate first:

In [None]:
ds.month

Month goes from 1 to 12, they are the months of the year. From the attribute "title", we know that these represent the average for each month for the period 2001-2014.

The **location coordinates** are also self explaining:

In [None]:
ds.lon

In [None]:
ds.lat

**Q: what is the spatial resolution of CERES data?**

In [None]:
# your answer here

### Variables 

Variables can also be accessed directly from the dataset:

In [None]:
ds.toa_sw_all_mon

The **attributes** of a variable are extremely important, they cary the *metadata* and must be specified by the data provider. Here we can read in which units the variable is defined, as well as a description of the variable (the "long_name" attribute).

**Q: what other informations can we read from this printout? Explore the other data variables and see if you understand all of them.**

In [None]:
# your answer here

## Simple analyses 

Analysing climate data is extremely easy in Python thanks to the [xarray](http://xarray.pydata.org/en/stable/) and [cartopy](http://scitools.org.uk/cartopy/docs/latest/index.html) libraries. First we are going to compute the time average of the TOA Shortwave Flux over the year:

In [None]:
sw_avg = ds.toa_sw_all_mon.mean(dim='month')

What did we just do? From the netcdf dataset, we took the toa_sw_all_mon variable (``ds.toa_sw_all_mon``) and we applied the function `.mean()` to it. So an equivalent formulation could be:

In [None]:
# Equivalent code:
sw = ds.toa_sw_all_mon
sw_avg = sw.mean(dim='month')

What is ``sw_avg`` by the way?

In [None]:
sw_avg

So `sw_avg` is a 2-dimensional array of dimensions [lat, lon] (note that the month dimension has disapeared).

When we applied the `mean()` function, we added an argument (called a **keyword argument**): ``dim='month'``. With this argument, we told the function to compute the average *over the month dimension*.

Let's remove this keyword and compute the mean again:

In [None]:
sw.mean()

Ha! We now have an array without dimensions: a single element array, also called a **scalar**. This is the total average over all the dimensions. We'll come back to this later...

**Q: what should we expect from the flowing commands:**

    sw.mean(dim='lon')
    sw.mean(dim='month').mean(dim='lon')
    sw.mean(dim=['month', 'lon'])
    
**Try them out!**

In [None]:
# Try the commands above. Do they work as expected? 

**E: what is the maximum shortwave radiation value radiated back to space? And the minimum?** ([hint](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.min.html))

In [None]:
# your answer here

## A first plot 

### 2d data

We are now going to plot the average Top of The Atmosphere Shortwave Flux on a map:

In [None]:
# Define the map projection
ax = plt.axes(projection=ccrs.Robinson())
# ax is an empty plot. We now plot the variable sw_avg onto ax
sw_avg.plot(ax=ax, transform=ccrs.PlateCarree()) 
# the keyword "transform" tells the function in which projection the data is stored 
ax.coastlines(); ax.gridlines(); # Add gridlines and coastlines to the plot

We are looking at the average TOA outgoing shorwage flux, expressed in W m$^{-2}$. Such time averages are often writen with a bar on top of them:

$\overline{SW_{TOA}} = temporal\_mean(SW_{TOA})$

**Q: look at the basic features of the plot. Can you explain most of the patterns that you observe? Where are the highest values? The lowest ones?**

### 1d data

It is equally easy to plot 1d data. In this case, we are going to compute the zonal average of ``sw_avg``. "Zonal average" means "along a latitude circle". It is often writen with ``[]`` or ``<>`` in formulas:

$\left[ \overline{SW_{TOA}} \right] = zonal\_mean(temporal\_mean(SW_{TOA}))$

Note that the two operators are commutative, i.e.:

$\left[ \overline{SW_{TOA}} \right] = \overline{\left[ SW_{TOA} \right]}$

Let's compute it and plot it right away:

In [None]:
sw_avg.mean(dim='lon').plot();

**Q: look at the basic features of the plot. Can you recognize the important features from the map above?**

## More data manipulation with xarray 

As you have probably noted already, xarray's objects (called Dataset for the whole netCDF file or DataArray for single variables) are quite powerful, and can do more than much arrays know from other languages. Last week we talked about the differences between python's lists and numpy's arrays. Today we introduced this new object (DataArray) which is one level higher in usability.

But don't worry if this sounds confusing at first! From now on we are going to use DataArrays only. The best thing about them is that they carry their dimension names and coordinates with them. This is the reason why it was so easy to make a plot with the right axis labels in just one command. They have very useful other properties, and we will learn these step by step.

One of the first nice properties they have is that they behave just like regular arrays. That is, you can do arithmetic with them. Our first task will be to compute the net energy balance at the top of the atmosphere:

$$\overline{EB_{TOA}} = \overline{SW_{In}} - \overline{SW_{TOA}} - \overline{LW_{TOA}} \approx 0$$

### Arithmetics and averages on a sphere

In [None]:
# Note that there are many different ways to get to the same result. For the sake of clarity we use the simple way:
eb_avg = ds.solar_mon.mean(dim='month') - ds.toa_sw_all_mon.mean(dim='month') - ds.toa_lw_all_mon.mean(dim='month')

**E: plot eb_avg on a map. Why did xarray chose to use another colormap? Describe the basic features of the plot. Where is the climate system gaining energy? Losing energy?** 

In [None]:
# your answer here

In [None]:
ax = plt.axes(projection=ccrs.Robinson())
eb_avg.plot(ax=ax, transform=ccrs.PlateCarree()) 
ax.coastlines(); ax.gridlines(); 

We said that the energy balance should be close to zero (balanced). Fortunately, it is easy to check:

In [None]:
eb_avg.mean()

But, wait? This is quite far from zero!!! What's going on here?

Well, it's simpler than it seems. This is an anoying problem with our planet: it happens to be a sphere. (Or something close to a sphere).

So when we average without taking this into account, we get wrong results. How wrong is it? A regular plot of the data will help us to see what happens here:

In [None]:
eb_avg.plot();

Which has to be compared to a sphere. When averaging [lon, lat] data, one gives too much weight to high latitudes.

Fortunately, this can be solved by noting that we have to weight each latitudinal band by the cosinus of the latitude, i.e. $\cos \varphi$. We are going to compute a new average, but [weighted](https://en.wikipedia.org/wiki/Weighted_arithmetic_mean) this time. First, let's make a weight array:

In [None]:
weight = np.cos(np.deg2rad(ds.lat))
weight = weight / weight.sum()

In [None]:
np.sqrt(weight).plot();

**Q: can you follow each step? If not, redo each step one by one, and use the ? to get help about each of these functions!**

In [None]:
# your answer here

Weight is an array of 180 elements, which is normalised so that it's sum is 1. This is exactly what we need to compute a weighted average! First, we have to average over the longitudes (this is fine, because along a latitude circle all points have the same weight), and then compute the weighted average.

In [None]:
zonal_eb_avg = eb_avg.mean(dim='lon')  # important! Always average over longitudes first
# this averaging is needed so that the arithmetic below makes sense (multiply two arrays of 180 elements together)
weighted_eb_avg = np.sum(zonal_eb_avg * weight)
weighted_eb_avg

Aaaah, this looks much better now. Not exactly zero, but much closer. The remaining value (called the residual) is a combination of [measurement errors](http://ceres.larc.nasa.gov/science_information.php?page=EBAFbalance), geometrical approximations, and also a little bit of anthropogenic energy imbalance.

### Data selection and multiline plots

We have seen that DataArrays can be averaged along one dimension as following:

In [None]:
eb_lon_avg = ds.solar_mon.mean(dim='lon')
eb_lon_avg

The resulting array has dimensions (month, lat). One of the things we'd like to do is select certain months for example, which is an easy task with xarray and the method ``.sel()``:

In [None]:
avg_jan = eb_lon_avg.sel(month=1)

**E: plot avg_jan to make sure that it is indeed what you think it is.**

In [None]:
# your answer here

With the help of a few commands, it is not a big deal to make a nice looking plot:

In [None]:
eb_lon_avg.sel(month=1).plot(label='January')
eb_lon_avg.sel(month=7).plot(label='July')
eb_lon_avg.mean(dim='month').plot(label='Annual Avg', linewidth=3)
plt.xlim(-90, 90)
plt.title('Incoming solar radiation - zonal average')
plt.legend(loc='best')
plt.ylabel('W m$^{-2}$');