# 03: More tools for statistical and climatological analysis

With exercise 00 we learned to use the jupyter notebook and learned the very basics of python. In exercise 01 we've seen that we can do some statistics and plots without knowing much about the details of the python language. Finally, we've applied what we learned in exercise 02.

Today's lesson will also be organised in two units. This notebook will introduce some new tools, while the next notebook (``04_Exercises_Precipitation.ipynb``) will help you to apply them on your own.

## Import the packages and read the temperature data

This did not change:

In [None]:
# Define the tools we are going to need today
%matplotlib inline
import matplotlib.pyplot as plt  # plotting library
import numpy as np  # numerical library
import xray  # NetCDF library
import cartopy  # Plotting libary
import cartopy.crs as ccrs  # Projections
# Some defaults
plt.rcParams['figure.figsize'] = (14, 5)  # Default plot size
np.set_printoptions(threshold=20)  # avoid to print very large arrays on screen
# The commands below are not important
import warnings
warnings.filterwarnings('ignore')

**New data**: I have prepared new NetCDF files for the temperature and the invariant field. You will find them on OLAT. You won't see much difference between the files, but I had a good reason to change them. Reading the data is as easy as it was before:

In [None]:
netcdf = xray.open_dataset('ERA-Int-Monthly-2mTemp_new.nc')
t2_var = netcdf.t2m

We will convert the temperature to degree Celsius:

In [None]:
t2_var = t2_var - 273.15

## "Real" average temperature of the Earth

As you remember, t2_var is a multidimensional array:

In [None]:
print(t2_var.dims)

We have learned how to average this 3D array over time:

In [None]:
t2_avg = t2_var.mean(dim='time')
t2_avg

or how to compute the total average:

In [None]:
print(t2_avg.mean())

Which, as we have discussed in exercise 01, is definitely **not** the average temperature of the Earth. This is because the area of the grid points of the longitude, latitude grid are not constant: they are smaller at high latitudes.

### Create our own DataArray to do a weighted average

With the following commands I am creating a DataArray which is containing the area of each grid point on a sphere. This three commands are the "compressed" version of an explanation provided in the Notebook "Appendix_A_Surface_of_Gridpoints". The most interested students may want to have a look at it, but for now we just use the few commands I prepared:

In [None]:
corner_lats = np.deg2rad(np.clip(np.arange(242) * 0.75 - 90.375, -90, 90))
area_segment = 2 * np.pi * 6371**2 * np.abs(np.sin(corner_lats[1:]) - np.sin(corner_lats[:-1])) / 480
area_grid = (t2_avg * 0 + 1) * xray.DataArray(area_segment, [('latitude', t2_avg.latitude)])

**E: "Explore" the variable area_grid. Plot it on a map (as we did with temperature). What is the probable unit of this variable? Write a simple test to see if this variable is indeed the real surface area or the ERA-Interim grid points** (Earth's radius: 6371 km)

In [None]:
# Your answer here

### Weighted average

To compute the average temperature of the Earth we have to "weight" each temperature value by it's relative contribution to the total temperature. The easiest way to do this is to define an array of weights, which has the following properties: it has the same dimensions as ``area_grid``, each value is proportional the the area of the grid point and the sum of its elements is 1.

**E: use the variable "area_grid" to compute a variable named "weight" which has the properties mentioned above.** ([hint](http://xray.readthedocs.org/en/stable/generated/xray.DataArray.sum.html))

In [None]:
# Your answer here

**E: compute a variable "weighted_t2_avg" by multiplying "t2_avg" with "weight". Compute the sum of its elements.**

In [None]:
# Your answer here

**Q: What is the result of our computations? Is it now closer to our expectations?**

In [None]:
# Your answer here

## Working with time series 

We start by multiplying our 3-dimensional ``t2_var`` with the 2-dimensional ``weight`` variable:

In [None]:
weighted_2d_var = weight * t2_var

We then sum this weighted variable over the dimensions 'longitude' and 'latitude', then we plot the result:

In [None]:
t2_avg_ts = weighted_2d_var.sum(dim='longitude').sum(dim='latitude')
t2_avg_ts.plot();

**Q: what is "t2_avg_ts"? Are you surprised by what you see? Try to find an explanation for the strong periodic variations.**

In [None]:
# Your answer here

### Annual cycle 

xray makes it very easy to compute the standard statistics of time series. For example, let's see what the following commands will do:

In [None]:
t2_cycle_ts = t2_avg_ts.groupby('time.month').mean(dim='time')

**E: "explore" the variable "t2_cycle_ts". Plot it. What are we looking at? What could be the reasons for these variations?**

In [None]:
# Your answer here

### Annual average

In [None]:
t2_annual_ts = t2_avg_ts.resample(dim='time', freq='A')
t2_annual_ts.plot()

**E: "explore" the variable "t2_annual_ts". Plot it. What are we looking at?**

In [None]:
# Your answer here

**E: compute the standard deviation, min and max of this time serie. What can you say about the variability of the air temperature at the surface of the Earth?**

In [None]:
# Your answer here

## Selecting specific areas of our data

We are now getting back to the map of average temperature:

In [None]:
ax = plt.axes(projection=ccrs.PlateCarree()) # Note that I changed the projection
t2_avg.plot(ax=ax, origin='upper', aspect='equal', transform=ccrs.PlateCarree()) 
ax.gridlines()  # What does this command do?
ax.coastlines();

We are now learning how to "select" parts of the data for a specific analysis. One more time, xray provides us with tools that are very intuitive: 

In [None]:
sel_t2 = t2_avg.sel(longitude=slice(-20, 20))

The best way to understand what we've done is simply to plot it:

In [None]:
ax = plt.axes(projection=ccrs.PlateCarree()) # Note that I changed the projection
sel_t2.plot(ax=ax, origin='upper', aspect='equal', transform=ccrs.PlateCarree()) 
ax.add_feature(cartopy.feature.BORDERS); # What does this command do? 
ax.coastlines();

**E: create a new "sel_t2" variable which is a subset of t2_avg between the longitudes (-20, 60) and the latitudes (40, -40). Plot the result.** (*hint: yes, I wrote (40, -40) and not (-40, 40)*)

In [None]:
sel_t2 = t2_avg.sel(longitude=slice(-20, 60), latitude=slice(40, -40))
ax = plt.axes(projection=ccrs.PlateCarree()) # Note that I changed the projection
sel_t2.plot(ax=ax, origin='upper', aspect='equal', transform=ccrs.PlateCarree()) 
ax.add_feature(cartopy.feature.BORDERS); # What does this command do? 
ax.coastlines();

## Dimensional juggling!

Now that we know how to work with time series and how to select part of our data, maybe we could combine both methods? Let's get back to our orginal 3D temperature data. Remember?

In [None]:
t2_var.dims

If I ask xray to plot this data, it won't really know what to do with all these dimensions, so it gets back to a default solution:

In [None]:
t2_var.plot();

**Q: what are we looking at? Explain what you see.**

Now, we can note that the selection methods we applied earlier can *also* be applied to our 3 dimensional array! Let's try it:

In [None]:
sel_t2 = t2_var.sel(longitude=slice(-20, 60), latitude=slice(40, -40))

**E: Explore sel_t2. What are its dimensions, its coordinates?**

In [None]:
# your answer here

By the way, if there is a time dimension: shouldn't the time aggregation methods also be applicable to our variable?

In [None]:
sel_t2_cycle = sel_t2.groupby('time.month').mean(dim='time')

**E: Explore sel_t2_cycle. What are its dimensions, its coordinates?**

In [None]:
# your answer here

Let's continue to play around:

In [None]:
sel_t2_cycle_lonavg = sel_t2_cycle.mean(dim='longitude')

**E: Explore sel_t2_cycle_lonavg. What are its dimensions, its coordinates? Try out the command: "sel_t2_cycle_lonavg.T" What did it change?**

In [None]:
# your answer here

OK. Let's try to plot it:

In [None]:
sel_t2_cycle_lonavg.T.plot();

The plot above is called a [Hovmöller](https://en.wikipedia.org/wiki/Hovm%C3%B6ller_diagram), often used in climatology. The default procedure to plot this data is probably not the best for this kind of plot. Here's another posibility to plot it:

In [None]:
xray.plot.contourf(sel_t2_cycle_lonavg.T, levels=np.linspace(10, 30, 11));
plt.title('Hovmöller plot of the monthly average of temperature 1970-2014 over Africa (20°E, 60°W)');

**Q: Desbribe the plot.**

In [None]:
# your answer here

## Selection based on a condition

What if we are interested into air temperature on land only, and want to remove the oceans from our analyses? For this we are going to have to "mask out" the oceans grid points. First, we will need to open the Invariant file: 

In [None]:
nc_inv = xray.open_dataset('ERA-Int-Invariant_new.nc')
nc_inv

We remember from Exercise 02 that the variable "lsm" contains the landmask from ERA-Interim.

In [None]:
ax = plt.axes(projection=ccrs.Robinson())
nc_inv.lsm.plot(ax=ax, origin='upper', aspect='equal', transform=ccrs.PlateCarree()) 
ax.gridlines()
ax.coastlines();

OK. So "1" is land, "0" is ocean. We are going to use this information to mask out the values from the ocean:

In [None]:
masked_t2_avg = t2_avg.where(nc_inv.lsm == 1)

What did we just do? We applied a filter to select the values only [where](http://xray.readthedocs.org/en/stable/generated/xray.DataArray.where.html#xray.DataArray.where) a certain condition is met. Here the variable "lsm" should be equal to zero.

**E: Plot the variable "masked_t2_avg". Compute the average temperature on land from this data. But bee careful! Shouldnt this array also be weighted? Repeat the operation with all oceans grid points. Compare the two values.**

In [None]:
# your answer here

**E: Now repeat the "dimensionnal juggling" operations above but with oceans masked. Do the Hovmöller plot for Africa, but with all ocean grid points masked.**