---

# <center> Environmental Data </center>

---

## Course objectives and philosophy
Let's start by recalling what the learning objective of the Environmental Data modules are:

1. Understand common data format and database structures specific to representative fields of environmental science
2. Demonstrate technical competency in handling common data types routinely encountered in the environmental sciences and identify relevant open-source data repositories
3. Identify and design suitable data analysis strategies that consider data types, data distribution constraints, strength, benefits and limitations of statistical and modelling tools and environmental dynamics.
4. Understand the limitation of available data and data analysis products. Understand sources of errors and demonstrate ability to comprehensively characterize uncertainties and interpret results in the context of these uncertainties, including measurement errors, environmental uncertainties as well as errors stemming from the analytical procedure itself (e.g. calibration of analysis using synthetic data/models).

You'll note that these objectives address a progression of skills. The first focus on technical competency (understanding common data format and an ability to manipulate these data). The last two go deeper than technical competency - these aim to develop a data analysis workflow. In real-life situations, some of that workflow will be informed by intuition, and the best workflow will depend on context (e.g. time, infrastructure available) and on the objectives of a study. 

Intuition will come with experience, and therefore, also from failure! This course is an opportunity to try, sometimes fail ... and therefore learn! (That is the luxury of being a student!) With that in mind, some of the coursework presented here will be purposefully "vague". This is not meant to be annoying, it is designed to push you to seek and explore possible solutions (solutions are rarely unique) and express your creativity as an analyst. 

**Important**: Data analysis is a key component of the scientific method and therefore cannot exist without `hypothesis testing`. Approach every part of the data analysis workflow as a test, an experiment. Each experiment should be motivated by a hypothesis (a goal), will involve applying a methodology, will produce results pertinent to the hypothesis/goal, and will require a critique of these results with regards to their ability to address the hypothesis/goal before drawing a conclusion (it is ok for conclusions to come with caveats and uncertainties if the analysis warrants limits of interpretation).  

**Warning**: Data analysis can also be a bit of  "rabbit hole"! There will always be nicer, cooler, better things to do. Sometimes, good is good enough though. Whatever you do, never lose sight of the overall objectives of a study. This needs to be delivered in priority.

## This segment of the module
In this module, content will be delivered using a mixture of
powerpoint lectures, jupyter notebooks and practical exercises.

Warning: other directories, containing figures, images, animations or data are also provided. 
However, cliamte data can be very large, and github has a low limit for these. 
A dropbox link with data also exist: https://www.dropbox.com/sh/fxcmtbz4o3tacz1/AABjQbeyg27zDh1chZxRDFcpa?dl=0

### Github repository: `environmental-data-week-3`


## We will focus and use climate data.
This segment of the module will focus on 2D, 3D and 4D data. To illustrate techniques and learn some useful skills, we will use data around the theme of climate and climate change. As part of this, we will learn how to access and use climate data and output from global climate models. 

The science of climate is beyond the scope of this course, but although we will focus on data analysis, it is my hope that you will develop an appreciation and better understanding of climate science in the process. 
___

---

# 01- Introduction to climate data and climate models

---

<a id='Contents'></a>
## Contents
- [Lectures](#lectures)
- [Question](#question)
- [ESMValTool](#ESMValTool)
- [Sources of climate data](#climatedata)
- [CEDA](#CEDA)
- [File formats](#fileformat)
- [Metadata](#metadata)
- [NetCDF](#NetCDF)
- [NCO](#NCO)
- [Reanalysis products](#reanalysis)
- [Climate services UKCP18](#ukcp18)
  
## Learning outcomes
1. Understand the basics of climate modelling
2. Know where to find and how to access climate data (ESGF, CEDA)
3. Understand common file formats
4. Understand the importance of metadata and understand CF conventions
5. Understand the NetCDF file format and know how to read, create and maniputate NetCDF data
6. Become familiar with other key tools used to manipulate netCDF data (NCO, CDO)
7. Know how to access ERA5 reanalysis products
8. Climate Services and UKCP18 and ability to access that data

---


<a id='lectures'></a>
# Lecture on relevance of climate science, climate models and climate modelling


(see Powerpoint presentations)

1. 01v1_Why_climate_data.pptx/pdf
2. 02v1_Building_climate_models.pptx/pdf
3. 03v1_Common_issues.pptx/pdf
   

---

<a id='question'></a>

# Assessement and driving question to answer this week: 
It is ok to learn techniques and go through practicals, but the overal objective of this course is to equip you with the skills needed to pursue your own analyses. With this in mind, we are going to work towards a central question, to motivate learning though the next few lessons.  


* ## Should French Champagne makers invest in wineries in Hampshire (UK)?

Recently, [french champagne producer Tattinger have decided to invest in a winery in Kent, called Domaine Evremond](https://www.domaineevremond.com).  Tattinger is aware that other English wineries in [Hampshire](https://en.wikipedia.org/wiki/Hampshire) are also [producing excellent sparkling wine](https://www.visit-hampshire.co.uk/food-and-drink/vineyards). 

Tattinger's business wing is of course concerned about this. Are these new UK wineries going to be strong competitors for them, or should Tattinger see these as investment opportunities and try to buy them while they can to establish a stronger presence in the UK market?

Like everyone else, Tattinger is also aware about climate change and they realize that, from now on, they need to systematically incorporate a climate analysis as part of their business decisions.  

From the growing success of English winemakers, Tattinger can see that the current climate in the Hampshire region is good enough at the moment for growing good grapes, but they want to know if and how climate change will affect the current climate in Hampshire. Ultimately, they want to know if climate change could affect the wine buisness in Hampshire. Will climate change negatively affect viticulture in the region, should we expect little change, or are conditions actually going to become even better for growing grapes in the next few decades? 

Imagine that your new start-up, as an up-and-coming climate intelligence company, is being contacted by representatives from Tattinger HQ to advise them on how climate will change in Hampshire. They want to know how climate change may affect wineries in Hampshire in the coming 50 years or so and they are looking for someone to deliver answers.  

**Your task is to i) develop a research strategy that will answer their questions on the future of climate in Hampshire, ii) present initial results on the issue, and iii) issue a preliminary recommendation as to wether Tattinger should or should not invest in UK wineries going forward.**

Sadly, time is of the essence! Tattinger has heard that [Moet-et-Chandon](https://en.wikipedia.org/wiki/Moët_&_Chandon), another major champagne producer, is also eyeing on these buisness opportunities and they must make a decision quickly. **You goal here is to produce code (e.g. a jupyter notebook) by Friday that summarizes your analysis and recommendations to Tattinger, building on what you learn during the week.** 

In your analysis, consider which data you want to use, how to get them (can you access and analyze them already?), what could be learned from them. Consider resolution, data use policy, data storage and analytical requirements (will you need to use a cloud service, like JASMIN, if so, could you use this for free as a comercial entity, ir not, how much could that cost?). You may also want to use output from reanalysis products (e.g. ERA5) or climate models (UKCP18, CMIP, etc.). 

Use any and all resources at your disposal (including your instructor and your classmates) but keep in mind that **this is a commercial project**, so beware of any data use or resource access policies in place. In the real world, it could be that addressing this question would imply some costs that could be passed on to the client. However, the rules for this exercise are to only use open-acess resources. 

You don't have much time, but this could be a big and important contract for your new company so you want to develop a good solution, ideally with some preliminary results to showcase your skills, and prove to Tattinger that you are the right people for the job. Your company could earn future business from them! 

**IMPORTANT**: Although you are encouraged to exchange with your classmates and exchange ideas, source of data, etc, the report you produce **must represent your own work**. **Plagirism in any form will not be tolerated, either in the report or in the code.** 

---

<a id='ESMValTool'></a>

# Python environments to consider for working with climate data 

* ## Earth System Model Evaluation Tool (ESMValTool)
![esmvaltool](img/ESMValTool-logo-2.png)

This week, it should be possible to simply install packages as needed...

However, if you anticipate possible future needs, you may want to invest a bit of effort to setup a specific python environment to work with climate data. If so, a good place to start is the `ESMValTool` project. 



[ESMValTool](https://docs.esmvaltool.org/projects/esmvalcore/en/latest/index.html) is  a community-led, open source, set of software tools (build on python 3) developed to improve diagnosing and understanding climate models. While the focus has been on climate models, not observations, one cannot quantify model biases without also comparing the model results with observations. [ESMValTool](https://docs.esmvaltool.org/projects/esmvalcore/en/latest/index.html) therefore also has some capability to analyze and manipulate large datasets. 

[ESMValTool](https://docs.esmvaltool.org/projects/esmvalcore/en/latest/index.html) can work offline, but as the amount of climate data is so large, it is mostly designed to work with data centers that provide local access to the vast amount of data. One such initiative is the Earth System Grid Federation ([ESGF](https://esgf.llnl.gov)). ESMValTool provides functionality to work on HPC systems.

We will not use ESMValTool here specifically, but it will install other convenient tools and you may be curious to investigate ESMValTool capabilities on your own anyways. It is now used routinely for climate modelling research. 

Additional information and help can be found at https://www.esmvaltoolorg, and on the [ESMValGroup](https://github.com/ESMValGroup) github page. The [ESMValTool tutorial](https://esmvalgroup.github.io/ESMValTool_Tutorial/) is a great resource to get started. 


### Installing ESMValTool (optional, but recommended)
[Installation instructions](https://docs.esmvaltool.org/en/latest/quickstart/installation.html) differ by operating system and it is very much an experimental, in development, tool - keep this in mind! 

[Notes:ESMValTool 2.0 requires a Unix(-like) operating system and Python 3.7+. (Python 3.7, 3.8 and 3.9 are supported). On my laptop (Macbook pro 13'' M1 2020 with OS Monterey 12.0.1), I installed it in Nov 2021 with Python 3.9.7. ...but I installed it when 'conda' was still an installation option. ESMValTool developer have  changed installation methods to `mamba` recently. Do refer to online/official instructions). You will now be required to also [install `mamba`](https://docs.esmvaltool.org/en/latest/quickstart/installation.html#mamba-installation) first (...yet another package/dependency manager).] 

ESMValTool can also work with Julia, R, NCL. Do check installation instructions for these extensions. 

[back to contents](##Contents)

---

<a id='climatedata'></a>

# Working with climate data and climate models

### Earth System Grid Federation (ESGF)

![esgf](img/ESGF_logo.png)


[ESGF](https://esgf.llnl.gov) is an open-source platform that provides distributed access to peta/exa-scale scientific data, **globally**. 

It is an interagency and international effort led by the US Department of Energy (DOE), co-funded by National Aeronautics and Space Administration (NASA), National Oceanic and Atmospheric Administration (NOAA), the US National Science Foundation (NSF), and other international partners, such as the Max Planck Institute for Meteorology (MPI-M) German Climate Computing Centre (DKRZ), the Australian National University (ANU) National Computational Infrastructure (NCI), Institut Pierre-Simon Laplace (IPSL), and the **Centre for Environmental Data Analysis ([CEDA](https://www.ceda.ac.uk))** in the UK.


<a id='CEDA'></a>

### For the UK: Center for Environmental Data Analysis (CEDA)

![ceda](img/ceda_archive_logo_transp_white_3_h80.png)

[CEDA](https://www.ceda.ac.uk) serves the UK environmental science community. It is a component of ESGF. 

CEDA has two main branches: 
1. [CEDA-Archive: https://archive.ceda.ac.uk](https://archive.ceda.ac.uk) 
2. [JASMIN: https://jasmin.ac.uk](https://jasmin.ac.uk)

[CEDA-Archive](https://archive.ceda.ac.uk) serves as the national (UK) data centre for atmospheric and earth observation research. It currently holds >18 Petabytes of atmospheric and earth observation data from a variet of sources, such as aircraft campaigns, satellites imagery, automatic weather stations, climate models, etc. 

CEDA is one of 5 UK data centers, comprising the Environmental Data Service ([EDS](https://nerc.ukri.org/research/sites/environmental-data-service-eds/)), supported by the UK Research and Innovation's National Environmental Research Council (NERC). The 5 centers are 
1. British Oceanographic Data Centre (Marine), 
2. CEDA (Atmospheric, Earth Observation, and Solar and space physics), 
3. Environmental Information Data Centre (Terrestrial and freshwater), 
4. National Geoscience Data Centre (Geoscience), and 
5. Polar Data Centre (Polar and cryosphere). 
   
As the focus is on climate here, CEDA is the most relevant, but it is useful to know of these resources if you end up working in other fields. 

CEDA also serves as the Data Distribution Center ([DDC](https://www.ipcc-data.org)) for the Intergovernmental Panel on Climate Change ([IPCC](https://www.ipcc.ch)).

[JASMIN](https://jasmin.ac.uk) is a data intensive supercomputer for environmental science. It currently supports over 160 projects ranging from climate science and oceanography to air pollution, earthquake deformation or biodiversity studies. JASMIN consists of multi-Petabyte fast storage, co-located with data analysis computing facilities. This would be the "go-to" tool for processing large amounts of climate data, which are directly linked to the system (through CEDA). We will here work offline (using local storage), but JASMIN and CEDA are useful resources for anyone working in environmental science in the UK (most nations don't have these facilities - they are rare and valuable!). 

Since CEDA and JASMIN are linked, the **compute power is linked to the data**, making for an efficient environment for data analysis. 

#### Getting access

Obviously, maintaining these facilities has a cost and therefore not all users will have the same access priviledges. Members of educational, academic and research institutions in the UK can often access these infrastructures for free (pending one can demonstrate need and subject to a fair-use policy). Use and access will likely not be free for commercial applications.  

**Use your Imperial College London email adresses (or from another UK institution) to register!**

In order to use the CEDA-Archive, it is necessary to first [register for an account](https://services.ceda.ac.uk/cedasite/register/info/). This is generally quick, easy and free.

Similarly, access to JASMIN requires an account, but this involves a more complex multi-step process. First, one must ask for a [JASMIN portal account](https://help.jasmin.ac.uk/article/4435-get-a-jasmin-account). If granted, one can then register for a [jasmin-login account] (https://help.jasmin.ac.uk/article/161-get-login-account). The type of account given will depend on need and priviledges requested. Users granted jasmin-login access get a HOME directory of 100GB of storage and can access the shared JASMIN servers (Scientific servers, data transfer servers, and the LOTUS HPC cluster).
   
[back to contents](#Contents)

---

<a id='fileformat'></a>

## File formats 

![unidata](img/unidatalogo.png)

It is customary for different scientific communities to use different data formats ... it would be too easy otherwise! 

More seriously, these file formats ususally evolve out of necessity. Not all data are equal in quantity, precision, type, etc. Each community therefore tends to evolve data formats specific to their needs and to cater for certain applications. 

* **TEXT**: Files in **text formats**, typically encoded as [ASCII](https://en.wikipedia.org/wiki/ASCII) or [Unicode](https://en.wikipedia.org/wiki/Unicode) (often UTF-8, or UTF-16), simply have bytes that can be interpreted as text characters. ASCII represents the first 128 charcters of UTF-8, so ASCII is a subset of Unicode. ASCII was developed as 7-bit code, but since UTF-8 kept the ASCII definitions when moving to 8-bits, UTF-8 (and subsequent) are backwards compatible with ASCII. Text formats are very attractive and useful as the file content is directly human-readable but this only works well for small files because data storage of such data is inefficient. 

* **BINARY**: More commonly, we will encouter files in **binary formats**. Sadly, binary files are not human-readable. They need to be interpreted, **using sets of rules set by whoever prescribed the format**. These rules have to be known to decode the data effectively (this can be a source of problem...and cause for frustration!).  
  
Some common data formats found in environmental sciences are summarized in [this table](https://help.ceda.ac.uk/article/104-file-formats). 

`pandas` I/O tool can handle most of them (read and write capability), as summarized in this  [table](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html). 

`gdal`, the Geospatial Data Abstraction Library ([GDAL](https://gdal.org)), is also able to deal with these (and many more). 

Some of the most imporant, or common, file formats in climate studies are: 

1. `.csv`: These are comma-separated-values,i.e text separated by commas. This format (or the related tab-delimited text format) is typically used for small datasets, such as when saving an Excel spreadsheet into a format useful for other programmes. Certain data-loggers can also produce text-based datasets. Data in this format will tend to be in the form of small tables or as 1-D data, such as time-series data.  
   
2. `.hdf`: The so-called [Hierarchichal Data Format](https://www.hdfgroup.org) is the format commonly associated with satellite imagery, where they typically hold raster data. HDF files tend to be very large and can have quite complex structure. The current version is `HDF5`, which differs from legacy `HDF4`. Common HDF file suffixes are: .hdf, .h4, .hdf4, .he2, .h5, .hdf5, .he5. `pandas` is able to read HDF data into `python` using [`pandas.read_hdf`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_hdf.html). The more specific `h5py` python package ([HD5 For Python](http://www.h5py.org)) could provide more control, if needed.  
   
3. `.nc` : NetCDF files are a typical format for gridded datasets with more than one dimension (i.e 2D to ND data), for example data with a latitude, longitude, depth/altitude and certain values. It is the data currency format used to exchange climate model outputs. Note, this does not mean that raw output of climate models comes as `.nc` files. Generally, models use their own binary formats during computation, to improve speed and storage. Results are only transformed into convenient `.nc` format at the end, where the goal is data communication and interpretation. The [`xarray`](http://xarray.pydata.org/en/stable/) python library is tailored to work with `netCDF` data formats.  
   
4. `grb`: [GRIB](https://en.wikipedia.org/wiki/GRIB) files, which stands for 'GRIdded Binary' or 'General Regularly-distributed Information in Binary form', are often used in meteorology. They are the types of files used by ECMWF. There have been 3 versions: i) version 0 is legacy, no longer in use, version 2 is slowly replacing version 1, but version 1 is used by most meteorology center today. The [`cfgrib`](https://github.com/ecmwf/cfgrib) python library from ECMWF will help translate `.grb` into `xarray`, `netcdf` or `hdf`. Alternatively, the [`pygrib`](https://github.com/jswhit/pygrib) python package will provide a way to read .grb files, and modify existing ones, but not create them.

[back to contents](#Contents)

---

* ## File aggregation and compression

Data often come **aggregated** and **compressed**. These will typically have the following suffixes: 

1. `.tar`: The .tar suffix represents an **aggregation** of data, i.e. many files grouped togeter into a bundle of joy called a 'tar ball'. 'Tar balls' have to be 'untarred' (disaggregated) before being usable. 
2. `.zip`, `.gzip`, ...: This relates to data **compression** (this is different than aggregation). Compressed data have to be uncompressed before being usable. Uncompressing will increase the size storage requirement of the data! Because they are smaller, it is convenient to exchange compressed files. 

It is also common to find files that are both "tarred" and "zipped", typically with suffix `.tar.gz`. 
To unzip and untar a file, you can use (on MacOS and Linux): 
>`tar -xzf <tarfile>`
 
where options flags `-xzf` stand for `Xtract Ze File` ... as "the Terminator" would say (`man tar` will of course provide more detail): 
* x: Extract, i.e. untar the tarfile. 
* z: Use gzip to uncompress (this option can be omitted if the file is 'just' aggregated and not compressed, i.e. just .tar)
* f: specifies the input tarfile to operate on. 

To *create* an aggregated compressed file, the synthax is similar, just replace option `x` (extract) with `c` (for create): 
>`tar -czf tarfile <list of files to tar and gzip>`. 

[back to contents](#Contents)

---

<a id='Metadata'></a>

## The importance of Metadata
[Metadata](https://help.ceda.ac.uk/article/4428-metadata-basics) **relate to all the information necessary to interpret, understand and use a given dataset. Metadata are not the data themselves, but are required to provide context and therefore inform the data analysis.** 

There are two types of metadata, "**discover metadata**", which can be used to locate the dataset in a search, and "**detailed metadata**", which contain the information necessary to use the data (ideally) without having to ask the data provider for more information (even if one of the key piece of information should be the data provider information and contact detail!).

There is no upper limit as to what information should be included in the metadata, but typical metadata comprise information about when and where the observations, or the file, where produced, how they were produced (which instrument, technique, algorithm were used), information about the experiment the data come from, maybe some information about data accuracy (if not provided as seperate data in the file), information about who produced the data or the file, including contact information, or appropriate reference(s), and, if relevant, additional information about the research context, i.e. which project is associated with these data (this could help the user understand if more data from this group/project exist or can be expected). If the data represent a subset of a larger dataset, this should also be mentioned. 

Another key pieces of information that the metadata should provide are a description of the variables in the file, with variable **names**, and the **units** of these variables. 

If the data in the file represent a processed version of other raw data, then one should add information about these raw data as well, their provenance, references, and the transformation that was made, etc. One should be able to track the chain of information all the way to the orginal observations (or model output). 

If the data represent gridded variables, then the nature of the grid on which these data are defined should also be explained. 

Similarly, if the position, or some other features of the data, depend on a reference system, that reference system should be given. This will be discussed in more depth during the introduction to geostatistics lecture. 

In the case of climate models (or other models, if relevant), the metadata should also include information about the model itself, the model name, version, the length of the integration, maybe a brief description of the spin-up, boundary conditions, initial conditions used to produce the simulation. 

Obviously the information that will be included in the metadata vary substantially. You should learn (and will discover - mostly through shear frustration when key information is missing!) what constitute useful metadata in your field of work. 

Metadata can be included in the file itself (if the file format allows it), as separate files, or in the form of some other documents. When sourcing data, always make a note of where the data come from. This is especially true if getting data online as it is very easy to download data and then forget where they come from. Detailed note-taking is part of the job of the data analyst (and that of any scientist!). 

When metadata is included in the file directly, this typically comes as a header, or in a specified location in the file. The netCDF format has rules for how/where to include metadata, which is one of the reason why netCDF data are particularly useful and popular in climate science. 


[back to contents](#Contents)

---

 * ### The Climate and Forecast (CF) Metadata Conventions (prominent in climate science)
 
Given the amount of data produced keeps increasing, data libraries are being developed. As in any library, certain sets of rules are required so that the data can be found later. Metadata are obviously a good way to ensure data can be catalogued and found by various search systems.

Multiple types of conventions exist (see for example this [list of netCDF conventions from Unidata](https://www.unidata.ucar.edu/software/netcdf/conventions.html)). Ensuring that data files are produced in a way that follows conventions about content, vocabulary used and layout, allows for batch processing, easy extraction and automation! It is extremely useful (but should not prevent innovation).

CEDA, as it focuses on climate and environmental data, relies extensively on the [Climate and Forecast (CF) Conventions](http://cfconventions.org). The CF Conventions are probably the most popular in climate science (it underpins the modelling effort of the IPCC). A detailed description of the latest release of the CF Conventions for netCDF can be found [here](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html). 

The CF Conventions were designed specifically for the netCDF format but they can be applied widely. The netCDF format enables creation of self-describing datasets by design. CF Conventions aim to ensure that  files contain sufficient metadata that they are self-describing in the sense that each variable in the file has an associated description of what it represents, including physical units if appropriate, and that each value can be located in space (relative to earth-based coordinates) and time. (Absence of such information in early (historical) datasets has hindered climate change science for decades - how does one measure change, if one cannot locate the observations in time and space?)

One example of CF Conventions is to provide a list of **standard_names** for certain commonly-used variables. The [CF standard name table] (http://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html) will show what the names are and describe what they represent. 

To illustrate, imagine you work with data at the earth surface and label your variable 'temperature'. What did you mean exactly? Surface temperature of the ocean/land, or air temperature? etc. CF conventions ensure that climate scientists name common variables in the same way. 

A [python package](https://pypi.org/project/cfunits/) called `cfunits` provides an interface to the CF Convention. This is useful to combine and compare variables and convert various units. By relying on a package, fewer user-errors are made.   

The CF Conventions are not fixed, they are evolving depending on needs and scientific progress. Although conventions are decided by a committee of experts, anyone can propose a change to the convention by engaging in the [discussion forum](https://cfconventions.org/discussion.html).


[back to contents](#Contents)

---

<a id='NetCDF'></a>

## Network Common Data Form (NetCDF)

[NetCDF](https://www.unidata.ucar.edu/software/netcdf/) is one of the most common data format used to store climate data. NetCDF files allow the user to insert metadata in the data file by design, ensuring the data file is self-describing (amongst other properties).

*NetCDF is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.* 

The netCDF format, which is a type of HDF format, is attractive because it is:

* Self-Describing. A netCDF file includes information about the data it contains and metadata.
* Portable. A netCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers.
* Scalable. Small subsets of large datasets in various formats may be accessed efficiently through netCDF interfaces, even from remote servers.
* Appendable. Data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure.
* Sharable. One writer and multiple readers may simultaneously access the same netCDF file.
* Archivable. Access to all earlier forms of netCDF data will be supported by current and future versions of the software.


The NetCDF project is maintained by the Unidata program at the University Corporation for Atmospheric Research ([UCAR](https://www.ucar.edu)). [UCAR](https://www.ucar.edu) also manages [NCAR](https://ncar.ucar.edu), one of the first center to have developped climate models and is today one of the gold-standard on the topic. 

NCAR also developed [NCL](https://www.ncl.ucar.edu), an interpreted programming language designed specifically for scientific data analysis and visualization (we will not use NCL here, but the python package `ESMValTool` has build-in support for it). 

[NCAR](https://ncar.ucar.edu) has also developped the [climate data guide](https://climatedataguide.ucar.edu), a tool to search and access >200 data sets specific to the ocean, land and the atmosphere. (Some of these data may or may not also be found on CEDA.) 

[back to contents](#Contents)

---

## Useful tools to read, create or manipulate NetCDF files

Because they are so common, various tools have been developped to effectively  manipulate NetCDF files. We will here only discuss a few of them, notably the `netCDF4` python interface, and the `nco` routines which work on the command-line.

---
* ### Python support for NetCDF
---

`netCDF4`: As expected, Unidata provides an [API for its NetCDF C library](http://unidata.github.io/netcdf4-python/). This can be installed with `pip` or `conda`, via the conda-forge: `conda install -c conda-forge netCDF4`. 
  It can read/write files in any of the NetCDF formats (NETCDF3_CLASSIC, NETCDF3_64BIT_OFFSET, NETCDF3_64BIT_DATA, NETCDF4_CLASSIC, and NETCDF4). The latest NETCDF4 format create files readable by HDF5 clients as well.
  
  Detailed instructions on the use of `netCDF4` can be found [here](http://unidata.github.io/netcdf4-python/). We are here only going to demonstrate a few features. 

Before looking at the internal structure of a file, let's create an empty NetCDF file using python into a new `output/` directory. To do this, we need to import the `Dataset` constructor from `netCDF4`. 


In [1]:
# lets first create a new directory called 'output' where we will save this file
import os
mypath = 'output'
# for more complicated path, we could use mypath = os.path.join('dir','other-dir'). 
if not os.path.isdir(mypath): # if the directory does not exist, ...
    os.makedirs(mypath) # let's create it

In [2]:
# Now, lets create the NetCDF file:
# import the Dataset constructor
from netCDF4 import Dataset
# create a .nc file, using the default NETCDF4 format.
# the 'w' option is for 'write'. 
# one could use a 'r' option for 'read' if the file already exists. 
rootgrp = Dataset("output/test.nc", "w", format="NETCDF4")
# one can use the "data_model" attribute to show what format the file is in
print(rootgrp.data_model)
# Finally, one can close the file using the Dataset.close method.
rootgrp.close()

# [note: you may get an error if you run this twice. 
# This is because the output/test.nc already exist. 
# Delete test.nc and this will run again, or change this code to check for prior existance.] 

NETCDF4


### A sidenote on remote data access

Note that `Dataset` could also read a remote [OPeNDAP](https://www.opendap.org)-hosted dataset over http if a URL is provided instead of a filename. If you can access a file remotely, it may be best to do so, as it unburdens you from the cost of storage! However, you should also ensure that users will have continous access to the file for the duration of the project, and maybe later also if archiving of the results is required. 

[OPeNDAP](https://www.opendap.org) stands for Open-source Project for a Network Data Protocol and is a software for remote data retrieval, commonly used in Earth science. It allows one to access data via the internet, allowing one to retrieve data when needed, without having to necessarily download all the data locally, what facilitates real-time analysis. 

Other commonly encountered remote-access protocols are [THREDDS](https://www.unidata.ucar.edu/software/tds/current/) and [ERDDAP](https://upwell.pfeg.noaa.gov/erddap/index.html). 

[THREDDS](https://www.unidata.ucar.edu/software/tds/current/) stands for Thematic Real-time Environmental Distributed Data Services. The THREDDS Data Server ([TDS](https://docs.unidata.ucar.edu/tds/current/userguide/index.html)) is another Unidata service that provides remote access to real-time or archived datasets. It is a web server that provides metadata and data access for scientific datasets, using a variety of remote data access protocols, including OPeNDAP. The THREDDS data catalogue of meteorological data is available [here](https://thredds.ucar.edu/thredds/catalog.html). 

[ERDDAP](https://upwell.pfeg.noaa.gov/erddap/index.html) is also a data server by the National Oceanic and Atmospheric Administration (NOAA, USA) that acts as a middleman between the user and various other data servers. You can see a [list of datasets available through ERDDAP here](https://upwell.pfeg.noaa.gov/erddap/info/index.html?page=1&itemsPerPage=1000). ERDDAP provides a consistent way to download subsets of gridded and tabular data in common file formats. ERDDAP is  a  tool that facilitates user interactions so the user doesnt have to know about OPeNDAP or other remote access protocols (such as OPeNDAP).

[back to contents](#Contents)

---

Let's look into the `output` directory to see if a `.nc` file now exists. There are [many ways to do this](https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory) (e.g. `os`, `glob`). 

In [3]:
# using 'glob' as it does pattern matching
import glob
ncfiles = glob.glob("output/*.nc")
print(ncfiles)

['output/ERAtest_subarea_1deg1deg.nc', 'output/ERA5test.nc', 'output/test.nc']


One can also use `Dataset` to read existing files. To do so, simply repace the `"w"` option with the `"r"` option (for read). One could also use the `"a"` option to append new data to an existing file. 

In [4]:
# lets try to open the 'test.nc' file we just created
ncdata = Dataset("output/test.nc", "r")
print(ncdata)
# don't forget to close the file when done
ncdata.close()

<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    dimensions(sizes): 
    variables(dimensions): 
    groups: 


We see that even if we added no information to 'test.nc' when we created it, the file already contains structure. 

Another convenient way to look at the structure of netCDF file on the command line is to use `ncdump -h <filename.nc>`. `ncdump` is part of the netCDF C library that needs to be installed for anything related to netCDF to work (it comes 'for free'!)

In particular, there is a `group`, and that 'group' seems to posses some placeholders for `dimensions` and `variables` information. We also see that variables have dimensions, and that dimensions have sizes. 

- **Groups**: `groups` are the NetCDF file analogue to a directory file system (only implemented in NETCDF4 - so not all NetCDF files will have groups). That is, a NetCDF file can contain different groups, which are like containers for `variables`, `dimensions` and other `attributes`. Groups could also contain other groups and the `root group` created by `Dataset` is like the root directory. 

![netcdf](img/hdf5-example-data-structure.jpg)

One can create new groups using `Dataset.createGroup`. Lets create two new groups ("forecast" and "analyses") within 'test.nc'. And then let's create two other subgroups within the "foecast" group (i.e. nested groups, just like subdirectories).


In [5]:
# we are going to append "a" new information to the existing file
rootgrp = Dataset("output/test.nc", "a")
# lets create 2 new groups
fcstgrp = rootgrp.createGroup("forecasts")
analgrp = rootgrp.createGroup("analyses")
# show the results below
print(rootgrp.groups)

# now let's create 2 new subgroups in forecasts
# if 'forecasts' did not already exist, it would have been created,
# analogous to the unix 'mkdir -p' command.  
fcstgrp1 = rootgrp.createGroup("/forecasts/model1")
fcstgrp2 = rootgrp.createGroup("/forecasts/model2")
print('----now see the subgroups in forecasts: ')
print(rootgrp.groups)


{'forecasts': <class 'netCDF4._netCDF4.Group'>
group /forecasts:
    dimensions(sizes): 
    variables(dimensions): 
    groups: , 'analyses': <class 'netCDF4._netCDF4.Group'>
group /analyses:
    dimensions(sizes): 
    variables(dimensions): 
    groups: }
----now see the subgroups in forecasts: 
{'forecasts': <class 'netCDF4._netCDF4.Group'>
group /forecasts:
    dimensions(sizes): 
    variables(dimensions): 
    groups: model1, model2, 'analyses': <class 'netCDF4._netCDF4.Group'>
group /analyses:
    dimensions(sizes): 
    variables(dimensions): 
    groups: }


- **Dimensions**: In NetCDF files, we define the sizes of all the variables using `dimensions`. Dimensions are created using the `Dataset.createDimension` method. We must provide a name and a value, but one can create dimensions that are finite (fixed value), or unlimited. Unlimited dimensions are useful as it can be appended to. To create unlimited dimensions, we can set its value to `None`, or `0`.  

If needed, one can rename dimensions with `Datatset.renameDimension`. 

Here is an example, where we create 4 dimensions, level, time, lat and lon. Level and time are unlimited (useful so we can append new data), while lat and lon are fixed (i.e. the spatial grid used does not change). (NB: prior to NETCDF4, only one unlimited dimension could be set.)

In [6]:
level = rootgrp.createDimension("level", None)
time = rootgrp.createDimension("time", None)
lat = rootgrp.createDimension("lat", 73)
lon = rootgrp.createDimension("lon", 144)

print("Dimensions are stored in a python dictionary:")
print(rootgrp.dimensions)


Dimensions are stored in a python dictionary:
{'level': <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'level', size = 0, 'time': <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 0, 'lat': <class 'netCDF4._netCDF4.Dimension'>: name = 'lat', size = 73, 'lon': <class 'netCDF4._netCDF4.Dimension'>: name = 'lon', size = 144}


- **Variables**: NetCDF variables are a bit like python arrays from `numpy`. Variables can have any number of dimensions (i.e. 1D, 2D, 3D, 4D, ...). One can create a new variable with `Dataset.createVariable`, where we must provide a name and a data type for the variable. The dimensions of a variables are provided as a 'tuple'. One can also create a 0D variable (i.e. a scalar variable), in which case we simply leave out the dimension information. 

Datatype must be one of the following: 'f8','f4','i8','i4','i2','i1','u8','u4','u2','u1','S1'. This is similar to 'numpy'; 'f' is for floating point, 'i' for signed-integer, 'u' for unsigned-integers, 'S' for single character string; 8 is for 64-bit, 4 for 32-bit, 2 for 16-bit and 1 for 8-bit. ('i8' and 'u8' are only available on NETCDF4). 

Variable names can be changed with `Dataset.renameVariable`. 

Note that the dimensions themselves will eventually have to hold some values, so they are defined as 'coordinate variables', a special type of variables. 

Let's define coordinate variables (dimensions) and also a new variable ('temp') in 'test.nc'.

In [7]:
# define the coordinate variables (time, level, lat, lon)
times = rootgrp.createVariable("time","f8",("time",))
levels = rootgrp.createVariable("level","i4",("level",))
latitudes = rootgrp.createVariable("lat","f4",("lat",))
longitudes = rootgrp.createVariable("lon","f4",("lon",))
# new variable 'temp' has 4 dimensions, 2 of which are unlimited (time and level)
temp = rootgrp.createVariable("temp","f4",("time","level","lat","lon",))
# we can print a summary of the 'temp' variable
print(temp)

# we can also creat a variable inside a group (e.g. in forecasts/model1)
ftemp = rootgrp.createVariable("/forecasts/model1/temp","f4",("time","level","lat","lon",))
# we can also query the status of the group
print('--- model1 ---')
print(rootgrp["/forecasts/model1"]) 
# or the variable in the group directly
print('---model1/temp---')
print(rootgrp["/forecasts/model1/temp"])

# variables in the Dataset are stored as a python dictionary
print('--- python variable dictionary ---')
print(rootgrp.variables)

<class 'netCDF4._netCDF4.Variable'>
float32 temp(time, level, lat, lon)
unlimited dimensions: time, level
current shape = (0, 0, 73, 144)
filling on, default _FillValue of 9.969209968386869e+36 used
--- model1 ---
<class 'netCDF4._netCDF4.Group'>
group /forecasts/model1:
    dimensions(sizes): 
    variables(dimensions): float32 temp(time, level, lat, lon)
    groups: 
---model1/temp---
<class 'netCDF4._netCDF4.Variable'>
float32 temp(time, level, lat, lon)
path = /forecasts/model1
unlimited dimensions: time, level
current shape = (0, 0, 73, 144)
filling on, default _FillValue of 9.969209968386869e+36 used
--- python variable dictionary ---
{'time': <class 'netCDF4._netCDF4.Variable'>
float64 time(time)
unlimited dimensions: time
current shape = (0,)
filling on, default _FillValue of 9.969209968386869e+36 used, 'level': <class 'netCDF4._netCDF4.Variable'>
int32 level(level)
unlimited dimensions: level
current shape = (0,)
filling on, default _FillValue of -2147483647 used, 'lat': <clas

- **Attributes**: The last category of information required are 'attributes'. Attributes can be `global' or 'variable'-specific. 'Global' attributes are for the whole Dataset or the group. 'Variable' attributes only inform on a specific variable (or coordinate variable).  

`Dataset.ncattrs` can be used to retrieve attributes from a NetCDF file. Alternatively, the `__dict__` attribute of a Dataset, Group or Variable will return the name/value pairs for all the attributes. 

Let's define some attribute to the time coordinate variable (or dimension):

In [8]:
# let's import the 'time' library, to add to our metadata
import time
# now let's add some attributes, to add to our metadata
rootgrp.description = "Example of how to create and manipulate test.nc"
rootgrp.history = "Created " + time.ctime(time.time())
rootgrp.source = "netCDF4 python module tutorial: http://unidata.github.io/netcdf4-python/#tutorial"

# Now, let's assing units to the variables/coordinate variables (these could follow CF Conventions)
latitudes.units = "degrees north"
longitudes.units = "degrees east"
levels.units = "hPa"
times.units = "hours since 0001-01-01 00:00:00.0"
times.calendar = "gregorian"
temp.units = "K"

# now let's see what we did, and show how that information could be retrieved
# so it could be integrated/used in a programme designed to analyze data from a NetCDF file.
#
# method 1: get information with Dataset.ncattrs
print('--- method 1: ncattrs ---')
for name in rootgrp.ncattrs():
     print("Global attr {} = {}".format(name, getattr(rootgrp, name)))


# method 2: get information as a python dictionary, with rootgrp.__dict__
print('--- method 2: __dict__ ---')
print(rootgrp.__dict__)

--- method 1: ncattrs ---
Global attr description = Example of how to create and manipulate test.nc
Global attr history = Created Mon Nov 22 15:15:38 2021
Global attr source = netCDF4 python module tutorial: http://unidata.github.io/netcdf4-python/#tutorial
--- method 2: __dict__ ---
{'description': 'Example of how to create and manipulate test.nc', 'history': 'Created Mon Nov 22 15:15:38 2021', 'source': 'netCDF4 python module tutorial: http://unidata.github.io/netcdf4-python/#tutorial'}


**Adding data**: Finally, we are prepared to add data to the file...which is the point! That is simple, we can just treat the 'variables' as arrays and assign data into them. 

In [9]:
# let's make up some data (and worry about dimensions)
import numpy as np
# Create latitudes and longitudes
lats =  np.arange(-90,91,2.5) # lats from -90 to +90 in 2.5 degrees increments
lons =  np.arange(-180,180,2.5) # lons  from -180 to +180 in 2.5 degrees increments
# assign lats/lons data to a slice (of latitudes/longitudes arrays)
latitudes[:] = lats
longitudes[:] = lons
# let's see what we made:
print("latitudes =\n{}".format(latitudes[:]))
print("longitudes =\n{}".format(longitudes[:]))


# Let's add random numbers for temperature 'temp'. 
# Recalling we created variable temp with the following dimensions: 
# temp(time, level, lat, lon). 'temp' is a 4D dataset. 
# Dimensions time and level are unlimited, while lat and lon are finite. 
print("temp shape before adding data = {}".format(temp.shape))

from numpy.random import uniform
# we now create 5 different time slices, and 10 different levels. 
# Levels represent altitude slices of a 4D dataset.
nlats = len(rootgrp.dimensions["lat"])
nlons = len(rootgrp.dimensions["lon"])
temp[0:5, 0:10, :, :] = uniform(size=(5, 10, nlats, nlons))
print("temp shape after adding data = {}".format(temp.shape))
# !! Beware: unlike numpy arrays, variables with unlimited dimensions will grow
# along these dimensions if we assign data that exceed the size of the finite dimensions!
#
# Since we have created a 4D temp dataset, the coordinate variable 'levels' has grown as well, 
# even if we have not yet assigned values to each 'level'; 
# as can be seen here:
print("levels shape after adding 'level' data to variable temp = {}".format(levels.shape))

# Let's now assign some values to 'levels': 
levels[:] =  [1000.,850.,700.,500.,300.,250.,200.,150.,100.,50.]

latitudes =
[-90.  -87.5 -85.  -82.5 -80.  -77.5 -75.  -72.5 -70.  -67.5 -65.  -62.5
 -60.  -57.5 -55.  -52.5 -50.  -47.5 -45.  -42.5 -40.  -37.5 -35.  -32.5
 -30.  -27.5 -25.  -22.5 -20.  -17.5 -15.  -12.5 -10.   -7.5  -5.   -2.5
   0.    2.5   5.    7.5  10.   12.5  15.   17.5  20.   22.5  25.   27.5
  30.   32.5  35.   37.5  40.   42.5  45.   47.5  50.   52.5  55.   57.5
  60.   62.5  65.   67.5  70.   72.5  75.   77.5  80.   82.5  85.   87.5
  90. ]
longitudes =
[-180.  -177.5 -175.  -172.5 -170.  -167.5 -165.  -162.5 -160.  -157.5
 -155.  -152.5 -150.  -147.5 -145.  -142.5 -140.  -137.5 -135.  -132.5
 -130.  -127.5 -125.  -122.5 -120.  -117.5 -115.  -112.5 -110.  -107.5
 -105.  -102.5 -100.   -97.5  -95.   -92.5  -90.   -87.5  -85.   -82.5
  -80.   -77.5  -75.   -72.5  -70.   -67.5  -65.   -62.5  -60.   -57.5
  -55.   -52.5  -50.   -47.5  -45.   -42.5  -40.   -37.5  -35.   -32.5
  -30.   -27.5  -25.   -22.5  -20.   -17.5  -15.   -12.5  -10.    -7.5
   -5.    -2.5    0.     2.5    

**Slicing and indexing**:   Definition of slices are similar to numpy, i.e. `start:stop:step` triplet can be used, and so can an integer index `i` to take the i-th element, blicing rules in NetCDF4 works a little differently than in numpy. For example:

> temp[0, 0, [0,1,2,3], [0,1,2,3]].shape

returns (4,4), so this would be a 4 rows x 4 column array, corresponding to the first time point and the first level only. 

In numpy, this would result in only 4 elements (e.g. a vector with 4 elements). 

We can use slicing and indexing to select data from N-dimensional arrays. For example, to extract the 0th, 2nd and 4th index of the time dimension, the 850, 500 and 200 hPa pressure levels, keep only data from the Northern Hemisphere and Eastern Hemisphere from the the 'temp' variable we defined in test.nc, we could simply do:

In [10]:
tempdat = temp[::2, [1,3,6], lats>0, lons>0]
print("shape of fancy temp slice = {}".format(tempdat.shape))

shape of fancy temp slice = (3, 3, 36, 71)


**Scalar variables**: When dealing with scalar variables (e.g. variable 'v') with no dimensions, one can use numpy to extract a numpy scalar array from the data: `numpy.asarray(v)`, or `v[...]`. 

**Masks**: `Masks` are often used in climate science data to separate data from different regions (e.g. ocean vs land, etc.). By default, the python NetCDF4 API returns numpy arrays that mask entries that equal the `missing_value` or `_FillValue` attribute. One can force unmasking (i.e. return all values, regardless) using the `Dataset.set_auto_mask` method. To switch back auto-masking, use `Dataset.set_always_mask`. When writing a masked array to a variable in a netcdf file (i.e. an array with missing data), the masked elements are filled with the value that is defined in the `missing_value` attribute that was defined when creating the netcdf fie. 

**Time coordinates**: The time coordinate ofen poses a special challenge when working with environmental data. Not only the format of the time information is complex (e.g. YY-MM-DD hh:mm:ss) and awkard to deal with, but this can also depend on the calendar used, or requires some additional information about when T=0 was if the time coordinate is given as relative time. CF Conventions advocate for a measure of time given relative to a fixed data and using a defined calendar, and given in the YY-MM-DD hh:mm:ss format. 

The `cftime` and `datetime` packages provide functions to decode time units and variable values from a netCDF file, if that file conforms to CF Conventions. The following three functions can greatly simplify the problem of dealing with time: 
1. [`num2date`](http://unidata.github.io/netcdf4-python/#num2date): converts numeric values to time in specified `units` with a given `calendar` to so-called "datetime objects". All calendars defined in the CF Conventions are supported: ‘standard’, ‘gregorian’, ‘proleptic_gregorian’ ‘noleap’, ‘365_day’, ‘360_day’, ‘julian’, ‘all_leap’, ‘366_day’.
2. [`date2num`](http://unidata.github.io/netcdf4-python/#date2num): this does the opposite of `num2date`
3. [`date2index`](http://unidata.github.io/netcdf4-python/#date2index): returns the indices of a netCDF time variable that correspond to given dates

Here is an example:

In [11]:
# Example of dealing with time in python/netCDF4
# load the proper functions
from datetime import datetime, timedelta
from cftime import num2date, date2num
#
# Recall the 'temp' variable in our toy netCDF file
# We build the 'temp' variable with 5 time points:
# the '0' index of course picks the first dimension, here 'time'
print("Number of time points for the 'temp' variable:\n{}".format(temp.shape[0])) 
#
# We now use the 'datetime' and 'timedelta' functions, from the 'datetime' package
# to create 5 time points, spaced by 12 hours, starting on  March 1st 2001. 
dates = [datetime(2001,3,1)+n*timedelta(hours=12) for n in range(temp.shape[0])]
# Let's see the 5 time points, noticing that the date automatically goes to march 2nd 3rd
print("dates created for each time point:\n{}".format(dates)) 
#
# We can then use the 'date2num' function, from 'cftime' to turn these dates into numeric values
# Numerical values are much easier to plot or deal with, due to their simpler format. 
# Numerical values make no intuitive sense, though! 
# Numeric time is useful for analysis, but not no good for communication. 
# Note that 'units' and 'calendar' are specified. 
times[:] = date2num(dates,units=times.units,calendar=times.calendar)
# Numeric time values are relative to a starting point! 
print("Numerical time values (in units {}):\n{}".format(times.units, times[:]))
# 
# we can use 'num2date' to turn numerical times back into calendar dates
dates = num2date(times[:],units=times.units,calendar=times.calendar)
print("dates corresponding to time values:\n{}".format(dates))


Number of time points for the 'temp' variable:
5
dates created for each time point:
[datetime.datetime(2001, 3, 1, 0, 0), datetime.datetime(2001, 3, 1, 12, 0), datetime.datetime(2001, 3, 2, 0, 0), datetime.datetime(2001, 3, 2, 12, 0), datetime.datetime(2001, 3, 3, 0, 0)]
Numerical time values (in units hours since 0001-01-01 00:00:00.0):
[17533104. 17533116. 17533128. 17533140. 17533152.]
dates corresponding to time values:
[cftime.DatetimeGregorian(2001, 3, 1, 0, 0, 0, 0, has_year_zero=False)
 cftime.DatetimeGregorian(2001, 3, 1, 12, 0, 0, 0, has_year_zero=False)
 cftime.DatetimeGregorian(2001, 3, 2, 0, 0, 0, 0, has_year_zero=False)
 cftime.DatetimeGregorian(2001, 3, 2, 12, 0, 0, 0, has_year_zero=False)
 cftime.DatetimeGregorian(2001, 3, 3, 0, 0, 0, 0, has_year_zero=False)]



<a id='NCO'></a>
### NetCDF C operators (NCO) - command line 
---
NetCDF Operators ([NCO](http://nco.sourceforge.net)) are (in my opinion) fantastic! 

The NCO operators are a dozen standalone programmes that allow users to manipulate and analyze data from NetCDF, HDF or DAP files directly from the command line without having to open file or load them into memory in an interpreted language and produce output to screen or in text, binary, or netCDF formats. They are written in C (Fortran versions also available) and build on the [GNU scientific library](https://www.gnu.org/software/gsl/), making them very fast and efficient. They are build to work with the CF Conventions for metadata, and can work with OPeNDAP as well. 

They are meant to be used to transform primary data into secondary data (stuff that is then plotted and carries interpretable meaning as decided by the analyst). NCO operators are not for plotting, plotting will be done in a higer-level language (e.g. python, R, Matlab, etc.). 

NCO are "race horses", designed to minimize system memory required and improve speed. NCO routines are not meant to do everything and anything, they are tools, not a full-fledged programming language. They are  a set of (very useful) functions that do a small set of manipulations common to a lot of analysis workflows specific to gridded unstructured data interpretation exercises. 

NCO operators are quite specific in what they do but they can be daisy-chained to perform complex tasks. 

...sadly, they are not that easy or intuitive to use - it takes practice. But they are powerful as they can be embeded in shell scripts or in any code (with a call to the system), where they can help solve certain memory issues (depending on the interpreted language used), or help speed up the workflow. 

NCO routines [can be installed](http://nco.sourceforge.net/#Executables) using package managers like homebrew (on MacOS) or even from conda (or they can be compiled from source too). Some NCO operators can support shared memory parallelism with OpenMP threading (if compiled to do so), but most of the NCO tools do not benefit from great speed up as processing speed tends to be limited by I/O operations.   

The 12 NCO operators are ([detailed explanation and examples are found in the user guide](http://nco.sourceforge.net/nco.html)): 

* [ncap2](http://nco.sourceforge.net/nco.html#ncap2) netCDF Arithmetic Processor 
* [ncatted](http://nco.sourceforge.net/nco.html#ncatted-netCDF-Attribute-Editor) netCDF ATTribute EDitor
* [ncbo](http://nco.sourceforge.net/nco.html#ncbo-netCDF-Binary-Operator) netCDF Binary Operator (addition, multiplication...)
* [ncclimo](http://nco.sourceforge.net/nco.html#ncclimo-netCDF-Climatology-Generator) netCDF CLIMatOlogy Generator
* [nces](http://nco.sourceforge.net/nco.html#nces-netCDF-Ensemble-Statistics) netCDF Ensemble Statistics
* [ncecat](http://nco.sourceforge.net/nco.html#ncecat-netCDF-Ensemble-Concatenator) netCDF Ensemble conCATenator
* [ncflint](http://nco.sourceforge.net/nco.html#ncflint-netCDF-File-Interpolator) netCDF FiLe INTerpolator
* [ncks](http://nco.sourceforge.net/nco.html#ncks-netCDF-Kitchen-Sink) netCDF Kitchen Sink
* [ncpdq](http://nco.sourceforge.net/nco.html#ncpdq-netCDF-Permute-Dimensions-Quickly) netCDF Permute Dimensions Quickly, Pack Data Quietly
* [ncra](http://nco.sourceforge.net/nco.html#ncra-netCDF-Record-Averager) netCDF Record Averager
* [ncrcat](http://nco.sourceforge.net/nco.html#ncrcat-netCDF-Record-Concatenator) netCDF Record conCATenator
* [ncremap](http://nco.sourceforge.net/nco.html#ncremap-netCDF-Remapper) netCDF REMAPer 
* [ncrename](http://nco.sourceforge.net/nco.html#ncrename-netCDF-Renamer) netCDF RENAMEer
* [ncwa](http://nco.sourceforge.net/nco.html#ncwa-netCDF-Weighted-Averager) netCDF Weighted Averager 


The most useful one, for me, have been `ncks`, `ncpdq`, `ncra`.

[back to contents](#Contents)

---

### Climate Data Operators (CDO)
The Climate Data Operators ([CDO](https://code.mpimet.mpg.de/projects/cdo)) software is a collection of >600 operators for standard processing of climate and forecast model data. They are slighly less memory/speed efficient than NCO, but they do more (>600 vs 12!). CDO support the main data formats used in the field, such as GRIB and netCDF. 

They too can be daisy-chained for specific applications. There are [many recipes available](https://code.mpimet.mpg.de/projects/cdo/embedded/cdo_eca.pdf) to calculate various climate indices and diagnostics, what makes them very attractive to evaluate gridded climate data or climate model output.

The [CDO user guide](https://code.mpimet.mpg.de/projects/cdo/embedded/index.html) provides details instructions and examples. 

[They are also very useful, but we don't have time to discuss them in detail here. You are encouraged to practice with CDO on your own time]

[back to contents](#Contents)

---

<a id='reanalysis'></a>

# Reanalysis products

![C3S](img/c3s-logo.png)
![ECMWF](img/logo-ecmwf.png) 

[*ERA5 provides hourly estimates of a large number of atmospheric, land and oceanic climate variables. The data cover the Earth on a 30km grid and resolve the atmosphere using 137 levels from the surface up to a height of 80km. ERA5 includes information about uncertainties for all variables at reduced spatial and temporal resolutions.Quality-assured monthly updates of ERA5 (1979 to present) are published within 3 months of real time. Preliminary daily updates of the dataset are available to users within 5 days of real time.*](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5)

[ERA5](https://confluence.ecmwf.int/display/CKB/ERA5) is a [family of datasets](https://confluence.ecmwf.int/display/CKB/The+family+of+ERA5+datasets). It currently comprises ERA5, ERA5.1 and ERA5-Land. ERA5 is the fifth generation ECMWF atmospheric reanalysis of the global climate covering the period from January 1950 to present. ERA5 is produced by the Copernicus Climate Change Service ([C3S](https://confluence.ecmwf.int/pages/viewpage.action?pageId=151530614)) at the European Center for Medium-Range Weather Forecasts ([ECMWF](https://www.ecmwf.int)) and made available via the [Climate Change Service](https://climate.copernicus.eu). 

Importantly, ERA5 is a [**reanalysis product**](https://www.youtube.com/watch?v=FAGobvUGl24), meaning a **model that assimilates data**. A model of the climate (weather) is run, and adjusted (following certain laws of physics and constraints) to fit [as many observations](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Observations) as possible using a technique called 4D-var. One must realize that, even if assimimated prodcuts such as ERA5 are often used 'in lieu' of observations, they are **\*not\*** observations: they are a model product, but a product that is made to look as much like the data as possible given computational, mathematical, physical limitations of the model. 

ERA5 is one of various [reanalysis products](https://reanalyses.org) available globally. Another well-known produce is the [NCEP/NCAR Reanalysis product](https://en.wikipedia.org/wiki/NCEP/NCAR_Reanalysis). [MERRA-2](https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2/), produced by NASA, is another. 

The article by [Hersbach et al. 2020](https://rmets.onlinelibrary.wiley.com/doi/10.1002/qj.3803) discusses the ERA global Reanalysis prodcuct in more detail.

While observations are only available in specific locations and at specific times, the reanalysis product provides a clever way to dynamically interpolate between these observations. ERA5 also comes as a gridded product, making it very convenient to use. In the model-world, one can access a complete/global picture every time-step, with a spatial resolution as high as computational limits allow. I re-emphasize that this is not the same as observations! ... but it is as close to observations as we can get if one is trying to work with a spatially and temporall interpolated product. 

**Grid geometry depends on data format**: Note that the grid geometry of the output data of ERA5 [depends on the format the data that are being downloaded](https://confluence.ecmwf.int/display/CKB/ERA5%3A+What+is+the+spatial+reference). Native GRIB format data are delivered on the model's native grid geometry (this is not a regular lat/lon grid!). On the other hand, data in NetCDF format are automatically interpolated and regridded to a regular lat/lon grid. While this is not very important for most application, one must remember that interpolated data in the NetCDF files are not the same as the original model output and this could have implication for the conservation properties of some variables. It can be easier/convenient to work with data interpolated on a regular lat/lon grid, however.  

**Grid definition and wrap-around**: The gridded ERA5 archive is provided on a [-90;+90] latitude grid and a [0;+360] longitude grid, with decimal degrees, referenced to the Greenwich Prime Meridian. While latitude is generally not an issue, care must be taken when working with longitude as one must remember that 0 and 360 are the same point. One must account for the wrap-around issue: although the first column and last column of a datasets on the [0;+360] grid are far away from each other in terms of index, these points are geographically very close. Some software can automatically deal with this wrap-around and convert to [-180;+180] or other system as required, but this should not be taken for granted. 


[back to contents](#Contents)

---


### Downloading ERA5 via the Climate Data Store (CDS)
The instructions to download ERA5 data are [here](https://confluence.ecmwf.int/display/CKB/How+to+download+ERA5). There are two ways to download data:

1. The simplest (if one only need to do this once), is to use the [web interface](https://confluence.ecmwf.int/display/CKB/How+to+download+ERA5#HowtodownloadERA5-3-DownloadingonlineERA5familydatathroughtheCDSwebinterface). 

2. However, one can also download ERA5 via installation and understanding of yet another service and its [API](https://confluence.ecmwf.int/display/CKB/How+to+download+ERA5#HowtodownloadERA5-3-DownloadingonlineERA5familydatathroughtheCDSwebinterface), namely, the [Climate Data Store API](https://confluence.ecmwf.int/display/CKB/Climate+Data+Store+%28CDS%29+infrastructure+and+API). (Beware, the installation instructions depend on the operating system). This provides the user with a mechanism to write data request scripts to download files automatically (or as necessary whithin a workflow). 

#### The CDS API
  
**Note that before using this service, one must first register for a free CDS account [here](https://cds.climate.copernicus.eu/user/register)**. Registration is fast, only a minute. 

The next step is to setup the uid:API key detail, as prescribed in the installation instructions. 

Then install the CDS API itself; follow the instructions [here](https://cds.climate.copernicus.eu/api-how-to). (Note, this can also be done with [conda](https://anaconda.org/conda-forge/cdsapi): `conda install -c conda-forge cdsapi`.)

The Copernicus Climate Change Service [provides tutorials and training material](https://confluence.ecmwf.int/display/COPSRV/CDS+web+API+%28cdsapi%29+training) on the CDS API. 

The following shows an example of how one can download Air Temperature, here on the 1000 hPa pressure level, for a given time point in both NetCDF or GRIB format. 

Note that the CDS web interface can be used to build the CDS API download script. In the web interface, in the `Download data` tab, after selecting some files of interest, one can click the button `Show API request` at the bottom left, which will generate a script. This script can then be modified easily as required (for example, to access similar data from different dates - what would be convenient if building a self-updating dashboard, etc.). 

One can also study the systematic structure of the ERA archive to build CDS API scripts more [efficiently](https://confluence.ecmwf.int/display/CKB/Climate+Data+Store+%28CDS%29+documentation#ClimateDataStore(CDS)documentation-Efficiencytips). 

A full list of available variables available for download is provided [here](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Parameterlistings). 


In [8]:
#!/usr/bin/env python
# 
# example useage of the CDS API: https://confluence.ecmwf.int/display/CKB/How+to+download+ERA5#HowtodownloadERA5-3-DownloadingonlineERA5familydatathroughtheCDSwebinterface
#
# Downloading Air Temperature on the 1000 hPa level for a Jan 1st 2021 at 12:00.
#

import cdsapi
 
c = cdsapi.Client()
 
c.retrieve(
    'reanalysis-era5-pressure-levels',
    {
        'product_type': 'reanalysis',
        'variable': 'temperature',
        'pressure_level': '1000',
        'year': '2021',
        'month': '01',
        'day': '01',
        'time': '12:00',
        'format': 'netcdf',     # NetCDF format
    },
    'output/ERA5test.nc')       # Output file. Adapt as you wish.


# Retrieve the same files in GRIB format
c.retrieve(
    'reanalysis-era5-pressure-levels',
    {
        'product_type': 'reanalysis',
        'variable': 'temperature',
        'pressure_level': '1000',
        'year': '2021',
        'month': '01',
        'day': '01',
        'time': '12:00',
        'format': 'grib',       # GRIB format
    },
    'output/ERA5test.grib')     # Output file. Adapt as you wish.

2021-11-07 22:10:12,481 INFO Welcome to the CDS
2021-11-07 22:10:12,482 INFO Sending request to https://cds.climate.copernicus.eu/api/v2/resources/reanalysis-era5-pressure-levels
2021-11-07 22:10:12,510 INFO Request is queued
2021-11-07 22:10:13,541 INFO Request is running
2021-11-07 22:10:15,072 INFO Request is completed
2021-11-07 22:10:15,074 INFO Downloading https://download-0005.copernicus-climate.eu/cache-compute-0005/cache/data1/adaptor.mars.internal-1636323013.50075-12027-16-4a1dd4dd-58a0-45e2-8511-49ef05de9d17.nc to output/ERA5test.nc (2M)
2021-11-07 22:10:15,461 INFO Download rate 5.1M/s
2021-11-07 22:10:15,504 INFO Welcome to the CDS
2021-11-07 22:10:15,505 INFO Sending request to https://cds.climate.copernicus.eu/api/v2/resources/reanalysis-era5-pressure-levels
2021-11-07 22:10:15,533 INFO Request is queued
2021-11-07 22:10:16,565 INFO Request is completed
2021-11-07 22:10:16,566 INFO Downloading https://download-0001.copernicus-climate.eu/cache-compute-0001/cache/data5/ada

Result(content_length=2076600,content_type=application/x-grib,location=https://download-0001.copernicus-climate.eu/cache-compute-0001/cache/data5/adaptor.mars.internal-1636323015.8604436-19368-20-595f97a6-e99b-4c4c-a9c5-c103da7811f8.grib)

One can also modify the CDS API download script using optional post-processing arguments.

For example, one can use the `grid` option to change the grid on which the data are presented.  

The native grid of ERA data on CDS is 0.25°x0.25° (atmosphere), 0.5°x0.5° (ocean waves) for variables, while derived quantities, such as mean and variance are on a 0.5°x0.5° (atmosphere) and a coarser 1°x1° (ocean waves). ERA5-Land is provided as 0.1°x0.1°. These grid resolutions will be provided by default, but perhaps one does not needs such high resolution for our application, and one may only be interested in a subset of the global dataset, in which case, it makes sense to minimize the local computational and storage burden and to use the CDS API to provide a coarser and more focused data file. 

The following script shows an [example](https://confluence.ecmwf.int/display/CKB/How+to+download+ERA5#HowtodownloadERA5-3-DownloadingonlineERA5familydatathroughtheCDSwebinterface), where the resolution is decreased to 1 degree by 1 degree and a sub-domain is selected (in case one does not need global coverage): 

In [9]:
#!/usr/bin/env python
import cdsapi
 
c = cdsapi.Client()
 
c.retrieve(
    'reanalysis-era5-pressure-levels',
    {
        'product_type': 'reanalysis',
        'variable': 'temperature',
        'pressure_level': '1000',
        'year': '2021',
        'month': '01',
        'day': '01',
        'time': '12:00',
        'format': 'netcdf',                         # NetCDF
        'area'          : [60., -11., 34., 35.],    # Default area is global; provide [North, West, South, East] limits to select an area (here selecting Europe)
        'grid'          : [1.0, 1.0],               #  Default atmospheric resolution is 0.25 x 0.25; provie [Latitude/longitude] grid resolution to interpolate data to something else.           
    },
    'output/ERAtest_subarea_1deg1deg.nc')           # Output file. Adapt as you wish.

2021-11-07 22:55:17,737 INFO Welcome to the CDS
2021-11-07 22:55:17,738 INFO Sending request to https://cds.climate.copernicus.eu/api/v2/resources/reanalysis-era5-pressure-levels
2021-11-07 22:55:17,764 INFO Request is queued
2021-11-07 22:55:18,794 INFO Request is running
2021-11-07 22:55:22,618 INFO Request is completed
2021-11-07 22:55:22,619 INFO Downloading https://download-0002.copernicus-climate.eu/cache-compute-0002/cache/data1/adaptor.mars.internal-1636325721.1884649-10109-12-29518ea8-a89a-4c1c-a103-206d5b95ad02.nc to output/ERAtest_subarea_1deg1deg.nc (3.9K)
2021-11-07 22:55:22,731 INFO Download rate 34.7K/s


Result(content_length=3948,content_type=application/x-netcdf,location=https://download-0002.copernicus-climate.eu/cache-compute-0002/cache/data1/adaptor.mars.internal-1636325721.1884649-10109-12-29518ea8-a89a-4c1c-a103-206d5b95ad02.nc)

### The CDS Toolbox

An extention to the CDS API is the [CDS Toolbox](https://cds.climate.copernicus.eu/toolbox/doc/index.html). The Toolbox is a programming interface. It is free and available to everyone. It links raw data to an online computing facility, thereby removing the computing barrier for users worldwide. The Toolbox allows users to develop python scripts for the CDS and to run them online. The users than then simply download the maps, graph or secondary data without needing to incur the costs associated with storing and maintaining the large climate data archive locally, nor is it required to invest in a powerful computer. 


The Toolbox works hand-in-hand with the CDS API. 

That is, one can issue commmands to the Toolbox via the CDS API, or develop a python script that does certain things, and then use the CDS API functionality to send it to CDS for evaluation, only to download the finished product locally. 

Examples of how to proceed are given [here](https://confluence.ecmwf.int/display/COPSRV/Call+a+service+with+the+CDS+API). 

Note, it is also possible to use the CDS Toolbox to plot data from other services (i.e. not originally stored on CDS), using the `remote` function instead of the `service` function. This is possible thanks to protocols such as OPeNDAP. Here is an example using data from UNIDATA. 

For illustration of how the Toolbox can be used, the following piece of code, run locally, after the CDS API is properly configured and assuming one is connected to the internet, will produce, and then download to our `output/` directory, a map of global temperature on January 1st 2021 at 14:00.

In [14]:
import cdsapi
 
c = cdsapi.Client(full_stack=True, debug=True)
 
r = c.service("catalogue.retrieve",
    'reanalysis-era5-single-levels',
    {"variable": "2t", 
     "product_type": "reanalysis", 
     "date": "2021-01-01", 
     "time": "14:00"})
 
r = c.service('map.plot', r) # ask the Toolbox to make a map of the data
 
r.download("output/ERA5_testmap.png") # Check the output/ directory for a new map!

2021-11-07 23:24:33,573 INFO Welcome to the CDS
2021-11-07 23:24:33,574 INFO Sending request to https://cds.climate.copernicus.eu/api/v2/tasks/services/catalogue/retrieve/clientid-b51ae2dfb7874262977f338d0cf6f978
2021-11-07 23:24:33,640 INFO Request is completed
2021-11-07 23:24:33,655 INFO Welcome to the CDS
2021-11-07 23:24:33,656 INFO Sending request to https://cds.climate.copernicus.eu/api/v2/tasks/services/map/plot/clientid-5a7d81efe86142cb9c058dd2d90761f3
2021-11-07 23:24:33,824 INFO Downloading https://download-0010.copernicus-climate.eu/cache-compute-0010/cache/plots/map.plot-1636326997.4671943-21549-20-e380a147-00d8-4697-a88a-b79f4f1a80a0.png to output/ERA5_testmap.png (335.7K)
2021-11-07 23:24:33,957 INFO Download rate 2.5M/s


'output/ERA5_testmap.png'

---

<a id='ukcp18'></a>
# Towards Climate Services


* ## UKCP18: latest high resolution model output for the UK
[UKCP](https://www.metoffice.gov.uk/research/approach/collaboration/ukcp/index) stands for UK Climate Projections. UKCPs uses climate models to provide the most up-to-date assessment of how the UK climate may change in the future. 

[UKCP18](https://www.metoffice.gov.uk/research/approach/collaboration/ukcp/about) represents a set of high resolution climate projections for the UK and the globe ('18' because this was released in 2018). It is an update of UKCP09 (done in 2009). 

UKCP18 provides model projections of climate change until 2100 over the UK at resolutions ranging from 2.2km, 12km, and also provides global projections at 60km resolution. Although 2.2km resolution is only available in some region, we note that is rougly the resolution used for weather forecasting! ... but in this case, the forecasts were carried forward for about a century (i.e. that's a lot of data)! This nesting, or resolution enhancement, allows the model to resolve localised high-impact events, such as heavy rainfalls. UKCP18 also comes with a set of marine projections of sea-level rise and storm surge. 

Although UKCP18 is 'just' a model product, it is based on consistent physics and the products presents the best tool available today to try to tease out the likely effect of climate change on the UK in the next few decades. [UKCP18](https://www.metoffice.gov.uk/research/approach/collaboration/ukcp/about)  simulations are intented as a tool to help decision-makers in the UK to assess exposure to climate risk.  It is part of the Met Office Hadley Centre Climate Programme.

### Accessing UKCP18 data

Instructions for downloading UKCP18 data are available [here](https://www.metoffice.gov.uk/research/approach/collaboration/ukcp/download-data). The UKCP18 project provides a [web-interface](https://ukclimateprojections-ui.metoffice.gov.uk/ui/home) to facilitate data download, but this doesn't contain all the data. The full data is accessible from [CEDA](http://catalogue.ceda.ac.uk/?q=ukcp18&sort_by=). 

Accessing the data through the web-interface requires registration (this only takes a few minutes).

There is also a UKCP Web Processing Service (WPS) that can be used to build Apps, and a python [API](https://ukclimateprojections-ui.metoffice.gov.uk/help/api), [`ukcp-api-client`](https://github.com/ukcp-data/ukcp-api-client), is available from git that allows remote applications to talk directly to the UKCP service to submit requests for products.   

[Data availability](https://www.metoffice.gov.uk/binaries/content/assets/metofficegovuk/pdf/research/ukcp/ukcp18_data_availability_jul-2021.pdf): The resolution (temporal and spatial) of data available to download varies between domain (ocean, land, atmosphere) and what is available depends on what service is used to access the data (the full archive is available through CEDA...but this may not be the easiest way depending on what information we are looking for!)

### Example: will the weather become rainier over Imperial College London?
Our goal here is to dowload high temporal resoluttion model output, i.e. hourly data, from UKCP18, using the web-interface. 
To use the web-interface, first create an account and remember your login credentials. 

**ADVICE**: You are strongly encouraged to install/work with a [password manager](https://www.techradar.com/uk/best/password-manager) on your computer ... you will need to create lots of accounts when working with data! It is not advisable at all to reuse passwords for multiple data servcies as one cannot guarantee the safety of each data provider.  

Once you have your login account, login. We are now going to walk through the dowload procedure to access rainfall predictions for `Imperial College London`. 

 Note, we will get the data for the RCP8.5 scenario, for two time periods, 1981-2000 and 2080-2100, and we are going to ask for the data to be saved in the CF-NetCDF format (i.e. netcdf file complient with CF conventions). Because we want two time periods, we would need to repeat the whole request procedure twice. 

Tick the following boxes in "Product Selection":
1. Collection: Land projections: local (2.2km)
2. Scenario: RCP8.5 (more on what the scenario means later)
3. Output: Data only
4. Climate Change Type: Absolute values
   
This will create a shortlist of datasets, shown on the right side. 
1. Select **"Data: Variables from local projections (2.2km) regridded to 5km over UK for subdaily data (hourly and 3-hourly)"**. 
2. Click on `View Details` and familiarize yourself with what the data contain. You will note the statement `We recommend that you consider results from more than one 5km grid box, i.e. you should also consider surrounding grid boxes.` Of course, the weather is highly variable, and in order to develop a spatially and temporally coherent picture of what climate change may look like in an area, in a way that account for uncertainties and variability, one should follow their advice. Because we are here only going through the process to illustrate how to get data, we will only ask for data from 1 grid box (i.e. the South Kensington area). Once you finish looking at the detail, go back (`click back` on your browser). 
3. Assuming these are the data we want, click `Submit a Request`. 
4. For "Variable", select `Precipitation rate (mm/day)`.
5. On the Map, click on the `red search button`, and type `Imperial College London`. Various choices come up, pick the first one. We note that all the options are within the same 5km by 5km grid cell anyways. Remember, we are working with gridded model products here, so there is no point in choosing a location with too much detail (i.e the model can't differentiate between the Royal School or Mine or the Natural History Museum given the model resolution). 
6. On the Map, click on the grid cell with Imperial College (the grid cell should be highlighted in blue).
7. For "Temporal Average", select `hourly`. This means we will get data for every hour (that's a lot of data!).
8. For "Time range", we will select two periods: `1981-2000` and `2061-2080`.
9. The next pull-down menu asks which `ensemble member` we are interested in. Choose the first one `HadGEM3-GC3.05-r001i1p00000`. "Ensembles" are basically replicates. In this case, the model was run 12 times! Because the weather is turbulent, the results from each ensemble will vary! A complete assessement of uncertainties associated with these results would have to consider all available ensembles, and check that the conclusions of any analysis is consistently reproduced across the ensemble set, or report a probability stating how likely the outcome is. Since we are here only doing an example (and time is limited), we will only work with a single ensemble member. 
10. For "Data Format", choose "CV-netCDF". As we know, netCDF files will come with a bunch of metadata, and this one is also conveniently compliant with "CF conventions" which would make automation of analysis easier. 
11. Add a title for "Job label". Although optional, as a scientist, you will want to keep notes of what you are doing. One suggested title could be `EDSML_UKCP18_2061-2080_precip-hourly`. 
12. Click `submit`. You will see that the web-interface now carries you to the next phase `generating output`. Don't close your browser! By clicking submit, a reqest was made to fetch the data from the data archive and to put them in a file. This can take a few minutes (>5-10 min), be patient. 
13. When ready, you will see a blue `Download` button. The file size should be about 1.4Mb (quite small, but only because we only selected data from 1 grid cell. If we were to do the analysis over the UK, the file would be much bigger!). This should look somethink like the following screenshot: 

![UKCP18download](img/UKCP18_download.png)

Note the other tabs (Outputs, Job Details, ASCII, XML Response) and explore these. 

**Repeat steps 1-12**: After the data are downloaded, click on `Edit inputs` (bottom right). Because we are looking to dowload data from two periods, we'll have to repeat the download precedure, making sure to select the correct time periods. Our goal will be to compare precipitation results over Imperial College from the historical period 1981-2000 with model prediction for the 2061-2080 period, assuming humanity achieves not reduction in greenhouse gas emissions (i.e. the RCP8.5 scenario is essentially a `buisness-as-usual scenario`).

Make sure you move your downloaded files to a convenient folder, where you remember what these files are! (e.g. such as in an `output/` folder within this lecture folder)


[back to contents](#Contents)

### Looking at the data

Now take a look at the downloaded data. 

You should see folders with the following format (`output_<some string>`): 
```
output_08b295c097368efd78f0c78b53b320d7_20211114_234836/
output_1819e6a692e9503460dcf1416043a7c2_20211114_231924/
```

Looking in these folders, we see 4 files: 
```
>>> ls output_08b295c097368efd78f0c78b53b320d7_20211114_234837

2021-11-14T23-44-34.nc
data_licence.txt
input_paths.txt
request.txt
```
`input_path.txt` shows the path on the CEDA archive from where the data were obtained. We see a long list of path, each corresponding to a different time step. 
```
http://data.ceda.ac.uk/badc/ukcp18/data/land-cpm/uk/5km/rcp85/01/pr/1hr/v20210615/pr_rcp85_land-cpm_uk_5km_01_1hr_20601201-20601230.nc
http://data.ceda.ac.uk/badc/ukcp18/data/land-cpm/uk/5km/rcp85/01/pr/1hr/v20210615/pr_rcp85_land-cpm_uk_5km_01_1hr_20610101-20610130.nc
...
```

The web-interface conveniently isolated the one data point from each one of these files for us, and concatenated the results into a single file for us. Without the web-interface, we would have had to this this ourselves, manually! (now you also understand why it took a few minutes to retrieve our data file!)

`request.txt` is just a summary of the request we made. This is convenient as it can be used to retrace our steps and know how/where the data were obtained. 

Note the `data_licence.txt` file. When accessing/downloading data, always know under what license these are provided. For academic research, there are usually very few restrictions (except the need to reference the data adequately!), but this may not be the case for commercial applications (...better check the license before facing a lawsuit!). Here, data_license.txt tells us  that: 
`The data on this web site are available under the Open Government Licence, see http://www.nationalarchives.gov.uk/doc/open-government-licence/`

**Question**: Given this licence, would you be free to use UKCP18 data for commercial application? 

**Answer**: [Yes](http://www.nationalarchives.gov.uk/doc/open-government-licence/) (make sure you know why!)


`2021-11-14T23-44-34.nc` is the netCDF file with the data. Navigate to the directory where the data are saved and let's use the command line tool `ncdump -h <filename>` (on the terminal), to see the headers for the netcdf file. 

```
>>> ncdump -h 2021-11-14T23-44-34.nc

netcdf \2021-11-14T23-44-34 {
dimensions:
	ensemble_member = 1 ;
	time = 172080 ;
	bnds = 2 ;
	string27 = 27 ;
	string64 = 64 ;
variables:
	float pr(ensemble_member, time) ;
		pr:standard_name = "lwe_precipitation_rate" ;
		pr:long_name = "Precipitation rate" ;
		pr:units = "mm/hour" ;
		pr:cell_methods = "time: mean" ;
		pr:grid_mapping = "transverse_mercator" ;
		pr:coordinates = "ensemble_member_id latitude longitude month_number projection_x_coordinate projection_y_coordinate year yyyymmddhh" ;
	int transverse_mercator ;
		transverse_mercator:grid_mapping_name = "transverse_mercator" ;
		transverse_mercator:longitude_of_prime_meridian = 0. ;
		transverse_mercator:semi_major_axis = 6377563.396 ;
		transverse_mercator:semi_minor_axis = 6356256.909 ;
		transverse_mercator:longitude_of_central_meridian = -2. ;
		transverse_mercator:latitude_of_projection_origin = 49. ;
		transverse_mercator:false_easting = 400000. ;
		transverse_mercator:false_northing = -100000. ;
		transverse_mercator:scale_factor_at_central_meridian = 0.9996012717 ;
	int ensemble_member(ensemble_member) ;
		ensemble_member:units = "1" ;
		ensemble_member:long_name = "ensemble_member" ;
	double time(time) ;
		time:axis = "T" ;
		time:bounds = "time_bnds" ;
		time:units = "hours since 1970-01-01 00:00:00" ;
		time:standard_name = "time" ;
		time:calendar = "360_day" ;
	double time_bnds(time, bnds) ;
	char ensemble_member_id(ensemble_member, string27) ;
		ensemble_member_id:units = "1" ;
		ensemble_member_id:long_name = "ensemble_member_id" ;
	double latitude ;
		latitude:units = "degrees_north" ;
		latitude:standard_name = "latitude" ;
	double longitude ;
		longitude:units = "degrees_east" ;
		longitude:standard_name = "longitude" ;
	int month_number(time) ;
		month_number:units = "1" ;
		month_number:long_name = "month_number" ;
	double projection_x_coordinate ;
		projection_x_coordinate:bounds = "projection_x_coordinate_bnds" ;
		projection_x_coordinate:units = "m" ;
		projection_x_coordinate:standard_name = "projection_x_coordinate" ;
	double projection_x_coordinate_bnds(bnds) ;
	double projection_y_coordinate ;
		projection_y_coordinate:bounds = "projection_y_coordinate_bnds" ;
		projection_y_coordinate:units = "m" ;
		projection_y_coordinate:standard_name = "projection_y_coordinate" ;
	double projection_y_coordinate_bnds(bnds) ;
	int year(time) ;
		year:units = "1" ;
		year:long_name = "year" ;
	char yyyymmddhh(time, string64) ;
		yyyymmddhh:units = "1" ;
		yyyymmddhh:long_name = "yyyymmddhh" ;

// global attributes:
		:collection = "land-cpm" ;
		:contact = "ukcpproject@metoffice.gov.uk" ;
		:description = "Precipitation rate" ;
		:domain = "uk" ;
		:frequency = "1hr" ;
		:institution = "Met Office Hadley Centre (MOHC), FitzRoy Road, Exeter, Devon, EX1 3PB, UK." ;
		:institution_id = "MOHC" ;
		:label_units = "mm/hour" ;
		:plot_label = "Precipitation rate (mm/hour)" ;
		:project = "UKCP18" ;
		:references = "https://ukclimateprojections.metoffice.gov.uk" ;
		:resolution = "5km" ;
		:scenario = "rcp85" ;
		:source = "UKCP18 realisation from a set of 12 convection-permitting models (HadREM3-RA11M) driven by perturbed variants of the Met Office Unified Model Global Atmosphere GA7 model (HadREM3-GA705) at 12km resolution.  The HadREM3-GA705 models were driven by perturbed variants of the global HadGEM3-GC3.05" ;
		:title = "UKCP18 land projections - Regridded 2.2km convection-permitting climate model results on 5km British National Grid from Ordnance Survey (OSGB), Precipitation rate over the UK for the RCP8.5 scenario" ;
		:version = "v20210615" ;
		:Conventions = "CF-1.7" ;
}
```


[back to contents](#Contents)

---