# Bringing In Data

In cdms2 data/array/numbers are called **Variables**

cdms2 has two types of variables ***File Variables*** and ***Transient Variables***

The first type ***File*** variables are just a piinter (or handle) to a variable in a file. It is very useful to query the variable as shown in the previous notebook. The actual data is still on disk and is not loaded into memory yet. Once you get more experienced with CDAT/cdms2 this has usefull implication for things like distributed computing (dask, mpi, etc...).

The second type is the ***Transient Variable*** these variables possess the same query function as *File Variables* but their data is now in memory rather than in their original file. 

***WARNING***: ANY operation on a file variable will convert it to a ***Transient Variable*** keep this in mind as it has memory implication (e.g you do not want to load a 3Tb variable in memory)

cdms variables can be treated just like a *supercharged* `numpy` array. As the figure bellow demonstrates they are a numpy array PLUS a mask marking area where values are to be ignored (missing or bad) PLUS dimensions information (axes) to describe the array PLUS metadata about the array itself and its dimensions.


## Loading a variable

### File Variables

Once a file is opened you can create a file varialbe by refering it with *square* brackets ***[]***


In [1]:
import os
import cdms2
home = os.path.expandvars("$HOME")
f = cdms2.open(os.path.join(home,"cmip6_data/CMIP6/CMIP/E3SM-Project/E3SM-1-0/piControl/r1i1p1f1/Amon/tas/gr/v20180608/tas_Amon_E3SM-1-0_piControl_r1i1p1f1_gr_000101-050012.nc_0"))

In [2]:
tas_file = f["tas"]

The above `tas_file` is a file variable, to actually load all of it data in memory (and do anything with its data) you will need a ***Transient Variable***

### Transient Variables

In [3]:
tas_memory = f("tas")

You probably noticed that the above line takes a lot more time to execute, this time we had to read the entire variable in memory

## Subsetting Variables

As you probably guessed bringing all the variable in memory will quickly bring your system to its knees. We will need to work on a subset of variable

In the following we will start from `tas_file` our ***File Variable*** an load in memory only what we need.

### Subsetting by index

As explained above, we can treat variables just a sregular nupmy arrays so subsetting by index is easy:

In [4]:
tas_sub = tas_file[0] # retrieve only the first time
print(tas_sub.shape)

(129, 256)


Steps are also allowed

In [5]:
tas_sub = tas_file[-20::2]  # last 20 time steps but every other one
print(tas_sub.shape)

(10, 129, 256)


**IMPORTANT NOTE**: while they can *mostly* be treated as numpy arrays, cdms transient varialbe do **NOT** support array indexing

### Subsetting by dimension

Since cdms variables contain dimension information the easiest way to subset them is usually to rely on the dimensions names, this frees the user to know the actual order of the dimensions (unlike index subsetting).

Dimension subsetting can be done again by index or by values. '`time` dimension subsetting also accepts actual time object (either as strind in the `YYYY-MM-DD HH:MM:SS` format or `cdtime` objects (see notebook later)

#### Subsetting via index

In [6]:
tas_sub = tas_file(time=slice(0, 5))  # first 5 time steps
print(tas_sub.shape)

(5, 129, 256)


#### Subsetting by value

subsetting by value in its easiest form requires 2 values representing the desired bounds. To determine if an index needs to be selected cdms will look if a dimension `value` falls within this domain


In [7]:
print(tas_file.getLatitude()[:])  # Looking at latitude values in file

[-90.      -88.59375 -87.1875  -85.78125 -84.375   -82.96875 -81.5625
 -80.15625 -78.75    -77.34375 -75.9375  -74.53125 -73.125   -71.71875
 -70.3125  -68.90625 -67.5     -66.09375 -64.6875  -63.28125 -61.875
 -60.46875 -59.0625  -57.65625 -56.25    -54.84375 -53.4375  -52.03125
 -50.625   -49.21875 -47.8125  -46.40625 -45.      -43.59375 -42.1875
 -40.78125 -39.375   -37.96875 -36.5625  -35.15625 -33.75    -32.34375
 -30.9375  -29.53125 -28.125   -26.71875 -25.3125  -23.90625 -22.5
 -21.09375 -19.6875  -18.28125 -16.875   -15.46875 -14.0625  -12.65625
 -11.25     -9.84375  -8.4375   -7.03125  -5.625    -4.21875  -2.8125
  -1.40625   0.        1.40625   2.8125    4.21875   5.625     7.03125
   8.4375    9.84375  11.25     12.65625  14.0625   15.46875  16.875
  18.28125  19.6875   21.09375  22.5      23.90625  25.3125   26.71875
  28.125    29.53125  30.9375   32.34375  33.75     35.15625  36.5625
  37.96875  39.375    40.78125  42.1875   43.59375  45.       46.40625
  47.8125   49.218

In [8]:
tas_sub = tas_file(time = slice(0, 12), latitude=(-20, 20))  # first year only lats between 20 south and 20 north
print(tas_sub.shape)
print(tas_sub.getLatitude()[:])

(12, 29, 256)
[-19.6875  -18.28125 -16.875   -15.46875 -14.0625  -12.65625 -11.25
  -9.84375  -8.4375   -7.03125  -5.625    -4.21875  -2.8125   -1.40625
   0.        1.40625   2.8125    4.21875   5.625     7.03125   8.4375
   9.84375  11.25     12.65625  14.0625   15.46875  16.875    18.28125
  19.6875 ]


In [9]:
tas_sub = tas_file(time = slice(0, 12), latitude=(-22, 22))  # first year only lats between 22 south and 22 north
print(tas_sub.shape)
print(tas_sub.getLatitude()[:])

(12, 31, 256)
[-21.09375 -19.6875  -18.28125 -16.875   -15.46875 -14.0625  -12.65625
 -11.25     -9.84375  -8.4375   -7.03125  -5.625    -4.21875  -2.8125
  -1.40625   0.        1.40625   2.8125    4.21875   5.625     7.03125
   8.4375    9.84375  11.25     12.65625  14.0625   15.46875  16.875
  18.28125  19.6875   21.09375]


In [10]:
tas_sub = tas_file(time = slice(0, 12), latitude=(-21.09375, 21.09375))  # first year only lats between 2 latitudes exactly
print(tas_sub.shape)
print(tas_sub.getLatitude()[:])

(12, 31, 256)
[-21.09375 -19.6875  -18.28125 -16.875   -15.46875 -14.0625  -12.65625
 -11.25     -9.84375  -8.4375   -7.03125  -5.625    -4.21875  -2.8125
  -1.40625   0.        1.40625   2.8125    4.21875   5.625     7.03125
   8.4375    9.84375  11.25     12.65625  14.0625   15.46875  16.875
  18.28125  19.6875   21.09375]


Now as explained above by default cdms looks at the latitude value (it's `node`) and see it is is comprised within the domain inclusively

A third argument can be passed to control this, in the following example we will exclude the upper latitude by adding a thrid argumnet: `'co'` which stand for ***c***losed ***o***pened e.g inclusive on the first bound, exclusive on the second bound

In [11]:
tas_sub = tas_file(time = slice(0, 12), latitude=(-21.09375, 21.09375, 'co'))  # first year only lats between 2 latitudes exactly
print(tas_sub.shape)
print(tas_sub.getLatitude()[:])

(12, 30, 256)
[-21.09375 -19.6875  -18.28125 -16.875   -15.46875 -14.0625  -12.65625
 -11.25     -9.84375  -8.4375   -7.03125  -5.625    -4.21875  -2.8125
  -1.40625   0.        1.40625   2.8125    4.21875   5.625     7.03125
   8.4375    9.84375  11.25     12.65625  14.0625   15.46875  16.875
  18.28125  19.6875 ]


We can further control this by telling cdms to look not at the **n***ode (value) to decide if the index needs to be selected but by looking at the actual cell ***bounds***, for example:

In [12]:
tas_sub1 = tas_file(time = slice(0, 12), latitude=(-21, 21, 'ccn'))  # first year only lats between 20 south and 20 north
print(tas_sub.shape)
tas_sub2 = tas_file(time = slice(0, 12), latitude=(-21, 21, 'ccb'))  # first year only lats between 20 south and 20 north
print(tas_sub2.shape)

(12, 30, 256)
(12, 31, 256)


The second example brought in 2 extra dimensions. Why?

In the second case we asked cdms to look at the bounds and indeed -21 and 21 are within the bounds of the first and last cell

In [13]:
print(tas_sub2.getLatitude().getBounds())

[[-21.796875 -20.390625]
 [-20.390625 -18.984375]
 [-18.984375 -17.578125]
 [-17.578125 -16.171875]
 [-16.171875 -14.765625]
 [-14.765625 -13.359375]
 [-13.359375 -11.953125]
 [-11.953125 -10.546875]
 [-10.546875  -9.140625]
 [ -9.140625  -7.734375]
 [ -7.734375  -6.328125]
 [ -6.328125  -4.921875]
 [ -4.921875  -3.515625]
 [ -3.515625  -2.109375]
 [ -2.109375  -0.703125]
 [ -0.703125   0.703125]
 [  0.703125   2.109375]
 [  2.109375   3.515625]
 [  3.515625   4.921875]
 [  4.921875   6.328125]
 [  6.328125   7.734375]
 [  7.734375   9.140625]
 [  9.140625  10.546875]
 [ 10.546875  11.953125]
 [ 11.953125  13.359375]
 [ 13.359375  14.765625]
 [ 14.765625  16.171875]
 [ 16.171875  17.578125]
 [ 17.578125  18.984375]
 [ 18.984375  20.390625]
 [ 20.390625  21.796875]]


#### Time values

For time we can either pass the time value in the file or the actual value we wnat, this is very useful because

1. It is hard to figure out what `171466.000000 hours since 2000` is (`2019-07-24 10:00`)
2. It gets harder if you want to compare between files with differenttime units!

In [14]:
miroc_ps_file = cdms2.open("/global/cscratch1/sd/cmip6/CMIP6/CMIP/MIROC/MIROC6/historical/r1i1p1f1/Amon/ps/gn/v20181212/ps_Amon_MIROC6_historical_r1i1p1f1_gn_195001-201412.nc")
ipsl_ps_file = cdms2.open("/global/cscratch1/sd/cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/historical/r1i1p1f1/Amon/ps/gr/v20180803/ps_Amon_IPSL-CM6A-LR_historical_r1i1p1f1_gr_185001-201412.nc")
ipsl_time = ipsl_ps_file["ps"].getTime()
miroc_time = miroc_ps_file["ps"].getTime()
print("IPSL:", ipsl_time[0], ipsl_time.units)
print("MIROC:", miroc_time[0], miroc_time.units)

IPSL: 15.5 days since 1850-01-01 00:00:00
MIROC: 36539.5 days since 1850-1-1


Let's retrieve year 2000 from both files

In [15]:
ipsl = ipsl_ps_file("ps", time=("2000", "2001", "con"))
miroc = miroc_ps_file("ps", time=("2000", "2001", "con"))
print("IPSL:", ipsl.shape)
print("MIROC:", miroc.shape)

IPSL: (12, 143, 144)
MIROC: (12, 128, 256)


Finally time axis have a few usefll functions to be able to more easily understand them. Most useful is the `asComponentTime()` one

In [16]:
print(ipsl.getTime().asComponentTime())

[2000-1-16 12:0:0.0, 2000-2-15 12:0:0.0, 2000-3-16 12:0:0.0, 2000-4-16 0:0:0.0, 2000-5-16 12:0:0.0, 2000-6-16 0:0:0.0, 2000-7-16 12:0:0.0, 2000-8-16 12:0:0.0, 2000-9-16 0:0:0.0, 2000-10-16 12:0:0.0, 2000-11-16 0:0:0.0, 2000-12-16 12:0:0.0]


Similarly to compare between two differenttime models the `asRelativeTime(units)` function can help you go back and forth.

In [17]:
print(ipsl.getTime().asRelativeTime("days since 2020"))

[-7289.500000 days since 2020, -7259.500000 days since 2020, -7229.500000 days since 2020, -7199.000000 days since 2020, -7168.500000 days since 2020, -7138.000000 days since 2020, -7107.500000 days since 2020, -7076.500000 days since 2020, -7046.000000 days since 2020, -7015.500000 days since 2020, -6985.000000 days since 2020, -6954.500000 days since 2020]


Converting units can also be done via the `toRelativeTime(units`) command

In [18]:
print(ipsl.getTime()[0], ipsl.getTime().units)
ipsl.getTime().toRelativeTime("hours since 2000")
print(ipsl.getTime()[0], ipsl.getTime().units)

54801.5 days since 1850-01-01 00:00:00
372.0 hours since 2000


# Re-ordering

cdms let you easily re-order the data, by using the `order` keyword

In [19]:
print("Original:", ipsl.getAxisIds(), ipsl.shape)
ipsl = ipsl(order='(longitude)(latitude)(time)')
print("Re-ordered:", ipsl.getAxisIds(), ipsl.shape)

Original: ['time', 'lat', 'lon'] (12, 143, 144)
Re-ordered: ['lon', 'lat', 'time'] (144, 143, 12)


Spatial dimensions can be reference by their aliases: `x, y, z, t`

In [20]:
ipsl = ipsl(order="txy")
print("Further re-ordered:", ipsl.getAxisIds(), ipsl.shape)

Further re-ordered: ['time', 'lon', 'lat'] (12, 144, 143)


# Data spanning multiple files: cdscan

With CMIP data it is common that the full dataset is being split over mutliple files representing different time periods.

example:

In [21]:
import glob
files = glob.glob("/global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/*.nc")
print(files[:10])

['/global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/tas_Amon_EC-Earth3_historical_r1i1p1f1_gr_198701-198712.nc', '/global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/tas_Amon_EC-Earth3_historical_r1i1p1f1_gr_198201-198212.nc', '/global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/tas_Amon_EC-Earth3_historical_r1i1p1f1_gr_185001-185012.nc', '/global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/tas_Amon_EC-Earth3_historical_r1i1p1f1_gr_193801-193812.nc', '/global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/tas_Amon_EC-Earth3_historical_r1i1p1f1_gr_200401-200412.nc', '/global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/tas_Amon_EC-Earth3_historica

The `cdscan` utility let you create an xml file that will make your data look like in file only, avoiding the trouble of opening each file as you need more time

In [22]:
!cdscan -x one_file.xml /global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/*.nc

Finding common directory ...
Common directory: /global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/
Scanning files ...
/global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/tas_Amon_EC-Earth3_historical_r1i1p1f1_gr_185001-185012.nc
Setting reference time units to days since 1850-01-01 00:00:00
/global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/tas_Amon_EC-Earth3_historical_r1i1p1f1_gr_185101-185112.nc
Setting reference time units to days since 1850-01-01 00:00:00
/global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/tas_Amon_EC-Earth3_historical_r1i1p1f1_gr_185201-185212.nc
Setting reference time units to days since 1850-01-01 00:00:00
/global/cscratch1/sd/cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-Earth3/historical/r1i1p1f1/Amon/tas/gr/v20190711/tas_Amon_EC-Earth

In [23]:
f = cdms2.open("one_file.xml")
tas = f["tas"]
print("Shape:", tas.shape)

Shape: (1980, 256, 512)


This concludes this tutorial please proceeed to [Manipulating Variables](02_Manipulating_Variables.ipynb).