# CCMP Winds in a cloud-optimized-format for Pangeo

The Cross-Calibrated Multi-Platform (CCMP) Ocean Surface Wind Vector Analyses is part of the NASA Making Earth System Data Records for Use in Research Environments (MEaSUREs) Program. MEaSUREs, develops consistent global- and continental-scale Earth System Data Records by supporting projects that produce data using proven algorithms and input.  If you use this data, please give [credit](https://podaac.jpl.nasa.gov/MEaSUREs-CCMP?sections=about).  For more information, please review the [documentation](https://podaac-tools.jpl.nasa.gov/drive/files/allData/ccmp/L2.5/docs/ccmp_users_guide.pdf). Please note that this data is not recommended for trend calculations.

# Accessing cloud satellite data

- CCMP zarr conversion funding: Interagency Implementation and Advanced Concepts Team [IMPACT](https://earthdata.nasa.gov/esds/impact) for the Earth Science Data Systems (ESDS) program and AWS Public Dataset Program
  
### Credits: Tutorial development
* [Dr. Chelle Gentemann](mailto:gentemann@faralloninstitute.org) -  [Twitter](https://twitter.com/ChelleGentemann)   - Farallon Institute

### Zarr data format

 [Zarr](https://zarr.readthedocs.io/en/stable/)

### Data proximate computing
These are BIG datasets that you can analyze on the cloud without downloading the data. You can run this on your phone, a Raspberry Pi, laptop, or desktop.   
By using public cloud data, your science is reproducible and easily shared!

### To run this notebook

Code is in the cells that have <span style="color: blue;">In [  ]:</span> to the left of the cell and have a colored background

To run the code:
- option 1) click anywhere in the cell, then hold `shift` down and press `Enter`
- option 2) click on the Run button at the top of the page in the dashboard

Remember:
- to insert a new cell below press `Esc` then `b`
- to delete a cell press `Esc` then `dd`

### First start by importing libraries

In [None]:
#libs for reading data
import xarray as xr
import gcsfs
import glob
import numpy as np
import matplotlib.pyplot as plt
from xhistogram.xarray import histogram

#lib for dask gateway
from dask_gateway import Gateway
from dask.distributed import Client
from dask import delayed

### Start a cluster, a group of computers that will work together.

(A cluster is the key to big data analysis on on Cloud.)

- This will set up a [dask kubernetes](https://docs.dask.org/en/latest/setup/kubernetes.html) cluster for your analysis and give you a path that you can paste into the top of the Dask dashboard to visualize parts of your cluster.  
- You don't need to paste the link below into the Dask dashboard for this to work, but it will help you visualize progress.
- Try 20 workers to start (during the tutorial) but you can increase to speed things up later

In [None]:
gateway = Gateway()
cluster = gateway.new_cluster()
cluster.adapt(minimum=1, maximum=75)
client = Client(cluster)
cluster

** ☝️ Don’t forget to click the link above or copy it to the Dask dashboard ![images.png](attachment:images.png) on the left to view the scheduler dashboard! **

### Initialize Dataset

Here we load the dataset from the zarr store. Note that this very large dataset (273 GB) initializes nearly instantly, and we can see the full list of variables and coordinates.

### Examine Metadata

For those unfamiliar with this dataset, the variable metadata is very helpful for understanding what the variables actually represent
Printing the dataset will show you the dimensions, coordinates, and data variables with clickable icons at the end that show more metadata and size.

In [None]:
from intake import open_catalog

cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/atmosphere.yaml")

ds = cat['nasa_ccmp_wind_vectors'].to_dask()

ds['wspd']=np.sqrt(ds.uwnd**2+ds.vwnd**2)  #calculate wind speed

ds

# Plot a global image of the data on 7/28/2020

``xarray`` makes plotting the data very easy.  A nice overview of plotting with xarray is [here](http://xarray.pydata.org/en/stable/plotting.html).  Details on [.plot](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.plot.html#xarray.DataArray.plot)

In [None]:
day = ds.sel(time='2020-07-04T00')

day.nobs.plot()

## Make a land/ocean/ice mask to show where there is actually data

### Three different ways to mask the data
1. A daily mask that removes data with sea ice and land
- sum over time for nobs (number of observations) variable
- average over a month so that land and monthly sea ice are masked out
2. A mask that removes all data that over land or where there is 'permanent' sea ice
- find when nobs is > 0
3. A climatology mask that removes all data that over land or where there has ever been sea ice
- sum over time for nobs (number of observations) variable
- average over a month so that land and monthly sea ice are masked out

# Apply the mask 
- over land, CCMP is ERA5 data
- for many ocean applications a land / sea ice mask is needed
- below are some different mask options that use the CCMP data to generate a mask


In [None]:
def mask_data(ds,type):
    if type=='daily': #daily mask removes sea ice and land
        mask_obs = ds.nobs.rolling(time=180,center=True).max('time')  #4 per day 30 days = 180 rolling window
        cutoff = 0
    if type=='land':  # land mask only (includes data over sea ice)
        mask_obs = ds.nobs.sum({'time'},keep_attrs=True)  #this will give you a LAND mask
        cutoff = 0
    if type=='climatology':  #climatology mask removes max sea ice extent and land
        mask_obs = ds.nobs.rolling(time=180,center=True).max('time')  #4 per day 30 days = 180 rolling window
        mask_obs = mask_obs.sum({'time'},keep_attrs=True)
        cutoff = 125000
    dy_mask = mask_obs>cutoff
    dy_mask = dy_mask.compute() #computing the mask speeds up subsequent operations
    masked = ds.where(dy_mask)
    return masked,dy_mask

# Print what the different masks look like
- This next cell block will take a while as the masks are computed.

In [None]:
%%time
subset=ds.isel(time=slice(500,3500))
masked1,dy_mask = mask_data(subset,'daily')
masked2,land_mask = mask_data(subset,'land')
masked3,clim_mask = mask_data(ds,'climatology')
fig, ax = plt.subplots(1,3, figsize=(18,6))
masked1.wspd.isel(time=500).plot(ax=ax[0])
masked2.wspd.isel(time=500).plot(ax=ax[1])
masked3.wspd.isel(time=1000).plot(ax=ax[2])

In [None]:
masked1

# For this we will use the climatology mask

In [None]:
# decide which mask to use 1=land/ice, 2=land, 3=climatology
masked,mask_obs = mask_data(ds,'climatology')

In [None]:
mask_obs.plot()

In [None]:
fig, ax = plt.subplots(1,3, figsize=(18,6))
masked.wspd[100,:,:].plot(ax=ax[0])
masked.wspd[-100,:,:].plot(ax=ax[1])
masked.wspd[5000,:,:].plot(ax=ax[2])

# create a weighted global mean function

In [None]:
# from http://gallery.pangeo.io/repos/pangeo-gallery/cmip6/global_mean_surface_temp.html
def global_mean(ds):
    lat = ds.latitude
    weight = np.cos(np.deg2rad(lat))
    weight /= weight.mean()
    other_dims = set(ds.dims) - {'time'}
    return (ds * weight).mean(other_dims)

# calculate the global mean
- I wish I didn't have to have these loops.  Programatically, it would be much cleaner to just do: 
```python
#glb_mn = global_mean(masked)
#glb_mn = glb_mn.compute()
#print(glb_mn) 
```
- but this code doesn't run, it kills my kernel (memory?) every time I try
- for some reason if I run it year by year it runs fine.


In [None]:
m,x=[],[]
for lyr in range(1988,2020):
    subset = masked.sel(time=str(lyr))
    m1 = global_mean(subset)
    m1 = m1.mean()
    m1_computed = m1.compute()
    m.append(m1_computed)
    x.append(lyr)
    print(lyr)
mn_yr = xr.concat(m, dim='time')
mn_yr['time']=np.arange(1988,2020)
glb_mn = np.mean(mn_yr)
print(glb_mn)

# Results
-glb_mean = 1988 - 2019 41 years
-    nobs     1.296
-    uwnd     -0.4763
-    vwnd     0.2749
-    wspd     8.558

In [None]:
plt.rcParams['figure.figsize'] = (12,6)
mn_yr.wspd.plot()
#plt.legend(fontsize=8)
plt.xlim(1988,2020)
#plt.ylim()
plt.ylabel('CCMPv2 Wind Speed (m s$^{-1}$)',fontsize=18)
plt.xlabel('Year',fontsize=18)
#plt.text(10,0.011,'CCMPv2 1988-2019 ',fontsize=18)
plt.text(2005,8.5,'Global mean = 8.6 m s$^{-1}$',fontsize=16)
#plt.text(10,0.009,'67% of winds are > 6 m s$^{-1}$',fontsize=16)
plt.savefig('./../../figures/ccmp_ts_mean.png')

# global Histogram figure

In [None]:
bins = np.arange(0,30,.1)
h,x=[],[]
for lyr in range(1988,2020):
    subset = masked.wspd.sel(time=str(lyr))
    h1 = histogram(subset, bins=[bins])
    h1 = h1.compute()
    print('start',lyr)
    h.append(h1)
    x.append(lyr)
    hh = xr.concat(h, dim='time')
    hh.to_netcdf('./../../data/ccmp/ccmp_annual_hist_20210507a.nc')
    print('end',lyr)

In [None]:
hh=xr.open_dataset('./../../data/ccmp/ccmp_annual_hist_20210507.nc')
hh1=xr.open_dataset('./../../data/ccmp/ccmp_annual_hist_20210507a.nc')
#hh1.assign_coords['time']=hh1.time+27
hh=xr.concat([hh,hh1],dim='time')
hh['time']=np.arange(1988,2020)
hh.to_netcdf('./../../data/ccmp/ccmp_annual_hist_20210507_final.nc')

In [None]:
hh = xr.open_dataset('./../../data/ccmp/ccmp_annual_hist_20210507_final.nc')
hhall = hh.histogram_wspd.sum('time')
hhall

In [None]:
yr = hh.histogram_wspd[0,:].load()

In [None]:
yr.plot()

In [None]:
print('percentage of winds =< 2 m/s',hhall[0:21].sum()/hhall.sum())
print('percentage of winds =< 6 m/s',hhall[0:60].sum()/hhall.sum())
print('percentage of winds > 6 m/s',hhall[60:].sum()/hhall.sum())

In [None]:
hh2=hh
x=hh.time
plt.rcParams['figure.figsize'] = (8,8)
for iyr in range(32):
    plt.plot(hh.wspd_bin,hh2.histogram_wspd[iyr,:]/hh2.histogram_wspd[iyr,:].sum(),label=str(x[iyr].data))
plt.legend(fontsize=8)
plt.xlim(-0,32)
plt.ylim(0,.013)
plt.xlabel('CCMP Wind Speed (m s$^{-1}$)',fontsize=18)
plt.ylabel('PDF (s m$^{-1}$)',fontsize=18)
plt.text(11,0.011,'CCMPv2 1988-2019 ',fontsize=18)
plt.text(11,0.010,'Global mean = 8.6 m s$^{-1}$',fontsize=16)
plt.text(11,0.009,'68% of winds are > 6 m s$^{-1}$',fontsize=16)
plt.savefig('./../../figures/ccmp_annual_hist.png')

In [None]:
plt.rcParams['figure.figsize'] = (8,8)
hhall = hh2.sum('time')
plt.plot(hh.wspd_bin,hhall.histogram_wspd/hhall.histogram_wspd.sum(),linewidth=5)
plt.xlim(-0,30)
plt.ylim(0,.012)
plt.xlabel('CCMP Wind Speed (m s$^{-1}$)',fontsize=18)
plt.ylabel('PDF (s m$^{-1}$)',fontsize=18)
plt.text(10,0.011,'CCMPv2 1988-2019 ',fontsize=18)
plt.text(10,0.010,'Global mean = 8.6 m s$^{-1}$',fontsize=18)
plt.text(10,0.009,'68% of winds are > 6 m s$^{-1}$',fontsize=18)
plt.savefig('./../../figures/ccmp_all_hist2.png')

In [None]:
bins = np.arange(0,30,.1)
h,x=[],[]
for lyr in range(1988,2020):
    subset = masked.wspd.sel(time=str(lyr))
    h1 = histogram(subset, bins=[bins])
    h1 = h1.compute()
    print('start',lyr)
    h.append(h1)
    x.append(lyr)
    hh = xr.concat(h, dim='time')
    hh.to_netcdf('./../../data/ccmp/ccmp_annual_hist_20210507a.nc')
    print('end',lyr)

# maps of wind speed distributions for c.donlon

In [None]:
%%time
# calc % winds
#a spatial map showing a climatology of roughness.  
#Ideally in 3 panels - (a) at Hs=<2, (b) at Hs=mean wind speed (c) Hs> 10 
wnd = ds.wspd.where(ds.wspd<=2)
f2 = (wnd/wnd).sum({'time'})/len(wnd.time)*100  # percent less than or equal to 2 m/s
wnd = ds.wspd.where((ds.wspd>=8)&(ds.wspd<=9))
f8 = (wnd/wnd).sum({'time'})/len(wnd.time)*100  # percent 8-9 m/s
wnd = ds.wspd.where(ds.wspd>10)
f10 = (wnd/wnd).sum({'time'})/len(wnd.time)*100  # percent >= 10 m/s

In [None]:
%%time
f2 = f2.compute()
f8 = f8.compute()
f10 = f10.compute()
ff = xr.concat([f2,f8,f10],dim='frac')

In [None]:
plt.rcParams['figure.figsize'] = (15.0,8.0)
plt.rcParams.update({'font.size': 16})
fg = ff.plot(aspect=1, size=10, vmin=0, vmax=100,
    col="frac",
    transform=ccrs.PlateCarree(),  # remember to provide this!
    subplot_kws={
        "projection": ccrs.PlateCarree()
    },
    cbar_kwargs={"label":'Percent',"orientation": "horizontal", "shrink": 0.8, "aspect": 40},
    robust=True,
)
tstr = ['< 2 m/s','8-9 m/s','> 10 m/s']
for i, ax in enumerate(fg.axes.flat):
    ax.set_title(tstr[i]) 
fg.map(lambda: plt.gca().coastlines())
fig_fname = '../../figures/map_global_wind_distributions.png'
plt.savefig(fig_fname, transparent=False, format='png')

In [None]:
ff2 = ff.where(mask_obs>0)
fg = ff2.plot(aspect=1, size=10, vmin=0, vmax=100,
    col="frac",
    transform=ccrs.PlateCarree(),  # remember to provide this!
    subplot_kws={
        "projection": ccrs.PlateCarree()
    },
    cbar_kwargs={"label":'Percent',"orientation": "horizontal", "shrink": 0.8, "aspect": 40},
    robust=True,
)
tstr = ['< 2 m/s','8-9 m/s','> 10 m/s']
for i, ax in enumerate(fg.axes.flat):
    ax.set_title(tstr[i]) 
fg.map(lambda: plt.gca().coastlines())
fig_fname = '../../figures/map_ocean_wind_distributions.png'
plt.savefig(fig_fname, transparent=False, format='png')

In [None]:
import cartopy.crs as ccrs
plt.rcParams['figure.figsize'] = (18.0,5.0)
plt.rcParams.update({'font.size': 16})
ax = plt.subplot(131,projection=ccrs.PlateCarree())
cs=f2.plot(ax=ax,vmin=0,vmax=100,cbar_kwargs={'shrink':.35,'label': 'Wind < 2 m/s'})
ax.coastlines()
ax = plt.subplot(132,projection=ccrs.PlateCarree())
cs=f8.plot(ax=ax,vmin=0,vmax=100,cbar_kwargs={'shrink':.35,'label': 'Wind 8-9 m/s'})
ax.coastlines()
ax = plt.subplot(133,projection=ccrs.PlateCarree())
cs=f10.plot(ax=ax,vmin=0,vmax=100,cbar_kwargs={'shrink':.35,'label': 'Wind > 10 m/s'})
ax.coastlines()


In [None]:
# calculate weibull distributions TESTING STILL
ds

In [None]:
#test out weibull at one point with data and without data
import scipy.stats as stats
data = ds.wspd[:,0,400].load()
params = stats.exponweib.fit(data, floc=0, f0=1)
shape = params[1]
scale = params[3]

In [None]:
values,bins,hist = plt.hist(data,bins=51,range=(0,25),density=True)
center = (bins[:-1] + bins[1:]) / 2.
# Using all params and the stats function
params = stats.exponweib.fit(data, floc=0, f0=1)
plt.plot(center,stats.exponweib.pdf(center,*params),lw=4,label='scipy exp')

In [None]:
params = stats.exponweib.fit(ds.wspd, floc=0, f0=1)

In [None]:
params.to_netcdf('./../../data/weib.nc')

In [None]:
params

In [None]:
#adapted from https://gist.github.com/luke-gregor/4bb5c483b2d111e52413b260311fbe43
def dataset_encoding(xds):
    cols = ['source', 'original_shape', 'dtype', 'zlib', 'complevel', 'chunksizes']
    info = pd.DataFrame(columns=cols, index=xds.data_vars)
    for row in info.index:
        var_encoding = xds[row].encoding
        for col in info.keys():
            info.ix[row, col] = var_encoding.pop(col, '')
    
    return info


def xarray_trend(xarr):    
    from scipy import stats
    import numpy as np
    # getting shapes
    
    m = np.prod(xarr.shape[1:]).squeeze()
    n = xarr.shape[0]
    
    # creating x and y variables for linear regression
    #x = xarr.time.to_pandas().index.to_julian_date().values[:, None]
    y = xarr.to_masked_array().reshape(n, -1)
    
    # ############################ #
    # LINEAR REGRESSION DONE BELOW #
    params = stats.exponweib.fit(y, floc=0, f0=1)
    shape = params[1]
    scale = params[3]
  
    # preparing outputs
    out = xarr[:2].mean('time')
    # first create variable for slope and adjust meta
    xarr_slope = out.copy()
    xarr_slope.name += '_shape'
    xarr_slope.attrs['units'] = 'none'
    xarr_slope.values = shape.reshape(xarr.shape[1:])
    # do the same for the p value
    xarr_p = out.copy()
    xarr_p.name += '_scale'
    xarr_p.attrs['info'] = "none"
    xarr_p.values = p.reshape(xarr.shape[1:])
    # join these variables
    xarr_out = xarr_slope.to_dataset(name='shape')
    xarr_out['scale'] = xarr_p

    return xarr_out

In [None]:
sst_slope2=[]
for inc in range(0,1):
    mlon=inc*5
    mlon2 = (inc+1)*5-1
    subset = ds.wspd.sel(longitude=slice(mlon,mlon2),latitude=slice(-78,-68)).load()
    sst_slope = xarray_trend(subset)
    sst_slope2.append(sst_slope)

In [None]:
from scipy import stats
import numpy as np
# getting shapes

xarr = subset

m = np.prod(xarr.shape[1:]).squeeze()
n = xarr.shape[0]

# creating x and y variables for linear regression
#x = xarr.time.to_pandas().index.to_julian_date().values[:, None]
y = xarr.to_masked_array().reshape(n, -1)

# ############################ #
# LINEAR REGRESSION DONE BELOW #
params = stats.exponweib.fit(y, floc=0, f0=1)
shape = params[1]
scale = params[3]

# preparing outputs
out = xarr[:2].mean('time')
# first create variable for slope and adjust meta
xarr_slope = out.copy()
xarr_slope.name += '_shape'
xarr_slope.attrs['units'] = 'none'
xarr_slope.values = shape.reshape(xarr.shape[1:])
# do the same for the p value
xarr_p = out.copy()
xarr_p.name += '_scale'
xarr_p.attrs['info'] = "none"
xarr_p.values = p.reshape(xarr.shape[1:])
# join these variables
xarr_out = xarr_slope.to_dataset(name='shape')
xarr_out['scale'] = xarr_p

return xarr_out

In [None]:
sst_slope2=[]
for inc in range(0,35):
    mlon=inc*10
    mlon2 = (inc+1)*10-1
    subset = ds.wspd.sel(longitude=slice(mlon,mlon2))
    sst_slope = xarray_trend(subset)
    sst_slope2.append(sst_slope)

In [None]:
sst_slope

In [None]:
cluster.close()