<left><img src="https://github.com/pandas-dev/pandas/raw/main/web/pandas/static/img/pandas.svg" alt="pandas Logo" style="width: 200px;"/></left>
<right><img src="https://matplotlib.org/stable/_images/sphx_glr_logos2_003.png" style="width: 200px;"/></right>

# Pandas and Matplotlib - EMODNET
---

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

import datetime
from pathlib import Path


## Load ERDDAP data

 [ERDDAP](https://coastwatch.pfeg.noaa.gov/erddapinfo/) is a data server that gives you a simple, consistent way to download data in the format and the spatial and temporal coverage that you want. ERDDAP is a web application with an interface for people to use. It is also a RESTful web service that allows data access directly from any computer program (e.g. Matlab, R, or webpages)."

This notebook uses the python client [erddapy](https://pyoceans.github.io/erddapy) to help construct the RESTful URLs and translate the responses into Pandas and Xarray objects. 

A typical ERDDAP RESTful URL looks like:

[https://data.ioos.us/gliders/erddap/tabledap/whoi_406-20160902T1700.mat?depth,latitude,longitude,salinity,temperature,time&time>=2016-07-10T00:00:00Z&time<=2017-02-10T00:00:00Z &latitude>=38.0&latitude<=41.0&longitude>=-72.0&longitude<=-69.0](https://data.ioos.us/gliders/erddap/tabledap/whoi_406-20160902T1700.mat?depth,latitude,longitude,salinity,temperature,time&time>=2016-07-10T00:00:00Z&time<=2017-02-10T00:00:00Z&latitude>=38.0&latitude<=41.0&longitude>=-72.0&longitude<=-69.0)

Let's break it down to smaller parts:

- **server**: https://data.ioos.us/gliders/erddap/
- **protocol**: tabledap
- **dataset_id**: whoi_406-20160902T1700
- **response**: .mat
- **variables**: depth,latitude,longitude,temperature,time
- **constraints**:
    - time>=2016-07-10T00:00:00Z
    - time<=2017-02-10T00:00:00Z
    - latitude>=38.0
    - latitude<=41.0
    - longitude>=-72.0
    - longitude<=-69.0

### EMODNET:  
https://emodnet.ec.europa.eu/en/emodnet-web-service-documentation#non-ogc-web-services

erddap EMODNET physics:  
https://prod-erddap.emodnet-physics.eu/erddap/index.html  
https://prod-erddap.emodnet-physics.eu/erddap/tabledap/documentation.html  

### erddapy  
https://github.com/ioos/erddapy

>pip install erddapy

In [None]:
from erddapy import ERDDAP
from erddapy.core.url import urlopen

In [None]:
# ERDDAP for EMODNET Physics
server = 'https://coastwatch.pfeg.noaa.gov/erddap'
protocol = 'tabledap'
emodnet = ERDDAP(server=server, protocol=protocol)


server = 'https://prod-erddap.emodnet-physics.eu/erddap'
protocol = 'tabledap'
emodnet = ERDDAP(server=server, protocol=protocol)

In [None]:
min_time = '2010-01-01T00:00:00Z'
max_time = '2020-12-31T23:00:00Z'
min_lon, max_lon = -17, -15
min_lat, max_lat = 44.1, 44.5

In [None]:
kw = {
    'min_lon': min_lon,'max_lon': max_lon,'min_lat': min_lat,'max_lat': max_lat,
    'min_time': min_time,'max_time': max_time
}

search_url = emodnet.get_search_url(response='csv', **kw)
search_df = pd.read_csv(urlopen(search_url))
search_df = search_df[['Institution', 'Dataset ID','tabledap']]
search_df

In [None]:
dataset_id = 'GLODAPv2_2021'
emodnet.dataset_id = dataset_id
emodnet.response = "csv"
emodnet.constraints = {
#     "time>=": min_time,
#     "time<=": max_time,
    "latitude>=": min_lat,
    "latitude<=": max_lat,
    "longitude>=": min_lon,
    "longitude<=": max_lon,
}
emodnet.variables = ["longitude", "latitude", "time",
    "G2temperature", "G2salinity", "G2pressure"
]

df = emodnet.to_pandas()

In [None]:
df

---

## The pandas [`DataFrame`](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe)...
...is a **labeled**, two-dimensional columnar structure, similar to a table, spreadsheet, or the R `data.frame`.

![dataframe schematic](https://github.com/pandas-dev/pandas/raw/main/doc/source/_static/schemas/01_table_dataframe.svg "Schematic of a pandas DataFrame")

The `columns` that make up our `DataFrame` can be lists, dictionaries, NumPy arrays, pandas `Series`, or many other data types not mentioned here. Within these `columns`, you can have data values of many different data types used in Python and NumPy, including text, numbers, and dates/times. The first column of a `DataFrame`, shown in the image above in dark gray, is uniquely referred to as an `index`; this column contains information characterizing each row of our `DataFrame`. Similar to any other `column`, the `index` can label rows by text, numbers, datetime objects, and many other data types. Datetime objects are a quite popular way to label rows.

For our first example using Pandas DataFrames, we start by reading in some data in comma-separated value (`.csv`) format. We retrieve this dataset from the Pythia DATASETS class (imported at the top of this page); however, the dataset was originally contained within the NCDC teleconnections database. This dataset contains many types of geoscientific data, including El Nino/Southern Oscillation indices. For more information on this dataset, review the description [here](https://www.ncdc.noaa.gov/teleconnections/enso/indicators/sst/).

In [None]:
df

In [None]:
# Set index
df.set_index(pd.to_datetime(df['time (UTC)']), inplace=True)

In [None]:
df

In [None]:
df.index[0]

### Read file

In [None]:
p_file = Path('__file__').resolve()
dir_data = p_file.parents[0] / 'data'

fnd = dir_data / 'GLODAPv2.2021.csv'
df2 = pd.read_table(fnd, sep=',')
df2

The `DataFrame` index, as described above, contains information characterizing rows; each row has a unique ID value, which is displayed in the index column.  By default, the IDs for rows in a `DataFrame` are represented as sequential integers, which start at 0.

In [None]:
df.index

At the moment, the index column of our `DataFrame` is not very helpful for humans. However, Pandas has clever ways to make index columns more human-readable. The next example demonstrates how to use optional keyword arguments to convert `DataFrame` index IDs to a human-friendly datetime format.

In [None]:
# For pandas version > 2.0
# df2 = pd.read_table(fnd, sep=',', dtype={'G2year': int, 'G2month': int, 'G2day': int, 
#                                                        'G2hour': int, 'G2minute': int},
#                     parse_dates={'time': ['G2year', 'G2month', 'G2day', 'G2hour', 'G2minute']},
#                     date_format='%Y %m %d %H %M', 
# )

df2 = pd.read_table(fnd, sep=',', dtype={'G2year': int, 'G2month': int, 'G2day': int, 
                                                       'G2hour': int, 'G2minute': int},
                    parse_dates={'time': ['G2year', 'G2month', 'G2day', 'G2hour', 'G2minute']},
                    infer_datetime_format=True, 
)
df2 
# date was not recognized!

In [None]:
df2.drop('time', axis=1)

Each of our data rows is now helpfully labeled by a datetime-object-like index value; this means that we can now easily identify data values not only by named columns, but also by date labels on rows. This is a sneak preview of the `DatetimeIndex` functionality of Pandas; this functionality enables a large portion of Pandas' timeseries-related usage. Don't worry; `DatetimeIndex` will be discussed in full detail later on this page. In the meantime, let's look at the columns of data read in from the `.csv` file:

In [None]:
df.columns

## The pandas [`Series`](https://pandas.pydata.org/docs/user_guide/dsintro.html#series)...

...is essentially any one of the columns of our `DataFrame`. A `Series` also includes the index column from the source `DataFrame`, in order to provide a label for each value in the `Series`.

![pandas Series](https://github.com/pandas-dev/pandas/raw/main/doc/source/_static/schemas/01_table_series.svg "Schematic of a pandas Series")

The pandas `Series` is a fast and capable 1-dimensional array of nearly any data type we could want, and it can behave very similarly to a NumPy `ndarray` or a Python `dict`. You can take a look at any of the `Series` that make up your `DataFrame`, either by using its column name and the Python `dict` notation, or by using dot-shorthand with the column name:

### Get columns informations  

df['name']  
df.name 

if name is a number  
df[145]  
df.15 is not valid!

In [None]:
df['G2temperature']

<div class="alert alert-block alert-info">
<b>Tip:</b> You can also use the dot notation illustrated below to specify a column name, but this syntax is mostly provided for convenience. For the most part, this notation is interchangeable with the dictionary notation; however, if the column name is not a valid Python identifier (e.g., it starts with a number or space), you cannot use dot notation.</div>

In [None]:
df.G2temperature

In [None]:
df = pd.read_table('data/data_waves.dat', header=None, delim_whitespace=True, 
                   names=['YY', 'mm', 'DD', 'time', 'hs', 'tm', 'tp', 'dirm', 'dp', 'spr', 'h', 'lm', 'lp', 
                          'uw', 'vw'],
                  parse_dates=[[0, 1, 2, 3]], index_col=0)
df

In [None]:
df.describe()

In [None]:
df.hs[:10000].plot()

In [None]:
df.max()

In [None]:
df.sort_values('hs', ascending=False)

## Resampling, Shifting, and Windowing

In [None]:
df['hs']

In [None]:
df.hs[:100].plot()

In [None]:
df.rolling('12H').mean().hs[:100].plot()

In [None]:
dfi = df.iloc[:500]

In [None]:
import matplotlib.pyplot as plt
dfi.hs.plot(alpha=0.5, style='-.', marker='o', markersize=1)
dfi.hs.resample('24H').mean().plot(style=':', linewidth=2)
dfi.hs.asfreq('24H').plot(style='--');
plt.legend(['input', 'resample', 'asfreq'],
           loc='upper left');

In [None]:
#df.hs.plot(alpha=0.5, style='-.', marker='o', markersize=1)
df_res = df.hs.resample('A').mean()
df_res.plot(style=':', linewidth=2)

In [None]:
df.hs.asfreq('Y').plot()

Notice the difference: at each point, ``resample`` reports the *average of the previous year*, while ``asfreq`` reports the *value at the end of the year*.

For up-sampling, ``resample()`` and ``asfreq()`` are largely equivalent, though resample has many more options available.
In this case, the default for both methods is to leave the up-sampled points empty, that is, filled with NA values.
Just as with the ``pd.fillna()`` function discussed previously, ``asfreq()`` accepts a ``method`` argument to specify how values are imputed.
Here, we will resample the business day data at a daily frequency (i.e., including weekends):

In [None]:
df.resample('A').max()

In [None]:
annual_max = df.groupby(df.index.year).max()
annual_max

In [None]:
index_hs_max=df.hs.groupby(df.index.year).idxmax()
index_hs_max

In [None]:
df["1982-01-01":"1982-12-01"]

### Using `.iloc` and `.loc` to index

In this section, we introduce ways to access data that are preferred by Pandas over the methods listed above. When accessing by label, it is preferred to use the `.loc` method, and when accessing by index, the `.iloc` method is preferred. These methods behave similarly to the notation introduced above, but provide more speed, security, and rigor in your value selection. Using these methods can also help you avoid [chained assignment warnings](https://pandas.pydata.org/docs/user_guide/indexing.html#returning-a-view-versus-a-copy) generated by pandas.

In [None]:
df.iloc[3]

In [None]:
df.iloc[0:12]

In [None]:
df.loc["1982-04-01"]

In [None]:
df.loc["1982-01-01":"1982-12-01"]

The `.loc` and `.iloc` methods also allow us to pull entire rows out of a `DataFrame`, as shown in these examples:

In [None]:
df.loc["1982-04-01"]

In [None]:
df.loc["1982-01-01":"1982-12-01"]

In the next example, we illustrate how you can use slices of rows and lists of columns to create a smaller `DataFrame` out of an existing `DataFrame`:

### Resampling
In these examples, we illustrate a process known as resampling. Using resampling, you can change the frequency of index data values, reducing so-called 'noise' in a data plot. This is especially useful when working with timeseries data; plots can be equally effective with resampled data in these cases. The resampling performed in these examples converts monthly values to yearly averages. This is performed by passing the value '1Y' to the `resample` method.

In [None]:
df.hs.plot();

In [None]:
df.hs.resample('1Y').mean().plot();

In [None]:
# Exercises