<a name="top"></a>

<div style="float:right; width:98 px; height:98px;">
<center><img src="https://raw.githubusercontent.com/Unidata/MetPy/master/src/metpy/plots/_static/unidata_150x150.png" alt="Unidata Logo" style="height: 98px;"><p><center>Unidata Program Center</center></p></center>
</div>

<h1>Python Readiness Learning Series</h1>
<h3>Exploratory Data Analysis</h3>

<h4>Learning Objectives</h4>
<ol>
    <li>Create and use a framework for exploratory analysis of Earth Systems Science datasets</li>
    <li>Locate critical exploratory information about tabular data using Pandas tools</li>
    <li>Locate critical exploratory information about netCDF data using Xarray tools</li>
</ol>

<h4>Schedule</h4>
<table>
<tbody>
<tr style="height: 22px;">
<td style="height: 22px;">&nbsp;9:30 - 9:50</td>
<td style="height: 22px;">Introduction and Metalearning&nbsp;</td>
</tr>
<tr style="height: 22px;">
<td style="height: 22px;">&nbsp;9:50 - 10:20</td>
<td style="height: 22px;">Exploratory Data Analysis&nbsp;</td>
</tr>
<tr style="height: 22px;">
<td style="height: 22px;">&nbsp;10:20 - 12:00</td>
<td style="height: 22px;">EDA I: Tabular Data&nbsp;</td>
</tr>
<tr style="height: 22.6px;">
<td style="height: 22.6px;">&nbsp;12:00 - 1:00</td>
<td style="height: 22.6px;">Lunch&nbsp;</td>
</tr>
<tr style="height: 22px;">
<td style="height: 22px;">&nbsp;1:00 - 2:45</td>
<td style="height: 22px;">EDA II: Multidimensional Data&nbsp;</td>
</tr>
<tr style="height: 22px;">
<td style="height: 22px;">&nbsp;2:45 - 3:00</td>
<td style="height: 22px;">Introduction to Project and Closing&nbsp;</td>
</tr>
</tbody>
</table>

<div style="clear:both"></div>
</div>

<hr style="height:2px;">

<a name="top"></a>
<div style="width:1000 px">

### Table of Contents
1. <a href="#meta">Metalearning</a>
1. <a href="#eda">Exploratory Data Analysis</a>
1. <a href="#table">EDA I: Tabular Data</a>
1. <a href="#multi">EDA II: Multidimensional Data</a>
1. <a href="#proj">About the Day 2 Project</a>
1. <a href="#more">More Information</a>
    
<div style="clear:both"></div>
</div>

<a name="#meta"></a>
## Metalearning

Learning a new subject, especially something as complex as learning a new language, is not a linear process. It would be unreasonable to expect that you will leave this session today feeling like you have learned everything you need to be a successful scientific Python user. Instead, consider these next two sessions as a means of making those initial connections in your brain that you will continue to build on in your studies. It takes time and sufficient practice to transfer new information from short-term to long-term memory and build the mental models we need to complete our future work.

All the while, we battle the **forgetting curve**. 

*Execute the cell below to view the slide widget*

In [None]:
# Run this cell to view the widget
import ipywidgets as wg
from IPython.display import Image

w=800
def f(Slide):
    if Slide == 1:
        return Image(url='https://elearning.unidata.ucar.edu/metpy/PythonReadiness/media/FC1.PNG', width=w)
    elif Slide == 2:
        return Image(url='https://elearning.unidata.ucar.edu/metpy/PythonReadiness/media/FC2.PNG', width=w)
    elif Slide == 3:
        return Image(url='https://elearning.unidata.ucar.edu/metpy/PythonReadiness/media/FC3.PNG', width=w)
    else:
        return Image(url='https://elearning.unidata.ucar.edu/metpy/PythonReadiness/media/FC4.PNG', width=w)

wg.interact(f, Slide=wg.IntSlider(min=1,max=4,step=1));

With this knowledge about your own learning, consider how the information you learn in this series continues to build bridges between your short and long-term memory.

<a name="eda"></a>
## Exploratory Data Analysis

The topic for the day is *Exploratory Data Analysis*. 

<div class="alert alert-success">
    <b>Discussion</b>: 

What does this process sound like we’re going to do?<br>
What do you think could be the goals of this type of analysis? What do you think EDA is not?
</div>

While there are many ways to achieve this, we propose this general framework towards understanding data. First, and arguably the most difficult hurdle is finding data and reading it into Python. Then we want to know about the metadata, the data that describes what the values represent. Then we want to understand what single variables are available, units, etc. And finally we want to know how variables relate to each other. This is where we bridge the gap between EDA and explanatory analysis. 

<img src="https://elearning.unidata.ucar.edu/metpy/PythonReadiness/media/EDAFramework.png">

<div class="alert alert-success">
    <b>Discussion</b>: 

What do we look for in an exploratory data analysis?
    
When we acquire a new dataset, what kind of information would we need to say we can “fully describe” the data? Consider any and all types of earth systems data.
    
Document your responses below.
</div>

<a name="table"></a>
## EDA I: Tabular Data

<p>In this exercise we will explore local time series data in csv format. These data are "local" because they are stored on the same machine where we are running this notebook from.</p>

<img src="https://elearning.unidata.ucar.edu/metpy/PythonReadiness/media/localdata.png">

<p>We start by importing the Python package required to read the csv, <code>pandas</code>, with the abbreviated name <code>pd</code>. Then we use the <code>read_csv()</code> function from the <code>pd</code> package to read the file, and load it into our Python project as a pandas <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html" target="blank">DataFrame</a>.</p>

In [None]:
# import required packages
import pandas as pd

### Time Series #1

We will explore two different kinds of time series data. This is the first.

In [None]:
# load csv from local folder
df = pd.read_csv('../../data/timeseries.csv')

# preview the DataFrame, df
df

<div class="alert alert-success">
    <b>Discussion</b>: 
    
What information can we find in this preview alone? What information from our earlier EDA brainstorm are we missing?
</div>

#### Other helpful pandas functions

Recall the other information we would want to know for a successful EDA. These functions below help us to more clearly see that information in a pandas DataFrame.

In [None]:
# list all column headers
df.columns

In [None]:
# describe the DataFrame to see the range of values
df.describe()

<div class="alert alert-success">
    <b>Discussion</b>: 
    
Referring to the syntax we used to describe the DataFrame compared to the syntax we used to print the column names, why do we use () after <code>df.describe()</code> but not after <code>df.columns</code>?
</div>

Not every dataset will use "Null" to denote missing data, some will choose -999… or some other large negative as a placeholder. Because our minimum values in the dataframe description seem reasonable, let’s check for Null values. 

In [None]:
# Use pandas isnull() to find missing values
df.isnull()

Our result isn't particularly useful since we have so many entries that are clipped from the preview... Let's try to add them together.

In [None]:
# sum all null values by column
df.isnull().sum()

<div class="alert alert-success">
    <b>Discussion</b>: 
    
What can you ascertain from the results, and what do they mean, physically? What might explain the missing data? Try to be as descriptive as possible.
</div>

We can use pandas selection tools to determine which times from the dataset are missing values.

In [None]:
# determine what times are missing WS
# Note the syntax for selecting one column:
# DataFrame[‘column name’]


# create a “mask” of the entries where WS = null
nullWS = df['WS'].isnull()


# pull times where nullWS is True
df['DATE'][nullWS]

#### EDA Plotting

Let’s now move into phase four of the EDA framework and examine how variables relate to each other. We will do this by plotting some of the variables. 


But first, consider what we know to be true about the physical world. We have 24 hours worth of data, what can we predict the shape of temperature to look like over a 24 hour time period? What about relative humidity? What are possible red flags that would determine if we have a bad dataset?

In [None]:
# plot the shape of relative humidity
df['RH'].plot()

In [None]:
# plot the shape of temperature
df['T'].plot()

Do these plots make physical sense? Why or why not?

### Time Series #2

Let's try out an EDA on a different time series next. 

In [None]:
# read data with Pandas
df2 = pd.read_csv('../../data/SFC_obs.csv')


# display Pandas DataFrame preview
df2

<div class="alert alert-success">
    <b>Discussion</b>: 
    
Immediate we see differences between the previous dataset and this one. What are they?
</div>

In [None]:
# describe the DataFrame to see the range of values
df2.describe()

Notice that we don’t get a summary of the categorical variables, only the numeric variables. 

With this trick below, we describe all columns by casting all values as an object type, then use describe

In [None]:
# optionally, transpose the table for clearer reading by appending .transpose()
df2.astype('object').describe().transpose()

Let's now use our pandas selection tools to more closely examine a single station.

Note the syntax for selecting one column:

<code>DataFrame['column name']</code>

In [None]:
# create a stations variable and preview it
stations = df2['station']
stations

In [None]:
# create a true/false "mask" where the value in the series is the desired string
fnl_mask = stations.values == 'FNL'
fnl_mask

In [None]:
# sum all True values to determine how many entries we have
sum(fnl_mask)

In [None]:
# trim df to just the entries where the mask is true
df_fnl = df2[fnl_mask]
df_fnl

<div class="alert alert-success">
    <b>Exercise</b>: 

Recall that the whole reason we do EDA is to understand our data to ultimately determine if it is sufficient for a scientific or research question. Let's put together all of these tools now in a new scenario. 
    
<b>Goal:</b>
    
Working in pairs, determine if this dataset is appropriate for your research question.
    
<b>Scenario:</b>

You are starting out on a project related to how tropospheric ozone concentrations in the US have changed in the last 10 years. 
You’ve found a paper that provides a benchmark dataset that could be helpful for you in your statistical comparisons. To determine if it is appropriate for your project, you must first examine the dataset using the EDA approach we just introduced.     
    
<b>Tasks:</b>
    
First, load the dataset (via the read_csv function)
http://hdl.handle.net/11304/89dd440e-4e10-496e-b476-1ccf0ebeb4f3

The paper the dataset was from to help find some critical metadata:
https://essd.copernicus.org/articles/13/3013/2021 

Then, find the following information:
1. Spatial extent of the data, the number of records, and summary of stations
1. Temporal extent of the data
1. Summary of all variables 
1. Examine the ranges of values for a few relevant variables to determine if any data seem like outliers or incorrect
1. Summarize information about any missing data
1. Then, make a decision on whether you want to use this dataset for the research project in this scenario.
    
</div>

<a name="#multi"></a>
## EDA II: Multidimensional Data

In the previous section, we looked at three different examples of tabular data. These all represented point-based data, but that's not the only kind of data we work with in the Earth Systems Sciences. 

In fact, a large percentage of the data used in the Earth Systems Sciences is array-based. 

<div class="alert alert-success">
    <b>Exercise</b>: 

<br><b><a href="https://elearning.unidata.ucar.edu/dataeLearning/MultidimensionalDataStructures" target="blank">Multidimenstional Data Structures</a></b>
    
Complete the above microlesson. We will watch the video together. 
</div>


Now let's try examining a local NetCDF File like the one we saw in the lesson. 

While <code>pandas</code> is a great package for examining tabular data, it is insufficient for multidimensional or array-based data.

To examine these types of data we use a different package, <code>xarray</code>, with the abbreviated name <code>xr</code>.

In [None]:
# import required packages 
import xarray as xr

In [None]:
# open the file with xarray open_dataset
nc = '../../data/irma_gfs_example.nc'
data = xr.open_dataset(nc)

# View a summary of the Dataset
data

<div class="alert alert-success">
    <b>Discussion</b>: 

Of the must-have information for EDA, what information can we gather from just this preview?
</div>


#### CF (Climate and Forecast) Conventions

Notice that we have info in the global metadata called Conventions. While not crucial to us right now, these will be important for future analyses you do, so let’s briefly discuss. 

Consider the following, you are working with several datasets from various sources and comparing variables in one region to another. Imagine comparing temperatures from three different datasets, but the labels for temperature are different. One is “T”, one is “temp”, and one is “temperature”. Additionally, two of the datasets record temperature in K and one records temperature in C. Reproducing results on these datasets would be cumbersome because you would have to keep changing variable names, units, and potentially other attributes. 

From the <a href="http://cfconventions.org/" target="blank">CF website</a>:
>“The Climate and Forecast metadata conventions (CF) are a community-developed standard designed to promote the processing and sharing of climate and forecast model and observational output data, and derived data products. The conventions define metadata that provide a definitive description of what the data in each variable represents, and the spatial and temporal properties of the data. This enables users of data from different sources to decide which quantities are comparable, and facilitates building applications with powerful extraction, regridding, and display capabilities. The CF convention includes a standard name table, which defines strings that identify physical quantities.”

Continuing where we left off, we will now explore the potential of missing data in this dataset. To do this, we'll begin by isolating a single variable for interrogation, temperature (on isobaric surfaces).

In [None]:
# isolate temperature in its own variable
# Note the syntax dataFrame['variable']
temperature = data['Temperature_isobaric']

# display preview
temperature

No new info yet… But notice how temperature is a DataArray as opposed to a DataSet. A DataSet is composed of DataArrays. 

In pandas, a DataFrame contains multiple Series (columns)

In xarray, a DataSet contains multiple DataArrays (arrays)

Continuing on, we can use the `is_null()` method to pick out any missing values in the DataArray. 

In [None]:
# find nulls in temperature
temperature.isnull()

In [None]:
# try to sum the True null values
temperature.isnull().sum()

Looks like no null values, but let's check the ranges of all values in the array to see if anything looks off. We can leverage the built-in `.plot.hist()` method in xarray to do this. 

In [None]:
xr.plot.hist(temperature)

Seems reasonable, what about another variable? We can string together both the selection of the DataArray and the plot together in one line, as below. 

In [None]:
# Preview a histogram of Relative Humidity
xr.plot.hist(data['Relative_humidity_isobaric'])

Do these values look reasonable for relative humidity to you? What could possibly explain the skewed values towards 0%?

In these histograms, every value from every dimension gets dumped into the same pool and divided into buckets. 

What if we want to look for specific features in the data? In that case, we need to reduce our dimensionality from 4 (x, y, z, and t) to just 2 (x and y) for a 2D plot. 

xarray has many indexing tools that make these selections possible. Let's take a closer look at the `.sel()` method. We will use this method using the format:

`DataArray.sel(dimension = 'value')`

Let's first choose a single vertical level. We should start by examining which levels are available. 

In [None]:
# look at available values in the isobaric3 coordinate
temperature['isobaric3']

Then we choose one level and enter it into our selection. 

In [None]:
# Note the syntax DataArray.sel(dimension='value')
temperature_selection = temperature.sel(isobaric3 = 50000)
temperature_selection

Now our temperature selection is a data array with dimensionality reduced to 3 (x, y, t)

On our quest for a 2D array, we next will choose a single time to select. 

From the above xarray HTML preview, we can preview the times available for the data and see that we have data available from 5 Sept 2017 at 12Z to 6 Sept 2017 at 9Z every 6 hours. We can select a specific time from this list, or we can practice using the `method` parameter to choose a time that's nearest to a time that may not exist within the coordinate. 

We can also use multiple selections in a single line, so let's rewrite our selection to include both time and isobaric level.

In [None]:
# apply two selections in one operation
temperature_selection = temperature.sel(isobaric3=50000, time1='2017-09-06T02', method='nearest')
temperature_selection

With our dimensionality now reduced to 2 (x, y), we can plot the array and look at features within the data!

For quick and simple plots like the ones we use in an EDA, we can use the built-in plotting tools to xarray, the `.plot()` method. 

In [None]:
temperature_selection.plot()

With these selection tools, we can quickly string together code to view samples of our data on 2D planes. But be careful of dimension names!

In [None]:
rh_selection = data['Relative_humidity_isobaric'].sel(time1='2017-09-05T12',isobaric3='75000')
rh_selection.plot()

In [None]:
w_selection = data['Vertical_velocity_pressure_isobaric'].sel(time1='2017-09-06T12',isobaric1='100000')
w_selection.plot()

<div class="alert alert-success">
    <b>Exercise</b>: 

Scavenger hunt challenge!
    
<b>Goal:</b>

When the timer starts, open the CMIPC.nc file in this notebook and find the following information    
    
- Source (sensor/model/instrument)
- Spatial region or extent
- Valid day(s)/time(s) or forecast times
- Available dimensions and their respective lengths
- Variables and their units

</div>

<a name="proj"></a>
## Introduction to the Day 2 Project

Today you worked a lot with data that was provided for you, and now it's your opportunity to explore data of your choosing. 

Recall the purpose of exploratory data analysis: to examine the appropriateness of data for a specific task or research question. 

Before the next class period, you are asked to do the following:
- <b>Choose any research question you are interested in </b>
    - This may or may not be related to your current work. Follow your curiosity!
- <b>Find data that may support your research question</b>
    - It’s okay if you're not certain the data will be appropriate!
    - Choose at least one file/resource to examine in Python
    - Any data source, you don’t need to know how to read it yet.  
    - Any file type
- <b>Review the Remote Data Access lesson below</b>
    - Watch the short video below, then try accessing a remote dataset following the demo in the video
    - You may try the Unidata's THREDDS Data Server: https://thredds.ucar.edu/thredds/catalog/catalog.html
    - Or NCEI's THREDDS Data Server: https://www.ncei.noaa.gov/thredds/catalog.html
    - Or any other THREDDS Data Server you can find

You will have time in the next session to complete an EDA on your data of choice. 

#### Remote Data Access Lesson

<video width="600" src="https://elearning.unidata.ucar.edu/metpy/QuantitativeAnalysisILT/SiphonTDS/IntroSiphonTDS.mp4"  
       controls>
</video>

<a href="https://elearning.unidata.ucar.edu/metpy/QuantitativeAnalysisILT/SiphonTDS/IntroSiphonTDS.mp4" target="blank">Video source</a>

In [None]:
# Practice your code here