# Introduction to Python for Earth Scientists

These notebooks have been developed by Calum Chamberlain, Finnigan Illsley-Kemp and John Townend at [Victoria University of Wellington-Te Herenga Waka](https://www.wgtn.ac.nz) for use by Earth Science graduate students. 

The notebooks cover material that we think will be of particular benefit to those students with little or no previous experience of computer-based data analysis. We presume very little background in command-line or code-based computing, and have compiled this material with an emphasis on general tasks that a grad student might encounter on a daily basis. 

In 2022, this material will be delivered at the start of Trimester 1 in conjunction with [ESCI451 Active Earth](https://www.wgtn.ac.nz/courses/esci/451/2022/offering?crn=32176). Space and pandemic alert levels permitting, interested students not enrolled in ESCI451 are encouraged to come along too but please contact Calum, Finn, or John first.

| Notebook | Contents | Data |
| --- | --- | --- |
| [1A](ESCI451_Module_1A.ipynb) | Introduction to programming, Python, and Jupyter notebooks | - |
| [1B](ESCI451_Module_1B.ipynb) | Basic data types and variables, getting data, and plotting with Matplotlib | Geodetic positions |
| [2A](ESCI451_Module_2A.ipynb) | More complex plotting, introduction to Numpy | Geodetic positions; DFDP-2B temperatures |
| **[2B](ESCI451_Module_2B.ipynb)** | **Using Pandas to load, peruse and plot data** | **Earthquake catalogue**  |
| [3A](ESCI451_Module_3A.ipynb) | Working with Pandas dataframes | Geochemical data set; GNSS data |
| [3B](ESCI451_Module_3B.ipynb) | Simple time series analysis using Pandas | Historical temperature records |
| [4A](ESCI451_Module_4A.ipynb) | Making maps with PyGMT | Earthquake catalogue |
| [4B](ESCI451_Module_4B.ipynb) | Gridded data and vectors | Ashfall data and GNSS |

The content may change in response to students' questions or current events. Each of the four modules has been designed to take about three hours, with a short break between each of the two parts.

# This notebook

1. An introduction to Pandas and dataframes
   - Loading data into a dataframe
   - Visualising the datafame
   - Dataframe statistics
   - Sorting and slicing a dataframe
   - Editing dataframes

# An introduction to Pandas and dataframes
 
<img alt="Pandas logo" align="right" style="width:30%" src="https://dev.pandas.io/static/img/pandas.svg">


So far we have looked at some fairly simple datasets.  NumPy is great for multi-dimensional arrays, but
book-keeping can be tricky and somewhat counterintuitive.  Pandas is our friend here.  Pandas adds meta-data to our data, and allows
us to interact with data using names and words, rather than indexes. This can mean that we can
write much clearer code (yay).  It's also really good at working with data that you would have previously
interacted with in spreadsheets.  Spreadsheets are the source of **many** errors, and keeping data and
results in the same file is almost criminal! Your data are sacred and should **never be in the same
file that you process them in!**.

Pandas [github README](https://github.com/pandas-dev/pandas/blob/master/README.md) outlines why you should
care about Pandas:

> **pandas** is a Python package providing fast, flexible, and expressive data structures designed to 
make working with "relational" or "labeled" data both easy and intuitive. It aims to be the 
fundamental high-level building block for doing practical, **real world** data analysis in Python.
Additionally, it has the broader goal of becoming the **most powerful and flexible open source 
data analysis / manipulation tool available in any language**. It is already well on its way towards 
this goal.

When Pandas says **real world** think messy data. Measurements of properties of the Earth are *almost always*
messy: data are missed when power fails or equipment breaks or when it is too wet to get into the field,
almost all Earth science datasets are noisy, and almost all Earth science data are multi-dimensional and
relational (e.g. multiple variables at one particular place and/or time).  Pandas is really good at coping
with this mess, and **will make your life easier!**

In [None]:
# Enable interactive plots.
%matplotlib notebook

## Loading data into a dataframe

To explore some of the functionality of Pandas, we need a dataset. One large and freely accesible
geoscience dataset in New Zealand is the GeoNet eatrhquake catalogue. This contains hundreds of thousands
of earthquakes, so should be fun to play around with.

To start off with, we need to get the data.  We could manually query the 
[Quake Search](https://quakesearch.geonet.org.nz/) web-app, but that means we need to
click lots of buttons, and isn't great for just exploring a dataset.  Lets do it
programatically.  We will build a function, but let's look at the steps along the way.

### Building a query

The Quake Search page can be queried by generating a specific web request in the form:

`"https://quakesearch.geonet.org.nz/csv?bbox={min-longitude},{min-latitude},{max-longitude},{max-latitude}&minmag={min-magnitude}&maxmag={max-magnitude}&mindepth={min-depth}&maxdepth={max-depth}&startdate={start-time}&enddate={end-time}"`

We can build that as a string really easily using variables in place of the curly-brackets things:

In [None]:
format_string = (
    "https://quakesearch.geonet.org.nz/csv?bbox="
    "{min_longitude},{min_latitude},{max_longitude},"
    "{max_latitude}&minmag={min_magnitude}"
    "&maxmag={max_magnitude}&mindepth={min_depth}"
    "&maxdepth={max_depth}&startdate={start_time}"
    "&enddate={end_time}")

min_latitude = -49.0
max_latitude = -40.0
min_longitude = 164.0
max_longitude = 182.0
min_magnitude = 0.0
max_magnitude = 9.0
min_depth = 0.0  # in km
max_depth = 500.0
start_time = "2019-1-1T00:00:00"
end_time = "2020-1-1T00:00:00"

query_string = format_string.format(
    min_latitude=min_latitude,
    max_latitude=max_latitude,
    min_longitude=min_longitude,
    max_longitude=max_longitude,
    min_magnitude=min_magnitude,
    max_magnitude=max_magnitude,
    min_depth=min_depth,
    max_depth=max_depth,
    start_time=start_time,
    end_time=end_time)

print(query_string)

See what we did? We specified the format of the query string, then specified the particular search criteria we were interested in, and then put those two elements together to construct the query string. Because we have used variables in place of parts of the string, we can change our query really easily.  

If you click that link we just constructed you should download a file called *earthquakes.csv*. What we really want though is to download that file and look at it in Python straight away.  To do that
we can use the `requests` package to make a web-request:

In [None]:
import requests

response = requests.get(query_string)
print(response)

All being well, that should have output `<Response [200]>`. The value of 200 is simply a return code saying that things went as planned.

The `Response` object contains the content that we requested from the web as a `.contents` attribute.  Lets have a look at the first 1000 elements of the response:

In [None]:
print(response.content[0:1000])

This is the contents of the `earthquakes.csv` file and we can write it to a file in the data directory.  The
contents that we have downloaded are in binary (which `print` converted a string before
displaying it), so we have to open the file we want to write to using the `wb` argument, which means
"open the file in **b**inary mode with **w**rite permission":

In [None]:
with open("data/earthquakes.csv", "wb") as f:
    f.write(response.content)

Now we could read those data in using some convoluted looping and NumPy arrays, or we could
just get Pandas to read it using the 
[pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
function.  This will quickly parse that large csv file into a Pandas 
[dataframe](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html):

In [None]:
import pandas as pd  # It is a normal convention to rename pandas as pd for short

earthquakes = pd.read_csv("data/earthquakes.csv")

print(earthquakes[0:5])

Dataframes are really handy ways of handling "spreadshseet"-type data, because they provide a convenient way of labelling the columns. Here we have printed out the first five rows (starting from zero, remember) of the `earthquakes` dataframe we have create. This shows the catalogue information for five earthquakes, arranged in columns labelled `publicid`, `eventtype`, `origintime`, etc. You can see a full list of the column names with the following command:

In [None]:
import pandas as pd

earthquakes = pd.read_csv("data/earthquakes.csv")

earthquakes.columns

We can access the contents of those columns pretty easily too:

In [None]:
import pandas as pd

earthquakes = pd.read_csv("data/earthquakes.csv")

print(earthquakes["origintime"][0:10])

See how in this case we've specified both a column (`origintime`) and a number of rows (the first ten).

Each column is a Pandas [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) which is similar to a numpy array
and has a lot of the same functionality.

You will note that the time columns (`origintime` and `modificationtime`) have not been
read in ("parsed") in the most helpful way: we can see what the strings represent but can't yet treat them as dates or times directly. To get around this, we can tell pandas to read those columns in as `datetime` objects
using the `parse_dates` argument.  While we're at it, we can also get rid of the warning about values in column
0 having multiple dtypes (short for data types) by setting the `dtype` argument for the `publicid` column.

In [None]:
earthquakes = pd.read_csv(
    "data/earthquakes.csv",
    parse_dates=["origintime", "modificationtime"],
    dtype={"publicid": str})

print(earthquakes["origintime"][0:10])

Now the `dtype` of the `origintime` column is reported as `datetime64`, which is a 64-bit precision
`datetime` number. We'll leave more detailed discussion of dates and times until the next module and for the time being we'll explore the dataframe itself in a bit more detail. 

Before we do that, however, let's quickly address one other minor formatting issue. You might notice that some of the column names have a leading space in them.  GeoNet doesn't format it's tables particularly nicely, and those leading spaces are annoying. Let's rename the columns to remove the spaces - first we can make a dictionary that maps the original name to the new name, then use the `.rename` method on the dataframe:

In [None]:
import pandas as pd

earthquakes = pd.read_csv(
    "data/earthquakes.csv",
    parse_dates=["origintime", "modificationtime"],
    dtype={"publicid": str})

column_mapper = dict()
for column in earthquakes.columns:
    # Use the strip method to remove spaces
    column_mapper[column] = column.strip()
    
earthquakes = earthquakes.rename(columns=column_mapper)
print(earthquakes.columns)

## Data visualisation

Now we have a nicely named dataframe, lets have a look at some of the data.
First lets look at magnitude against time. We could use matplotlib directly, but pandas
has some handy plotting shortcuts built in - I have put all the parts from above together here as you would in your own script:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

earthquakes = pd.read_csv(
    "data/earthquakes.csv",
    parse_dates=["origintime", "modificationtime"],
    dtype={"publicid": str})
column_mapper = dict()
for column in earthquakes.columns:
    # Use the strip method to remove spaces
    column_mapper[column] = column.strip()
    
earthquakes = earthquakes.rename(columns=column_mapper)

# Plot the data!

earthquakes.plot(x="origintime", y="magnitude", kind="scatter")
plt.show()

Here we specified the `x` argument as the column name we wanted to plot on the x-axis, and
`y` as the other column name.  Pandas has a few different plotting options that can
be specified by the `kind` argument, you can find out more about them 
[here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

You can clearly see the large magnitude Kaikoura earthquake standing out from everything else.

Another helpful plot might be a histogram. The syntax for that is pretty straightforward too and here are two examples:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

earthquakes = pd.read_csv(
    "data/earthquakes.csv",
    parse_dates=["origintime", "modificationtime"],
    dtype={"publicid": str})
column_mapper = dict()
for column in earthquakes.columns:
    # Use the strip method to remove spaces
    column_mapper[column] = column.strip()
    
earthquakes = earthquakes.rename(columns=column_mapper)

# Plot the data!

earthquakes.hist(column='depth', bins=25)
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

earthquakes = pd.read_csv(
    "data/earthquakes.csv",
    parse_dates=["origintime", "modificationtime"],
    dtype={"publicid": str})
column_mapper = dict()
for column in earthquakes.columns:
    # Use the strip method to remove spaces
    column_mapper[column] = column.strip()
    
earthquakes = earthquakes.rename(columns=column_mapper)

# Plot the data!

earthquakes.hist(column=['depth', 'magnitude'])
plt.show()

### Exercise:

Pick a specific region based on latitude and longitude ([this website](http://bboxfinder.com/) is
really helpful for finding bounding boxes) and get a dataframe spanning a longer period of
time.  Plot the magnitude vs. time graph for that region.

In [None]:
# Your answer here.  Call your dataframe something different to `earthquakes`

## Dataframe statistics

We can also obtain some basic stats from our dataframe, like the median magnitude...

In [None]:
import pandas as pd

earthquakes = pd.read_csv(
    "data/earthquakes.csv",
    parse_dates=["origintime", "modificationtime"],
    dtype={"publicid": str})
column_mapper = dict()
for column in earthquakes.columns:
    # Use the strip method to remove spaces
    column_mapper[column] = column.strip()
    
earthquakes = earthquakes.rename(columns=column_mapper)

# Calculate the median!
print(earthquakes["magnitude"].median())

... or the maximum depth:

In [None]:
print(earthquakes['depth'].max())

There are lots of other useful ways you can extract descriptive statistics from your dataframe, which are documented [here](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#computations-descriptive-stats).

### Exercise:

What is the mean, maximum and minimum magnitude in our dataframe?

In [None]:
# Your answer here

## Sorting the dateframe

Something we often need to do is to sort a dataset based on the value of one parameter).

In [None]:
import pandas as pd

earthquakes = pd.read_csv(
    "data/earthquakes.csv",
    parse_dates=["origintime", "modificationtime"],
    dtype={"publicid": str})
column_mapper = dict()
for column in earthquakes.columns:
    # Use the strip method to remove spaces
    column_mapper[column] = column.strip()
    
earthquakes = earthquakes.rename(columns=column_mapper)

earthquakes.sort_values(by=["latitude"], ascending=False)

You see how we have sorted the dataframe, but the index remains as it was? We can fix that so that the index is reset by passing the `ignore_index` argument to `.sort_values`:

In [None]:
earthquakes.sort_values(by=["latitude"], ascending=False, ignore_index=True)

### Exercise:

Sort the dataframe by depth.

In [None]:
# Your answer here.

## Slicing dataframes

We can also select subsets of our dataframe; this is commonly referred to as "slicing".  Say you had downloaded the whole catalogue
and realised that you only wanted events shallower than 20 km depth. The `.loc` command is used to slice the dataframe to only those rows meeting the specific criteria:

In [None]:
import pandas as pd

earthquakes = pd.read_csv(
    "data/earthquakes.csv",
    parse_dates=["origintime", "modificationtime"],
    dtype={"publicid": str})
column_mapper = dict()
for column in earthquakes.columns:
    # Use the strip method to remove spaces
    column_mapper[column] = column.strip()
    
earthquakes = earthquakes.rename(columns=column_mapper)

earthquakes.loc[earthquakes["depth"] <= 20.0]

We can chain multiple conditions together using the "&" operator. Here's what we can do if we only want the earthquakes shallower than 20 km and 
larger than magnitude 4:

In [None]:
import pandas as pd

earthquakes = pd.read_csv(
    "data/earthquakes.csv",
    parse_dates=["origintime", "modificationtime"],
    dtype={"publicid": str})
column_mapper = dict()
for column in earthquakes.columns:
    # Use the strip method to remove spaces
    column_mapper[column] = column.strip()
    
earthquakes = earthquakes.rename(columns=column_mapper)

earthquakes.loc[(earthquakes["depth"] <= 20.0) &
                (earthquakes["magnitude"] > 4.0)]

## Exercise:

Select earthquakes deeper than 80 km depth between -42 degrees latitude and -44 degrees latitude.

In [None]:
# Your answer here

## Editing dataframes

Sometimes you need to add or remove rows of columns from your dataframe, or you might want to change how the dataframe is indexed. This kind of dataframe editing is relatively straightforward in Pandas, although there are often multiple ways to accomplish the same goal. In this section we will try to demonstrate some of these methods.

### Removing columns

Say we didn't care about the "modificationtime" column - we can use the [`.drop`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method to get rid of columns or rows - in this case we can pass `columns="modificationtime"` to remove that column:

In [None]:
import pandas as pd

earthquakes = pd.read_csv(
    "data/earthquakes.csv",
    parse_dates=["origintime", "modificationtime"],
    dtype={"publicid": str})
column_mapper = dict()
for column in earthquakes.columns:
    # Use the strip method to remove spaces
    column_mapper[column] = column.strip()
    
earthquakes = earthquakes.rename(columns=column_mapper)

earthquakes = earthquakes.drop(columns="modificationtime")
print(earthquakes.columns)

If we know that we only want a few columns then a simpler way than dropping lots of columns might be to just read in the columns we care about, and we can do that in `.read_csv`:

In [None]:
import pandas as pd

earthquakes = pd.read_csv(
    "data/earthquakes.csv",
    usecols=["origintime", "magnitude", "latitude", "longitude", "depth", "publicid"],
    parse_dates=["origintime"],
    dtype={"publicid": str})

print(earthquakes)

We often find that we want to add more information to our dataframes, for example we might want to add a column that is the distance away from some place we care about, say VUW... We will talk more about applying functions to dataframes in a couple of notebooks time, but for now we will import a function for calculating the distance and loop through our dataframe:

In [None]:
from helpers.geodetics import globe_distance
import pandas as pd

earthquakes = pd.read_csv(
    "data/earthquakes.csv",
    usecols=["origintime", "magnitude", "latitude", "longitude", "depth", "publicid"],
    parse_dates=["origintime"],
    dtype={"publicid": str})

vuw_lat, vuw_lon = -41.2901, 174.768

distances = []
for row in earthquakes.itertuples():
    distance = globe_distance(vuw_lat, vuw_lon, row.latitude, row.longitude)
    distances.append(distance)
    
print(distances[0:10])

To add that column into our dataframe we can simply assign a new column as if the dataframe was a dictionary using syntax like:

`dataframe["new_column"] = distances`


In [None]:
earthquakes["distance"] = distances

earthquakes.describe()

Using all of the above - what is the largest magnitude earthquake that happened within 100 km of VUW in our data period?

In [None]:
# Your answer here - hint, use slicing to get events within 100km and find the max of the magnitude column.