# Introduction to Python for Earth Scientists

These notebooks have been developed by Calum Chamberlain, Finnigan Illsley-Kemp and John Townend at [Victoria University of Wellington-Te Herenga Waka](https://www.wgtn.ac.nz) for use by Earth Science graduate students. 

The notebooks cover material that we think will be of particular benefit to those students with little or no previous experience of computer-based data analysis. We presume very little background in command-line or code-based computing, and have compiled this material with an emphasis on general tasks that a grad student might encounter on a daily basis. 

In 2022, this material will be delivered at the start of Trimester 1 in conjunction with [ESCI451 Active Earth](https://www.wgtn.ac.nz/courses/esci/451/2022/offering?crn=32176). Space and pandemic alert levels permitting, interested students not enrolled in ESCI451 are encouraged to come along too but please contact Calum, Finn, or John first.

| Notebook | Contents | Data |
| --- | --- | --- |
| **[1A](ESCI451_Module_1A.ipynb)** | **Introduction to programming, Python, and Jupyter notebooks** | **-** |
| [1B](ESCI451_Module_1B.ipynb) | Basic data types and variables, getting data, and plotting with Matplotlib | Geodetic positions |
| [2A](ESCI451_Module_2A.ipynb) | Logic, more complex plotting, introduction to Numpy | Geodetic positions; DFDP-2B temperatures |
| [2B](ESCI451_Module_2B.ipynb) | Using Pandas to load, peruse and plot data | Earthquake catalogue  |
| [3A](ESCI451_Module_3A.ipynb) | Working with Pandas dataframes | Geochemical data set; GNSS data |
| [3B](ESCI451_Module_3B.ipynb) | Simple time series analysis using Pandas | Historical Temperature Records |
| [4A](ESCI451_Module_4A.ipynb) | Making maps with Cartopy | Earthquake catalogue |
| [4B](ESCI451_Module_4B.ipynb) | Working with gridded data | DEMs and Ashfall data |

The content may change in response to students' questions or current events. Each of the four modules has been designed to take about three hours, with a short break between each of the two parts.


# This notebook

1. What, why and how do we programme?
   - Why programming?
   - An example of lots of data
   - Why Python?
   - Jupyter notebooks
2. Using Python on your own computer
3. Hello World!


# What, why and how do we programme?

## Why programming (what's wrong with Excel!?)

- **Reproducibility:** If someone can't replicate your 
  work, why should we trust it to be true?
- **Safety:** Your data and your processing should not
  overlap.  Your raw data should be sacred.
- **Speed:** You want a result, and you want it yesterday... Learn how to write good code 
    (and change the clock on your computer) and you can...
- **Complexity:** Being able to solve complex problems logically, in a way that others can follow
    (and reproduce) is essential to natural sciences. *Hint: Writing good code is as much about the*
    *quality of your documentation as it is about the quality of your code*.
- **Data scale:** Data in natural sciences is noisy, and large. Ideally to understand the natural world
    we would have data from every place at every time throughout the Earth. We don't have that, but
    our datasets are growing...
    


<img alt="XKCD spreadsheets" align="center" style="width:80%" src="https://imgs.xkcd.com/comics/spreadsheets.png">

## An example of lots of data

Let's consider what happens if we're dealing with data from a fairly standard seismological network. 

- long durations (multi-year);
- multiple locations;
- modest sampling rates.

For example: [SAMBA](http://ds.iris.edu/mda/9F/?starttime=2008-01-01T00:00:00&endtime=2020-12-31T23:59:59) is
the Southern Alps Microearthquake Borehole Array around Mt. Cook has been recording since 2008.

<img alt="SAMBA" align="right" style="width:100%" src="images/COVA_pano.jpg">

SAMBA records at 200Hz (200 samples per second). How many seconds per day?

In [1]:
seconds_per_day = 60 * 60 * 24
print(f"There are {seconds_per_day} seconds in a day")

There are 86400 seconds in a day


How many samples per day?

In [2]:
sampling_rate = 200.0
samples_per_day = seconds_per_day * sampling_rate
print(f"SAMBA records {samples_per_day} samples per day")

SAMBA records 17280000.0 samples per day


So, > 17 million samples per day.  But that is just for one channel: SAMBA seismographs
have three channels, a vertical and two horizontals, so how many samples per day for one station?

In [3]:
number_of_channels = 3
samples_per_day_per_station = samples_per_day * number_of_channels
print(f"One station records {samples_per_day_per_station} samples per day")

One station records 51840000.0 samples per day


Nearly 52 million samples per day per station. SAMBA is made up of 13 stations, so our dataset gets bigger
still:

In [4]:
number_of_stations = 13
samples_per_day_total = samples_per_day_per_station * number_of_stations
print(f"SAMBA records {samples_per_day_total} samples per day")

SAMBA records 673920000.0 samples per day


673 million samples per day across the network. So what is that over the first 10 years of operation?

In [5]:
days_per_year = 365.25  # Roughly
samples_per_year = days_per_year * samples_per_day_total
print(f"SAMBA records about {samples_per_year} samples per year.")
samples_per_decade = samples_per_year * 10
print(
    f"In 10 years of operation SAMBA recorded something like {samples_per_decade:e} samples")

SAMBA records about 246149280000.0 samples per year.
In 10 years of operation SAMBA recorded something like 2.461493e+12 samples


2 trillion samples.

Try working with that in a spreadsheet...

Of course, this is just one example of a large dataset - and it's hard to imagine a situation in which a scientist needed to work with all 2 trillion measurements in a completely unstructured way. However, the SAMBA dataset gives an idea about the sorts of volumes of data that could be encountered.

## Why Python?

In this course we will be using the Python programming language to help us learn how to automate tasks in geoscience. Python is a relatively friendly language, but it still has lots of **rules** that you need to follow to make codes run. In the first three notebooks we will introduce some of those rules and start you on your way to 
[zen](https://www.python.org/dev/peps/pep-0020/).

So, why Python?
1. Open-source, community-driven (free) software;
2. Simple syntax, fast to make mistakes and helpful error messages;
3. Community libraries to do lots of complex tasks 
   (e.g. [ObsPy](https://github.com/obspy/obspy/wiki) for seismology, [CartoPy](https://pypi.org/project/Cartopy/) for making maps and handling geographic projections, and [SciPy](https://www.scipy.org/) as an umbrella environment for computational science)

<img alt="xkcd Python" align="center" style="width:80%" src="https://imgs.xkcd.com/comics/python.png">

Python itself is a useful language in its own right, but one of *the best* things about Python is all the packages written to extend it.  These packages are usually distributed via [pypi](https://pypi.org) or/and [anaconda](https://conda.io), and are easy to install.  This means that you often don't have to write (much of) your own code! Most of the time someone out there knows better than you how to do something, so you get to use their code and focus on the important things.

Most Python packages (and all good ones) have documentation.  If you find yourself stuck, or thinking *I wish I could do this*, it is worth having a search online for what you want, or what you are stuck on.  With Python, installing other packages can be quite simple using either [conda](https://conda.io/en/latest/) or [pip](https://pypi.org/project/pip/).

Python is an interpreted language (rather than a compiled language like Fortran or C). Because of this it is easy to iterate and see your results. You can interact with your code in a step-by-step way, so it is simple to understand the logic of your code. However, because of the interpreted nature of the language, Python is rarely the fastest choice. To combat this, Python can be (and has been) extended by compilled sections of code, meaning that time-critical sections of code can be sped-up.  This has led to quite a few libraries that use *Python as glue* to hold together faster sections of code written in C, fortran, or other languages. We will introduce one of these fast packages, *numpy* later: *numpy* is at the heart of almost all scientific Python applications.

Python itself is open-source and runs almost anywhere, and is used for a whole range of purposes, from science to web-pages, data analysis and more: Dropbox was written almost exclusively in 
[Python](https://blogs.dropbox.com/tech/2018/09/how-we-rolled-out-one-of-the-largest-python-3-migrations-ever/).

## What are Jupyter notebooks and how do we use them?

This is a Jupyter notebook! [Jupyter notebooks](www.jupyter.org) provide inline interactive Python shells - i.e. an interface to entering and running real Python code - interspersed with explanations and other details that are formatted in something known as "markdown". Notebooks are increasingly used to document the actual code researchers are using to do their analysis alongside the interpretations and analysis. In fact there are now [some scientific papers have been written in Jupyter notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks#reproducible-academic-publications) which enables people to test their work. They are a great way to *show your work* while explaining what you did in more extensive prose. We are using them for teaching purposes because they let us play with the code and explain the ideas behind the code.

Notebooks like the ones we've prepared for these modules are designed to be used interactively in a web browser.  You should run through them, change some values, see what works, try and play
with variables and experiment.  There will be sections that you are expected to fill in
marked as **Exercise:**.  Shout out if and when you have problems.

In these notebooks we'll provide a brief introduction to Python for newcomers, with a firm emphasis on doing the sorts of things that graduate students and more senior researchers do in Earth Sciences.  The focus of these notebooks is to introduce you to the simple data types and logic in Python, and a couple of handy packages.  There are many other great tutorials out there for more in-depth ideas, e.g.
- [The Python tutorial](https://docs.python.org/3/tutorial/)
- [LearnPython](https://www.learnpython.org/)
- [Data Carpentry](https://datacarpentry.org/lessons/) and [Software Carpentry](https://software-carpentry.org/) for data literacy and research computing skills
- And many more!

Let us know if you want to play around with any other data.

Remember that this course is supposed to be a brief and generalised look over some of the key ideas in Python and useful libraries as they relate to Earth Science research. Feel free to ask for more specific advice on topics you're working and we'll do our best to help.

### Getting set-up on a CO501 computer
1. Log into your Linux account
2. Open a terminals and set-up your conda virtual environment
```bash
startconda esci451
```
3. Clone the notebook repository
```bash
git clone https://github.com/calum-chamberlain/ESCI451-Python.git
```
4. Start Jupyter:
```bash
cd ESCI451-Python
jupyter notebook
```

# Using Python on your own computer

As we've emphasized above, we've chosen to use Jupyter notebooks for these modules. However, there are other ways that you can interact with Python, especially as you become more experienced. One way is to use a Python shell. In MacOS or Linux systems, open a terminal and type `python3` to start a Python 3.x shell (In case you're interested, the default Python on most systems is Python 2.7, which is now longer being developed and has been superseded by Python 3.x.)  In windows, open the command line and type `C:\python3\python.exe`: you might have to check your Python version.  

To get a more interactive, nicely coloured interface, try using the [iPython](https://ipython.org/) shell, which you can install using Anaconda (or pypi).

## Using the same environment on your own computer as CO501

To get your computer set up with the esci451 environment you need to:
1. Install [miniconda](https://docs.conda.io/en/latest/miniconda.html) or [anaconda](https://www.anaconda.com/products/individual) (which comes with a GUI)
2. Download the `environment.yml` file from [here](https://raw.githubusercontent.com/calum-chamberlain/ESCI451-Python/master/environment.yml) and save it somewhere.
3. Open a terminal (MacOS or Linux) or Command.exe (Windows) and  navigate to where you saved the `environment.yml` file.
4. Run `conda env create -f environment.yml`. This might take a while - carry on with the notebooks while this runs!

This will make an `esci451` environment. To start this environment and run jupyter open a terminal (or command prompt) and run:
```bash
conda activate esci451
jupyter notebook
```
Then you can navigate to your local notebooks!

# Getting started - "Hello World!"

The first program written in most languages is a simple "Hello World!" program, that just outputs the phrase "Hello World!"
to the screen. In Python this is embarassingly simple (run the code by clicking the arrow button up the top, or by hitting
*ctrl-enter*):

In [6]:
print("Hello World!")

Hello World!


What we did is call the `print` function with the *argument* `"Hello World!"`. Encapsulating *Hello World!* in
quotes tells Python that we want this to be a *string* type. Strings hold characters, other types hold other
data types.

The `print` function takes whatever we gave it as an argument and prints that to screen (we see the output of our
code in Jupyter notebooks just beneath the *cell* that we ran the code in).

In the next notebook, we'll look at different data types and start playing with real data.