# Introduction to Python for Earth Scientists

<img alt="xkcd 303 Compiling
    The #1 Programmer Excuse for Legitimately Slacking Off: 'My code's compiling.'
    [Two programmers are sword-fighting on office chairs in a hallway. An unseen manager calls them back to work through an open office door.]
    Manager: Hey! Get back to work!
    Cueball: Compiling!
    Manager: Oh. Carry on."  align="right" style="width:30%" src="https://imgs.xkcd.com/comics/compiling.png">

## Motivation for these sessions

1. Why should you want to learn to program?
2. Why are we teaching you Python?
3. Introduce you to some fundamentals of programming,
4. Demonstrate why *DRY* is important, not just for reducing effort
5. Get you set-up to work on your own machines! 


<img alt="El servicing an ECLIPSE network seismometer site dressed in smart clothing." align="right" style="width:15%" src="images/el_service_smart.jpg">

## Dr. Eleanor (El) Mestel

- Postdoctoral Fellow in Geophysics (Room CO503)
- PhD from VUW: Taupō volcano seismology and co-production of research

### Using Python in my research
- Learned Python for my Master's and used it throughout my research
- Using large seismic datasets: waveforms, station metadata, earthquake catalogues, etc. 
- Helps me to understand the processes and software in my analysis (even those written in other programming languages!)
- Makes analysis simpler, faster and more reproducible

# Why programming?!

1. Short term:
    - You need to know some programming to get through ESCI451 and other courses this year!

2. Medium term:
    - Your research projects probably involve working with data - programming can help you!
    - You might want to make some pretty plots during your research - we can show you how.

3. Long term:
    - If you stay in research, analysing data in a *reproducible* style will probably be part of your life.
    - For those that do not stay in academia, programming skills are among the most useful and transferable skills you will learn at university. A quick job search on the 19th February 2024 listed 114 jobs in Wellington related to Python alone.

## Why programming (what's wrong with Excel!?)
   
<img alt="XKCD 2180 spreadsheets
    [Cueball is at his computer. In the air on either side of him are an angel version of Cueball, with a halo and wings, and a devil version of Cueball, with horns and a pitchfork. The angel's dialogue appears in regular print, while the devil's dialogue appears in white print in black speech balloons.]
    Angel: Don't use a spreadsheet! Do it right.
    Devil: But a spreadsheet would be so easy.
    Angel: In the long run you'll regret it!
    [Closeup on Cueball, the angel, and the devil.]
    Angel: Take the time to write real code.
    Devil: Just paste the data! Tinker until it works!
    Devil: Build a labyrinth of REGEXREPLACE() and ARRAYFORMULA()!
    Devil: Feel the power!
    [Closeup on the devil.]
    Angel (off-panel): Fight the temptation!
    Devil: Ever tried QUERY() in Google Sheets? It lets you treat a block of cells like a database and run SQL queries on them.
    [Another shot of Cueball at his computer with the angel and devil at either side.]
    Angel: Don't listen to
    Angel: ...wait, really?
    Devil: Yes, and let me tell you about IMPORTHTML()...
    Angel: Oooh..." align="center" style="width:70%" src="https://imgs.xkcd.com/comics/spreadsheets.png">

You've spent pretty much all your undergraduate career (in SGEES anyway) without much programming, so why do you need to learn now?

**Reproducibility**, **Safety**,  **Speed**,  **Complexity**,  **Working with large data** 

## Why programming (what's wrong with Excel!?)

- **Reproducibility:** If someone can't replicate your 
  work, why should we trust it to be true?
- **Safety:** Your data and your processing should not
  overlap.  Your raw data should be sacred. [Poor use of Excel resulted in the loss of 16,000 COVID-19 records in the UK.](https://www.theguardian.com/politics/2020/oct/05/how-excel-may-have-caused-loss-of-16000-covid-tests-in-england)
- **Speed:** You want a result, and you want it yesterday... Learn how to write good code 
    (and change the clock-speed on your computer) and you can...
- **Complexity:** Being able to solve complex problems logically, in a way that others can follow
    (and reproduce) is essential to natural sciences. *Hint: Writing good code is as much about the*
    *quality of your documentation as it is about the quality of your code*.
- **Data scale:** Data in natural sciences is noisy, and large. Ideally to understand the natural world
    we would have data from every place at every time throughout the Earth. We don't have that, but
    our datasets are growing...

## Data scale: seismology bias

Seismographs deployed for:
- long durations (multi-year);
- multiple locations;
- modest sampling rates.

For example: [SAMBA](http://ds.iris.edu/mda/9F/?starttime=2008-01-01T00:00:00&endtime=2020-12-31T23:59:59) is
the Southern Alps Microearthquake Borehole Array around Aoraki Mt. Cook has been recording since 2008.

<img alt="Panorama including a SAMBA site (COVA) in the mountains" align="right" style="width:100%" src="images/COVA_pano.jpg">

SAMBA records at 200Hz (200 samples per second). How many seconds per day?

In [None]:
seconds_per_day = 60 * 60 * 24
print(f"There are {seconds_per_day} seconds in a day")

How many samples per day?

In [None]:
sampling_rate = 200.0
samples_per_day = seconds_per_day * sampling_rate
print(f"SAMBA records {samples_per_day} samples per day")

So, > 17 million samples per day.  But that is just for one channel: SAMBA seismographs
have three channels, a vertical and two horizontals, so how many samples per day for one station?

In [None]:
number_of_channels = 3
samples_per_day_per_station = samples_per_day * number_of_channels
print(f"One station records {samples_per_day_per_station} samples per day")

Nearly 52 million samples per day per station. SAMBA is made up of 13 stations, so our dataset gets bigger
still:

In [None]:
number_of_stations = 13
samples_per_day_total = samples_per_day_per_station * number_of_stations
print(f"SAMBA records {samples_per_day_total} samples per day")

673 million samples per day across the network. So what is that over the first 10 years of operation?

In [None]:
days_per_year = 365.25 # Roughly
samples_per_year = days_per_year * samples_per_day_total
print(f"SAMBA records {samples_per_year} samples per year.")
samples_per_decade = samples_per_year * 10
print(f"In 10 years of operation SAMBA recorded {samples_per_decade:e} samples")

2 Trillion samples.

Try working with that in a spreadsheet...

And that is barely scraping the surface of dataset sizes in Earth Science. Consider thousands of seismic stations around the globe, or more recently the development of DAS fibre technology which generates **terabytes of data per day.**

## Why Python?

Our first notebook covers this a little more, but the key things for me are:

1. Open-source, community driven (often free) software;
2. Simple syntax, fast to make mistakes and helpful error messages;
3. Community libraries to do lots of complex tasks 
   (e.g. [obspy](https://github.com/obspy/obspy/wiki) for seismology)
4. Widely used: Python is in the top three most popular programming languages according to [TIOBE](https://www.tiobe.com/tiobe-index/), and possibly the most popular *interpreted* language.
    1. There is a wealth of information online from others.
    2. Python is useable across a huge range of tasks.


# How to use these notebooks:

There are a series of Jupyter notebooks on here.  You can run them interactively in your
browser.  You should run through them, change some values, see what works, try and play
with variables and experiment.  There will be sections that you are expected to fill in
marked as **Exercise:**.  Shout out if and when you have problems.

Let us know if you want to play around with any other data.

Remember that this course is supposed to be a brief look over some of the key ideas in Python
and useful libraries, it is not complete!

## Getting set-up on a CO501 computer
1. Log into your Linux account
2. Open a terminals and set-up your conda virtual environment
```bash
startconda esci451
```
3. Clone the notebook repository
```bash
git clone https://github.com/calum-chamberlain/ESCI451-Python.git
```
4. Start Jupyter:
```bash
cd ESCI451-Python  # Or where ever you made the folder containing the notebooks
jupyter notebook
```