<a href="https://colab.research.google.com/github/ds4geo/ds4geo/blob/master/DS4GEO_L1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Science for Geoscientists - Winter Semester 2020**
# **Session 1**

Welcome to python running in a Jupyter Notebook in Google Colab!
Python, Jupyter, Colab and these notebooks will be explained fully later.
Notebooks like this comprise the lecture notes and practical excercises for the course.

To make the notebook interactive, and allow the python code to be run, click the "Open in Colab" button at the top.


# Part 1.1 - Data Ice-breaker - PPT
???


# Part 1.2 - Course Introduction - PPT

*   Who am I?
*   What is data science and why is it important?
*   Who are you?
*   Structure of Session 1





# Part 1.3 - Super Simple Python Plotting Example - *Walkthrough*

No theory until we see at least a simple python script in action!

We are going to load, view and plot the famous Lisiecki and Raymo (2005) "LR04 stack" benthic d18O stack with just a few lines of python code.

Below is a "code cell".
Follow the instructions in the comments (lines starting with #).


In [None]:
# Click here and run the code cell by pressing the play button, or ctrl+enter
# Import some libraries with useful functionality
import pandas as pd # For loading and manipulating data
import matplotlib.pyplot as plt

In [None]:
# Load the LR04 stack from a csv file hosted in the course github repository using pandas
dat_LR04 = pd.read_csv(r"https://raw.githubusercontent.com/ds4geo/ds4geo/master/data/timeseries/LR04stack_short.csv")

In [None]:
# Display an overview of the loaded data
dat_LR04

In [None]:
# Plot the data using matplotlib
plt.plot(dat_LR04["Time"], dat_LR04["d18O"])

# Part 1.4 - A Short Introduction to Python - PPT

*Short notes here*

# Part 1.5 - A Short Introduction to Jupyter Notebooks and Google Colab

**Jupyter**

In this course, the notes are and most of the assignments will be Jupyter Notebooks. Jupyter allows one to combine text, images, code and code outputs in a single document, making it perfect for data story telling.

For more detail, see:
* https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/What%20is%20the%20Jupyter%20Notebook.html
* https://www.earthdatascience.org/courses/intro-to-earth-data-science/open-reproducible-science/jupyter-python/

**Google Colab(oratory)**

Google Colab is an web-based Jupyter Notebook based system where the python code is executed on Google cloud servers. We use it in this course to avoid having to set up python on personal computers (but you are encouraged to do this in your own time).

For more detail, see:
* https://research.google.com/colaboratory/faq.html
* https://colab.research.google.com/notebooks/basic_features_overview.ipynb (use this as reference later for explanation of many useful features like code completion)


# Part 1.6 - Plotting LR04

The above example uses simplified data and the most basic plotting.
Now we will load data from the actual data file published by Lisiecki and Raymo (2005), and we will improve the plotting. To do this we will need to use additional arguments and functions.


In [None]:
'''
Multi-line comments can be created using triple inverted commas.

Load the original LR04 stack from a tab delimited txt file
Check the data file here: https://github.com/ds4geo/ds4geo/blob/master/data/timeseries/LR04stack.txt

You'll notice the column headings are on line/row 5 and the data starts on row 6.
Therefore we need to tell pd.read_csv which row the headings are on.
We can do this by adding a keyword argument called "header".
pd.read_csv ignores blank lines by default, and python starts counts at 0, so we pass header=3

The delimiter of this file is tab, not comma, so we also need to use the sep argument with the value "\t" for tab.

'''
# Read the original LR04 stack
dat_LR04 = pd.read_csv(r"https://raw.githubusercontent.com/ds4geo/ds4geo/master/data/timeseries/LR04stack.txt", sep="\t", header=3)


In [None]:
# Display an overview of the loaded data. The data is stored as a pandas DataFrame
dat_LR04

In [None]:
# The long column names are annoying to type out while coding, so we can rename them if we want:
# The DataFrame object dat has an attribute called columns.
# We can use the core python print function to display objects in the cell output (or terminal)
print(dat_LR04.columns)

# We can overwrite the column names by assigning a list of new ones to the columns attribute.
dat_LR04.columns = ["Time", "d18O", "error"]

# Print the column names again to see the change
print(dat_LR04.columns)

In [None]:
# Plot the data with matplotlib
plt.plot(dat_LR04["Time"], dat_LR04["d18O"])

In [None]:
# That shows the whole record which isn't easy to read
# Now we add some basic plotting controls: axes limits

# Plot the data
plt.plot(dat_LR04["Time"], dat_LR04["d18O"])

# Limit the X axis to the last 140 ka
plt.xlim(0, 140)
# Limit the Y axis to a sensible range
plt.ylim(3, 5.5)

In [None]:
# This record is often shown filpped vertically so warm inter-glacial periods are "up"
# Sometimes it is also shown flipped horizontally so time goes left to right

# Plot the data
plt.plot(dat_LR04["Time"], dat_LR04["d18O"])

# Flip the X axis by reversing
plt.xlim(140, 0)
# Flip the Y axis in the same way
plt.ylim(5.5, 3)

# Part 1.7 - Guided Plotting of NGRIP oxygen isotope record

Try plotting the NGRIP oxygen isotope record. Each step is prepared, but you will need to fill in the correct arguments where you see the $ symbol. You should also add short comments for each line/section.

The dataset can be found here:

For viewing: https://github.com/ds4geo/ds4geo/blob/master/data/timeseries/NGRIP_chronology_20.tab

Raw version for loading: https://raw.githubusercontent.com/ds4geo/ds4geo/master/data/timeseries/NGRIP_chronology_20.tab

In [None]:
# $
# dat_NGRIP = pd.read_csv("https://raw.githubusercontent.com/ds4geo/ds4geo/master/data/timeseries/NGRIP_chronology_20.tab", sep="\t", header=20)
dat_NGRIP = pd.read_csv($, sep=$, header=$)

In [None]:
# $
dat_NGRIP

In [None]:
# $
# dat_NGRIP.columns = ["age", "depth", "error", "d18O"]
dat_NGRIP.columns = [$]

In [None]:
# $
# plt.plot(dat_NGRIP["age"], dat_NGRIP["d18O"])
# plt.xlim(40, 30)
plt.plot($)
plt.xlim($)

# Part 1.8 - Plotting Alpine flood record from scratch

Using the previous examples, plot the Wirth et al. 2013 Alpine flood record.

For viewing: https://github.com/ds4geo/ds4geo/blob/master/data/timeseries/Alps_flood_de.txt

Raw version for loading: https://raw.githubusercontent.com/ds4geo/ds4geo/master/data/timeseries/Alps_flood_de.txt

Beware, I've added a quirk in the data which will require you to use an additional argument when loading the data. Create an empty code cell and type `pd.read_csv?` and run the cell to see the `pd.read_csv` documentation.

Also, try to add some important features to your plot such as a legend and axes labels. To find useful functions, type `plt.` and wait a second. A scrollable list of available functions/methods will appear.
See also here: https://colab.research.google.com/notebooks/basic_features_overview.ipynb#scrollTo=d4L9TOP9QSHn

Remember to add comments to your code!


In [None]:
# dat_flood = pd.read_csv("", sep=";", header=9, decimal=",")
# dat_flood.columns = ["age", "north", "south"]
# plt.plot(dat_flood["age"], dat_flood["north"], label="north")
# plt.plot(dat_flood["age"], dat_flood["south"], label="south")
# plt.legend()
# plt.xlabel("age (ka)")
# plt.ylabel("")
# plt.xlim(10,0)

# Part 1.9 - Python Objects

Python has a number of in-built object types. More are provided by imported libraries, and it is possible to create new types.
The type of an object determines how/what data is stored and what can be done with the object.

In [None]:
# Create object/variable "a" and assign integer value 5
a = 5
# See which type "a" is
type(a)

In [None]:
# Add an int to an int
a + 2

In [None]:
# Create object "b" and assign float value 2.5
b = 2.5
# See which type "b" is
type(b)

In [None]:
# Add an integer and a float
a+b

In [None]:
# Create some strings
c = "Geology "
d = "Rocks!"
type(c)

In [None]:
# Concatenate strings with the + operator
c+d

In [None]:
# Create a list out of the objects we just made
e = [a, b, c, d]
type(e)

In [None]:
# View e
e

In [None]:
# Get help and info on an object:
e?

In [None]:
# Objects have methods depending on their type. Methods are functions applied on
# an object.
# Use dir to list object methods. Ignore those with form __x__
dir(e)

In [None]:
# See help on "append" method
e.append?

In [None]:
# Try appending something
e.append("(Not really!)")
e

In [None]:
# That's not right, perhaps the "remove" method can help:
e.remove("(Not really!)")
e

For more detail see:

*   https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb
*   https://medium.com/analytics-vidhya/data-types-in-python-506009234f89



# Part 1.10 - Advanced/Object Based Plotting

Matplotlib provides a more advanced system for plotting and controlling figures. One explicitly creates figure and axes object and uses their methods to control all aspects of creating figures.

**Task**:
Create a plot with both the LR04 and NGRIP records with a shared x axis but separate y axes, and view only the time period covered by both records.

Try using `plt.subplots` to create a figure and axis which you can then use for plotting. `ax.twinx` will also be useful.

Use code completion and the help documentation. The help documentation usually contains a useful examples section.

Remember to comment your code!

In [None]:
# 
fig, ax = plt.subplots()
# Create a second axis in the same location and with a linked x scale with ax.twinx()
ax2 = ax.twinx()

ax.plot(dat_NGRIP["age"], dat_NGRIP["d18O"], color="orange", label="NGRIP")
ax2.plot(dat_LR04["Time"], dat_LR04["d18O"], color="blue", label="LR04")
ax.set_xlim(0,45) # x axes are linked, so only need to set one of them
ax2.set_ylim(5.5,3)
# 
ax.set_xlabel("time (ka)")
ax.set_ylabel("NGRIP ice d18O")
ax2.set_ylabel("LR04 d18O d18O")
ax.set_title("Comparison of LR04 and NGRIP ice d18O for mid-MIS3 to present")

# Part 1.11 - Week 1 Assignment

This course is all about working with data. Example data is provided for each class, but you will be expected to try out what you learn on datasets of your choice during class, and you will require such datasets for the homework assignemnts and main project.

**Task**

Find and submit 3 datasets for shared use by the rest of the class:
* At least 1 should be a timeseries like the NGRIP or LR04 datasets we used this week. Where the data is ordered by time, depth, distance, etc. (Geoscience is full of these!)
* At least 1 should be non-ordered, for example, a database of lakes with metadata (e.g. area, depth, etc.), tephra REE chemistries, data on fault plane orientations, etc. This data may have a spatial component but does not have to.
* Datasets should be freely available on the internet , OR be your own data which you are willing to share for use in this course.
* If sharing one's own data, file sizes should be maximum 25 mb (you can split the dataset into multiple files if required, or make a sub-set).
* It should be possible to load the data easily with pandas (e.g. csv, excel, ect.) or geopandas (shapefile, geojson, gpkg).
* The topic of the data is open - it does not need to be geoscience related.
* Large, interesting and complex datasets make the course more interesting for everyone involved.

**Submission**

* Data should be submitted by sending a link (to public dataset or e.g. shareable Google Drive link) to x@uibk.ac.at
* The **deadline** is 23:59 on 13th October 2020.
* This assignment comprises 5% of the assessment for the course. Full marks are awarded for providing 3 datasets fulfilling the above criteria, no marks are given otherwise.


**Useful sources of geoscience data**

* http://www.pangaea.de
* Supplementary data from journal articles
* Google!

**Useful sources of other data**

* https://inspire-geoportal.ec.europa.eu/ (spatial data)
* https://data.europa.eu/
* https://www.data.gv.at/
* https://datasetsearch.research.google.com/
* lists from https://www.wikipedia.org (convert to csv files using: https://wikitable2csv.ggor.de/)


# Part 1.12 - Course Structure
https://github.com/ds4geo/ds4geo/blob/master/Course%20Overview

# References


*   LR04
*   NGRIP
* Alpine flood


