<a href="https://colab.research.google.com/github/ds4geo/ds4geo/blob/master/WS%202020%20Course%20Notes/Session%202.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Science for Geoscientists - Winter Semester 2020**
# **Session 2 - Python Basics for Data Science - 14th October 2020**

In the previous session, we handled data in a very simple way using the Pandas library. In this session we will introduce some basic python object types for handling data, and learn how to index/slice data (extract only certain parts of the data/object). Specifically, we will cover lists and dictionaries. We will also introduce the two most important python flow controls: if statements and for loops.





# Part 2.1 - Data Icebreaker - Data Skepticism - *Discussion*
Discussion: The United States is the largest exporter of agricultural products in the world. Which country is second? Discuss why.

# Part 2.2 - Student Submitted Datasets - Discussion

Overview of submitted datasets here: https://github.com/ds4geo/ds4geo/blob/master/student_submitted_data/Assignment%201%20data%20catalogue.csv

Uploaded data here: https://github.com/ds4geo/ds4geo/blob/master/student_submitted_data/

Notable mentions:
* Interesting data formats
 * SQL database
 * Shapefiles (geospatial)
* Non-geological data
 * Meteorological
 * US politics
 * Archaeological site data
 * Greenland sociology data
* Geological data:
 * Rock chemistries
 * Tectonic/structural data
 * All sorts of sediment core data
 * Surface processes and monitoring data
 * Much more


# Part 2.3 - Python Objects - *Walkthrough*

Python has a number of in-built object types. More are provided by imported libraries, and it is possible to create new types.
The type of an object determines how/what data is stored and what can be done with the object.

In [30]:
# Create object/variable "a" and assign integer value 5
a = 5
# See which type "a" is
type(a)

int

In [31]:
# Add an int to an int
a + 2

7

In [32]:
# Create object "b" and assign float value 2.5
b = 2.5
# See which type "b" is
type(b)

float

In [33]:
# Add an integer and a float
a+b

7.5

In [34]:
# Create some strings
c = "Geology "
d = "Rocks!"
type(c)

str

In [35]:
# Concatenate strings with the + operator
c+d

'Geology Rocks!'

In [36]:
# Create a list out of the objects we just made
e = [a, b, c, d]
type(e)

list

In [37]:
# View e
e

[5, 2.5, 'Geology ', 'Rocks!']

In [38]:
# Get help and info on an object:
e?

In [39]:
# Objects have methods depending on their type. Methods are functions applied on
# an object.
# Use dir to list object methods. Ignore those with form __x__
dir(e)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

In [40]:
# See help on "append" method
e.append?

In [41]:
# Try appending something
e.append("(Not really!)")
e

[5, 2.5, 'Geology ', 'Rocks!', '(Not really!)']

In [42]:
# That's not right, perhaps the "remove" method can help:
e.remove("5")
e

ValueError: ignored

In [None]:
# Overview of built-in types
# Most common:
my_list = [1,2,3] # list
my_string = "geology rocks" # string
my_int = 1 # integer
my_float = 1.5 # float
my_bool = True # boolean
my_tuple = (1,2,3) # tuple
my_dict = {"key": 2} # dictionary
my_set = {1,2,3} # set

# Code related types
import pandas as pd
pd # module
pd.read_csv # function
"geology rocks".strip # method or function

For more detail see:

*   Cheat sheet - https://www.datacamp.com/community/tutorials/python-data-science-cheat-sheet-basics
*   Good visualisation about lists - https://medium.com/@meghamohan/mutable-and-immutable-side-of-python-c2145cf72747
*   Colab notebook examples - https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb
*   Python docs on built-in types and their methods - https://docs.python.org/3/library/stdtypes.html



# Part 2.4 - Lists, Dictionaries and Indexing - *Walkthrough*

Lists and dictionaries are built-in python objects useful for storing and handling data.

## 2.2.1 - Lists
Python lists are ordered collections of other python objects, separated by commas. They are defined by square brackets [ ]

In [43]:
a = [1,2,3] # List of integers
print("a:", a)

a: [1, 2, 3]


In [44]:
b = [1.5, 2.5, 3.5] # List of floats
print("b:", b)

b: [1.5, 2.5, 3.5]


In [45]:
# Lists can contain different types
c = [1, "data", 2.5]
print("b:", b)

b: [1.5, 2.5, 3.5]


In [46]:
# Including other lists (nested)
d = [[1,2,3], [4,5,6]]
print("d:", d)

d: [[1, 2, 3], [4, 5, 6]]


In [47]:
e = [a, b]
print("e:", e)

e: [[1, 2, 3], [1.5, 2.5, 3.5]]


In [48]:
# They can contain any other python objects
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
f = [pd, np, plt] # But there's no reason to actually do this
print("f:", f)

f: [<module 'pandas' from '/usr/local/lib/python3.6/dist-packages/pandas/__init__.py'>, <module 'numpy' from '/usr/local/lib/python3.6/dist-packages/numpy/__init__.py'>, <module 'matplotlib.pyplot' from '/usr/local/lib/python3.6/dist-packages/matplotlib/pyplot.py'>]


In [49]:
# Earlier, you'll recall dir() can be used to find methods on objects
a = [1, 2, 3]
# Append adds a new item to the end of a list
a.append(4)
print("a:", a)

a: [1, 2, 3, 4]


In [50]:
# Extend joins to lists *in place*
a.extend(b) # notice we don't assign a result
print("a:", a)

a: [1, 2, 3, 4, 1.5, 2.5, 3.5]


In [51]:
# + operator when applied to two lists but not *in place*:
h = a + b
print("h:", h)

h: [1, 2, 3, 4, 1.5, 2.5, 3.5, 1.5, 2.5, 3.5]


In [52]:
# sort does what it suggests, in place
a.sort()
print("a:", a)

a: [1, 1.5, 2, 2.5, 3, 3.5, 4]


In [None]:
# Tuples are another type very similar to lists except they can't be modified
# i.e. you cannot append to a tuple
# They are defined by parentheses ( ) instead of [ ]
a_tuple = (1, 2, 3)
print("a_tuple:", a_tuple)

# The specific reasons for using tuples complex.
# You will see them in documentation, but usually you can just use a list

## 2.2.2 - Dictionaries
Python dictionaries are un-ordered collections of pairs known as keys and values. They function like language dictionaries where you look up a word (they key) and see its definition or translation (value).
They are defined with braces { }, separated by commas, and colons : indicate the key-value relationships.

In [66]:
# Create a simple German to English language dictionary
De2En = {"Bier": "Beer", "Wurst": "Sausage"}
print(De2En)

{'Bier': 'Beer', 'Wurst': 'Sausage'}


In [67]:
# When making lists and dictionaries, you can wrap between lines for readability:
De2En = {"Bier": "Beer",
         "Wurst": "Sausage",
         "Kren": "Horseradish"}
print(De2En)

{'Bier': 'Beer', 'Wurst': 'Sausage', 'Kren': 'Horseradish'}


In [68]:
# Values can be any python object, e.g. lists:
rocks = {"igneous": ["Granite", "Basalt", "Rhyolite"],
         "Sedimentary": ["Sandstone", "Limestone"]}

In [69]:
# Keys can be some python objects (int, float, string, tuple), but not others (lists or dicts)
# Keys and values do not all have to be the same type
complex_dict = {0: "zero",
                "one point 5": 1.5,
                2.5: ["two", "point", "five"]}

In [70]:
# Dictionaries can also be nested like lists.
# Note the nesting is multi-line and aligned to improve readability
rock_dict = {"granite": {"type": "igneous",
                         "composition": {"quartz": 0.5,
                                         "feldspar": 0.2},
                         "locations": [(50.59671,-3.98289),
                                       (50.59591,-4.61987)]},
             "sandstone": {}}

## 2.2.3 - List and Dictionary Indexing
You can select objects/data from lists and dictionaries using square brackets [ ].
List indexing is based on numeric positions, while dictionary indexing is based on its keys.

**Note** python positional indexing (for lists, numpy, pandas, etc) always starts at 0. i.e. the first item is 0. This might seem counter intuitive at first, but when combined with some other features of python, it actually simplifies code in many situations!

In [53]:
# Remind ourselves what is in variable "c"
print(c)

[1, 'data', 2.5]


In [54]:
# Print positions 0, 1 and 2 of list "c"
print(c[0])
print(c[1])
print(c[2])

1
data
2.5


In [55]:
# If we try to index a position beyond the size of the list, we get an index error
# Uncomment this line to try it.
# (It is commented out so you can run all the code in this notebook without generating an error)
#print(c[3])

IndexError: ignored

In [56]:
# List indexing also works with negative numbers in reverse, with -1 being the last index
print(c)
print(c[-1]) # the last item in c

[1, 'data', 2.5]
2.5


In [57]:
# With nested objects, indexing can be stacked with sets of square brackets [ ][ ]
print(d)
print(d[1])
print(d[1][2])

[[1, 2, 3], [4, 5, 6]]
[4, 5, 6]
6


In [58]:
# Indexing tuples and strings works exactly the same way
print(a_tuple)
print(a_tuple[0])

NameError: ignored

In [59]:
print(c)
print(c[1])
print(c[1][2])

[1, 'data', 2.5]
data
t


In [60]:
# For lists, tuples and strings (and numpy - see later), ranges also work.
# Ranges are "half-open", i.e. include the first index, but not the last.
# This is so when you use a range of e.g. 2:4, you get a result of length 2, despite indexing starting at 0
print(a)
print(a[2:4])

[1, 1.5, 2, 2.5, 3, 3.5, 4]
[2, 2.5]


In [61]:
# Leaving out a number:
# Everything after position 2:
print(a[2:])

[2, 2.5, 3, 3.5, 4]


In [62]:
# Everything before position 4:
print(a[:4])

[1, 1.5, 2, 2.5]


In [63]:
# A third number can be added to define the step.
# for example, take every 2nd item from 0 to 5th position of a:
print(a[0:5:2])

[1, 2, 3]


In [64]:
# Or take every second item from the whole list:
print(a[::2])

[1, 2, 3, 4]


In [71]:
# Also useful is finding the length of lists, dicts and strings:
print("length of list a:", len(a))
print("length of dict rocks:", len(rocks))
print("length of string in position 1 of list c:", len(c[1]))

length of list a: 7
length of dict rocks: 2
length of string in position 1 of list c: 4


In [73]:
len(c[1])

'data'

In [74]:
# Dictionaries are indexed by their keys:
print(De2En["Bier"])

Beer


In [75]:
# And example of indexing nested objects
print(rocks["igneous"])
print(rocks["igneous"][1])
print(rock_dict["granite"])
print(rock_dict["granite"]["composition"])
print(rock_dict["granite"]["composition"]["quartz"])

['Granite', 'Basalt', 'Rhyolite']
Basalt
{'type': 'igneous', 'composition': {'quartz': 0.5, 'feldspar': 0.2}, 'locations': [(50.59671, -3.98289), (50.59591, -4.61987)]}
{'quartz': 0.5, 'feldspar': 0.2}
0.5


In [76]:
# You can also expand dictionaries using indexing assignment:
print(De2En)
De2En["Semmel"] = "Bread roll"
print(De2En)

{'Bier': 'Beer', 'Wurst': 'Sausage', 'Kren': 'Horseradish'}
{'Bier': 'Beer', 'Wurst': 'Sausage', 'Kren': 'Horseradish', 'Semmel': 'Bread roll'}


In [77]:
print(rocks)
rocks["metamorphic"] = ["Gneiss, Schist"]
print(rocks)

{'igneous': ['Granite', 'Basalt', 'Rhyolite'], 'Sedimentary': ['Sandstone', 'Limestone']}
{'igneous': ['Granite', 'Basalt', 'Rhyolite'], 'Sedimentary': ['Sandstone', 'Limestone'], 'metamorphic': ['Gneiss, Schist']}


In [78]:
# And you can use methods on the objects indexed:
rocks["igneous"].append("Gabbro")
print(rocks)

{'igneous': ['Granite', 'Basalt', 'Rhyolite', 'Gabbro'], 'Sedimentary': ['Sandstone', 'Limestone'], 'metamorphic': ['Gneiss, Schist']}


# Part 2.5 - Indexing excercise - *Workshop: 15 minutes*
Follow the instructions below.

In [80]:
# From this list
base_list = ["a", "b", "c", "d", "e",
             "f", "g", "h", "i", "j",
             "k", "l", "m", "n", "o",
             "p", "q", "r", "s", "t",
             "u", "v", "w", "x", "y", "z"]

In [None]:
# Create the following lists from base_list using indexing (e.g. print(base_list[0])):
# 1. ["a", "b", "c", "d", "e"]
# 2. ["x", "y", "z"]
# 3. ["a", "f", "k", "p", "u", "z"]
# 4. ["m", "p", "s", "v", "y"]


In [82]:
base_list[0:5]

['a', 'b', 'c', 'd', 'e']

In [None]:
# 5. From the dictionary "rocks", get the last sedimentary rock type
# 6. From the dictionary "complex_dict", get the string "five"
# 7. From the dictionary "rock_dict", get the y coordinate of the second location of granite


# Part 2.6 - Advanced/Object Based Plotting - *Workshop: 10 minutes*

Matplotlib provides a more advanced system for plotting and controlling figures. One explicitly creates figure and axes object and uses their methods to control all aspects of creating figures.

**Task**:
Create a plot with both the LR04 and NGRIP records (see see Session 1) with a shared x axis but separate y axes, and view only the time period covered by both records.

Try using `plt.subplots` to create a figure and axis which you can then use for plotting. `ax.twinx` will also be useful.

Use code completion and the help documentation. The help documentation usually contains a useful examples section.

Remember to comment your code!

#Part 2.7 - Python flow controls - *Walkthrough*
Flow controls turn scripting lists of commands into complex programs.
There are two main types:
* **If** statements which allow code to be executed based on a decision statement.
* **Loops** which allow a program to run through the same block of code many times.
If statements and loops are often used in combination.

Most languages use parentheses or curly brackets to indicate the code which defines the decision or loop, along with the code to be executed based on the decision or within the loop. Python however follows a different approach.
If statements and loops (along with a number of other more advanced statements) are defined by indentation with spaces or tabs. The number of spaces or tabs does not matter, but 4 spaces if typical.

A helpful resource for constructing python code with flow controls from block diagrams can be found here (be sure to select python as the language):

https://developers.google.com/blockly

If you are not familiar with python syntax, this tool might be quite helpful!

## Part 2.7.1 If statements
If statements contain a condition which should evaluate to True or False, followed by an indented block to execute if the condition is True. Optionally also an else statement followed by a block to execute if the condition is False.

```
if <condition>:
      <code if true>
else:
      <code if false>
```


In [85]:
a = 5
5>10

False

In [86]:
a = 5
if a > 1: # test, is a greater than 1
  # Indented block with code if above is True
  print(a, "is greater than 1")

5 is greater than 1


In [87]:
# This won't print anything
a = 0
if a > 1:
  print(a, "is greater than 1")

In [88]:
# Add an else - excecuted if the original if statement is False
a = 0
if a > 1:
  print(a, "is greater than 1")
else: # at indentation level of the original if
  print(a, "is less than 1")

0 is less than 1


In [89]:
# Ifs (and loops) can be nested:
a = 5
if a > 1:
  if a < 9:
    print(a, "is greater than 1 and less than 9")
  else: # Note the levels of indentation
    print(a,"is greater than 9")
else:
  print(a,"is less than 1")

# try different values of a

5 is greater than 1 and less than 9


In [90]:
# We can use if statements to achieve complex things like:
print(rocks)
new_rock = "Dolomite"
if new_rock not in rocks["Sedimentary"]:
  rocks["Sedimentary"].append(new_rock)

print(rocks)

{'igneous': ['Granite', 'Basalt', 'Rhyolite', 'Gabbro'], 'Sedimentary': ['Sandstone', 'Limestone'], 'metamorphic': ['Gneiss, Schist']}
{'igneous': ['Granite', 'Basalt', 'Rhyolite', 'Gabbro'], 'Sedimentary': ['Sandstone', 'Limestone', 'Dolomite'], 'metamorphic': ['Gneiss, Schist']}


## Part 2.7.2. (For) Loops
The most common types of loop in python are for loops.
For loops iterate over items in an iterable object (like a list). In other words: *for* each item in *x*, do *y*.
We begin with the for statement, which defines the iterator and the variable name used to represent the individual items on each iteration of the loop, followed by the indented code to execute on each iteration:


```
for <item name> in <iterator>:
        <code to execute>
```




In [91]:
print(b)
# Print each item in "b", one, by one
for i in b:
  print(i)

[1.5, 2.5, 3.5]
1.5
2.5
3.5


In [92]:
# Iterating over a dictionary returns its keys
print(rocks)
for k in rocks:
  print(k)

{'igneous': ['Granite', 'Basalt', 'Rhyolite', 'Gabbro'], 'Sedimentary': ['Sandstone', 'Limestone', 'Dolomite'], 'metamorphic': ['Gneiss, Schist']}
igneous
Sedimentary
metamorphic


In [93]:
# Loops can be nested
for k in rocks: # Loop over the dict keys
  for r in rocks[k]: # loop over the list items which are the values of the dict
    print(r, "is a", k, "rock")

Granite is a igneous rock
Basalt is a igneous rock
Rhyolite is a igneous rock
Gabbro is a igneous rock
Sandstone is a Sedimentary rock
Limestone is a Sedimentary rock
Dolomite is a Sedimentary rock
Gneiss, Schist is a metamorphic rock


In [94]:
# the built-in range function allows iterating over a series of integers
# try range? to see the help docs
for i in range(5):
  print(i)

0
1
2
3
4


In [95]:
# We can do more interesting things in the loop:
for i in range(10):
  # First 10 powers of 2
  x = 2 ** i
  print(x)

1
2
4
8
16
32
64
128
256
512


In [96]:
# Its common to want to create a list out of the results of the loop.
# Create an empty list:
powers = []
for i in range(10):
  x = 2 ** i
  # append the result to the list
  powers.append(x)

print(powers)


[1, 2, 4, 8, 16, 32, 64, 128, 256, 512]


In [97]:
rocks["igneous"]

['Granite', 'Basalt', 'Rhyolite', 'Gabbro']

In [98]:
"""Sometimes its helpful to know the iteration number as well.
When the iterator contains more than one value (e.g. with enumerate, zip,
dict.items() or lists of lists), one can "unpack" them by providing multiple
variable names in the for statement.
"""
for c, i in enumerate(rocks["igneous"]): # c is counter, i is object
  print("The number",c," igenous rock is", i)

The number 0  igenous rock is Granite
The number 1  igenous rock is Basalt
The number 2  igenous rock is Rhyolite
The number 3  igenous rock is Gabbro


In [99]:
# We can also combine loops and if statements:
rocks_begining_with_g = []
for rock in rocks["igneous"]: # Loop over igneous rocks list
  if rock[0] == "G": # See if first letter is G
    rocks_begining_with_g.append(rock) # If so, append to list
print(rocks_begining_with_g)

['Granite', 'Gabbro']


# Part 2.8 - Flow control excercise - *Workshop: 15 minutes*

In [None]:
# Create the following lists using loops:
# 1. [3,6,9,12,15,18....300] 
# 2. The above list by a different approach
# 3. A list of all integers divisible by 3 and 7 between 1 and 1000


# Part 2.9 - Other python plotting libraries - *Mini-lecture*
Matplotlib is by far the most popular plotting library for python. It is extremely powerful and flexible for creating static publication quality graphics. However, other plotting libraries exist with different focuses:

* Matplotlib for static graphics, quick plotting and publication quality graphics, (integrated into many other libraries)
* Plotly for interactive or static/publication quality
* Bokeh for interactive/web plots
* Seaborn for quick but powerful stats plots
* Altair - different approach to making plots based on how we reason about data

This example notebook provides examples of each:

https://colab.research.google.com/notebooks/charts.ipynb#scrollTo=N-u5cYwpS-y0

It is highly recommended to go through the examples and experiment with these libraries in your own time and in the assignments. However, be sure to also get experience with Matplotlib because it is so common.



# Part 2.10 - Week 2 Assignment
One of the most critical basic skills in data science is data visualisation, whether making quick plots to explore data, or publication quality graphics, or anything in between. Through this course we will routinely plot data in the classes and assignments.

**Task**

Create a Jupyter notebook in which you load data and make at least 2 different plots.

*Datasets*

You can use datasets from the ds4geo github repository (e.g. which were submitted as the week 1 assignment) or any dataset which you can load directly from a url (e.g. from http://www.pangaea.de).
(At any time you can submit extra datasets to be hosted in the ds4geo repository for sharing with the rest of the class. Please follow the same instructions as for the Week 1 Assignment.)

*Plots*

You can make any type of plots (doesn't have to be a time-series), but the type should be appropriate to the data.
The plots do not need to be very complicated, but try to make clear, readable and interesting plots to share with the rest of the class.

For inspiration about types of plot, see here: https://colab.research.google.com/notebooks/charts.ipynb#scrollTo=N-u5cYwpS-y0

*Libraries*

You may use any python plotting library, but it is recommended to use Matplotlib (or its Pandas integration - see pd.DataFrame.plot?) unless you are already experienced with it.

*How to get help*

Don't be afraid to use google! Typing "python" followed by a description of what you want to do often results in useful Stackoverflow forum discussions with code snippets. e.g. "python matplotlib two x axes"

If can't figure something out, note it down in your notebook. We will have the opportunity to answer such questions later in the course.

Remember to comment you code!

**Submission**

* Submit the assignment here: https://github.com/ds4geo/ds4geo_ws2020/tree/master/Assignments/Session%202
* See readme about how to use github for assessment submissions here: https://github.com/ds4geo/ds4geo/blob/master/Github%20Assignment%20Readme.md
* The **deadline** is 23:59 on 20th October 2020.
* This assignment comprises 5% of the assessment for the course. Marks are awarded for effort.

Submitted notebooks are available to the whole class, and will be discussed in Session 3.

