# Introduction to Python

## Data Management Boot Camp

### August 19, 2019

David Naughton  
Web Development  
UMN Libraries  
naughton@umn.edu

## Why Python?

![xkcd: Python](https://imgs.xkcd.com/comics/python.png)  
[xkcd.com/353/](https://xkcd.com/353/)

## Why Python? Seriously...
* Named after Monty Python's Flying Circus! (Sorry...)
* Easy to learn and fun to use. Actually important.
* Open source
* Not only one of the most popular languages, extremely popular for research computing.




## Why Python? pandas!
* [pandas](https://pandas.pydata.org/) is a Python data analysis library. It's what makes Python competitive with R.
* Includes the Python [SciPy](https://www.scipy.org/) package collection, which includes [NumPy](http://www.numpy.org/).


## Why Python? Jupyter Notebooks!
* Text, images, and _executable_ code, all in a single shareable object!
* This workshop itself is a Jupyter Notebook.
* GitHub renders them automatically: https://github.com/UMNLibraries/python-workshop-2019-08
* Includes pandas by default.


In [None]:
import pandas as pd

## Python 2 vs. 3
Python 2 was released in 2000, and will fall out of support in 2020. pandas dropped Python 2 support in July. Python 3 was released in 2008, and is the future of the language. Use Python 3, unless you are using already-existing code in Python 2, or need a package (library) that does not support Python 3, a rare situation.

## Hello, World!
This is a complete Python program:

In [None]:
print('Hello, World!')

## HelloWorld.java
```java
public class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello, World");
    }
}
```

## Exercise: Lake Superior's Seiche
A seiche is a standing wave in an enclosed or partially enclosed body of water.

[Wikipedia: Seiche](https://en.wikipedia.org/wiki/Seiche)

**Please note!** Some of what follows may seem contrived, because it is. Some things we would never do in a real project, but we do here to illustrate Python and programming concepts. As we advance through the exercise, we will get closer and closer to a real-world approach.

## Lake Superior Water Levels
Get Lake Superior station number from [NOAA Great Lakes Low Water Datums](https://tidesandcurrents.noaa.gov/gldatums.html), then request an [International Great Lakes Datum (IGLD)](https://opendap.co-ops.nos.noaa.gov/axis/webservices/waterlevelrawsixmin/index.jsp).

## Reading Files

In [None]:
with open('CO-OPS__9099064__wl.csv') as file:
    for line in file:
        print(line)

## Whitespace is significant!
What happens when we do this?

In [None]:
with open('CO-OPS__9099064__wl.csv') as file:
    for line in file:
    print(line)

**Important note:** Four-space indentation is an extremely strong convention in the Python community.

## Variables

In [None]:
file_name = 'CO-OPS__9099064__wl.csv'
with open(file_name) as file:
    for line in file:
        print(line)

## Types & Objects
Everything in Python is some _type_ of _object_. Objects contain _attributes_, usually data and related functions, called _methods_.

Python is _dynamically typed_ because it determines these types at runtime.

In [None]:
file_name = 'CO-OPS__9099064__wl.csv'
print('file_name is a ', type(file_name))
with open(file_name) as file:
    print('file is a ', type(file))
    for line in file:
        print('line is a ', type(line))
        print(f'line = {line}') # f-strings!
        break

## Strong Typing
Python is also _strongly typed_. For example, some operations require operands of specific types.

What happens when we do this?

In [None]:
file_name = 'CO-OPS__9099064__wl.csv'
file_name + 3

We'll come back to types, but first we'll cover some type differences that are conceptual, but not captured in Python types.

## Procedures
Procedures give a name to some collection of actions. They do not return output.

In [None]:
def print_lines(file_name):
    with open(file_name) as file:
        for line in file:
            print(line)
            #pass

print('print_lines is a ', type(print_lines))
            
file_name = 'CO-OPS__9099064__wl.csv'            
print_lines(file_name)
#output = print_lines(file_name)
#print(output)

## Functions
_Pure functions_ always return the same output for the same input.

In [None]:
def file_to_list(file_name):
    lines = [] # list
    with open(file_name) as file:
        for line in file:
            lines.append(line) # append is a method (function) in the list class
    return lines

print('file_to_list is a ', type(file_to_list))
            
file_name = 'CO-OPS__9099064__wl.csv'
lines = file_to_list(file_name)
print(lines)


## Named Arguments
Python allows function arguments to have names. The syntax may be confusing at first!

In [None]:
def file_to_limited_list(file_name, limit):
    lines = []
    with open(file_name) as file:
        for line in file:
            lines.append(line)
            if len(lines) == limit:
                break
    return lines
    
file_name = 'CO-OPS__9099064__wl.csv'
limit = 3
lines = file_to_limited_list(file_name, limit)
#lines = file_to_limited_list(file_name=file_name, limit=limit)
#lines = file_to_limited_list(limit=limit, file_name=file_name) # Helps to avoid incorrect argument order.
#lines = file_to_limited_list(limit, file_name) # What happens when we do this?
print(lines)

## Functional Programming

### (An even further diversion!)

[_Functional programming (FP)_](https://en.wikipedia.org/wiki/Functional_programming) encourages using [_pure functions_](https://en.wikipedia.org/wiki/Pure_function), which always return the same output for the same input, without side effects. FP discourages unnecessary state and state changes.

Is `print` a pure function?

In [None]:
output = print('Hello, World!')
type(output)

## FP vs. OOP
Like most popular languages, Python supports both [_functional programming (FP)_](https://en.wikipedia.org/wiki/Functional_programming) and [_object-orientend programming (OOP)_](https://en.wikipedia.org/wiki/Object-oriented_programming).

[_When is OOP better than FP and vice-versa?_](https://www.quora.com/Computer-Programming/When-is-OOP-better-than-FP-and-vice-versa) It depends. Here I focus on FP, because I find it a fast start and a good fit for research.

OK, let's get back to types!

## Lists, a.k.a. Arrays

In [None]:
file_name = 'CO-OPS__9099064__wl.csv'
with open(file_name) as file:
    for line in file:
        row = line.strip().split(',') # split is a function that returns an array
        print('row is a ', type(row))
        print(row)
        print(row[0], row[1])
        break

## What about the header?

In [None]:
file_name = 'CO-OPS__9099064__wl.csv'
with open(file_name) as file:
    line_count = -1
    for line in file:
        line_count = line_count + 1
        if line_count == 0:
            continue
        row = line.strip().split(',')
        print(row[0], row[1])

## Python csv Package

In [None]:
import csv
file_name = 'CO-OPS__9099064__wl.csv'
with open(file_name) as file:
    has_header = csv.Sniffer().has_header(file.read(1024))
    file.seek(0)  # Rewind.
    reader = csv.reader(file)
    if has_header:
        next(reader)  # Skip header row.
    for row in reader:
        print(row[0], row[1])
        break

## Dictionaries, a.k.a Associative Array, a.k.a, Hash Tables, a.k.a. Hashes

In [None]:
file_name = 'CO-OPS__9099064__wl.csv'
with open(file_name) as file:
    reader = csv.DictReader(file)
    for row in reader:
        print('row is a', type(row))
        print(row)
        print(row['date_time'], row['water_level'])
        break

## Doing stuff with water levels...

In [None]:
file_name = 'CO-OPS__9099064__wl.csv'
with open(file_name) as file:
    reader = csv.DictReader(file)
    water_levels = []
    for row in reader:
        water_levels.append(row['water_level'])
    water_levels_sum = sum(water_levels)

## Will this fix our problem?

In [None]:
file_name = 'CO-OPS__9099064__wl.csv'
with open(file_name) as file:
    reader = csv.DictReader(file)
    water_levels = []
    for row in reader:
        water_level = float(row['water_level'])
        water_levels.append(water_level)
    water_levels_sum = sum(water_levels)

## Investigating further...

In [None]:
import re # regular expressions
file_name = 'CO-OPS__9099064__wl.csv'
with open(file_name) as file:
    reader = csv.DictReader(file)
    water_levels = []
    for row in reader:
        if not (re.match(r'^\d+', row['water_level'])):
            print('"{}"'.format(row['water_level']))
            break
        water_level = float(row['water_level'])
        water_levels.append(water_level)
    water_levels_sum = sum(water_levels)

## pandas

In [None]:
import matplotlib
%matplotlib inline
import csv
import pandas as pd

### Dataframes

In [None]:
file_name = 'CO-OPS__9099064__wl.csv'
df = pd.read_csv(
    file_name,
    header=0,
    parse_dates=True,
    float_precision='high', # Without this, pandas adds a ridiculous level of precision to some values.
)

### Matplotlib Axes Object

In [None]:
ax = df.plot(
    x='date_time',
    y='water_level',
    title='Lake Superior Levels (IGLD 1985)',
    legend=False,
)

### Labeling the Axes

In [None]:
ax = df.plot(
    x='date_time',
    y='water_level',
    title='Lake Superior Levels (IGLD 1985)',
    legend=False,
)
ax.set_xlabel('Days from June 26, 2017')
ax.set_ylabel('Water Level IGLD (feet)')

### Mean Difference

In [None]:
df['water_level_mean_diff'] = df.water_level - df.water_level.mean()
ax_mean_diff = df.plot(
    x='date_time',
    y='water_level_mean_diff',
    title='Lake Superior Levels (IGLD 1985)',
    legend=False,
)
ax_mean_diff.set_xlabel('Days from June 26, 2017')
ax_mean_diff.set_ylabel('Water Level Mean Diff IGLD (feet)')

## Further Learning
* #python channel on [Tech People UMN Slack](https://tech-people-umn.slack.com/)
* [UMN Tech People Coworking](https://umnhackerhours.github.io/) every Wednesday afternoon!
* [Dive Into Python](https://www.cmi.ac.in/~madhavan/courses/prog2-2012/docs/diveintopython3/index.html)
* [Learn Python the Hard Way](https://learnpythonthehardway.org/)
* [Official Python Documentation](https://docs.python.org/3/index.html)

## Questions?

## Thank you!