# Overview

### Contents

- **Introduction to Python**
  - The Python Interpreter
  - First Steps with Python
  - Importing Libraries
  - About the Data
  - Arrays and their Attributes
  - Getting Help
  - More on Arrays
  - Basic Data Visualization
- **Repeating Tasks with Loops**
  - Sequences
  - More Complex Loops
- **Analyzing Data from Multiple Files**
  - Looping over Files
  - Generating a Plot
  - Putting it All Together
- **Conditional Evaluation**
  - Conditional Expressions in Python
  - Checking our Data
- **Creating Functions for Reuse**
  - Composing Multiple Functions
  - Cleaning Up our Analysis Code
  - Positional versus Keyword Arguments
  - Documenting Functions
- **Understanding and Handling Errors**
- **Defensive Programming**
  - Assertions
  - Test-Driven Development
- **Python at the Command Line**
- **Analyzing and Optimizing Performance**
  - Benchmarking
- **Connecting to SQLite with Python**

# Introduction to Python

## The Python Interpreter

### Jupyter Notebook

The Python interpreter we'll interact with in Jupyter Notebook is the same interpreter we could use from the command line. To launch Jupyter Notebook:

- In GNU/Linux or Mac OS X, launch the Terminal and type: `jupyter notebook`; then press ENTER.
- In Windows, launch the Command Prompt and type `jupyter notebook`; then press ENTER.

## First Steps with Python

## Importing Libraries

## About the Data

The data we're using for this lesson are **monthly averages of surface air temperatures** from 1948 to 2016 for five different locations. They are derived from the NOAA NCEP CPC Monthly Global Surface Air Temperature Data Set, which has a 0.5 degree spatial resolution.

**What is the unit for air temperature used in this dataset?** Recall that when we assign a value to a variable, we don't see any output on the screen. To see our Barrow temperature data, we can use the `print()` function again.

In [None]:
print(barrow)

The data are formatted such that:

- Each column is the monthly mean, January (1) through December (12)
- Each row is a year, starting from January 1948 (1) through December 2016 (69)

[More information on the data can be found here.](http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GHCN_CAMS/.gridded/.deg0p5/.temp/)

## Arrays and their Attributes

**How many rows and columns are there in the `barrow` array?**

**Challenge: What do each of the following code samples do?**

```py
barrow[0]
barrow[0,]
barrow[-1]
barrow[-3:-1]
```

### Slicing NumPy Arrays

**Challenge: What's the mean monthly temperature in August of 2016? Converted to degrees Fahrenheit?**

Degrees F can be calculated from degrees K by the formula:

$$
T_F = \left(T_K \times \frac{9}{5}\right) - 459.67
$$

### Calculating on NumPy Arrays

**Convert the first year of Barrow air temperatures from degrees Kelvin to degrees Celsius.**

**Calculate the monthly average of the first two years of air temperatures in Barrow.** (Consider this the average of the monthly averages.) Then convert to Celsius.

**What is the overall mean temperature in any month in Barrow between 1948 and 2016 in degrees C?**

**How cold was the coldest February in Barrow, by monthly mean temperatures, in degrees C?**

**Challenge: What's the minimum, maximum, and mean monthly temperature for August in Barrow, in degrees C?**

## Getting Help

## More on Arrays

**What is the mean temperature in 1948? In 1949? And so on...**

**What, then, does the following function call give us?**

In [None]:
barrow.mean(axis = 0)

## Basic Data Visualization

# Repeating Tasks with Loops

## Sequences

### Character Strings

### Lists and Tuples

**How many times is the value `"Barrow"` found in the `cities` list?**

### Performing Calculations with Lists

## More Complex Loops

### Looping over Sequences

**Write a `for` loop that iterates through the letters of your favorite city, putting each letter inside a list. The result should be a list with an element for each letter.**

Hint: You can create an empty list like this:

```py
letters = []
```

Hint: You can confirm you have the right result by comparing it to:

```py
list("my favorite city")
```

### Challenge: Sequences and Mutability

**Which of the sequences we've learned about are immutable (i.e., they can't be changed)?**

- Strings are (immutable / mutable)?
- Lists are (immutable / mutable)?
- Tuples are (immutable / mutable)?

**And what does this mean for working with each data type?**

```py
"birds".upper()

[1, 2, 3].append(4)

(1, 2, 3)
```


# Analyzing Data from Multiple Files

## First Step: Looping over Files

## Second Step: Generating a Plot

## Third Step: Putting It All Together

### Challenge: Integrating over Multiple File Datasets

**For each location (each file), plot the difference between that location's mean temperature and the mean across all locations.**

Hint: One way to calculate the mean across five (5) files is by adding the 5 arrays together, then dividing by 5. You can add arrays together in a loop like this:

```py
# Start with an array full of zeros that is 69-elements long
running_total = np.zeros((69))

for fname in filenames:
    data = np.loadtxt(fname, delimiter = ',')
    running_total = running_total + data.mean(axis = 1)
```

Hint: How do you difference two arrays? Remember how the plus, `+`, and minus, `-`, operators work on arrays?

# Conditional Evaluation

This code can be represented by the following workflow.

![](./python-flowchart-conditional.png)

### Challenge: Conditional Expressions

**How can you make this code print "Greater" by changing only one line?**

```py
a_number = 42

if a_number > 100:
    print('Greater')
    
else:
    print('Not greater')
    
print('Done')
```

**There are two (2) one-line changes you could make. Can you find them both?**

## Conditional Expressions in Python

**What do each of the following evaluate to, `True` or `False`?**

```py
1 < 2
1 <= 1
3 == 3
2 != 3
```

## Checking our Data

### Challenge: Fitting a Line over Multiple File Datasets

**Write a `for` loop, with an `if` statement inside, that calculates a line of best fit for each dataset's temperature anomalies and prints out a message as to whether that trend line is positive or negative.**

Hint: What we want to know about each trend line is whether, for:

```py
results = sm.OLS(y_data, x_data).fit()
b0, b1 = results.params
```

If `b1`, the slope of the line, is positive or negative.

# Creating Functions for Reuse

At this point, what have we learned?

- What sequences are and how to create them;
- A list is created by comma-separated values inside square brackets;
- A tuple is created by comma-separated values inside parentheses;
- Lists are mutable--their elements can be changed;
- Strings and tuples are immutable--the letters or other elements in these sequences cannot be changed;
- How to use `glob.glob()` to create a list of files whose names match a given pattern;
- How to use `if` statements to test a condition;
- How to use `elif` and `else` statements to test alternative conditions;
- Conditional operators including `==`, `>=`, `<=`, `and`, and `or`;
- `X and Y` is only true if both X and Y are true;
- `X or Y` is true if either X, Y, or both are true;
- How to implement a test over multiple inputs using an `if` statement inside a `for` loop.

## Composing Multiple Functions

Now that we've created a function that converts temperatures in degrees Kelvin to degrees Celsius, let's see if we can write a function that converts from degrees Celsius to degrees Fahrenheit.

$$
T_F = \left(T_C \times \frac{9}{5}\right) + 32
$$

**Now, what if we want to convert temperatures in degrees to Kelvin to degrees Fahrenheit?**

## Cleaning Up our Analysis Code

## Positional versus Keyword Arguments

## Documenting Functions

### Challenge: Functions

**Create one (or both, for an extra challenge) of the following functions...**

- A function called `fences` that takes an input character string and surrounds it on both sides with another string, e.g., "pasture" becomes "|pasture|" or "@pasture@" if either "|" or "@" are provided.
- A function called `rescale` that takes an array and returns a corresponding array of values scaled to lie in the range 0.0 to 1.0.

Hint: Strings can be concatenated with the plus operator.

```py
'cat' + 's'
```

Hint: If $x_0$ and $x_1$ are the lowest and highest values in an array, respectively, then the replacement value for any element $x$, scaled to between 0.0 and 1.0, should be:

$$
\frac{x - x_0}{x_1 - x_0}
$$

# Understanding and Handling Errors

# Defensive Programming

Up to this point, we've introduced the basic tools of programming:

- Variables and lists,
- File input and output,
- Loops,
- Conditionals, and
- Functions.

## Assertions

## Test-Driven Development

For example, suppose we need to find where two or more time series overlap. The range of each time series is represented as a pair of numbers, which are the time the interval started and ended. The output is the largest range that they all include.

![](./python-overlapping-ranges.svg)

Here are three test functions for `range_overlap()`.

```py
assert range_overlap([ (0.0, 1.0) ]) == (0.0, 1.0)
assert range_overlap([ (2.0, 3.0), (2.0, 4.0) ]) == (2.0, 3.0)
assert range_overlap([ (0.0, 1.0), (0.0, 2.0), (-1.0, 1.0) ]) == (0.0, 1.0)
```

We're missing a test case, however; what should happen when the ranges don't overlap at all? Or if they share a boundary?

```py
assert range_overlap([ (0.0, 1.0), (5.0, 6.0) ]) == ???
assert range_overlap([ (0.0, 1.0), (1.0, 2.0) ]) == ???
```

**Since we're planning to use the `range_overlap()` function to generate a time series for the horizontal axis in a plot, we'll decide:**

- Every overlap has to have non-zero width, and,
- We will return the special value `None` when there's no overlap.

`None` is built into Python and means "nothing here."

```py
assert range_overlap([ (0.0, 1.0), (5.0, 6.0) ]) == None
assert range_overlap([ (0.0, 1.0), (1.0, 2.0) ]) == None
```

### A Potential Solution

### Challenge: Fix the Range Overlap Function

Fix `range_overlap()`; re-run `test_range_overlap()` after each change you make.

## When in Doubt...

In [None]:
import this

## Testing for Quality Control

# Python at the Command Line

We've seen a lot of tools and techniques for improving our productivity through reproducible Python code. So far, however, we've been working exclusively within Jupyter Notebook. Jupyter Notebook is great for interactive, exploratory work in Python and encourages literate programming, as we discussed earlier. A Notebook is a great place to demonstrate to your future self or your peers how some Python code works.

But when it's time to scale-up your work and process data, you want to be on the command line, for all the reasons we saw when we discussed the Unix shell earlier.

Let's explore Python programs at the command line using the following *Python script,* `temp_extremes.py`.

```py
'''
Reports the min and max July temperatures for each file
that matches the given filename pattern.
'''

import csv
import os
import sys
import glob

def main():
    # Get the user-specified directory
    directory = sys.argv[1]

    # Pattern to use in searching for files
    filename_pattern = os.path.join(directory, '*temperature.csv')

    for filename in glob.glob(filename_pattern):
        july_temps = []

        # While the file is open...
        with open(filename, 'r') as stream:
            # Use a function to read the file
            reader = csv.reader(stream)

            # Each row is a year
            for row in reader:
                # Add this year's July temperature to the list
                july_temps.append(row[6])

        # A human-readable name for the file        
        pretty_name = os.path.basename(filename)
        print(pretty_name, '--Hottest July mean temp. was', max(july_temps), 'deg K')
        print(pretty_name, '--Coolest July mean temp. was', min(july_temps), 'deg K')
        
        
if __name__ == '__main__':
    main()
```

We can run this script on the command line by typing the following:

```sh
$ python3 temp_extremes.py .
```

Remember that the single dot, `.` represents the current working directory, which is where all of our temperature CSV files are located.

## Our First Python Script

**Let's investigate this command-line program, line-by-line. We'll start to write this in our own text editor, calling the file `myscript.py` for now.**

**The docstring.** 

```py
'''
Reports the min and max July temperatures for each file
that matches the given filename pattern.
'''
```

**Module imports.** 

```py
import csv
import os
import sys
import glob
```

### Encapsulating to Keep the Namespace Clean

What do we have so far?

```py
import csv
import os
import sys
import glob

# Get the user-specified directory
directory = sys.argv[1]

# Pattern to use in searching for files
filename_pattern = os.path.join(directory, '*temperature.csv')
files = glob.glob(filename_pattern)
print(files)
```

**Why does this work? Let's enter an interactive session again.**

## Alternative Command-Line Tools

`sys.argv` is a rather crude tool for processing command-line arguments. There are a couple of alternatives I suggest you look into if you are going to be writing command-line programs in Python:

- `argparse`, another built-in library, that handles common cases in a systematic way. Check out [this tutorial](http://docs.python.org/dev/howto/argparse.html).
- `Fire`, [a very new Python module from Google](https://opensource.googleblog.com/2017/03/python-fire-command-line.html), which can turn any Python object (function, class, etc.) into a command-line API.

### Optional: Try out Python Fire

[Installation instructions and source code here.](https://github.com/google/python-fire)

## Modularization

```
package_name/
    __init__.py
    module1.py
    subpackage/
        __init__.py
        module2.py
```

Which could be used as:

```py
import package_name.module1
from package_name.module1 import some_function
from package_name.subpackage import module2
from package_name.subpackage.module2 import another_function
```

### Installing Your Project as a Module

We haven't covered installing new Python modules, but when the time is right for you to package your code together as a single module (e.g., as `package_name`, in the example above), [consider installing your module "in development mode" first.](https://packaging.python.org/tutorials/distributing-packages/#working-in-development-mode)

## Unit Testing

**A good test is one that verifies that a *small* and *specific* part of your code is working.** If a test is written too generally, then we don't actually know went wrong when the test fails. In general, when we test a specific part of a larger code base, we are *unit testing.*

Python has a built-in unit testing module. Let's try it out.

```py
import unittest

class Main(unittest.TestCase):
    
    def test_range_overlap_with_disjoint_ranges(self):
        'Should return None for ranges that do not intersect'
        self.assertEqual(range_overlap([ (0.0, 1.0), (5.0, 6.0) ]), None)
        self.assertEqual(range_overlap([ (0.0, 1.0), (1.0, 2.0) ]), None)
        
    def test_range_overlap_with_single_range(self):
        'Should return same input when single range is provided'
        self.assertEqual(range_overlap([ (0.0, 1.0) ]), (0.0, 1.0))
        
        
if __name___ == '__main__':
    unittest.main()
```

# Analyzing and Optimizing Performance

> "Premature optimization is the root of all evil." - Sir Tony Hoare (later popularized by Donald Knuth)

## Benchmarking

## Line and Memory Profiling

Line and memory profilers aren't available in the Anaconda installation I had you use, [but you can read all about this topic on this excellent blog post.](https://www.huyng.com/posts/python-performance-analysis)

# Connecting to SQLite with Python

**What is one way we could access each row at a time?**

## Best Practices with Database Connections