# Overview

This lesson introduces Python as an environment for reproducible scientific data analysis and programming. The materials are based on the Software Carpentry [Programming with Python lesson](http://swcarpentry.github.io/python-novice-inflammation/).

**At the end of this lesson, you will be able to:**

- Read and write basic Python code;
- Import and export tabular data with Python;
- Subset and filter tabular data;
- Understand different data types and data formats;
- Understand pandas Data Frames and how they help organize tabular data;
- Devise and intepret data processing workflows;
- Automate your data cleaning and analysis with Python;
- Visualize your data using matplotlib and pandas;
- Connect to a SQLite database using Python.

This lesson will introduce Python as a *general purpose programming language.* Python is a great programming language to use for a wide variety of applications, including:

- Natural language processing or text analysis;
- Web development and web publishing;
- Web scraping or other unstructured data mining;
- Image processing;
- Spatial data analysis;
- (And many others.)

## License

As with [the Software Carpentry lesson](http://swcarpentry.github.io/python-novice-inflammation/license/), this lesson is licensed for open use under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).

# Introduction to Python

Python is a general purpose programming language that allows for the rapid development of scientific workflows. Python's main advantages are:

- It is open-source software, supported by the [Python Software Foundation](https://www.python.org/psf/);
- It is available on all platforms, including Windows, Mac OS X, and GNU/Linux;
- It can be used to program any kind of task (it is a *general purpose* language);
- It supports multiple *programming paradigms* (a fancy term computer scientists use to describe the different ways people like to design software);
- **Most importantly, it has a large and diverse community of users who share Python code they've already written to do a wide variety of things.**

## The Python Interpreter

The only language that computers really understand is machine language, or binary: ones and zeros. Anything we tell computers to do has to be translated to binary for computers to execute.

Python is what we call an *interpreted language.* This means that computers can translate Python to machine code as they are reading it. This distinguishes Python from languages like C, C++, or Java, which have to be *compiled* to machine code *before* they are run. The details aren't important to us; **what is important is that we can use Python in two ways:**

- We can use the Python interpreter in **interactive mode;**
- Or, we can use execute Python code that is stored in a text file, called a script.

### Jupyter Notebook

For this lesson, we'll be using the Python interpreter that is embedded in Jupyter Notebook. Jupyter Notebook is a fancy, browser-based environment for **literate programming,** the combination of Python scripts with rich text for telling a story about the task you set out to do with Python. This is a powerful way for collecting the code, the analysis, the context, and the results in a single place.

The Python interpreter we'll interact with in Jupyter Notebook is the same interpreter we could use from the command line. To launch Jupyter Notebook:

- In GNU/Linux or Mac OS X, launch the Terminal and type: `jupyter notebook`; then press ENTER.
- In Windows, launch the Command Prompt and type `jupyter notebook`; then press ENTER.

Let's try out the Python interpreter.

In [1]:
print('Hello, world!')

Hello, world!


Alternatively, we could save that one line of Python code to a text file with a `*.py` extension and then execute that file. We'll see that towards the end of this lesson.

## First Steps with Python

In *interactive mode,* the Python interpreter does three things for us, in order:

1. Reads our input;
2. Evaluates or executes the input command, if it can;
3. Prints the output for us to see, then waits for the next input.

This is called a **read, evaluate, print loop (REPL).** Let's try it out.

In [7]:
5 * 11

55

We can use Python as a fancy calculator, like any programming language.

When we perform calculations with Python, or run any Python statement that produces output, if we don't explicitly save that output somewhere, then we can't access it again. Python prints the output to the screen, but it doesn't keep a record of the output.

**In order to save the output of an arbitrary Python statement, we have to assign that output to a variable. We do this using the equal sign operator:**

In [15]:
weight_kg = 5 * 11

Notice there is no output associated with running this command. That's because the output we saw earlier has instead been saved to the *variable* named `number`.

If we want to retrieve this output, we can ask Python for the value associated with the variable named `number`.

In [16]:
weight_kg

55

As we saw earlier, we can also use the **function** `print()` to explicitly print the value to the screen.

In [17]:
print('Weight in pounds:', 2.2 * weight_kg)

Weight in pounds: 121.00000000000001


A function like `print()` can take multiple **arguments,** or inputs to the function. In the example above, we've provided two arguments; two different things to print to the screen in a sequence.

We can also change a variable's assigned value.

In [20]:
weight_kg = 57.5
print('Weight in pounds:', 2.2 * weight_kg)

Weight in pounds: 126.50000000000001


**If we imagine the variable as a sticky note with a name written on it, assignment is like putting the sticky note on a particular value.**

This means that assigning a value to one variable does *not* change the values of other variables. For example:

In [21]:
weight_lb = 2.2 * weight_kg
weight_lb

126.50000000000001

In [22]:
weight_kg = 100.0
weight_lb

126.50000000000001

Since `weight_lb` doesn't depend on where its initial value came from, it isn't automatically updated when `weight_kg` changes. This is different from how, say, spreadsheets work.

## Importing Libraries

What are some tasks you're hoping to complete with Python? Alternatively, what kinds of things have you done in other programming languages?

**When you're thinking of starting a new computer-aided analysis or building a new software tool, there's always the possibility that someone else has created just the piece of software you need to get your job done faster.** Because Python is a popular, general-purpose, and open-source programming language with a long history, there's a wealth of completed software tools out there written in the Python for you to use. Each of these software *libraries* extends the basic functionality of Python to let you do new and better things.

**The Python Package Index (PyPI),** is the place to start when you're looking for a piece of Python software to use. We'll talk about that later.

For now, we'll load a Python package that is already available on our systems. **NumPy** is a numerical computing library that allows us to both represent sophisticated data structures and perform calculations on them.

In [23]:
import numpy

Now that we've imported `numpy`, we have access to new tools and functions that we didn't have before. For instance, we can use `numpy` to read in tabular data for us to work with.

In [26]:
numpy.loadtxt('barrow.temperature.csv', delimiter = ',')

array([[ 245.66  ,  247.52  ,  245.28  ,  256.32  ,  262.9   ,  272.46  ,
         278.06  ,  275.51  ,  269.34  ,  261.68  ,  251.35  ,  242.52  ],
       [ 248.12  ,  242.64  ,  252.04  ,  248.61  ,  262.84  ,  271.93  ,
         277.45  ,  278.92  ,  274.4   ,  266.77  ,  258.69  ,  248.2   ],
       [ 252.3   ,  240.39  ,  248.66  ,  255.41  ,  265.84  ,  274.64  ,
         278.87  ,  278.26  ,  273.36  ,  265.77  ,  261.22  ,  248.8   ],
       [ 239.52  ,  242.16  ,  243.65  ,  256.21  ,  266.    ,  274.64  ,
         279.66  ,  279.17  ,  273.23  ,  264.89  ,  260.47  ,  249.39  ],
       [ 244.95  ,  241.02  ,  244.75  ,  251.51  ,  262.25  ,  272.85  ,
         278.78  ,  277.02  ,  270.84  ,  264.87  ,  259.14  ,  248.89  ],
       [ 244.04  ,  242.96  ,  245.17  ,  255.92  ,  265.94  ,  275.58  ,
         277.5   ,  275.5   ,  272.59  ,  261.36  ,  254.54  ,  245.16  ],
       [ 245.15  ,  241.35  ,  249.29  ,  255.02  ,  264.12  ,  273.83  ,
         277.92  ,  279.6   ,  2

The expression `numpy.loadtxt()` is a **function call** that asks Python to run the function `loadtxt()` which belongs to the `numpy` library. Here, the word `numpy` is the **namespace** to which a function belongs. **This dotted notation is used everywhere in Python to refer to the parts of things as `thing.component`.**

Because the `loadtxt()` function and others belong to the `numpy` library, to access them we will always have to type `numpy.` in front of the function name. This can get tedious, especially in interactive mode, so Python allows us to come up with a new namespace as an alias.

In [27]:
import numpy as np

The `np` alias for the `numpy` library is a very common alias; so common, in fact, that you can get help for NumPy functions by looking up `np` and the function name in a search engine.

With this alias, the `loadtxt()` function is now called as:

In [28]:
np.loadtxt('barrow.temperature.csv', delimiter = ',')

array([[ 245.66  ,  247.52  ,  245.28  ,  256.32  ,  262.9   ,  272.46  ,
         278.06  ,  275.51  ,  269.34  ,  261.68  ,  251.35  ,  242.52  ],
       [ 248.12  ,  242.64  ,  252.04  ,  248.61  ,  262.84  ,  271.93  ,
         277.45  ,  278.92  ,  274.4   ,  266.77  ,  258.69  ,  248.2   ],
       [ 252.3   ,  240.39  ,  248.66  ,  255.41  ,  265.84  ,  274.64  ,
         278.87  ,  278.26  ,  273.36  ,  265.77  ,  261.22  ,  248.8   ],
       [ 239.52  ,  242.16  ,  243.65  ,  256.21  ,  266.    ,  274.64  ,
         279.66  ,  279.17  ,  273.23  ,  264.89  ,  260.47  ,  249.39  ],
       [ 244.95  ,  241.02  ,  244.75  ,  251.51  ,  262.25  ,  272.85  ,
         278.78  ,  277.02  ,  270.84  ,  264.87  ,  259.14  ,  248.89  ],
       [ 244.04  ,  242.96  ,  245.17  ,  255.92  ,  265.94  ,  275.58  ,
         277.5   ,  275.5   ,  272.59  ,  261.36  ,  254.54  ,  245.16  ],
       [ 245.15  ,  241.35  ,  249.29  ,  255.02  ,  264.12  ,  273.83  ,
         277.92  ,  279.6   ,  2

`np.loadtxt()` has two arguments: the name of the file we want to read, and the delimiter that separates values on a line. These both need to be character strings (or strings for short), so we put them in quotes.

Finally, note that we haven't stored the Barrow temperature data because we haven't assigned it to a variable. Let's fix that.

In [31]:
barrow = np.loadtxt('barrow.temperature.csv', delimiter = ',')

## About the Data

The data we're using for this lesson are **monthly averages of surface air temperatures** from 1948 to 2016 for five different locations. They are derived from the NOAA NCEP CPC Monthly Global Surface Air Temperature Data Set, which has a 0.5 degree spatial resolution.

**What is the unit for air temperature used in this dataset?** Recall that when we assign a value to a variable, we don't see any output on the screen. To see our Barrow temperature data, we can use the `print()` function again.

In [32]:
print(barrow)

[[ 245.66    247.52    245.28    256.32    262.9     272.46    278.06
   275.51    269.34    261.68    251.35    242.52  ]
 [ 248.12    242.64    252.04    248.61    262.84    271.93    277.45
   278.92    274.4     266.77    258.69    248.2   ]
 [ 252.3     240.39    248.66    255.41    265.84    274.64    278.87
   278.26    273.36    265.77    261.22    248.8   ]
 [ 239.52    242.16    243.65    256.21    266.      274.64    279.66
   279.17    273.23    264.89    260.47    249.39  ]
 [ 244.95    241.02    244.75    251.51    262.25    272.85    278.78
   277.02    270.84    264.87    259.14    248.89  ]
 [ 244.04    242.96    245.17    255.92    265.94    275.58    277.5     275.5
   272.59    261.36    254.54    245.16  ]
 [ 245.15    241.35    249.29    255.02    264.12    273.83    277.92
   279.6     273.92    266.54    256.88    244.83  ]
 [ 248.89    240.27    245.6     253.41    264.44    272.96    277.09
   274.26    271.2     260.74    248.42    246.11  ]
 [ 244.43    240.

The data are formatted such that:

- Each column is the monthly mean, January (1) through December (12)
- Each row is a year, starting from January 1948 (1) through December 2016 (69)

[More information on the data can be found here.](http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/.CPC/.GHCN_CAMS/.gridded/.deg0p5/.temp/)

## Arrays and their Attributes

Now that our data are stored in memory, we can start asking substantial questions about it. First, let's ask how Python represents the value stored in the `barrow` variable.

In [34]:
type(barrow)

numpy.ndarray

This output indicates that `barrow` currently refers to an N-dimensional array created by the NumPy library.

**A NumPy array contains one or more elements of the same data type.** The `type()` function only tells us that we have a NumPy array. We can find out the type of data contained in the array by asking for the *data type* of the array.

In [35]:
barrow.dtype

dtype('float64')

This tells us that the NumPy array's elements are 64-bit *floating point,* or decimal numbers.

**In the last example,** we accessed an **attribute** of the `barrow` array called `dtype`. Because `dtype` is not a function, we don't call it using a pair of parentheses. We'll talk more about this later but, for now, it's sufficient to distinguish between these examples:

- `np.loadtxt()` - A function that takes arguments, which go inside the parentheses
- `barrow.dtype` - An attribute of the `barrow` array; the `dtype` of an array doesn't depend on anything, so `dtype` is not a function and it does not take arguments

**How many rows and columns are there in the `barrow` array?**

In [36]:
barrow.shape

(69, 12)

We see there are 64 rows and 12 columns.

The `shape` attribute, like the `dtype`, is a piece of information that was generated and stored when we first created the `barrow` array. This extra information, `shape` and `dtype`, describe `barrow` in the same way an adjective describes a noun. **We use the same dotted notation here as we did with the `loadtxt()` function because they have the same part-and-whole relationship.**

**To access the elements of the `barrow` array, we use square-brackets as follows.**

In [38]:
barrow[0, 0]

245.66

The `0, 0` element is the element in the first row and the first column. Python starts counting from zero, not from one, just like other languages in the C family (including C++, Java, and Perl).

**With this bracket notation, remember that rows are counted first, then columns.** For instance, this is the value in the first row and second column of the array:

In [39]:
barrow[0, 1]

247.52000000000001

We can make a larger selection with **slicing.** For instance, **here is the first year of monthly average temperatures, all 12 of them, for Barrow:**

In [42]:
barrow[0, 0:12]

array([ 245.66,  247.52,  245.28,  256.32,  262.9 ,  272.46,  278.06,
        275.51,  269.34,  261.68,  251.35,  242.52])

The notation `0:12` can be read, "Start at index 0 and go up to, *but not including,* index 12." The up-to-but-not-including is important; we have 12 values in the array but, since we started counting at zero, there isn't a value at index 12.

In [43]:
barrow[0, 12]

IndexError: index 12 is out of bounds for axis 1 with size 12

Slices don't have to start at zero and they also don't have to include the upper or lower bound, if we want to simply take all the ending or beginning values, respectively.

Here's the last six monthly averages of the first three years, written two different ways:

In [47]:
barrow[0:3, 6:12]

array([[ 278.06,  275.51,  269.34,  261.68,  251.35,  242.52],
       [ 277.45,  278.92,  274.4 ,  266.77,  258.69,  248.2 ],
       [ 278.87,  278.26,  273.36,  265.77,  261.22,  248.8 ]])

In [48]:
barrow[:3, 6:]

array([[ 278.06,  275.51,  269.34,  261.68,  251.35,  242.52],
       [ 277.45,  278.92,  274.4 ,  266.77,  258.69,  248.2 ],
       [ 278.87,  278.26,  273.36,  265.77,  261.22,  248.8 ]])

If we don't include a number at all, then the `:` symbol indicates "take everying."

In [50]:
barrow[0, :]

array([ 245.66,  247.52,  245.28,  256.32,  262.9 ,  272.46,  278.06,
        275.51,  269.34,  261.68,  251.35,  242.52])