# Python, Numpy, and Pandas

Python has three main environments in which we will manipulate data. First, Python has its own native data structures that can be effectively used. However, these Python structures are often more general than a researcher might want. `Numpy` provides some extra structure on numerical arrays that is often helpful in scientific computing. For traditional data analysis where the unit of analysis is an obervation, the `pandas` library is a great environment in Python for working with data.

## 1. Python

Python objects consist of the types of the elements within objects (e.g., int, long, float, complex, string) and the types of objects that contain other objects (e.g., list, tuple, set, dictionary) called sequence data types.

### 1.1 Python Element Types: Numerical Types
The `type()` built-in function allows the user to check what is the type of an object. The following two examples show the difference between an integer and a float.

In [None]:
type(3)

In [None]:
type(3.)

You can perform traditional float division.

In [None]:
15.0 / 4.0

In [None]:
15 / 4

You can perform integer division, which rounds to the nearest integer.

In [None]:
15 // 4

You can perform modular division, which gives you the remainder.

In [None]:
7 % 3

We won't use complex numbers in this class, but you can create and analyze complex numbers with the `complex()`, `real()`, and `imag()` functions.

In [None]:
x = complex(2,3)
print(x)
print(x.real)

In [None]:
y = 4 + 5j
print(y)
print(y.imag)

### 1.2 Python Element Types: Strings
Strings are an important data type. They can be created by enclosing characters in double quotes "" or single quotes ''. And you can do different operations on those strings.

In [None]:
str1 = "I love"
str2 = 'the MACSS program'
str3 = str1 + ' ' + str2 + '!'
print(str3)

You can pull out particular elements of a string. For example, the 10th element of `str3` is index 9 and should be the "e" in "the".

In [None]:
str3[9]

The last element of `str3` should be the exclamation point "!" which is index -1, and the second-to-last element should be the "m" in "program" which is index -2.

In [None]:
print(str3[-1])
print(str3[-2])

We can also pull out slices of strings

In [None]:
print(str3[2:9])
print(str3[:-4])
print(str3[-4:])

And double colons will give us every nth element.

In [None]:
print(str3[::2])

### 1.3 Python Sequence Types: List
A Python `list` is created by enclosing comma-separated values with square brackets []. Entries of a list do not have to be of the same type. Accessing entries of a list uses the same indexing and slicing operations as were demonstrated with strings.

In [None]:
my_list = ["Hello", 93.8, "world", 10]
my_list

In [None]:
print(my_list[0])
print(my_list[-1])
print(my_list[-2])

Common list methods (functions) include `append()`, `insert()`, `remove()`, and `pop()`.

In [None]:
next_list = [1,2]
print(next_list)
next_list.append(4)
print(next_list)

You can use the `.insert(x, y)` function to insert element `y` in position `x` of the list.

In [None]:
next_list.insert(2, 3)
print(next_list)

You can use the `.remove(y)` function to remove the first instance of the element `y` from a list.

In [None]:
your_list = [1, 'hey', 7, 'hey', 'cool', 36]
print(your_list)

In [None]:
your_list.remove('hey')
print(your_list)

The `.pop(x)` function will remove and return the `x`th element of a list. If you leave the argument blank, it gives the last element of the list.

In [None]:
num_list = [10, 20, 30, 40, 50]
print(num_list)
print(num_list.pop(3))
print(num_list.pop())

A last note about lists is that they are mutable objects. That is, when you replace, change, add to, or take away from the list, it changes the single instance of that object in the computer's memory. Other objects, such as tuples that we will cover soon (and strings covered previously), are immutable. This distinction is important for functional and object oriented programming. You often want immutable objects as the input to and output of a function. For this reason, tuples are the go-to container object for passing arguments to functions.

### 1.4 Python Sequence Types: Set
A Python `set` is an unordered collection of distinct objects. Objects can be added
to or removed from a set after its creation (mutable). Initialize a set with curly braces { },
separating the values by commas, or use set() to create an empty set.

In [None]:
gym_members = {'Doe, John', 'Doe, John', 'Smith, Jane', 'Brown, Bob'}
gym_members

Like mathematical sets, Python `sets` have operations like `union` and `intersection`.

In [None]:
gym_members.intersection({'Brown, Bob', 'Smith, Jane', 'Jones, William'})

In [None]:
gym_members.union({'Brown, Bob', 'Smith, Jane', 'Jones, William'})

### 1.5 Python Sequence Types: Tuple
A Python `tuple` (pronounced "tuh-pul") is created by enclosing comma-separated values with parenthesis (). Entries of a `tuple` do not have to be of the same type. A `tuple` has fewer built-in operations than a `list`. Also, the tuple is an immutable object in that it cannot be changed after assignment. Any operations that behave like they are changing the `tuple` are actually making copies of the `tuple` with the changes. Accessing entries of a `tuple` uses the same indexing and slicing operations as were demonstrated with `lists` and `strings`.

The immutability of the `tuple` makes it the ideal object for passing arguments into functions and returning objects from functions. A tuple can be a collection of any object. It can be a collection of `lists`, `dicts`, `Series`, or `DataFrames`.

In [None]:
tup1 = (1, 'three', float(6.2), 'five', int(100))
tup1

In [None]:
tup1[:2]

You could dig out the fourth element of the string that is the second element of the tuple with some advanced slicing.

In [None]:
tup1[1][3]

You can unpack the contents of a tuple with a comma-separated sequence of values.

In [None]:
mynumber, sixovertwo, wishheight, siblings, oldage = tup1

In [None]:
(timetogo, milestoMich) = tup1[3:]
print(timetogo)
print(milestoMich)

### 1.6 Python Sequence Types: Dictionary
Like a `list`, a Python `dict` (dictionary) is an unordered data type. A dictionary stores key-value pairs, called items. The values of a dictionary are indexed by its `keys`. Dictionaries are initialized with curly braces, colons, and commas. Use dict()or {} to create an empty dictionary. Dictionaries are a good way to organize objects that are associated with keywords or names.

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
data

You can list all the `keys` of a dictionary by using the `.keys()` method. The keys that are returned are sorted.

In [None]:
data.keys()

You can list all of the values of the `keys` using the `values` method.

In [None]:
data.values()

You can select the values associated with a particular `key` in the `dict.`

In [None]:
data['pop']

And you can select particular element values of within the key.

In [None]:
data['pop'][-2:]

## 2. Numpy

`Numpy` is a powerful Python package for manipulating numerical data structures called arrays. The name `numpy` stands for "numerical Python".

A vector and a matrix are both examples of arrays. A vector is a one-dimensional array, and a matrix is a two-dimensional array. In numerical methods, one will often want to organize and manipulate data that has many dimensions. Arrays are the ideal Euclidean structure for numerical data.

`Numpy` is an essential package for working with numerical data. Even though you can often perform the operations you would like to execute with Python's native data types, `numpy` will often provide the most convenient and efficient functionality.

We can create a 1-dimensional `numpy` array by importing numpy and using the `array()` function.

In [None]:
import numpy as np

In [None]:
arr1 = np.array([8, 4, 6, 0, 2])
arr1

A 2-dimensional array looks a little more cumbersome to input manually.

In [None]:
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
arr2

In [None]:
arr2.shape

You can slice elements of the array with `numpy`'s indexing. The indexing is such that the first dimension is rows, the second dimension is columns, and every other dimension is not geometrically represented. As with all of Python, the first element's index number is 0. If you want a slice from the `m`th element to the `n`th element, you must use `m: n+1`. Further
* a scalar `m` means the `m+1`th element
* an empty `:` means the entire dimension
* two colons followed by an integer `::p` means every `p`th element
* a colon followed by an integer `:n` gives a slice from the first element to the `n`th element.
* an integer followd by a colon `m:` gives a slice from the `m+1`th element to the last element.

In [None]:
arr2[:1, 1:]

We can generate some uniformly distributed random numbers between 0 and 1 to fill a 3-dimensional array.

In [None]:
threeD = np.random.uniform(0, 1, (3, 3, 3))
threeD

In [None]:
threeD.shape

In [None]:
threeD[:, :, 0]

`Numpy` has a lot of great commands for slicing matrices.

In [None]:
np.diag(threeD[:, :, 0])

You can also take noncontiguous and nonlinear slices using Boolean masks

In [None]:
(threeD[:, :, 0] < 0.3) | (threeD[:, :, 0] > 0.9)

In [None]:
threeD[:,:,0][(threeD[:, :, 0] < 0.3) | (threeD[:, :, 0] > 0.9)]

Notice here that if you made an identity matrix (=1 on diagonal, =0 otherwise) that had Boolean values (True or False), you could pull out the diagonal elements with that object, exactly as the `np.diag()` function did.

In [None]:
ident_num = np.eye(3)
print(ident_num)
ident_bool = np.eye(3, dtype=bool)
print(ident_bool)

In [None]:
threeD[:, :, 0][ident_bool]

Of course, `numpy` has matrix algebra operations. But `numpy`'s default is elementwise operations. This is what you want, because most numerical operations on arrays are elementwise.

In [None]:
print(arr2)
print(arr2.T)
newvec = np.array([0.5, 2])
print(newvec)
np.dot(arr2.T, newvec)

In [None]:
print(arr2)
arr2 + np.ones((2, 3))

In [None]:
arr2 + 1

In [None]:
arr2 + np.ones(3)

The following "broadcasting" will not work. (More on broadcasting later)

In [None]:
arr2 + np.ones(2)

## 3. Pandas

`Pandas` is Python library for high-level data structures, created by Wes McKinney. The package name `pandas` is derived from the term "panel data". In his book, *Python for Data Analysis*, McKinney (2013) states:
> I started building pandas in early 2008 during my tenure at AQR, a quantitative investment management firm. At the time, I had a distinct set of requirements that were not well-addressed by an single tool at my disposal.
* Data structures with labeled axes supporting automatic or explicit data alignment. This prevents common errors resulting from misaligned data and working with differently-indexed data coming from different sources.
* Integrated time series functionality
* The same data structures handle both time series data and non-time series data.
* Arithmetic operations and reductions (like summing across an axis) would pass on the metadata (axis labels).
* Flexible handling of missing data.
* Merge and other relational operations found in popular database databases (SQL-based, for example)

> I wanted to be able to do all of these things in one place, preferably in a language well-suited to general purpose software development. Python was a good candidate language for this, but at that time there was not an integrated set of data structures and tools providing this functionality. (p. 111)

Pandas two main data structures are the `Series` object and the `DataFrame` object.

### 3.1 Pandas: Series
A `Series` is a one-dimensional array-like oject containing an array of data (of any `numpy` data type) and an associated array of data labels, called its *index*. [Note: To run many of the pandas operations in the following cells, you will need to execute the `import pandas as pd` command and the `from pandas import Series, DataFrame` command.]

In [None]:
import pandas as pd
from pandas import Series, DataFrame

In [None]:
obj = Series([4, 7, -5, 3])
obj

In [None]:
obj.values

In [None]:
obj.index

You can create a `Series` with a customized index, as opposed to the default of simple index numbers, by supplying a list of index labels.

In [None]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

In [None]:
obj2.values

In [None]:
obj2.index

It is these customized indices that set the pandas `Series` object apart from the `numpy` array. With `Series`, you can use values in the index when selecting single vlaues or a set of values.

In [None]:
obj2['a']

In [None]:
obj2[['c', 'a', 'd']]

`Numpy` array operations, such as filtering with a boolean array, scalar multiplication or applying math functions, will preserve the index-value link.

In [None]:
obj2

In [None]:
obj2 > 0

In [None]:
obj2[obj2 > 0]

In [None]:
obj2 * 2

In [None]:
import numpy as np

np.exp(obj2)

In [None]:
np.log(obj2)

The `Series` object has many of the same properties as a fixed-length, ordered `dict`. The `Series` is a one-to-one mapping of index values to data values. It can be stubstitutted into many functions that expect a `dict`.

In [None]:
'b' in obj2

In [None]:
'e' in obj2

Data stored as a Python `dict` can be easily transformed into a pandas `Series`. Note in the example below that the object displays with the indices in sorted order.

In [None]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

If we create another object with the `dict` named `sdata`, but we label indices that do not match up exactly with the `keys` of the `dict`, the `Series` object will select the indices that do match up and place the missing value of `NaN` for the indices that do not match up.

In [None]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

Two useful pandas methods (functions) for detecting missing values `NaN`s are the `.isnull()` method and the `.notnull()` method.

In [None]:
pd.isnull(obj4)

In [None]:
obj4.isnull()

In [None]:
obj4.notnull()

In [None]:
obj4[obj4.notnull()]

As was mentioned earlier, one of the main benefits of `pandas` is that its indices are treated as a key associative feature. In contrast to `numpy` arrays, a pandas `Series` will automatically align index numbers in arithmetic operations.

In [None]:
obj3

In [None]:
obj4

In [None]:
obj3 + obj4

We can assign a `.name` attribute to a `Series` object as a whole as well as to the index of the `Series`. This is valuable for labeling data. It is also valuable for efficient manipulation of index values.

In [None]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

You can change the index values in place if wanted. You might use this function if your data comes with index values that are not as descriptive as you would like.

In [None]:
obj

In [None]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

### 3.2 Pandas: DataFrame
McKinney (2013) describes the pandas `DataFrame` object.
> A `DataFrame` represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The `DataFrame` has both row and column index; it can be thought of as a `dict` of `Series` (one for all sharing the same index). Compared with other such `DataFrame`-like structures you may have used before (like R's `data.frame`), row-oriented and column-oriented operations in `DataFrame` are treated roughly symmetrically. (p. 115)

The `DataFrame` is the standard data structure that you would think of when using programs like Stata, SAS, or R. As with the univariate `Series` object, the `DataFrame` allows for traditional data analysis facility while interacting with all of Python's other functionality. You will notice that many of the methods available to pandas `DataFrames` are also available in `numpy`. Their methods are usually equivalent, but the advantage of performing operations with the `DataFrame` is its respect for the index values.

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame

You can reorder the columns by passing a list of exactly which columns you would like and in what order.

In [None]:
DataFrame(data, columns=['year', 'state', 'pop'])

If you would like to change the names of the columns, you can just pass a new list into the `column` attribute of the `DataFrame`.

In [None]:
frame.columns = ['Pop', 'State', 'Year']
frame

When creating the object, we can pass in a column that is not contained in `data`. This will result in a column with that name filled with missing values `NaN`. Further, we can do the same operations with the `index` labels as we did with the `column` labels.

In [None]:
print(DataFrame(data))
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
frame2

In [None]:
frame2.index

In [None]:
frame2.columns

You can retrieve a column from a `DataFrame` as a `Series` object either by using a `dict`-like notation or by attribute.

In [None]:
frame2['state']

In [None]:
frame2.state

You can also create a `Series` from a row from a `DataFrame` by using the `.ix` method.

In [None]:
frame2.ix['three']

We can fill in the `debt` column values using `numpy` arithmetic operations.

In [None]:
frame2.debt = 16.5
frame2

In [None]:
frame2['debt'] = np.arange(5)
frame2

The following example shows how nicely data can be combined based on index values. Suppose we know some `debt` values that are associated with certain `index` values and we want to incorporate that information into the `DataFrame`.

In [None]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
val

In [None]:
frame2.debt = val
frame2

We can create new columns in the `DataFrame` based on other columns.

In [None]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

And we can delete columns using the `del` keyword.

In [None]:
del frame2['eastern']
frame2

You can take slices of your data by including a list with the columns you want along with a standard `numpy`-type slicing argument.

In [None]:
frame2[['year', 'state', 'pop']][:-2]

You could also explicitly list the particular observations that you want using a `DataFrame` call.

In [None]:
DataFrame(frame2, columns=['year', 'state', 'pop'], index=['two', 'five'])

## References

* McKinney, Wes, *Python for Data Analysis*, O'Reilly Media, Inc. (2013).
* [Python labs](http://www.acme.byu.edu/?page_id=2067), Applied and Computational Mathematics Emphasis (ACME), Brigham Young University.