# Chapter 5: Pandas

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data. Since becoming an open source project in 2010, pandas has matured into a quite large library that’s applicable in a broad set of real-world use cases. The developer community has grown to over 800 distinct contributors, who’ve been helping build the project as they’ve used it to solve their day-to-day data problems. Throughout the rest of the book, I use the following import convention for pandas:

In [None]:
import pandas as pd
import numpy as np
import pandas_datareader.data as web

## Introduction to pandas datastructure

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.

### Series

A series is a one-dimensional array-like object containing a sequence of values (of similar types to numpy types) and as an associated array of data labels, called its *index*. The simplest Series is formed from only an array of data.

In [None]:
obj = pd.Series([4, 7, -5, 3])
obj

The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers $0$ through $N - 1$ is created. You can get the array representation and index object of the Series via its values and index attributes respecitvely:

In [None]:
obj.values

In [None]:
obj.index

Often it is desirable to create a Series with an index identifying each data point with a label:

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

Compared with numpy arrays, you can use labels in the index when selecting single values or a set of values:

In [None]:
obj2['a']

In [None]:
obj2['d'] = 6
obj2[['c', 'a', 'd']]

Here the indexes is interpreted as a list of indices, even though it contains strings instead of integers.

Using NumPy unctions or NumPy-like opreations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [None]:
obj2[obj2 > 0]

In [None]:
obj2  * 2

In [None]:
np.exp(obj2)

Another way to think about a Series is as a fixed-length, ordered dictionary, as it is mapping of index values to data values. It can be used in many contexts where you might use a dict.

In [None]:
'b' in obj2

In [None]:
'e' in obj2

In [None]:
sdata = {'Ohio' : 35000, 'Texas' : 71000, 'Oregon' : 16000, 'Utah' : 50000}
obj3 = pd.Series(sdata)
obj3

When you are only passing a dictionary, the index in the resulting Series will have the dictionarys keys in sortedorder. You can override this by passing the dict keys in the order you want them to apear in the resulting Series:

In [None]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

Here, three values found in *sdata* were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number), which is considered in pandas to mark missing og NA values. Since 'Utah' was not included in states, it is excluded from the resulting object.

In [None]:
pd.isnull(obj4)

In [None]:
pd.notnull(obj4)

Series also has these as instance methods:

In [None]:
obj4.isnull()

A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:

In [None]:
obj4 + obj3

This can be thought of as a join operation in database terms.

A series's index can be altered in-place by assignment:

In [None]:
obj

In [None]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

### DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc). The DataFrame has both a row and columnd index; it can be thought of as a dictionary of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, cit or some other collection of one-dimensional arrays. The exact details of DataFrame's are outside the scope of this book. There are many ways to construct a Data Frame, though one of the most common is from a dictionary of equal-length lists or Numpy Arrays.

In [None]:
data = {'state' : ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year' : [2000, 2001, 2002, 2001, 2002, 2003],
        'pop' : [1.5, 1.7, 3.6, 2.4, 2.9, 2.3]}
frame = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically as with Series, and the columns are placed in sorted order:

In [None]:
frame

If you are using the Jupyter notebook, pandas DataFrame object will be displayer as a more browser-friendly HTML table. For large DataFrame, the *head* method selects only the first five rows:

In [None]:
frame.head()

If you specify a sequence of columns, the DataFrame's columns will be arranged in that order:

In [None]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

If you pass a column that isn't contained in the dict, it will appear with missing values in the results:

In [None]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                            index = ['one', 'two', 'three', 'four', 'five', 'six'])
frame2

In [None]:
frame2.columns

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:

In [None]:
frame2['state']

In [None]:
frame2.year

Rows can also be retrieved by position or name with the special *loc* attribute

In [None]:
frame2.loc['three']

Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values:

In [None]:
frame2['debt'] = 16.5
frame2 

In [None]:
frame2['debt'] = np.arange(6.)
frame2

When you are assigning lists or arrays to a column, the value's length must match the length of the DataFrame. If you assign a Series, its labels will be realign exactly to the DataFrame's index, inserting missing valuesin any holes:

In [None]:
val = pd.Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five'])
frame2['debt'] = val
frame2

Assigning a column that does not exist will create a new column. The del keyword will delete columns as with a dict.

As an example of del:

In [None]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

The del method can then be used to remove this column:

In [None]:
del frame2['eastern']
frame2.columns

Another common form of data is a nested dict of dicts:

In [None]:
pop = {'Nevada' : {2001 : 2.4, 2002 : 2.9},
       'Ohio' : {2000: 1.5, 2001 : 1.7, 2002 : 3.6}}

If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices:

In [None]:
frame3 = pd.DataFrame(pop)
frame3

You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array:

In [None]:
frame3.T

The keys in the inner dictionaries are combined and sorted to form the index in the result. This isnt true if an explicit index is specified:

In [None]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Dict of Series are treated in much the same way:

In [None]:
pdata = {'Ohio' : frame3['Ohio'][:-1],
         'Nevada' : frame3['Nevada'][:2]}
pd.DataFrame(pdata)

If a DataFrame's index and columns have their name attributes set, these will also be displayed:

In [None]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

As with Series, the values attribute returns the data contained in the DataFrame as a two-dimensional ndarray:

In [None]:
frame3.values

if the DataFrame's columns are different dtypes, the dtype of the values array will be chosen to accomodate all of the columns: 

In [None]:
frame2.values

### Index Objects

Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when construction a Series or DataFrame is internally converted to an Index:

In [None]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

In [None]:
index[1:]

Index object are immutable and thus cant be modified by the user:

In [None]:
index[1] = 'd'

Immutability makes it safer to share Index object among data structures:

In [None]:
labels = pd.Index(np.arange(3))
labels

In [None]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

In [None]:
obj2.index is labels

In addition to being array-like, an Index also behaves like a fixed-size set:

In [None]:
frame3

In [None]:
frame3.columns

In [None]:
'Ohio' in frame3.columns

In [None]:
2003 in frame3.index

Unlike python sets, a pandas Index can contain duplicate labels:

In [None]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

## Essential Functionality

### Reindexing

An important method on pandas objects is *reindex*, which means to create a new object with the data *conformed* tp a new index. Consider an example:

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

Calling *reindex* on this Series rearranges the data according to the new index, introdicing missing values if any index values were not already present:

In [None]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a method such as *ffil*, which forward-fills the values:

In [None]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index = [0, 2, 4])
obj3

In [None]:
obj3.reindex(range(6), method='ffill')

With DataFrame, reindex can can alter either the (row) index, columns or both. When passed only a sequence, it reindexes the rows in the result:

In [None]:
frame =pd.DataFrame(np.arange(9).reshape((3,3)), 
                    index = ['a', 'c', 'd'],
                    columns = ['Ohio', 'Texas', 'California'])
frame

In [None]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

In [None]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns = states)

### Dropping entries from an axis

Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:

In [None]:
obj = pd.Series(np.arange(5.), index = ['a', 'b', 'c', 'd', 'e'])
obj

In [None]:
new_obj = obj.drop('c')
new_obj

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index = ['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns = ['one', 'two', 'three', 'four'])
data

Calling *drop* with a sequence of labels will drop values from the row labels (axis = 0):

In [None]:
data.drop(['Colorado', 'Ohio'])

You can drop values from the columns by passing axis = 1 or axis = 'columns'

In [None]:
data.drop('two', axis=1)

Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object *in-place* without returning a object:

In [None]:
obj.drop('c', inplace = True)
obj

### Indexing, Selection and Filtering

Seres indexing works analogously to NumPy array indexing, except you can use the Series's index values instead of only integers. Here are som examples of this:

In [None]:
obj = pd.Series(np.arange(4.), index = ['a', 'b', 'c', 'd'])
obj

In [None]:
obj['b']

In [None]:
obj[1]

In [None]:
obj[2:4]

Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:

In [None]:
obj['b':'c']

Boolean selection:

In [None]:
data[data['three'] > 5]

### Selecting with loc and iloc

For DataFrame label-indexing on the rows, I introduce the special indexing operators *loc* and *iloc*. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either acis labels (loc) or integers (iloc). 

As a preliinary example, let's select a single row and multiple columns by label:

In [None]:
data.loc['Colorado', ['two', 'three']]

We'll then perform some similar sections with integers using *iloc*:

In [None]:
data.iloc[2, [3, 0, 1]]

In [None]:
data.iloc[2]

### Integer indexes

Indexing is slightly different than built-in Python data structures.

In [None]:
ser = pd.Series(np.arange(3.))
ser
ser[-1]

with non-integer index, this is not a problem:

In [None]:
ser2 = pd.Series(np.arange(3.), index = ['a', 'b', 'c'])
ser2[-1]

### Arithmetic and Data Alignment

An important pandas feature for some applications is the behavior of arithmetic between objects, if any index pairs are note the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels. Let's look at an example:

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index = ['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index = ['a', 'c', 'e', 'f', 'g'])
s1

In [None]:
s2

In [None]:
s1 + s2

The internal data alignment introduces missing values in the label locations that dont overlap. Missing values will then propagate in further arithmetic computations.

In the case of DataFrame, alignment is performed on both the rows and the columns:

In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), 
                   columns = list('bcd'),
                   index = ['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                   columns = list('bde'),
                   index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
                   
df1

In [None]:
df2

Adding these together returns a DataFrame whose index and columns are the unions of the ones in each DataFrame:

In [None]:
df1 + df2

### Function Application and Mapping

Numpy ufunc (element-wise array methods) also work with pandas objects:

In [None]:
frame = pd.DataFrame(np.random.randn(4, 3), 
                     columns=list('bde'),
                     index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

In [None]:
np.abs(frame)

Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame's apply method does exactly this.

In [None]:
f = lambda x : x.max() - x.min()
frame.apply(f)

In [None]:
frame.apply(f, axis = 'columns')

Many of the most common array statistics are DataFrame methods, so using apply is not necessary. The function passed to apply need not return a scalar value, it can also return a Series with multiple values:

In [None]:
def f(x):
    return pd.Series([x.min(), x.max()], index = ['min', 'max'])

frame.apply(f)

Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating-point value in frame. You can do this with applymap:

In [None]:
format = lambda x : '%.2f' % x

frame.applymap(format)

The reason for the name applymap is that Series has a map method for applying an element-wise function:

In [None]:
frame['e'].map(format)

### Sorting and Ranking

Sorting a dataset by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the *sort_index* method, which returns a new, sorted object:

In [None]:
obj = pd.Series(range(4), index = list('dabc'))
obj.sort_index()

With a DataFrame, you can sort by index on either axis:

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index = ['three', 'one'],
                     columns = list('dabc'))
frame.sort_index()

In [None]:
frame.sort_index(axis = 1)

## Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value from a Series or a Series of values from the rows or columns of a DataFrame. Compared with the similar methods found on NumPy arrays, they have built-in handling for missing data. Consider a small DataFrame:

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], 
                   [np.nan, np.nan], [0.75, -1.3]],
                   index = list('abcd'),
                   columns = ['one', 'two'])
df

Calling a DataFrame's sum method returns a Series containing column sums:

In [None]:
df.sum()

In [None]:
df.sum(axis = 'columns')

Other methods are *accumulations*

In [None]:
df.cumsum()

In [None]:
df.describe()

### Correlation and Covariance

Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let's consider some DataFrames of stock prices and volumes obtained from Yahoo! Finance using the add-on pandas-datareader package.

In [None]:
all_data = {ticker : web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
price = pd.DataFrame({ticker : data['Adj Close'] for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker : data['Volume'] for ticker, data in all_data.items()})

Compute percent changes of the prices.

In [None]:
returns = price.pct_change()
returns.tail()

The corr method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, cov computes the covariance.

In [None]:
returns['MSFT'].corr(returns['IBM'])

In [None]:
returns['MSFT'].cov(returns['IBM'])

### Unique Values, Value Counts and Membership

Another class of related methods extracts information about the values contained in a one-dimensional Series. To illustrate these, consider this example:

In [None]:
obj = pd.Series(list('cadaabbcc'))
uniques = obj.unique()
uniques