<center> 
# R406: Using Python for data analysis and modelling

<br> <br> 

## Lecture 11: Pandas — main data structures, indexing, selecting, filtering and sorting

<br>

<center> **Andrey Vassilev**

<br> 

<center> **2016/2017**
 

# Outline

1. An overview of Pandas
2. Main data structures
3. Basic operations on Pandas objects

# Main facts about Pandas

- Pandas is a Python package that offers rich data processing and analysis functionality.
- In particular, it can work with series of observations and tabular heterogeneous data (think a dataset consisting of several time series or observations on different subjects).
- Pandas allows us to clean, transform, filter, sort etc. a dataset.
- Pandas also allows us to split, merge and extract various representations of our data.
- Pandas can interact with different data sources.
- It has sophisticated date-time functionality.

# Pandas data structures

- The main data structures in Pandas are: 
    - `Series` 
    - `DataFrame` 
    - `Panel`
- The key ones are the first two.
- These structures can be treated as nestable by dimension: 
   - The `Series` is 1D and can be used as a building block of a `DataFrame`
   - The `DataFrame` is 2D and can serve as the building block of a `Panel`
   - The `Panel` is 3D and is the most general (but least used) data structure.

To start exploring the various Pandas structures we first import the relevant modules:

In [None]:
import pandas as pd # another established convention
import numpy as np

# Series

A `Series` can be created from a list.

In [None]:
s = pd.Series([1,4,-2,0,np.nan,3])
s

A `Series` object has several main characteristics.

It has an index.

In [None]:
s.index

This type of indexing is trivial because it coincides with the familiar indexing for sequences. We can substitute it with more interesting indexes:

In [None]:
dt = pd.date_range(start="2017-01-11",periods=len(s),freq="M") # Monthly frequency, starting Jan 11, 2017
print(dt)
s.index=dt
s

You can inspect the contents of a `Series` by using the `head()` and `tail()` methods.

In [None]:
s.head()

In [None]:
s.head(3) # Try changing it to 2 or 4

In [None]:
s.tail()

In [None]:
s.values # You can extract the values as an array

In [None]:
s[4] = 8 # assignment can be done in a standard way
s.describe()

A `Series` can be created from a dictionary. The dictionary keys will be used as index, which will be sorted.

In [None]:
s = pd.Series({"a":1,"b":3,"f":4, "c":-2.2})
s

You can also create it by simultaneously passing values and index.

In [None]:
s = pd.Series(np.random.rand(5),index = ["e"+str(i) for i in range(1,6)])
s

An element of a `Series` can be accessed by its index, "dictionary-style"...

In [None]:
s['e2']

... or by its position:

In [None]:
s[1] 

A `Series` object also supports slicing:

In [None]:
s[1:3]

Slicing can be done with respect to the index elements (notice that it is inclusive, unlike position-based slicing):

In [None]:
s['e1':'e3']

# `DataFrame`s

The `DataFrame` is the Pandas data structure that holds tabular data. It can be created from a NumPy array and takes an index argument, just like a `Series`. In addition, it takes a `columns` argument specifying column names.

In [None]:
df = pd.DataFrame(np.array([[2,1,3,5],[34,36,29,35]]).T,
                  index = pd.date_range(start="2005",periods=4,freq='A'),
                  columns = ['A','B'])
df

In [None]:
df.index

In [None]:
df.columns

A `DataFrame` column can be accessed by direct indexing:

In [None]:
df['B']

However, using a slice will be assumed to refer to the index:

In [None]:
df['A':'B'] 
# Notice that the error message says that 
# the string provided is not a date!

Similarly, you need a slice to access rows. The following is an error because Pandas assumes you are trying to provide a column name:

In [None]:
df['2005-12-31']

This already works:

In [None]:
df['2005-12-31':'2006-12-31']

Or a trivial type of slice if you need to access a single row:

In [None]:
df['2005-12-31':'2005-12-31']

Since the previous conventions may be inconvenient in some use cases, Pandas offers a more flexible way to access elements.

The `iloc` reference (index location) allows us to specify positions:

In [None]:
df.iloc[0]

In [None]:
type(df.iloc[0])

In [None]:
df.iloc[-1]

In [None]:
df.iloc[1:3]

In [None]:
df.iloc[:3]

It can be used to access rows and columns simultaneously:

In [None]:
df.iloc[1,0]

In [None]:
df.iloc[:,0]

In [None]:
df.iloc[2,:] # equivalent to df.iloc[2]

The `loc` functionality allows us to refer by label instead of position.

In [None]:
df.loc['20071231'] # You can also provide the date string in this format

In [None]:
df.loc['20071231':'20071231']

In [None]:
df.loc['20061231':'20071231']

In [None]:
df.loc['20061231':]

In [None]:
df.loc[:,'B']

Incidentally, `iloc` and `loc` work also for the indexes of `Series` objects.

It is possible to select a custom subset of the data by passing a list.

In [None]:
df.iloc[[0,2,3],:]

In [None]:
tmpidx = df.index # change the index temporarily
                  # to avoid complications with dates
df.index = list('abcd')
df.loc[['a','c','d'],'B':]

In [None]:
# restore index
df.index = tmpidx
del tmpidx

There is also a hybrid indexer `ix` that can take a combination of labels and positions.

In [None]:
df.ix['2007-12-31',0]

In [None]:
df.ix['2007-12-31':'2008-12-31',1]

In [None]:
df.ix[1:3,'A']

## Ways of creating `DataFrame`s

Apart from passing an array, we can also pass a list of lists:

In [None]:
df1 = pd.DataFrame([[2,1,3,5],[34,36,29,35]],
                  index = ['A','B'],
                  columns = range(4))
df1

Or we can create the `DataFrame` from a dictionary of `Series` objects.

In [None]:
s1 = pd.Series(np.random.rand(6),index = range(6,0,-1)) # We can index backward
s2 = pd.Series(np.random.rand(6),index = range(6,0,-1)) 
df1 = pd.DataFrame({'Ser1':s1,'Ser2':s2})
df1

Notice what happens when the indexes of the series are different:

In [None]:
s1 = pd.Series(np.random.rand(6),index = range(6,0,-1)) # We can index backward
s3 = pd.Series(np.random.rand(6),index = list('abcdef')) 
df2 = pd.DataFrame({'Ser1':s1,'Ser3':s3})
df2

## Indexes

The last example hints at some of the properties of indexes. They behave like ordered sets and are designed this way in order to facilitate operations like various joins of datasets.

First, an index can be created as an independent object and passed to a `Series` or `DataFrame` constructor later.

In [None]:
i1 = pd.Index(list('abcde'))
i2 = pd.Index(list('acdghkl'))
print(i1) 
print(i2)

You can access the elements of an index by position or using a slice:

In [None]:
i1[2]

In [None]:
i2[1:5:2]

But indexes are immutable. This is a conscious design choice to safeguard the integrity of data transformations and merges.

In [None]:
# This raises an error
i2[2] = 'z'

Indexes also support set operations (again useful when combining datasets):

In [None]:
i1 & i2

In [None]:
i1 | i2

In [None]:
i1 ^ i2

In [None]:
i1.difference(i2) # i1-i2 is deprecated for Index objects

# More on selection and assignment

A column name of a `DataFrame` can be accessed as an attribute.

In [None]:
df.B # equivalent to df['B']

We can assign using a slice:

In [None]:
df.loc['20051231':'20071231','A'] = [111]*3
df

And we can add an entire column:

In [None]:
df['C'] = np.random.rand(4)
df

While we have been working with numeric values up to here, there nothing to prevent us from having columns of different types:

In [None]:
df['D'] = ['red', 'blue', 'green', 'yellow']
df['E'] = [True, True, False, True]
# df.pop('D') 
df.dtypes

We can delete columns like this:

In [None]:
del df['D']
df

Or like this:

In [None]:
df.pop('E')
df

Or, if we need to delete many columns, we can just keep what we need:

In [None]:
df = df[['A','B']]
df

Rows in a `DataFrame` can be deleted by means of `drop()`. Note that it returns a copy unless you force in-place changes (either by assignment or by passing `inplace=True`).

In [None]:
df.drop(df.index[0]) # Drop the row that corresponds to the first index

In [None]:
df # still the old one

In [None]:
df = df.drop(df.index[0])
df

In [None]:
df.drop(df.index[1],inplace=True)
df

Replace `df` with a new one to use for the following demonstration.

In [None]:
df = pd.DataFrame(np.array([[-4.31464978,  4.18579587, -3.95827137,  0.43225809],
                           [-1.00034678,  4.32407815,  4.79826565, -4.52343789],
                           [ 3.43708467,  1.2913998 ,  4.12525004, -0.55061573],
                           [ 3.54330653,  4.45819847,  4.15887073,  4.50748233],
                           [ 4.1124862 ,  4.18789329, -1.5093025 ,  3.1387294 ]]), 
                  index = range(5),columns=list('ABCD'))
df

# Filtering

We can filter a dataframe based on a global condition (if it can be evaluated). The entries that fail the condition are filled with `nan`s.

In [None]:
df[df>0]
# An equivalent way would be df.where(df>0)

The `where()` method allows us to replace the `NaN`s with a specified value or condition

In [None]:
df.where(df>0,999)
# try also df.where(df>0,-df)

We can also filter a dataframe based on the values of a specific column:

In [None]:
df[df['A']<3]

In [None]:
df[ (df['A']>-2) & (df['A']<3.5) ]

# Sorting

Sometimes we want to rearrange our dataframe based on the values of certain columns. This can be done by using `sort_values()`

In [None]:
df.sort_values('B')

In [None]:
df.sort_values('A',ascending=False) # sort in descending order

In [None]:
df.loc[1,'B'] = df.loc[2,'B']
print(df)
df.sort_values(['B','C']) # sort by two columns to break ties

In [None]:
# apply ascending vs descending sort to different columns
df.sort_values(['B','C'],ascending=[True,False]) 

Sorting can also be forced to happen in-place using the familiar `inplace` argument.