In [1]:
# Ignore, this code will be explained later
import os
import sys

module_path = os.path.abspath(os.path.join('../..'))
if module_path not in sys.path: sys.path.append(module_path)

from src import utils

In [2]:
%reload_ext post_content
%post_content register YOUR_USER_NAME

PostContent loaded
Registered


# Pandas - an overview

Pandas library provides a fantastic interface to tabular data, made up of rows and columns. This is usually used to represent "business" datasets such as customer or product information. Generally, tables are organized so rows represent observations and columns represent features. Such "tabular" datasets are different from other datasets used in machine learning, such as images, videos, graphs, etc.

We will study three main objects within the Pandas library: Dataframes and two objects which make up dataframes, Series and Indexes.

![](images/dataframes.jpg)

*Hint* Review earlier lectures which provide quickstart introductions to Numpy and Pandas. These notes take a more systematic approach to describing Pandas.

In [3]:
import numpy as np
import pandas as pd

# Series

Pandas series are similar to Python's built-in lists and numpy arrays. Here is an example:

In [4]:
pd.Series(["Bart", "Homer", "Lisa", "Maggie", "Marge"])

0      Bart
1     Homer
2      Lisa
3    Maggie
4     Marge
dtype: object

In [5]:
#Python list
["Bart", "Homer", "Lisa", "Maggie", "Marge"]

['Bart', 'Homer', 'Lisa', 'Maggie', 'Marge']

In [6]:
#Numpy array
np.array(["Bart", "Homer", "Lisa", "Maggie", "Marge"])

array(['Bart', 'Homer', 'Lisa', 'Maggie', 'Marge'], dtype='<U6')

#### Series, like Numpy arrays, have types

In [7]:
type([1, 2, 3])

list

In [8]:
type([1.2, 2, 3])

list

In [9]:
type([1.2, 2, "Homer"])

list

In [10]:
pd.Series([1, 2, 3])

0    1
1    2
2    3
dtype: int64

In [11]:
pd.Series([1.2, 2, 3])

0    1.2
1    2.0
2    3.0
dtype: float64

In [12]:
pd.Series([1.2, 2, "Homer"])

0      1.2
1        2
2    Homer
dtype: object

Notice that, unlike Python lists, Series objects have types associated with them. One of the reasons Pandas and Numpy are so much faster than native Python is because all elements are expected to be the same type and can be optimized for performance.

**String type** With Pandas 1.0, string series have their own data type, while `object` will be the generic, catch all type.

#### More complex keys

Notice that Pandas series are nothing more than a wrapper around Numpy arrays (at least in Pandas 1.0):

In [13]:
ss = pd.Series(["Bart", "Homer", "Lisa", "Maggie", "Marge"])
ss

0      Bart
1     Homer
2      Lisa
3    Maggie
4     Marge
dtype: object

In [14]:
ss.values

array(['Bart', 'Homer', 'Lisa', 'Maggie', 'Marge'], dtype=object)

In [15]:
type(ss.values)

numpy.ndarray

In [16]:
type(np.array(["Bart", "Homer", "Lisa", "Maggie", "Marge"]))

numpy.ndarray

Recall that Numpy was created to be a library for numeric matrix manipulation. It make sense to ask, what is the value at index zero. It makes no sense, in matrix math, to ask, "what is the value at index 'Homer'?"

Pandas, however, is designed to work with datasets which may have categories and texts. If a matrix contains ages of people, it is perfectly reasonable to ask, what is the 'age' value at index 'Homer.'

In [17]:
pd.Series(["Bart", "Homer", "Lisa", "Maggie", "Marge"])

0      Bart
1     Homer
2      Lisa
3    Maggie
4     Marge
dtype: object

In [18]:
ss2 = pd.Series([12, 38, 10, 2, 36], index=["Bart", "Homer", "Lisa", "Maggie", "Marge"])
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

Notice that Series provide the ability to add a custom `index`:

In [19]:
ss2["Homer"]

38

#### Series combine the properties of lists and dictionaries

Recall that Python's core lists are accessed by zero based integer values. Elements in a list are ordered, which means we can use the _slice_ notation to access multiple items:

In [20]:
["Bart", "Homer", "Lisa", "Maggie", "Marge"][2:4]

['Lisa', 'Maggie']

Dictionaries let us provide our own keys. However, they cannot be sliced:

In [77]:
{"Homer":38, "Marge":36, "Bart":12, "Lisa":10, "Maggie":6}["Bart":"Homer"] #<= dictionaries don't understand slicing

TypeError: unhashable type: 'slice'

Pandas Series combine properties of lists and dictionaries, along with the performance of Numpy:

In [22]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [23]:
ss2[0]

12

In [24]:
ss2[2:4]

Lisa      10
Maggie     2
dtype: int64

In [25]:
ss2["Lisa":"Marge"]

Lisa      10
Maggie     2
Marge     36
dtype: int64

**Caveat** Notice that when we used slicing with the implicit (built-in) index values: `2:4`, the 4th value wasn't included in the result set (as expected). However, when sliced with explicit index values (the index we provided): "Bart":"Maggie", the last value _is_ included. Series provides a way around this confusion via `.loc` and `.iloc` 

![](images/series.jpg)

In [26]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [27]:
ss2.values

array([12, 38, 10,  2, 36], dtype=int64)

In [28]:
ss2.index

Index(['Bart', 'Homer', 'Lisa', 'Maggie', 'Marge'], dtype='object')

### Caveat: `s.columnA` vs `s['columnA']`

So far, we have accessed elements of Series using the syntax `s['columnA']`. However, the following syntax is also allowed `s.columnA`. For example

In [29]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [30]:
ss2['Marge']

36

In [31]:
ss2.Marge

36

The 'dot' syntax is very convenient, since it looks like calling a normal method on an object. It is slightly shorter to type and prvides better IDE help. In a cell, type `ss2.` then press the [TAB] key. You will notice a list of dropdowns. Type "M" and you will see Marge and Maggie's name pop up. You are using the autocomplete to find elements of a series!

Also notice that you can call actual operations on a series:

In [32]:
ss2.sum()

98

If you type `series.XYZ` .. is that referring to an element, indexed by the key "XYZ" or is it referring to the function "XYZ()"? This is confusion is the reason why, unless we are sure that there is no conflict, the safer option is to use the syntax `s['columnA']`

### Indexing Series

Similar to Numpy, values in retrieved in several ways:
1. Implicit index (similar to lists)
2. Explicit index or label (similar to dictionaries)
3. Slicing
4. Masking
5. Fancy indexing

#### Implicit index

In [33]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

Given the series above, where names of the Simpson family are the index and their ages are the values, we can get Marge's age, which is at row 2 via directly requesting the second value (remember that Python is zero based):

In [34]:
ss2[1]

38

#### Explicit index or labels

Much like dictionaries, the key associated with Marge's name will return her age:

In [35]:
ss2["Marge"]

36

**Caveat** Notice that the same syntax is being used to access data implicitely and explicitely: `series[index]`. If `index` is an integer, then Pandas assumes it is an implicit inex, if it is not an integer, then Pandas assumes it is an explicit index. What if our explicit index was also an intger? See `.loc` and `.iloc`

#### Slicing

Similar to Numpy arrays and Python lists, Series can be sliced:

In [36]:
ss2["Lisa":"Marge"]

Lisa      10
Maggie     2
Marge     36
dtype: int64

In [37]:
ss2[2:4]

Lisa      10
Maggie     2
dtype: int64

Notice that slicing with explicit values includes the last item while slicing with implicit indexes the last value is not included.

Remember that negative indexes can be used, just like normal Python lists. Below we select items which start at "second to last" and end at the last item:

In [38]:
ss2[-2:]

Maggie     2
Marge     36
dtype: int64

#### Masking

Providing a `True` or `False` value for element results in elements corresponding to `False` being filtered out.

In [39]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [40]:
mask = [True, False, True, True, True]
ss2[mask]

Bart      12
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [41]:
ss2[[True, False, True, True, True]]

Bart      12
Lisa      10
Maggie     2
Marge     36
dtype: int64

Filter on the index

In [42]:
ss2.index != "Homer"

array([ True, False,  True,  True,  True])

In [43]:
ss2[ss2.index != "Homer"]

Bart      12
Lisa      10
Maggie     2
Marge     36
dtype: int64

Filter on the values

In [44]:
ss2.values <30

array([ True, False,  True,  True, False])

In [45]:
ss2[ss2.values < 30]

Bart      12
Lisa      10
Maggie     2
dtype: int64

In [46]:
ss2 < 30

Bart       True
Homer     False
Lisa       True
Maggie     True
Marge     False
dtype: bool

Notice that you don't have to call `.values` for cleaner code

In [47]:
ss2[ss2 < 30]

Bart      12
Lisa      10
Maggie     2
dtype: int64

#### Fancy indexing

If you know the specific list of items, you can ask the Series to return them directly:

In [48]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [49]:
ss2[[2,3,4]]

Lisa      10
Maggie     2
Marge     36
dtype: int64

Notice that the order of elements in the index will be what the Series returns:

In [50]:
ss2[[4,3,2]]

Marge     36
Maggie     2
Lisa      10
dtype: int64

You can also use the explicit index, instead of the implicit index

In [51]:
ss2[["Marge", "Lisa"]]

Marge    36
Lisa     10
dtype: int64

### `.loc` and `.iloc` or confusion when explicit indexes are integers

Notice that the series we have been working with has strings as the explicit index and integers as the values:

In [52]:
pd.Series([12, 38, 10, 2, 36], index=["Bart", "Homer", "Lisa", "Maggie", "Marge"])

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

What if we flip that around and a couple more characters

In [68]:
ss3 = pd.Series(["Bart", "Homer", "Lisa", "Maggie", "Marge"], index=[12, 38, 10, 2, 36])
ss3

12      Bart
38     Homer
10      Lisa
2     Maggie
36     Marge
dtype: object

**Caveat** If I ask for values for 1 `ss3[2]`, am I asking for elements at location 2 or for characters with age 2?

In [69]:
ss3[2]

'Maggie'

In order to avoid this confusion, Pandas provides methods `.loc`, which expects explicit index values (aka labels) and `.iloc`, which expects implicit index values:

In [70]:
ss3.loc[2]

'Maggie'

In [71]:
ss3.iloc[2]

'Lisa'

#### Special indexer: `.at[]`

Implicit index (similar to lists)
Explicit index (similar to dictionaries)
Slicing
Masking
Fancy indexing

Notice that we have several ways of accessing elements: index, labels, slicing, masking and fancy indexing. The same syntax `s[XXX]`, `s.loc[XXX]` or `s.iloc[XXX]` can be used with any of the methods above. There are times when you want exactly a single, scalar value to be returned. In such cases, you use the syntax `s.at[XXX]`:

In [74]:
ss3.at[2]

'Maggie'

In [76]:
ss3.at[1:2] # <= slicing is prohibited, since it would return multiple values

ValueError: At based indexing on an integer index can only have integer indexers