# Pandas

Here are some notes and examples of using the Pandas library for manipulating tabular data.

The example below reads a CSV file and returns it as a `DataFrame` object. This object type is provided by Pandas and represents a table of data (more formally called a "relation"). Pandas has numerous data sources it can read from using various read_* functions.

In [1]:
import pandas as pd  # Make the library available via a convenient name.

athletes = pd.read_csv("../data/Tokyo-2020-Athletes.csv")
athletes


Unnamed: 0,Name,NOC,Discipline
0,AALERUD Katrine,Norway,Cycling Road
1,ABAD Nestor,Spain,Artistic Gymnastics
2,ABAGNALE Giovanni,Italy,Rowing
3,ABALDE Alberto,Spain,Basketball
4,ABALDE Tamara,Spain,Basketball
...,...,...,...
11080,ZWICKER Martin Detlef,Germany,Hockey
11081,ZWOLINSKA Klaudia,Poland,Canoe Slalom
11082,ZYKOVA Yulia,ROC,Shooting
11083,ZYUZINA Ekaterina,ROC,Sailing


It is possible to select columns from the DataFrame using the usual Python selection operator `[]`. For example:

In [9]:
just_names = athletes["Name"]                       # Get a column of names (a Series).
names_and_sport = athletes[["Name", "Discipline"]]  # Get two columns (a DataFrame).

print("The first name:", just_names.head(10)[0])    # A Series is one dimensional. Select an item using []
names_and_sport.head(10)                            # A DataFrame is two dimensional.

The first name: AALERUD Katrine


Unnamed: 0,Name,Discipline
0,AALERUD Katrine,Cycling Road
1,ABAD Nestor,Artistic Gymnastics
2,ABAGNALE Giovanni,Rowing
3,ABALDE Alberto,Basketball
4,ABALDE Tamara,Basketball
5,ABALO Luc,Handball
6,ABAROA Cesar,Rowing
7,ABASS Abobakr,Swimming
8,ABBASALI Hamideh,Karate
9,ABBASOV Islam,Wrestling


It is possible to select specific rows using `iloc`. Notice that the rows are numbered starting at zero and, as is usual for Python, the last number in the range is just past the end. Thus 0:5 implies rows number 0 through 4, inclusive (5 rows).

In [10]:
athletes.iloc[0:5]

Unnamed: 0,Name,NOC,Discipline
0,AALERUD Katrine,Norway,Cycling Road
1,ABAD Nestor,Spain,Artistic Gymnastics
2,ABAGNALE Giovanni,Italy,Rowing
3,ABALDE Alberto,Spain,Basketball
4,ABALDE Tamara,Spain,Basketball


You can also select specific columns at the same time (again, zero based, etc.).

In [11]:
athletes.iloc[0:5, 0:2]


Unnamed: 0,Name,NOC
0,AALERUD Katrine,Norway
1,ABAD Nestor,Spain
2,ABAGNALE Giovanni,Italy
3,ABALDE Alberto,Spain
4,ABALDE Tamara,Spain


You can select rows based on a condition, and specific columns based on their names using `loc`.

In [13]:
athletes.loc[athletes["NOC"] == "Norway", ["Name", "NOC"]]

Unnamed: 0,Name,NOC
0,AALERUD Katrine,Norway
50,ABELVIK ROED Magnus,Norway
941,BERGERUD Torbjoern,Norway
1044,BJOERNSEN Kristian,Norway
1073,BLUMMENFELT Kristian,Norway
...,...,...
9900,TUFTE Olaf Karl,Norway
9930,TUXEN Anne,Norway
9963,ULLVANG Lars Magne,Norway
10443,WARHOLM Karsten,Norway


If you want all columns, you can use a simplified syntax that selects from the DataFrame directly (without using `loc` or `iloc`)

In [14]:
athletes[athletes["NOC"] == "Norway"]

Unnamed: 0,Name,NOC,Discipline
0,AALERUD Katrine,Norway,Cycling Road
50,ABELVIK ROED Magnus,Norway,Handball
941,BERGERUD Torbjoern,Norway,Handball
1044,BJOERNSEN Kristian,Norway,Handball
1073,BLUMMENFELT Kristian,Norway,Triathlon
...,...,...,...
9900,TUFTE Olaf Karl,Norway,Rowing
9930,TUXEN Anne,Norway,Diving
9963,ULLVANG Lars Magne,Norway,Canoe Sprint
10443,WARHOLM Karsten,Norway,Athletics


The usual Python bitwise operators can be used to make more complex conditions.

In [20]:
athletes[(athletes["NOC"] == "Norway") & (athletes["Discipline"] == "Triathlon")]

Unnamed: 0,Name,NOC,Discipline
1073,BLUMMENFELT Kristian,Norway,Triathlon
4183,IDEN Gustav,Norway,Triathlon
6448,MILLER Lotte,Norway,Triathlon
9359,STORNES Casper,Norway,Triathlon


It is possible to do simple math operations on Series objects directly (they are done element-wise across the entire Series). However, asking for the `len()` of a Series returns, unsurprisingly the number of elements in the Series. To compute a new Series holding the length of each element of another requires a more general approach using `apply`.

The example below shows us augmenting the DataFrame with an additional column computed from the "Name" column be calculating the length of each name. Note that `apply` uses a function, here a lambda expression, on each element of the input Series to compute the result Series.

In [23]:
athletes["NameLength"] = athletes["Name"].apply(lambda item: len(item))
athletes.head(10)

Unnamed: 0,Name,NOC,Discipline,NameLength
0,AALERUD Katrine,Norway,Cycling Road,15
1,ABAD Nestor,Spain,Artistic Gymnastics,11
2,ABAGNALE Giovanni,Italy,Rowing,17
3,ABALDE Alberto,Spain,Basketball,14
4,ABALDE Tamara,Spain,Basketball,13
5,ABALO Luc,France,Handball,9
6,ABAROA Cesar,Chile,Rowing,12
7,ABASS Abobakr,Sudan,Swimming,13
8,ABBASALI Hamideh,Islamic Republic of Iran,Karate,16
9,ABBASOV Islam,Azerbaijan,Wrestling,13
