# Pandas Quick Reference Guide

### Import the Pandas library

The typical way to import the Pandas library is the following:

In [3]:
import pandas as pd

This allows you to access Pandas functions by using `pd.<function name>`.

### Read a dataset from a CSV (.csv) file

Use the `read_csv` function to read in a dataset from a CSV file.
This function returns a Pandas `DataFrame`. ([Further Reading](https://pandas.pydata.org/docs/user_guide/io.html#csv-text-files))

In this reference, we will be using `degrees.csv` as our example dataset. [^dataset]

Examples:

In [4]:
# When the data file is in the same folder as your notebook, write the file name.
data = pd.read_csv('degrees.csv')

# When the data file is in a different folder as your notebook, you need
# to write the path to the file.
# other_data = pd.read_csv('data/degrees.csv')

### Get an overview of a `DataFrame`
The datasets we will be working with are often quite large. The `pandas` library has a few methods that help us take a quick glance at our data or retrieve information about it.

#### `.info` and `.head`
Use `.head()` to view the first few rows of the table.
On the other hand, use `.info` to print out more detailed information about the `DataFrame`. ([Further Reading](https://pandas.pydata.org/docs/user_guide/basics.html#head-and-tail))

Example:

In [33]:
data.head()

Unnamed: 0,Name,Degree,Absentia,Spring,Fall,Total
0,Honours Bachelor of Arts,HBA,,1599.0,345.0,1944.0
1,Bachelor of Arts - 4 Yr.,BA-4,,,,
2,Bachelor of Arts,BA,,31.0,12.0,43.0
3,Honours Bachelor of Science,HBSC,3.0,2737.0,447.0,3187.0
4,Bachelor of Science,BSC,,40.0,23.0,63.0


In [34]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      6 non-null      object 
 1   Degree    6 non-null      object 
 2   Absentia  1 non-null      float64
 3   Spring    5 non-null      float64
 4   Fall      5 non-null      float64
 5   Total     5 non-null      float64
dtypes: float64(4), object(2)
memory usage: 416.0+ bytes


#### `.columns`
Use `.columns` to return a set of the column labels in the `DataFrame`. ([Further Reading](https://pandas.pydata.org/docs/user_guide/10min.html#viewing-data))

Example:

In [35]:
print(data.columns)

# You can use list() to make the output cleaner
print(list(data.columns))

Index(['Name', 'Degree', 'Absentia', 'Spring', 'Fall', 'Total'], dtype='object')
['Name', 'Degree', 'Absentia', 'Spring', 'Fall', 'Total']


#### `.dtypes`
Use the `.dtypes` attribute to find the datatypes of each column. It returns a `Series`, where the index is the original column name. ([Further Reading](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes))

Example:

In [36]:
data.dtypes

Name         object
Degree       object
Absentia    float64
Spring      float64
Fall        float64
Total       float64
dtype: object

#### `.size`
The `.size` attribute returns how many elements are in the `DataFrame` or `Series` as an `int`. This
means it returns the number of rows multiplied by the number of columns.

Example:

In [37]:
# this returns 36
data.size

36

#### `.shape`
The `.shape` attribute tells us the dimensions of our `DataFrame`, returned as the tuple:
`(number of rows, number of columns)`

Example:

In [38]:
# this returns (6, 6), since our table has 6 rows and 6 columns, respectively
data.shape

(6, 6)

### Extract `DataFrame` columns by name

To extract certain columns from the `DataFrame`, you can use the names of the columns like so: `df[['col1', 'col2', ...]]` ([Further Reading](https://pandas.pydata.org/docs/user_guide/10min.html#selection))

Example:

In [39]:
data[['Name', 'Fall']]

Unnamed: 0,Name,Fall
0,Honours Bachelor of Arts,345.0
1,Bachelor of Arts - 4 Yr.,
2,Bachelor of Arts,12.0
3,Honours Bachelor of Science,447.0
4,Bachelor of Science,23.0
5,Bachelor of Commerce,71.0


### Extract `DataFrame` rows by criteria

You can extract rows using conditions, using the following syntax: `df[df['col1'] > 0]`. ([Further Reading](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing))

Example:

In [40]:
data[data['Fall'] > 50]

Unnamed: 0,Name,Degree,Absentia,Spring,Fall,Total
0,Honours Bachelor of Arts,HBA,,1599.0,345.0,1944.0
3,Honours Bachelor of Science,HBSC,3.0,2737.0,447.0,3187.0
5,Bachelor of Commerce,BCOM,,612.0,71.0,683.0


#### More Complex Criteria
You can combine diffeent conditions using boolean operators:
- Use `|` for `or`, meaning you want at least one of the conditions to be true
- Use `&` for `and`, meaning you want both conditions to be true
- Use `~` for `not`, meaning you want the condition to be false

If you want to filter out null (or `NaN`) values, you can use the `.notnull()` or `.isnull()` functions.

Also, use paranthesis `()` to signify which conditions should be evaluated first.

Example:

In [41]:
# this will return the rows with a Null absentia and more than 50 Fall certificates.
data[(data['Absentia'].isnull()) & (data['Fall'] > 50)]

Unnamed: 0,Name,Degree,Absentia,Spring,Fall,Total
0,Honours Bachelor of Arts,HBA,,1599.0,345.0,1944.0
5,Bachelor of Commerce,BCOM,,612.0,71.0,683.0


In [42]:
# this will return rows with a Null absentia and either less than 1,000 spring certificates or mroe than 50 fall certificates
data[(data['Absentia'].isnull()) & ((data['Spring'] < 1000) | (data['Fall'] > 50))]

Unnamed: 0,Name,Degree,Absentia,Spring,Fall,Total
0,Honours Bachelor of Arts,HBA,,1599.0,345.0,1944.0
2,Bachelor of Arts,BA,,31.0,12.0,43.0
4,Bachelor of Science,BSC,,40.0,23.0,63.0
5,Bachelor of Commerce,BCOM,,612.0,71.0,683.0


### Sort `DataFrame` rows

Use the `.sort_values` function to sort your `DataFrame` by the value of its columns or rows. ([Further Reading](https://pandas.pydata.org/docs/user_guide/basics.html#by-values))

Example:

In [43]:
data.sort_values(by=["Total"])

Unnamed: 0,Name,Degree,Absentia,Spring,Fall,Total
2,Bachelor of Arts,BA,,31.0,12.0,43.0
4,Bachelor of Science,BSC,,40.0,23.0,63.0
5,Bachelor of Commerce,BCOM,,612.0,71.0,683.0
0,Honours Bachelor of Arts,HBA,,1599.0,345.0,1944.0
3,Honours Bachelor of Science,HBSC,3.0,2737.0,447.0,3187.0
1,Bachelor of Arts - 4 Yr.,BA-4,,,,


### Obtain the "n largest" or "n smallest" rows from a `DataFrame`
Use the `.nlargest` function to return the first `n` rows sorted by certain columns in descending order. ([Further Reading](https://pandas.pydata.org/docs/user_guide/basics.html#smallest-largest-values))

Example:

In [44]:
# You can sort using one column, like so:
data.nlargest(3, 'Total')

# Or with multiple columns. This will order by largest in 'Spring' then in 'Fall'
data.nlargest(3, ['Spring', 'Fall'])

Unnamed: 0,Name,Degree,Absentia,Spring,Fall,Total
3,Honours Bachelor of Science,HBSC,3.0,2737.0,447.0,3187.0
0,Honours Bachelor of Arts,HBA,,1599.0,345.0,1944.0
5,Bachelor of Commerce,BCOM,,612.0,71.0,683.0


Similarly, use the `.nsmallest` function to return the first `n` rows sorted by certain columns in ascending order.

Example:

In [45]:
# You can sort using one column, like so:
data.nsmallest(3, 'Total')

# Or with multiple columns. This will order by largest in 'Spring' then in 'Fall'
data.nsmallest(3, ['Spring', 'Fall'])

Unnamed: 0,Name,Degree,Absentia,Spring,Fall,Total
2,Bachelor of Arts,BA,,31.0,12.0,43.0
4,Bachelor of Science,BSC,,40.0,23.0,63.0
5,Bachelor of Commerce,BCOM,,612.0,71.0,683.0


### Renaming `DataFrame` columns

Use the function `.rename(columns={'col1': 'new col1', ...})` to rename columns in a `DataFrame`. ([Further Reading](https://pandas.pydata.org/docs/user_guide/basics.html#basics-rename))

Example:

In [46]:
data.rename(columns={'Degree': 'Degree Code'})

Unnamed: 0,Name,Degree Code,Absentia,Spring,Fall,Total
0,Honours Bachelor of Arts,HBA,,1599.0,345.0,1944.0
1,Bachelor of Arts - 4 Yr.,BA-4,,,,
2,Bachelor of Arts,BA,,31.0,12.0,43.0
3,Honours Bachelor of Science,HBSC,3.0,2737.0,447.0,3187.0
4,Bachelor of Science,BSC,,40.0,23.0,63.0
5,Bachelor of Commerce,BCOM,,612.0,71.0,683.0


### Calculating descriptive statistics for a `DataFrame`
You can read more about the following operations [here](https://pandas.pydata.org/docs/user_guide/basics.html#descriptive-statistics)

Use the `.count` function to return how many non-NaN values there are on one axis.

Example:

In [47]:
data.count()

Name        6
Degree      6
Absentia    1
Spring      5
Fall        5
Total       5
dtype: int64

Use the `.sum` function to sum up values on one axis.

Example:

In [6]:
data.sum(numeric_only=True)

Absentia       3.0
Spring      5019.0
Fall         898.0
Total       5920.0
dtype: float64

Use the `.mean` function to find the mean of values on one axis

Example:

In [5]:
data.mean(numeric_only=True)

Absentia       3.0
Spring      1003.8
Fall         179.6
Total       1184.0
dtype: float64

Use the `.median` function to find the median of values on one axis.

Example:

In [7]:
data.median(numeric_only=True)

Absentia      3.0
Spring      612.0
Fall         71.0
Total       683.0
dtype: float64

Use the `.min` and `.max` functions to find the minimum/maximum values respectively on one axis.

Example:

In [51]:
data.min()

Name        Bachelor of Arts
Degree                    BA
Absentia                 3.0
Spring                  31.0
Fall                    12.0
Total                   43.0
dtype: object

The `.describe` function provides the descriptive statistics listed above, and a few more: ([Further Reading](https://pandas.pydata.org/docs/user_guide/basics.html#summarizing-data-describe))

In [52]:
data.describe()

Unnamed: 0,Absentia,Spring,Fall,Total
count,1.0,5.0,5.0,5.0
mean,3.0,1003.8,179.6,1184.0
std,,1160.495885,202.031681,1360.067278
min,3.0,31.0,12.0,43.0
25%,3.0,40.0,23.0,63.0
50%,3.0,612.0,71.0,683.0
75%,3.0,1599.0,345.0,1944.0
max,3.0,2737.0,447.0,3187.0


### Adding a computed column to a `DataFrame`
To add a new column to a `DataFrame`, use `df['new column'] = new_column_content`. ([Further Reading](https://pandas.pydata.org/docs/user_guide/dsintro.html#column-selection-addition-deletion))

Example:

In [53]:
data['Non-absentia Total'] = data['Spring'] + data['Fall']

data.head()

Unnamed: 0,Name,Degree,Absentia,Spring,Fall,Total,Non-absentia Total
0,Honours Bachelor of Arts,HBA,,1599.0,345.0,1944.0,1944.0
1,Bachelor of Arts - 4 Yr.,BA-4,,,,,
2,Bachelor of Arts,BA,,31.0,12.0,43.0,43.0
3,Honours Bachelor of Science,HBSC,3.0,2737.0,447.0,3187.0,3184.0
4,Bachelor of Science,BSC,,40.0,23.0,63.0,63.0


[^dataset]: The dataset we are using in this notebook is extracted from the [2021 Annual Report on Degrees Diplomas and Certificates](https://governingcouncil.utoronto.ca/system/files/2022-01/Annual%20Report%20on%20Degrees%20Diplomas%20and%20Certificates%20%282021%29.pdf), specifically for the Faculty of Arts & Science.