# Week 7: The Pandas Library

### Feb 28, 2024

### Krishnapriya Vishnubhotla

## Python Libraries

- A collection of data types and methods.
  
  
- We have already used libraries in Python
  - The `string` library defines string operations like `str.split()`, `str.replace()`.
  - Part of The Python Standard Library
      
      
- There are several external libraries that can be *imported* and used for specific tasks:
    - plotting (`matplotlib`, `seaborn`)
    - statistics (`scipy`)
    - graphics, audio (`pygame`)
    
    
- `Pandas`: for data science
    - Reading and writing data files
    - Data operations
    - Visualizations

## Importing a Library

- A statement to tell python to load the library so we can access the functions

In [None]:
import pandas

In [None]:
# sample usage
data = pandas.read_csv('PanTHERIA_WR05_Aug2008.csv')

In [None]:
display(data)

In [None]:
# give the library an easy nickname
import pandas as pd

In [None]:
data = pd.read_csv('PanTHERIA_WR05_Aug2008.csv')
display(data)

## Part 1: What is a Dataframe?

In [None]:
print(type(data))

In [None]:
help(data)

In [None]:
display(data)

In [None]:
# The shape of the dataframe
print(data.shape)
num_rows = data.shape[0]
num_cols = data.shape[1]
print("There are {} rows and {} columns in the dataframe".format(num_rows, num_cols))

In [None]:
# display the first (or last) few rows
data.head(10)

In [None]:
data.tail()

**Note**: The difference between accessing *properties* (`data.shape`) vs *methods* (`data.head()`) of a dataframe.

- You do not have parantheses for properties.

In our dataframe (and generally), each row contains represents a datapoint, and each column represents a certain attribute, or aspect, of a datapoint.

In [None]:
# obtain the list of column names
print(data.columns)

In [None]:
# The above is not exactly a list, but we can convert it to a list.
list_of_columns = list(data.columns)
print(len(list_of_columns))
print(list_of_columns[:5])

### Data types
You might have noticed that the columns of our dataframe store different *types* of information.

In [None]:
data.head()

- `MSW05_Order` is a text attribute (what regular python calls a `string` type).
- `1-1_ActivityCycle` is a numerical attribute (type `int`) 

We can access the data types of the dataframe column using the `dataframe.dtypes` property.

In [None]:
data.dtypes # a property

`pandas` uses its own custom data types which loosely map on to the base python types we've encountered before:
- `float64` corresponds to the `float` type.
- `int64` to `int`
- text fields are by default assigned the generic `object` type by pandas, 
    - but we can change that!
    
The [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes) comprehensively lists the different `dtypes` in Pandas...but this is not too important right now.

In [None]:
# to summarize
data.info() # a method!

## Part 2: Data Cleaning

In any data science project, the raw data generally needs to be **pre-processed** -- reworked into a format that is cleaner and more understandable.
- We've already seen an example of this: not considering datapoints with a -999 or NA value.

Pre-processing can involve:
- data selection (filter our data down to a subset of rows/columns)
- data transformation (convert the raw values to a different scale/type)
- other stuff (normalization, imputation, etc.)

Let us start with some column operations.

### Renaming Columns

The list of column names in our dataframe is quite messy and unintuitive. We can change the column names using the `Dataframe.rename()` method.

This method requires us to supply an **argument** that maps the current column name(s) to the new, desired column name(s). 

What do you think the best data type is for specifying such a mapping?

In [None]:
# A dictionary!

old_to_new = {
    'MSW05_Genus': 'Genus',
    'MSW05_Species': 'Species',
    '1-1_ActivityCycle': 'Activity Cycle',
    '5-1_AdultBodyMass_g': 'Adult Body Mass (g)',
    '2-1_AgeatEyeOpening_d': 'Age at Eye Opening (days)',
    '17-1_MaxLongevity_m': 'Max Longevity (months)'
}

# help(pd.DataFrame.rename)


In [None]:
data_renamed = data.rename(columns=old_to_new)

In [None]:
display(data_renamed)

In [None]:
data_renamed.columns

### Converting column types

We can ask Pandas to *automatically choose* the best column types for an existing `DataFrame`.
This is done with the `DataFrame.convert_dtypes()` method.

In [None]:
data_converted = data_renamed.convert_dtypes()

In [None]:
data_converted.dtypes

### Replacing values
We have seen that missing values are usually indicated by a special value in data files (eg, -999 or "NA").

In Python (and other languages), a special `null` value is often used to indicate missing data. We will replace all -999 values in the PanTHERIA dataset with this `null` value, specified using the `pd.NA` variable. 

In [None]:
print(pd.NA)

In [None]:
data_converted

In [None]:
data_converted_with_na = data_converted.replace(-999, pd.NA)

In [None]:
data_converted_with_na

### Selecting Columns
We often do not need ALL the information that is in a data file to answer a data science question of interest -- our answer probably depends on a subset of the columns (attributes) or rows (datapoints). 

We can extract a subset of columns by:
- Specifying the **list** of columns that we want to keep
- Selecting the subset of columns using the square-bracket `[list, of, columns]` notation

(We will see how to select rows later)

In [None]:
columns_to_keep = ['Genus', 'Species', 'Adult Body Mass (g)', 'Max Longevity (months)']

In [None]:
data_filt_colums = data_converted_with_na[columns_to_keep]

In [None]:
data_filt_colums

## Part 3: Column-level Operations

So far, we have performed operations on the dataframe as a whole:
- Replacing -999 values in all columns and rows with `pd.NA`
- Converting all the columns to the recommended `dtype`
- Renaming multiple columns

A powerful feature of Pandas is that it allows us to perform specific operations on individual columns. For example: 
- multiplying all the values in a column by 100
- adding the values of two columns together
- Converting a specific column to a dtype, or replacing values in a single column

In [None]:
# Let us first retrive a single column of our dataframe
data_filt_colums.columns

In [None]:
mass_info = data_filt_colums['Adult Body Mass (g)']

In [None]:
display(mass_info)

In [None]:
# NOTE: NOT THE SAME AS SELECTING COLUMN SUBSETS!
type(data_filt_colums[['Adult Body Mass (g)']])

In [None]:
# What is the data type of this variable?
print(type(mass_info))

### Series

A Pandas `Series` represents the data contained in a single column. It is like a `list`, but allows us to apply powerful Pandas methods on it.

In [None]:
mass_info.shape #property

In [None]:
mass_info.dtype #property

In [None]:
mass_info.name #property

In [None]:
mass_info.info() #method

### Some information about the values in a column

We can extract several interesting statistics about the values in a column:
- `nunique()` will tell us the number of *unique* values
- `value_counts()` will give us the number of times each unique value occurs

In [None]:
genus_info = data_filt_colums['Genus']

In [None]:
genus_info

In [None]:
# how many unique values?
n_unique_genus = genus_info.nunique()
print(n_unique_genus)

In [None]:
# how many times does each genus appear?
genus_value_counts = genus_info.value_counts()
genus_value_counts

In [None]:
type(genus_value_counts)

In [None]:
# almost like a dictionary!
genus_value_counts['Sorex']

In [None]:
genus_info.unique()

In [None]:
# obtain a list of the unique genus values
genus_list = list(genus_info.unique()) #genus_info.unique() is the method
print(len(genus_list))
print(genus_list[:5])

### Numerical Statistics

Here are five simple *descriptive statistics* that we can use to describe a collection of numbers:
- count (i.e., size; number of elements)
- sum
- mean (average)
- min
- max

In [None]:
mass_info

In [None]:
mass_info.count()
# Non <NA> values only!

In [None]:
print("Sum: ", mass_info.sum())
print("Mean: {}, Minimun: {}, Maximum: {}".format(mass_info.mean(), mass_info.min(), mass_info.max()))

In [None]:
# all together
mass_info.describe()

### Data transformation on Series

Let us check out how we can apply a data transformation to all the values in a series, using a single command.

In [None]:
# convert the mass values from grams to kilograms (i.e, kilograms = grams/1000)
mass_in_kg = mass_info/1000

In [None]:
mass_in_kg

In [None]:
type(mass_in_kg)

In [None]:
# apply some rounding...
mass_in_kg_rounded = mass_in_kg.round(2)
# no need to write for loops!

In [None]:
mass_in_kg_rounded

### Operations involving multiple columns

Say we want to compute a value that uses information from two (or multiple) columns. 

We can do this by storing the values in each column in two `Series` variables, and specifying the operation.

In [None]:
data_filt_colums

In [None]:
longetivity_info = data_filt_colums['Max Longevity (months)']

In [None]:
mass_info.shape, longetivity_info.shape

In [None]:
long_mass_ratio = longetivity_info / mass_info

In [None]:
long_mass_ratio

### Adding new columns!

In [None]:
pd.set_option('mode.chained_assignment', None)
# don't worry about this

In [None]:
data_filt_colums['Longetivity-to-mass-ratio'] = long_mass_ratio
# This modifies the dataframe!

In [None]:
data_filt_colums
# note how the NA values are handled.

## Part 4: Row operations

We have seen how to name, select, and access columns.

Another type of data selection is too choose only a subset of the rows.

There are three main ways of selecting rows:
- By name (`DataFrame.loc`)
- By position (`DataFrame.iloc`)
- With a Boolean filter (`DataFrame[Bool-Series]`)

In [None]:
# rows are also indexed by names -- these are the bold values at the start of each row.
data_filt_colums.head()

By default, Pandas indexes rows with a numerical ordering starting from 0.

(We can also specify our own names, but we will not cover that right now.)

In [None]:
# extract the row with name 0
row_name_zero = data_filt_colums.loc[0]

In [None]:
row_name_zero

In [None]:
# what is the type?
type(row_name_zero)

# it is a series with the columns as names, and the attributes as values

In [None]:
row_name_zero['Genus']

In [None]:
# we can specify a list of row names to extract a subset of the dataframe
row_names_to_extract = [0, 2, 4, 6, 8]
subset_of_data = data_filt_colums.loc[row_names_to_extract]
display(subset_of_data)

In [None]:
# If we want to specify the position, rather than name, we use iloc
# get the first row
data_filt_colums.iloc[0]


In [None]:
# get the first and tenth row
data_filt_colums.iloc[[0, 9]]

In [None]:
# this is useful when the row name and the row position do not always match
display(subset_of_data.loc[4])
display(subset_of_data.iloc[4])

## Boolean Filtering

The most common scenario when filtering rows is when we want to select datapoints **based on some conditions**. 

These conditions are constraints that are placed on the values in one or more columns.

In [None]:
# remind ourself of the data
data_filt_colums

Say we want to impose the following condition: keep only those rows (species) with Body Mass > 100kg (100 * 1000 grams).

We can do this in two steps:
- Create a *boolean `Series`* that stores `True` for the rows we want to keep, and `False` for the other rows.
- Use this `Series` to index `species_data_final` by using square bracket notation.

In [None]:
# Step 1
is_large = data_filt_colums['Adult Body Mass (g)'] >= 100000
display(is_large)

In [None]:
# Step 2
data_filt_colums[is_large]

### Note: lots of square brackets!

One of the tricky things about `DataFrame`s is that there are different ways of obtaining subsets of the dataset that all have very similar code syntax:

```python
data_filt_colums[...]
```

The key principle is that **the type of the value inside the square brackets determines what kind of "subsetting" operation is being performed**.

| Type inside `[...]` | Example                    | Filtering On | Return type   | Which columns?  | Which rows? |
|---------------------|----------------------------|---------------|--------|-----------|-------|
| `str`               | `data_filt_colums["Adult Body Mass (g)"]` | Columns | `Series` | The one specified | All rows |
| `list` of `str`     | `data_filt_colums[["Genus", "Species"]]` | Columns | `DataFrame` | The ones specified | All rows |
| `Series` of `bool`  | `data_filt_colums[is_large]` | Rows | `DataFrame` | All columns | The ones specified |

### Logical operators: `&` and `|`

Sometimes we want to filter on two conditions.
To start, suppose we have these two boolean `Series`:

In [None]:
is_large = data_filt_colums["Adult Body Mass (g)"] >= 100000
is_long_lived = data_filt_colums["Max Longevity (months)"] >= 240

In [None]:
is_large

In [None]:
is_long_lived

There are two common ways to filter based on a combination of these two conditions.

**Filter 1**: find rows where the species is large **and** is long-lived.
To do this, we use the `&` operator to combine the two `Series`.

**Filter 2**: find rows where the species is large **or** is long-lived.
To do this, we use the `|` operator to combine the two `Series`.

In [None]:
filter1 = is_large & is_long_lived
filter2 = is_large | is_long_lived

In [None]:
display(filter1)

In [None]:
display(filter2)

In [None]:
data_filt_colums[filter1]

In [None]:
data_filt_colums[filter2]

## Further Reading

`pandas` is the most complex part of Python we've studied so far in this course, and so we expect you'll need to review and practice more as we dive deeper into this library.

The official Pandas website has some great introductory materials, including:

- [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html)
- [*Getting Started* tutorials](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
