# Intro to `pandas`: the `DataFrame`

We'll explore the Pandas package for simple data handling tasks using geoscience data examples. 

Pandas introduces the concept of a `DataFrame` in Python. If you're familiar with R, it's pretty much the same idea! Useful cheat sheet [here](https://www.datacamp.com/community/blog/pandas-cheat-sheet-python#gs.59HV6BY)

The main purpose of Pandas is to allow easy manipulation of data in tabular form. Perhaps the most important idea that makes Pandas great for data science, is that it will always preserve **alignment** between data and labels.

## Meet the `DataFrame`

The most important data structure in Pandas is `pd.DataFrame`, which is a 2D structure — a table or spreadsheet — that can hold various types of Python objects indexed by a special `index` column (or multiple indices). Columns are nearly always labelled using strings. In a sense, you can think of DataFrames as a dict of names mapped to Series.

An easy way to think about a `DataFrame` is if you imagine it as a table or an Excel spreadsheet, with a lot of superpowers.

Let's define one using a small dataset, defined as a nested `list`:

In [None]:
data =  [[101, 2.13, 'corroded'],
         [102, 1.45, 'cracked'],
         [104, 4.11, 'pitted'],
         [107, 0.28, 'good'],
         [108, -0.08, 'good'],
         [109, 0.75, 'cracked']]
data

Make a `DataFrame` from `data`

In [None]:
import pandas as pd

df = pd.DataFrame(data, columns=['id', 'roughness', 'condition'])
df

Note the special display of DataFrame objects in the Jupyter Notebook. This will help you recognize if you are looking at a DataFrame or a Series.

Recall that we earlier instantiated a DataFrame from a dict; we can cast it to a dict as well:

This representation might help explain why the 'first class' index names of the DataFrame are the column names:

The same as for `Series`, the primary way to index or slice into the DataFrame is with `.loc[]`.

The default `RangeIndex` (positional integer index) can be explicitly replaced in the same way as for `Series`:

    df.index = [1001, 1002, 1004, 1008, 1009]    
    
But often, we already have the index we want in the DataFrame and therefore want to 'promote' one or more of the DataFrame columns:

## Selecting data with `.loc[]`

Two useful ways to get at a selection of the data are `.loc[]` (an indexable attribute) and `.query()` (a method).

`.loc` works as for `Series`, but now you can select columns too. Like NumPy indexing, the two 'selectors' are separated by commas:

    df.loc[<row selector>, <column selector>]

For example:

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>EXERCISE</h3>

- Get the roughness of all cracked samples.
- See if you can get the roughness of all samples that are either cracked or pitted.
</div>

## Selecting data with `.query()`

We can pass a 'query string' to query, using column names as 'variables' in the string. Eg

Looks like this as a query:

If there's a space in the column name, delimit with backticks. But note that certain other characters will not work.

## `df.apply()`

Functional programming is a useful paradigm, but it takes a bit of getting used to.

The fundamental concepts are **map** (transform data items somehow)...

In [None]:
# Instead of a loop:
def sqrt(x):
    return x**0.5

data = [2, 3, 9]

list(map(sqrt, data))
# There are several ways to do this in Pandas.

...and **reduce** (summarize data items somehow, usually in a cumulative way):

In [None]:
# Aggregate a function over data:
def product(a, b):
    return a * b

from functools import reduce

reduce(product, data)
# This is called 'agg' in Pandas.

Now suppose we have this:

| Surface Roughness     | Description                      |
|-----------------------|----------------------------------|
| < 0.1 µm              | Pristine                         |
| 0.1 µm - 0.5 µm       | Very smooth                      |
| 0.5 µm - 1.6 µm       | Smooth                           |
| 1.6 µm - 3.2 µm       | Moderately rough                 |
| 3.2 µm - 6.3 µm       | Rough                            |
| ≥ 6.3 µm              | Very rough                       |

How can we fill a column with these descriptions?

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>EXERCISE</h3>

Write a function that takes a roughness value and returns an appropriate description.
</div>

We cannot pass a Series to this function, but we can pass the function to the data via one of the functional methods:

- `map` can use a mapping (dict) or a function, and performs the operation element-wise on all of the data.
- `transform` takes a column and produces a column; it will process every column of a DataFrame.
- `agg` reduces a column to a scalar; it will process every column of a DataFrame.
- `apply` smartly tries to `transform` or `agg`, depending on the function.

It's definitely a bit confusing. But in general, I find I tend to use `map` on Series, and `apply` on DataFrames.

For our purpose here, we can use `map`, `transform`, or `apply`.

## Save a snapshot

Before continuing, let's save the current state:

If anything goes wrong, you can read it again with:

For large datasets you might want to use `pickle`, `feather` or HDF5 formats for better performance. But CSV files are highly portable and human readable, so they are nearly always the best choice for small datasets.

## Adding data

Add more data (row wise). This is adding a new record; it must have all the columns.

To add a new feature or attribute we want to add a new column — a `Series` or something that can be interpreted as such.

Add a new column with a "complete" list, array or series.

In [None]:
df['vessel'] = ['FPSO', 'FPSO', 'Handysize', 'Handysize', 'FPSO', 'Unknown']
df

Alternatively, you can broadcast a value or calculation.

BTW, now we can see how the backticks worth with `df.query()`.

### Snapshot again

Check `df.shape`. It should be `(6, 5)`. 

If it is, let's save again. If not, fix it first.

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">


<h3>Exercise</h3>

* Add a new record (row) with a row index of 300 and your own name as the `source`. Make up the other values as needed.
* Add a new boolean Series (with `True` or `False` values) `is_fatigued`. If a record has a `condition` other than `good`, it should be `True`.
* Replace values of 'Unknown' with `np.nan`.
* Sort the DataFrame by its index.
* Create a subset of the current dataframe with only the `vessel`, `Ra (µin)` and `is_fatigued` columns, in that order; name it `dg`.
</div>

If everything looks good and `df` has shape (7, 6), let's save it again.

In [None]:
df.shape

In [None]:
df.to_csv('./corrosion.csv')

## `groupby`

It's often convenient to gather data according to categorical values, for example "by vessel" in this small dataset.

We can also iterate over the groups:

## Styling

The rich displays in Jupyter can be customized a great deal, for example:

Check out the other examples at https://pandas.pydata.org/docs/user_guide/style.html

## Concatenating DataFrames

Sometimes more data comes along and we need to combine two or more DataFrames. We can use `pd.concat(dfs)` for that. I find I usually want to align the datasets as much as possible, in terms of column names etc, and then do an 'outer' join.

Here's some new, and incomplete, data:

In [None]:
dh = pd.DataFrame({
 'vessel': {
    200: 'ULCC',
    205: 'ULCC',
    225: 'FPSO',
 },
 'Ra': {
    200: 40,
    205: 65,
    225: 5,
 },
 'is_fatigued': {
    200: False,
    205: True,
    225: False,
 },
})

dh

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>EXERCISE</h3>

What do we need to do to align these two DataFrames?

1. Name the index **id**.
2. <u>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</u>
3. <u>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</u>
4. <u>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</u>
5. <u>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</u>

Do those things!
</div>

<div style="background: #e0f0ff; border: solid 2px #d0e0f0; border-radius:3px; padding: 1em; color: navy">

<h3>High performance Pandas</h3>

Note that for very large datasets, there are a few optional dependencies and settings you can use to speed up certain operations (big data, Boolean comparisons, lots of NaNs, etc).

Read more in the docs, eg:

- https://pandas.pydata.org/docs/user_guide/basics.html#accelerated-operations
- https://pandas.pydata.org/docs/getting_started/install.html#performance-dependencies-recommended
</div>

<hr />

<p style="color:gray">©2025 Matt Hall / Equinor. Licensed CC-BY. Remix and share!</p>