# Intro to `pandas` for analytics

We'll explore the Pandas package for simple data handling tasks using geoscience data examples. 

In [None]:
import pandas as pd

## What is Pandas?

Pandas introduces the concept of a `DataFrame` in Python. If you're familiar with R, it's pretty much the same idea! Useful cheat sheet [here](https://www.datacamp.com/community/blog/pandas-cheat-sheet-python#gs.59HV6BY)

The main purpose of Pandas is to allow easy manipulation of data in tabular form. Perhaps the most important idea that makes Pandas great for data science, is that it will always preserve **alignment** between data and labels.

Pandas can ingest data from a lot of different formats, but we can make a little DataFrame from a Python dictionary:

In [None]:
df = pd.DataFrame({
    'company': ['Equinor', 'Aramco', 'ConocoPhillips', 'Total Energies', 'Shell'],
    'ISO3166-1': ['NO', 'SA', 'US', 'FR', 'UK'],
    'mkt cap': [64.2, 1710, 127, 155, 220],    
})

df 

All DataFrames have a least one **index** (effectively the name of each row), and this particular one has three **columns** (each one with a name), and four **rows**. 

We can get at columns via their names:

And we can get convenient summaries of the data:

We'll look at DataFrames in depth, but first we need to get to know the `Series` object that represents each column of data.

## Meet the `Series`

The basic data structure in Pandas is `Series`, which represents a _column_ in a table or spreadsheet.

In terms of Python data structures, a `Series` is based on a 1-dimensional NumPy array, which is like a `list` with some superpowers.

In [None]:
s = pd.Series([15, 22, 30, 41, 56, 69, 70])
s

In [None]:
ss = pd.Series([10, 20, 30, 40, 50, 60, 70])  # Elementwise compare to another collection.
                                              # NOTE: Only works if the series have same length
s == ss

We can cast to other data types:

There are lots of convenience functions, like `.plot()` for example (requires `matplotlib`):

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>EXERCISE</h3>

- How can you access the fourth value?
- Change the first value to 11.
- What happens if you try to change it to `11.1` (i.e. a float)? Why?
- Make a horizontal bar plot of `s` instead of a line plot.
- Can you reverse the y-axis?
</div>

## The (row) index

One difference from arrays and lists is that a Pandas `Series` has an explicit `index`. By default this is a `RangeIndex`:

In [None]:
s.index

But the index can be anything: arbitrary values, timestamps, strings, whatever.

For example, they could be depths in a well (note that the last two values are out of order):

In [None]:
i = [1000, 1100, 1250, 1350, 1500, 1999, 1600]
s.index = i # Explicitly assigning the i's as index to our DataFrame
s

The plot respects the order, not the index:

We can sort the index:

This does **not** change the index in-place:

<div style="background: #e0f0ff; border: solid 2px #d0e0f0; border-radius:3px; padding: 1em; color: navy">

Methods on `Series` return `Series` (similar to `string` and `array` methods, but unlike `list` methods).

Most of these methods have an `in_place` argument, but its use is not recommended (for example, [by the Ruff linter](https://docs.astral.sh/ruff/rules/pandas-use-of-inplace-argument/)).
</div>

Non-default indices do affect how 'native' indexing works, it is no longer positional...

But for some strange reason, slices do still work 🤔

Whatever, indexing is weird now, so if we want to use this explicit index, we need another way. That's where `.loc` comes in.

## `Series.loc[]`

The syntax might look odd at first because `.loc` is not a method, it's an indexable attribute.

`Series.loc` is more flexible than this though. Like NumPy arrays, it supports two other features:

**Indexing with a collection**, eg a list of indices, of arbitrary length:

**Indexing with a Boolean collection** _with the same dimensions as the `Series` itself_. This is very useful. For example, let's apply a Booean condition:

You can also perform Boolean operations on the index:

<div style="background: #e0f0ff; border: solid 2px #d0e0f0; border-radius:3px; padding: 1em; color: navy">

Index values don't have to be unique — but it's a good idea for performance and for understandability if they are. What's more, non-unique indices are not compatible with certain representations, e.g. `.to_dict()` or `.to_json()`.
</div>

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>EXERCISE</h3>

- How do you access the value at 1500 m?
- How do you access the value at **positional** index 5? (You might need to Google this.)
- What happens if you try to access a value at a non-existent depth?
- How many values does `s.loc[:1500]` return? Is this surprising?
- Change the values at 1000 m and 1350 m to `np.nan`.
- Use `s.fillna()` to replace the NaNs, but don't overwrite `s`.
- Use `Series.interpolate()` to replace the NaNs by interpolation.
</div>

In [None]:
# If you have not already imported numpy, you can do it here
import numpy as np

In [None]:
# If you wish to 'reset' the Series s for this exercise, in case you have overwritten
# or changed it, run this cell
s = pd.Series([11.1, 22.0, 30.0, 41.0, 56.0, 69.0, 70.0])
i = [1000, 1100, 1250, 1350, 1500, 1999, 1600]
s.index = i 
s

---
## More on column types

### Strings

We've seen that columns have types. The types affects how some operations happen:

In [None]:
s = pd.Series(['aaa.', 222., '333.', 'ddd.'])
s

There's a mixture of objects, so a mixture of things can happen when we process the column:

Pandas has two ways to store strings. From the [docs](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes):

> * `object` dtype, which can hold any Python object, including strings.
> * `StringDtype`, which is dedicated to strings.
> 
> Generally, we recommend using StringDtype. See [Text data types](https://pandas.pydata.org/docs/user_guide/text.html#text-types) for more.
>
> Finally, arbitrary objects may be stored using the object dtype, but should be avoided to the extent possible (for performance and interoperability with other libraries and methods. See [object conversion](https://pandas.pydata.org/docs/user_guide/basics.html#basics-object-conversion)).
> 
> A convenient dtypes attribute for DataFrame returns a Series with the data type of each column.

So let's cast our strings to actual string objects:

In [None]:
s = s.astype('string')  # NB not `str`
s

We still must access the vectorized string operations via `Series.str` like so:

<div style="background: #e0ffe0; border: solid 2px #d0f0d0; border-radius:3px; padding: 1em; color: darkgreen">

<h3>EXERCISE</h3>

- Use the `split` string method on this column (don't overwrite it though).
- Can you use the `replace` string method to replace the dots with hyphens (the `-` character)?
</div>

In [None]:
s = pd.Series(['18.06.25', '22.06.25', '30.06.25'])
s

Let's convert the format to [an ISO8601 compliant format](https://en.wikipedia.org/wiki/ISO_8601) eg YYYY-MM-DD:

In [None]:
# Regexplanation
#  \d+    Matches one or more (+) digits (\d).
#  \.     Matches a literal dot. (Dot means 'any char' otherwise.)
#  (...)  Defines a group to capture.
#
#  20     Literal '20'
#  \2     References captured group 2.
#  -      Literal '-'

s.str.replace(r'(\d+)\.(\d+)\.(\d+)', r'20\3-\2-\1', regex=True)

The result from the `replace` string method can also be achieved using the `replace` method directly on the `Series` (no `s.str` first). This requires some tweaking of the arguments though (and `s.replace` is usually intended for more general replacements of values in a column, not for substring replacements)

These are still strings though — it would be better to convert to the dedicated type for dates, Pandas `Timestamp`:

### Dates

We'll look at this in more detail later, but for now it's good to know that there are some special tricks for handling dates.

In [None]:
s

In [None]:
pd.to_datetime(s) 

We can avoid this warning by being explicit:

In [None]:
d = pd.to_datetime(s, format='%d.%m.%y')
d

Now we can format any way we like... There are [lots of format codes!](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes)

In [None]:
d.dt.strftime('%d %B %Y')

There's also a `Timedelta` type for differences between datetimes:

In [None]:
d[1] - d[0]

As we'll see later, datetimes make good indices for time series data.

### Categories

It makes sense to use the [Pandas `categorical` type](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) for categorical variables. These are good for a few cases:

> * A string variable consisting of only a few different values (for example lithologies in a well log). Converting such a string variable to a categorical variable will save some memory.
> * The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order.
> * As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

In [None]:
s = pd.Series(list('aaaabbbccccccccddde'), dtype='category')
s

In [None]:
s.describe()

### Money

Do not use floating point values for money, there are too many gotchas like this:

In [None]:
0.10 + 0.20 - 0.3

Safer to use `decimal.Decimal` or use integer values of 'cents' as your units (if you do not have to deal with arbitrary precision fractional conversions etc).

First, let's look at the `Decimal` type. The slightly weird thing is that we represent everything now as strings:

In [None]:
from decimal import Decimal

a, b, c = Decimal("0.1"), Decimal("0.2"), Decimal("0.3")
a + b - c

We can also control the precision:

In [None]:
d = Decimal("20.00")

d / 2

Imprecise things can still happen:

In [None]:
d / 3

In [None]:
from decimal import ROUND_DOWN

TWO_PLACES = Decimal('.01')

(d / 3).quantize(TWO_PLACES, rounding=ROUND_DOWN)

Or we can set a trap for inexact quantities:

In [None]:
from decimal import Inexact, Context

INEXACT = Context(traps=[Inexact])

(d / 3).quantize(TWO_PLACES, rounding=ROUND_DOWN, context=INEXACT)

Read more! https://docs.python.org/3/library/decimal.html#recipes

Anyway, Pandas does not have monetary types, but we can use `Decimal` in Pandas by ensuring that all the items in a column are of that type, for example:

In [None]:
import pandas as pd

s = pd.Series([3.6667, 54, 41.0111, 17.99])

s.apply(lambda x: Decimal(str(x)).quantize(TWO_PLACES))

<hr />

<p style="color:gray">©2025 Matt Hall / Equinor. Licensed CC-BY. Please share and re-use.</p>