# Intro to Pandas

Pandas is a very useful library for working with tabular data. If it is something that would fit into a spreadsheet or csv file, then this is a great way to deal with it. The library is big, with [extensive documentation](https://pandas.pydata.org/pandas-docs/stable/), so this is going to just scratch the surface, and hopefully help you towards being able to use it to automate some of your routine data processing tasks that normally involve Excel.

We will first import the libraries that we need. As is common in scientific python, we will need `numpy` and `matplotlib`'s `pyplot`. In addition, we will import `pandas` and `seaborn` (the latter is another plotting library built on `matplotlib`).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Loading data

The easiest way to load data using `pandas` is with the relevant `read_*` method. There are a range of these to read a variety of data formats and files, including `read_csv`, `read_sql`, `read_clipboard`. For our dataset, we will use `read_excel`. They all work in roughly the same manner, but may have a wide range of additional arguments.

In [None]:
df = pd.read_excel('./data/RPC_4_lithologies.xlsx')

This creates a DataFrame. Each DataFrame has a number of named Series, which are analogous to columns in a spreadsheet, and Indexes, which are analogous to rows.

In [None]:
df

The `RPC` column is a unique identifier, which makes it potentially suitable for use as an index, but it is unfortunately not sequential, so we will keep the default one instead.

## Inspecting data

Pandas offers a few useful ways to see what data we have available in a DataFrame:

In [None]:
df.head(8) # An int here will display that many rows.

In [None]:
df.tail()

Some statistical information for numerical fields can be found using `describe`:

In [None]:
df.describe()

Note that in many cases we get back a new DataFrame from a given function. This can be treated the same as any other DataFrame.

## Selecting data

### Selecting by column:

In [None]:
df.Vs # equivalent to df['Vs']

In [None]:
df[['Vp', 'Vs', 'Lithology']] # pass a list of column names to select a subset of columns

### Selecting by row

When selecting by row, either the index or the position can be used.

This is selecting by index. Note that the stop value is included, unlike standard python slicing. It is also possible to use a timeseries as the index, which will slice differently.

In [None]:
df.loc[:10]

Selecting by position has the same behaviour as standard python slices.

In [None]:
df.iloc[:10]

In the case of this dataset, they are very similar, but `loc` can be be used to access things such as times or dates, rather than integer positions. `iloc` is the integer location of the DataFrame, which is just the position.

Just like in numpy, boolean conditions can be used to select subsets of data:

In [None]:
df['Lithology'] == 'sandstone'

In [None]:
sandstones = df.loc[df['Lithology'] == 'sandstone']
sandstones

## Simple Plotting

The DataFrame has a built-in `plot` function, which can plot given Series.

In [None]:
df[['Vp', 'Vs']].plot()

The `kind` keyword can change the type of plot that is created.

In [None]:
sandstones['Vp'].plot(kind='hist', bins=25)

If desired, a given Series (or set of Series) can be plotted using standard `matplotlib.pyplot` functions as well.

In [None]:
fig, ax = plt.subplots()
ax.plot(df['Vp'], label='Vp')
ax.plot(df['Vs'], label='Vs')
ax.set_ylabel('Velocity [m/s]')
plt.legend()
plt.show()

In [None]:
fig, ax = plt.subplots()
_ = ax.hist(sandstones['Vp'], bins=25)

In [None]:
low_densities = df.loc[df['Rho'] <= 2000]
low_densities

In [None]:
set(df.Lithology)

In [None]:
fig, ax = plt.subplots()
ax.scatter(low_densities['Vp'], low_densities['Rho'])
ax.set_xlabel('Velocity [m/s]')
ax.set_ylabel('Density [g/cm3]')

## Removing null values

Notice that in the output from `df.describe`, the `Rho` and `Rho_n` columns have a lower count (752) than the remaining columns (800). This implies that there is missing data in those columns.

In [None]:
df.describe()

We can remove the rows containing missing data easily with `.dropna`. By default it drops rows (indices) with a NaN, but it can do it for columns too.

`inplace=True` gets us the same effect as `df = df.dropna()`. This option exists for a number of DataFrame methods. _Use it with caution_: it changes the original DataFrame.

In [None]:
df.dropna(inplace=True)
df.describe()

Since we have dropped values, we will now get gaps in our index, at around 500 to 600.

In [None]:
plt.plot(df.index)

This can make using slices not work as expected, so we will reset the index to remove the gap:

In [None]:
df.reset_index(inplace=True)

In [None]:
plt.plot(df.index)

## Aggregation

A very useful tool in pandas is grouping by specific values in a field. This uses the groupby, followed by the function that you wish to know about the group. Common options are `mean`, `median`, `sum`, `count`, `max`, and `min`.

In [None]:
df.groupby('Lithology').count()

In [None]:
df.groupby('Lithology').median()

`groupby` is a very flexible, powerful tool. The [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) is extensive, and this demo will not go into it in detail. In this case, grouping by 'Lithology' seems natural, because we might expect the different lithologies to have different P- and S-wave velocities, along with a different Rho.

In [None]:
grouped = df.groupby(['Lithology'])[['Vs', 'Vp', 'Rho']]

We can now obtain some aggregate stats per group:

In [None]:
grouped.agg([np.size, np.mean, np.median, np.std]).T # The .T pivots the table so it prints more compactly here.

We can also group by multiple columns. In this case we will get a count of `Vp` values when we group by `Lithology` and then `Description`.

In order to see how this works completely, we will temporarily overwrite the number of rows shown using a context manager. Notice that each record is grouped into a lithology and then a description, and we get the count of each group.

In [None]:
with pd.option_context('display.max_rows', None):
    print(df.groupby(['Lithology', 'Description'])['Vp'].count())

From the above, we can see that the limestone consistently has only one or two `Vp` values for each description, while the shales are more variable. The sandstones have fewer different descriptions, but some of those have many `Vp` values associated with them. The dolomites have fewer descriptions again, but all have at least 13 `Vp` values. This may affect what sort of statistics we can derive from this dataset.

We can also do something like plot the median of our `Vp`, `Vs` and `rho` for each lithology.

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
grouped.agg(np.median).T.plot(marker='o', lw=0, ax=ax)
#grouped.agg(np.min).T.plot(marker='*', lw=0, ax=ax)
#grouped.agg(np.max).T.plot(marker='+', lw=0, ax=ax)

The dolomites have the highest median values for both `Vp` and `Vs`, with limestone notably lower. Shale and sandstone are between these, and are quite similar in value. The `Rho` has less scatter.

## Adding data

Recall in the _Intro to Functions_ notebook we created a function to calculate acoustic impedance, given a rho and Vp. We can use this to create a new `impedance` Series.

In [None]:
def impedance(rho, vp):
    """
    Calculate acoustic impedance from Rho and Vp.

    args:
        rho: [float] density
        vp: [float] p-wave velocity

    returns:
        z: [float] acoustic impedance
    """
    z = rho * vp
    return z

In order to add it to the DataFrame, we use a similar approach as with dictionaries, where we assign a specific column the values. If the column does not exist, it will be created for us.

In [None]:
impedance(df['Rho'], df['Vp'])

In [None]:
df['Impedance'] = impedance(df['Rho'], df['Vp'])
df

We could also see how different the result of the Rho calculated by Gardner's equation ( $ \rho = 310\ V_\mathrm{P}^{\,0.25}\ \ \mathrm{kg}/\mathrm{m}^3 $ ) is from the measured Rho.

In [None]:
def gardner(vp, alpha=310, beta=0.25):
    '''
    Calculate Gardner's equation, given a Vp. Alpha and beta are optional.
    
    Args:
        vp: [float] p-wave velocity
        alpha: [float]
        beta: [float]
        
    Returns:
        rho: [float] density
    '''
    rho = alpha * vp**beta
    return rho

In [None]:
df['Rho_gardner'] = gardner(df['Vp'])

We can plot how far off the Gardner equation gets us by looking at the difference from measured values, and then saving that error to the df.

In [None]:
sns.distplot(df.Rho - df.Rho_gardner)

In [None]:
df['Gardner_error'] = df.Rho - df.Rho_gardner

The `Rho` and `Rho_n` are very similar, so we will remove `Rho_n`.

In [None]:
sns.distplot(df.Rho - df.Rho_n)

In [None]:
fig, ax = plt.subplots()
ax.hist(df['Rho'], bins=50, alpha=0.7, label='rho')
ax.hist(df['Rho_gardner'], bins=50, alpha=0.7, label='Gardner rho')
plt.legend()

In [None]:
df.drop(['Rho_n'], axis=1, inplace=True) # axis=1 means that we want to drop columns.
df

## Applying functions per row

Sometimes we may have a function that requires input per row. An example might be where the lithology affects the calculation that we want to use by means of an optional argument.

We will change the parameters of Gardner's equation by the lithology of the sample. This requires a function that will work on the row:

In [None]:
def variable_gardner(row):
    if row['Lithology'] == 'dolomite':
        alpha, beta = 250, 0.28
    elif row['Lithology'] == 'limestone':
        alpha, beta = 250, 0.28
    elif row['Lithology'] == 'shale':
        alpha, beta = 350, 0.25
    elif row['Lithology'] == 'sandstone':
        alpha, beta = 380, 0.23
    else:
        alpha, beta = 310, 0.25
    return gardner(row['Vp'], alpha, beta)

With this function, we can work through the DataFrame row-wise, and `apply` the function on each row. The resulting Series can be added to `df` in the normal way.

In [None]:
df['Rho_v_gardner'] = df.apply(variable_gardner, axis=1)
df['VGardner_error'] = df.Rho - df.Rho_v_gardner

In [None]:
fig, ax = plt.subplots()
ax.hist(df['Rho'], bins=50, alpha=0.7, label='rho')
ax.hist(df['Rho_gardner'], bins=50, alpha=0.7, label='Gardner rho')
ax.hist(df['Rho_v_gardner'], bins=50, alpha=0.7, label='Variable Gardner rho')
plt.legend()

With enough knowledge of the different sensible ranges for `alpha` and `beta` for a given lithology, we can improve the fit of the `variable_gardner` results for each lithology. Currently we are overestimating our rho fairly noticeably.

# Plotting with Seaborn

Seaborn is a nice wrapper around Matplotlib with a focus on statistical plots. It makes some things much simpler than in standard Matplotlib. We can start by selecting some data that we are interested in from the available columns.

In [None]:
df.columns

In [None]:
to_plot = ['Rho', 'Rho_gardner', 'Rho_v_gardner', 'Vp', 'VGardner_error']
g = sns.PairGrid(df, hue='Lithology', vars=to_plot, diag_sharey=False)
g.map_lower(sns.scatterplot, alpha=0.4)
g.map_upper(sns.kdeplot, alpha=0.4)
g.map_diag(sns.kdeplot)
g.add_legend()

This should now give us a better handle on the reasonable ranges in which we can expect our densities and velocities to vary based on lithology.

## Writing files

Since we processed our data (by adding new calculated values), we should write these changes out to a file. Luckily this is very straightforward, using one of the `.to_*` methods. Common ones to store data for future use are `.to_csv`, `.to_excel`, `.to_hdf`. It is also possible to interact with SQL databases or convert to other in-memory formats such as a dict or xarray.

In [None]:
df.to_xarray()

When writing to Excel or csv, the index will be added as a column. Should you not need that (for this example they are simply ascending numbers), then use `index=False` in the call to your `to_*` function.

In [None]:
df.to_excel('./data/edited_RPC_4_lithologies.xlsx', sheet_name='lithologies', index=False)
df.to_csv('./data/edited_RPC_4_lithologies.csv', index=False)

We can also use a context manager to append to an existing Excel file, or to write to multiple sheets within it:

In [None]:
with pd.ExcelWriter('./data/edited_RPC_4_lithologies.xlsx', mode='a') as writer:
    df.to_excel(writer, sheet_name='processed RPC4', index=False)

<hr />
<img src="https://avatars1.githubusercontent.com/u/1692321?v=3&s=200" style="float:center" width="40px" />
<p><center>© 2021 <a href="http://www.agilegeoscience.com/">Agile Geoscience</a> — <a href="https://creativecommons.org/licenses/by/4.0/">CC-BY</a></center></p>