# Intro to `pandas` for machine learning

We'll explore the Pandas package for simple data handling tasks using geoscience data examples. 

Pandas introduces the concept of a `DataFrame` in Python. If you're familiar with R, it's pretty much the same idea! Useful cheat sheet [here](https://www.datacamp.com/community/blog/pandas-cheat-sheet-python#gs.59HV6BY)

The main purpose of Pandas is to allow easy manipulation of data in tabular form. Perhaps the most important idea that makes Pandas great for data science, is that it will always preserve **alignment** between data and labels.

In [1]:
import pandas as pd

The most common data structure in Pandas is the `DataFrame`, which is a 2D structure that can hold various types of Python objects indexed by an `index` array (or multiple `index` arrays). Columns are usually labelled as well using strings. A column on its own is a different data type, called a `Series`.

An easy way to think about a `DataFrame` is if you imagine it as an Excel spreadsheet.

Let's define one using a small dataset:

In [2]:
data =  [[2.13, 'sandstone'],
         [3.45, 'limestone'],
         [2.45, 'shale']]
data

[[2.13, 'sandstone'], [3.45, 'limestone'], [2.45, 'shale']]

Make a `DataFrame` from `data`

In [3]:
df = pd.DataFrame(data, columns=['velocity', 'lithology'])
df

Unnamed: 0,velocity,lithology
0,2.13,sandstone
1,3.45,limestone
2,2.45,shale


Accessing the data is a bit more complex than in the numpy array cases but for good reasons

In [4]:
df['lithology']

0    sandstone
1    limestone
2        shale
Name: lithology, dtype: object

In [5]:
df[0]  # Can't get at rows like this.

KeyError: 0

In [6]:
df.loc[0:1]  # Inclusive slice, unlike anything else in Python or NumPy

Unnamed: 0,velocity,lithology
0,2.13,sandstone
1,3.45,limestone


In [7]:
df['id'] = [101, 102, 103]
df = df.set_index('id')
df.head()

Unnamed: 0_level_0,velocity,lithology
id,Unnamed: 1_level_1,Unnamed: 2_level_1
101,2.13,sandstone
102,3.45,limestone
103,2.45,shale


In [8]:
df.loc[0]

KeyError: 0

In [9]:
# Skip this unless folks ask for it.
df.iloc[0:1]  # Works like NumPy, rarely needed

Unnamed: 0_level_0,velocity,lithology
id,Unnamed: 1_level_1,Unnamed: 2_level_1
101,2.13,sandstone


In [10]:
df.loc[df['velocity'] < 3]  # Loc provides *selectors* like NumPy arrays.

Unnamed: 0_level_0,velocity,lithology
id,Unnamed: 1_level_1,Unnamed: 2_level_1
101,2.13,sandstone
103,2.45,shale


In [11]:
df.loc[df['velocity'] < 3, 'lithology']  # Both rows and columns.


id
101    sandstone
103        shale
Name: lithology, dtype: object

## Adding data

Add more data (row wise)

In [12]:
df.loc[99] = [3.5, 'dolomite']

In [13]:
df.loc[3] = [2.6, 'shale']
df

Unnamed: 0_level_0,velocity,lithology
id,Unnamed: 1_level_1,Unnamed: 2_level_1
101,2.13,sandstone
102,3.45,limestone
103,2.45,shale
99,3.5,dolomite
3,2.6,shale


Add a new column with a "complete" list, array or series. Alternatively, you can broadcast a value or calculation.

In [14]:
df['new_column'] = ['x', 'y', 'z', 'a', 'b']
df

Unnamed: 0_level_0,velocity,lithology,new_column
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
101,2.13,sandstone,x
102,3.45,limestone,y
103,2.45,shale,z
99,3.5,dolomite,a
3,2.6,shale,b


In [15]:
df['source'] = 'Agile'
df['velocity_ms'] = df['velocity'] * 1000
df

Unnamed: 0_level_0,velocity,lithology,new_column,source,velocity_ms
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
101,2.13,sandstone,x,Agile,2130.0
102,3.45,limestone,y,Agile,3450.0
103,2.45,shale,z,Agile,2450.0
99,3.5,dolomite,a,Agile,3500.0
3,2.6,shale,b,Agile,2600.0



## Exercise

* Add a new column, `ms`, with `True` or `False` if the `velocity` is in m/s.
* Add a new record with `id` 87, `velocity_ms` of 2225 and a `lithology` of sandstone. Use your own name as the `source`.
* Create a subset of the current dataframe named `df2`. We want only the `velocity_ms`, `lithology` and `source` columns.
* Change the `velocity` of only our shales in `df` to be 1000 times their current values.

In [None]:
df['ms'] = df['velocity'] > 10
df.loc[87] = [2.225, 'sandstone', 'm', 'Agile', 2225, False]
df2 = df[['velocity_ms', 'lithology', 'source']]

In [18]:
df.loc[df.lithology=='shale', 'velocity'] = df.velocity * 1000
df

Unnamed: 0_level_0,velocity,lithology,new_column,source,velocity_ms
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
101,2.13,sandstone,x,Agile,2130.0
102,3.45,limestone,y,Agile,3450.0
103,2450.0,shale,z,Agile,2450.0
99,3.5,dolomite,a,Agile,3500.0
3,2600.0,shale,b,Agile,2600.0


## Column types

It's especially worth checking that your numbers are actually numbers (eg floats or ints), because they look the same as strings in the various renderings of DataFrames and Series.

In [16]:
s = pd.Series([23, '23'])
s

0    23
1    23
dtype: object

In [17]:
s * 2

0      46
1    2323
dtype: object

Our DataFrame is not too bad already...

In [18]:
df.dtypes

velocity       float64
lithology       object
new_column      object
source          object
velocity_ms    float64
dtype: object

#### Categories

It makes sense to use the [Pandas `categorical` type](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) for categorical variables. Let's do that:

In [19]:
cols = ['lithology', 'new_column']
df[cols] = df[cols].apply(pd.Categorical)

df.dtypes

velocity        float64
lithology      category
new_column     category
source           object
velocity_ms     float64
dtype: object

We can now use those to do some useful things, for example:

In [20]:
df['lithology'].describe()

count         5
unique        4
top       shale
freq          2
Name: lithology, dtype: object

They can also be useful for larger datasets with many repeated values, where they can potentially speed up processing and use less memory.


#### Strings
Although strings are added as `object` by default, there is a new-ish (in v1.0) data type for text data, `pd.StringDtype()`.

In [20]:
density = [
    'Medium',
    'Low',
    'Med',
    3214,
    'High',
]

First we'll try accepting the default `object` data type:

In [21]:
s = pd.Series(density)  # Or pd.StringDtype()

In [22]:
s.str.isdigit()

0    False
1    False
2    False
3      NaN
4    False
dtype: object

In [23]:
s.str.replace('Med$', 'Medium', regex=True)

0    Medium
1       Low
2    Medium
3       NaN
4      High
dtype: object

Note that the indices of this list do not match the DataFrame index, so we have to explicitly pass the index to the Series constructor. (Try leaving `index=df.index` out.)

If we use a string type instead, then some things will work more smoothly.

In [24]:
s = pd.Series(density, dtype='string')  # Or pd.StringDtype()

In [25]:
s.dtype

string[python]

Using this does not, however, give us direct access to string methods — e.g. you still need `str` here — but at least things work more consistently:

In [26]:
s.str.replace('Med$', 'Medium', regex=True)

0    Medium
1       Low
2    Medium
3      3214
4      High
dtype: string

Some things break when there's a column with mixed content:

In [27]:
s.str.isdigit()

0    False
1    False
2    False
3     True
4    False
dtype: boolean

## Reading and writing files

Pandas also reads files from disk in tabular form ([here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)'s a list of all the formats that it can read and write). A very common one is Excel's `.xlsx` format, so let's load one!

The data is the same as used in this study: http://www.kgs.ku.edu/PRS/publication/2003/ofr2003-30/index.html

From that poster:

> The Panoma Field (2.9 TCF gas) produces from Permian Council Grove Group marine carbonates and nonmarine silicilastics in the Hugoton embayment of the Anadarko Basin. It and the Hugoton Field, which has produced from the Chase Group since 1928, the top of which is 300 feet shallower have combined to produce 27 TCF gas, making it the largest gas producing area in North America. Both fields are stratigraphic traps with their updip west and northwest limits nearly coincident. Maximum recoveries in the Panoma are attained west of center of the field. Deeper production includes oil and gas from Pennsylvanian Lansing-Kansas City, Marmaton, and Morrow and the Mississippian.

For Excel files, we can load specific sheets by passing the `sheet_name` argument:

In [37]:
df = pd.read_excel('https://geocomp.s3.amazonaws.com/data/Panoma_Field_Permian-RAW.xlsx', sheet_name='data')
df.head()

Unnamed: 0,Well Name,Depth,Formation,RelPos,Marine,GR,ILD,DeltaPHI,PHIND,PE,Facies,Completion Date,Index,Source
0,SHRIMPLIN,851.3064,A1 SH,1.0,1,77.45,4.613176,9.9,11.915,4.6,3.0,2010-03-26,63,KGS
1,SHRIMPLIN,851.4588,A1 SH,0.979,1,78.26,4.581419,14.2,12.565,4.1,3.0,2010-03-26,64,KGS
2,SHRIMPLIN,851.6112,A1 SH,0.957,1,79.05,4.549881,14.8,13.05,3.6,3.0,2010-03-26,65,KGS
3,SHRIMPLIN,851.7636,A1 SH,0.936,1,86.1,4.518559,13.9,13.115,3.5,3.0,2010-03-26,66,KGS
4,SHRIMPLIN,851.916,A1 SH,0.915,1,74.58,4.436086,13.5,13.3,3.4,3.0,2010-03-26,67,KGS


Without it, we get the first sheet, which in this case is not the data that we want, but it may still be useful:

In [22]:
pd.read_excel('https://geocomp.s3.amazonaws.com/data/Panoma_Field_Permian-RAW.xlsx')

Unnamed: 0.1,Unnamed: 0,0
0,Index,Index for sorting records
1,Well Name,Name of the well that the record is from
2,Depth,Depth below surface
3,Formation,Which formation the record is from. See accomp...
4,RelPos,Position of the record relative to a known depth
5,Marine,Whether a record is of a marine rock or not
6,GR,Gamma Ray measurements
7,DeltaPHI,
8,ILD,
9,PHIND,Nuclear Density for porosity


Other formats are usually loaded in a similarly way, using the `pd.read_*` pattern: `pd.read_csv`, `pd.read_csv` and so on.


## Exercise

* Select only the `GR`, `PHIND` and `PE` columns.
* Select only the records for the `LUKE G U` well. How many rows are there?
* Select the `Formation` and `GR` columns only for the `LUKE G U` well.
* Select only the records for the `LUKE G U` well below 850 m. How many rows are there?
* Select the `GR` column for the `LUKE G U` well below 850 m. What is the maximum value?

In [38]:
df[['GR', 'PHIND', 'PE']]

Unnamed: 0,GR,PHIND,PE
0,77.450,11.915,4.600
1,78.260,12.565,4.100
2,79.050,13.050,3.600
3,86.100,13.115,3.500
4,74.580,13.300,3.400
...,...,...,...
4894,86.078,16.150,3.161
4895,88.855,16.750,3.118
4896,90.490,16.780,3.168
4897,90.975,16.995,3.154


In [39]:
df.loc[df['Well Name'] == 'LUKE G U']

Unnamed: 0,Well Name,Depth,Formation,RelPos,Marine,GR,ILD,DeltaPHI,PHIND,PE,Facies,Completion Date,Index,Source
1386,LUKE G U,795.6804,A1 SH,1.000,1,74.90,6.053409,9.3,11.75,4.084,3.0,2012-10-01,1449,KGS
1387,LUKE G U,795.8328,A1 SH,0.981,1,83.80,5.559043,12.0,13.10,3.501,3.0,2012-10-01,1450,KGS
1388,LUKE G U,795.9852,A1 SH,0.962,1,86.97,5.321083,12.9,12.55,3.400,3.0,2012-10-01,1451,KGS
1389,LUKE G U,796.1376,A1 SH,0.943,1,84.43,5.105050,13.2,12.00,3.400,3.0,2012-10-01,1452,KGS
1390,LUKE G U,796.2900,A1 SH,0.925,1,78.51,5.105050,11.8,11.40,3.400,3.0,2012-10-01,1453,KGS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1842,LUKE G U,865.6320,C LM,0.074,2,79.48,4.886524,7.7,26.15,2.500,8.0,2012-10-01,1905,KGS
1843,LUKE G U,865.7844,C LM,0.059,2,62.88,4.886524,12.6,23.70,2.700,8.0,2012-10-01,1906,KGS
1844,LUKE G U,865.9368,C LM,0.044,2,41.04,4.295364,13.7,10.65,3.200,8.0,2012-10-01,1907,KGS
1845,LUKE G U,866.0892,C LM,0.029,2,33.99,3.689776,3.2,3.90,3.800,8.0,2012-10-01,1908,KGS


In [40]:
df.loc[df['Well Name'] == 'LUKE G U', ['Formation', 'GR']]

Unnamed: 0,Formation,GR
1386,A1 SH,74.90
1387,A1 SH,83.80
1388,A1 SH,86.97
1389,A1 SH,84.43
1390,A1 SH,78.51
...,...,...
1842,C LM,79.48
1843,C LM,62.88
1844,C LM,41.04
1845,C LM,33.99


In [41]:
df.loc[(df['Well Name'] == 'LUKE G U') & (df['Depth'] > 850)]

Unnamed: 0,Well Name,Depth,Formation,RelPos,Marine,GR,ILD,DeltaPHI,PHIND,PE,Facies,Completion Date,Index,Source
1743,LUKE G U,850.0872,C SH,0.737,1,72.69,3.388442,5.0,18.10,3.1,3.0,2012-10-01,1806,KGS
1744,LUKE G U,850.2396,C SH,0.719,1,73.82,3.318945,5.8,16.30,3.1,3.0,2012-10-01,1807,KGS
1745,LUKE G U,850.3920,C SH,0.702,1,77.10,3.111716,7.6,15.50,3.2,3.0,2012-10-01,1808,KGS
1746,LUKE G U,850.5444,C SH,0.684,1,78.49,3.040885,13.0,13.30,3.2,3.0,2012-10-01,1809,KGS
1747,LUKE G U,850.6968,C SH,0.667,1,78.59,2.917427,14.3,14.25,3.3,3.0,2012-10-01,1810,KGS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1842,LUKE G U,865.6320,C LM,0.074,2,79.48,4.886524,7.7,26.15,2.5,8.0,2012-10-01,1905,KGS
1843,LUKE G U,865.7844,C LM,0.059,2,62.88,4.886524,12.6,23.70,2.7,8.0,2012-10-01,1906,KGS
1844,LUKE G U,865.9368,C LM,0.044,2,41.04,4.295364,13.7,10.65,3.2,8.0,2012-10-01,1907,KGS
1845,LUKE G U,866.0892,C LM,0.029,2,33.99,3.689776,3.2,3.90,3.8,8.0,2012-10-01,1908,KGS


In [42]:
df.loc[(df['Well Name'] == 'LUKE G U') & (df['Depth'] > 850), ['GR']].max()

GR    172.0
dtype: float64

## Writing Data Out

Writing data out is similarly simple to reading it in, using one of the range of `to_*` functions. In this case, we will go with a simple `csv` format:

In [23]:
df.to_csv('../data/Panoma_Field_Permian_RAW.csv', index=False)

If we are comfortable with SQL, or have an existing database, we may wish to write our dataframe as a table there. We will use the Python implementation of [sqlite](https://www.sqlite.com/index.html), [sqlite3](https://docs.python.org/3/library/sqlite3.html). If you have an existing database you may prefer to look at [SQLalchemy](https://docs.sqlalchemy.org/) to create the connection instead:

In [None]:
import sqlite3

In [None]:
connection = sqlite3.connect('../data/panoma.db')
df.to_sql('panoma_raw', con=connection, if_exists='replace', index=False)

You could use `read_sql` to get data from a SQL database instead of reading a file.

<hr />

<p style="color:gray">©2022 Agile Geoscience. Licensed CC-BY.</p>