# Intro to `pandas` for machine learning

We'll explore the Pandas package for simple data handling tasks using geoscience data examples. 

Introduces the concept of a `DataFrame` in Python. If you're familiar with R, it's pretty much the same idea! Useful cheat sheet [here](https://www.datacamp.com/community/blog/pandas-cheat-sheet-python#gs.59HV6BY)

The main purpose of Pandas is to allow easy manipulation of data in tabular form. Perhaps the most important idea that makes Pandas great for data science, is that it will always preserve **alignment** between data and labels.

In [2]:
import pandas as pd

The most common data structure in Pandas is the `DataFrame`. A 2D structure that can hold various types of Python objects indexed by an `index` array (or multiple `index` arrays). Columns are usually labelled as well using strings.

An easy way to think about a `DataFrame` is if you imagine it as an Excel spreadsheet.

Let's define one using a small dataset:

In [3]:
data =  [[2.13, 'sandstone'],
         [3.45, 'limestone'],
         [2.45, 'shale']]
data

[[2.13, 'sandstone'], [3.45, 'limestone'], [2.45, 'shale']]

Make a `DataFrame` from `data`

In [4]:
df = pd.DataFrame(data, columns=['velocity', 'lithology'])
df

Unnamed: 0,velocity,lithology
0,2.13,sandstone
1,3.45,limestone
2,2.45,shale


Accessing the data is a bit more complex than in the numpy array cases but for good reasons

In [5]:
df['lithology']

0    sandstone
1    limestone
2        shale
Name: lithology, dtype: object

In [6]:
df[0]  # Can't get at rows like this.

KeyError: 0

In [None]:
df.loc[0:1]  # Inclusive slice, unlike anything else in Python or NumPy

In [None]:
df['id'] = [101, 102, 103]
df = df.set_index('id')
df.head()

In [None]:
df.loc[0]

In [None]:
df.iloc[0:1]  # Works like NumPy, rarely needed

In [None]:
df.loc[df['velocity'] < 3]  # Loc provides *selectors* like NumPy arrays.

In [None]:
df.loc[df['velocity'] < 3, 'lithology']  # Both rows and columns.


## Adding data

Add more data (row wise)

In [None]:
df.loc[99] = [3.5, 'dolomite']

In [None]:
df.loc[3] = [2.6, 'shale']
df

Add a new column with a "complete" list, array or series

In [None]:
df['new_column'] = ['x', 'y', 'z', 'a', 'b']
df

## Column types

It's especially worth checking that your numbers are actually numbers (eg floats or ints), because they look the same as strings in the various renderings of DataFrames and Series.

In [None]:
s = pd.Series([23, '23'])
s

In [None]:
s * 2

Our DataFrame is not too bad already...

In [None]:
df.dtypes

It makes sense to use the [Pandas `categorical` type](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) for categorical variables. Let's do that:

In [None]:
cols = ['lithology', 'new_column']
df[cols] = df[cols].apply(pd.Categorical)

df.dtypes

## Reading and Writing Files

Pandas also reads files from disk in tabular form ([here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)'s a list of all the formats that it can read and write). A very common one is CSV, so let's load one!

The data is the same as used in this study: http://www.kgs.ku.edu/PRS/publication/2003/ofr2003-30/index.html

From that poster:

> The Panoma Field (2.9 TCF gas) produces from Permian Council Grove Group marine carbonates and nonmarine silicilastics in the Hugoton embayment of the Anadarko Basin. It and the Hugoton Field, which has produced from the Chase Group since 1928, the top of which is 300 feet shallower have combined to produce 27 TCF gas, making it the largest gas producing area in North America. Both fields are stratigraphic traps with their updip west and northwest limits nearly coincident. Maximum recoveries in the Panoma are attained west of center of the field. Deeper production includes oil and gas from Pennsylvanian Lansing-Kansas City, Marmaton, and Morrow and the Mississippian.

In [7]:
df = pd.read_csv('https://geocomp.s3.amazonaws.com/data/Panoma_Field_Permian_RAW.csv')
df.head()

Unnamed: 0,Index,Well Name,Depth,Formation,RelPos,Marine,GR,ILD,DeltaPHI,PHIND,PE,Facies,Completion Date,Source
0,63,SHRIMPLIN,851.3064,A1 SH,1.0,1.0,77.45,4.6132,9.9,11.915,4.6,Nonmarine fine siltstone,2010-03-26,KGS
1,64,SHRIMPLIN,851.4588,A1 SH,0.979,1.0,78.26,4.5814,14.2,12.565,4.1,Nonmarine fine siltstone,2010-03-26,KGS
2,65,SHRIMPLIN,851.6112,A1 SH,0.957,1.0,79.05,4.5499,14.8,13.05,3.6,Nonmarine fine siltstone,2010-03-26,KGS
3,66,SHRIMPLIN,851.7636,A1 SH,0.936,1.0,86.1,4.5186,13.9,13.115,3.5,Nonmarine fine siltstone,2010-03-26,KGS
4,67,SHRIMPLIN,851.916,A1 SH,0.915,1.0,74.58,4.4361,13.5,13.3,3.4,Nonmarine fine siltstone,2010-03-26,KGS


Writing data out is similarly simple:

In [None]:
df.to_csv('../data/Panoma_Field_Permian-RAW.csv')

If we are comfortable with SQL, or have an existing database, we may wish to write our dataframe as a table there. We will use [sqlite3](https://docs.python.org/3/library/sqlite3.html). If you have an existing database you may prefer to look at [SQLalchemy](https://docs.sqlalchemy.org/) to create the connection instead:

In [8]:
import sqlite3

In [12]:
connection = sqlite3.connect('../data/panoma.db')
df.to_sql('panoma_raw', con=connection, if_exists='replace', index=False)

4899

<hr />

<p style="color:gray">©2022 Agile Geoscience. Licensed CC-BY.</p>