# Introduction to IO (Input/Output)

We inevitably will need to read data from various places and formats in order to do things with them. This notebook is an overview of some common formats and common ways to read and/or write them. This is absolutely not an exhaustive list of what can be read in python, so if you have specific requests, please do reach out.

The following will not import everything upfront. We will start with some generic formats, and then some more specialised subsurface/geoscience formats.

## Loading test data

In order to illustrate the various I/O operations below, we'll start by loading some data files to work with covering `*.csv`, `*.xlsx`, `*.geojson`, `*.shp`, `*.las`, `*.sgy`, `*.dlis`:

# TODO: add `*.shp` file

In [1]:
import pooch

# TODO: remember to change path to `../data` in prod
spot = pooch.create(path='./data', base_url="https://geocomp.s3.amazonaws.com/data/",
                    registry={"Norway_field_production_monthly.csv": "md5:26e7f45b8bb9807e0c8f03d993cc973e",
                              "L-30_Depth-DT-RHOB.csv": "md5:b28af3b948694cc59b743b0e119b3220",
                              "FMI_run2_feature_picks.xlsx": "md5:c8384e3701e04a74a320350d6b73657b",
                              "Offshore_wells.geojson": "md5:fb9a743840a105158785addb191392fb",
                              "B-41.las": "md5:8496be6d22b71e7d8b6f9afe63f8d2a4",
                              "F3_8-bit_int.sgy": "md5:cbde973eb6606da843f40aedf07793e4",
                              "FMI_Run3_processed.dlis": None,
                              "F3_Demo_0_FS4.dat": None,
                             })

## CSV or TSV files

A very common format, which is plain text with some sort of delimiter character (often `,` or `;`) separating each column, and newlines separating records. There are a number of ways to load these, depending on the intended use-case. Numpy or Pandas are probably the most common. D

In [2]:
import numpy as np
import pandas as pd

In [3]:
fname = spot.fetch("L-30_Depth-DT-RHOB.csv")
depth, dt, rhob = np.genfromtxt(fname, delimiter=',')
depth.shape, dt.shape, rhob.shape

((25621,), (25621,), (25621,))

In [4]:
depth, dt, rhob = np.loadtxt(fname, delimiter=',')
depth.shape, dt.shape, rhob.shape

((25621,), (25621,), (25621,))

In [5]:
# TODO: Change path in prod to `../data`
np.savetxt('./data/L-30_depth_np_export', depth)

In [6]:
fname = spot.fetch("Norway_field_production_monthly.csv")
df = pd.read_csv(fname)
df.head()

Unnamed: 0,prfInformationCarrier,prfYear,prfMonth,prfPrdOilNetMillSm3,prfPrdGasNetBillSm3,prfPrdNGLNetMillSm3,prfPrdCondensateNetMillSm3,prfPrdOeNetMillSm3,prfPrdProducedWaterInFieldMillSm3,prfNpdidInformationCarrier
0,24/9-12 S (Frosk),2019,8,0.01705,0.00068,0.0,0.0,0.01772,0.00061,31140456
1,24/9-12 S (Frosk),2019,9,0.05557,0.00323,0.0,0.0,0.0588,0.0,31140456
2,24/9-12 S (Frosk),2019,10,0.04403,0.00258,0.0,0.0,0.04661,0.0,31140456
3,24/9-12 S (Frosk),2019,11,0.0535,0.00299,0.0,0.0,0.05648,0.0,31140456
4,24/9-12 S (Frosk),2019,12,0.05825,0.00297,0.0,0.0,0.06123,9e-05,31140456


In [7]:
df.columns

Index(['prfInformationCarrier', 'prfYear', 'prfMonth', 'prfPrdOilNetMillSm3',
       'prfPrdGasNetBillSm3', 'prfPrdNGLNetMillSm3',
       'prfPrdCondensateNetMillSm3', 'prfPrdOeNetMillSm3',
       'prfPrdProducedWaterInFieldMillSm3', 'prfNpdidInformationCarrier'],
      dtype='object')

In [8]:
# TODO: Change path in prod to `../data`
df.loc[df['prfYear'] == 2021,
       'prfPrdOilNetMillSm3'].to_csv('./data/Norway_prod_2021_OilNetMillSm3.csv', index=False)

## Excel Files

The easiest for this is definitely pandas. You will need to install `xlrd` and `openpyxl` as well, since these are optional libraries used in the background.

In [9]:
fname = spot.fetch('FMI_run2_feature_picks.xlsx')
df = pd.read_excel(fname)
df.head()

Unnamed: 0,TDEP,Azimuth,Dip_TRU,FRACTURE_APERTURE[0],FVA (mean fracture height),FVAH (mean hydraulic fracture height),Induced_Fracture_Height_N,Breakout_Azimuth_N,Breakout_Dip_Azimuth,Type
0,7442.4995,352.3611,85.95914,68.98955,0.253895,0.278163,,,,Conductive_Part_Resistive_Fracture
1,7442.681591,308.8001,88.82356,,,,,128.7822,38.80008,Breakout
2,7443.234932,307.3956,88.85078,,,,,307.3779,37.39557,Breakout
3,7444.356259,285.515,81.06979,42.64809,0.160323,0.43383,,,,Conductive_Part_Resistive_Fracture
4,7445.044424,92.84357,20.3742,215.7491,0.20795,0.236389,,,,Conductive_Part_Resistive_Fracture


It is worth noting that you can either read individual worksheets, or load multiple ones into one dictionary. As with `*.csv` files, we can easily write back to disk from pandas:

In [10]:
df.columns

Index(['TDEP', 'Azimuth', 'Dip_TRU', 'FRACTURE_APERTURE[0]',
       'FVA (mean fracture height)', 'FVAH (mean hydraulic fracture height)',
       'Induced_Fracture_Height_N', 'Breakout_Azimuth_N',
       'Breakout_Dip_Azimuth', 'Type'],
      dtype='object')

In [11]:
df_sample = df.loc[(60 < df['Azimuth']) &
                   (df['Azimuth'] < 120) &
                   (df['FRACTURE_APERTURE[0]'] > 50),
                   ['TDEP', 'Azimuth', 'Dip_TRU', 'FRACTURE_APERTURE[0]']
                  ]
df_sample

Unnamed: 0,TDEP,Azimuth,Dip_TRU,FRACTURE_APERTURE[0]
4,7445.044424,92.84357,20.3742,215.7491
68,7500.826235,67.46991,26.24434,160.5575
69,7501.285741,82.76354,24.28477,150.5226
70,7501.523337,73.33259,26.63311,180.6272
112,7532.65198,114.4092,80.00818,147.0968
124,7541.789573,115.3005,56.43068,68.38709


In [12]:
# TODO: Change path in prod to `../data`
with pd.ExcelWriter("./data/Large_fracs_east_sample.xlsx") as writer:
    df_sample.to_excel(writer)

## Databases

There are numerous ways of reading a database, which partially depends on the type of database. Pandas can read or write SQL, so it a reasonable starting point.

For a more powerful and flexible option, consider [sqlalchemy](https://www.sqlalchemy.org/).

## JSON

JavaScript Object Notation is a very common format used to exchange information on the internet, so you may get this back from various Application Programming Interfaces (APIs). It is very similar to a python `dict`, which is how these are usually handled once they are loaded. There is a built-in library for working with these, logically enough named `json`. This can handle json files in string format as well.

In [13]:
import json

In [14]:
#json.load()

In [15]:
#json.dump()

<hr/>

The following are more geoscience or subsurface data formats.

## Shapefiles

These are a common geographical information system format, originally developed by Esri. A simple way to load these is to use geopandas:

In [16]:
import geopandas as gpd

In [17]:
#gdf = gpd.read_file()

Because geopandas uses `fiona` in the background for file handling, it can handle the following formats in addition to shapefiles. Files with `'r'` can read from, `'w'` can be written to, and `'a'` can be appended to.

In [18]:
import fiona
fiona.supported_drivers

{'ARCGEN': 'r',
 'DXF': 'rw',
 'CSV': 'raw',
 'OpenFileGDB': 'r',
 'ESRIJSON': 'r',
 'ESRI Shapefile': 'raw',
 'FlatGeobuf': 'rw',
 'GeoJSON': 'raw',
 'GeoJSONSeq': 'rw',
 'GPKG': 'raw',
 'GML': 'rw',
 'OGR_GMT': 'rw',
 'GPX': 'rw',
 'GPSTrackMaker': 'rw',
 'Idrisi': 'r',
 'MapInfo File': 'raw',
 'DGN': 'raw',
 'PCIDSK': 'rw',
 'OGR_PDS': 'r',
 'S57': 'r',
 'SQLite': 'raw',
 'TopoJSON': 'r'}

## LAS files

`lasio` is a library that is able to read LAS2 files, but `welly` is a wrapper that may be nicer to use for everyday use:

In [19]:
from welly import Well, Project

In [20]:
# TODO: Change path in prod to `../data`
fname = spot.fetch('B-41.las')
w = Well.from_las(fname)
w

PENOBSCOT B-41,PENOBSCOT B-41.1
crs,CRS({})
location,44'10'02; N LAT|60'06'32; W LO
county,NOVA SCOTIA SHELF
latitude,0
longitude,0
td,
data,"CALD, CALS, DEPTH:2, DRHO, DT, GRD, GRS, ILD, ILM, LL8, NPHISS, RHOB, SP"


Welly can also load an entire directory of las files into a `Project`, see the [welly docs](https://code.agilescientific.com/welly/userguide/Projects.html) for an example.

## SEG-Y

The SEG-Y format is widely-used, although any given individual file can be tricky to load. Equinor has written a low-level library named [`segyio`](https://github.com/equinor/segyio) which can (with some effort in some cases) read and write SEG-Y files and headers.

In [21]:
import segyio

In [22]:
fname = spot.fetch('F3_8-bit_int.sgy')

with segyio.open(fname) as s:
    vol = segyio.cube(s)
vol.shape

(225, 300, 463)

Given that `segyio` is intended for relatively low-level operations, it means that there is a fair amount of work to get things working. An alternative, built on top of it is SEGY Swis Army Knife ([SEGYSAK](https://segysak.readthedocs.io/en/latest/index.html)). This is intended to make common operations a little easier. It also interfaces with `xarray`, which is an extension of numpy, and well-worth a look.

In [23]:
from segysak.segy import segy_loader

  from tqdm.autonotebook import tqdm


In [None]:
segy_loader(fname)

  0%|          | 0.00/67.5k [00:00<?, ? traces/s]

Loading as 3D
Fast direction is INLINE_3D


Converting SEGY:   0%|          | 0.00/67.5k [00:00<?, ? traces/s]

## DLIS files

Equinor have written a library named `dlisio` that can handle dlis files:

In [None]:
from dlisio import dlis

fname = spot.fetch('FMI_Run3_processed.dlis')

with dlis.load(fname) as (f, *tail):
    print(f.describe())

In [None]:
for ch in f.channels:
    print(ch)

## Other Assorted Formats

The subsurface world is filled with all sorts of other formats. Agile Scientific has written a library named `gio` that can handle a variety of these, such as OpendTect horizons, Surfer 7 grids, and ZMaps. These are loaded as `xarray`s. The documentation has [more details](https://code.agilescientific.com/gio/index.html).

In [None]:
import gio

In [None]:
fname = spot.fetch('F3_Demo_0_FS4.dat')
data = gio.read_odt(fname)

In [None]:
data