# Introduction to IO (Input/Output)

We inevitably will need to read data from various places and formats in order to do things with them. This notebook is an overview of some common formats and common ways to read and/or write them. This is absolutely not an exhaustive list of what can be read in python, so if you have specific requests, please do reach out.

The following will not import everything upfront. We will start with some generic formats, and then some more specialised subsurface/geoscience formats.

## Loading test data

In order to illustrate the various I/O operations below, we'll start by loading some data files to work with covering `*.csv`, `*.xlsx`, `*.geojson`, `*.shp`, `*.las`, `*.sgy`, `*.dlis`:

# TODO: add `*.shp` file

In [1]:
import pooch

# TODO: remember to change path to `../data` in prod
spot = pooch.create(path='./data', base_url="https://geocomp.s3.amazonaws.com/data/",
                    registry={"Norway_field_production_monthly.csv": "md5:26e7f45b8bb9807e0c8f03d993cc973e",
                              "L-30_Depth-DT-RHOB.csv": "md5:b28af3b948694cc59b743b0e119b3220",
                              "FMI_run2_feature_picks.xlsx": "md5:c8384e3701e04a74a320350d6b73657b",
                              "Offshore_wells.geojson": "md5:fb9a743840a105158785addb191392fb",
                              "B-41.las": "md5:8496be6d22b71e7d8b6f9afe63f8d2a4",
                              "F3_8-bit_int.sgy": "md5:c1039c439abdd240f72efb39108ca186",
                              "FMI_Run3_processed.dlis": "md5:a1258f9868e7bb55aedb7015a4f04fc4",
                             })

## CSV or TSV files

A very common format, which is plain text with some sort of delimiter character (often `,` or `;`) separating each column, and newlines separating records. There are a number of ways to load these, depending on the intended use-case. Numpy or Pandas are probably the most common. D

In [2]:
import numpy as np
import pandas as pd

In [3]:
fname = spot.fetch("L-30_Depth-DT-RHOB.csv")
depth, dt, rhob = np.genfromtxt(fname, delimiter=',')
depth.shape, dt.shape, rhob.shape

((25621,), (25621,), (25621,))

In [4]:
depth, dt, rhob = np.loadtxt(fname, delimiter=',')
depth.shape, dt.shape, rhob.shape

((25621,), (25621,), (25621,))

In [5]:
# TODO: Change path in prod to `../data`
np.savetxt('./data/L-30_depth_np_export', depth)

In [6]:
fname = spot.fetch("Norway_field_production_monthly.csv")
df = pd.read_csv(fname)
df.head()

Unnamed: 0,prfInformationCarrier,prfYear,prfMonth,prfPrdOilNetMillSm3,prfPrdGasNetBillSm3,prfPrdNGLNetMillSm3,prfPrdCondensateNetMillSm3,prfPrdOeNetMillSm3,prfPrdProducedWaterInFieldMillSm3,prfNpdidInformationCarrier
0,24/9-12 S (Frosk),2019,8,0.01705,0.00068,0.0,0.0,0.01772,0.00061,31140456
1,24/9-12 S (Frosk),2019,9,0.05557,0.00323,0.0,0.0,0.0588,0.0,31140456
2,24/9-12 S (Frosk),2019,10,0.04403,0.00258,0.0,0.0,0.04661,0.0,31140456
3,24/9-12 S (Frosk),2019,11,0.0535,0.00299,0.0,0.0,0.05648,0.0,31140456
4,24/9-12 S (Frosk),2019,12,0.05825,0.00297,0.0,0.0,0.06123,9e-05,31140456


In [13]:
df.columns

Index(['prfInformationCarrier', 'prfYear', 'prfMonth', 'prfPrdOilNetMillSm3',
       'prfPrdGasNetBillSm3', 'prfPrdNGLNetMillSm3',
       'prfPrdCondensateNetMillSm3', 'prfPrdOeNetMillSm3',
       'prfPrdProducedWaterInFieldMillSm3', 'prfNpdidInformationCarrier'],
      dtype='object')

In [19]:
# TODO: Change path in prod to `../data`
df.loc[df['prfYear'] == 2021, 'prfPrdOilNetMillSm3'].to_csv('./')

17       0.02617
18       0.02174
19       0.00600
20       0.00000
21       0.00000
          ...   
22459    0.00000
22460    0.00000
22461    0.00000
22462    0.00000
22463    0.00000
Name: prfPrdOilNetMillSm3, Length: 620, dtype: float64

In [None]:
df.to_csv

## Excel Files

The easiest for this is definitely pandas. You will need to install `xlrd` as well, since this is an optional library used in the background.

In [None]:
df = pd.read_excel()

It is worth noting that you can either read individual worksheets, or load multiple ones into one dictionary.

In [None]:
df.to_excel()

## Databases

There are numerous ways of reading a database, which partially depends on the type of database. Pandas can read or write SQL, so it a reasonable starting point.

For a more powerful and flexible option, consider [sqlalchemy](https://www.sqlalchemy.org/).

## JSON

JavaScript Object Notation is a very common format used to exchange information on the internet, so you may get this back from various Application Programming Interfaces (APIs). It is very similar to a python `dict`, which is how these are usually handled once they are loaded. There is a built-in library for working with these, logically enough named `json`. This can handle json files in string format as well.

In [None]:
import json

In [None]:
json.load()

In [None]:
json.dump()

<hr/>

The following are more geoscience or subsurface data formats.

## Shapefiles

These are a common geographical information system format, originally developed by Esri. A simple way to load these is to use geopandas:

In [None]:
import geopandas as gpd

In [None]:
gdf = gpd.read_file()

Because geopandas uses `fiona` in the background for file handling, it can handle the following formats in addition to shapefiles. Files with `'r'` can read from, `'w'` can be written to, and `'a'` can be appended to.

In [None]:
import fiona
fiona.supported_drivers

In [None]:
gpd.geodataframe.

## LAS files

`lasio` is a library that is able to read LAS2 files, but `welly` is a wrapper that may be nicer to use for everyday use:

In [None]:
from welly import Well, Project

In [None]:
w = Well.from_las()

Welly can also load an entire directory of las files into a `Project`:

In [None]:
p = Project.from_las()

## SEG-Y

The SEG-Y format is widely-used, although any given individual file can be tricky to load. Equinor has written a low-level library named [`segyio`](https://github.com/equinor/segyio) which can (with some effort in some cases) read and write SEG-Y files and headers.

In [None]:
import segyio

In [None]:
with segyio.open() as s:
    vol = s.cube()

Given that `segyio` is intended for relatively low-level operations, it means that there is a fair amount of work to get things working. An alternative, built on top of it is SEGY Swis Army Knife ([SEGYSAK](https://segysak.readthedocs.io/en/latest/index.html)). This is intended to make common operations a little easier. It also interfaces with `xarray`, which is an extension of numpy, and well-worth a look.

In [None]:
from segysak.segy import segy_loaderder

In [None]:
segy_loader()

## DLIS files

Equinor have written a library named `dlisio` that can handle dlis files:

In [None]:
import dlisio

In [None]:
# need to confirm how this one works

## Other Assorted Formats

The subsurface world is filled with all sorts of other formats. Agile Scientific has written a library named `gio` that can handle a variety of these, such as OpendTect horizons, Surfer 7 grids, and ZMaps. These are loaded as `xarray`s. The documentation has [more details](https://code.agilescientific.com/gio/index.html).

In [None]:
import gio

In [None]:
data = gio.read_odt(fname)