# Data for Playground

The Playground accepts data in a JSON file, with structure like:

[
  {'x': 100, 'y':  2.3, 'label': -1},
  {'x': 200, 'y': -1.2, 'label': -1},
  {'x': 340, 'y': 14.9, 'label':  1}
]  

For classification problems, `label` should be -1 for the negative class and 1 for the positive class. Anything else will be min/max scaled to those nubers. 

For regression problems, `label` should be min/max scaled to the range [-1, 1], but this will be done for you so you probably don't need to worry about it.

`x` and `y` will be standardized — scaled to a zero mean and standard deviation of 1. (The train/test split is done in the app; scaling is done using statistics from the training set only.)

## Classification: Sand-shale data from RPC

Let's import some data. Here I'm using an extract from the Rock Property Catalog, https://subsurfwiki.org/wiki/Rock_Property_Catalog

In [3]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('https://geocomp.s3.amazonaws.com/data/RPC_simple.csv')
df['label'] = [1 if row[1].Lithology=='sandstone' else 0 for row in df.iterrows()]
df[['x', 'y']] = df[['Vp', 'Rho_n']]
df[['x', 'y', 'label']].to_json('sand-shale.json', orient='records')

## Regression: DTS data from R-39 well

In [9]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import welly
import numpy as np

r39 = welly.Well.from_las('https://geocomp.s3.amazonaws.com/data/R-39.las', index='original')

# Clean up the DTS log.
dts = r39.data['DT4S']
dts[dts < 0] = np.nan
r39.data['DT4S'] = dts.interpolate()

# Export.
data = r39.data_as_matrix(keys=['RHOB', 'DT4P', 'DT4S'], step=2.0)
df = pd.DataFrame(data, columns=['x', 'y', 'label'])
df.to_json('dts.json', orient='records')

## Regression: Porosity data

I'm using a small dataset originally from [**Geoff Bohling**](http://people.ku.edu/~gbohling/) at the Kansas Geological Survey. I can no longer find the data online, but here's what he says about it:

> Our example data consist of vertically averaged porosity values, in percent, in Zone A of the Big Bean Oil Field (fictitious, but based on real data). Porosity values are available from 85 wells distributed throughout the field, which is approximately 20 km in east-west extent and 16 km north-south. The porosities range from 12% to 17%.

[Read more about it](http://discoverspatial.in/wp-content/uploads/2018/04/IntroToGeostatistics-Geoff-Bohling.pdf)

In [12]:
import pandas as pd

fname = "https://geocomp.s3.amazonaws.com/data/ZoneA.dat"

df = pd.read_csv(fname,
                 sep=' ',
                 header=9,
                 usecols=[0, 1, 2, 3, 4],
                 names=['x', 'y', 'thick', 'por', 'perm'],
                 dtype="float64",
                 na_values=[-999.9999],
                )

df.head()

Unnamed: 0,x,y,thick,por,perm
0,12100.0,8300.0,37.1531,14.6515,2.8547
1,5300.0,8700.0,31.4993,14.5093,
2,3500.0,13900.0,36.9185,14.0639,
3,5100.0,1900.0,24.0156,15.1084,1.1407
4,9900.0,13700.0,35.0411,13.919,


In [13]:
from sklearn.preprocessing import MinMaxScaler

df['label'] = df.por
df[['x', 'y', 'label']].to_json('porosity.json', orient='records')

--- 

&copy; 2021 Agile Scientific