# Data for Playground

Let's look at one of the most fundamental types of machine learning task: a binary classification problem. By definition, there are two classes. You can think of them as a 'positive' and 'negative' class.

First we'll import some data. I'm using an extract from the Rock Property Catalog, https://subsurfwiki.org/wiki/Rock_Property_Catalog

## Sand-shale data from RPC

In [32]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('https://geocomp.s3.amazonaws.com/data/RPC_simple.csv')
df['label'] = [1 if row[1].Lithology=='sandstone' else -1 for row in df.iterrows()]
df[['vp', 'rho']] = 2 * StandardScaler().fit_transform(df[['Vp', 'Rho_n']])
df[['x', 'y', 'label']].to_json('sand-shale.json', orient='records')

{x: -1.941, y: -2.535, label: 1},
{x: -2.066, y: -2.713, label: 1},
{x: -1.058, y: -2.318, label: 1},
{x: -1.525, y: -1.815, label: 1},
{x: 1.369, y: 0.914, label: 1},
{x: 1.596, y: 0.788, label: 1},
{x: 2.189, y: 0.201, label: 1},
{x: -0.572, y: -3.122, label: 1},
{x: -0.610, y: -1.572, label: 1},
{x: -0.498, y: -2.226, label: 1},
{x: -0.838, y: -1.723, label: 1},
{x: -1.089, y: -2.806, label: 1},
{x: -1.123, y: -2.069, label: 1},
{x: -1.182, y: -1.753, label: 1},
{x: 1.798, y: -1.877, label: 1},
{x: -1.012, y: -3.496, label: 1},
{x: -1.831, y: -2.473, label: 1},
{x: -0.354, y: -1.795, label: 1},
{x: 4.614, y: -2.477, label: 1},
{x: -0.344, y: -2.718, label: 1},
{x: 0.757, y: 0.937, label: 1},
{x: -0.955, y: -2.311, label: 1},
{x: 1.680, y: 0.531, label: 1},
{x: -1.941, y: -2.997, label: 1},
{x: -1.493, y: -2.186, label: 1},
{x: -1.799, y: -2.525, label: 1},
{x: 2.213, y: 0.516, label: 1},
{x: -2.013, y: -2.891, label: 1},
{x: -1.752, y: -2.815, label: 1},
{x: 1.722, y: 0.204, label: 

In [31]:
for x, y, label in df[['vp', 'rho', 'label']].values:
    print(f"{{x: {2*x:.3f}, y: {2*y:.3f}, label: {int(label)}}},")

{x: -1.941, y: -2.535, label: 1},
{x: -2.066, y: -2.713, label: 1},
{x: -1.058, y: -2.318, label: 1},
{x: -1.525, y: -1.815, label: 1},
{x: 1.369, y: 0.914, label: 1},
{x: 1.596, y: 0.788, label: 1},
{x: 2.189, y: 0.201, label: 1},
{x: -0.572, y: -3.122, label: 1},
{x: -0.610, y: -1.572, label: 1},
{x: -0.498, y: -2.226, label: 1},
{x: -0.838, y: -1.723, label: 1},
{x: -1.089, y: -2.806, label: 1},
{x: -1.123, y: -2.069, label: 1},
{x: -1.182, y: -1.753, label: 1},
{x: 1.798, y: -1.877, label: 1},
{x: -1.012, y: -3.496, label: 1},
{x: -1.831, y: -2.473, label: 1},
{x: -0.354, y: -1.795, label: 1},
{x: 4.614, y: -2.477, label: 1},
{x: -0.344, y: -2.718, label: 1},
{x: 0.757, y: 0.937, label: 1},
{x: -0.955, y: -2.311, label: 1},
{x: 1.680, y: 0.531, label: 1},
{x: -1.941, y: -2.997, label: 1},
{x: -1.493, y: -2.186, label: 1},
{x: -1.799, y: -2.525, label: 1},
{x: 2.213, y: 0.516, label: 1},
{x: -2.013, y: -2.891, label: 1},
{x: -1.752, y: -2.815, label: 1},
{x: 1.722, y: 0.204, label: 

## DTS data from R-39 well

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import welly
import numpy as np

r39 = welly.Well.from_las('https://geocomp.s3.amazonaws.com/data/R-39.las', index='original')
data = r39.data_as_matrix(keys=['RHOB', 'DT4P', 'DT4S'], step=2.0)
np.random.shuffle(data)
minmax = MinMaxScaler(feature_range=(-1, 1))
data = minmax.fit_transform(data)
data *= np.array([6, 6, 1])
df = pd.DataFrame(data, columns=['x', 'y', 'label'])
df.to_json('dts.json', orient='records')

## Porosity data

I'm using a small dataset originally from [**Geoff Bohling**](http://people.ku.edu/~gbohling/) at the Kansas Geological Survey. I can no longer find the data online, but here's what he says about it:

> Our example data consist of vertically averaged porosity values, in percent, in Zone A of the Big Bean Oil Field (fictitious, but based on real data). Porosity values are available from 85 wells distributed throughout the field, which is approximately 20 km in east-west extent and 16 km north-south. The porosities range from 12% to 17%.

[Read more about it](http://discoverspatial.in/wp-content/uploads/2018/04/IntroToGeostatistics-Geoff-Bohling.pdf)

In [2]:
import pandas as pd

fname = "https://geocomp.s3.amazonaws.com/data/ZoneA.dat"

df = pd.read_csv(fname,
                 sep=' ',
                 header=9,
                 usecols=[0, 1, 2, 3, 4],
                 names=['x', 'y', 'thick', 'por', 'perm'],
                 dtype="float64",
                 na_values=[-999.9999],
                )

df.head()

Unnamed: 0,x,y,thick,por,perm
0,12100.0,8300.0,37.1531,14.6515,2.8547
1,5300.0,8700.0,31.4993,14.5093,
2,3500.0,13900.0,36.9185,14.0639,
3,5100.0,1900.0,24.0156,15.1084,1.1407
4,9900.0,13700.0,35.0411,13.919,


In [None]:
from sklearn.preprocessing import MinMaxScaler

dg = df.copy()
norm = MinMaxScaler(feature_range=(-5.5, 5.5))
dg[['x', 'y']] = norm.fit_transform(df[['x', 'y']])
dg['label'] = 2 * ((dg.por - dg.por.min()) / (dg.por.max() - dg.por.min()) - 0.5)
dg[['x', 'y', 'label']].to_json('porosity.json', orient='records')

--- 

&copy; 2021 Agile Scientific