## Getting testing data coordinates from S3

In this notebook we will download a dataframe populated with testing data.

However, all values will be blank (NaN).

We expect competitors to run their predictive models and fill in the blank locations
using their predictions at the IL/XL/TWT locations.

There are two blind wells for evaluation.

Once results are submitted, we will be calculating R<sup>2</sup> scores with ground truth data
and results will be put on the leaderboard.

Doing imports.

In [1]:
from pandas import read_json

## Loading AWS Credentials

See Tutorials #1 for setting up credentials on local machines.

## Getting the Blank DataFrame

We have the blind wells in the same format as training wells.

They have inline, crossline, and two-way time values provided (coordinates).

We expect you to run your feature extraction, feature engineering, and predictions around or at
these locations and populate this DataFrame without adding/removing or changing the shape.

We will then evaluate results comparing these to ground truth data.

See the output of this cell for what it looks like.

**Any result that is not the same shape as this DataFrame will not be considered.**

**Any result that has `NaN` values in the results DataFrame will not be considered.**

In [None]:
well_bucket = 's3://sagemaker-gitc2021/poseidon/wells/'
well_file = 'poseidon_geoml_testing_wells_blank.json.gz'

well_df = read_json(
    path_or_buf=well_bucket + well_file,
    compression='gzip',
)

well_df.set_index(['well_id', 'twt'], inplace=True)

well_df

Below are the statistics. As you can see, the rhob, p_impedance, and s_impedance are blank.

In [3]:
well_df.describe()

Unnamed: 0,inline,xline,rhob,p_impedance,s_impedance
count,1773.0,1773.0,0.0,0.0,0.0
mean,3007.407384,1766.345919,,,
std,386.799776,584.865566,,,
min,2407.1763,1389.0076,,,
25%,2408.2571,1389.0613,,,
50%,3256.5844,1389.2113,,,
75%,3257.0021,2671.0803,,,
max,3257.3135,2675.9386,,,


We will accept results uploaded into these S3 buckets.

Please use following paths and file names as a template. The code cell after the
explanation will have an example.

The units of submitted predictions should be same as original logs, no unit transformation should be
necessary. The seismic velocities are in *meters/second*. Unit conversion may not be necessary
since it is just a scalar, however, if you like, we recommend converting it to *feet/second*.

**Impedances:** *(feet/second) x (grams/cm<sup>3</sup>)* aka. Velocity x Density

**Density:** *grams/cm<sup>3</sup>*

#### **Intermediate Results:**

`bucket =` *`s3://sagemaker-gitc2021/poseidon/wells/submissions/intermediate/`*

`file_name =` *`TeamName_Intermediate_Results_YYYYMMDD.json`*

#### **Final:**

`bucket =` *`s3://sagemaker-gitc2021/poseidon/wells/submissions/final/`*

`file_name = `*`TeamName_Final_Results_YYYYMMDD.json`*

#### **If you are submitting multiple results per day, please add a 2 digit integer (like 01, 02, etc.) after the Intermediate/Final**
Which will look like this: `file_name = `*`TeamName_Intermediate05_Results_YYYYMMDD.json`*

Final submissions must be in the `.json` format. This can be achieved by using the
following code snippet. This assumes your populated DataFrame variable is named `result`.

In [None]:
bucket = 's3://sagemaker-gitc2021/poseidon/wells/submissions/intermediate/'

file_name = 'MyTeam_Intermediate_Results_20210416.json'

# Making sure extension is in the file name.
if not file_name.lower().endswith('.json'):
    file_name += '.json'

my_result.reset_index(inplace=True)
my_result.to_json(
    path_or_buf=bucket + file_name,
    double_precision=4,
)

This will be our scoring metric:

In [None]:
r2_columns = ['p_impedance', 's_impedance', 'rhob']  # Get only relevant data columns
not_na = ground_truth[r2_columns].notna().any(axis=1)  # Get any NaN row, so we can exclude NaNs
sklearn.metrics.r2_score(y_true=ground_truth[r2_columns][not_na],
                         y_pred=team_df[r2_columns][not_na],
                         multioutput='variance_weighted',
                         )