# Borehole extraction demo

Here's a short guide on running borehole.extraction in a few scenarios.

## Initialising variables
First, import extraction.py from the borehole package. 

Call extraction.init() - this initialises the lists of column and key terms which tables will be searched for. These lists are hardcoded as strings in the file, to allow for easy editing by the user, and are converted to lists during init().

In [1]:
import sys
sys.path.append('../')
from borehole import extraction

extraction.init()

Before you go further, you may want to set a few variables.

By default, extraction results will be saved to the value of the bhcsv_all variable, by default: 'bh_refs_all_tables.csv'. However, you can give a different filename to the extraction functions as a parameter and they'll save to that location instead.

In [2]:
result_fname= 'example_result.csv'

By default, this program will look for source files for tables inside the value of the paths.training_file_folder variable (when training=True in function parameters), so make sure that location contains a populated tables/ folder.

Next, you'll want to decide if you want to perform borehole extraction for a set list of report IDs, or for all report IDs appearing in filenames at a location. 

## Extracting tables from given reportIDs

If you're extracting for a specific set of report IDs, define that set as a list of strings, for example:

In [3]:
reports_str = '25335 34372 35500 36675 40923 41674 41720 41932 44638 48384 48406'
reportIDs = reports_str.split()
reportIDs

['25335',
 '34372',
 '35500',
 '36675',
 '40923',
 '41674',
 '41720',
 '41932',
 '44638',
 '48384',
 '48406']

And run extraction:

In [4]:
for e in reportIDs:
    extraction.extract_bh(e, fname=result_fname)

Borehole extraction for  25335  saved to  example_result.csv
Borehole extraction for  34372  saved to  example_result.csv
Borehole extraction for  35500  saved to  example_result.csv
Borehole extraction for  36675  saved to  example_result.csv
Borehole extraction for  40923  saved to  example_result.csv
Borehole extraction for  41674  saved to  example_result.csv
Borehole extraction for  41720  saved to  example_result.csv
Borehole extraction for  41932  saved to  example_result.csv
Borehole extraction for  44638  saved to  example_result.csv
Borehole extraction for  48384  saved to  example_result.csv
Borehole extraction for  48406  saved to  example_result.csv


Finish up by running manage_data(result_fname) to clean up the saved file from duplicates.

In [6]:
extraction.manage_data(result_fname)

And here's what the extraction file looks like:

In [10]:
import pandas as pd
df = pd.read_csv(result_fname)
df.head()

Unnamed: 0,DocID,File,BH,grid_loc_1,grid_loc_2,geo_loc_1,geo_loc_2,BH source,grid_loc_1 source,grid_loc_2 source,geo_loc_1 source,geo_loc_2 source
0,25335,105,,,,,,Hole,,,,
1,25335,25,,,,,,Hole No.,,,,
2,25335,296,25 3,,,,,HOLE,,,,
3,25335,2,M 670,,,,,Hole,,,,
4,25335,2,M 1673,,,,,Hole,,,,


## Extracting tables from all tables in your trainingFiles

This is the simplest in terms of running - just run extract_for_all_docids() - init() and manage_data() come included in it.

And that's it for simple execution. There's more playing around you can do, in terms of managing the source location of your table files(with training=False, and/or setting extrafolder), if you'd like the table files to be used to be those already predicted to contain boreholes (with bh=True)(see borehole.tables for that process), and of course editing the column and key borehole and location terms that are searched for.