# Table Classification Guide

This notebook will guide you through some of the functions of borehole/tables.py and the reason for their coexistence. The borehole package is centred on getting borehole information out of tables. The tables.py file contains many operations that act on tables, all of which were initally written to support table classification (as containing borehole references or not, to make the later search for the references shorter). 

The more general, helper functions (like get_tables()) in that file are pretty self-explanatory, so this guide's focus will be on the table classification. The smaller set of tales created by this what you'll need to use the bh=True parameter in extraction. 

In [1]:
import sys
sys.path.append('../')  # need this to be able to import borehole package files
from borehole import tables

## Automatically classify and save all bh tables

If you just want to be able to run all your tables through the classifier so you can run bh=True, execute no more but the import and the following line of code. 

By default, this function looks in your paths.training_file_folder, in the 'tables' folder and processes all the tables in it and saves the results to 'bh_tables', if the result files do not yet exist. 

Desipte already being run for the directory, the function still prints output as there are files which have tables, but no bh_tables, and if there are no bh_tables, no file is saved. This is different behaviour to textract tables - which saves files even when there are no tables to save - bh_tables can be changed to have the same behaviour. 

In [2]:
tables.save_all_bh_tables()

Getting borehole tables for  10232 _ 1
Saved  10232 _ 1  bh tables to file
Getting borehole tables for  105814 _ 1
Saved  105814 _ 1  bh tables to file
Getting borehole tables for  111200 _ 1
Saved  111200 _ 1  bh tables to file
Getting borehole tables for  11127 _ 1
File has no natural tables
Getting borehole tables for  11128 _ 1
Saved  11128 _ 1  bh tables to file
Getting borehole tables for  11148 _ 1
Saved  11148 _ 1  bh tables to file
Getting borehole tables for  1229 _ 1
Saved  1229 _ 1  bh tables to file
Getting borehole tables for  1664 _ 1
File has no natural tables
Getting borehole tables for  18509 _ 1
File has no natural tables
Getting borehole tables for  20170 _ 1
Saved  20170 _ 1  bh tables to file
Getting borehole tables for  21622 _ 1
Saved  21622 _ 1  bh tables to file
Getting borehole tables for  21838 _ 1
Saved  21838 _ 1  bh tables to file
Getting borehole tables for  22343 _ 1
File has no natural tables
Getting borehole tables for  22568 _ 1
Saved  22568 _ 1  bh 

If you want to know more than that, read on.

## Table classification model

Table classification is based on a labelled dataset of table content stored as text and a sckit-learn pipeline comprising of TfidfVectorizer and Complement Naive Bayes for binary classification. 

The dataset is programatically generated from tables files, then samples can be labelled in the csv or one-by-one in a python notebook. The classes are 0 (for: doesn't contain any reference to borehole names) and 1 (for the opposite). Labelling can be difficult for some tables; labelling inside the notebook can be more accurate as the context of the table structure is shown (but this process is also much slower as labelling is done one at a time). Another reason apart from structure that labelling can be a challenge is when it's not fully clear if the table contains borehole references. The rule of thumb I used, was that if it's not clear to me, it woudln't be clear to the classifier, so that should be labelled as a 0. By "clear" I mean: does not only the name/reference (which will be unique to the borehole and therefore not common enough to be a vocabulary term in the vectoriser) appear, but do key terms that indicate the presence of boreholes?

For example, a table which, say, uses a borehole name such that that knowledge that it is a borehole is implicit, perhaps taken from knowing the pattern of borehole names and being able to make an educated guess, but no indicator terms that label it as such exist in the table as well, should be classed as a 0, wheras explicit references will be 1s. Take a look at the dataset in report/datasets/boreholes/table_content.csv for examples. 

As the dataset is already created, running tables.create_dataset() can be skipped. If you don't have a dataset, run this function without parameters. 

Once you've got a dataset with some labels, run tables.train(0) to train the model on it. The '0' represents that you don't want to label any more samples manually before doing this - you must have this if you have unlabelled samples in your dataset, or the function will deafult to giving you 10 samples to label first (in the principle of using active learning to incrementally improve your model by labelling data which will most help it improve).

In [7]:
clf = tables.train(0)

training with all labelled samples
test set size:  0.2
[Pipeline] .......... (step 1 of 3) Processing list2str, total=   0.0s
[Pipeline] ............. (step 2 of 3) Processing tfidf, total=   0.6s
[Pipeline] ............... (step 3 of 3) Processing cnb, total=   0.0s
Test set results: 
0.8966942148760331
[[596  74]
 [  1  55]]
For manually annotated:
0.9181141439205955
[[3079  291]
 [   6  251]]


If you do wish to use active learning and label samples in the notebook, run train() with a num_queries parameter > 0. Your labels will be used in the training and also saved to the dataset file.

We can see this below: one a preliminary model training has been done, the table is shown to the user, the program shows its own prediction for the sample, and the user is asked to enter the correct label - 0 or 1 (for negative and positive of containing borehole names). Only these values will be valid - entering anything else will re-prompt for a correct input. 

Once a sample's label is recorded, the output window is refreshed and the next sample for labelling is queried, until all samples have been done. The model is re-trained with this new information and the new accuracy and confusion matrix are show (for a 20% validation set).

In [3]:
tables.train(1)

   Start Time  End Time   Dur (hrs)   Depth Start (mKB)   Depth End (mKB)    Phase  Op Code  Activity Code  Time P-T-X                                          Operation 
0       00:00     06:00          6.0                 4.3               4.3    MIRU   RIGMNT           RGRP           P   Continued with flare & vent lines, re-instal s...
1       06:00     11:30          5.5                 4.3               4.3    MIRU   RIGMNT           RGRP           P   line on LCM tank. fill active tanks with Drill...
2          NaN       NaN         NaN                 NaN               NaN      NaN      NaN            NaN         NaN  mast, install cusion sub, upper kelly sub, sav...
3       11:30     12:00          0.5                 4.3               4.3    MIRU   RIGMNT           SFTY           P         Conducted weekly safety meeting with crew. 
4       12:00     14:30          2.5                 4.3               4.3    MIRU   RIGMNT           RGRP           P   Re-Align pipe arm & conn



ActiveLearner(X_training=array([["['Name ', 'TENEMENT ', 'SUB_BLOCK ', 'A01 ', 'EPM13858 and EPM13859 ', 'A02 ', 'EPM13858 ', 'A03 ', 'EPM13861 ', 'A05 ', 'EPM13861 ', 'A06 ', 'EPM13861 ', 'A07 ', 'EPM13862 ', 'A08 ', 'EPM13863 ', 'A12 ', 'EPM13857 ', 'A13 ', 'EPM13857 ', 'A14 ', 'EPM13857 ', 'A28 ', 'EPM13857 ', 'A30 ', 'EPM13857 ', 'A40 ', 'EPM13859 and EPM13860 ', 'A41 ', 'EPM13860 ']"],
       ["['reeatlen/Source ', '...
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,