This notebook clusters our data based on the item_for_spend and takes sample_n samples of the data for the user to tag. The output of this book is a csv in which the user has to tag all relevant entities.

Expected input:
Preprocessed csv (from 00)

Expected output:
csv for NER tagging

In [9]:
import utils as ut
import os 
import hjson as json
import pandas as pd
from sklearn.model_selection import train_test_split

In [10]:
# Read in our params file
f = open('input_params.hjson')
params = json.load(f)
f.close()

## Map json variable names to notebook variable names

spend_col = params['core']['spend_col']
model_name = params['core']['ner_model_name']


In [11]:
## Local nb variables
# SQL query params
db_type = params['core']['db_type']
sql_code = params['core']['sql_code']

# Modelling params
sample_n = params['nb_one']['sample_n']

model_type = 'named_entity_recognition' # We don't expect this parameter to change so its not being externalized

# Algorithm specific params
max_size = 1000
use_pretrained_model = params['core']['use_pretrained_model']
model_architecture = params['core']['model_architecture']
model_path = params['core']['model_path']


In [12]:
def trim_df_for_sampling(df, max_size, spend_col):
# Limit our NER to a max size so we don't overflow 

    if df.shape[0] > max_size:
        print('Input df too large. Pruning...')
        pct = max_size/df.shape[0]
        df = df[df[spend_col] > df[spend_col].quantile(.1)] # Focus on the top 90% of data b/c it tends to be cleaner
        df = df.sample(n=max_size, random_state=7)
        print(f'Percent of data kept: {pct}')
    return df

def k_means_sampling(df, sample_n, use_pretrained_model, model_path, model_architecture, item_col='item_for_selection'):
    # Use k means to sample our dataset

    CRS = ut.ClustResample(
        df = df,
        input_col = item_col,
        sample_n = sample_n,
        use_pretrained_model = use_pretrained_model,
        model_path = model_path,
        model_architecture = model_architecture
    )

    CRS.bert_vecs()
    CRS.kmeans()
    CRS.sample_centroids()
    CRS.format_for_ner_labelling()
    return CRS

In [13]:
# Run the functions in the notebook
df = pd.read_csv(f'{model_type}/{model_name}/data/{model_name}_preprocessed.csv')
df = trim_df_for_sampling(df, max_size, spend_col)
CRS = k_means_sampling(df, sample_n, use_pretrained_model, model_path, model_architecture)
CRS.out.to_csv(f'{model_type}/{model_name}/data/{model_name}_label_set_round1_n{sample_n}.csv',index = False)


Input df too large. Pruning...
Percent of data kept: 0.0021695974311966414


Some layers from the model checkpoint at epoch_95 were not used when initializing TFBertModel: ['mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFBertModel were not initialized from the model checkpoint at epoch_95 and are newly initialized: ['bert/pooler/dense/kernel:0', 'bert/pooler/dense/bias:0']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
1000it [00:00, 5717.96it/s]


### Validation tests

In [14]:
# Preview the df
df = CRS.out
df

Unnamed: 0,item_for_selection,position,word,tag
295,"(v) honey - sesame tofu (av,s)(s",0,(v),
296,"(v) honey - sesame tofu (av,s)(s",1,honey,
297,"(v) honey - sesame tofu (av,s)(s",2,-,
298,"(v) honey - sesame tofu (av,s)(s",3,sesame,
299,"(v) honey - sesame tofu (av,s)(s",4,tofu,
...,...,...,...,...
105,wrap chicken caesar,1,chicken,
106,wrap chicken caesar,2,caesar,
249,wt ham sandwich,0,wt,
250,wt ham sandwich,1,ham,


In [16]:
# Make sure our output df has the right columns
# If this fails, make sure that the columns tested for actually reside in your dataset

cols = list(df)
assert(all(item in cols for item in ['item_for_selection', 'position', 'word', 'tag']))

In [17]:
# Make sure our output df has at least one row
# If this fails, make sure your data wasn't accidentally dropped

assert(df.shape[0]>0)