This is the notebook for experimenting with a Question-Answering approach to the task of classifying documents based on whether or not they give evidence that the company submitting the document has provided any training about modern slavery to their employees.

The motivation is to use transfer learning from models pre-trained to extract relevant answers (as a span) from a document (context) in order to automate the identification of which small subsets of the documents might be relevant to modern slavery training. These smaller subsets can then make the job of human-labelling additional documents more efficient or be fed into another model (perhaps a transformer trained for sequence classification) which can only handle a limited number of tokens.

The tricky part of this task is that many of the documents are longer than the maximum length for even the longest models (e.g. longformer @ 4096 tokens). Most documents will need to be divided up into smaller chunks and running the pre-trained QA models on each chunk, but the best way to do this isn't obvious. Some options being considered include: 
* divide up by paragraph seems to make the most sense as each paragraph should be about a single topic such as training. However, documents in the dataset were imported in various ways and there is no easy way to tell where one paragraph ends and the next one begins as often (but not always) there are line breaks in the middle of sentences where the text wrapped to the next line in the original document.
* divide up by sentence could also be considered, but there are some sections of documents that are not written in sentence form (for example they may be in bullet form) and do not contain any periods for long chunks of text; some sentences in the dataset are longer than the max token length all by themselves
* divide up by a fixed number of tokens is tricky because you could end up with a break in the middle of the section that talks about training and miss it altogether. If you overlap the sections, it is difficult to figure how to recombine relevant sections. Also the tokenizer loses some of the text as the tokenization process is reversed, making it difficult to compare the output to the input.

These options are each currently being pursued as well as looking for other possibilities. Apologies as this is currently a work in progress, it is not well documented.

In [1]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import re
import pickle

PyTorch version 1.6.0 available.
TensorFlow version 2.3.0 available.


In [2]:
df_labeled=pd.read_csv('train (3).csv',index_col=0)
df_hidden=pd.read_csv('test (3).csv',index_col=0)

In [3]:
# any characters repeated more than 4 times will be shortened to 4 repetitions: 
# https://stackoverflow.com/questions/10072744/remove-repeating-characters-from-words
df_labeled['TEXT']=df_labeled['TEXT'].apply(lambda x: re.sub(r'(.)\1{4,}', r'\1\1\1\1', str(x)))
df_hidden['TEXT']=df_hidden['TEXT'].apply(lambda x: re.sub(r'(.)\1{4,}', r'\1\1\1\1', str(x)))

In [4]:
#df_labeled['paragraphs']=df_labeled['TEXT'].apply(lambda x: [paragraph for paragraph in x.splitlines() if paragraph!=''])
#df_hidden['paragraphs']=df_hidden['TEXT'].apply(lambda x: [paragraph for paragraph in x.splitlines() if paragraph!=''])

In [5]:
model_name = 'ktrapeznikov/albert-xlarge-v2-squad-v2'

In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
model.eval()

loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/ktrapeznikov/albert-xlarge-v2-squad-v2/config.json from cache at C:\Users\dhilg/.cache\torch\transformers\c63acbd2ffb1762d161c0c366bb4a0dd5312f615847b87d4cf7be001ca562cab.0aa9a4e13357b14219e56e90005cf95adfc4fbb59ad847974267550fef9c2f6f
Model config AlbertConfig {
  "architectures": [
    "AlbertForQuestionAnswering"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 8192,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "net_structure_type": 0,
  "num_attention_heads": 16,
  "num_hidden_groups": 1,
  "num_hidden_layers": 24,
  "num_memory_blocks": 0,
  "output_past": true,
  "pad_toke

AlbertForQuestionAnswering(
  (albert): AlbertModel(
    (embeddings): AlbertEmbeddings(
      (word_embeddings): Embedding(30000, 128, padding_idx=0)
      (position_embeddings): Embedding(512, 128)
      (token_type_embeddings): Embedding(2, 128)
      (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): AlbertTransformer(
      (embedding_hidden_mapping_in): Linear(in_features=128, out_features=2048, bias=True)
      (albert_layer_groups): ModuleList(
        (0): AlbertLayerGroup(
          (albert_layers): ModuleList(
            (0): AlbertLayer(
              (full_layer_layer_norm): LayerNorm((2048,), eps=1e-12, elementwise_affine=True)
              (attention): AlbertAttention(
                (query): Linear(in_features=2048, out_features=2048, bias=True)
                (key): Linear(in_features=2048, out_features=2048, bias=True)
                (value): Linear(in_features=2048, out_features=

In [7]:
questions=['Is there training provided?', 
           'Has training been done?']
max_question_tokens=max([tokenizer(question, return_tensors='pt')['input_ids'].size()[1] for question in questions])
max_context_tokens=512-max_question_tokens

In [24]:
def split_paragraph(paragraph):
    """
    For paragraphs that are longer than the maximum sequence length of the model, split them into smaller chunks, keeping
    sentences intact while maintaining maximum overlap.
    This assumes no paragraphs are so long that they need to be split into 3 or more paragraphs. If there are any that are, 
    it will throw an error which will require rewriting this function. 
    """
    sentences=paragraph.split('.')
    
    new_paragraphs=[]
    
    # build curr_paragraph forward from start of paragraph
    curr_paragraph=sentences[0]+'.'
    i=1
    while tokenizer(curr_paragraph+' '+sentences[i]+'.', return_tensors='pt')['input_ids'].size()[1] < max_context_tokens:
        curr_paragraph+=' '+sentences[i]+'.'
        i+=1
    new_paragraphs.append(curr_paragraph)
    
    # build curr_paragraph backward from end of paragraph
    curr_paragraph=sentences[-1]+'.'
    j=len(sentences)-2
    while tokenizer(sentences[j]+'. '+curr_paragraph, return_tensors='pt')['input_ids'].size()[1] < max_context_tokens:
        curr_paragraph=sentences[j]+'. '+curr_paragraph
        j-=1
    new_paragraphs.append(curr_paragraph)
            
    if j < i:
        return new_paragraphs
    else:
        print('ERROR: Paragraph must be split into >2 paragraphs. The function does not currently support this and needs to be modified.')

def text_to_paragraphs(text):
    """
    
    """
    base_paragraphs=[paragraph for paragraph in text.splitlines() if paragraph!='']
    final_paragraphs=[]
    for paragraph in base_paragraphs:
        if tokenizer(paragraph, return_tensors='pt')['input_ids'].size()[1] <= max_context_tokens:
            final_paragraphs.append(paragraph)
        else:
            for new_paragraph in split_paragraph(paragraph):
                final_paragraphs.append(new_paragraph)
    return final_paragraphs
        
def group_paragraphs(paragraphs):
    """
    Groups paragraphs together until they are as large as possible without exceeding the maximum sequence length of the 
    model
    """
    contexts=[]
    curr_context=''
    for i, paragraph in enumerate(paragraphs):
        if tokenizer(paragraph, return_tensors='pt')['input_ids'].size()[1] > max_context_tokens:
            #split the paragraph
            temp_paragraphs=split_paragraph(paragraph)
            for i in range(len(temp_paragraphs)-1):
                contexts.append(temp_paragraphs[i])
            curr_context=temp_paragraphs[-1]
            if i==len(paragraphs)-1:
                #if this is the last paragraph, add the curr_context to contexts regardless of how long it is
                contexts.append(curr_context)
        elif i==0:
            curr_context=paragraph
        elif i==len(paragraphs)-1:
            #if this is the last paragraph, add the curr_context to contexts regardless of how long it is
            curr_context+='\n'+paragraph
            contexts.append(curr_context)
        else:
            new_tokens=tokenizer(curr_context+'\n'+paragraph, return_tensors='pt')['input_ids'].size()[1]

            if new_tokens > max_context_tokens:
                # if this paragraph would put the context over the max length, add the curr_context to the list of 
                # contexts, and start a new curr_context with this paragraph
                contexts.append(curr_context)
                curr_context=paragraph
            else:
                curr_context+='\n'+paragraph
    return contexts

In [25]:
%%time
#df_labeled['contexts']=df_labeled['TEXT'].apply(lambda x: group_paragraphs([paragraph for paragraph in x.splitlines() if paragraph!='']))
#df_hidden['contexts']=df_hidden['TEXT'].apply(lambda x: group_paragraphs([paragraph for paragraph in x.splitlines() if paragraph!='']))
df_labeled['paragraphs']=df_labeled['TEXT'].apply(lambda x: text_to_paragraphs(x))
df_hidden['paragraphs']=df_hidden['TEXT'].apply(lambda x: text_to_paragraphs(x))

Token indices sequence length is longer than the specified maximum sequence length for this model (545 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (548 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (557 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (516 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (558 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

Wall time: 16.6 s


In [26]:
df_class1=df_labeled.loc[df_labeled['LABEL_MSA_training_binary']==1]
df_class1.head()

Unnamed: 0_level_0,TEXT,LABEL_MSA_training_binary,LABEL_MSA_training_original,contexts,paragraphs
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,MODERN SLAVERY ACT STATEMENT\nOUR BUSINESS Zal...,1,"Procurement / purchasing, Recruitment / HR, ...",[MODERN SLAVERY ACT STATEMENT\nOUR BUSINESS Za...,"[MODERN SLAVERY ACT STATEMENT, OUR BUSINESS Za..."
4,Modern Slavery Act Statement\nIntroduction fro...,1,Employees (all),[Modern Slavery Act Statement\nIntroduction fr...,"[Modern Slavery Act Statement, Introduction fr..."
5,\nMERRILL CORPORATION\nANTI-SLAVERY & HUMAN T...,1,Employees (all),[\nMERRILL CORPORATION\nANTI-SLAVERY & HUMAN ...,"[, MERRILL CORPORATION, ANTI-SLAVERY & HUMAN ..."
8,Jones Lang LaSalle Incorporated\nModern Slaver...,1,"Suppliers, Employees (all)",[Jones Lang LaSalle Incorporated\nModern Slave...,"[Jones Lang LaSalle Incorporated, Modern Slave..."
11,Shiva Hotels Group LLP\nModern Slavery Stateme...,1,Employees (all),[Shiva Hotels Group LLP\nModern Slavery Statem...,"[Shiva Hotels Group LLP, Modern Slavery Statem..."


# Short statements misclassified???

In [171]:
df_class1_sorted=df_class1.sort_values('num_tokens')
df_class1_sorted.tail(120)

Unnamed: 0_level_0,TEXT,LABEL_MSA_training_binary,LABEL_MSA_training_original,tokens,num_tokens
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
434,Modern Slavery Statement\nYear ended 31 March ...,1,Employees (all),"[tensor(2), tensor(773), tensor(9822), tensor(...",1193
132,MODERN SLAVERY AND HUMAN TRAFFICKING STATEMENT...,1,Employees (all),"[tensor(2), tensor(773), tensor(9822), tensor(...",1208
245,Jefferies\nJefferies International Limited\nVi...,1,"Procurement / purchasing, Recruitment / HR, ...","[tensor(2), tensor(3292), tensor(8337), tensor...",1222
452,STORES\n\nHELP\n\nMore \n\nSmart Rewards My A...,1,"Leadership, Employees (all)","[tensor(2), tensor(4134), tensor(448), tensor(...",1233
279,J D Wetherspoon plc\nAnti-slavery and human tr...,1,"Procurement / purchasing, Leadership, Employ...","[tensor(2), tensor(487), tensor(13), tensor(43...",1240
...,...,...,...,...,...
257,Co-op Modern Slavery and Human Trafficking Sta...,1,"Procurement / purchasing, Leadership, Suppliers","[tensor(2), tensor(326), tensor(8), tensor(259...",7693
531,MARKS AND SPENCER GROUP PLC\nModern Slavery St...,1,"Leadership, Suppliers, Employees (all)","[tensor(2), tensor(4872), tensor(17), tensor(6...",7792
497,Modern Slavery Statement\nSeptember 2016 - Mar...,1,Procurement / purchasing,"[tensor(2), tensor(773), tensor(9822), tensor(...",10438
131,AMNESTY INTERNATIONAL MODERN SLAVERY ACT 2015 ...,1,Employees (all),"[tensor(2), tensor(18304), tensor(294), tensor...",11052


In [28]:
for i in range(12):
    print('-'*80)
    print(f'i={i} ID={df_class1_sorted.iloc[i].name}')
    print('-'*80)
    print(df_class1_sorted.iloc[i]['TEXT'])
    print('\n\n')

--------------------------------------------------------------------------------
i=0 ID=366
--------------------------------------------------------------------------------
Modern Slavery Statement for Financial Year 2016/17
Our business Our policies
Our suppliers
Approval for this statement Simon Fuller





--------------------------------------------------------------------------------
i=1 ID=46
--------------------------------------------------------------------------------
Arcadia Group Ltd
Modern Slavery Act – Transparency Statement
Modern Slavery Statement
September 2016 - August 2017
Read here
Modern Slavery Statement
September 2015 - August 2016
Read here
© Arcadia Group Limited • Accessibility • Modern Slavery Act • Candidate Privacy • Legal Notice • Privacy Notice • CA Transparency Act • Contacts
This site uses cookies. Details of which can be found here. Accept





--------------------------------------------------------------------------------
i=2 ID=67
----------------

# Single Example

In [27]:
ID=100 #279, 100, 526

In [28]:
%%time
answers=[]
for question in questions:
    curr_answers=[]
    with torch.no_grad():
        for i, paragraph in enumerate(df_labeled.loc[ID,'paragraphs']):
            #inputs=tokenizer(question, context, padding='max_length', max_length=512, return_tensors='pt')
            inputs=tokenizer(question, paragraph, return_tensors='pt')
            input_ids = inputs["input_ids"].tolist()[0]
            
            answer_start_scores, answer_end_scores = model(**inputs)
            answer_start = torch.argmax(answer_start_scores)
            answer_end = torch.argmax(answer_end_scores) + 1
            answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
            print(f'{question}\n{answer}\n\n')
            
            if answer!='[CLS]' and answer!='':
                curr_answers.append({
                    'answer text':answer,
                    'paragraph_id':i,
                    'paragraph_text':paragraph
                })
    answers.append(curr_answers)
answers

Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
[CLS]


Is there training provided?
we have embraced the stronger together programme


Is there training provided?
[CLS]


Is there training provided?
our compliance teams have been externally trained in the exploitation of human labour


Is there training provided?
have received training in the stronger together principles and instigated measures throughout

[[{'answer text': 'we have embraced the stronger together programme',
   'paragraph_id': 18,
   'paragraph_text': 'We have embraced the Stronger Together Programme across our sites and are Stronger Together'},
  {'answer text': 'our compliance teams have been externally trained in the exploitation of human labour',
   'paragraph_id': 20,
   'paragraph_text': 'Our Compliance Teams have been externally trained in the exploitation of human labour and we'},
  {'answer text': 'have received training in the stronger together principles and instigated measures throughout each',
   'paragraph_id': 21,
   'paragraph_text': 'have received training in the Stronger Together principles and instigated measures throughout each'}],
 [{'answer text': 'we have embraced the stronger together programme',
   'paragraph_id': 18,
   'paragraph_text': 'We have embraced the Stronger Together Programme across our sites and are Stronger Together'},
  {'answer text': 'our compliance teams have been externally tra

In [31]:
print(df_labeled.loc[526,'TEXT'])

THE H&M GROUP MODERN SLAVERY STATEMENT
FINANCIAL YEAR 2015/2016
This statement has been published in accordance with the Modern Slavery Act 2015 and sets out the steps H & M Hennes & Mauritz AB and all companies included in the H & M group (hereby referred to as H&M) have taken during the financial year 2015/2016 to address modern slavery in its value chain – both its own business and its supply chain.
1

THE H&M GROUP
The H&M group is one of the world’s leading fashion companies. With our brands – H&M, COS, Monki, & Other Stories, Weekday and Cheap Monday – we want to inspire fashion fans across the globe to dress their personal style. Each of our brands has its own unique identity and they are united by a passion for fashion and a drive to dress customers in a sustainable way.
We want to make fashion sustainable and sustainability fashionable. We are dedicated to creating a better fashion future and we use our size and scale to drive development towards a more circular, fair and equ

# Scratch

In [34]:
df_labeled['tokens']=df_labeled['TEXT'].apply(lambda x: tokenizer(x, return_tensors='pt')['input_ids'])
df_labeled['num_tokens']=df_labeled['tokens'].apply(lambda x: x.size()[1])
df_labeled['num_chars']=df_labeled['TEXT'].apply(lambda x: len(x))
df_hidden['tokens']=df_hidden['TEXT'].apply(lambda x: tokenizer(x, return_tensors='pt')['input_ids'])
df_hidden['num_tokens']=df_hidden['tokens'].apply(lambda x: x.size()[1])
df_hidden['num_chars']=df_hidden['TEXT'].apply(lambda x: len(x))

Token indices sequence length is longer than the specified maximum sequence length for this model (870 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1259 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1690 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (4146 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (746 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for 

Token indices sequence length is longer than the specified maximum sequence length for this model (1246 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (589 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2122 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2117 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1751 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for

Token indices sequence length is longer than the specified maximum sequence length for this model (1262 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1019 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (943 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (695 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1188 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for 

Token indices sequence length is longer than the specified maximum sequence length for this model (585 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (791 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (559 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (4184 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (934 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for th

Token indices sequence length is longer than the specified maximum sequence length for this model (10625 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3490 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1373 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (592 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3925 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length fo

Token indices sequence length is longer than the specified maximum sequence length for this model (1048 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1060 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1056 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1498 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (7843 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length fo

Token indices sequence length is longer than the specified maximum sequence length for this model (40724 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1101 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1415 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1228 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (803 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length fo

Token indices sequence length is longer than the specified maximum sequence length for this model (3740 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (846 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1347 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1407 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1384 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for

Token indices sequence length is longer than the specified maximum sequence length for this model (1144 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1115 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1673 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1559 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1232 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length fo

Token indices sequence length is longer than the specified maximum sequence length for this model (1238 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (556 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (5112 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3569 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3136 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for

Token indices sequence length is longer than the specified maximum sequence length for this model (1222 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1020 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (6627 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (858 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1698 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for

Token indices sequence length is longer than the specified maximum sequence length for this model (1089 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (809 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2650 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (906 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (4492 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for 

Token indices sequence length is longer than the specified maximum sequence length for this model (1956 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1446 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (938 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (560 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (819 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for t

Token indices sequence length is longer than the specified maximum sequence length for this model (641 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1520 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1244 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (868 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (785 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for t

Token indices sequence length is longer than the specified maximum sequence length for this model (2356 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2547 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (699 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (927 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1496 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for 

Token indices sequence length is longer than the specified maximum sequence length for this model (4105 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (922 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (687 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2135 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1564 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for 

Token indices sequence length is longer than the specified maximum sequence length for this model (836 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1502 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (579 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (559 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3256 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for t

Token indices sequence length is longer than the specified maximum sequence length for this model (1639 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (685 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1771 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1233 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3251 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for

Token indices sequence length is longer than the specified maximum sequence length for this model (6227 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (548 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (661 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1396 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2335 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for 

Token indices sequence length is longer than the specified maximum sequence length for this model (601 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1942 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1289 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (848 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (578 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for t

In [35]:
df_labeled['num_chars'].sum()+df_hidden['num_chars'].sum()

7175390

In [36]:
df_labeled['num_tokens'].sum()+df_hidden['num_tokens'].sum()

1385999

In [37]:
(df_labeled['num_chars'].sum()+df_hidden['num_chars'].sum())/(df_labeled['num_tokens'].sum()+df_hidden['num_tokens'].sum())

5.1770527972963905

In [None]:
target_tokens=64

# all

In [None]:
%%time
if torch.cuda.is_available():
    model.to('cuda')

for index, row in df_labeled.iterrows():
    if index == 303:
        tokens=row['tokens']
        if torch.cuda.is_available():
            tokens = tokens.to('cuda')
        num_tokens=tokens.size()[1]
        windows=list_windows(num_tokens,512,4)
        
        with torch.no_grad():
            for i, (start, end) in enumerate(windows):
                pickle.dump(model(tokens[:,start:end])[0],open(f'embeddings_{index}_{i}.pkl','wb'))
                print(i)
    elif index == -1:
        tokens=row['tokens']
        if torch.cuda.is_available():
            tokens = tokens.to('cuda')
        num_tokens=tokens.size()[1]
        windows=list_windows(num_tokens,512,4)
        vectors=torch.zeros((len(windows),num_tokens,1024))

        with torch.no_grad():
            for i, (start, end) in enumerate(windows):
                vectors[i,start:end]=model(tokens[:,start:end])[0]
        avg_vectors = vectors.sum(dim=0)/(vectors!=0).sum(dim=0)
        pickle.dump(avg_vectors,open(f'embeddings_{index}.pkl','wb'))
        print(index)

In [None]:
num_tokens=df_labeled.loc[303,'tokens'].size()[1]
windows=list_windows(num_tokens,512,4)
windows

In [None]:
%%time
num_tokens=df_labeled.loc[303,'tokens'].size()[1]
windows=list_windows(num_tokens,512,4)

appears_in={}
for i in range(num_tokens):
    appears_in[i]=[]
    for j, (start, end) in enumerate(windows):
        if i >= start and i < end:
            appears_in[i].append(j)
lookup={}
for k, v in appears_in.items():
    if k==0:
        start=0
        curr_val=v
    elif v != curr_val:
        lookup[(start,k-1)]=curr_val
        start=k
        curr_val=v
lookup[(start,k)]=curr_val
lookup

In [None]:
num_tokens

In [None]:
%%time
loaded_vectors={}
avg_vectors=None
for k,v in lookup.items():
    for i in [x for x in loaded_vectors.keys()]:
        if i < min(v):
            del loaded_vectors[i]
    vectors=torch.zeros((len(v),k[1]-k[0]+1,1024))
    for j, i in enumerate(v):
        if i not in loaded_vectors.keys():
            loaded_vectors[i]=pickle.load(open(f'embeddings_303_{i}.pkl','rb'))
        offset=windows[i][0]
        start=k[0]-offset
        end=k[1]-offset+1
        #print(loaded_vectors[i][0,start:end,:].size())
        vectors[j]=loaded_vectors[i][0,start:end,:]
    if avg_vectors is None:
        avg_vectors=vectors.sum(dim=0)/(vectors!=0).sum(dim=0)
    else:
        avg_vectors=torch.cat((avg_vectors,vectors.sum(dim=0)/(vectors!=0).sum(dim=0)),0)
print(avg_vectors.size())

In [None]:
pickle.dump(avg_vectors,open(f'embeddings_303.pkl','wb'))

In [None]:
avg_vectors=None
for i in range(3):
    new_vector=torch.randn((5,8))
    if avg_vectors is None:
        avg_vectors=new_vector
    else:
        avg_vectors=torch.cat((avg_vectors,new_vector),1)
avg_vectors.size()

In [None]:
%%time
for index, row in df_hidden.iterrows():
    tokens=row['tokens']
    if torch.cuda.is_available():
        tokens = tokens.to('cuda')
    num_tokens=tokens.size()[1]
    windows=list_windows(num_tokens,512,4)
    vectors=torch.zeros((len(windows),num_tokens,1024))
    
    with torch.no_grad():
        for i, (start, end) in enumerate(windows):
            vectors[i,start:end]=model(tokens[:,start:end])[0]
    avg_vectors = vectors.sum(dim=0)/(vectors!=0).sum(dim=0)
    pickle.dump(avg_vectors,open(f'hidden_embeddings_{index}.pkl','wb'))
    print(index)

In [None]:
pickle.dump(embeddings,open('embeddings.pkl','wb'))

In [None]:
pickle.dump(indeces,open('indeces.pkl','wb'))

In [None]:
%%time
all_embeddings=torch.cat(embeddings,0)
all_embeddings.size()

In [None]:
%%time
u,s,v=all_embeddings.svd()

In [None]:
%%time
pickle.dump(u,open('u.pkl','wb'))
pickle.dump(s,open('s.pkl','wb'))
pickle.dump(v,open('v.pkl','wb'))
pickle.dump(all_embeddings,open('embeddings_large.pkl','wb'))

In [None]:
df_labeled['num_tokens']

In [None]:
df_labeled