## Development Notebook: build and test base layers for Anathem Transformer (aka Silo'd Transformer)

### Notes
- the google-minature models have the same vocab size and heads as bert-large-ucased
- the minature-google papers discusses the classification and distallation tasks & corpus's including:
    - *NLI* (Natural language inference involves classifying pairs of sentences (a premise and a hypothesis) as entailment, contradiction, or neutral. This task is representative of the scenario in which proxy data is non-trivial to gather (Gururangan et al., 2018). We chose MNLI (Williams et al., 2018) as our target dataset. Since strictly in-domain data is difficult to obtain, we supplement DT with two other sentence-pair datasets: SNLI (Bowman et al., 2015) and QQP (Chen et al., 2018).
    - *sentiment analysis* -
- the MTEB leader best model is e5-large (24 layers) which uses the CLS token. It is also "instruction fine-tuned", requiring query and passage prefixes.
- distillation example: https://github.com/philschmid/knowledge-distillation-transformers-pytorch-sagemaker/blob/master/knowledge-distillation.ipynb
    - they set temperature to 2: which results in a flatter probability distribution. I could make this dynamic -> start 0.5 progress to 1
    - they set alpha to 0.5, which balances label-loss vs distil-loss

#### Loss MLM - hf example:
- https://github.com/huggingface/transformers/blob/601ac5b1dc1438f00d09696588f2deb0f045ae3b/src/transformers/modeling_bert.py#L1001-L1004
    - notice that when initializing CrossEntropyLoss, the ignore index is -100, so, when I make the masked-token objective, I can compute the loss by masking out all -100?


#### DataCollator for Masked MLM - hf example
- https://github.com/huggingface/transformers/blob/ee88ae59940fd4b2c8fc119373143d7a1175c651/src/transformers/data/data_collator.py#L607


# Dataset specifics

### From the Google mini-architectures:
- with labels: Williams 2018 (NLI-task): citation: https://aclanthology.org/N18-1101/; available at https://huggingface.co/datasets/multi_nli  
    - how should I process these? [sep] or sentence pairs? or both?
    - I could do sentence-pairs for teaching & labels, I guess (why not)
    - I could also include concatenated text, stricly with labels (what would be the point of this though? Better sub-sectioning the input data, not so much a sentence-vector thing
- with no-labels, used for teaching: Since strictly in-domain data is difficult to obtain, we supplement DT with two other sentence-pair datasets: SNLI (Bowman et al., 2015) and QQP (Chen et al., 2018).

### 1) MLM Tasks
- Pile (multi-domain, books, wiki, law, and more) - curate and remove twitter  
    - see urls at: https://github.com/EleutherAI/the-pile/blob/master/the_pile/datasets.py
    - https://the-eye.eu/public/AI/pile_preliminary_components/
- Supplements to pile:  
    - https://huggingface.co/datasets/him1411/EDGAR10-Q - numeric filings
    - eloukas/edgar-corpus - annual reports (but it is in weird sections)
    - LEDGAR .jsonl https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A - this can be streamed too
    - Pile of Law - https://huggingface.co/datasets/pile-of-law/pile-of-law - but cannot be streamed
- JanosAudran/financial-reports-sec - SEC financial reports in small sentences
- RefinedWeb - a competitor to Pile, curated common-crawl - https://arxiv.org/abs/2306.01116
- CNN_dailymail? ag_news?

### A) Retrieval Tasks
In general, what loss would I use for the QA & retrieval tasks? Distillation is obvious, but what about
- SQUAD - has QA pairs - squad_v2
    - good for distillation
- ORCA - has GPT-like prompting QA pairs: https://huggingface.co/datasets/Open-Orca/OpenOrca/viewer/Open-Orca--OpenOrca/train?row=29
- Simple-Wiki https://huggingface.co/datasets/embedding-data/simple-wiki - has paraphrases
- embedding-data/coco_captions_quintets - multiple captions as paraphrases
- embedding-data/simple-wiki - pairs of paraphrases from wikipedia
- embedding-data/SPECTER - triplets of {anchor, pos, neg}, small headline-like snippets in technical /statistical /science fields
- https://huggingface.co/embedding-data - has a lot of retrieval tasks
- LLukas22/scidocs - titles and abstracts
- LEDGAR - can possible do triplets on same label
- Rahmaa/ElsevieR_ClEaN - possible relation between title and abstract
- embedding-data/WikiAnswers - 25 question paraphrases (maybe no answers)

### B) QA Tasks
- squad_2
- WikiHow - used by S-BERT (questions and articles) - needs to be manually downloaded - https://github.com/mahnazkoupaee/WikiHow-Dataset/  
    - but see: wanicca/WikiHowQA-mnbvc - looks good
- trivia_qa - 680 question, ans, evidence triplets. But, the context strings are very long (like wikipedia) and the questions are almost pop culture
- LLukas22/fiqa - financial QA, like conversations
- embedding-data/WikiAnswers - question-duplicates as paraphrases
- embedding-data/QQP_triplets - question-duplicates plus negatives (Quora)
- LLukas22/lfqa_preprocessed - question and answers 226k
- gbharti/finance-alpaca (like FIQA - finance Q&A)
- embedding-data/PAQ_pairs - wikipedia question & answers
- the_pile_stack_exchange - single texts, but can be split into question, answer
- cais/mmlu - multiple choice, but some of the answers are longers (need to filter)
- sciq - science questions - see question and support
- wiki_qa - wikipedia QA
- qasc - high-school questions - can combine the "facts" into a support
- pubmed_qa - science QA with answers
- EnglishDictionary - auto convert "What is the definition of X'?

## C) NER tasks
- tner/ontonotes5 - has > 12 entities and 59.9k
- tner/multinerd - 23 entiteis and 157k test set - see also tner/wikineural which has a 98.8k training set?
-


# Teacher Models

## Embeddings
Mteb leaderboard

- instructor-xl / large - this does best, but it prepends instructions that are domain specific (like science this, or wikipedia that.... it could be possible to do that with the Pile dataset, possible) https://huggingface.co/hkunlp/instructor-xl
- https://huggingface.co/intfloat/e5-large-v2 - winner otherwise






#### Playing Around with novel architectures

In [None]:
%pip install torch transformers datasets zstandard rank_bm25 langdetect
#%pip install langdetect
from langdetect import detect





In [None]:
from textblob import TextBlob

def has_many_errors(text, threshold=0.5):
    blob = TextBlob(text)

    # Get a list of misspelled words
    misspelled = blob.words.spellcheck()

    # Filter words that are not recognized
    misspelled = [word[0] for word in misspelled if word[1] == '']

    # Calculate the ratio of misspelled words to total words
    misspelled_ratio = len(misspelled) / len(blob.words) if len(blob.words) > 0 else 0

    return misspelled_ratio >= threshold

# Example usage
text1 = "This is a sample English text with a few misspelled words."
text2 = "Thsi is a smaple Enlgish text wtih a feew misspeled wrdos."
text3 = "Это русский текст."

In [None]:
has_many_errors(text1)


**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************



MissingCorpusError: ignored

In [None]:
from transformers import AutoModel, AutoTokenizer
import torch
from torch.utils.data import DataLoader, DataSet
from typing import List, Optional
from torch import nn
import torch.nn.functional as F
from torch.cuda import is_available
if is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

from transformers.models.bert.modeling_bert import BertEncoder
from transformers.activations import ACT2FN
import copy

model_string = 'google/bert_uncased_L-12_H-512_A-8' # 'distilroberta-base
tokenizer = AutoTokenizer.from_pretrained(model_string)
basemod = AutoModel.from_pretrained(model_string)
basemod.to(device)

ImportError: ignored

In [None]:
text = [
    "A standard indemnity clause is a waiver clause that states that one party won't hold the other liable for damages, losses, or costs associated with issues.",
    "It usually consists of two elements: a trigger event or circumstance and a payment obligation2. The trigger event or circumstance is the breach of the agreement, misconduct, or negligence of the indemnifying party or its affiliates"
]

In [None]:
from transformers import BertTokenizer


class CustomTokenizer:
    def __init__(self, model_string='google/bert_uncased_L-12_H-512_A-8', n_cls_prepend = 4, n_pad_to_multiple_of=4):
        self.base_tokenizer = AutoTokenizer.from_pretrained(model_string)
        self.n_cls_prepend = n_cls_prepend
        self.n_pad_to_multiple_of = n_pad_to_multiple_of
        for k in dir(self.base_tokenizer):
            if not (k[0]=='_' or k=='tokenize' or k=='encode' or k=='build_inputs_with_special_tokens' or k == 'batch_encode_plus'):
                setattr(self,k,getattr(self.base_tokenizer, k))

    def __call__(self, text, pad_to_multiple_of=None, add_special_tokens = True, return_tensors=None, *args, **kwargs):
        if pad_to_multiple_of is None:
            pad_to_multiple_of = self.n_pad_to_multiple_of

        # run through base tokenizer
        tokens = self.base_tokenizer(
            text,
            pad_to_multiple_of=(pad_to_multiple_of if not add_special_tokens else False),
            add_special_tokens=add_special_tokens,
            return_tensors=return_tensors if (not add_special_tokens) else None,
            *args,
            **kwargs
        )
        if add_special_tokens:
            tokens = self._prepend_extra_cls_tokens_because_of_maxpooling(tokens, return_tensors)

        return tokens

    def _num_pad_tokens(self, token_list):
        """Calculates how many PAD tokens to append to sequence to make a multiple of X"""
        return (self.n_pad_to_multiple_of - ((len(token_list)+(self.n_cls_prepend-1)) % self.n_pad_to_multiple_of)) % self.n_pad_to_multiple_of

    def _prepend_extra_cls_tokens_because_of_maxpooling(self, tokens, return_tensors=None):
        n_cls_prepend = self.n_cls_prepend
        # prepend (n-1) CLS tokens to the front of the token_ids (because of maxpooling)
        # also pad so that the total length is a multiple of n_cls_prepend
        #num_pad_tokens = (self.n_pad_to_multiple_of - ((len_tokens+(n_cls_prepend-1)) % self.n_pad_to_multiple_of)) % self.n_pad_to_multiple_of
        tokens['input_ids'] = [
            [self.cls_token_id]*(n_cls_prepend-1)+input_id + [self.pad_token_id]*self._num_pad_tokens(input_id)
            for input_id
            in tokens['input_ids']
        ]
        tokens['attention_mask'] = [
            [1]*(n_cls_prepend-1)+attnmask +[0]*self._num_pad_tokens(attnmask)
            for attnmask
            in tokens['attention_mask']
        ]
        if 'token_type_ids' in tokens.keys():
            tokens['token_type_ids'] = [
                [toktypeid[0]]*(n_cls_prepend-1)+toktypeid +[toktypeid[-1]]*self._num_pad_tokens(toktypeid)
                for toktypeid
                in tokens['token_type_ids']
            ]
        if return_tensors == 'pt':
            for k,v in tokens.items():
                tokens[k] = torch.LongTensor(v)
        return tokens

    def encode(self, text, pad_to_multiple_of=4, add_special_tokens = True, *args, **kwargs):
        encoded = self.base_tokenizer.encode(text, pad_to_multiple_of=False, add_special_tokens=add_special_tokens, *args, **kwargs)
        if add_special_tokens:
            encoded = [self.cls_token_id]*(pad_to_multiple_of-1) + encoded
        if bool(pad_to_multiple_of):
            num_pad_tokens = (pad_to_multiple_of - (len(encoded) % pad_to_multiple_of)) % pad_to_multiple_of
            encoded += [self.pad_token_id] * num_pad_tokens
        return encoded

    def tokenize(self, text, add_special_tokens=True, *args, **kwargs):
        toks = self.base_tokenizer.tokenize(text, add_special_tokens=add_special_tokens, *args, **kwargs)
        if add_special_tokens:
            toks = [self.cls_token] * (self.n_cls_prepend-1) + toks
        return toks

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ):
        out = self.base_tokenizer.build_inputs_with_special_tokens(token_ids_0, token_ids_1)
        return [self.cls_token_id]*3 + out

    def batch_encode_plus(self, batch_text_or_text_pairs, *args, **kwargs):
        batched_encoded = self.base_tokenizer.batch_encode_plus( batch_text_or_text_pairs, *args, **kwargs)
        #batched_encoded.update({'foo':'bar'})
        return batched_encoded



# Note, if I use the vanilla LineByLineTextDataset, it just calls tokenizer.__call__ turns on the `use_special_tokens`, and it pads to a multiple of optional
# .. so somehow I need to ensure that, whatever base function it calls as part of the tokenizer pipeline, it will continue using MY new function
# the tokenizer.__call__ DOES NOT use `encode` nor `tokenize` otherwise my modifications would manifest
# looks like `prepare_for_model` (and maybe `batch_prepare_for_model`) is what adds special tokens?
# looks like `prepare_for_model` just calls `build_inputs_with_special_tokens`, so maybe intervene there?
#         if add_special_tokens:
#            sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
#            token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
# editing `build_inputs_with_special_tokens` didn't work either

# FOOFU:
# see how .pad works: https://github.com/huggingface/transformers/blob/c5454eba9eac00a3e7d0a46a3d25aacd43187f1e/src/transformers/tokenization_utils_base.py#L2887
# notice the `self.model_input_names[0]` list for a tokenizer -> I should update this for my unique inputs
# ... and there is also a ._pad function

ModuleNotFoundError: ignored

In [None]:
tokenizer2 = CustomTokenizer()
tokenizer2.pad_token_id

In [None]:
#toks = tokenizer2.encode(text[0], add_special_tokens=True)
#print(len(toks)) # works
#print(toks[:10])

tokens = tokenizer2(text, padding='longest', return_tensors=None) # doesn't work, obviously
#print(tokens)
print(len(tokens['input_ids'][0]))
print(len(tokens['attention_mask'][0]))

print(len(tokens['input_ids'][1]))
print(len(tokens['attention_mask'][1]))

tokens

#tokenizer2.batch_encode_plus(text, add_special_tokens=True) # doesn't work


In [None]:
dir(basemod)
# base embedding layers
layer_emb = copy.deepcopy(basemod._modules['embeddings'])


In [None]:
# base trasnformers (full)
layer_basetransformer = copy.deepcopy(basemod._modules['encoder']._modules['layer']._modules['0'])

In [None]:
# text
text = [
    "A standard indemnity clause is a waiver clause that states that one party won't hold the other liable for damages, losses, or costs associated with legal issues1.",
    "It usually consists of two elements: a trigger event or circumstance and a payment obligation2. The trigger event or circumstance is the breach of the agreement, willful misconduct, or negligence of the indemnifying party or its affiliates"
]

import math

#padding_length = int(math.ceil(max_length / 4)) * 4
tokens = tokenizer(text,padding=True, return_tensors='pt', pad_to_multiple_of=4)
input_shape = tokens['input_ids'].size()

# change token padding to be multiple of 4
#ideal_length = int(math.ceil(input_shape[-1] / 4)) * 4 # should be a multiple of 4
#if input_shape[-1]!=ideal_length:
#  tokens = tokenizer(text,padding='max_length', max_length = ideal_length, return_tensors='pt')
#  input_shape = tokens['input_ids'].size()

token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
tokens['token_type_ids'] = token_type_ids
past_key_values_length =0

# need to extend attention mask
extended_attention_mask = basemod.get_extended_attention_mask(tokens['attention_mask'], input_shape)
tokens['extended_attention_mask'] = extended_attention_mask
print(tokens.keys())
print(tokens['input_ids'].shape)


In [None]:
silo_dimensions = {0:basemod.config.hidden_size,
                  1:basemod.config.hidden_size//2,
                  2:basemod.config.hidden_size//4,
                  }
reintegration_dim = silo_dimensions[1] + silo_dimensions[2]


NameError: ignored

In [None]:
embedding_output = layer_emb(
            input_ids=tokens['input_ids'],
            position_ids=tokens.get('position_ids',None),
            token_type_ids=tokens['token_type_ids'],
            inputs_embeds=None,
            past_key_values_length=past_key_values_length
)
print(embedding_output.shape)

NameError: ignored

In [None]:
# basemodel transformer outputs: *full bert model
out_l1 = layer_basetransformer(
    hidden_states = embedding_output,
    attention_mask = tokens['extended_attention_mask'],#tokens['attention_mask'],
    head_mask=None,
    encoder_hidden_states=None,
    encoder_attention_mask=None,
    #past_key_values=0,
    #use_cache=None,
    output_attentions=True,
    #output_hidden_states=True,
    #return_dict=True
)

hidden_states_l1 = out_l1[0]
self_attention_l1 = out_l1[1]

NameError: ignored

In [None]:
# Next Layer:
# Query -> max pool and reduce  hidden dimension // 2
# Key -> reduce hidden_dim // 2
# value -> reduce hidden_dim //2
#maxpool_l2 = nn.MaxPool2d((2,1), stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=True)

maxpool_l2 = nn.Sequential(
    nn.Dropout(0.05),
    nn.MaxPool2d((2,1), stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=True),
)

maxpool_l2_attn = nn.MaxPool1d((2), stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=True)

In [None]:
# reduce dimension of hidden states
hiddens_states_l1_reduced = maxpool_l2(hidden_states_l1)
print(hidden_states_l1.shape)
print(hiddens_states_l1_reduced.shape)

# reduce dimension of attention mask
attention_mask_l1_reduced = maxpool_l2_attn(tokens['attention_mask'].float())
print(attention_mask_l1_reduced.shape)

# extend the dimension of the reduced attention_mask
print(input_shape)
extended_attention_mask_l1_reduced = basemod.get_extended_attention_mask(attention_mask_l1_reduced, attention_mask_l1_reduced.shape)
print(tokens['extended_attention_mask'].shape)
print(extended_attention_mask_l1_reduced.shape)

torch.Size([2, 48, 768])
torch.Size([2, 24, 768])
torch.Size([2, 24])
torch.Size([2, 48])
torch.Size([2, 1, 1, 48])
torch.Size([2, 1, 1, 24])


In [None]:
# Try to do Multi Headed attenion with differently sized query and value

In [None]:
import torch
import torch.nn as nn
import math
from typing import Optional, Tuple
import copy

class BertSelfAttnDimensionReduction(nn.Module):
    """Bert Attention Layer that uses a dimension-reduced version of the query, so to reduce the dimension of the outputs"""
    def __init__(
        self,
        config,
        hidden_size_input=768,
        hidden_size_query = None,
        position_embedding_type=None,
        dim_reduction = 2
    ):
        """Special type of Bert Self attention that reduces the dimension of the inputs by half"""
        super().__init__()
        if (config.hidden_size // dim_reduction) % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
                f"heads ({config.num_attention_heads})"
            )
        self.dim_reduction = dim_reduction
        self.hidden_size_input = hidden_size_input
        self.hidden_size_reduced = hidden_size_input // dim_reduction
        if hidden_size_query is None:
            hidden_size_query = hidden_size_input
        self.hidden_size_query = hidden_size_query
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(self.hidden_size_reduced / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(self.hidden_size_query, self.all_head_size)
        self.key = nn.Linear(self.hidden_size_input, self.all_head_size)
        self.value = nn.Linear(self.hidden_size_input, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.position_embedding_type = position_embedding_type or getattr(
            config, "position_embedding_type", "absolute"
        )
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            self.max_position_embeddings = config.max_position_embeddings
            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)

        self.is_decoder = config.is_decoder

    def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        encoder_hidden_states: Optional[torch.FloatTensor] = None,
        encoder_attention_mask: Optional[torch.FloatTensor] = None,
        past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
        output_attentions: Optional[bool] = False,
    ) -> Tuple[torch.Tensor]:
        mixed_query_layer = self.query(hidden_states)

        # If this is instantiated as a cross-attention module, the keys
        # and values come from an encoder; the attention mask needs to be
        # such that the encoder's padding tokens are not attended to.

        key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
        value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
        query_layer = self.transpose_for_scores(mixed_query_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))

        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            query_length, key_length = query_layer.shape[2], key_layer.shape[2]
            if use_cache:
                position_ids_l = torch.tensor(key_length - 1, dtype=torch.long, device=hidden_states.device).view(
                    -1, 1
                )
            else:
                position_ids_l = torch.arange(query_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
            position_ids_r = torch.arange(key_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
            distance = position_ids_l - position_ids_r

            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility

            if self.position_embedding_type == "relative_key":
                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores
            elif self.position_embedding_type == "relative_key_query":
                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key

        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if encoder_attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            #print(attention_scores.shape)
            #print(attention_scores.shape)
            attention_scores = attention_scores + encoder_attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.functional.softmax(attention_scores, dim=-1)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(new_context_layer_shape)

        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)

        if self.is_decoder:
            outputs = outputs + (past_key_value,)
        return outputs

bertlayer_l2_reduction = BertSelfAttnDimensionReduction(
    config=basemod.config,
    hidden_size_input=basemod.config.hidden_size,
    position_embedding_type=basemod.config.position_embedding_type,
    dim_reduction = 2
)

bertlayer_l3_reduction = BertSelfAttnDimensionReduction(
    config=basemod.config,
    hidden_size_input=basemod.config.hidden_size // 2,
    position_embedding_type=basemod.config.position_embedding_type,
    dim_reduction = 2
)

In [None]:
out_l2 = bertlayer_l2_reduction(
        hidden_states = hiddens_states_l1_reduced,
        attention_mask = extended_attention_mask_l1_reduced,
        head_mask=None,
        encoder_hidden_states = hidden_states_l1,
        encoder_attention_mask= tokens['extended_attention_mask'],
        past_key_value=None,
        output_attentions=False
    )
hidden_states_l2 = out_l2[0]
print(hidden_states_l2.shape)

torch.Size([2, 24, 384])


In [None]:
# Next dimension reduction:
hiddens_states_l2_reduced = maxpool_l2(hidden_states_l2)
print(hidden_states_l2.shape)
print(hiddens_states_l2_reduced.shape)

# reduce dimension of attention mask
attention_mask_l2_reduced = maxpool_l2_attn(attention_mask_l1_reduced.float())
print(attention_mask_l2_reduced.shape)

# extend the dimension of the reduced attention_mask
extended_attention_mask_l2_reduced = basemod.get_extended_attention_mask(attention_mask_l2_reduced, attention_mask_l2_reduced.shape)
print(extended_attention_mask_l2_reduced.shape)

if True:
  out_l3 = bertlayer_l3_reduction(
        hidden_states = hiddens_states_l2_reduced, # input has been maxpooled
        attention_mask = extended_attention_mask_l2_reduced,
        head_mask=None,
        encoder_hidden_states = hidden_states_l2,
        encoder_attention_mask= extended_attention_mask_l1_reduced,
        past_key_value=None,
        output_attentions=False
    )
  hidden_states_l3 = out_l3[0]
  print(hidden_states_l3.shape)


# The outputs of the bertlayer_l3_reduction can now run through a usual BertLayer for 3 times

torch.Size([2, 24, 384])
torch.Size([2, 12, 384])
torch.Size([2, 12])
torch.Size([2, 1, 1, 12])
torch.Size([2, 12, 192])


In [None]:
# The outputs of the bertlayer_l3_reduction can now run through a usual BertLayer for 3 times

config_lowres_encoder = copy.deepcopy(basemod.config)
config_lowres_encoder.hidden_size = config_lowres_encoder.hidden_size//4
config_lowres_encoder.num_hidden_layers = 3
print(config_lowres_encoder)

# The outputs of the bertlayer_l3_reduction can now run through a usual BertLayer for 3 times
encoder_lowres = BertEncoder(config_lowres_encoder)

RobertaConfig {
  "_name_or_path": "distilroberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 192,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 3,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.29.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}



In [None]:
out_encoder_lowres = encoder_lowres(
    hidden_states=hidden_states_l3,
    attention_mask=extended_attention_mask_l2_reduced,
    head_mask = None,
    return_dict=True,
)
hidden_states_lowres = out_encoder_lowres[0]
print(hidden_states_lowres.shape)

torch.Size([2, 12, 192])


In [None]:
## Upresolution Layer: up-resolution from dim-3 to dim-2 is as follows:
# hs_l3 -> upsampled sequence-length as hs-l2
# -> could have another attention-based mechanism that expands dimension of hs-l2

class InterpolateCombo(nn.Module):
    """there could also be an attentive way to do this"""
    def __init__(self, scale_factor=2, dropout=0.05, alpha=0.667):
        """Arguments:
        :param scaler_factor: float, multiple of up-scaling
        :param dropout: float, dropout proportion
        :param alpha: float, mixture weight between nearest-neighbor vs linear-interpolation
        """
        super(InterpolateCombo, self).__init__()
        self.interp = nn.functional.interpolate
        self.scale_factor = scale_factor
        self.dropout = nn.Dropout(dropout)
        self.a = alpha

    def forward(self, x):
        x_trans = x.transpose(-2,-1)
        z = self.a*self.interp(x_trans, mode='nearest',scale_factor=self.scale_factor) + (1-self.a)*self.interp(x_trans, mode='linear',scale_factor=self.scale_factor)
        z = self.dropout(z)
        return z.transpose(-2,-1)

#hidden_states_upscaled_3to2_nearest = nn.functional.interpolate(hidden_states_rowres.transpose(-2,-1), scale_factor=2, mode='nearest').transpose(-2,-1)
#hidden_states_upscaled_3to2_linear = nn.functional.interpolate(hidden_states_rowres.transpose(-2,-1), scale_factor=2, mode='linear').transpose(-2,-1)

upscaler_x2 = InterpolateCombo(scale_factor=2)

In [None]:
hidden_states_upscaled3to2 = upscaler_x2(hidden_states_lowres)


In [None]:
## BertAttentiveIntegrator

class BertCrossAttention(nn.Module):
    def __init__(
        self,
        config,
        hidden_size,
        hidden_size_query,
        hidden_size_keyvalue=None,
        position_embedding_type=None
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.hidden_size_query = hidden_size_query
        if hidden_size_keyvalue is None:
            hidden_size_keyvalue = hidden_size
        self.hidden_size_keyvalue = hidden_size_keyvalue
        if self.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                f"The hidden size ({self.hidden_size}) is not a multiple of the number of attention "
                f"heads ({config.num_attention_heads})"
            )

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(self.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(self.hidden_size_query, self.all_head_size)
        self.key = nn.Linear(self.hidden_size_keyvalue, self.all_head_size)
        self.value = nn.Linear(self.hidden_size_keyvalue, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.position_embedding_type = position_embedding_type or getattr(
            config, "position_embedding_type", "absolute"
        )
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            self.max_position_embeddings = config.max_position_embeddings
            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)

        self.is_decoder = config.is_decoder

    def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        query_hidden_states: Optional[torch.FloatTensor] = None,
        query_attention_mask: Optional[torch.FloatTensor] = None,
        past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
        output_attentions: Optional[bool] = False,
    ) -> Tuple[torch.Tensor]:
        mixed_query_layer = self.query(query_hidden_states)

        # If this is instantiated as a cross-attention module, the keys
        # and values come from an encoder; the attention mask needs to be
        # such that the encoder's padding tokens are not attended to.
        key_layer = self.transpose_for_scores(self.key(hidden_states))
        value_layer = self.transpose_for_scores(self.value(hidden_states))
        query_layer = self.transpose_for_scores(mixed_query_layer)

        use_cache = past_key_value is not None
        if self.is_decoder:
            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
            # Further calls to cross_attention layer can then reuse all cross-attention
            # key/value_states (first "if" case)
            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
            # all previous decoder key/value_states. Further calls to uni-directional self-attention
            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
            # if encoder bi-directional self-attention `past_key_value` is always `None`
            past_key_value = (key_layer, value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))

        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            query_length, key_length = query_layer.shape[2], key_layer.shape[2]
            if use_cache:
                position_ids_l = torch.tensor(key_length - 1, dtype=torch.long, device=hidden_states.device).view(
                    -1, 1
                )
            else:
                position_ids_l = torch.arange(query_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
            position_ids_r = torch.arange(key_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
            distance = position_ids_l - position_ids_r

            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility

            if self.position_embedding_type == "relative_key":
                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores
            elif self.position_embedding_type == "relative_key_query":
                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key

        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.functional.softmax(attention_scores, dim=-1)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(new_context_layer_shape)

        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)

        if self.is_decoder:
            outputs = outputs + (past_key_value,)
        return outputs

In [None]:
bertlayer_l3_to_l2_crossattn = BertCrossAttention(
        config=basemod.config,
        hidden_size=silo_dimensions[1],
        hidden_size_query=silo_dimensions[2],
        position_embedding_type=None
    )

In [None]:
print(hidden_states_upscaled3to2.shape)
print(hidden_states_l2.shape)
print(attention_mask_l1_reduced.shape)
print(extended_attention_mask_l1_reduced.shape)

torch.Size([2, 24, 192])
torch.Size([2, 24, 384])
torch.Size([2, 24])
torch.Size([2, 1, 1, 24])


In [None]:
out_l2_postencode = bertlayer_l3_to_l2_crossattn(
    hidden_states = hidden_states_l2,
    attention_mask = extended_attention_mask_l1_reduced,
    head_mask = None,
    query_hidden_states = hidden_states_upscaled3to2,
    query_attention_mask = attention_mask_l1_reduced
)
hidden_states_l2_postencode = out_l2_postencode[0]
print(hidden_states_l2_postencode.shape)
assert hidden_states_l2_postencode.shape == hidden_states_l2.shape

torch.Size([2, 24, 384])


In [None]:
print(basemod.config.hidden_size)
print(basemod.config.intermediate_size)
print(basemod.config.intermediate_size/basemod.config.hidden_size)

768
3072
4.0


In [None]:
# how does bert actually work?
"""
input = x

BertLayer:
- BertAttention
--- x2 = BertSelfAttention(x)
--- x3 = BertSelfOutput(x2,x) -> lnorm(drop(f(x2)) + x)
- BertIntermediate (expension:  4*hidden_size)
--- x4_ex = activation(f(x3)) # expansion (4*)
- BertOutput
--- x5 = lnorm(drop(f(x4_ex)) + x3 )


inputs = x_l2, x_l3_up

BertIntegrativeLayer:
- x2 = BertCrossAttention(k,v=x_l2, q=x_l3_up)
- x3 = lnorm(drop(f(x2)) + x_l2)
- x4_ex = activation( f(cat(x3, x_l3_up))  )
- x5 = lnorm(drop(f(x4_ex)) + x3)
"""


class BertIntegrativeLayer(nn.Module):
    """Vanilla Bert Layer, but integrates other hiddens states from a parallel transformers stack typically low-re"""
    def __init__(
            self,
            config,
            hidden_size,
            hidden_size_query,
            intermediate_size=None
        ):
        super().__init__()
        #self.chunk_size_feed_forward = config.chunk_size_feed_forward
        #self.seq_len_dim = 1
        self.cat = torch.cat
        if intermediate_size is None:
            intermediate_size = int(4*hidden_size)
        self.intermediate_size = intermediate_size
        self.hidden_size = hidden_size
        self.hidden_size_query = hidden_size_query
        self.hidden_size_concat = int(hidden_size + hidden_size_query)

        # cross attention between (low-res) query and hidden layers below
        self.attention = BertCrossAttention(
            config,
            hidden_size,
            hidden_size_query,
            position_embedding_type="absolute"
        )
        self.is_decoder = config.is_decoder
        #self.intermediate = BertIntermediate(config)
        #self.output = BertOutput(config)
        #- x2 = BertCrossAttention(k,v=x_l2, q=x_l3_up)
        #- x3 = lnorm(drop(f(x2)) + x_l2)
        #- x4_ex = activation( f(cat(x3, x_l3_up))  )
        #- x5 = lnorm(drop(f(x4_ex)) + x3)

        # corresponds to BertAttention SelfOutput
        self.output_attn = nn.Linear(self.hidden_size, self.hidden_size)
        self.lnorm_attn = nn.LayerNorm(self.hidden_size, eps=config.layer_norm_eps)
        self.dropout_attn = nn.Dropout(config.hidden_dropout_prob)

        # corresponds to BertIntermediate
        self.intermediate = nn.Linear(self.hidden_size_concat, self.intermediate_size)
        if isinstance(config.hidden_act, str):
            self.intermediate_act_fn = ACT2FN[config.hidden_act]
        else:
            self.intermediate_act_fn = config.hidden_act

        # corresponds to BertOutput
        self.output_intm = nn.Linear(self.intermediate_size, self.hidden_size)
        self.lnorm_intm = nn.LayerNorm(self.hidden_size, eps=config.layer_norm_eps)
        self.dropout_intm = nn.Dropout(config.hidden_dropout_prob)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        query_hidden_states: Optional[torch.FloatTensor] = None,
        query_attention_mask: Optional[torch.FloatTensor] = None,
        past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
        output_attentions: Optional[bool] = False,
    ) -> Tuple[torch.Tensor]:
        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None

        # cross attn between hiddens states and (low-res) query vector
        cross_attn_outputs = self.attention(
            hidden_states = hidden_states,
            attention_mask = attention_mask,
            head_mask = head_mask,
            query_hidden_states = query_hidden_states,
            query_attention_mask = query_attention_mask
        )
        cross_hidden_states = cross_attn_outputs[0]

        # first Add+Norm skip connection (BertSelfOutput)
        cross_hidden_states = self.dropout_attn(self.output_attn(cross_hidden_states))
        hidden_states = self.lnorm_attn(cross_hidden_states + hidden_states)

        # intermediate expension
        intermediate_states = self.intermediate_act_fn(self.intermediate(
            self.cat((hidden_states, query_hidden_states),axis=2)
        ))
        assert intermediate_states.shape[0]==hidden_states.shape[0]
        assert intermediate_states.shape[1]==hidden_states.shape[1]

        # BertOutput
        intermediate_states = self.dropout_intm(self.output_intm(intermediate_states))
        out_states = self.lnorm_intm(intermediate_states + hidden_states)

        #- x2 = BertCrossAttention(k,v=x_l2, q=x_l3_up)
        #- x3 = lnorm(drop(f(x2)) + x_l2)
        #- x4_ex = activation( f(cat(x3, x_l3_up))  )
        #- x5 = lnorm(drop(f(x4_ex)) + x3)
        return out_states


In [None]:

# from low-res to mid-res
bert_integrative_layer_midres = BertIntegrativeLayer(
    basemod.config,
    hidden_size=silo_dimensions[1],
    hidden_size_query=silo_dimensions[2],
    intermediate_size=silo_dimensions[1]*4,
)

# from mid-res to high-res
bert_integrative_layer_hires = BertIntegrativeLayer(
    basemod.config,
    hidden_size=silo_dimensions[0],
    hidden_size_query=reintegration_dim,
    intermediate_size=silo_dimensions[0]*4,
)

In [None]:
hidden_states_midres = bert_integrative_layer_midres(
    hidden_states = hidden_states_l2,
    attention_mask = extended_attention_mask_l1_reduced,
    head_mask = None,
    query_hidden_states = hidden_states_upscaled3to2,
    query_attention_mask = attention_mask_l1_reduced
)
print(hidden_states_midres.shape)
assert hidden_states_midres.shape == hidden_states_l2.shape

torch.Size([2, 24, 384])


In [None]:
# upscale the l2 and l3 to the full dimension
upscaler_x4 = InterpolateCombo(scale_factor=4)
hidden_states_upscaled3to1 = upscaler_x4(hidden_states_lowres)
hidden_states_upscaled2to1 = upscaler_x2(hidden_states_midres)

hidden_states_upscaled = torch.cat(
    (hidden_states_upscaled2to1, hidden_states_upscaled3to1),
    axis=2)

print(hidden_states_upscaled.shape)

torch.Size([2, 48, 576])


In [None]:
# final layer to bring it up to full dimension
hidden_states_hires = bert_integrative_layer_hires(
    hidden_states = hidden_states_l1,
    attention_mask = extended_attention_mask,
    head_mask = None,
    query_hidden_states = hidden_states_upscaled,
    query_attention_mask = extended_attention_mask
)
print(hidden_states_hires.shape)
assert hidden_states_hires.shape == hidden_states_l1.shape

torch.Size([2, 48, 768])


In [None]:
hidden_states_hires.shape

torch.Size([2, 48, 768])

In [None]:
attention_mask_l1_reduced.shape

torch.Size([2, 24])

### The Reduce and Integrate layer:
- this is like a Transformer block, but:
- does dimension reduction along sequence and embedding-dim
- includes a skip connection from previous hidden-states of the same dimension

In [None]:



# this is the layer that just does cross-attention between a seq-reduced query and full-size value and key


"""
input = x

BertLayer:
- BertAttention
--- x2 = BertSelfAttention(x)
--- x3 = BertSelfOutput(x2,x) -> lnorm(drop(f(x2)) + x)
- BertIntermediate (expension:  4*hidden_size)
--- x4_ex = activation(f(x3)) # expansion (4*)
- BertOutput
--- x5 = lnorm(drop(f(x4_ex)) + x3 )


inputs = x_l2, x_l3_up

BertIntegrativeLayer:
- x2 = BertCrossAttention(k,v=x_l2, q=x_l3_up)
- x3 = lnorm(drop(f(x2)) + x_l2)
- x4_ex = activation( f(cat(x3, x_l3_up))  )
- x5 = lnorm(drop(f(x4_ex)) + x3)


BertReduceAddIntegrativeLayer
inputs = x_l1, x_l1_reduced, x_l2_prev
- x2 = BertCrossAttention(k,v=x_l1, q= cat(x_l1_reduced, x_l2_prev) ) -notice three inputs
- x3 = lnorm(drop(f(x2)) + x_l2_prev)
- x4_ex = activation( f(cat(x3, x_l1_reduced))  )
- x5 = lnorm(drop(f(x4_ex)) + x3)
"""


class BertReduceAddIntegrativeLayer(nn.Module):
    """Bert Layer that does dimenion reduction along embedding-dimenion and integrations a skip connection"""
    def __init__(
            self,
            config,
            hidden_size,
            hidden_size_input=None,
            hidden_size_query=None,
            intermediate_size=None,
            dim_reduction=2,
            do_concat_hidden_and_query = True
        ):
        super().__init__()
        #self.chunk_size_feed_forward = config.chunk_size_feed_forward
        #self.seq_len_dim = 1
        self.cat = torch.cat
        self.do_concat_hidden_and_query = do_concat_hidden_and_query
        assert bool(do_concat_hidden_and_query), 'not implemented: concatenation of query and hidden-states must happen'
        self.hidden_size = hidden_size
        if dim_reduction is None:
            dim_reduction = 2
        self.dim_reduction = dim_reduction
        if intermediate_size is None:
            intermediate_size = int(4*hidden_size)
        self.intermediate_size = intermediate_size
        if hidden_size_input is None:
            hidden_size_input = hidden_size
        self.hidden_size_input = hidden_size_input
        if hidden_size_query is None:
            hidden_size_query = hidden_size_input
        self.hidden_size_query = hidden_size_query + do_concat_hidden_and_query*hidden_size
        self.hidden_size_concat = int(hidden_size + hidden_size_input)

        # cross attention between (low-res) query and hidden layers below
        self.attention = BertSelfAttnDimensionReduction(
            config,
            hidden_size_input=self.hidden_size_input,
            hidden_size_query = self.hidden_size_query,
            position_embedding_type="absolute",
            dim_reduction = self.dim_reduction
        )
        self.is_decoder = config.is_decoder
        #inputs = x_l1, x_l1_reduced, x_l2_prev
        #- x2 = BertCrossAttention(k,v=x_l1, q= cat(x_l1_reduced, x_l2_prev) ) -notice three inputs
        #- x3 = lnorm(drop(f(x2)) + x_l2_prev)
        #- x4_ex = activation( f(cat(x3, x_l1_reduced))  )
        #- x5 = lnorm(drop(f(x4_ex)) + x3)

        # corresponds to BertAttention SelfOutput
        self.output_attn = nn.Linear(self.hidden_size, self.hidden_size)
        self.lnorm_attn = nn.LayerNorm(self.hidden_size, eps=config.layer_norm_eps)
        self.dropout_attn = nn.Dropout(config.hidden_dropout_prob)

        # corresponds to BertIntermediate
        self.intermediate = nn.Linear(self.hidden_size_concat, self.intermediate_size)
        if isinstance(config.hidden_act, str):
            self.intermediate_act_fn = ACT2FN[config.hidden_act]
        else:
            self.intermediate_act_fn = config.hidden_act

        # corresponds to BertOutput
        self.output_intm = nn.Linear(self.intermediate_size, self.hidden_size)
        self.lnorm_intm = nn.LayerNorm(self.hidden_size, eps=config.layer_norm_eps)
        self.dropout_intm = nn.Dropout(config.hidden_dropout_prob)

    def forward(
        self,
        inputs: torch.Tensor, # higher-resolution inputs for key and values (long sequence dimension)
        hidden_states: torch.Tensor, # previous hidden-states for skip connection (short squence-dim, low-res)
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        query_hidden_states: torch.FloatTensor = None, # hidden-states for query (short squence-dim, low-res)
        query_attention_mask: Optional[torch.FloatTensor] = None,
        past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
        output_attentions: Optional[bool] = False,
    ) -> Tuple[torch.Tensor]:
        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None

        if self.do_concat_hidden_and_query:
            query_hidden_states_plus = torch.cat((query_hidden_states, hidden_states),axis=2)
        # cross attn between (low-res) query vector and (high-res) key-values
        cross_attn_outputs = self.attention(
            query_hidden_states_plus, # query (short seq-dim, high-res)
            attention_mask=attention_mask,
            head_mask=head_mask,
            encoder_hidden_states = inputs, # for key/value (longer sequence dimension, high-res)
            past_key_value=past_key_value,
            output_attentions=output_attentions,
        )
        cross_hidden_states = cross_attn_outputs[0]

        # first Add+Norm skip connection (BertSelfOutput)
        cross_hidden_states = self.dropout_attn(self.output_attn(cross_hidden_states))
        hidden_states = self.lnorm_attn(cross_hidden_states + hidden_states)

        # intermediate expension
        intermediate_states = self.intermediate_act_fn(self.intermediate(
            self.cat((hidden_states, query_hidden_states),axis=2)
        ))
        assert intermediate_states.shape[0]==hidden_states.shape[0]
        assert intermediate_states.shape[1]==hidden_states.shape[1]

        # BertOutput
        intermediate_states = self.dropout_intm(self.output_intm(intermediate_states))
        out_states = self.lnorm_intm(intermediate_states + hidden_states)

        #inputs = x_l1, x_l1_reduced, x_l2_prev
        #- x2 = BertCrossAttention(k,v=x_l1, q= cat(x_l1_reduced, x_l2_prev) ) -notice three inputs
        #- x3 = lnorm(drop(f(x2)) + x_l2_prev)
        #- x4_ex = activation( f(cat(x3, x_l1_reduced))  )
        #- x5 = lnorm(drop(f(x4_ex)) + x3)
        return out_states


In [None]:
# initialize the mid-resolution BertReduceAndIntegrate layer
bert_reduce_add_integrate_midres = BertReduceAddIntegrativeLayer(
    config,
    hidden_size = silo_dimensions[1], # size of mid-res
    hidden_size_input=silo_dimensions[0],
    hidden_size_query=silo_dimensions[0],
    intermediate_size=silo_dimensions[1]*3,
    dim_reduction=2,
    do_concat_hidden_and_query = True
)

bert_reduce_add_integrate_lowres = BertReduceAddIntegrativeLayer(
    config,
    hidden_size = silo_dimensions[2], # size of mid-res
    hidden_size_input=silo_dimensions[1],
    hidden_size_query=silo_dimensions[1],
    intermediate_size=silo_dimensions[2]*3,
    dim_reduction=2,
    do_concat_hidden_and_query = True
)

In [None]:
# Reduce sequence-dim from l1->l2, and from high-res->mid-res
hidden_states_hires_reduced = maxpool_l2(hidden_states_hires)
assert hidden_states_hires_reduced.shape[1] == hidden_states_midres.shape[1] # reduced-seq-dim should be same as mid-res hidden-states
print(hidden_states_midres.shape)
hidden_states_midres = bert_reduce_add_integrate_midres(
    inputs = hidden_states_hires, # from highres outputs previous layer (key, values)
    hidden_states = hidden_states_midres, # previous hidden-states for skip connection (short squence-dim, low-res)
    attention_mask = extended_attention_mask_l1_reduced,
    head_mask=None,
    query_hidden_states = hidden_states_hires_reduced # reduced version of high-res inputs (reduced along sequence dimenion)
)
print(hidden_states_midres.shape)

torch.Size([2, 24, 384])
torch.Size([2, 24, 384])


In [None]:
# Reduce sequence-dim from l1->l2, and from high-res->mid-res
hidden_states_midres_reduced = maxpool_l2(hidden_states_midres)
assert hidden_states_midres_reduced.shape[1] == hidden_states_lowres.shape[1] # reduced-seq-dim should be same as mid-res hidden-states
print(hidden_states_midres_reduced.shape)

if True:
  print(hidden_states_lowres.shape)
  hidden_states_lowres = bert_reduce_add_integrate_lowres(
      inputs = hidden_states_midres, # from highres outputs previous layer (key, values)
      hidden_states = hidden_states_lowres, # previous hidden-states for skip connection (short squence-dim, low-res)
      attention_mask = extended_attention_mask_l2_reduced,
      head_mask=None,
      query_hidden_states = hidden_states_midres_reduced # reduced version of high-res inputs (reduced along sequence dimenion)
  )
  print(hidden_states_lowres.shape)

torch.Size([2, 12, 384])
torch.Size([2, 12, 192])
torch.Size([2, 12, 192])


In [None]:
try:
    from transformers.modeling_utiles import get_extended_attention_mask
except:
    def get_extended_attention_mask(self, attention_mask: torch.Tensor, input_shape: Tuple[int], device: device) -> torch.Tensor:
        """
        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.

        Arguments:
            attention_mask (:obj:`torch.Tensor`):
                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
            input_shape (:obj:`Tuple[int]`):
                The shape of the input to the model.
            device: (:obj:`torch.device`):
                The device of the input to the model.

        Returns:
            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
        """
        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
        # ourselves in which case we just need to make it broadcastable to all heads.
        if attention_mask.dim() == 3:
            extended_attention_mask = attention_mask[:, None, :, :]
        elif attention_mask.dim() == 2:
            # Provided a padding mask of dimensions [batch_size, seq_length]
            # - if the model is a decoder, apply a causal mask in addition to the padding mask
            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
            if self.config.is_decoder:
                batch_size, seq_length = input_shape
                seq_ids = torch.arange(seq_length, device=device)
                causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]
                # in case past_key_values are used we need to add a prefix ones mask to the causal mask
                # causal and attention masks must have same type with pytorch version < 1.3
                causal_mask = causal_mask.to(attention_mask.dtype)

                if causal_mask.shape[1] < attention_mask.shape[1]:
                    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
                    causal_mask = torch.cat(
                        [
                            torch.ones(
                                (batch_size, seq_length, prefix_seq_len), device=device, dtype=causal_mask.dtype
                            ),
                            causal_mask,
                        ],
                        axis=-1,
                    )

                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
            else:
                extended_attention_mask = attention_mask[:, None, None, :]
        else:
            raise ValueError(
                "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
                    input_shape, attention_mask.shape
                )
            )

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
        return extended_attention_mask

### Base-Layer nn.Module

In [None]:
from transformers import AutoModel, AutoTokenizer, AutoConfig
import torch
from torch import nn
from torch import Tensor

from transformers.models.bert.modeling_bert import BertEncoder
from transformers.activations import ACT2FN
from typing import List, Optional, Tuple, Union

def make_config(
    modelstring = "distilroberta-base",
    num_transformer_stacks = 2, # number of transformer stacks
    scale_ratio2 = 0.5, # reduce sequence-length by X, from high-res to mid-res
    scale_ratio3 = 0.25, # reduce sequence-length by Y, from high-res to low-res
    multipler_intermediate2 = 4.0, # intermeidate size is a multiple of hidden size
    multipler_intermediate3 = 4.0, # intermeidate size is a multiple of hidden size
    num_layers_l2 = 1, # mid-res encoder
    num_layers_l3 = 3, # low-res encoder
    dropout_scaling = 0.05, # dropout when performing downscaling from one-sequence length to next
    use_cheap_integrator_for_stacks = [],
    do_mlm=False,# whether to output MLM token predictions
    do_cls=False,# whether to output a pooled sentence-vector for sequence classification
):
    #if True:
    #modelstring = "distilroberta-base"
    #scale_ratio2 = 0.5
    #scale_ratio3 = 0.25
    #scale_intermediate2 = 4
    #scale_intermediate3 = 4
    base_config = AutoConfig.from_pretrained(modelstring)
    config_l2 = copy.deepcopy(base_config)
    config_l3 = copy.deepcopy(base_config)
    setattr(base_config,'model_string', modelstring)
    setattr(base_config,'num_transformer_stacks',num_transformer_stacks)
    setattr(base_config,'num_layers_l2', num_layers_l2)
    setattr(base_config,'num_layers_l3', num_layers_l3)
    setattr(base_config,'scale_ratio2', scale_ratio2)
    setattr(base_config,'scale_ratio3', scale_ratio3)
    setattr(base_config,'scale_factor2', int(1/base_config.scale_ratio2))
    setattr(base_config,'scale_factor3', int(1/base_config.scale_ratio3*base_config.scale_ratio2))
    setattr(base_config,"hidden_size_l2", int(base_config.hidden_size * scale_ratio2))
    setattr(base_config,"hidden_size_l3", int(base_config.hidden_size * scale_ratio3))
    setattr(base_config,"intermediate_size_l1", int(base_config.hidden_size_l2*multipler_intermediate2))
    setattr(base_config,"intermediate_size_l2", int(base_config.hidden_size_l3*multipler_intermediate3))
    setattr(base_config,"query_size1", base_config.hidden_size_l2 + base_config.hidden_size_l3)
    setattr(base_config,"query_size2", base_config.hidden_size_l3)
    setattr(base_config,"dropout_scaling", dropout_scaling)
    setattr(base_config,"use_cheap_integrator_for_stacks", use_cheap_integrator_for_stacks)
    setattr(base_config, "do_mlm", do_mlm)
    setattr(base_config, "do_cls", do_cls)

    # make the configuration for the l2 mid-res encoder
    config_l2.hidden_size = base_config.hidden_size_l2
    config_l2.num_hidden_layers = num_layers_l2
    setattr(base_config, 'config_l2', config_l2)

    # make the configuration for the l3 encoder
    config_l3.hidden_size = base_config.hidden_size_l3
    config_l3.num_hidden_layers = num_layers_l3
    setattr(base_config, 'config_l3', config_l3)
    return base_config


def initialize_baselayers(config, basemod = None, tokenizer=None, stack_id=0):
    """Initializes the embeddings and first stack of layers for the Anathem transformers"""
    # initialize the basemodel
    if basemod is None:
        basemod = AutoModel.from_pretrained(config.model_string)
    if tokenizer is None:
        # download pretrained tokenizer
        tokenizer = AutoTokenizer.from_pretrained(config.model_string)

    device = basemod.device
    setattr(config, 'device', device)

    # get basemodel's embeddings
    layer_embedding = copy.deepcopy(basemod._modules['embeddings'])

    # get basemodel's first transformer block
    layer_basetransformer = copy.deepcopy(basemod._modules['encoder']._modules['layer']._modules['0'])

    # initialize the maxpooling downsamplers
    maxpool = nn.Sequential(
        nn.Dropout(config.dropout_scaling),
        nn.MaxPool2d((2,1), stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=True)
    )
    # pooling the attention has no dropout
    maxpool_attn = nn.MaxPool1d((2), stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=True)

    # initialize downsampling attention layers
    bert_reducer_l2 = BertSelfAttnDimensionReduction(
        config=config,
        hidden_size_input=config.hidden_size,
        position_embedding_type=config.position_embedding_type,
        dim_reduction = config.scale_factor2
    )
    # 1/4 hidden size
    bert_reducer_l3 = BertSelfAttnDimensionReduction(
        config=config,
        hidden_size_input=config.hidden_size_l2,
        position_embedding_type=config.position_embedding_type,
        dim_reduction = config.scale_factor3
    )

    # initialize the mid-resolution BertEncoder
    bert_encoder_midres = BertEncoder(config.config_l2)
    # initialize the low-resolution BertEncoder
    bert_encoder_lowres = BertEncoder(config.config_l3)

    # initailize the upscalers
    upscaler_x2 = InterpolateCombo(scale_factor=config.scale_factor3, dropout=config.dropout_scaling)
    upscaler_x4 = InterpolateCombo(scale_factor=int(1/config.scale_ratio3), dropout=config.dropout_scaling)

    # initialize the BertIntegrative Layers: low res to mid res
    bert_integrative_layer_2 = BertIntegrativeLayer(
        config,
        hidden_size=config.hidden_size_l2,
        hidden_size_query=config.hidden_size_l3,
        intermediate_size=config.intermediate_size_l2
    )

    do_cheap_integrator = (stack_id in config.use_cheap_integrator_for_stacks)
    # from mid-res to high-res
    if not do_cheap_integrator:
        # cheap (non-transformer) method to integrate high- and mid-res hidden states
        bert_integrative_layer_1 = CheapMLPIntegrativeLayer(
            config,
            hidden_size=config.hidden_size,
            hidden_size_query=config.query_size1,
            intermediate_size=config.intermediate_size_l1
        )
    else:
        # full Transformer layer as mid-to-highres upscaling
        BertIntegrativeLayer(
            config,
            hidden_size=config.hidden_size,
            hidden_size_query=config.query_size1,
            intermediate_size=config.intermediate_size_l1//2
        )

    return (
        tokenizer,
        basemod,
        layer_embedding,
        layer_basetransformer,
        maxpool,
        maxpool_attn,
        bert_reducer_l2,
        bert_reducer_l3,
        bert_encoder_midres,
        bert_encoder_lowres,
        upscaler_x2,
        upscaler_x4,
        bert_integrative_layer_2,
        bert_integrative_layer_1
    )

def initialize_midlayers(config, basemod=None, tokenizer=None):
    """Initializes all the intermediate layers for the Anathem transformers"""
    # initialize the maxpooling downsamplers
    maxpool = nn.Sequential(
        nn.Dropout(config.dropout_scaling),
        nn.MaxPool2d((2,1), stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=True)
    )
    # pooling the attention has no dropout
    maxpool_attn = nn.MaxPool1d((2), stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=True)

    # initialize bert attentive downsampling and skipconnection (1/2 embedding dim)
    bert_reduceintegrator_l2 = BertReduceAddIntegrativeLayer(
        config,
        config.hidden_size_l2, # size of mid-res
        hidden_size_input=config.hidden_size, # size full-resolution
        hidden_size_query=config.hidden_size, # size full-resolution
        intermediate_size=config.intermediate_size_l1, # BertIntermediate dimension (expansion *4 the hiddensize)
        dim_reduction=config.scale_factor2, # reduce embedding dimension by factor of 2
        do_concat_hidden_and_query = True
    )

    # 1/4 the size
    bert_reduceintegrator_l3 = BertReduceAddIntegrativeLayer(
        config,
        config.hidden_size_l3, # size of mid-res
        hidden_size_input=config.hidden_size_l2, # size full-resolution
        hidden_size_query=config.hidden_size_l2, # size full-resolution
        intermediate_size=config.intermediate_size_l2, # BertIntermediate dimension
        dim_reduction=config.scale_factor3, # reduce embedding dimension by factor of 2
        do_concat_hidden_and_query = True
    )

    # initialize the low-resolution BertEncoder
    bert_encoder_midres = BertEncoder(config.config_l2)
    bert_encoder_lowres = BertEncoder(config.config_l3)

    # initailize the upscalers
    upscaler_x2 = InterpolateCombo(scale_factor=config.scale_factor3, dropout=config.dropout_scaling)
    upscaler_x4 = InterpolateCombo(scale_factor=int(1/config.scale_ratio3), dropout=config.dropout_scaling)

    # initialize the BertIntegrative Layers: low res to mid res
    bert_integrative_layer_2 = BertIntegrativeLayer(
        config,
        hidden_size=config.hidden_size_l2,
        hidden_size_query=config.hidden_size_l3,
        intermediate_size=config.intermediate_size_l2
    )

    # from mid-res to high-res
    bert_integrative_layer_1 = BertIntegrativeLayer(
        config,
        hidden_size=config.hidden_size,
        hidden_size_query=config.query_size1,
        intermediate_size=config.intermediate_size_l1
    )

    return (
        maxpool,
        maxpool_attn,
        bert_reduceintegrator_l2,
        bert_reduceintegrator_l3,
        bert_encoder_midres,
        bert_encoder_lowres,
        upscaler_x2,
        upscaler_x4,
        bert_integrative_layer_2,
        bert_integrative_layer_1
    )


class AnathemBaseModule(nn.Module):
    """First Sstack of layers with embeddings, that go full circle form high-res to low-res back to high res"""
    def __init__(
            self,
            config,
            basemod=None,
            tokenizer=None,
            past_key_values_length = None,
            device = None
        ):
        super().__init__()
        self.config = config

        # initalize the layers
        (
            tokenizer, basemod,
            layer_embedding,
            layer_basetransformer,
            maxpool,
            maxpool_attn,
            bert_reducer_l2,
            bert_reducer_l3,
            bert_encoder_midres,
            bert_encoder_lowres,
            upscaler_x2,
            upscaler_x4,
            bert_integrative_layer_2,
            bert_integrative_layer_1
        ) = initialize_baselayers(config, basemod, tokenizer)

        self.get_extended_attention_mask = basemod.get_extended_attention_mask
        self.embedding = layer_embedding
        self.layer_basetransformer = layer_basetransformer
        self.maxpool = maxpool
        self.maxpool_attn = maxpool_attn
        self.bert_reducer_l2 = bert_reducer_l2
        self.bert_reducer_l3 = bert_reducer_l3
        self.bert_encoder_midres = bert_encoder_midres
        self.bert_encoder_lowres = bert_encoder_lowres
        self.upscaler_x2 = upscaler_x2
        self.upscaler_x4 = upscaler_x4
        self.bert_integrative_layer_2 = bert_integrative_layer_2
        self.bert_integrative_layer_1 = bert_integrative_layer_1
        if device is None:
            self.to(basemod.device)
            #print(self.device)
            self.device = basemod.device
        else:
            self.to(device)
            self.device = device

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        encoder_hidden_states: Optional[torch.Tensor] = None,
        encoder_attention_mask: Optional[torch.Tensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = False
    ):
        input_shape = input_ids
        past_key_values_length =0 if past_key_values is None else len(past_key_values)

        # extend attention mask
        extended_attention_mask_l1 = self.get_extended_attention_mask(attention_mask, input_shape, self.device)
        # downsample the attention mask to l2 dimension
        attention_mask_l2 = self.maxpool_attn(attention_mask.float())
        extended_attention_mask_l2 = self.get_extended_attention_mask(attention_mask_l2,attention_mask_l2.shape, self.device)
        # downsample the attention mask to l3 dimension
        attention_mask_l3 = self.maxpool_attn(attention_mask_l2.float())
        extended_attention_mask_l3 = self.get_extended_attention_mask(attention_mask_l3,attention_mask_l3.shape, self.device)

        # embed
        embedding_output = self.embedding(
            input_ids = input_ids,
            position_ids = position_ids,
            token_type_ids = token_type_ids,
            #input_embeds=None,
            past_key_values_length = past_key_values_length
        )

        # first transformer block (vanilla transformer)
        out_l1 = self.layer_basetransformer(
            hidden_states = embedding_output,
            attention_mask = extended_attention_mask_l1,
            head_mask=head_mask,
            encoder_hidden_states=None,
            encoder_attention_mask=None,
            output_attentions=output_attentions
        )
        hidden_states_l1 = out_l1[0]

        # downsample to sequence 1 to length sequence 2
        hiddens_states_l1_reduced = self.maxpool(hidden_states_l1)

        # reduce dimenion on sequence 2
        out_l2 = self.bert_reducer_l2(
            hidden_states = hiddens_states_l1_reduced,
            attention_mask = extended_attention_mask_l2,
            head_mask=head_mask,
            encoder_hidden_states = hidden_states_l1,
            encoder_attention_mask= extended_attention_mask_l1,
            past_key_value=past_key_values,
            output_attentions=output_attentions,
        )
        hidden_states_l2 = out_l2[0]

        # Vanilla transformers block at mid-resolution (1/2 seq-length)
        out_encoder = self.bert_encoder_midres(
            hidden_states=hidden_states_l2,
            attention_mask=extended_attention_mask_l2,
            head_mask = head_mask,
            return_dict=return_dict
        )
        hidden_states_l2 = out_encoder[0]

        # reduce sequence length (1/4 seq-length)
        hiddens_states_l2_reduced = self.maxpool(hidden_states_l2)

        # reduce dimenion on sequence 2
        out_l3 = self.bert_reducer_l3(
            hidden_states = hiddens_states_l2_reduced,
            attention_mask = extended_attention_mask_l3,
            head_mask=head_mask,
            encoder_hidden_states = hidden_states_l2,
            encoder_attention_mask= extended_attention_mask_l2,
            past_key_value=past_key_values,
            output_attentions=output_attentions,
        )
        hidden_states_l3 = out_l3[0]

        #print(hidden_states_l3.shape)
        #print(extended_attention_mask_l3.shape)
        # BertEncoder at low-res
        out_encoder = self.bert_encoder_lowres(
            hidden_states=hidden_states_l3,
            attention_mask=extended_attention_mask_l3,
            head_mask = head_mask,
            return_dict=return_dict
        )
        hidden_states_l3 = out_encoder[0]

        # upscaling: l3 to l2
        hidden_states_upscaled3to2 = self.upscaler_x2(hidden_states_l3)

        # integrate sequence-2 and upscaled sequence-3
        hidden_states_l2 = self.bert_integrative_layer_2(
            hidden_states = hidden_states_l2,
            attention_mask = extended_attention_mask_l2,
            head_mask = head_mask,
            query_hidden_states = hidden_states_upscaled3to2,
            query_attention_mask = attention_mask_l2
        )

        # upscaling: l3/l2 to l1 sequence length
        hidden_states_upscaled3to1 = self.upscaler_x4(hidden_states_l3)
        hidden_states_upscaled2to1 = self.upscaler_x2(hidden_states_l2)
        hidden_states_upscaled = torch.cat((
            hidden_states_upscaled2to1, hidden_states_upscaled3to1
        ),axis=2)

        # integrate low-resolution information back to original dimension
        hidden_states_l1 = self.bert_integrative_layer_1(
            hidden_states = hidden_states_l1,
            attention_mask = extended_attention_mask_l1,
            head_mask = head_mask,
            query_hidden_states = hidden_states_upscaled,
            query_attention_mask = extended_attention_mask_l1
        )
        if not return_dict:
            return (
                (hidden_states_l1, hidden_states_l2, hidden_states_l3),
                (extended_attention_mask_l1, extended_attention_mask_l2, extended_attention_mask_l3)
            )
        return {
            "hidden_states": (hidden_states_l1, hidden_states_l2, hidden_states_l3),
            "attention":(extended_attention_mask_l1, extended_attention_mask_l2, extended_attention_mask_l3)
        }


class AnathemMidModule(nn.Module):
    """Stack of layers that go full circle form high-res to low-res back to high res"""
    def __init__(
            self,
            config,
            basemod=None,
            tokenizer=None,
            past_key_values_length = None,
            device=None,
        ):
        super().__init__()
        self.config = config

        # initalize the layers
        (
            maxpool,
            maxpool_attn,
            bert_reducerintegrator_l2,
            bert_reducerintegrator_l3,
            bert_encoder_midres,
            bert_encoder_lowres,
            upscaler_x2,
            upscaler_x4,
            bert_integrative_layer_2,
            bert_integrative_layer_1
        ) = initialize_midlayers(config, basemod, tokenizer)

        self.get_extended_attention_mask = get_extended_attention_mask
        self.maxpool = maxpool
        self.maxpool_attn = maxpool_attn
        self.bert_reducerintegrator_l2 = bert_reducerintegrator_l2
        self.bert_reducerintegrator_l3 = bert_reducerintegrator_l3
        self.bert_encoder_midres = bert_encoder_midres
        self.bert_encoder_lowres = bert_encoder_lowres
        self.upscaler_x2 = upscaler_x2
        self.upscaler_x4 = upscaler_x4
        self.bert_integrative_layer_2 = bert_integrative_layer_2
        self.bert_integrative_layer_1 = bert_integrative_layer_1
        if device is None:
            self.to(basemod.device)
            #print(self.device)
            self.device = basemod.device
        else:
            self.to(device)
            self.device = device

    def forward(
        self,
        hidden_states_highres: torch.Tensor,
        hidden_states_midres: torch.Tensor,
        hidden_states_lowres: torch.Tensor,
        attention_mask: Optional[List[torch.FloatTensor]] = None,
        extended_attention_mask_highres: Optional[List[torch.FloatTensor]] = None,
        extended_attention_mask_midres: Optional[List[torch.FloatTensor]] = None,
        extended_attention_mask_lowres: Optional[List[torch.FloatTensor]] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = False
    ):
        input_shape = hidden_states_highres.shape[:2]
        past_key_values_length =0 if past_key_values is None else len(past_key_values)

        # extend attention mask
        if extended_attention_mask_highres is None:
            extended_attention_mask_highres = self.get_extended_attention_mask(attention_mask, input_shape, self.device)
        if extended_attention_mask_midres is None:
            attention_mask_midres = self.maxpool_attn(attention_mask.float())
            extended_attention_mask_midres = self.get_extended_attention_mask(attention_mask_midres,attention_mask_midres.shape, self.device)
        if extended_attention_mask_lowres is None:
           attention_mask_lowres = self.maxpool_attn(attention_mask_midres.float())
           extended_attention_mask_lowres = self.get_extended_attention_mask(attention_mask_lowres,attention_mask_lowres.shape, self.device)

        # downsample to sequence 1 to length sequence 2
        hiddens_states_l1_reduced = self.maxpool(hidden_states_highres)

        # reduce dimenion on sequence 2
        hidden_states_l2 = self.bert_reducerintegrator_l2(
            inputs = hidden_states_highres, # from highres outputs previous layer (key, values)
            hidden_states = hidden_states_midres, # previous hidden-states for skip connection (short squence-dim, low-res)
            attention_mask = extended_attention_mask_midres,
            head_mask=None,
            query_hidden_states = hiddens_states_l1_reduced
        )

        # Vanilla transformers at mid-resolution (1/2 sequence-length)
        out_encoder = self.bert_encoder_midres(
            hidden_states=hidden_states_l2,
            attention_mask=extended_attention_mask_midres,
            head_mask = None,
            return_dict=return_dict
        )
        hidden_states_l2 = out_encoder[0]

        # reduce sequence length (to 1/4 sequence-length)
        hiddens_states_l2_reduced = self.maxpool(hidden_states_l2)

        # reduce dimenion on sequence 2
        hidden_states_l3 = self.bert_reducerintegrator_l3(
            inputs = hidden_states_midres, # from highres outputs previous layer (key, values)
            hidden_states = hidden_states_lowres, # previous hidden-states for skip connection (short squence-dim, low-res)
            attention_mask = extended_attention_mask_lowres,
            head_mask=None,
            query_hidden_states = hiddens_states_l2_reduced
        )

        # BertEncoder at low-res
        out_encoder = self.bert_encoder_lowres(
            hidden_states=hidden_states_l3,
            attention_mask=extended_attention_mask_lowres,
            head_mask = None,
            return_dict=return_dict
        )
        hidden_states_lowres = out_encoder[0]

        # upscaling: l3 to l2
        hidden_states_upscaled3to2 = self.upscaler_x2(hidden_states_lowres)

        # integrate sequence-2 and upscaled sequence-3
        hidden_states_midres = self.bert_integrative_layer_2(
            hidden_states = hidden_states_l2,
            attention_mask = extended_attention_mask_midres,
            head_mask = None,
            query_hidden_states = hidden_states_upscaled3to2        )

        # upscaling: l3/l2 to l1 sequence length
        hidden_states_upscaled3to1 = self.upscaler_x4(hidden_states_lowres)
        hidden_states_upscaled2to1 = self.upscaler_x2(hidden_states_midres)
        hidden_states_upscaled = torch.cat((
            hidden_states_upscaled2to1, hidden_states_upscaled3to1
        ),axis=2)

        # integrate low-resolution information back to original dimension
        hidden_states_highres = self.bert_integrative_layer_1(
            hidden_states = hidden_states_highres,
            attention_mask = extended_attention_mask_highres,
            head_mask = None,
            query_hidden_states = hidden_states_upscaled,
            query_attention_mask = extended_attention_mask_highres
        )
        if not return_dict:
            return (
                (hidden_states_highres, hidden_states_midres, hidden_states_lowres),
                (extended_attention_mask_highres, extended_attention_mask_midres, extended_attention_mask_lowres)
            )
        return {
            "hidden_states": (hidden_states_highres, hidden_states_midres, hidden_states_lowres),
            "attention":(extended_attention_mask_highres, extended_attention_mask_midres, extended_attention_mask_lowres)
        }

class BertClassificationHead(nn.Module):
    def __init__(self, config, n_classes = 1, activation = 'sigmoid', device=None):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size*2, n_classes)
        if activation == 'tanh':
            self.activation = nn.Tanh()
        elif activation == 'relu':
            self.activation = nn.ReLU()
        elif activation == 'sigmoid':
            self.activation = torch.sigmoid
        elif activation == 'none':
            self.activation = lambda x: x
        if device is not None:
            self.to(device)

    def forward(self, hidden_states, attention_mask) -> torch.Tensor:
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        output_vectors=[]
        first_token_tensor = hidden_states[:, 0]
        output_vectors.append(first_token_tensor)
        # mean pooling
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
        sum_embeddings = torch.sum(hidden_states * input_mask_expanded, 1)
        sum_mask = input_mask_expanded.sum(1)
        sum_mask = torch.clamp(sum_mask, min=1e-9)
        output_vectors.append(sum_embeddings / sum_mask)
        # concatenate
        pooled_output = torch.concat(output_vectors, axis=1)
        #print(pooled_output.shape)
        logits = self.dense(pooled_output)
        return self.activation(logits)


def tokenize_anathem(text, device=device):
    #padding_length = int(math.ceil(max_length / 4)) *
    tokens = tokenizer(text,padding=True, return_tensors='pt', pad_to_multiple_of=4)
    input_shape = tokens['input_ids'].size()

    # change token padding to be multiple of 4
    #ideal_length = int(math.ceil(input_shape[-1] / 4)) * 4 # should be a multiple of 4
    #if input_shape[-1]!=ideal_length:
    #  tokens = tokenizer(text,padding='max_length', max_length = ideal_length, return_tensors='pt')
    #  input_shape = tokens['input_ids'].size()

    token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
    tokens['token_type_ids'] = token_type_ids
    for k,v in tokens.items():
        tokens[k] = v.to(device)

    return tokens

In [None]:
#config = make_config('distilroberta-base')
#config = make_config('t5-small') # can't use t5 because it uses relative
config = make_config('google/bert_uncased_L-12_H-512_A-8') #

if False:
  (tokenizer,basemod,layer_embedding,layer_basetransformer,maxpool,maxpool_attn,bert_reducer_l2,
   bert_reducer_l3,bert_encoder_lowres,upscaler_x2,upscaler_x4,bert_integrative_layer_2,bert_integrative_layer_1) = initialize(config)

# make the basemod and tokenizer
basemod = AutoModel.from_pretrained(config.model_string)
basemod.to(device)
tokenizer = AutoTokenizer.from_pretrained(config.model_string)



Some weights of the model checkpoint at google/bert_uncased_L-12_H-512_A-8 were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# the Anathem encoder includes the embeddings and first transformer block
anathem_encoder1 = AnathemBaseModule(config, basemod, tokenizer)
anathem_encoder2 = AnathemMidModule(config, basemod)

In [None]:
cls_head = BertClassificationHead(config, n_classes = 3, activation = 'none',device=device)


In [None]:
text = [
    "* Welcome home to this gorgeously upgraded, beautifully maintained, three-bedroom home with double attached garage. Drive up to this quiet cul-de-sac and let the experience begin. On the main floor, you’ll notice the abundance of natural light. There is a separate office with view over the front of the property. The layout was customized, with a great open living space. The kitchen is a chef’s dream, with a breakfast bar, granite countertops, stainless steel appliance package, a pantry, and a view out to the sunny west facing yard.",
    "There’s room for formal dining and the family room has a gas fireplace to relax by on the cooler nights. Out back, there’s a stunner of a deck, perfect for BBQ season! Upstairs, you’ll find a massive bonus room with tons of windows. There are two, secondary bedrooms and the master suite is amazing",
]

In [None]:
tokens = tokenize_anathem(text,device)

In [None]:
#stack 1
out1 = anathem_encoder1(
      input_ids = tokens['input_ids'],
      attention_mask = tokens['attention_mask'],
      token_type_ids = tokens['token_type_ids']
)
(hidden_states, extended_attention_masks) = out1



In [None]:
# stack2
out2 = anathem_encoder2(
      hidden_states_highres = hidden_states[0],
      hidden_states_midres = hidden_states[1],
      hidden_states_lowres = hidden_states[2],
      extended_attention_mask_highres = extended_attention_masks[0],
      extended_attention_mask_midres = extended_attention_masks[1],
      extended_attention_mask_lowres = extended_attention_masks[2]
)
(hidden_states, extended_attention_masks) = out2

cls_head(hidden_states[0], tokens['attention_mask'])



tensor([[-0.8376, -0.3891, -0.6668],
        [-0.8747, -0.3621, -0.7735]], device='cuda:0',
       grad_fn=<AddmmBackward0>)

In [None]:
out1[0][0].shape

torch.Size([2, 48, 768])

In [None]:
####

In [None]:
## Next steps, do something simple like sentiment analysis

In [None]:
from datasets import list_datasets, load_dataset
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split
import numpy as np
from tqdm import tqdm
from torch.optim import AdamW
from sklearn.metrics import precision_recall_fscore_support
from scipy.special import softmax
#datasets_list = list_datasets()
#[k for k in datasets_list if 'phrasebank' in k]


In [None]:
#[k for k in datasets_list if 'phrasebank' in k]

dataset = load_dataset('financial_phrasebank', 'sentences_75agree')

# split
idx_train, idx_val = train_test_split(np.arange(len(dataset['train']['sentence'])), test_size=0.1)
dataset_train = [{'text':dataset['train']['sentence'][idx], 'label':dataset['train']['label'][idx]}  for idx in idx_train]
dataset_val = [{'text':dataset['train']['sentence'][idx], 'label':dataset['train']['label'][idx]} for idx in idx_val]



  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
print(len(dataset_train)); print(len(dataset_val))

3107
346


In [None]:
class MyDataset(Dataset):
    """torch dataset."""

    def __init__(self, dataset):
        self.data = dataset
        self.n = len(self.data)

    def __len__(self):
        return self.n

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        unit = self.data[idx]
        return unit

In [None]:
ds_train = MyDataset(dataset_train)
ds_val = MyDataset(dataset_val)

In [None]:
batch_size_train = 12
batch_size_val = 36
lr = 0.00005
eval_iter = 20
n_epochs = 1

In [None]:
dl_train = DataLoader(ds_train, batch_size=batch_size_train, shuffle=True)
dl_val = DataLoader(ds_val, batch_size=batch_size_val, shuffle=False)

In [None]:
optimizer = AdamW(list(anathem_encoder1.parameters()) + list(anathem_encoder2.parameters()) + list(cls_head.parameters()), lr=lr)

In [None]:

optimizer.zero_grad()
anathem_encoder1.train()
anathem_encoder2.train()
cls_head.train()
for epoch in range(n_epochs):

  for iteration, batch in enumerate(tqdm(dl_train, disable=True)):

      # tokenize the batch
      tokens = tokenize_anathem(batch['text'],device)
      target = batch['label'].to(device)

      optimizer.zero_grad()

      out1 = anathem_encoder1(
        input_ids = tokens['input_ids'],
        attention_mask = tokens['attention_mask'],
        token_type_ids = tokens['token_type_ids']
      )
      (hidden_states, extended_attention_masks) = out1

      features,_ = anathem_encoder2(
          hidden_states_highres = hidden_states[0],
          hidden_states_midres = hidden_states[1],
          hidden_states_lowres = hidden_states[2],
          extended_attention_mask_highres = extended_attention_masks[0],
          extended_attention_mask_midres = extended_attention_masks[1],
          extended_attention_mask_lowres = extended_attention_masks[2]
      )

      # prediction
      preds = cls_head(features[0], tokens['attention_mask'])

      # loss
      loss = nn.functional.cross_entropy(preds, target)
      loss.backward()
      optimizer.step()

      # do evaluation
      if ((iteration+1) % eval_iter)==0:
          anathem_encoder1.eval()
          anathem_encoder2.eval()
          cls_head.eval()
          # tokenize the eval
          eval_logits = []
          eval_targets = []
          for i, batch_eval in enumerate(tqdm(dl_val, disable=True)):
              with torch.no_grad():
                  # tokenize the batch
                  tokens_eval = tokenize_anathem(batch_eval['text'], device)
                  labels_eval = batch_eval['label'].to(device)
                  out_eval1 = anathem_encoder1(
                      input_ids = tokens_eval['input_ids'],
                      attention_mask = tokens_eval['attention_mask'],
                      token_type_ids = tokens_eval['token_type_ids']
                  )
                  (hidden_states, extended_attention_masks) = out_eval1
                  features,_ = anathem_encoder2(
                      hidden_states_highres = hidden_states[0],
                      hidden_states_midres = hidden_states[1],
                      hidden_states_lowres = hidden_states[2],
                      extended_attention_mask_highres = extended_attention_masks[0],
                      extended_attention_mask_midres = extended_attention_masks[1],
                      extended_attention_mask_lowres = extended_attention_masks[2]
                  )
                  # prediction
                  batch_logits = cls_head(features[0], tokens_eval['attention_mask'])
                  eval_logits+=batch_logits.detach().tolist()
                  eval_targets+=labels_eval.detach().tolist()

          eval_prec,eval_recall,eval_f1,eval_support = precision_recall_fscore_support(eval_targets, np.array(eval_logits).argmax(axis=1),zero_division=0)
          print('E:%d; i:%d: f1:%0.3f (%0.3f); prec:%0.3f (%0.3f); rec:%0.3f (%0.3f)' % (epoch, iteration, eval_f1.mean(), eval_f1.min(), eval_prec.mean(), eval_prec.min(), eval_recall.mean(), eval_recall.min()))
          cls_head.train()
          anathem_encoder1.train()
          anathem_encoder2.train()






E:0; i:19: f1:0.402 (0.000); prec:0.352 (0.000); rec:0.469 (0.000)




E:0; i:39: f1:0.326 (0.000); prec:0.400 (0.000); rec:0.372 (0.000)




E:0; i:59: f1:0.459 (0.158); prec:0.531 (0.405); rec:0.485 (0.095)




E:0; i:79: f1:0.506 (0.305); prec:0.583 (0.450); rec:0.494 (0.231)




E:0; i:99: f1:0.499 (0.190); prec:0.555 (0.383); rec:0.551 (0.116)




E:0; i:119: f1:0.552 (0.280); prec:0.663 (0.568); rec:0.534 (0.179)




E:0; i:139: f1:0.661 (0.469); prec:0.708 (0.600); rec:0.636 (0.385)




KeyboardInterrupt: ignored

In [None]:
target

tensor([1, 0, 2, 1, 1, 0, 1, 1, 1, 1, 1, 2])

## Test performance speed

In [None]:
# how many parameters in the model in total
from math import prod
nparam = 0
for encoder in [anathem_encoder1, anathem_encoder2]:
    for na,l in encoder.named_parameters():
        nparam+=prod(l.data.shape)
print('Number of parameters for anathem: %d' % nparam)
# 33676544

Number of parameters for anathem: 33283328


In [None]:
# compare this to distilbert
#other_mod = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
other_mod = AutoModel.from_pretrained('google/bert_uncased_L-12_H-512_A-8')

Some weights of the model checkpoint at google/bert_uncased_L-12_H-512_A-8 were not used when initializing BertModel: ['cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
nparam = 0
for na,l in other_mod.named_parameters():
    nparam+=prod(l.data.shape)

print('Number of parameters for other-mod: %d' % nparam)

# number of parameters for anathem-trans: 33676544 (google/bert_uncased_L-12_H-512_A-8)
# number of parametres for anathem-trans: 78973824 (includng 2 more mid-res encoders)
# number of parameters for anathem-trans: 73062528 (with a 768 dimension)
# Number of parameters for distilroberta: 82118400 (with a 768 dimension)
# Number of parameters  all-MiniLM-L6-v2: 22713216
# Number of parameters google/bert_uncased_L-12_H-512_A-8: 53982720 (512 dim, 12L)


Number of parameters for other-mod: 53982720


## Test Performance Speed at inference (CPU)
- distilroberta-base: 10 batches: 23.517s , CPU
- oogle/bert_uncased_L-12_H-512_A-8: 10 batches: 12.44s, CPU
- anathem (distilroberta-768): 10 batches, 23.23s,
- anathem ((google/bert_uncased_L-12_H-512_A-8)): 10 batches, ~7.5s, CPU

## Test Performance Speed at inference (GPU)
- anathem ((google/bert_uncased_L-12_H-512_A-8)): 30 batches, 0.79s, GPU
- google/bert_uncased_L-12_H-512_A-8: 30 batches: 0.8 GPU


In [None]:
import time

In [None]:
time1 = time.time()
for iteration, batch in enumerate(tqdm(dl_train, disable=True)):
    if iteration>30:
        time2 = time.time()
        print(time2-time1)
        break
    with torch.no_grad():
        tokens = tokenize_anathem(batch['text'])
        (hidden_states, extended_attention_masks) = anathem_encoder1(
            input_ids = tokens['input_ids'],
            attention_mask = tokens['attention_mask'],
            token_type_ids = tokens['token_type_ids']
        )
        features,_ = anathem_encoder2(
            hidden_states_highres = hidden_states[0],
            hidden_states_midres = hidden_states[1],
            hidden_states_lowres = hidden_states[2],
            extended_attention_mask_highres = extended_attention_masks[0],
            extended_attention_mask_midres = extended_attention_masks[1],
            extended_attention_mask_lowres = extended_attention_masks[2]
        )

0.8027215003967285


In [None]:
time3 = time.time()
for iteration, batch in enumerate(tqdm(dl_train, disable=True)):
    if iteration>30:
        time4 = time.time()
        print(time4-time3)
        break
    with torch.no_grad():
        tokens = tokenize_anathem(batch['text'])
        out = basemod(
            input_ids = tokens['input_ids'],
            attention_mask = tokens['attention_mask'],
            token_type_ids = tokens['token_type_ids']
        )

0.7066085338592529


In [None]:
eval

array([0.        , 0.86464646, 0.52173913])

In [None]:
eval_prec,eval_recall,eval_f1,eval_support = precision_recall_fscore_support(eval_targets, np.array(eval_logits).argmax(axis=1),zero_division=0)

## Variant: Possibly Faster Integrative Layer

The above version uses a BertIntegrativeLayer that uses the high-res hidden-states as the key/values, and the upscaled-low res as the query

This variant flips it: the high-res is the query (thereby upscaling via attention) and the low-res are the value and keys

#### Varient #2 has slightly fewer parameters: 33283328 vs 336

In [None]:
%pip install torch transformers datasets zstandard rank_bm25 langdetect


Collecting transformers
  Downloading transformers-4.32.1-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting zstandard
  Downloading zstandard-0.21.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting huggin

In [None]:
from transformers import AutoModel, AutoTokenizer, AutoConfig, AutoModelForMaskedLM
from torch.utils.data import DataLoader, Dataset
import torch
from typing import List, Optional, Tuple, Union
from torch import nn
import torch.nn.functional as F
from torch.cuda import is_available
if is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

from transformers.models.bert.modeling_bert import BertEncoder
from transformers.tokenization_utils_base import BatchEncoding
from transformers.activations import ACT2FN
import copy
import math
from langdetect import detect

from transformers import BertTokenizer

from typing import TYPE_CHECKING, Any, Dict, List, NamedTuple, Optional, Sequence, Tuple, Union
from transformers.utils import PaddingStrategy

EncodedInput = List[int]

In [None]:
class CustomTokenizer:
    def __init__(
        self,
        model_string='google/bert_uncased_L-12_H-512_A-8',
        n_cls_prepend = 4,
        n_pad_to_multiple_of=4,
        downscale_multiple=2
    ):
        # initialize the tokenizer from the base model
        self.base_tokenizer = AutoTokenizer.from_pretrained(model_string)
        # how many cls tokens to prepend to the fullsize data
        self.n_cls_prepend = n_cls_prepend
        self.n_pad_to_multiple_of = n_pad_to_multiple_of
        for k in dir(self.base_tokenizer):
            if not ((k[0]=='_') or (k in ['tokenize','encode','build_inputs_with_special_tokens','batch_encode_plus','encode_plus','pad'])):
                setattr(self,k,getattr(self.base_tokenizer, k))
        self.downscale_multiple = downscale_multiple
        # downscale attention
        self.maxpool_attn = nn.MaxPool1d(
            (self.downscale_multiple), stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=True
        )

        # ensure excess_token_ids are included for .pad operations
        if 'excess_cls_ids' not in self.base_tokenizer.model_input_names:
            self.base_tokenizer.model_input_names += ['excess_cls_ids']

    def __call__(self, text, pad_to_multiple_of=None, add_special_tokens = True, return_tensors=None, *args, **kwargs):
        if pad_to_multiple_of is None:
            pad_to_multiple_of = self.n_pad_to_multiple_of
        tokens = self.base_tokenizer(
            text,
            pad_to_multiple_of=(pad_to_multiple_of if not add_special_tokens else False),
            add_special_tokens=add_special_tokens,
            return_tensors=return_tensors if (not add_special_tokens) else None,
            *args,
            **kwargs
        )
        if add_special_tokens:
            tokens = self._batch_prepend_extra_cls_tokens_because_of_maxpooling(tokens, return_tensors)

        # downscale the attention, add to tokens
        tokens = self.downscale_attention(
            tokens, downscale_multiple=[self.downscale_multiple, self.downscale_multiple],name='attention_mask'
        )
        # dowscale the excess_cls_tokens, add to tokens
        tokens = self.downscale_attention(
            tokens, downscale_multiple=[self.downscale_multiple, self.downscale_multiple],name='excess_cls_ids'
        )
        return tokens

    def __len__(self):
        return len(self.base_tokenizer)

    def _num_pad_tokens(self, token_list):
        """Calculates how many PAD tokens to append to sequence to make a multiple of X"""
        return (self.n_pad_to_multiple_of - ((len(token_list)+(self.n_cls_prepend-1)) % self.n_pad_to_multiple_of)) % self.n_pad_to_multiple_of

    def _prepend_extra_cls_tokens_because_of_maxpooling(self, tokens,return_tensors=None):
        n_cls_prepend = self.n_cls_prepend
        # prepend (n-1) CLS tokens to the front of the token_ids (because of maxpooling)
        # also pad so that the total length is a multiple of n_cls_prepend
        #num_pad_tokens = (self.n_pad_to_multiple_of - ((len_tokens+(n_cls_prepend-1)) % self.n_pad_to_multiple_of)) % self.n_pad_to_multiple_of
        tokens['input_ids'] = [self.cls_token_id]*(n_cls_prepend-1)+tokens['input_ids'] + [self.pad_token_id]*self._num_pad_tokens(tokens['input_ids'])
        tokens['excess_cls_ids'] = [0]*(n_cls_prepend)+tokens['attention_mask'][1:] +[0]*self._num_pad_tokens(tokens['attention_mask'])
        tokens['attention_mask'] = [1]*(n_cls_prepend-1)+tokens['attention_mask'] +[0]*self._num_pad_tokens(tokens['attention_mask'])
        if 'token_type_ids' in tokens.keys():
            tokens['token_type_ids'] = [
                tokens['token_type_ids'][0]
            ]*(n_cls_prepend-1) + tokens['token_type_ids'] + [tokens['token_type_ids'][-1]]*self._num_pad_tokens(tokens['token_type_ids'])
        if return_tensors == 'pt':
            for k,v in tokens.items():
                tokens[k] = torch.LongTensor(v)
        return tokens

    def _batch_prepend_extra_cls_tokens_because_of_maxpooling(self, tokens,return_tensors=None):
        n_cls_prepend = self.n_cls_prepend
        # prepend (n-1) CLS tokens to the front of the token_ids (because of maxpooling)
        # also pad so that the total length is a multiple of n_cls_prepend
        #num_pad_tokens = (self.n_pad_to_multiple_of - ((len_tokens+(n_cls_prepend-1)) % self.n_pad_to_multiple_of)) % self.n_pad_to_multiple_of
        tokens['input_ids'] = [
            [self.cls_token_id]*(n_cls_prepend-1)+input_id + [self.pad_token_id]*self._num_pad_tokens(input_id)
            for input_id
            in tokens['input_ids']
        ]
        tokens['excess_cls_ids'] = [
            [0]*(n_cls_prepend)+attnmask[1:] +[0]*self._num_pad_tokens(attnmask)
            for attnmask
            in tokens['attention_mask']
        ]
        tokens['attention_mask'] = [
            [1]*(n_cls_prepend-1)+attnmask +[0]*self._num_pad_tokens(attnmask)
            for attnmask
            in tokens['attention_mask']
        ]
        if 'token_type_ids' in tokens.keys():
            tokens['token_type_ids'] = [
                # we use the token_type_ids
                [toktypeid[0]]*(n_cls_prepend-1)+toktypeid +[toktypeid[-1]]*self._num_pad_tokens(toktypeid)
                for toktypeid
                in tokens['token_type_ids']
            ]
        if return_tensors == 'pt':
            for k,v in tokens.items():
                tokens[k] = torch.LongTensor(v)
        return tokens

    def encode(self, text, pad_to_multiple_of=4, add_special_tokens = True, *args, **kwargs):
        encoded = self.base_tokenizer.encode(text, pad_to_multiple_of=False, add_special_tokens=add_special_tokens, *args, **kwargs)
        if add_special_tokens:
            encoded = [self.cls_token_id]*(pad_to_multiple_of-1) + encoded
        if bool(pad_to_multiple_of):
            num_pad_tokens = (pad_to_multiple_of - (len(encoded) % pad_to_multiple_of)) % pad_to_multiple_of
            encoded += [self.pad_token_id] * num_pad_tokens
        return encoded

    def encode_plus(self, text, add_special_tokens=True, return_tensors=None, *args, **kwargs):
        tokens = self.base_tokenizer.encode_plus(text, add_special_tokens=add_special_tokens, return_tensors=return_tensors, *args, **kwargs)
        if add_special_tokens:
            tokens = self._prepend_extra_cls_tokens_because_of_maxpooling(tokens, return_tensors)
        return tokens

    def tokenize(self, text, add_special_tokens=True, *args, **kwargs):
        toks = self.base_tokenizer.tokenize(text, add_special_tokens=add_special_tokens, *args, **kwargs)
        if add_special_tokens:
            toks = [self.cls_token] * (self.n_cls_prepend-1) + toks
        return toks

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ):
        out = self.base_tokenizer.build_inputs_with_special_tokens(token_ids_0, token_ids_1)
        return [self.cls_token_id]*3 + out

    def batch_encode_plus(self, batch_text_or_text_pairs, *args, **kwargs):
        batched_encoded = self.base_tokenizer.batch_encode_plus( batch_text_or_text_pairs, *args, **kwargs)
        batched_encoded.update({'foo':'bar'})
        return batched_encoded

    def downscale_attention(self, tokens, downscale_multiple=None, name = 'attention_mask'):
        """
        Reduces the sequence-dimenion by self.downscale_multiple using nn.maxpool
        Adds the downscale attention to the tokens dictionary
        """
        if downscale_multiple is None:
            downscale_multiple = [self.downscale_multiple, self.downscale_multiple]

        # fullsize attention
        attn = tokens[name]
        if not isinstance(attn, torch.Tensor):
            attn = torch.Tensor(attn)

        for i, mult in enumerate(downscale_multiple):
            name_of_downsized_attn = '%s_l%d' % (name, i+2)
            with torch.no_grad():
                attn = self.maxpool_attn(attn.float())
            tokens[name_of_downsized_attn] = attn
        return tokens

    def pad(
        self,
        encoded_inputs,
        pad_to_multiple_of=4,
        return_tensors=None,
        padding: Union[bool, str, PaddingStrategy] = True,
        max_length: Optional[int] = None,
        *args,
        **kwargs
    ):
        """Pad a list of tokenized-inputs to the same batch-length, with special processing of Anathem-specific inputs"""

        # which are conventional inputs and which are anathem specific
        conventional_input_nm = [k for k in encoded_inputs[0].keys() if k in ['input_ids', 'token_type_ids','attention_mask']]
        unconventional_input_nm = [k for k in encoded_inputs[0].keys() if k not in conventional_input_nm]

        # pad the vanilla inputs
        conventional_encoded_inputs = self.base_tokenizer.pad([
                {k:v for k,v in encoded_input.items() if k in conventional_input_nm}
                for encoded_input in encoded_inputs
            ], pad_to_multiple_of=pad_to_multiple_of, return_tensors=return_tensors, padding=padding, max_length=max_length, *args, **kwargs
        )

        # deal with the remaining inputs
        padding_strategy, _, max_length, _ = self.base_tokenizer._get_padding_truncation_strategies(
            padding=padding, max_length=max_length, verbose=False
        )

        #required_input = encoded_inputs[][self.model_input_names[0]]
        # this is stupid, I need to pad each input in batch individually
        special_anathem_inputs = [
                {k:v for k,v in encoded_input.items() if k in unconventional_input_nm}
                for encoded_input in encoded_inputs
        ]
        special_anathem_encoded_inputs = self.pad_special_anathem_inputs(
            special_anathem_inputs=special_anathem_inputs,
            encoded_inputs=conventional_encoded_inputs,
            max_length=max_length,
            padding_strategy=padding_strategy,#: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
            pad_to_multiple_of=pad_to_multiple_of,
            return_tensors=return_tensors
        )
        # let's see if I can just insert into the conventional_encode_inputs
        conventional_encoded_inputs.update(special_anathem_encoded_inputs) # apparently I can just append..

        # downscale the attention and add to inputs
        conventional_encoded_inputs = self.downscale_attention(
            conventional_encoded_inputs,
            downscale_multiple=[self.downscale_multiple, self.downscale_multiple],
            name='attention_mask'
        )
        # dowscale the excess_cls_tokens, add to tokens
        conventional_encoded_inputs = self.downscale_attention(
            conventional_encoded_inputs,
            downscale_multiple=[self.downscale_multiple, self.downscale_multiple],
            name='excess_cls_ids'
        )
        return conventional_encoded_inputs

    def pad_special_anathem_inputs(
        self,
        special_anathem_inputs,
        encoded_inputs,
        max_length: Optional[int] = None,
        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
        pad_to_multiple_of: Optional[int] = None,
        return_tensors=None,
    ):
        required_input = encoded_inputs[self.model_input_names[0]]
        batch_size,max_length = required_input.shape
        #print(batch_size,max_length)
        assert batch_size == len(special_anathem_inputs)
        assert isinstance(special_anathem_inputs, list)
        padding_strategy = PaddingStrategy.MAX_LENGTH
        special_anathem_batch_outputs = {}
        for i in range(batch_size):
            inputs = special_anathem_inputs[i] #{k: v[i] for k, v in special_anathem_inputs.items()}
            assert isinstance(inputs, dict)
            outputs = self._pad_special_anathem_input(
                inputs,
                max_length=max_length,
                padding_strategy=padding_strategy,
                pad_to_multiple_of=pad_to_multiple_of
            )
            for key, value in outputs.items():
                if key not in special_anathem_batch_outputs:
                    special_anathem_batch_outputs[key] = []
                special_anathem_batch_outputs[key].append(value)

        return BatchEncoding(special_anathem_batch_outputs, tensor_type=return_tensors) # returning because of failure

    def _pad_special_anathem_input(
        self,
        special_anathem_input,
        max_length: Optional[int] = None,
        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
        pad_to_multiple_of: Optional[int] = None
    ) -> dict:
        """
        Pad encoded Anathem-specific inputs (on left/right and up to predefined length or max length in the batch)
        """
        assert isinstance(special_anathem_input, dict)
        len_required_input = len(special_anathem_input[list(special_anathem_input.keys())[0]])
        if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
            max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of

        needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len_required_input != max_length

        # Initialize attention mask if not present
        if needs_to_be_padded:
            special_anathem_outputs = dict.fromkeys(special_anathem_input.keys())
            difference = max_length - len_required_input
            if self.padding_side == "right":
                for k in special_anathem_input.keys():
                    special_anathem_outputs[k] = special_anathem_input[k] + [0] * difference
            elif self.padding_side == "left":
                for k in special_anathem_input.keys():
                    special_anathem_outputs[k] = [0] * difference + special_anathem_input[k]
            else:
                raise ValueError("Invalid padding strategy:" + str(self.padding_side))

            return special_anathem_outputs
        return special_anathem_input

In [None]:
tokenizer = CustomTokenizer(
        model_string='google/bert_uncased_L-12_H-512_A-8',
        n_cls_prepend = 4,
        n_pad_to_multiple_of=4,
        downscale_multiple=2
    )

Downloading (…)lve/main/config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.


In [None]:
tokenizer.base_tokenizer.model_input_names

['input_ids', 'token_type_ids', 'attention_mask', 'excess_cls_ids']

In [None]:
text = [
    "A standard [MASK] clause is a waiver clause that states that one party won't hold the other liable for damages, losses, or costs associated with issues.",
    "It usually consists of two elements: a trigger event or circumstance and a [MASK] obligation. The trigger event or circumstance is the [MASK] of the agreement, misconduct, or negligence of the indemnifying party or its affiliates"
]

tokens = tokenizer(text, return_tensors='pt', padding=True)

In [None]:
# FOOFU
# in the vanilla DataCollatorForLanguageModelling, if the data is pretokenized (unpadded)
#    then collator will simply "pad", the input_ids and the attention_mask (but not the generated excess_cls_ids, nor the attention_mask_l2 or l3)
#    ... but, I created these _l2,_l3 assuming that everything was already padded properly
# so, adding excess_token_ids to _model_names_inputs (or whatev, doesn't automatically cause the behaviour I wanted)
# the error is because the _pad specifically only handles special_token_ids and token_type_ids in a very specific way
#... there is no generic list_of_names to enforce padding of generic inputs.

# options:
# --- make an updated "pad" function for the tokenizer, that will likewise apply padding
tokens = [tokenizer.encode_plus(txt, add_special_tokens=True) for txt in text]

for tok in tokens:
    for k,v in tok.items():
        print(k,len(v))
        print(k,v)
print('---')

pad_out = tokenizer.pad(tokens, pad_to_multiple_of=4, return_tensors='pt')
print('CONVENTIONAL')
print(pad_out)

#for k,v in tokenizer.base_tokenizer.pad(tokens, pad_to_multiple_of=4, return_tensors='pt').items():
print('SPECIAL')
print(pad_out)
for k,v in pad_out.items():
    print(k, len(v))
    for j in v:
        print(len(j))


# still need to do: reduce attention_mask
# return as tensor
# merge and make a BatchEncoding

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


input_ids 40
input_ids [101, 101, 101, 101, 1037, 3115, 103, 11075, 2003, 1037, 23701, 6299, 11075, 2008, 2163, 2008, 2028, 2283, 2180, 1005, 1056, 2907, 1996, 2060, 20090, 2005, 12394, 1010, 6409, 1010, 2030, 5366, 3378, 2007, 3314, 1012, 102, 0, 0, 0]
token_type_ids 40
token_type_ids [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask 40
attention_mask [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
excess_cls_ids 40
excess_cls_ids [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
input_ids 48
input_ids [101, 101, 101, 101, 2009, 2788, 3774, 1997, 2048, 3787, 1024, 1037, 9495, 2724, 2030, 25652, 1998, 1037, 103, 14987, 1012, 1996, 9495, 2724, 2030, 25652, 2003, 1996, 103, 1997, 1996, 3820, 1010, 23337, 1010, 2030, 27988, 1997, 1996, 27427, 6633, 3490, 14116, 2

In [None]:
type(pad_out)

transformers.tokenization_utils_base.BatchEncoding

In [None]:
class BertSelfAttnDimensionReduction(nn.Module):
    """Bert Attention Layer that uses a dimension-reduced version of the query, so to reduce the dimension of the outputs"""
    def __init__(
        self,
        config,
        hidden_size_input=768,
        hidden_size_query = None,
        position_embedding_type=None,
        dim_reduction = 2
    ):
        """Special type of Bert Self attention that reduces the dimension of the inputs by half"""
        super().__init__()
        if (config.hidden_size // dim_reduction) % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
                f"heads ({config.num_attention_heads})"
            )
        self.dim_reduction = dim_reduction
        self.hidden_size_input = hidden_size_input
        self.hidden_size_reduced = hidden_size_input // dim_reduction
        if hidden_size_query is None:
            hidden_size_query = hidden_size_input
        self.hidden_size_query = hidden_size_query
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(self.hidden_size_reduced / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(self.hidden_size_query, self.all_head_size)
        self.key = nn.Linear(self.hidden_size_input, self.all_head_size)
        self.value = nn.Linear(self.hidden_size_input, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.position_embedding_type = position_embedding_type or getattr(
            config, "position_embedding_type", "absolute"
        )
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            self.max_position_embeddings = config.max_position_embeddings
            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)

        self.is_decoder = config.is_decoder

    def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        encoder_hidden_states: Optional[torch.FloatTensor] = None,
        encoder_attention_mask: Optional[torch.FloatTensor] = None,
        past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
        output_attentions: Optional[bool] = False,
    ) -> Tuple[torch.Tensor]:
        mixed_query_layer = self.query(hidden_states)

        # If this is instantiated as a cross-attention module, the keys
        # and values come from an encoder; the attention mask needs to be
        # such that the encoder's padding tokens are not attended to.

        key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
        value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
        query_layer = self.transpose_for_scores(mixed_query_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))

        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            query_length, key_length = query_layer.shape[2], key_layer.shape[2]
            if use_cache:
                position_ids_l = torch.tensor(key_length - 1, dtype=torch.long, device=hidden_states.device).view(
                    -1, 1
                )
            else:
                position_ids_l = torch.arange(query_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
            position_ids_r = torch.arange(key_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
            distance = position_ids_l - position_ids_r

            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility

            if self.position_embedding_type == "relative_key":
                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores
            elif self.position_embedding_type == "relative_key_query":
                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key

        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if encoder_attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            #print(attention_scores.shape)
            #print(attention_scores.shape)
            attention_scores = attention_scores + encoder_attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.functional.softmax(attention_scores, dim=-1)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(new_context_layer_shape)

        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)

        if self.is_decoder:
            outputs = outputs + (past_key_value,)
        return outputs


class InterpolateCombo(nn.Module):
    """there could also be an attentive way to do this"""
    def __init__(self, scale_factor=2, dropout=0.05, alpha=0.667):
        """Arguments:
        :param scaler_factor: float, multiple of up-scaling
        :param dropout: float, dropout proportion
        :param alpha: float, mixture weight between nearest-neighbor vs linear-interpolation
        """
        super(InterpolateCombo, self).__init__()
        self.interp = nn.functional.interpolate
        self.scale_factor = scale_factor
        self.dropout = nn.Dropout(dropout)
        self.a = alpha

    def forward(self, x):
        x_trans = x.transpose(-2,-1)
        z = self.a*self.interp(x_trans, mode='nearest',scale_factor=self.scale_factor) + (1-self.a)*self.interp(x_trans, mode='linear',scale_factor=self.scale_factor)
        z = self.dropout(z)
        return z.transpose(-2,-1)


class BertCrossAttention(nn.Module):
    def __init__(
        self,
        config,
        hidden_size,
        hidden_size_query,
        hidden_size_keyvalue=None,
        position_embedding_type=None
    ):
        super().__init__()
        self.hidden_size = hidden_size
        self.hidden_size_query = hidden_size_query
        if hidden_size_keyvalue is None:
            hidden_size_keyvalue = hidden_size
        self.hidden_size_keyvalue = hidden_size_keyvalue
        if self.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
            raise ValueError(
                f"The hidden size ({self.hidden_size}) is not a multiple of the number of attention "
                f"heads ({config.num_attention_heads})"
            )

        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(self.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(self.hidden_size_query, self.all_head_size)
        self.key = nn.Linear(self.hidden_size_keyvalue, self.all_head_size)
        self.value = nn.Linear(self.hidden_size_keyvalue, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
        self.position_embedding_type = position_embedding_type or getattr(
            config, "position_embedding_type", "absolute"
        )
        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            self.max_position_embeddings = config.max_position_embeddings
            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)

        self.is_decoder = config.is_decoder

    def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        query_hidden_states: Optional[torch.FloatTensor] = None,
        query_attention_mask: Optional[torch.FloatTensor] = None,
        past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
        output_attentions: Optional[bool] = False,
    ) -> Tuple[torch.Tensor]:
        mixed_query_layer = self.query(query_hidden_states)

        # If this is instantiated as a cross-attention module, the keys
        # and values come from an encoder; the attention mask needs to be
        # such that the encoder's padding tokens are not attended to.
        key_layer = self.transpose_for_scores(self.key(hidden_states))
        value_layer = self.transpose_for_scores(self.value(hidden_states))
        query_layer = self.transpose_for_scores(mixed_query_layer)

        use_cache = past_key_value is not None
        if self.is_decoder:
            # if cross_attention save Tuple(torch.Tensor, torch.Tensor) of all cross attention key/value_states.
            # Further calls to cross_attention layer can then reuse all cross-attention
            # key/value_states (first "if" case)
            # if uni-directional self-attention (decoder) save Tuple(torch.Tensor, torch.Tensor) of
            # all previous decoder key/value_states. Further calls to uni-directional self-attention
            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
            # if encoder bi-directional self-attention `past_key_value` is always `None`
            past_key_value = (key_layer, value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))

        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
            query_length, key_length = query_layer.shape[2], key_layer.shape[2]
            if use_cache:
                position_ids_l = torch.tensor(key_length - 1, dtype=torch.long, device=hidden_states.device).view(
                    -1, 1
                )
            else:
                position_ids_l = torch.arange(query_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
            position_ids_r = torch.arange(key_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
            distance = position_ids_l - position_ids_r

            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility

            if self.position_embedding_type == "relative_key":
                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores
            elif self.position_embedding_type == "relative_key_query":
                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key

        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        if attention_mask is not None:
            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
            attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.functional.softmax(attention_scores, dim=-1)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        # Mask heads if we want to
        if head_mask is not None:
            attention_probs = attention_probs * head_mask

        context_layer = torch.matmul(attention_probs, value_layer)

        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(new_context_layer_shape)

        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)

        if self.is_decoder:
            outputs = outputs + (past_key_value,)
        return outputs


class BertReduceAddIntegrativeLayer(nn.Module):
    """Bert Layer that does dimenion reduction along embedding-dimenion and integrations a skip connection"""
    def __init__(
            self,
            config,
            hidden_size,
            hidden_size_input=None,
            hidden_size_query=None,
            intermediate_size=None,
            dim_reduction=2,
            do_concat_hidden_and_query = True
        ):
        super().__init__()
        #self.chunk_size_feed_forward = config.chunk_size_feed_forward
        #self.seq_len_dim = 1
        self.cat = torch.cat
        self.do_concat_hidden_and_query = do_concat_hidden_and_query
        assert bool(do_concat_hidden_and_query), 'not implemented: concatenation of query and hidden-states must happen'
        self.hidden_size = hidden_size
        if dim_reduction is None:
            dim_reduction = 2
        self.dim_reduction = dim_reduction
        if intermediate_size is None:
            intermediate_size = int(4*hidden_size)
        self.intermediate_size = intermediate_size
        if hidden_size_input is None:
            hidden_size_input = hidden_size
        self.hidden_size_input = hidden_size_input
        if hidden_size_query is None:
            hidden_size_query = hidden_size_input
        self.hidden_size_query = hidden_size_query + do_concat_hidden_and_query*hidden_size
        self.hidden_size_concat = int(hidden_size + hidden_size_input)

        # cross attention between (low-res) query and hidden layers below
        self.attention = BertSelfAttnDimensionReduction(
            config,
            hidden_size_input=self.hidden_size_input,
            hidden_size_query = self.hidden_size_query,
            position_embedding_type="absolute",
            dim_reduction = self.dim_reduction
        )
        self.is_decoder = config.is_decoder
        #inputs = x_l1, x_l1_reduced, x_l2_prev
        #- x2 = BertCrossAttention(k,v=x_l1, q= cat(x_l1_reduced, x_l2_prev) ) -notice three inputs
        #- x3 = lnorm(drop(f(x2)) + x_l2_prev)
        #- x4_ex = activation( f(cat(x3, x_l1_reduced))  )
        #- x5 = lnorm(drop(f(x4_ex)) + x3)

        # corresponds to BertAttention SelfOutput
        self.output_attn = nn.Linear(self.hidden_size, self.hidden_size)
        self.lnorm_attn = nn.LayerNorm(self.hidden_size, eps=config.layer_norm_eps)
        self.dropout_attn = nn.Dropout(config.hidden_dropout_prob)

        # corresponds to BertIntermediate
        self.intermediate = nn.Linear(self.hidden_size_concat, self.intermediate_size)
        if isinstance(config.hidden_act, str):
            self.intermediate_act_fn = ACT2FN[config.hidden_act]
        else:
            self.intermediate_act_fn = config.hidden_act

        # corresponds to BertOutput
        self.output_intm = nn.Linear(self.intermediate_size, self.hidden_size)
        self.lnorm_intm = nn.LayerNorm(self.hidden_size, eps=config.layer_norm_eps)
        self.dropout_intm = nn.Dropout(config.hidden_dropout_prob)

    def forward(
        self,
        inputs: torch.Tensor, # higher-resolution inputs for key and values (long sequence dimension)
        hidden_states: torch.Tensor, # previous hidden-states for skip connection (short squence-dim, low-res)
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        query_hidden_states: torch.FloatTensor = None, # hidden-states for query (short squence-dim, low-res)
        query_attention_mask: Optional[torch.FloatTensor] = None,
        past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
        output_attentions: Optional[bool] = False,
    ) -> Tuple[torch.Tensor]:
        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None

        if self.do_concat_hidden_and_query:
            query_hidden_states_plus = torch.cat((query_hidden_states, hidden_states),axis=2)
        # cross attn between (low-res) query vector and (high-res) key-values
        cross_attn_outputs = self.attention(
            query_hidden_states_plus, # query (short seq-dim, high-res)
            attention_mask=attention_mask,
            head_mask=head_mask,
            encoder_hidden_states = inputs, # for key/value (longer sequence dimension, high-res)
            past_key_value=past_key_value,
            output_attentions=output_attentions,
        )
        cross_hidden_states = cross_attn_outputs[0]

        # first Add+Norm skip connection (BertSelfOutput)
        cross_hidden_states = self.dropout_attn(self.output_attn(cross_hidden_states))
        hidden_states = self.lnorm_attn(cross_hidden_states + hidden_states)

        # intermediate expension
        intermediate_states = self.intermediate_act_fn(self.intermediate(
            self.cat((hidden_states, query_hidden_states),axis=2)
        ))
        assert intermediate_states.shape[0]==hidden_states.shape[0]
        assert intermediate_states.shape[1]==hidden_states.shape[1]

        # BertOutput
        intermediate_states = self.dropout_intm(self.output_intm(intermediate_states))
        out_states = self.lnorm_intm(intermediate_states + hidden_states)

        #inputs = x_l1, x_l1_reduced, x_l2_prev
        #- x2 = BertCrossAttention(k,v=x_l1, q= cat(x_l1_reduced, x_l2_prev) ) -notice three inputs
        #- x3 = lnorm(drop(f(x2)) + x_l2_prev)
        #- x4_ex = activation( f(cat(x3, x_l1_reduced))  )
        #- x5 = lnorm(drop(f(x4_ex)) + x3)
        return out_states

try:
    from transformers.modeling_utils import get_extended_attention_mask
except:
    def get_extended_attention_mask(self, attention_mask: torch.Tensor, input_shape: Tuple[int], device: device) -> torch.Tensor:
        """
        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.

        Arguments:
            attention_mask (:obj:`torch.Tensor`):
                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
            input_shape (:obj:`Tuple[int]`):
                The shape of the input to the model.
            device: (:obj:`torch.device`):
                The device of the input to the model.

        Returns:
            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
        """
        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
        # ourselves in which case we just need to make it broadcastable to all heads.
        if attention_mask.dim() == 3:
            extended_attention_mask = attention_mask[:, None, :, :]
        elif attention_mask.dim() == 2:
            # Provided a padding mask of dimensions [batch_size, seq_length]
            # - if the model is a decoder, apply a causal mask in addition to the padding mask
            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
            if self.config.is_decoder:
                batch_size, seq_length = input_shape
                seq_ids = torch.arange(seq_length, device=device)
                causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]
                # in case past_key_values are used we need to add a prefix ones mask to the causal mask
                # causal and attention masks must have same type with pytorch version < 1.3
                causal_mask = causal_mask.to(attention_mask.dtype)

                if causal_mask.shape[1] < attention_mask.shape[1]:
                    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
                    causal_mask = torch.cat(
                        [
                            torch.ones(
                                (batch_size, seq_length, prefix_seq_len), device=device, dtype=causal_mask.dtype
                            ),
                            causal_mask,
                        ],
                        axis=-1,
                    )

                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
            else:
                extended_attention_mask = attention_mask[:, None, None, :]
        else:
            raise ValueError(
                "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
                    input_shape, attention_mask.shape
                )
            )

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
        return extended_attention_mask



In [None]:


# how does bert actually work?
"""
input = x

BertLayer:
- BertAttention
--- x2 = BertSelfAttention(x)
--- x3 = BertSelfOutput(x2,x) -> lnorm(drop(f(x2)) + x)
- BertIntermediate (expension:  4*hidden_size)
--- x4_ex = activation(f(x3)) # expansion (4*)
- BertOutput
--- x5 = lnorm(drop(f(x4_ex)) + x3 )


inputs = x_l2, x_l3_up

BertIntegrativeLayer:
- x2 = BertCrossAttention(k,v=x_l2, q=x_l3_up)
- x3 = lnorm(drop(f(x2)) + x_l2)
- x4_ex = activation( f(cat(x3, x_l3_up))  )
- x5 = lnorm(drop(f(x4_ex)) + x3)
"""


class BertIntegrativeLayer(nn.Module):
    """Vanilla Bert Layer, but integrates other hiddens states from a parallel transformers stack typically low-re"""
    def __init__(
            self,
            config,
            hidden_size, # dimensions of the (high-res) hiddens states; same dimension as output
            hidden_size_keyvalues, # dimensions of (low-res) states used as key/values; 1/2 sequence-length and dim
            hidden_size_query_to_concat=None, # dimensions of (low-res) to concat to hidden_states; 1/2 sequence-length and dim
            intermediate_size=None
        ):
        super().__init__()
        #self.chunk_size_feed_forward = config.chunk_size_feed_forward
        #self.seq_len_dim = 1
        self.cat = torch.cat
        self.hidden_size = hidden_size
        self.hidden_size_keyvalues = hidden_size_keyvalues
        if hidden_size_query_to_concat is None:
            hidden_size_query_to_concat = hidden_size_keyvalues
        self.hidden_size_query_to_concat = hidden_size_query_to_concat
        self.hidden_size_query = int(hidden_size + hidden_size_query_to_concat)
        self.hidden_size_concat = int(hidden_size + hidden_size_query_to_concat)
        if intermediate_size is None:
            intermediate_size = int(4*hidden_size)
        self.intermediate_size = intermediate_size

        # cross attention between (low-res) query and hidden layers below
        self.attention = BertCrossAttention(
            config,
            hidden_size= self.hidden_size, # high dim output
            hidden_size_query = self.hidden_size_query, # high dim query
            hidden_size_keyvalue = self.hidden_size_keyvalues, # low-dim keyvalues
            position_embedding_type="absolute"
        )
        self.is_decoder = config.is_decoder
        #self.intermediate = BertIntermediate(config)
        #self.output = BertOutput(config)
        #- x2 = BertCrossAttention(k,v=x_l2, q=x_l3_up)
        #- x3 = lnorm(drop(f(x2)) + x_l2)
        #- x4_ex = activation( f(cat(x3, x_l3_up))  )
        #- x5 = lnorm(drop(f(x4_ex)) + x3)

        # corresponds to BertAttention SelfOutput
        self.output_attn = nn.Linear(self.hidden_size, self.hidden_size)
        self.lnorm_attn = nn.LayerNorm(self.hidden_size, eps=config.layer_norm_eps)
        self.dropout_attn = nn.Dropout(config.hidden_dropout_prob)

        # corresponds to BertIntermediate
        self.intermediate = nn.Linear(self.hidden_size_concat, self.intermediate_size)
        if isinstance(config.hidden_act, str):
            self.intermediate_act_fn = ACT2FN[config.hidden_act]
        else:
            self.intermediate_act_fn = config.hidden_act

        # corresponds to BertOutput
        self.output_intm = nn.Linear(self.intermediate_size, self.hidden_size)
        self.lnorm_intm = nn.LayerNorm(self.hidden_size, eps=config.layer_norm_eps)
        self.dropout_intm = nn.Dropout(config.hidden_dropout_prob)

    def forward(
        self,
        hidden_states: torch.Tensor, # high-res hidden states (same dimensions as output), used as query
        attention_mask: Optional[torch.FloatTensor] = None,
        head_mask: Optional[torch.FloatTensor] = None,
        keyvalue_hidden_states: torch.Tensor=None, # low-res hidden-states (1/2 seq-dim) used for key-value pairs
        query_to_concat_hidden_states: torch.Tensor=None, # to concatenate to query
        query_attention_mask: Optional[torch.FloatTensor] = None,
        past_key_value: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
        output_attentions: Optional[bool] = False,
    ) -> Tuple[torch.Tensor]:
        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None

        # cross attn between hiddens states and (low-res) query vector
        cross_attn_outputs = self.attention(
            hidden_states = keyvalue_hidden_states,
            attention_mask = attention_mask,
            head_mask = head_mask,
            query_hidden_states = torch.cat((hidden_states, query_to_concat_hidden_states),axis=2),
            query_attention_mask = query_attention_mask
        )
        cross_hidden_states = cross_attn_outputs[0]
        assert cross_hidden_states.shape[1]==hidden_states.shape[1], f"{cross_hidden_states.shape[1]},{cross_hidden_states.shape[2]} vs {hidden_states.shape[1]},{hidden_states[2]}"
        assert cross_hidden_states.shape[2]==hidden_states.shape[2]


        # first Add+Norm skip connection (BertSelfOutput)
        cross_hidden_states = self.output_attn(cross_hidden_states)
        cross_hidden_states = self.dropout_attn(cross_hidden_states)
        hidden_states = self.lnorm_attn(cross_hidden_states + hidden_states)

        # intermediate expension
        intermediate_states = self.cat((hidden_states, query_to_concat_hidden_states),axis=2)
        intermediate_states = self.intermediate(intermediate_states)
        intermediate_states = self.intermediate_act_fn(intermediate_states)
        assert intermediate_states.shape[0]==hidden_states.shape[0]
        assert intermediate_states.shape[1]==hidden_states.shape[1]

        # BertOutput
        out_states = self.output_intm(intermediate_states)
        out_states = self.dropout_intm(out_states)
        out_states = self.lnorm_intm(out_states + hidden_states)

        #- x2 = BertCrossAttention(k,v=x_l2, q=x_l3_up)
        #- x3 = lnorm(drop(f(x2)) + x_l2)
        #- x4_ex = activation( f(cat(x3, x_l3_up))  )
        #- x5 = lnorm(drop(f(x4_ex)) + x3)
        return out_states



In [None]:


# how does bert actually work?
"""
input = x

BertLayer:
- BertAttention
--- x2 = BertSelfAttention(x)
--- x3 = BertSelfOutput(x2,x) -> lnorm(drop(f(x2)) + x)
- BertIntermediate (expension:  4*hidden_size)
--- x4_ex = activation(f(x3)) # expansion (4*)
- BertOutput
--- x5 = lnorm(drop(f(x4_ex)) + x3 )


inputs = x_l2, x_l3_up

BertIntegrativeLayer:
- x2 = BertCrossAttention(k,v=x_l2, q=x_l3_up)
- x3 = lnorm(drop(f(x2)) + x_l2)
- x4_ex = activation( f(cat(x3, x_l3_up))  )
- x5 = lnorm(drop(f(x4_ex)) + x3)
"""


class CheapMLPIntegrativeLayer(nn.Module):
    """Cheap (non-transformer) Integrator layer that merges a (low-res) layers with higher-res"""
    def __init__(
            self,
            config,
            hidden_size, # dimensions of the (high-res) hiddens states; same dimension as output
            hidden_size_keyvalues=None, # dimensions of (low-res) states used as key/values; 1/2 sequence-length and dim
            hidden_size_query_to_concat=None, # dimensions of (low-res) to concat to hidden_states; 1/2 sequence-length and dim
            intermediate_size=None
        ):
        super().__init__()
        #self.chunk_size_feed_forward = config.chunk_size_feed_forward
        #self.seq_len_dim = 1
        self.cat = torch.cat
        self.hidden_size = hidden_size
        if hidden_size_keyvalues is None:
            hidden_size_keyvalues = hidden_size
        self.hidden_size_keyvalues = hidden_size_keyvalues
        if hidden_size_query_to_concat is None:
            hidden_size_query_to_concat = hidden_size_keyvalues
        self.hidden_size_query_to_concat = hidden_size_query_to_concat
        self.hidden_size_query = int(hidden_size + hidden_size_query_to_concat)
        if intermediate_size is None:
            intermediate_size = int(2*hidden_size)
        self.intermediate_size = intermediate_size

        # expand hidden-size to a multiple
        self.dense_expander = nn.Linear(
            self.hidden_size_query,
            self.intermediate_size
        ) # deflate back to same size as hidden-state
        self.dense_deflator = nn.Linear(
            self.intermediate_size,
            self.hidden_size
        )

        # intermediate activation function
        self.intermediate_act_fn = nn.RReLU(0.0625, 0.125)

        # corresponds to BertOutput
        self.lnorm = nn.LayerNorm(self.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(
        self,
        hidden_states: torch.Tensor, # high-res hidden states (same dimensions as output), used as query
        attention_mask = None, # ignored
        head_mask = None, # ignored
        keyvalue_hidden_states =None, # ignored
        query_to_concat_hidden_states: torch.Tensor=None, # to concatenate to hidden_states
        query_attention_mask = None, # ignored
        past_key_value = None, # ignored
        output_attentions = False, # ignored
    ) -> torch.Tensor:

        # concat (lowres) to hidden-states
        inputs = self.cat((hidden_states, query_to_concat_hidden_states),axis=2)
        # expand x2 dimension
        intermediate_states = self.dense_expander(inputs)
        # activation (leaky relue)
        intermediate_states = self.intermediate_act_fn(intermediate_states)
        # like BertOutput
        out_states = self.dense_deflator(intermediate_states)
        # dropout
        out_states = self.dropout(out_states)
        # combine with hidden-state inputs
        out_states = self.lnorm(out_states + hidden_states)

        return out_states



In [None]:

def make_config(
    modelstring = "distilroberta-base",
    num_transformer_stacks = 3,
    scale_ratio2 = 0.5,
    scale_ratio3 = 0.25,
    multiplier_intermediate2 = 4.0,
    multiplier_intermediate3 = 4.0,
    num_layers_l2 = 1, # mid-res encoder
    num_layers_l3 = 3, # low-res encoder
    dropout_scaling = 0.05,
    do_cheap_integrator = [1],
    sequence_classification_intermediate_dim = None, # default is the same as the basemodel hidden-dim
    sequence_classification_out_dim = None, # default is x2 same as the basemodel hidden-dim
    do_mlm =False,
    do_cls = False
):
    #if True:
    #modelstring = "distilroberta-base"
    #scale_ratio2 = 0.5
    #scale_ratio3 = 0.25
    #scale_intermediate2 = 4
    #scale_intermediate3 = 4
    base_config = AutoConfig.from_pretrained(modelstring)
    config_l2 = copy.deepcopy(base_config)
    config_l3 = copy.deepcopy(base_config)
    setattr(base_config, 'model_string', modelstring)
    setattr(base_config,'num_transformer_stacks', num_transformer_stacks)
    setattr(base_config,'num_layers_l2', num_layers_l2)
    setattr(base_config,'num_layers_l3', num_layers_l3)
    setattr(base_config,'scale_ratio2', scale_ratio2)
    setattr(base_config,'scale_ratio3', scale_ratio3)
    setattr(base_config,'scale_factor2', int(1/base_config.scale_ratio2))
    setattr(base_config,'scale_factor3', int(1/base_config.scale_ratio3*base_config.scale_ratio2))
    setattr(base_config,"hidden_size_l2", int(base_config.hidden_size * scale_ratio2))
    setattr(base_config,"hidden_size_l3", int(base_config.hidden_size * scale_ratio3))
    setattr(base_config,"intermediate_size_l1", int(base_config.hidden_size_l2*multiplier_intermediate2))
    setattr(base_config,"intermediate_size_l2", int(base_config.hidden_size_l3*multiplier_intermediate3))
    setattr(base_config,"query_size1", base_config.hidden_size_l2 + base_config.hidden_size_l3)
    setattr(base_config,"query_size2", base_config.hidden_size_l3)
    setattr(base_config,"dropout_scaling", dropout_scaling)
    setattr(base_config,"use_cheap_integrator_for_stacks", do_cheap_integrator)
    setattr(base_config, "do_mlm", do_mlm)
    setattr(base_config, "do_cls", do_cls)

    # hidden dimension
    setattr(
        base_config,
        "sequence_classification_intermediate_dim",
        sequence_classification_intermediate_dim  if sequence_classification_intermediate_dim is not None else [
            int(base_config.hidden_size*s)
            for s in [1, scale_ratio2, scale_ratio3]
        ]
    )
    # final dimension outputed for sequence classification
    setattr(
        base_config,
        "sequence_classification_out_dim",
        sequence_classification_out_dim  if sequence_classification_out_dim is not None else base_config.hidden_size*2
    )


    # make the configuration for the l2 mid-res encoder
    config_l2.hidden_size = base_config.hidden_size_l2
    config_l2.num_hidden_layers = num_layers_l2
    setattr(base_config, 'config_l2', config_l2)

    # make the configuration for the l3 encoder
    config_l3.hidden_size = base_config.hidden_size_l3
    config_l3.num_hidden_layers = num_layers_l3
    setattr(base_config, 'config_l3', config_l3)
    return base_config

def initialize_baselayers(config, basemod = None, tokenizer=None, stack_id=0):
    """Initializes the embeddings and first stack of layers for the Anathem transformers"""
    # initialize the basemodel
    if basemod is None:
        basemod = AutoModel.from_pretrained(config.model_string)
    if tokenizer is None:
        # download pretrained tokenizer
        tokenizer = AutoTokenizer.from_pretrained(config.model_string)

    device = basemod.device
    setattr(config, 'device', device)

    # get basemodel's embeddings
    layer_embedding = copy.deepcopy(basemod._modules['embeddings'])

    # get basemodel's first transformer block
    layer_basetransformer = copy.deepcopy(basemod._modules['encoder']._modules['layer']._modules['0'])

    # initialize the maxpooling downsamplers
    maxpool = nn.Sequential(
        nn.Dropout(config.dropout_scaling),
        nn.MaxPool2d((2,1), stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=True)
    )
    # pooling the attention has no dropout
    maxpool_attn = nn.MaxPool1d((2), stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=True)

    # initialize downsampling attention layers
    bert_reducer_l2 = BertSelfAttnDimensionReduction(
        config=config,
        hidden_size_input=config.hidden_size,
        position_embedding_type=config.position_embedding_type,
        dim_reduction = config.scale_factor2
    )
    # 1/4 hidden size
    bert_reducer_l3 = BertSelfAttnDimensionReduction(
        config=config,
        hidden_size_input=config.hidden_size_l2,
        position_embedding_type=config.position_embedding_type,
        dim_reduction = config.scale_factor3
    )

    # initialize the mid-resolution BertEncoder
    bert_encoder_midres = BertEncoder(config.config_l2)
    # initialize the low-resolution BertEncoder
    bert_encoder_lowres = BertEncoder(config.config_l3)

    # initailize the upscalers
    upscaler_x2 = InterpolateCombo(scale_factor=config.scale_factor3, dropout=config.dropout_scaling)
    upscaler_x4 = InterpolateCombo(scale_factor=int(1/config.scale_ratio3), dropout=config.dropout_scaling)

    # initialize the BertIntegrative Layers: low res to mid res
    bert_integrater_l2 = BertIntegrativeLayer(
        config,
        hidden_size=config.hidden_size_l2,
        hidden_size_keyvalues = config.hidden_size_l3,
        hidden_size_query_to_concat=config.hidden_size_l3,
        intermediate_size=config.intermediate_size_l2
    )

    # from mid-res to high-res
    do_cheap_integrator = (stack_id in config.use_cheap_integrator_for_stacks)
    # from mid-res to high-res
    if not do_cheap_integrator:
        bert_integrater_l1 = BertIntegrativeLayer(
            config,
            hidden_size=config.hidden_size,
            hidden_size_keyvalues = config.hidden_size_l2,
            hidden_size_query_to_concat=config.hidden_size_l2,
            intermediate_size=config.intermediate_size_l1
        )
    else:
        bert_integrater_l1 = CheapMLPIntegrativeLayer(
            config,
            hidden_size=config.hidden_size,
            hidden_size_query_to_concat=config.hidden_size_l2,
            intermediate_size=config.hidden_size*2
        )

    return (
        tokenizer,
        basemod,
        layer_embedding,
        layer_basetransformer,
        maxpool,
        maxpool_attn,
        bert_reducer_l2,
        bert_reducer_l3,
        bert_encoder_midres,
        bert_encoder_lowres,
        upscaler_x2,
        upscaler_x4,
        bert_integrater_l2,
        bert_integrater_l1
    )

def initialize_midlayers(config, basemod=None, tokenizer=None, stack_id=1):
    """Initializes all the intermediate layers for the Anathem transformers"""
    # initialize the maxpooling downsamplers
    maxpool = nn.Sequential(
        nn.Dropout(config.dropout_scaling),
        nn.MaxPool2d((2,1), stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=True)
    )
    # pooling the attention has no dropout
    maxpool_attn = nn.MaxPool1d((2), stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=True)

    # initialize bert attentive downsampling and skipconnection (1/2 embedding dim)
    bert_reduceintegrator_l2 = BertReduceAddIntegrativeLayer(
        config,
        config.hidden_size_l2, # size of mid-res
        hidden_size_input=config.hidden_size, # size full-resolution
        hidden_size_query=config.hidden_size, # size full-resolution
        intermediate_size=config.intermediate_size_l1, # BertIntermediate dimension (expansion *4 the hiddensize)
        dim_reduction=config.scale_factor2, # reduce embedding dimension by factor of 2
        do_concat_hidden_and_query = True
    )

    # 1/4 the size
    bert_reduceintegrator_l3 = BertReduceAddIntegrativeLayer(
        config,
        config.hidden_size_l3, # size of mid-res
        hidden_size_input=config.hidden_size_l2, # size full-resolution
        hidden_size_query=config.hidden_size_l2, # size full-resolution
        intermediate_size=config.intermediate_size_l2, # BertIntermediate dimension
        dim_reduction=config.scale_factor3, # reduce embedding dimension by factor of 2
        do_concat_hidden_and_query = True
    )

    # initialize the low-resolution BertEncoder
    bert_encoder_midres = BertEncoder(config.config_l2)
    bert_encoder_lowres = BertEncoder(config.config_l3)

    # initailize the upscalers
    upscaler_x2 = InterpolateCombo(scale_factor=config.scale_factor3, dropout=config.dropout_scaling)
    upscaler_x4 = InterpolateCombo(scale_factor=int(1/config.scale_ratio3), dropout=config.dropout_scaling)

    # initialize the BertIntegrative Layers: from low-res to mide-res
    bert_integrater_l2 = BertIntegrativeLayer(
        config,
        hidden_size=config.hidden_size_l2,
        hidden_size_keyvalues = config.hidden_size_l3,
        hidden_size_query_to_concat=config.hidden_size_l3,
        intermediate_size=config.intermediate_size_l2
    )

    do_cheap_integrator = (stack_id in config.use_cheap_integrator_for_stacks)
    if not do_cheap_integrator:
        # from mid-res to high-res
        bert_integrater_l1 = BertIntegrativeLayer(
            config,
            hidden_size=config.hidden_size,
            hidden_size_keyvalues = config.hidden_size_l2,
            hidden_size_query_to_concat=config.hidden_size_l2,
            intermediate_size=config.intermediate_size_l1
        )
    else:
        bert_integrater_l1 = CheapMLPIntegrativeLayer(
            config,
            hidden_size=config.hidden_size,
            hidden_size_query_to_concat=config.hidden_size_l2,
            intermediate_size=config.hidden_size*2
        )

    return (
        maxpool,
        maxpool_attn,
        bert_reduceintegrator_l2,
        bert_reduceintegrator_l3,
        bert_encoder_midres,
        bert_encoder_lowres,
        upscaler_x2,
        upscaler_x4,
        bert_integrater_l2,
        bert_integrater_l1
    )


def initialize_finaltransformerlayers(config, basemod=None, tokenizer=None, names_encoder_module = 'encoder', stack_id=3):
    """Initializes the final BertLayer before output, but copying the final BertLayer from `Basemod`"""
    # initialize the maxpooling downsamplers
    assert basemod is not None, "`initialize_finaltransformerlayers` requires the basemod to instantiate the final transformer block"

    # get the Encoder stacks
    assert names_encoder_module in basemod._modules.keys(), 'expected %s in basemod._modules' % names_encoder_module
    basemod_encoder_stack = get_to_bertlayer(basemod, target_layer_name = names_encoder_module)

    # get the name of the final transformer block (-1) in encoder
    names_of_final_transformer_block = list(basemod_encoder_stack._modules['layer']._modules.keys())[-1]

    # get the final transformer block (NN weights pretrained)
    bert_finaltransformer_block = basemod_encoder_stack._modules['layer']._modules[
        names_of_final_transformer_block
    ]

    return copy.deepcopy(bert_finaltransformer_block)

def get_to_bertlayer(basemod, target_layer_name = 'encoder', model_string = None):
    """Clumsily locates a particular layer within a pretrained bert model"""
    if  target_layer_name in basemod._modules.keys():
        return basemod._modules[target_layer_name]
    elif target_layer_name in basemod._modules['bert']._modules.keys():
        return basemod._modules['bert']

In [None]:

class AnathemBaseModule(nn.Module):
    """First Sstack of layers with embeddings, that go full circle form high-res to low-res back to high res"""
    def __init__(
            self,
            config,
            basemod=None,
            tokenizer=None,
            past_key_values_length = None,
            device = None,
            stack_id=0
        ):
        super().__init__()
        self.config = config

        # initalize the layers
        (
            tokenizer, basemod,
            layer_embedding,
            layer_basetransformer,
            maxpool,
            maxpool_attn,
            bert_reducer_l2,
            bert_reducer_l3,
            bert_encoder_midres,
            bert_encoder_lowres,
            upscaler_x2,
            upscaler_x4,
            bert_integrater_l2,
            bert_integrater_l1
        ) = initialize_baselayers(config, basemod, tokenizer, stack_id=0)

        self.get_extended_attention_mask = basemod.get_extended_attention_mask
        self.embedding = layer_embedding
        self.layer_basetransformer = layer_basetransformer
        self.maxpool = maxpool
        self.maxpool_attn = maxpool_attn
        self.bert_reducer_l2 = bert_reducer_l2
        self.bert_reducer_l3 = bert_reducer_l3
        self.bert_encoder_midres = bert_encoder_midres
        self.bert_encoder_lowres = bert_encoder_lowres
        self.upscaler_x2 = upscaler_x2
        self.upscaler_x4 = upscaler_x4
        self.bert_integrater_l2 = bert_integrater_l2
        self.bert_integrater_l1 = bert_integrater_l1
        self.stack_id = 0
        if device is None:
            self.to(basemod.device)
            #print(self.device)
            self.device = basemod.device
        else:
            self.to(device)
            self.device = device

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        attention_mask_l2: Optional[torch.Tensor] = None,
        attention_mask_l3: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        encoder_hidden_states: Optional[torch.Tensor] = None,
        encoder_attention_mask: Optional[torch.Tensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = False
    ):
        input_shape = input_ids
        past_key_values_length =0 if past_key_values is None else len(past_key_values)

        # extend attention mask
        extended_attention_mask_l1 = self.get_extended_attention_mask(attention_mask, input_shape, self.device)
        # downsample the attention mask to l2 dimension
        if attention_mask_l2 is None:
            attention_mask_l2 = self.maxpool_attn(attention_mask.float())
        extended_attention_mask_l2 = self.get_extended_attention_mask(attention_mask_l2,attention_mask_l2.shape, self.device)
        # downsample the attention mask to l3 dimension
        if attention_mask_l2 is None:
            attention_mask_l3 = self.maxpool_attn(attention_mask_l2.float())
        extended_attention_mask_l3 = self.get_extended_attention_mask(attention_mask_l3,attention_mask_l3.shape, self.device)

        # embed
        embedding_output = self.embedding(
            input_ids = input_ids,
            position_ids = position_ids,
            token_type_ids = token_type_ids,
            #input_embeds=None,
            past_key_values_length = past_key_values_length
        )

        # first transformer block (vanilla transformer)
        out_l1 = self.layer_basetransformer(
            hidden_states = embedding_output,
            attention_mask = extended_attention_mask_l1,
            head_mask=head_mask,
            encoder_hidden_states=None,
            encoder_attention_mask=None,
            output_attentions=output_attentions
        )
        hidden_states_l1 = out_l1[0]

        # downsample to sequence 1 to length sequence 2
        hiddens_states_l1_reduced = self.maxpool(hidden_states_l1)

        # reduce dimenion on sequence 2
        out_l2 = self.bert_reducer_l2(
            hidden_states = hiddens_states_l1_reduced,
            attention_mask = extended_attention_mask_l2,
            head_mask=head_mask,
            encoder_hidden_states = hidden_states_l1,
            encoder_attention_mask= extended_attention_mask_l1,
            past_key_value=past_key_values,
            output_attentions=output_attentions,
        )
        hidden_states_l2 = out_l2[0]

        # Vanilla transformers block at mid-resolution (1/2 seq-length)
        out_encoder = self.bert_encoder_midres(
            hidden_states=hidden_states_l2,
            attention_mask=extended_attention_mask_l2,
            head_mask = head_mask,
            return_dict=return_dict
        )
        hidden_states_l2 = out_encoder[0]

        # reduce sequence length (1/4 seq-length)
        hiddens_states_l2_reduced = self.maxpool(hidden_states_l2)

        # reduce dimenion on sequence 2
        out_l3 = self.bert_reducer_l3(
            hidden_states = hiddens_states_l2_reduced,
            attention_mask = extended_attention_mask_l3,
            head_mask=head_mask,
            encoder_hidden_states = hidden_states_l2,
            encoder_attention_mask= extended_attention_mask_l2,
            past_key_value=past_key_values,
            output_attentions=output_attentions,
        )
        hidden_states_l3 = out_l3[0]

        #print(hidden_states_l3.shape)
        #print(extended_attention_mask_l3.shape)
        # BertEncoder at low-res
        out_encoder = self.bert_encoder_lowres(
            hidden_states=hidden_states_l3,
            attention_mask=extended_attention_mask_l3,
            head_mask = head_mask,
            return_dict=return_dict
        )
        hidden_states_l3 = out_encoder[0]

        # upscaling: l3 to l2
        hidden_states_upscaled3to2 = self.upscaler_x2(hidden_states_l3)

        # integrate sequence-2 and upscaled sequence-3
        hidden_states_l2 = self.bert_integrater_l2(
            hidden_states = hidden_states_l2,
            attention_mask = extended_attention_mask_l3,
            head_mask = head_mask,
            keyvalue_hidden_states = hidden_states_l3,
            query_to_concat_hidden_states = hidden_states_upscaled3to2,
            query_attention_mask = attention_mask_l2
        )

        # upscaling: l3/l2 to l1 sequence length
        #hidden_states_upscaled3to1 = self.upscaler_x4(hidden_states_l3)
        hidden_states_upscaled2to1 = self.upscaler_x2(hidden_states_l2)
        #hidden_states_upscaled = torch.cat((
        #    hidden_states_upscaled2to1, hidden_states_upscaled3to1
        #),axis=2)

        # integrate low-resolution information back to original dimension
        hidden_states_l1 = self.bert_integrater_l1(
            hidden_states = hidden_states_l1,
            attention_mask = extended_attention_mask_l2,
            head_mask = head_mask,
            keyvalue_hidden_states = hidden_states_l2,
            query_to_concat_hidden_states = hidden_states_upscaled2to1,
            query_attention_mask = extended_attention_mask_l2
        )
        if not return_dict:
            return (
                (hidden_states_l1, hidden_states_l2, hidden_states_l3),
                (extended_attention_mask_l1, extended_attention_mask_l2, extended_attention_mask_l3),
                (attention_mask, attention_mask_l2, attention_mask_l3)
            )
        return {
            "hidden_states": (hidden_states_l1, hidden_states_l2, hidden_states_l3),
            "extended_attention_masks":(extended_attention_mask_l1, extended_attention_mask_l2, extended_attention_mask_l3),
            "attention_masks":(attention_mask, attention_mask_l2, attention_mask_l3)
        }


class AnathemMidModule(nn.Module):
    """Stack of layers that go full circle form high-res to low-res back to high res"""
    def __init__(
            self,
            config,
            basemod=None,
            tokenizer=None,
            past_key_values_length = None,
            device=None,
            stack_id = 1
        ):
        super().__init__()
        self.config = config

        # initalize the layers
        (
            maxpool,
            maxpool_attn,
            bert_reducerintegrator_l2,
            bert_reducerintegrator_l3,
            bert_encoder_midres,
            bert_encoder_lowres,
            upscaler_x2,
            upscaler_x4,
            bert_integrater_l2,
            bert_integrater_l1
        ) = initialize_midlayers(config, basemod, tokenizer, stack_id)

        self.get_extended_attention_mask = get_extended_attention_mask
        self.maxpool = maxpool
        self.maxpool_attn = maxpool_attn
        self.bert_reducerintegrator_l2 = bert_reducerintegrator_l2
        self.bert_reducerintegrator_l3 = bert_reducerintegrator_l3
        self.bert_encoder_midres = bert_encoder_midres
        self.bert_encoder_lowres = bert_encoder_lowres
        self.upscaler_x2 = upscaler_x2
        self.upscaler_x4 = upscaler_x4
        self.bert_integrater_l2 = bert_integrater_l2
        self.bert_integrater_l1 = bert_integrater_l1
        if device is None:
            self.to(basemod.device)
            #print(self.device)
            self.device = basemod.device
        else:
            self.to(device)
            self.device = device

    def forward(
        self,
        hidden_states_highres: torch.Tensor,
        hidden_states_midres: torch.Tensor,
        hidden_states_lowres: torch.Tensor,
        attention_mask: Optional[List[torch.FloatTensor]] = None,
        extended_attention_mask_highres: Optional[List[torch.FloatTensor]] = None,
        extended_attention_mask_midres: Optional[List[torch.FloatTensor]] = None,
        extended_attention_mask_lowres: Optional[List[torch.FloatTensor]] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = False
    ):
        input_shape = hidden_states_highres.shape[:2]
        past_key_values_length =0 if past_key_values is None else len(past_key_values)

        # extend attention mask
        if extended_attention_mask_highres is None:
            extended_attention_mask_highres = self.get_extended_attention_mask(attention_mask, input_shape, self.device)
        if extended_attention_mask_midres is None:
            attention_mask_midres = self.maxpool_attn(attention_mask.float())
            extended_attention_mask_midres = self.get_extended_attention_mask(attention_mask_midres,attention_mask_midres.shape, self.device)
        if extended_attention_mask_lowres is None:
           attention_mask_lowres = self.maxpool_attn(attention_mask_midres.float())
           extended_attention_mask_lowres = self.get_extended_attention_mask(attention_mask_lowres,attention_mask_lowres.shape, self.device)

        # downsample to sequence 1 to length sequence 2
        hiddens_states_l1_reduced = self.maxpool(hidden_states_highres)

        # reduce dimenion on sequence 2
        hidden_states_l2 = self.bert_reducerintegrator_l2(
            inputs = hidden_states_highres, # from highres outputs previous layer (key, values)
            hidden_states = hidden_states_midres, # previous hidden-states for skip connection (short squence-dim, low-res)
            attention_mask = extended_attention_mask_midres,
            head_mask=None,
            query_hidden_states = hiddens_states_l1_reduced
        )

        # Vanilla transformers at mid-resolution (1/2 sequence-length)
        out_encoder = self.bert_encoder_midres(
            hidden_states=hidden_states_l2,
            attention_mask=extended_attention_mask_midres,
            head_mask = None,
            return_dict=return_dict
        )
        hidden_states_l2 = out_encoder[0]

        # reduce sequence length (to 1/4 sequence-length)
        hiddens_states_l2_reduced = self.maxpool(hidden_states_l2)

        # reduce dimenion on sequence 2
        hidden_states_l3 = self.bert_reducerintegrator_l3(
            inputs = hidden_states_midres, # from highres outputs previous layer (key, values)
            hidden_states = hidden_states_lowres, # previous hidden-states for skip connection (short squence-dim, low-res)
            attention_mask = extended_attention_mask_lowres,
            head_mask=None,
            query_hidden_states = hiddens_states_l2_reduced
        )

        # BertEncoder at low-res
        out_encoder = self.bert_encoder_lowres(
            hidden_states=hidden_states_l3,
            attention_mask=extended_attention_mask_lowres,
            head_mask = None,
            return_dict=return_dict
        )
        hidden_states_lowres = out_encoder[0]

        # upscaling: l3 to l2
        hidden_states_upscaled3to2 = self.upscaler_x2(hidden_states_lowres)

        # integrate sequence-2 and upscaled sequence-3
        hidden_states_midres = self.bert_integrater_l2(
            hidden_states = hidden_states_l2,
            attention_mask = extended_attention_mask_lowres,
            head_mask = None,
            keyvalue_hidden_states = hidden_states_lowres,
            query_to_concat_hidden_states = hidden_states_upscaled3to2
        )
        #hidden_states_midres = self.bert_integrative_layer_2(
        #    hidden_states = hidden_states_l2,
        #    attention_mask = extended_attention_mask_midres,
        #    head_mask = None,
        #    query_hidden_states = hidden_states_upscaled3to2)

        # upscaling: l3/l2 to l1 sequence length
        #hidden_states_upscaled3to1 = self.upscaler_x4(hidden_states_lowres)
        hidden_states_upscaled2to1 = self.upscaler_x2(hidden_states_midres)
        #hidden_states_upscaled = torch.cat((hidden_states_upscaled2to1, hidden_states_upscaled3to1),axis=2)

        # integrate low-resolution information back to original dimension
        hidden_states_highres = self.bert_integrater_l1(
            hidden_states = hidden_states_highres,
            attention_mask = extended_attention_mask_midres,
            head_mask = None,
            keyvalue_hidden_states = hidden_states_midres,
            query_to_concat_hidden_states = hidden_states_upscaled2to1
        )

        if not return_dict:
            return (
                (hidden_states_highres, hidden_states_midres, hidden_states_lowres),
                (extended_attention_mask_highres, extended_attention_mask_midres, extended_attention_mask_lowres)
            )
        return {
            "hidden_states": (hidden_states_highres, hidden_states_midres, hidden_states_lowres),
            "attention":(extended_attention_mask_highres, extended_attention_mask_midres, extended_attention_mask_lowres)
        }


class AnathemEncoder(nn.Module):
    """Anathem cores stacks of layers, from embeddings to final transformer block"""
    def __init__(
            self,
            config,
            basemod=None,
            tokenizer=None,
            past_key_values_length = None,
            device=None,
        ):
        super().__init__()
        self.config = config
        self.device = device

        # initialize embedings and first stack
        self.anathem_base_stack = AnathemBaseModule(
            config,
            basemod,
            tokenizer,
            past_key_values_length,
            device,
        )

        # initialize all subsequence stacks
        self.anathem_mid_stack = nn.ModuleList([
            AnathemMidModule(
                config,
                basemod,
                tokenizer,
                past_key_values_length,
                device,
                stack_id = i
            ) for i in range(1, self.config.num_transformer_stacks)
        ])

        # initialize the final transformer modules
        self.final_transformer_block = initialize_finaltransformerlayers(
            config,
            basemod,
            tokenizer,
            stack_id=self.config.num_transformer_stacks+1
        )

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        attention_mask_l2: Optional[torch.Tensor] = None,
        attention_mask_l3: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        encoder_hidden_states: Optional[torch.Tensor] = None,
        encoder_attention_mask: Optional[torch.Tensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        use_cache: Optional[bool] = False,
        output_attentions: Optional[bool] = False,
        output_hidden_states: Optional[bool] = False,
        return_dict: Optional[bool] = False
    ):

        # embed and run through first stack of transformers
        hidden_states, extended_attention_masks, attention_masks = self.anathem_base_stack(
            input_ids=input_ids,
            attention_mask=attention_mask,
            attention_mask_l2=attention_mask_l2,
            attention_mask_l3=attention_mask_l3,
            token_type_ids=token_type_ids, #: Optional[torch.Tensor] = None,
            position_ids=position_ids,#: Optional[torch.Tensor] = None,
            head_mask=head_mask,#: Optional[torch.Tensor] = None,
            inputs_embeds=None,#: Optional[torch.Tensor] = None,
            encoder_hidden_states=None,#: Optional[torch.Tensor] = None,
            encoder_attention_mask=None,#: Optional[torch.Tensor] = None,
            past_key_values=past_key_values,#: Optional[List[torch.FloatTensor]] = None,
            use_cache=use_cache,#: Optional[bool] = None,
            output_attentions=output_attentions,#: Optional[bool] = None,
            output_hidden_states=output_hidden_states,#: Optional[bool] = None,
            return_dict=return_dict
        )

        # middle stack of transformers
        for i, anathem_stack in enumerate(self.anathem_mid_stack):

            # run through each stack (1-2)
            hidden_states, extended_attention_masks = anathem_stack(
                hidden_states_highres = hidden_states[0],
                hidden_states_midres = hidden_states[1],
                hidden_states_lowres = hidden_states[2],
                extended_attention_mask_highres = extended_attention_masks[0],
                extended_attention_mask_midres = extended_attention_masks[1],
                extended_attention_mask_lowres = extended_attention_masks[2]
            )

        # hidden states (high,med,low resolution)
        hidden_states_highres, hidden_states_midres, hidden_states_lowres = hidden_states

        # run through final transformer block (pretrained)
        out_final = self.final_transformer_block(
            hidden_states = hidden_states_highres,
            attention_mask = extended_attention_masks[0],
            head_mask=None,
            encoder_hidden_states=None,
            encoder_attention_mask=None,
            output_attentions=output_attentions
        )
        #print(type(out_final))
        #print(len(out_final))
        hidden_states_highres = out_final[0]
        if not output_attentions:
            return (hidden_states_highres, hidden_states_midres, hidden_states_lowres), attention_masks

        attention_final = out_final[1]
        return (hidden_states_highres, hidden_states_midres, hidden_states_lowres), attention_masks, attention_final


class BertGenericClassificationHead(nn.Module):
    """Instantiates a basic classification head that takes the CLS token and mean of the final layer for classification"""
    def __init__(self, config, n_classes = 1, activation = 'sigmoid', device=None):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size*2, n_classes)
        if activation == 'tanh':
            self.activation = nn.Tanh()
        elif activation == 'relu':
            self.activation = nn.ReLU()
        elif activation == 'sigmoid':
            self.activation = torch.sigmoid
        elif activation == 'none':
            self.activation = lambda x: x
        if device is not None:
            self.to(device)

    def forward(self, hidden_states, attention_mask) -> torch.Tensor:
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token.
        output_vectors=[]
        first_token_tensor = hidden_states[:, 0]
        output_vectors.append(first_token_tensor)
        # mean pooling
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
        sum_embeddings = torch.sum(hidden_states * input_mask_expanded, 1)
        sum_mask = input_mask_expanded.sum(1)
        sum_mask = torch.clamp(sum_mask, min=1e-9)
        output_vectors.append(sum_embeddings / sum_mask)
        # concatenate
        pooled_output = torch.concat(output_vectors, axis=1)
        #print(pooled_output.shape)
        logits = self.dense(pooled_output)
        return self.activation(logits)


class AnathemMultiSiloPooler(nn.Module):
    """
    Pools the token-embeddings along the sequence dimenions for a final sentence-vector.
    The pooling occuras across all three 'silos'
    The pooling consists of the CLS token as well as mean pooling, concatenated token
    Use the pooling outputs prior to any sequenceClassification
    """
    def __init__(
        self,
        config,
        dim_out = None,
        mean_activation = nn.Tanhshrink,
        out_activation = None,
        dims_in = None,
        p_dropout=None,
        device=None
    ):
        super().__init__()

        # dimensions of the hiddens states being processed as inputs
        if dims_in is None:
            try:
                dims_in = config.sequence_classification_intermediate_dim
            except:
                dims_in = [dim_out, dim_out//2, dim_out//4]
        self.dims_in = dims_in
        self.dim_in = sum(dims_in)
        self.hidden_size = config.hidden_size
        if dim_out is None:
            try:
                dim_out = config.sequence_classification_out_dim
            except:
                dim_out = config.hidden_size*2
        self.dim_out = dim_out
        self.mean_activation = mean_activation

        #self.dense = nn.Linear(config.hidden_size*2, n_classes)
        if out_activation == 'none' or out_activation is None:
            self.activation = lambda x: x
        elif out_activation == 'tanh':
            self.activation = nn.Tanh()
        elif out_activation == 'relu':
            self.activation = nn.ReLU()
        elif out_activation == 'sigmoid':
            self.activation = torch.sigmoid

        if device is not None:
            self.to(device)

        # linear layer operating on the concatenated CLS tokens from all silos
        self.cls_pooler = nn.Sequential(
            nn.Dropout(p_dropout),
            nn.Linear(self.dim_in, int(self.hidden_size)),
        )

        # pre-mean-pooling (one for each silo)
        #self.pre_poolers = [nn.Sequential(
        #    nn.Dropout(p_dropout),
        #    nn.Linear(dim,dim)
        #    ) for dim in self.dims_in
        # ]
        self.pre_poolers = nn.Sequential(
            nn.Dropout(p_dropout),
            self.mean_activation
        )

        # sequential layer to concatenate the mean tokens from multiple tokens
        self.mean_pooler = nn.Linear(self.dim_in, self.hidden_size)

    def forward(self, hidden_states, attention_masks, excess_cls_ids=None) -> torch.Tensor:
        """Combines CLS token and mean-pooling for the sentence-vectorization"""

        # CLS/first-tokens from all silos, all concatenated together
        first_token_tensors = self._get_cls_tokens_all_silos(hidden_states)

        # mean pooling
        mean_pooled_tensors = self._mean_pool_all_silos(hidden_states, attention_masks, excess_cls_ids)

        # concatenate CLS and mean
        pooled_output = torch.concat((first_token_tensors, mean_pooled_tensors), axis=1)

        return self.activation(pooled_output)

    def _get_cls_token(self, hidden_state):
        """Grabs the CLS token from a hidden-states"""
        return hidden_state[:, 0]

    def _get_cls_tokens_all_silos(self, hidden_states):
        """Grabs the CLS token from all hidden_states"""
        first_tokens = [
            self._get_cls_token(hidden_state) for hidden_state in hidden_states
        ]
        # concat all first tokens
        all_first_tokens_cat = torch.cat(first_tokens,axis=1)
        # run the concatenated first-tokens through Dense
        all_first_tokens_out = self.cls_pooler(all_first_tokens_cat)
        return all_first_tokens_out

    def _mean_pool(self, hidden_state, attention_mask=None, excess_cls_id=None):
        """Pool along a sequence dimension (for just one silo)"""
        if excess_cls_id is None:
            excess_cls_id = attention_mask
        input_mask_expanded = excess_cls_id.unsqueeze(-1).expand(hidden_state.size()).float()
        sum_embeddings = torch.sum(hidden_state * input_mask_expanded, 1)
        sum_mask = input_mask_expanded.sum(1)
        sum_mask = torch.clamp(sum_mask, min=1e-9)
        return sum_embeddings / sum_mask

    def _mean_pool_all_silos(self, hidden_states, attention_masks=None, excess_cls_ids=None):
        """Pool along a sequence dimension (for all silos)"""
        if excess_cls_ids is None:
            excess_cls_ids = attention_masks

        # pre-pool: dense-layer before pooling
        hidden_states = [
            self.pre_poolers(hidden_state) for hidden_state in hidden_states
        ]

        # mean pool each silo
        mean_pooled_states = [
            self._mean_pool(
                hidden_state=hidden_state, excess_cls_id=excess_cls_id
            ) for hidden_state, excess_cls_id
            in zip(hidden_states, excess_cls_ids)
        ]

        # concat all mean-pooled states
        all_mean_pooled_states = torch.cat(mean_pooled_states,axis=1)
        # run the concatenated meanpooled states through Dense
        all_mean_pooled_states = self.mean_pooler(all_mean_pooled_states)
        return all_mean_pooled_states


In [None]:
class AnathemTransformer(nn.Module):
    def __init__(
        self,
        config=None,
        device=None,
        do_mlm = None,
        do_cls = None
    ):
        super().__init__()

        # default config
        if config is None:
            config = make_config()
        self.config = config
        self.do_mlm = config.do_mlm if do_mlm is None else do_mlm
        self.do_cls = config.do_cls if do_cls is None else do_cls

        # device
        if device is None:
            if torch.cuda.is_available():
                device = torch.device('cuda')
            else:
                device = torch.device('cpu')
        self.device= device

        # get the basemodel (and its masked LM head
        self.model_string = self.config.model_string
        basemodelLM_pretrained = AutoModelForMaskedLM.from_pretrained(self.model_string)

        # get the Pretrained BertEncoder
        basemod_pretrained = get_to_bertlayer(
            basemodelLM_pretrained,
            target_layer_name = 'encoder'
        )

        # make the tokenizer (based on pretrained)
        self.tokenizer = CustomTokenizer(
            model_string=self.config.model_string,
            n_cls_prepend = int(1/config.scale_ratio3),
            n_pad_to_multiple_of= int(1/config.scale_ratio3)
        )

        # make the Embedding and first layers (pretrained)
        self.encoder = AnathemEncoder(
            self.config,
            basemod=basemod_pretrained,
            tokenizer=self.tokenizer ,
            past_key_values_length = None,
            device=self.device,
        )

        # get the Pretrained maskedLM head
        if self.do_mlm:
            # perform maskedLM
            self.mlm = get_to_bertlayer(
                basemodelLM_pretrained,
                target_layer_name = 'cls'
            )
        else:
            self.mlm = lambda x : x

        # make the sequence-classification head
        if self.do_cls:
            self.pooler = AnathemMultiSiloPooler(
                config=self.config,
                mean_activation = nn.Tanhshrink(),
                dims_in = self.config.sequence_classification_intermediate_dim,
                p_dropout=self.config.hidden_dropout_prob,
                device=self.device
            )

    def _get_name(self):
        return 'ANATHEM_MODEL_FOR_MLM'

    def __call__(self, *args, **kwargs):
        return self.forward(*args, **kwargs)

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        attention_mask_l2: Optional[torch.Tensor] = None,
        attention_mask_l3: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        excess_cls_ids: Optional[torch.Tensor] = None,
        excess_cls_ids_l2: Optional[torch.Tensor] = None,
        excess_cls_ids_l3: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        head_mask: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        encoder_hidden_states: Optional[torch.Tensor] = None,
        encoder_attention_mask: Optional[torch.Tensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = False
    ):

        # run through base-layer (embeddings, transformer-block, 1 anathem stack)
        outputs_encoder = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            attention_mask_l2=attention_mask_l2, # optional downsized attention mask for sequence-dim 1/2
            attention_mask_l3=attention_mask_l3, # optional downsized attention mask for sequence-dim 1/4
            token_type_ids=token_type_ids, #: Optional[torch.Tensor] = None,
            position_ids=position_ids,#: Optional[torch.Tensor] = None,
            head_mask=head_mask,#: Optional[torch.Tensor] = None,
            inputs_embeds=None,#: Optional[torch.Tensor] = None,
            encoder_hidden_states=None,#: Optional[torch.Tensor] = None,
            encoder_attention_mask=None,#: Optional[torch.Tensor] = None,
            past_key_values=past_key_values,#: Optional[List[torch.FloatTensor]] = None,
            use_cache=use_cache,#: Optional[bool] = None,
            output_attentions=output_attentions,#: Optional[bool] = None,
            output_hidden_states=output_hidden_states,#: Optional[bool] = None,
            return_dict=False
        )
        if output_attentions:
            hidden_states, extended_attention_masks, attention = outputs_encoder
        else:
            hidden_states, extended_attention_masks = outputs_encoder
            attention = None

        out_mlm = {'logits':None}
        out_pooled_vector = None
        hidden_states_highres, hidden_states_midres, hiddenstates_lowres = hidden_states

        # MLM outputs
        if self.do_mlm:
            out_mlm = self.mlm(hidden_states_highres)

        # sequence pooling (for classification)
        if self.do_cls:
            out_pooled_vector = self.pooler(
                hidden_states=hidden_states,
                attention_masks=(attention_mask, attention_mask_l2, attention_mask_l3),
                excess_cls_ids=(excess_cls_ids, excess_cls_ids_l2, excess_cls_ids_l3)
            )
        #
        if return_dict:
            return {
                'hidden_states':(hidden_states_highres, hidden_states_midres, hiddenstates_lowres),
                'pooled':out_pooled_vector,
                'logits':out_mlm['logits'],
                'attention':attention,
                'extended_attention_masks':extended_attention_masks
            }
        return hidden_states, out_pooled_vector, out_mlm, attention, extended_attention_masks


In [None]:
modelstring_teacher_mlm = 'bert-base-uncased'
model_string = "google/bert_uncased_L-4_H-512_A-8"

config = make_config(
    modelstring = model_string,
    num_transformer_stacks = 3,
    scale_ratio2 = 0.5,
    scale_ratio3 = 0.25,
    multiplier_intermediate2 = 4.0,
    multiplier_intermediate3 = 4.0,
    num_layers_l2 = 1, # mid-res encoder
    num_layers_l3 = 3, # low-res encoder
    dropout_scaling = 0.05,
    do_cheap_integrator = [1],
    do_mlm=True,
    do_cls=True
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

In [None]:

anamod = AnathemTransformer(
        config=config,
        device=None,
        do_mlm = True,
        do_cls = True
    )

teacher_mlm = AutoModelForMaskedLM.from_pretrained(modelstring_teacher_mlm)


from torch import Tensor
class TeacherEmbedder:

    def __init__(self, pretrained_name = 'intfloat/e5-large-v2'):
        self.pretrained_name = pretrained_name
        self.teacher_tokenizer = AutoTokenizer.from_pretrained(pretrained_name)
        self.teacher_embedder = AutoModel.from_pretrained(pretrained_name)

    @staticmethod
    def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
        last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
        return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

    def forward(self, input_text, prepend = 'passage: '):
        input_text = [prepend + s for s in input_text]
        with torch.no_grad():
            batch_dict = self.teacher_tokenizer(input_text, max_length=512, padding=True, truncation=True, return_tensors='pt')
            outputs = self.teacher_embedder(**batch_dict)
            embeddings = self.average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
        return embeddings

    def __call__(self, input_text, prepend = 'passage: '):
        return self.forward(input_text)


teacher_emb = TeacherEmbedder()

Downloading pytorch_model.bin:   0%|          | 0.00/116M [00:00<?, ?B/s]

Some weights of the model checkpoint at google/bert_uncased_L-4_H-512_A-8 were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.


Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

In [None]:

print(anamod.mlm) # MLM head

BertOnlyMLMHead(
  (predictions): BertLMPredictionHead(
    (transform): BertPredictionHeadTransform(
      (dense): Linear(in_features=512, out_features=512, bias=True)
      (transform_act_fn): GELUActivation()
      (LayerNorm): LayerNorm((512,), eps=1e-12, elementwise_affine=True)
    )
    (decoder): Linear(in_features=512, out_features=30522, bias=True)
  )
)


In [None]:
text = [
    "A standard [MASK] clause is a waiver clause that states that one party won't hold the other liable for damages, losses, or costs associated with issues.",
    "It usually consists of two elements: a trigger event or circumstance and a [MASK] obligation. The trigger event or circumstance is the [MASK] of the agreement, misconduct, or negligence of the indemnifying party or its affiliates"
]

inputs = anamod.tokenizer(text, add_special_tokens=True, return_tensors='pt', padding='longest')

print(inputs.keys())
inputs

outputs = anamod.forward(
    input_ids = inputs['input_ids'],
    attention_mask = inputs['attention_mask'],
    attention_mask_l2 = inputs['attention_mask_l2'],
    attention_mask_l3 = inputs['attention_mask_l3'],
    excess_cls_ids = inputs['excess_cls_ids'],
    excess_cls_ids_l2 = inputs['excess_cls_ids_l2'],
    excess_cls_ids_l3 = inputs['excess_cls_ids_l3']
)
# hidden_states, out_pooled_vector, out_mlm, attention, extended_attention_masks

outputs_teacher_mlm = teacher_mlm(input_ids = inputs['input_ids'], attention_mask=inputs['attention_mask'])


print(outputs[0][0].shape) # full hidden state sequence
print(outputs[0][1].shape) # mid hidden state sequence
print(outputs[0][2].shape) # small hidden state sequence
print(outputs[1].shape) # sentencevector
print(outputs[2].shape) # mlm outputs

#
print(outputs_teacher_mlm['logits'].shape) # Teacher shape mlm

predicted_token_ids1 = outputs_teacher_mlm[0][0].argmax(dim=-1)
predicted_token_ids2 = outputs[2][0].argmax(dim=-1)

print('Bert Base')
print(anamod.tokenizer.convert_ids_to_tokens(outputs_teacher_mlm[0][0].argmax(dim=-1)))
print('Anamod')
print(anamod.tokenizer.convert_ids_to_tokens(outputs[2][0].argmax(dim=-1)))


print('Bert Base')
print(anamod.tokenizer.convert_ids_to_tokens(outputs_teacher_mlm[0][1].argmax(dim=-1)))
print('Anamod')
print(anamod.tokenizer.convert_ids_to_tokens(outputs[2][1].argmax(dim=-1)))

# try to embed text with the teacher_emb
text2 = input_texts = [
    'query: how much protein should a female eat',
    'query: summit define',
    "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "passage: Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments."
]
sentence_embeddings = teacher_emb(text2)
print(sentence_embeddings.shape)

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'excess_cls_ids', 'attention_mask_l2', 'attention_mask_l3', 'excess_cls_ids_l2', 'excess_cls_ids_l3'])




torch.Size([2, 48, 512])
torch.Size([2, 24, 256])
torch.Size([2, 12, 128])
torch.Size([2, 1024])
torch.Size([2, 48, 30522])
torch.Size([2, 48, 30522])
Bert Base
['.', '.', '.', '.', 'a', 'standard', 'liability', 'clause', 'is', 'a', 'wai', '##ver', 'clause', 'that', 'states', 'that', 'one', 'party', 'won', "'", 't', 'hold', 'the', 'other', 'liable', 'for', 'damages', ',', 'losses', ',', 'or', 'costs', 'associated', 'with', 'issues', '.', 's', '.', '.', 'it', '.', 'the', 'it', 'it', 'it', 'parties', 'one', 'party']
Anamod
['-', 'the', '-', '-', 'a', '-', '-', '-', '.', 'a', '-', '-', '.', '.', 'is', '.', 'the', '.', '-', "'", 's', '.', 'the', 'other', ',', 'for', 'me', ',', 'my', ',', 'or', 'the', '-', 'with', 'the', '.', 'the', 'he', 'he', 'he', '-', '-', ',', ',', ',', 'the', '-', ',']
Bert Base
['.', '.', '.', '.', 'it', 'usually', 'consists', 'of', 'two', 'elements', ':', 'a', 'trigger', 'event', 'or', 'circumstance', 'and', 'a', 'trigger', 'obligation', '.', 'the', 'trigger', 'even

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

{'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None), 'idx': Value(dtype='int32', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'excess_cls_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}
{'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'excess_cls_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}


In [None]:

## Test a batched inference routine: including loss calculations
## steps:
## 1) tokenize inputs internal to a torch dataset (encode_plus?)
## 2) loop through dataloader, with a MLM collator also set?
## 3) do inference using teacher
## 5) do inference using anathem
## 6) loss
from torch.utils.data import DataLoader, Dataset
from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling
from torch.optim import AdamW
from math import prod

# load dummy dataset
dataset_glue = load_dataset('glue', 'mrpc', split='test') # small set

# tokens = [tokenizer.encode_plus(txt, add_special_tokens=True) for txt in text]
# tokenize
dataset_glue = dataset_glue.map(lambda e: tokenizer.encode_plus(e['sentence1'], add_special_tokens=True))
print(dataset_glue.features)
dataset_glue = dataset_glue.remove_columns(column_names = ['sentence1','sentence2','idx','label'])
print(dataset_glue.features)
_ = """
{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'excess_cls_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}
 """

# MLM collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

# MLM distillation loss function (kl-divergence between teacher and student outputs)
loss_fn_mlm_distil = nn.KLDivLoss(reduction="batchmean")
loss_fn_mlm_labels = nn.CrossEntropyLoss(ignore_index=-100) # non-masked tokens have -100
weights_mlm_distil = 0.5
weights_mlm_labels = (1-weights_mlm_distil)

# dataloader with MLM collator
dl_mlm = DataLoader(dataset_glue, collate_fn=data_collator, batch_size=4)

# optimizer
optimizer = AdamW(anamod.parameters(), lr = 0.00001)
# (model.parameters(), lr=learning_rate)

# MLM objective
teacher_mlm.eval()
distillation_temperature = 1.0

for step_i, batch in enumerate(dl_mlm):

    # do inference using anathem model
    # hidden_states, out_pooled_vector, out_mlm, attention, extended_attention_masks
    outputs = anamod.forward(
        input_ids = batch['input_ids'],
        attention_mask = batch['attention_mask'],
        attention_mask_l2 = batch['attention_mask_l2'],
        attention_mask_l3 = batch['attention_mask_l3'],
        excess_cls_ids = batch['excess_cls_ids'],
        excess_cls_ids_l2 = batch['excess_cls_ids_l2'],
        excess_cls_ids_l3 = batch ['excess_cls_ids_l3']
    )

    # hidden_states, out_pooled_vector, out_mlm, attention, extended_attention_masks
    with torch.no_grad():
        outputs_teacher_mlm = teacher_mlm(
            input_ids = batch['input_ids'],
            attention_mask=batch['attention_mask']
        )

    # FOOFU
    assert outputs[2].size() == outputs_teacher_mlm.logits.size()
    # Soften probabilities and compute distillation loss
    loss_mlm_distil = loss_fn_mlm_distil(
            F.log_softmax(outputs[2] / distillation_temperature, dim=-1),
            F.softmax(outputs_teacher_mlm.logits / distillation_temperature, dim=-1)
        ) * (distillation_temperature ** 2) * weights_mlm_distil
    # label loss
    loss_mlm_labels = loss_fn_mlm_labels(
        outputs[2].view(-1, anamod.config.vocab_size),
        batch['labels'].view(-1)
    ) * weights_mlm_labels
    # Return weighted student loss
    #loss = self.args.alpha * student_loss + (1. - self.args.alpha) * loss_logits
    #return (loss, outputs_student) if return_outputs else loss
    optimizer.zero_grad()
    # Backward pass: compute gradient of the loss with respect to model
    (loss_mlm_distil+loss_mlm_labels).backward()
    #
    optimizer.step()

    if ((step_i+1) % 20) ==0:
        raise NotImplementedError('hit %d' % step_i)



{'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None), 'idx': Value(dtype='int32', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'excess_cls_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}
{'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'excess_cls_ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}


NotImplementedError: ignored

## MultiTask Training: adapted from s-bert

In [None]:
### Normal label-based losses (MLI
# -- https://huggingface.co/datasets/multi_nli
dataset_nli3 = load_dataset('multi_nli', split='train') # 383k examples

# I think I should keep the text untokenize for the multi-task, maybe use the default collator from sbert
dataset_nli3 = dataset_nli3.remove_columns(
    column_names = ['promptID', 'pairID', 'premise_binary_parse', 'premise_parse','hypothesis_binary_parse', 'hypothesis_parse', 'genre']
)

dl_mli3 = DataLoader(dataset_nli3, batch_size=4, shuffle=True)


# make a classification head

Downloading builder script:   0%|          | 0.00/5.14k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.88k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.67k [00:00<?, ?B/s]

Downloading and preparing dataset multi_nli/default to /root/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39...


Downloading data:   0%|          | 0.00/227M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/392702 [00:00<?, ? examples/s]

Generating validation_matched split:   0%|          | 0/9815 [00:00<?, ? examples/s]

Generating validation_mismatched split:   0%|          | 0/9832 [00:00<?, ? examples/s]

Dataset multi_nli downloaded and prepared to /root/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39. Subsequent calls will reuse this data.


In [None]:

class ClassifierMNLI3(nn.Module):
    """Bert Attention Layer that uses a dimension-reduced version of the query, so to reduce the dimension of the outputs"""
    def __init__(
        self,
        hidden_size = 512,
        do_subtract = True,
        dropout = 0.1,
        n_labels = 3
    ):
        """Special type of Bert Self attention that reduces the dimension of the inputs by half"""
        super().__init__()

        self.hidden_size = hidden_size
        self.do_subtract = do_subtract
        self.dropout_p = dropout
        self.n_labels = n_labels
        self.size_of_concatenated_inputs = self.hidden_size*2*2 + self.do_subtract*self.hidden_size*2

        # final output
        self.layer = nn.Sequential(
            nn.Dropout(self.dropout_p),
            nn.Linear(self.size_of_concatenated_inputs, self.n_labels)
        )
    def forward(self, input1, input2):
        features_concat = torch.concat((
            input1,
            input2,
            torch.sub(input1,input2)
        ),axis=1)
        return self.layer(features_concat)


# Make classifier for MNLI labelled data
classifier_mnli3 = ClassifierMNLI3(
    hidden_size = anamod.config.hidden_size,
    n_labels=3
)
classifier_mnli3.train()
anamod.train()
optimizer = torch.optim.AdamW(
    list(anamod.encoder.parameters()) +  list(anamod.pooler.parameters()) + list(classifier_mnli3.parameters()),
    lr=0.0001
)

# make loss function (3 labels)
loss_fn_nmli3 = nn.CrossEntropyLoss()
weights_mnli_distil = 0.5
weights_mnli_labels = (1-weights_mnli_distil)

loss_fn_mnli3_distil = nn.MSELoss()


In [None]:
for i, batch_mnli in enumerate(dl_mli3):
    optimizer.zero_grad()
    # get tokens
    tokens_mnli_1 = anamod.tokenizer(batch_mnli['premise'],pad_to_multiple_of=4, add_special_tokens = True, return_tensors='pt', padding='longest')
    tokens_mnli_2 = anamod.tokenizer(batch_mnli['hypothesis'],pad_to_multiple_of=4, add_special_tokens = True, return_tensors='pt', padding='longest')

    # student embeddings
    out_student_mnli1 = anamod.forward(
            input_ids = tokens_mnli_1['input_ids'],
            attention_mask = tokens_mnli_1['attention_mask'],
            attention_mask_l2 = tokens_mnli_1['attention_mask_l2'],
            attention_mask_l3 = tokens_mnli_1['attention_mask_l3'],
            excess_cls_ids = tokens_mnli_1['excess_cls_ids'],
            excess_cls_ids_l2 = tokens_mnli_1['excess_cls_ids_l2'],
            excess_cls_ids_l3 = tokens_mnli_1 ['excess_cls_ids_l3']
    )
    out_student_mnli2 = anamod.forward(
            input_ids = tokens_mnli_2['input_ids'],
            attention_mask = tokens_mnli_2['attention_mask'],
            attention_mask_l2 = tokens_mnli_2['attention_mask_l2'],
            attention_mask_l3 = tokens_mnli_2['attention_mask_l3'],
            excess_cls_ids = tokens_mnli_2['excess_cls_ids'],
            excess_cls_ids_l2 = tokens_mnli_2['excess_cls_ids_l2'],
            excess_cls_ids_l3 = tokens_mnli_2 ['excess_cls_ids_l3']
    )

    # raw sentence-vectors from student
    feature_student_mnli1, feature_student_mnli2 = out_student_mnli1[1], out_student_mnli2[1]
    # mnli predictions n labels
    pred_mnli3 = classifier_mnli3(feature_student_mnli1, feature_student_mnli2)
    # mnli binary loss
    loss_cls_nmli3 = loss_fn_nmli3(pred_mnli3, batch_mnli['label']) * weights_nmli_labels
    #loss_cls_nmli3.backward()

    # NEXT do distillation loss with teacher
    feature_teacher_nmli1 = teacher_emb(input_text=batch_mnli['premise'], prepend = 'passage: ')
    feature_teacher_nmli2 = teacher_emb(input_text=batch_mnli['hypothesis'], prepend = 'passage: ')
    # MNLI distillation loss
    loss_mnli_distil = (
        loss_fn_mnli3_distil(feature_student_mnli1, feature_teacher_nmli1) + loss_fn_mnli3_distil(feature_student_mnli2, feature_teacher_nmli2)
    )*weights_nmli_distil
    # backprop
    (loss_mnli_distil + loss_cls_nmli3).backward()

    # update weights
    optimizer.step()

    if (i+1)%3 ==0:
        print(loss_cls_nmli3.detach().item())





0.6361832022666931
0.5656223297119141
0.3880550265312195


KeyboardInterrupt: ignored

In [None]:
# Combine the teacher training with classification
optimizer = AdamW(list(anamod.parameters()) + list(classifier_mnli3.parameters()), lr = 0.00001)
# (model.parameters(), lr=learning_rate)

# MLM objective
teacher_mlm.eval()
distillation_temperature = 1.0
for i,(batch_mlm, batch_mnli) in enumerate(zip(dl_mlm, dl_mli3)):
    optimizer.zero_grad()
    # do inference using anathem model
    # hidden_states, out_pooled_vector, out_mlm, attention, extended_attention_masks
    outputs = anamod.forward(
        input_ids = batch['input_ids'],
        attention_mask = batch['attention_mask'],
        attention_mask_l2 = batch['attention_mask_l2'],
        attention_mask_l3 = batch['attention_mask_l3'],
        excess_cls_ids = batch['excess_cls_ids'],
        excess_cls_ids_l2 = batch['excess_cls_ids_l2'],
        excess_cls_ids_l3 = batch ['excess_cls_ids_l3']
    )

    # hidden_states, out_pooled_vector, out_mlm, attention, extended_attention_masks
    with torch.no_grad():

        # mlm teacher outputs
        outputs_teacher_mlm = teacher_mlm(
            input_ids = batch['input_ids'],
            attention_mask=batch['attention_mask']
        )
        # to do this, I'd need to have the original text, and NOT pre-tokenized text
        #teacher_emb(input_text=batch['premise'], prepend = 'passage: ')

    # FOOFU
    assert outputs[2].size() == outputs_teacher_mlm.logits.size()
    # Soften probabilities and compute distillation loss
    #loss_function = nn.KLDivLoss(reduction="batchmean")
    loss_mlm_distil = loss_fn_mlm_distil(
            F.log_softmax(outputs[2] / distillation_temperature, dim=-1),
            F.softmax(outputs_teacher_mlm.logits / distillation_temperature, dim=-1)
        ) * (distillation_temperature ** 2) * weights_mlm_distil
    #loss_mlm_distil.backward()
    loss_mlm_labels = loss_fn_mlm_labels(
        outputs[2].view(-1, anamod.config.vocab_size),
        batch['labels'].view(-1)
    ) * weights_mlm_labels

    # loss on paragraph embedding

    # BACKPROP MLM label loss and distilloss
    (loss_mlm_distil+loss_mlm_labels).backward()
    # Return weighted student loss
    #loss = self.args.alpha * student_loss + (1. - self.args.alpha) * loss_logits
    #return (loss, outputs_student) if return_outputs else loss

    # NLI task: get tokens
    tokens_mnli_1 = anamod.tokenizer(batch_mnli['premise'],pad_to_multiple_of=4, add_special_tokens = True, return_tensors='pt', padding='longest')
    tokens_mnli_2 = anamod.tokenizer(batch_mnli['hypothesis'],pad_to_multiple_of=4, add_special_tokens = True, return_tensors='pt', padding='longest')

    # student embeddings
    out_student_mnli1 = anamod.forward(
            input_ids = tokens_mnli_1['input_ids'],
            attention_mask = tokens_mnli_1['attention_mask'],
            attention_mask_l2 = tokens_mnli_1['attention_mask_l2'],
            attention_mask_l3 = tokens_mnli_1['attention_mask_l3'],
            excess_cls_ids = tokens_mnli_1['excess_cls_ids'],
            excess_cls_ids_l2 = tokens_mnli_1['excess_cls_ids_l2'],
            excess_cls_ids_l3 = tokens_mnli_1 ['excess_cls_ids_l3']
    )
    out_student_mnli2 = anamod.forward(
            input_ids = tokens_mnli_2['input_ids'],
            attention_mask = tokens_mnli_2['attention_mask'],
            attention_mask_l2 = tokens_mnli_2['attention_mask_l2'],
            attention_mask_l3 = tokens_mnli_2['attention_mask_l3'],
            excess_cls_ids = tokens_mnli_2['excess_cls_ids'],
            excess_cls_ids_l2 = tokens_mnli_2['excess_cls_ids_l2'],
            excess_cls_ids_l3 = tokens_mnli_2 ['excess_cls_ids_l3']
    )
    # raw sentence-vectors from student
    feature_student_mnli1, feature_student_mnli2 = out_student_mnli1[1], out_student_mnli2[1]
    # labels
    pred_mnli3 = classifier_mnli3(feature_student_mnli1, feature_student_mnli2)
    # binary loss
    loss_cls_nmli3 = loss_fn_nmli3(pred_mnli3, batch_mnli['label'])
    #loss_cls_nmli3.backward()
    feature_teacher_nmli1 = teacher_emb(input_text=batch_mnli['premise'], prepend = 'passage: ')
    feature_teacher_nmli2 = teacher_emb(input_text=batch_mnli['hypothesis'], prepend = 'passage: ')
    # MNLI distillation loss
    loss_mnli_distil = (
        loss_fn_mnli3_distil(feature_student_mnli1, feature_teacher_nmli1) + loss_fn_mnli3_distil(feature_student_mnli2, feature_teacher_nmli2)
    )*weights_nmli_distil
    # backprop
    (loss_mnli_distil + loss_cls_nmli3).backward()
    # Backward pass: compute gradient of the loss with respect to model
    optimizer.step()

    if (i+1)%4 ==0:
        print(loss_cls_nmli3.detach().item())



1.3287630081176758
1.1084638833999634
1.1774473190307617
1.0645709037780762
1.091556429862976
1.1649658679962158
1.319928765296936
1.1654601097106934
0.9826673865318298
1.1563453674316406
1.0446501970291138
1.1165382862091064
1.1049705743789673
0.9217707514762878
1.14559006690979
1.1429061889648438
0.9149771928787231
1.207316279411316
1.1845396757125854
1.2629420757293701
0.9769338369369507
1.0895546674728394
1.0898280143737793
1.1648684740066528
0.9611557126045227
1.044935703277588
1.144046425819397
1.099448561668396
1.0884103775024414
1.142393946647644
1.0853071212768555
1.1239224672317505
1.0658488273620605
1.1993112564086914
0.9642707109451294
1.182077407836914
1.3221166133880615
1.1279082298278809
1.0723700523376465
1.1399314403533936
1.0013256072998047
1.1049387454986572
1.0147031545639038
1.2314361333847046
1.0651648044586182
1.1327135562896729
0.9887092709541321
1.0250582695007324
1.1199613809585571
1.094027042388916
1.091330885887146
1.098750114440918
1.1193275451660156
1.1657

In [None]:
class TrainerMultiTask:
    """Adapted from the uklab/sentence-transformers .fit() function"""
    def __init__(
            self,
            do_reload = True,
            epochs_total_lifetime = 5,
            scheduler: str = 'WarmupLinear',
            warmup_steps: int = 10000,
            optimizer_class: Type[Optimizer] = torch.optim.AdamW,
            optimizer_params : Dict[str, object]= {'lr': 2e-5},
            weight_decay: float = 0.01,
            evaluation_steps: int = 0,
            output_path: str = None,
            save_best_model: bool = True,
            max_grad_norm: float = 2.0,
            use_amp: bool = False,
            callback: Callable[[float, int, int], None] = None,
            show_progress_bar: bool = False,
            checkpoint_path: str = 'checkpoint.pt',
            checkpoint_path_optimizer: str = 'checkpoint_optimizer.pt',
            checkpoint_path_scheduler: str = 'checkpoint_scheduler.pt',
            checkpoint_path_trainer_state: str = 'checkpoint_trainer_state.json',
            checkpoint_save_steps: int = 500,
            checkpoint_save_total_limit: int = 0,
            do_minimize_global_objective: Int = 1
        ):
            self.epochs_global = -1 # track the total number of epochs
            self.epochs_total_lifetime = epochs_total_lifetime # total number of epochs over lifetime
            self.global_step = 0 # track the toatl number of steps
            self.do_minimize = do_minimize_global_objective
            self.best_score = 9999999 if self.do_minimize else -9999999
            self.output_path = output_path
            self.checkpoint_path = checkpoint_path
            self.checkpoint_path_optimizer = checkpoint_path_optimizer
            self.checkpoint_path_scheduler = checkpoint_path_scheduler
            self.checkpoint_path_trainer_state = checkpoint_path_trainer_state
            self.scheduler_state_dict = None
            self.optimizer_state_dict = None
            self.trainer_state = None
            self.loss_models_states = None
            if do_reload:
                print('attempting to reload cached model, optimizer, scheduler, and saved trainer sate')
                model_state, loss_models_states = self.load_saved_model(self.checkpoint_path)
                self.model_state = model_state
                self.loss_models_states = loss_models_states
                self.scheduler_state_dicts = self.load_saved_scheduler(self.checkpoint_path_scheduler)
                self.optimizer_state_dicts = self.load_saved_optimizer(self.checkpoint_path_optimizer)
                self.trainer_state = self.load_saved_trainer_state(self.checkpoint_path_trainer_state)

    def fit(self,
            train_objectives: Iterable[Tuple[DataLoader, nn.Module]],
            model=None,
            weights_train_objectives:List = None,
            teachers: List = None,
            evaluator: SentenceEvaluator = None,
            epochs: int = 1,
            epochs_total_lifetime = None,
            steps_per_epoch = None,
            scheduler: str = None, # 'WarmupLinear',
            warmup_steps: int = 10000,
            optimizer_class: Type[Optimizer] = torch.optim.AdamW,
            optimizer_params : Dict[str, object]= {'lr': 2e-5},
            weight_decay: float = 0.01,
            evaluation_steps: int = 0,
            save_best_model: bool = True,
            max_grad_norm: float = 2.0,
            use_amp: bool = False,
            callback: Callable[[float, int, int], None] = None,
            show_progress_bar: bool = True,
            checkpoint_path = None,
            checkpoint_path_optimizer= None,
            checkpoint_path_scheduler= None,
            checkpoint_path_trainer_config= None,
            checkpoint_save_steps: int = 500,
            checkpoint_save_total_limit: int = 2
            ):
        """
        Train the model with the given training objective
        Each training objective is sampled in turn for one batch.
        We sample only as many batches from each objective as there are in the smallest one
        to make sure of equal training with each dataset.

        :param train_objectives: Tuples of (DataLoader, LossFunction). Pass more than one for multi-task learning
        :param evaluator: An evaluator (sentence_transformers.evaluation) evaluates the model performance during training on held-out dev data. It is used to determine the best model that is saved to disc.
        :param epochs: Number of epochs for training
        :param steps_per_epoch: Number of training steps per epoch. If set to None (default), one epoch is equal the DataLoader size from train_objectives.
        :param scheduler: Learning rate scheduler. Available schedulers: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
        :param warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from o up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero.
        :param optimizer_class: Optimizer
        :param optimizer_params: Optimizer parameters
        :param weight_decay: Weight decay for model parameters
        :param evaluation_steps: If > 0, evaluate the model using evaluator after each number of training steps
        :param output_path: Storage path for the model and evaluation files
        :param save_best_model: If true, the best model (according to evaluator) is stored at output_path
        :param max_grad_norm: Used for gradient normalization.
        :param use_amp: Use Automatic Mixed Precision (AMP). Only for Pytorch >= 1.6.0
        :param callback: Callback function that is invoked after each evaluation.
                It must accept the following three parameters in this order:
                `score`, `epoch`, `steps`
        :param show_progress_bar: If True, output a tqdm progress bar
        :param checkpoint_path: Folder to save checkpoints during training
        :param checkpoint_save_steps: Will save a checkpoint after so many steps
        :param checkpoint_save_total_limit: Total number of checkpoints to store
        """
        if self.model_state is not None:
            print('reloading saved model state into model')
            model.load_state_dict(self.model_state)
            self.model = model

        # paths (optional update)
        self.checkpoint_path = checkpoint_path if checkpoint_path is not None else self.checkpoint_path
        self.checkpoint_path_optimizer = checkpoint_path_optimizer if checkpoint_path_optimizer is not None else self.checkpoint_path_optimizer
        self.checkpoint_path_scheduler = checkpoint_path_scheduler if checkpoint_path_scheduler is not None else self.checkpoint_path_scheduler
        self.checkpoint_path_trainer_state = checkpoint_path_trainer_state if checkpoint_path_trainer_state is not None else self.checkpoint_path_trainer_state
        self._target_device = model.device
        self.max_grad_norm = max_grad_norm
        self.weight_decay = weight_decay
        self.warmup_steps = warmup_steps
        self.optimizer_params = optimizer_params
        self.evaluation_steps = evaluation_steps

        if use_amp:
            from torch.cuda.amp import autocast
            scaler = torch.cuda.amp.GradScaler()

        #self.to(self._target_device)

        dataloaders = [dataloader for dataloader, _ in train_objectives]

        # Use smart batching
        if len(collators)==0 or collators is None:
            print('using default batch collators')
        for dli, dataloader in enumerate(dataloaders):
            if dataloader.collate_fn is None:
                print('using default batch collators for dataloader %d' % dli)
                dataloader.collate_fn = self.smart_batching_collate

        loss_models = [loss for _, loss in train_objectives]
        for midx, loss_model in enumerate(loss_models):
            if self.loss_models_states is not None:
                # reload each loss_model.classifier's saved states
                if hassattr(loss_model, 'classifier'):
                    loss_model.classifier.load_state_dict(self.loss_models_states[midx])
            loss_model.to(self._target_device)

        if steps_per_epoch is None or steps_per_epoch == 0:
            steps_per_epoch = min([len(dataloader) for dataloader in dataloaders])

        if epochs_total_lifetime is None:
            epochs_total_lifetime = self.epochs_total_lifetime
        num_train_steps = int(steps_per_epoch * epochs_total_lifetime)

        # Prepare optimizers
        #optimizers = []
        #schedulers = []
        #for model_idx, loss_model in enumerate(loss_models):
        #    param_optimizer = list(loss_model.named_parameters())#
        #    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
        #    optimizer_grouped_parameters = [
        #        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': weight_decay},
        #        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
        #    ]
        #    optimizer = optimizer_class(optimizer_grouped_parameters, **optimizer_params)
        #    scheduler_obj = self._get_scheduler(optimizer, scheduler=scheduler, warmup_steps=warmup_steps, t_total=num_train_steps)
        #    if self.optimizer_state_dicts is not None:
        #        # reload optimizer states
        #        optimizer.load_state_dict(self.optimizer_state_dicts[model_idx])
        #    if self.scheduler_state_dicts is not None:
        #        # relead scheduler states
        #        scheduler_obj.load_state_dict(self.scheduler_state_dicts[model_idx])
        #    optimizers.append(optimizer)
        #    schedulers.append(scheduler_obj)

        # from: https://stackoverflow.com/questions/46377599/when-to-use-individual-optimizers-in-pytorch
        optimizer_parameters = set()
        for model_idx, loss_model in enumerate(loss_models):
            optimizer_parameters |= loss_model.named_parameters()

        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in optimizer_parameters if not any(nd in n for nd in no_decay)], 'weight_decay': weight_decay},
            {'params': [p for n, p in optimizer_parameters if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
        ]

        optimizer = optimizer_class(optimizer_grouped_parameters, **optimizer_params)
        scheduler_obj = self._get_scheduler(optimizer, scheduler=scheduler, warmup_steps=warmup_steps, t_total=num_train_steps)
        if self.optimizer_state_dicts is not None:
            # reload optimizer states
            #optimizer.load_state_dict(self.optimizer_state_dicts[model_idx])
            optimizer.load_state_dict(self.optimizer_state_dicts)
        if self.scheduler_state_dicts is not None:
            # relead scheduler states
            #scheduler_obj.load_state_dict(self.scheduler_state_dicts[model_idx])
            scheduler_obj.load_state_dict(self.scheduler_state_dicts)

        global_step = self.global_step
        data_iterators = [iter(dataloader) for dataloader in dataloaders]

        num_train_objectives = len(train_objectives)

        for epoch in trange(epochs, desc="Epoch", disable=not show_progress_bar):
            self.epochs_global += epoch
            training_steps = 0

            for loss_model in loss_models:
                loss_model.zero_grad()
                loss_model.train()

            for _ in trange(steps_per_epoch, desc="Iteration", smoothing=0.05, disable=not show_progress_bar):

                # loop through multiple tasks
                for train_idx in range(num_train_objectives):
                    loss_model = loss_models[train_idx]
                    loss_weight = weights_train_objectives[train_idx]
                    teacher = teachers[train_idx]
                    optimizer = optimizers[train_idx]
                    scheduler = schedulers[train_idx]
                    data_iterator = data_iterators[train_idx]

                    try:
                        data = next(data_iterator)
                    except StopIteration:
                        data_iterator = iter(dataloaders[train_idx])
                        data_iterators[train_idx] = data_iterator
                        data = next(data_iterator)

                    features, labels = data
                    features = list(map(lambda batch: batch_to_device(batch, self._target_device), features))
                    if labels is not None:
                        labels = labels.to(self._target_device)

                    loss_value = loss_model(features, labels, teacher=teacher)
                    loss_value *= loss_weight
                    loss_value.backward()

                torch.nn.utils.clip_grad_norm_(loss_model.parameters(), max_grad_norm)
                optimizers.step()
                optimizers.zero_grad()
                schedulers.step()

                # TODO: integrate amp: https://discuss.pytorch.org/t/ddp-amp-gradient-accumulation-calling-optimizer-step-leads-to-nan-loss/162624
                training_steps += 1
                global_step += 1
                self.global_step = global_step

                if evaluation_steps > 0 and training_steps % evaluation_steps == 0:
                    self._eval_during_training(evaluator, output_path, save_best_model, epoch, training_steps, callback)

                    for loss_model in loss_models:
                        loss_model.zero_grad()
                        loss_model.train()

                if self.checkpoint_path is not None and checkpoint_save_steps is not None and checkpoint_save_steps > 0 and global_step % checkpoint_save_steps == 0:
                    self._save_checkpoint(
                        model, optimizers, schedulers, loss_models, checkpoint_save_total_limit, global_step
                    )

            self._eval_during_training(evaluator, output_path, save_best_model, epoch, -1, callback)

        #if evaluator is None and output_path is not None:   #No evaluator, but output path: save final model version
        #    self.save(output_path)

        if checkpoint_path is not None:
            self._save_checkpoint(
                model, optimizers, schedulers, loss_models, checkpoint_save_total_limit, global_step
            )

    def evaluate(self, evaluator: SentenceEvaluator, output_path: str = None):
        """
        Evaluate the model

        :param evaluator:
            the evaluator
        :param output_path:
            the evaluator can write the results to this path
        """
        if output_path is not None:
            os.makedirs(output_path, exist_ok=True)
        return evaluator(self, output_path)

    def _eval_during_training(self, evaluator, output_path, save_best_model, epoch, steps, callback):
        """Runs evaluation during the training"""
        eval_path = output_path
        if output_path is not None:
            os.makedirs(output_path, exist_ok=True)
            eval_path = os.path.join(output_path, "eval")
            os.makedirs(eval_path, exist_ok=True)

        if evaluator is not None:
            score = evaluator(self, output_path=eval_path, epoch=epoch, steps=steps)
            if callback is not None:
                callback(score, epoch, steps)
            if score > self.best_score:
                self.best_score = score
                if save_best_model:
                    self.save(output_path)

    def _save_checkpoint(
        self,
        model,
        optimizers,
        schedulers,
        loss_models,
        checkpoint_save_total_limit,
        step,
        checkpoint_path = None,
        checkpoint_path_optimizer = None,
        checkpoint_path_scheduler = None,
        checkpoint_path_trainer_state =None
    ):
        # Store new checkpoint
        checkpoint_path = checkpoint_path if checkpoint_path is not None else self.checkpoint_path
        checkpoint_path_optimizer = checkpoint_path_optimizer if checkpoint_path_optimizer is not None else self.checkpoint_path_optimizer
        checkpoint_path_scheduler = checkpoint_path_scheduler if checkpoint_path_scheduler is not None else self.checkpoint_path_scheduler
        checkpoint_path_trainer_state = checkpoint_path_trainer_state if checkpoint_path_trainer_state is not None else self.checkpoint_path_trainer_state

        # model states
        self.model_state = model.state_dict()
        self.loss_models_states = [self._grab_loss_states(loss_model) for loss_models]
        torch.save({
            'epochs_global':self.epochs_global, 'global_step':self.global_step, 'step':step,
            'model_state_dict':self.model_state,
            'loss_models_state_dicts':self.loss_models_states,
        }, "%s-%08g" % (checkpoint_path, step))

        # optimizer
        self.optimizer_state_dicts = optimizers.state_dict() #[opt.state_dict() for opt in optimizers],
        torch.save({
            'epochs_global':self.epochs_global, 'global_step':self.global_step, 'step':step,
            'optimizer_state_dicts':self.optimizer_state_dicts,
        }, "%s-%08g" % (checkpoint_path_optimizer, step))

        # scheduler
        self.scheduler_state_dicts = schedulers.state_dict() #[scheduler.state_dict() for scheduler in schedulers]
        torch.save({
            'epochs_global':self.epochs_global, 'global_step':self.global_step, 'step':step,
            'scheduler_state_dicts':self.scheduler_state_dicts,
        }, "%s-%08g" % (checkpoint_path_scheduler, step))

        # trainer info
        with open(checkpoint_path_trainer_state, 'w') as jcon:
            trainer_objs_to_save = {
                'epochs_global':self.epochs_global, 'global_step':self.global_step, 'step':step,
                'max_grad_norm':self.max_grad_norm,
                'weight_decay':self.weight_decay,
                'warmup_steps':self.warmup_steps,
                'optimizer_params':self.optimizer_params,
                'evaluation_steps':self.evaluation_steps,
                'checkpoint_path_optimizer': "%s-%08g" % (checkpoint_path_optimizer, step),
                'checkpoint_path_scheduler': "%s-%08g" % (checkpoint_path_scheduler, step),
            }
            json.dump(trainer_objs_to_save, jcon)

        # Delete old checkpoints
        if checkpoint_save_total_limit is not None and checkpoint_save_total_limit > 0:
            old_checkpoints = []
            dir_to_checkpoints = "/".join(checkpoint_path.split('/')[:-1])
            for f in os.listdir(dir_to_checkpoints):
                if bool(re.search('(\-[0-9]+$',f)) & (checkpoint_path in f):
                    # get step of saved checkpoint
                    old_pt_step = int(re.search('(?<=\-)[0-9]+$',f).group())
                    old_checkpoints.append({
                        'step': old_pt_step, 'path': os.path.join(dir_to_checkpoints, f)
                    })

            if len(old_checkpoints) > checkpoint_save_total_limit:
                old_checkpoints = sorted(old_checkpoints, key=lambda x: x['step'])
                oldest_step = old_checkpoints[0]['step']
                for old_checkpoint in old_checkpoints:
                    if old_checkpoint['step']==oldest_step:
                        print('deleting old checkpoint: %s' % old_checkpoint['path'])
                        shutil.rmtree(old_checkpoint['path'])

    def _grab_loss_states(loss_model):
        """Gets the loss_model.state_dict() for a model embedded in a loss function"""
        return loss_model.classifier.state_dict()

    def load_saved_model(checkpoint_path=None):
        """reload saved model"""
        checkpoint_path = self.checkpoint_path if checkpoint_path is None else checkpoint_path
        saved_dict = torch.load(checkpoint_path)
        return saved_dict['model_state_dict'], saved_dict['loss_models_state_dicts']

    def load_saved_scheduler(checkpoint_path_scheduler=None):
        """reload saved model"""
        checkpoint_path_scheduler = self.checkpoint_path_scheduler if checkpoint_path_scheduler is None else checkpoint_path_scheduler
        saved_dict = torch.load(checkpoint_path_scheduler)
        return saved_dict['scheduler_state_dicts']

    def load_saved_optimizer(checkpoint_path_optimizer=None):
        """reload saved model"""
        checkpoint_path_optimizer = self.checkpoint_path_optimizer if checkpoint_path_optimizer is None else checkpoint_path_optimizer
        saved_dict = torch.load(checkpoint_path_optimizer)
        return saved_dict['optimizer_state_dicts']

    def load_saved_trainer_state(checkpoint_path_trainer_state):
        checkpoint_path_trainer_state = self.checkpoint_path_trainer_state if checkpoint_path_trainer_state is None else checkpoint_path_trainer_state
        with open(checkpoint_path_trainer_state, 'r') as jcon:
            trainer_state = json.load(jcon)
        self.epochs_global = trainer_state['epochs_global']
        self.global_step = trainer_state['global_step']
        self.step = trainer_state['step']
        self.max_grad_norm = trainer_state['max_grad_norm']
        self.weight_decay = trainer_state['weight_decay']
        self.warmup_steps = trainer_state['warmup_steps']
        self.optimizer_params = trainer_state['optimizer_params']
        self.evaluation_steps = trainer_state['evaluation_steps']

    def _load_auto_model(self, model_name_or_path):
        """
        Creates a simple Transformer + Mean Pooling model and returns the modules
        """
        logger.warning("No sentence-transformers model found with name {}. Creating a new one with MEAN pooling.".format(model_name_or_path))
        transformer_model = Transformer(model_name_or_path)
        pooling_model = Pooling(transformer_model.get_word_embedding_dimension(), 'mean')
        return [transformer_model, pooling_model]

    def _load_sbert_model(self, model_path):
        """
        Loads a full sentence-transformers model
        """
        # Check if the config_sentence_transformers.json file exists (exists since v2 of the framework)
        config_sentence_transformers_json_path = os.path.join(model_path, 'config_sentence_transformers.json')
        if os.path.exists(config_sentence_transformers_json_path):
            with open(config_sentence_transformers_json_path) as fIn:
                self._model_config = json.load(fIn)

            if '__version__' in self._model_config and 'sentence_transformers' in self._model_config['__version__'] and self._model_config['__version__']['sentence_transformers'] > __version__:
                logger.warning("You try to use a model that was created with version {}, however, your version is {}. This might cause unexpected behavior or errors. In that case, try to update to the latest version.\n\n\n".format(self._model_config['__version__']['sentence_transformers'], __version__))

        # Check if a readme exists
        model_card_path = os.path.join(model_path, 'README.md')
        if os.path.exists(model_card_path):
            try:
                with open(model_card_path, encoding='utf8') as fIn:
                    self._model_card_text = fIn.read()
            except:
                pass

        # Load the modules of sentence transformer
        modules_json_path = os.path.join(model_path, 'modules.json')
        with open(modules_json_path) as fIn:
            modules_config = json.load(fIn)

        modules = OrderedDict()
        for module_config in modules_config:
            module_class = import_from_string(module_config['type'])
            module = module_class.load(os.path.join(model_path, module_config['path']))
            modules[module_config['name']] = module

        return modules

    @staticmethod
    def load(input_path):
        return SentenceTransformer(input_path)

    @staticmethod
    def _get_scheduler(optimizer, scheduler: str, warmup_steps: int, t_total: int):
        """
        Returns the correct learning rate scheduler. Available scheduler: constantlr, warmupconstant, warmuplinear, warmupcosine, warmupcosinewithhardrestarts
        """
        scheduler = scheduler.lower()
        if scheduler == 'constantlr':
            return transformers.get_constant_schedule(optimizer)
        elif scheduler == 'warmupconstant':
            return transformers.get_constant_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps)
        elif scheduler == 'warmuplinear':
            return transformers.get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total)
        elif scheduler == 'warmupcosine':
            return transformers.get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total)
        elif scheduler == 'warmupcosinewithhardrestarts':
            return transformers.get_cosine_with_hard_restarts_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=t_total)
        else:
            raise ValueError("Unknown scheduler {}".format(scheduler))

    @property
    def device(self) -> device:
        """
        Get torch.device from module, assuming that the whole module has one device.
        """
        try:
            return next(self.parameters()).device
        except StopIteration:
            # For nn.DataParallel compatibility in PyTorch 1.5

            def find_tensor_attributes(module: nn.Module) -> List[Tuple[str, Tensor]]:
                tuples = [(k, v) for k, v in module.__dict__.items() if torch.is_tensor(v)]
                return tuples

            gen = self._named_members(get_members_fn=find_tensor_attributes)
            first_tuple = next(gen)
            return first_tuple[1].device

    @property
    def tokenizer(self):
        """
        Property to get the tokenizer that is used by this model
        """
        return self.model.tokenizer

    #@tokenizer.setter
    #def tokenizer(self, value):
    #    self._first_module().tokenizer = value

    @property
    def max_seq_length(self):
        """
        Property to get the maximal input sequence length for the model. Longer inputs will be truncated.
        """
        return self.model._first_module().max_seq_length

    @max_seq_length.setter
    def max_seq_length(self, value):
        """
        Property to set the maximal input sequence length for the model. Longer inputs will be truncated.
        """
        self.model._first_module().max_seq_length = value

SyntaxError: ignored

### Load a Standard Dataset for MLM task

Also need to grab datasets here: https://arxiv.org/pdf/1908.08962.pdf

```
    The Pile dataset looks good: https://pile.eleuther.ai/
    https://arxiv.org/abs/2101.00027
    PubMed Central, ArXiv, GitHub, the FreeLaw Project, Stack Exchange, the US
    Patent and Trademark Office, PubMed, Ubuntu IRC, HackerNews, YouTube, PhilPapers, and NIH ExPorter.
    We also introduce OpenWebText2 and
    BookCorpus2, which are extensions of the original
    OpenWebText (Gokaslan and Cohen, 2019) and
    BookCorpus (Zhu et al., 2015; Kobayashi, 2018)
    datasets, respectively.
    In addition, we incorporate several existing highquality datasets: Books3 (Presser, 2020), Project Gutenberg (PG-19) (Rae et al., 2019), OpenSubtitles (Tiedemann, 2016), English Wikipedia, DM Mathematics (Saxton et al., 2019), EuroParl
    (Koehn, 2005), and

    ABout the law:
    and other metadata, we focused specifically on
    court opinions due to an abundance of full-text
    entries. This data is entirely within the public domain.

```

Scientific Papers: You can use the scientific_papers dataset, which includes a large collection of scientific papers from various domains. It covers research articles from fields such as computer science, physics, biology, and more.

Patents: The patent_citations dataset contains patent text data along with citation information, making it suitable for training language models with a focus on technical and scientific domains.

ArXiv: The arxiv dataset includes research papers from the arXiv repository, covering a wide range of scientific disciplines. It can be used to enhance the exposure of your model to academic literature.

PubMed: The pubmed dataset consists of abstracts from biomedical research articles indexed in PubMed. It is a valuable resource if you want to incorporate biomedical and life sciences content into your MLM pretraining.

joelito/Multi_Legal_Pile - use subset `en_all` to access EU-courts, and other datasets


Looks like streaming data is available:
https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt

In [1]:
### Load a standard dataset
%pip install transformers datasets zstandard rank_bm25 langdetect pynndescent
# need the zstandard to use the streaming data function from huggingface datasets

Collecting transformers
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting zstandard
  Downloading zstandard-0.21.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m74.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m53.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pynnde

In [2]:
import lzma
from datasets import load_dataset
from itertools import islice
from datasets import interleave_datasets # for interweaving streaming datasets
#from transformers import BertTokenizer, LineByLineTextDataset, DataCollatorForLanguageModeling
from spacy.lang.en import English
import spacy
import re
import random
import numpy as np
import os
import pickle
from langdetect import detect
import copy
from math import prod

In [3]:
def check_is_code(text):
    """Estimates a ratio of special char (that may indicate math/code notation); less than 10% is code for normal text"""
    nchar = min(5000,len(text))
    nchar_after_removespecialchar = len(re.sub(r"[\<\>\_\@\^\=\+\*\$\{\[\]\}\(\)\/\\\.]",'',text[:5000]))
    ratio_specialchar = 1-nchar_after_removespecialchar/nchar
    return ratio_specialchar

def check_language(text, special_char_threshold=0.10):
    """Verifies that a string is: i) English, and ii) not overly mathematical/code"""
    ratio_specialchar = check_is_code(text)
    if ratio_specialchar>=special_char_threshold:
        return False, ratio_specialchar
    try:
        is_eng = detect(text[:200]+" hello")=='en'
        return is_eng, -1
    except:
        return False, -1

if False:
    bad_language = []
    good_language = []
    foo = load_dataset("EleutherAI/the_pile_deduplicated", split='train',streaming=True).shuffle(buffer_size=20000).take(20000)
    for e in foo:
        is_good, ratiospecialchar = check_language(e['text'])
        if not is_good:
            bad_language.append((e['text'], ratiospecialchar))
        else:
            if ratiospecialchar>0.025:
                good_language.append((e['text'], ratiospecialchar))

    print(len(bad_language)); print(len(good_language))
    bad_language = [p for p in bad_language if p[-1]>0]
    bad_language = sorted(zip([score for _,score in bad_language],[w for w,_ in bad_language]))


In [4]:

CHAR_PER_WORD = 6.36
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("sentencizer")
config = {
    'max_seq_length':512,
    'min_seq_length':48,
    'max_chunk_size':6,
    'min_sentence_len':20,
    'seed':42
}


class ExampleProcessor:
    def __init__(
        self,
        config=config,
        char_per_word = CHAR_PER_WORD,
        nlp =nlp,
    ):
        self.nlp = nlp
        self.char_per_word = char_per_word
        self.max_seq_length = config.get('max_seq_length', 512) # maximum word-length for chunks for mlm objective (else split)
        self.min_seq_length = config.get('min_seq_length', 128) # min sequence length for chunks (else discard
        self.max_chunk_size = config.get('max_chunk_size', 5) # maximum number of chunks of text to take (each ~512 in length)
        self.min_sentence_len = config.get('min_sentence_len', 20) # for next-sentence, min sentence size to merge together
        self.seed = config.get('seed', 42)
        self.max_chunk_length = self.max_chunk_size * self.max_seq_length
        self.max_chunk_length_char = int(self.max_chunk_length*self.char_per_word)
        self.min_seq_length_char = int(self.min_seq_length*self.char_per_word)
        self.min_sentence_length_char = int(self.min_sentence_len*self.char_per_word)

    @staticmethod
    def split_into_chunks(text, chunk_char_size, overlapping_size = 50):
        chunks = []
        start = 0
        end = chunk_char_size + overlapping_size
        while start < len(text):
            chunk = text[start:end]
            period_index = chunk.find(". ")
            if period_index != -1:
                chunk = chunk[period_index + 1:]
            else:
                first_space_index = chunk.find(" ")
                if first_space_index != -1:
                    chunk = chunk[first_space_index + 1:]
            # Check if the chunk has been split and contains more than one word
            #if start > 0 and " " in chunk:
            if end < len(text) and " " in chunk and chunk[-1]!=" ":
                last_space_index = chunk.rfind(" ")
                chunk = chunk[:last_space_index]
            chunks.append(chunk)
            start += chunk_char_size
            end += chunk_char_size
        return chunks

    def split_chunk_into_sentences(self, chunk, discard_first_sentence=True, discard_last_sentence=True ):
        doc = self.nlp(chunk)
        MAX_CHAR_LEN = int(self.max_seq_length*self.char_per_word)
        sentences = [sent.text for sent in doc.sents]
        if discard_first_sentence:
            sentences = sentences[1:]
        if discard_last_sentence:
            sentences = sentences[:-1]

        super_list_concatenated = [] # accumulates concatenated sentences
        super_list_raw_sentences = [] # accumulates raw sentences (for next-sentence prediction)
        buffer = []
        buffer_len = 0

        for sentence in sentences:
            sentence_len = len(sentence)

            if buffer_len + sentence_len > MAX_CHAR_LEN:
                super_list_concatenated.append(" ".join(buffer))
                super_list_raw_sentences.extend(buffer)
                buffer = []
                buffer_len = 0

            buffer.append(sentence)
            buffer_len += sentence_len

        if buffer:  # If there are any remaining sentences in the buffer
            super_list_concatenated.append(" ".join(buffer))
            super_list_raw_sentences.extend(buffer)

        return super_list_concatenated, super_list_raw_sentences

    def _sample_chunk_span(self, text, max_chunk_length_char):
        chunks = self.split_into_chunks(text, max_chunk_length_char)
        # randomly sample from the chunks
        #FOOBAR SAMPLE FROM CHUNKS
        return random.choice(chunks)

    def is_too_small_quickcheck(self, text, textlen=None):
        if textlen is None: textlen = len(text.strip())
        return textlen < self.min_seq_length_char*0.9

    def is_too_small(self, nwords):
        return nwords < self.min_seq_length

    def is_larger_than_max_chunk_quickcheck(self, text, textlen):
        """if it is larger than a chunksize, then we need to sample chunks"""
        if textlen is None: textlen = len(text.strip())
        return textlen > self.max_chunk_length_char

    def is_short_than_a_chunk(self, text, textlen):
        """if it is shorter than a chunk, then we'll take all text, in chunks"""
        if textlen is None: textlen = len(text.strip())
        return textlen < self.max_chunk_length_char

    def is_smaller_than_two_paragraphs(self, text):
        charlen = len(text)
        if charlen < (1.5*self.max_seq_length*self.char_per_word):
            return True, re.split(r"[\s\n\r]+",text.strip())
        if charlen > (2.5*self.max_seq_length*self.char_per_word):
            return False, None
        # inbetween cases, split and calculate the number of words
        textsplit = re.split(r"[\s\n\r]+",text.strip())
        nwords = len(textsplit)
        if nwords < 1.2*self.max_seq_length:
            return True, textsplit
        return False, textsplit

    @staticmethod
    def preprocess_sentences(list_of_sentences, min_sentence_char_length):
        """Merges small sentences in a sequence of sentence, until the strings are greater than `min_sentence_char_length`"""
        processed_sentences = []
        buffer = ""

        for sentence in list_of_sentences:
            if len(sentence) < min_sentence_char_length:
                buffer = buffer + " " + sentence
                if (len(buffer)>=min_sentence_char_length):
                    processed_sentences.append(buffer.strip())
                    buffer = ""
            else:
                if (len(buffer)<min_sentence_char_length):
                    to_add = buffer + " " + sentence
                    processed_sentences.append(to_add.strip())
                    buffer = ""
                else:
                    processed_sentences.extend([buffer.strip(), sentence.strip()])

        if buffer:  # If there are any remaining sentences in the buffer
            processed_sentences.append(buffer)

        return processed_sentences

    def process(self, text):
        """Chunks and samples large portions of text"""

        charlen = len(text.strip())

        # DISCARD if it is too small for copus
        if self.is_too_small_quickcheck(text, charlen):

            return {'text':[], 'do_accept':False, 'sentences':[]}

        # sample span of chunks: if it larger than our max chunk size
        if self.is_larger_than_max_chunk_quickcheck(text, charlen):
            text_span_chunks = self._sample_chunk_span(text, self.max_chunk_length_char)
        else:
            text_span_chunks = text

        # check if it smaller, than 1.5 seqlen, then we just accept it all as one unit to truncate later in tokenizer
        is_smaller_than_2_paras, textsplit = self.is_smaller_than_two_paragraphs(text_span_chunks)

        if is_smaller_than_2_paras:

            # check if less than minsize
            if self.is_too_small(len(textsplit)):
                # if too small, return nothing
                return {'text':[], 'do_accept':False, 'sentences':[]}

            # return text to be truncated
            return {'text':[text_span_chunks], 'do_accept':True, 'sentences':[]}

        # leftover cases: text that needs to be chunked into ~512 / max_seq_len
        text_to_return, sentences_to_return = self.split_chunk_into_sentences(text_span_chunks)

        # return text strings as list of chunks, flag
        return {
            'text':text_to_return,
            'do_accept':True,
            'sentences':self.preprocess_sentences(sentences_to_return, self.min_sentence_length_char),
        }

    def __call__(self, text):
        return self.process(text)
if False:
    example_processor = ExampleProcessor(config=data_streaming_config, char_per_word = CHAR_PER_WORD, nlp =nlp)
    text = """As the aircraft approached Pearl Harbor, the weather cleared, as if on cue. This enabled the strike formations to use the battery of searchlights at Kahuku Point as a navigation aid to guide them toward their targets. Dawn was now breaking. As sunlight streamed over the horizon, the airborne strike force pressed home its attack over Pearl Harbor, achieving complete surprise. Dive-bombers and torpedo planes went to work on the ships lying at anchor along Battleship Row, where the U.S. Navy's capital ships were berthed. Fighter aircraft peeled off and strafed the airfield, hitting parked planes, fuel storage tanks, and hangars. Army Air Corps pilots rushed to take off after the attacking force, but by the time they were aloft, the attackers had completed their strikes and vanished. Failing to locate the attackers, the Army aircraft returned to base, whereupon a second wave of carrier strike aircraft hit them. A _New York Times_ reporter on the scene reported that the attacks were "unopposed by the defense, which was caught virtually napping. Surveying the results, the American defenders were filled with anger—and relief. The attack, executed on the morning of Sunday, _February 7, 1932_ , occurred at the outset of a U.S. Army-Navy war game called Grand Joint Exercise 4. Rear Admiral Harry Yarnell, commander of the newly commissioned American aircraft carriers _Saratoga_ and _Lexington_ , had launched the attacking planes. The "bombs" dropped were flour bags, which could be found splattered on the Navy's ships still sitting at anchor. Surveying the results, the American defenders were filled with anger—and relief. The attack, executed on the morning of Sunday, _February 7, 1932_ , occurred at the outset of a U.S. Army-Navy war game called Grand Joint Exercise 4. Rear Admiral Harry Yarnell, commander of the newly commissioned American aircraft carriers _Saratoga_ and _Lexington_ , had launched the attacking planes. The "bombs" dropped were flour bags, which could be found splattered on the Navy's ships still sitting at anchor.Red-faced, the Army Air Corps commanders sought to minimize the attack's results. They argued that the damage incurred to Hickam Field was minimal, and asserted that they had found and attacked Yarnell's carriers. Finally, they protested the attack on legal grounds—it was improper to begin a war on Sunday! The war game's umpires sided with the Army. Their report made no mention of Yarnell's attack but concluded that "it is doubtful if air attacks can be launched against Oahu in the face of strong defensive aviation without subjecting the attacking carriers to the danger of material damage and consequent great loss in the attacking] air force. Nearly ten years later carriers of the Imperial Japanese Navy, attacking Pearl Harbor on Sunday, December 7, 1941, proved that Admiral Yarnell, not the umpires or the Army, had gauged the future correctly. The admiral had been willing to confront uncomfortable possibilities, whereas others had not. Although America was shocked by the Japanese attack, many in the Navy were not. As Admiral Chester W. Nimitz, the architect of the Navy's victorious campaign against Japan, ruefully admitted, "Nothing that happened in the Pacific was strange or unexpected. ## **THE DAWN OF BLITZKRIEG**"""
    text += text
    text += text
    text += text
    text += text
    foo = example_processor(text = text)
    foo,is_good, foo_sentences = foo.values()
    print(is_good)
    print('mlm_sentences')
    print(foo)

    print('next sentences:') # this seems to be working okay
    print(foo_sentences)
    print(len(foo_sentences))


    # works: test the process_sentences
    print(example_processor.preprocess_sentences(["This is fine.","foo",'sh',"This is fine and long.","This is also find and long.",'No', "This is long and good."], 10))

    # works, this returns split sentences
    example_processor.split_chunk_into_sentences(
        chunk="This is the first sentence. This is the 2nd sentence and another. I'm the third sentence. Hello, this is me. 5th sentence here. And finally its me.",
        discard_first_sentence=True, discard_last_sentence=True
    )

In [None]:
### Random (Smark) Negative Generator


In [None]:
import datasets
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import pynndescent
import numpy as np
import os
from math import prod

class NegativeExampleGenerator:
    """Builds a queryable corpus of negative examples using ANN and approximate TFIDF vectors"""
    def __init__(
            self,
            n_reps = 1,
            n_takes = 5000,
            #dataset_name = 'cerebras/SlimPajama-627B',
            tfidf_nfeatures = 3000,
            nchar_max_paragraph=3000,
            nword_max=100,
            nchar_max_word=4,
            max_sent_total = 5,
            corpus = None,
            save_cache = '/tmp/negative_corpus_cache.pkl'
    ):
        self.stopwords = [
            'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself',
            'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its',
            'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom',
            'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
            'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but',
            'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against',
            'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up',
            'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
            'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
            'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 'can', 'will',
            'just', 'don', 'should', 'now'
        ]
        self.n_reps = n_reps
        self.n_takes = n_takes
        self.tfidf_nfeatures = tfidf_nfeatures
        self.nchar_max_paragraph = nchar_max_paragraph
        self.nword_max = nword_max
        self.nchar_max_word = nchar_max_word
        self.max_sent_total = max_sent_total
        self.save_cache = save_cache
        if corpus is None:
            # fetch the corpus from streaming data (or reload from cache if available)
            print('warning: `corpus` is empty. Generating default corpus from RedPajama')
            self.corpus_static = self.fetch_default_corpus(self.save_cache)
        else:
            assert isinstance(corpus,list)
            assert len(corpus)>0
            print('using predefined corpus of length: %s' % len(corpus))
            self.corpus_static = corpus

        # build an ann index
        self.build_ann_index(self.corpus_static)

    def fetch_default_corpus(self, cache_file):
        """fetches streaming corpus and converts to a static list of data"""
        corpus_static = []
        if os.path.isfile(cache_file):
            print('reloading negative corpus %s for NegativeExampleGenerator' % cache_file)
            with open(cache_file, 'rb') as pcon:
                corpus_static = pickle.load(pcon)
                self.n_reps = pickle.load(pcon)
                self.n_takes = pickle.load(pcon)
                self.tfidf_nfeatures = pickle.load(pcon)
                self.nchar_max_paragraph = pickle.load(pcon)
                self.nword_max = pickle.load(pcon)
                self.nchar_max_word = pickle.load(pcon)
                self.max_sent_total = pickle.load(pcon)
        else:
            print('fetching streaming corpus for negatives (RedPajama)(%s reps)' % self.n_reps)
            # first do random draws from the corpora
            redpajama_set_name_support = ["RedPajamaCommonCrawl", "RedPajamaC4", "RedPajamaStackExchange", "RedPajamaWikipedia","RedPajamaBook", "RedPajamaArxiv"]
            for i_rep in range(self.n_reps):
                # load the streaming datasets (RedPajama)
                corpus_streaming = datasets.load_dataset(
                    'cerebras/SlimPajama-627B',
                    split="train",
                    streaming=True
                ).shuffle(
                    buffer_size = self.n_takes
                ).filter(
                    lambda x : x['meta']['redpajama_set_name'] in redpajama_set_name_support
                ).take(
                    self.n_takes
                ).remove_columns('meta')
                # convert streaming data to static and check language
                this_corpus_static = [
                    e['text'] for e in corpus_streaming #if langdetect(e['text'][:200]+' hello')=='en'
                    if check_language(e['text'])[0]
                ]
                # take only a few sentences per text
                this_corpus_static = [
                    self.limit_text_to_k_sentences(s, k=self.max_sent_total) for s in this_corpus_static
                ]
                # filtering again non-english
                this_corpus_static = [
                    s for s in this_corpus_static
                    if check_language(s)[0]
                ]
                # add
                corpus_static += this_corpus_static
                if (i_rep % 5)==0:
                    print('size of negative corpus: %d' % len(corpus_static))

            print('finished collecting streaming examples for negative corpus. Saving to %s' % self.save_cache)
            # save the cache
            with open(self.save_cache, 'wb') as pcon:
                pickle.dump(corpus_static, pcon)
                pickle.dump(self.n_reps, pcon)
                pickle.dump(self.n_takes, pcon)
                pickle.dump(self.tfidf_nfeatures, pcon)
                pickle.dump(self.nchar_max_paragraph, pcon)
                pickle.dump(self.nword_max, pcon)
                pickle.dump(self.nchar_max_word, pcon)
                picke.dump(self.max_sent_total, pcon)

        return corpus_static

    def build_ann_index(self, corpus):
        """vectorizes a corpus and builds an ann index"""
        # stem words in preparation for tfidf vectorizer
        corpus_processed = [
            self.preprocess_text_to_index(s) for s in corpus
        ]
        # convert the corpus into tfidfvectors
        self.tfidfvectorizer = TfidfVectorizer(max_features=self.tfidf_nfeatures)
        self.tfidfvectorizer.fit(corpus_processed)
        self.corpus_vectors = self.tfidfvectorizer.transform(corpus_processed)

        # build the ann index
        self.ann_index = pynndescent.NNDescent(self.corpus_vectors)
        print('finished building the ANN index')

    @staticmethod
    def limit_text_to_k_sentences(text, k=5):
        """splits text into sentences, then limits the paragraph to just `k` sentences"""
        if len(text)<400:
            return text
        text = text[:10000]
        sentences = [s for s in re.split(r"(?<=\w\w\.)\s+",text) if len(s)>1]
        n_sent = len(sentences)
        if n_sent<=k:
            return text
        # if larger than limit, pick a (pseudo)random set of sentences
        random_sent_start_max_offset = n_sent-k
        random_sent_start_offset = ord(sentences[-1][:10][-1]) % random_sent_start_max_offset
        random_sent_end_offset = random_sent_start_offset + k
        return " ".join(sentences[random_sent_start_offset:random_sent_end_offset])

    def preprocess_text_to_index(self, text):
        """converts text into small k-character word stems before passing to TFIDF"""
        ptext = text[:self.nchar_max_paragraph].lower()
        ptext = ' '.join([
            w[:self.nchar_max_word] for w in ptext.split(' ')[:self.nword_max]
            if (w not in self.stopwords)
        ])
        ptext = re.sub("\W+",' ',ptext).strip()
        return ptext

    def process_query(self,text):
        """Vectorizes query text for retrieval"""
        query_processed = self.preprocess_text_to_index(text)
        return self.tfidfvectorizer.transform([query_processed]), query_processed

    def find_negative(self, query_text, k=1, skip=1):
        """Finds similar text to the query text, skipping the first `skip` and returning `k` top matches"""
        query_vector, query_processed = self.process_query(query_text)
        ann_idx,scores = self.ann_index.query(query_vector, k = k+skip)
        retrieved_text = [
            self.corpus_static[i] for i in ann_idx[0][skip:]
        ]
        retrieved_text = [
            s for s in retrieved_text
            if (
                s.lower().replace(" ","")[:100]!=query_text.lower().replace(" ","")[:100]
            )
        ]
        if len(retrieved_text)>0:
            return retrieved_text, scores
        # check that the texts are different
        skip+=1
        return self.find_negative(query_text, k=k, skip=skip)

# Build the Negative Corpus
negative_example_generator= NegativeExampleGenerator(
    n_reps = 1, #
    n_takes = 40000,
    tfidf_nfeatures = 4000,
    nchar_max_paragraph=3000,
    nword_max=100,
    nchar_max_word=4,
    save_cache = 'negative_corpus_cache.pkl'
)



In [None]:
# test query
neg_retrievals,_ = negative_example_generator.find_negative(
    "MIT is an elite education institution based in Boston Massatusetts and is one of the first institutions of higher learning in the USA, dating back to the founding fathers. Recently, it has become embroiled in a series of scandels to do with free speech and allegations of scientific misconduct",
    k=1, skip=1
)
for _ in neg_retrievals: print(_)

كما أنّ ضرب الأسواق المدنية المزدحمة، الذي أسفر عن مقتل ما يقرب من مائة من مواطنيها، من قبل الحكومة أمر غير مقبول في أي ظرف من الظروف،" قال السيّد دي مستورا. هجمات أمس تأتي في أعقاب القصف العشوائي على دمشق الاسبوع الماضي من قبل جماعات المعارضة المسلحة وقطع إمدادات المياه، وجميعها تدابير تؤثر على المدنيين وهو أمر غير مقبول. هذه الهجمات الاخيرة هي مثال آخر على وحشية الصراع الدائر. "يجب السماح بوصول المساعدات الإنسانية دون قيد أو شرط ويجب أن يُتوقّف القتل.


#### A Sample of 1000 will have...
... approximately 1523 samples of 512-long examples

In [5]:
# FUNCTIONS TO MAKE THE TRAINING AND VAL SETs
import numpy as np
import pickle
import os
import pickle

## convert the streaming dataset in a static dataset
def convert_streaming_dataset_to_static_corpus(
    streaming_dataset,
    skip=0,
    take=1000
):
    """Takes a streaming_dataset and converts it into a list of examples"""
    if skip !=0:
        dataset_to_make_static = streaming_dataset.skip(skip).take(take)
    else:
        dataset_to_make_static = streaming_dataset.take(take)

    examples_static_mlm = [] # data for MLM objective
    examples_static_nextsentence = [] # data for next sentence task
    for i, example in enumerate(dataset_to_make_static):
        # chunk text into ~512 text-strings, and sentences
        examples_processed = example_processor(text = example['text'])
        # chunk, accept/reject, sentences
        example_parsed, do_accept, parsed_sentences = examples_processed.values()
        if is_do_acceptgood:
            # mlm gets the chunks of text-strings
            examples_static_mlm.extend(example_parsed)
            if len(parsed_sentences)>15:
                # sentences for next sentence prediction: make triplet of s1,s2,opposite, where opposites get label=1
                examples_static_nextsentence.extend(
                    convert_sequence_into_nextsentence_pairs(parsed_sentences)
                )
                #FOOFU - STOPPED HERE TO FIGURE OUT WHY MY NEXT-SENTENCE STUFF IS SO LONG
        if (i+1)%100==0:
            print("...streaming size: " % len(examples_static_mlm))

    return examples_static_mlm, examples_static_nextsentence

def convert_sequence_into_nextsentence_pairs(list_of_sentences):
    """Converts a list of sentences into a list of dicts, with next-sentence pairs"""
    n = len(list_of_sentences)

    def opposite(i,n):
        return (i + round(n/2+1)) % n

    list_of_nextsentence_pairs = []
    # loop through sequence, make triplet of anchor1+anchor2, next and an opposite
    #for o1a, o1b, o2 in zip(range(0,n-2), range(1,n-1), range(2,n)):
    for o1a, o1b, o1c, o2 in zip(range(0,n-3), range(1,n-2), range(2,n-1), range(3,n)):
        # anchor text is three sentences
        s_anchor = list_of_sentences[o1a] + " " + list_of_sentences[o1b] + " " +  list_of_sentences[o1c]
        # target is the fourth (next-sentence)
        s_next = list_of_sentences[o2]
        s_opposite = list_of_sentences[opposite(o1b,n)]
        list_of_nextsentence_pairs.append(
            {
                "anchor":s_anchor,
                "next":s_next,
                "opposite":s_opposite
            }
        )
    return list_of_nextsentence_pairs

print(convert_sequence_into_nextsentence_pairs(['a','b','c','d','e','f']))




[{'anchor': 'a b c', 'next': 'd', 'opposite': 'f'}, {'anchor': 'b c d', 'next': 'e', 'opposite': 'a'}, {'anchor': 'c d e', 'next': 'f', 'opposite': 'b'}]


In [6]:
TEXTSEPARATOR = "%0XTEXTXEPARAT0RX%0"

def chunk_docs_into_chunks_and_sentences(
    list_of_strings,
    nlp=None,
    config_chunking=None,
    seed = 42,
    fieldname='text',
    min_number_of_sentence_for_nextsentence_prediction = 15
):
    """Splits long docs into chunks that do next exceet max_seq_len, as well as sentences for next-sentence-prediction """
    if nlp is None:
        nlp = spacy.load("en_core_web_sm")
        nlp.add_pipe("sentencizer")

    if config_chunking is None:
        config_chunking = {
            'max_seq_length':512,
            'min_seq_length':48,
            'max_chunk_size':6,
            'min_sentence_len':20,
            'seed':seed
        }
    else:
        config_chunking.update({'seed':seed})

    # initialize the example processor
    example_processor = ExampleProcessor(
        config=config_chunking, char_per_word = CHAR_PER_WORD, nlp =nlp
    )

    examples_static_chunks = [] # data for MLM objective
    examples_static_nextsentence = [] # data for next sentence task
    for i, example in enumerate(list_of_strings):
        # chunk text into ~512 text-strings, and sentences
        if isinstance(example,str):
            examples_processed = example_processor(text = example)
        elif isinstance(example,dict):
            examples_processed = example_processor(text = example[fieldname])
        # chunk, accept/reject, sentences
        example_parsed, do_accept, parsed_sentences = examples_processed.values()
        if do_accept:
            # mlm gets the text-strings chunked to size 512
            examples_static_chunks.extend(example_parsed)
            if len(parsed_sentences)> min_number_of_sentence_for_nextsentence_prediction: #4:
                # sentences for next sentence prediction: make triplet of s1,s2,opposite, where opposites get label=1
                examples_static_nextsentence.extend(
                    convert_sequence_into_nextsentence_pairs(parsed_sentences)
                )

    return examples_static_chunks, examples_static_nextsentence

def nwords_quick(text):
    return len([w for w in text.split(" ") if len(w)>0])

def flatten(list_of_lists):
    return [subl for l in list_of_lists for subl in l]

def initialize_and_get_mlm_streaming_datasets(
    data_streaming_config,
    streaming_cleaning_functions,
    start_proportion = None,
    epoch=0,
    seed=42,
    path_to_val_cache = 'cache_val_mlm.pkl',
    path_to_train_cache_epoch = 'cache_train_mlm_%03g.pkl',
    do_check_english = True
):
    """Converts stream of unlabelled text data into static datasets for: MLM task and next-sentence-prediction task"""
    # list of files to stream
    files = data_streaming_config['files']
    # number of examples to take from stream for validation set
    val_size = data_streaming_config['val_size']
    # number of examples to take from stream for training set
    train_chunk_size = data_streaming_config['train_chunk_size']
    min_seq_len = data_streaming_config['min_seq_length']
    # normalization constant for normalizing the weights into probabilities
    probability_normalization_const = sum([x[2] for x in files])

    # where to initialize start-stream for training data
    if start_proportion is None:
        start_proportion = np.random.RandomState(seed+epoch).uniform()*0.99

    # reload cached files
    path_to_train_cache = None if not '%03g' in path_to_train_cache_epoch else path_to_train_cache_epoch % epoch
    do_make_valset = not os.path.isfile(path_to_val_cache)
    do_make_trainset = not os.path.isfile(path_to_train_cache)
    if not do_make_valset:
        print('RELOADING VAL-MLM SET: iter=%s' % path_to_val_cache)
        with open(path_to_val_cache,'rb') as pcon:
            datalist_val_mlm_static = pickle.load(pcon)
            datalist_val_sentences_static = pickle.load(pcon)
            epoch = pickle.load(pcon)
            log_source_val = pickle.load(pcon)
        print('VAL-MLM SET SIZE: %d' % len(datalist_val_mlm_static))
    else:
        datalist_val_mlm_static, datalist_val_sentences_static, log_source_val = [],[],{}
    if not do_make_trainset:
        print('RELOADING VAL-QA SET: iter=%s' % path_to_val_cache)
        with open(path_to_train_cache,'rb') as pcon:
            datalist_train_mlm_static = pickle.load(pcon)
            datalist_train_sentences_static = pickle.load(pcon)
            epoch = pickle.load(pcon)
            log_source_train = pickle.load(pcon)
        print('TRAIN-MLM EPOCH-%d SET SIZE: %d' % (epoch, len(datalist_train_mlm_static)))
    else:
        datalist_train_mlm_static, datalist_train_sentences_static,log_source_train = [],[],{}

    if (do_make_trainset or do_make_valset):

        # initialize the nlp-sentencizer for chunking
        nlp = spacy.load("en_core_web_sm")
        nlp.add_pipe("sentencizer")

        # loop through datasets
        for (mlm_nm, set_nm, prob, dataset_size, special_handling, partition_shuffle, threshold_specialchar), dataset_key in zip(
            files, streaming_cleaning_functions.keys()
        ):
            if prob ==0:
                continue
            prob /= probability_normalization_const

            # get cleaning & filter functions for streaming data functionality
            clean_func, filter_func, removefeature_names = streaming_cleaning_functions[dataset_key]

            # set arguments for the load_dataset (huggingface repos)
            load_dataset_args = {
                'path':mlm_nm, 'name':set_nm, 'split':'train', 'streaming':True
            }
            # for other non-huggingface repos, path needs to be a "builder"
            if mlm_nm.endswith('.jsonl') or mlm_nm.endswith('.jsonl.zip') or mlm_nm.endswith('.jsonl.zst'):
                load_dataset_args.update({'path':'json','data_files':mlm_nm})

            # special proecssing of datasets with multiple partitions
            if bool(partition_shuffle): # or str(epoch)=='val':

                n_files, n_per_file = partition_shuffle
                dataset_size = n_per_file
                print('trying %s initialization (shuffling through %d files)' % (mlm_nm, n_files))

                # whether there is a filter
                if filter_func is None:
                    dset_stream = load_dataset(**load_dataset_args)
                else:
                    dset_stream = load_dataset(**load_dataset_args).filter(filter_func)

                # validation set
                if do_make_valset:
                    # take from stream
                    n_valset_take = max(int(prob*val_size), 1)
                    print('take %d from %s validation'% (n_valset_take, mlm_nm))
                    dset_stream_val = dset_stream.take(n_valset_take).map(clean_func).remove_columns(removefeature_names)
                    # convert stream to a static set (and check english language)
                    dset_static_val_thisset =[
                        e['text'] for e in dset_stream_val
                        if bool(re.search(r"\w+",e['text'][:200])) and (nwords_quick(e['text'][:10000])>min_seq_len)
                    ]
                # training set
                if do_make_trainset:
                    # randomly skip a bunch from this set
                    skip_to_start = int(start_proportion*n_per_file)
                    take_from_this_set = max(int(round(train_chunk_size*prob)),1)
                    print('take %d from %s training'% (take_from_this_set, mlm_nm))
                    # shuffle: take a random data partition (from the dataset's list of files)
                    dset_stream_train = dset_stream.shuffle(
                        seed = seed+epoch, buffer_size = skip_to_start+take_from_this_set,
                    )
                    dset_stream_train = dset_stream_train.skip(
                        skip_to_start # random skip through dataset to new start position
                    ).take(
                        take_from_this_set # take this amount for the training ste
                    ).map(clean_func).remove_columns(removefeature_names)
                    # convert training to static dataset
                    dset_static_train_thisset =[
                        e['text'] for e in dset_stream_train
                        if bool(re.search(r"\w+",e['text'][:200])) and (nwords_quick(e['text'][:10000])>min_seq_len)
                    ]
            else:
                # regular streaming
                print('trying %s initialization' % mlm_nm)
                # whether there is a filter
                if filter_func is None:
                    dset_stream = load_dataset(**load_dataset_args).map(clean_func).remove_columns(removefeature_names)
                else:
                    dset_stream = load_dataset(**load_dataset_args).filter(filter_func).map(clean_func).remove_columns(removefeature_names)
                # take from stream
                n_valset_take = max(int(prob*val_size), 1) # size of valset
                print('take %d from %s validation'% (n_valset_take, mlm_nm))
                skip_to_start = int(start_proportion*(dataset_size-n_valset_take)) # random point to skip to
                n_train_take = max(int(round(train_chunk_size*prob)),1) # size of train set
                print('take %d from %s train'% (n_train_take, mlm_nm))
                if do_make_valset:
                    dset_stream_val = dset_stream.take(n_valset_take)
                    # checking for: existence of any words and ii) size of sequence meets minimum criteria
                    dset_static_val_thisset = [
                        e['text'] for e in dset_stream_val
                        if bool(re.search(r"\w+",e['text'][:200])) and (nwords_quick(e['text'][:10000])>min_seq_len)
                    ]
                if do_make_trainset:
                    dset_stream_train = dset_stream.skip(n_valset_take+skip_to_start).take(n_train_take)
                    # checking for: existence of any words and ii) size of sequence meets minimum criteria
                    dset_static_train_thisset = [
                        e['text'] for e in dset_stream_train
                        if bool(re.search(r"\w+",e['text'][:200])) and (nwords_quick(e['text'][:10000])>min_seq_len)
                    ]
            print('Done getting streams/reloading from %s' % mlm_nm)
            # check language, chunk sentences
            if do_make_valset:
                # discard non-english
                dset_static_val_thisset =[
                    e for e in dset_static_val_thisset
                    if check_language(e, threshold_specialchar)[0]
                ]
                print('done val language check')
                # split multi-answers that I want made into separate texts
                dset_static_val_thisset = [
                    e for e in dset_static_val_thisset
                    if TEXTSEPARATOR not in e
                ] + flatten([
                    e.split(TEXTSEPARATOR) for e in dset_static_val_thisset
                    if TEXTSEPARATOR in e
                ])
                # chunk the docs (512-tokens and next-sentence prediction sentences)
                dset_val_chunked_for_mlm, dset_val_nextsentence = chunk_docs_into_chunks_and_sentences(
                    list_of_strings=dset_static_val_thisset,
                    config_chunking=copy.deepcopy(data_streaming_config),
                    seed=seed+epoch,
                    nlp=nlp
                )
                print('done val longtext chunking')
                # add to val set
                datalist_val_mlm_static.extend(dset_val_chunked_for_mlm)
                datalist_val_sentences_static.extend(dset_val_nextsentence)
                # log the sources of text
                log_source_val[dataset_key] = len(dset_val_chunked_for_mlm)

            # check language, chunk sentences
            if do_make_trainset:
                # discard non-english
                dset_static_train_thisset =[
                    e for e in dset_static_train_thisset
                    if check_language(e, threshold_specialchar)[0]
                ]
                print('done train language check')
                # split multi-answers that I want made into separate texts
                dset_static_val_thisset = [
                    e for e in dset_static_train_thisset
                    if TEXTSEPARATOR not in e
                ] + flatten([
                    e.split(TEXTSEPARATOR) for e in dset_static_train_thisset
                    if TEXTSEPARATOR in e
                ])
                # chunk the docs (512-tokens and next-sentence prediction sentences)
                dset_train_chunked_for_mlm, dset_train_nextsentence = chunk_docs_into_chunks_and_sentences(
                    list_of_strings=dset_static_train_thisset,
                    config_chunking=copy.deepcopy(data_streaming_config),
                    seed=seed+epoch,
                    nlp=nlp
                )
                print('done trains longtext chunking')

                # ensure that none of the examples in the traning set are in the validation set
                if do_make_valset:
                    dset_train_chunked_for_mlm = [
                        s for s in dset_train_chunked_for_mlm
                        if s not in dset_val_chunked_for_mlm
                    ]
                    dset_train_nextsentence = [
                        tlt for tlt in dset_train_nextsentence
                        if (
                            tlt['anchor'] not in [
                                vtlt['anchor'] for vtlt in dset_val_nextsentence
                            ]
                        )
                    ]

                # add to training set
                datalist_train_mlm_static.extend(dset_train_chunked_for_mlm)
                datalist_train_sentences_static.extend(dset_train_nextsentence)
                # log the sources of text
                log_source_train[dataset_key] = len(dset_train_chunked_for_mlm)

        print('Done collecting streaming data')

    if do_make_valset:
        print('saving streamed validation data: %s' % path_to_val_cache)
        with open(path_to_val_cache,'wb') as pcon:
            pickle.dump(datalist_val_mlm_static, pcon)
            pickle.dump(datalist_val_sentences_static, pcon)
            pickle.dump(epoch,pcon)
            pickle.dump(log_source_val, pcon)
    if do_make_trainset:
        print('saving streamed training for epoch %d: %s' % (epoch, path_to_train_cache))
        with open(path_to_train_cache,'wb') as pcon:
            pickle.dump(datalist_train_mlm_static, pcon)
            pickle.dump(datalist_train_sentences_static, pcon)
            pickle.dump(epoch,pcon)
            pickle.dump(log_source_train,pcon)
    return {
        'train':{
            'mlm':datalist_train_mlm_static,
            'nextsentence':datalist_train_sentences_static
        },
        'val':{
            'mlm':datalist_val_mlm_static,
            'nextsentence':datalist_val_sentences_static
        },
        'epoch':epoch,
        'index_stream':start_proportion,
        'log_source':{'train':log_source_train, 'val':log_source_val}
    }

In [7]:
DEBATESUM_EXTREMIST_FILTER_OUT1 = ['13th Aff - DDI 2020 AT.html5', '13th Amendment Case Neg - DDI 2020 GG.html5', '13th Amendment Neg - DDI 2020 FS.html5', '13th Neg - DDI 2020 AT.html5', '13th Neg - DDI 2020 KM.html5', '1ac Airports - DDI 2015 KQ.html5', '1ac Borders - DDI 2015 KQ.html5', '1ac critical financial surveillance pedagogy - DDI 2015 KS.html5', '2 for 1 DA - JDI 2017.html5', '2020 DA - Berkeley 2019.html5', '2020 DA - MSDI 2020.html5', '2020 Election DA - JDI 2020.html5', '2020 Election-Starter - Georgetown 2020.html5', '2020 Elections DA - Michigan7 2019 CCPW.html5', '2nd Session Packet - WTO topic - TDI 2021.html5', 'A Door Into the Ocean Affirmative - HSS 2014.html5', 'A Door Into the Ocean Negative - HSS 2014.html5', 'AIIB Aff Supplement - Michigan7 2016.html5', 'AIIB Aff Wave 1 - Michigan7 2016.html5', 'AIIB Aff-Neg - JDI 2016.html5', 'AIIB Neg Starter - Michigan7 2016.html5', 'AIIB Neg Updates - MNDI 2016.html5', 'ALPR Negative - Michigan7 2015.html5', 'ANTS Aff - Michigan7 2019 HKMM.html5', 'ASAT Aff-Neg - Rohan - Wake 2016 RKS.html5', 'AT - Advantage CPs Updates - Michigan 7 2022 CPWW.html5', 'AT - Afropessism K - Michigan 7 2022 BFHR.html5', 'AT - Baudrillard K - Michigan 7 2022 K LAB.html5', 'AT - Cap K - Michigan 7 2022 CPWW.html5', 'AT - Cap K - UTNIF 2022.html5', 'AT - Cybernetics K - Starter - Michigan 7 2022.html5', 'AT - DOD Tradeoff DA Starter - Michigan 7 2022.html5', 'AT - Dept of State CP - SDI 2022.html5', 'AT - Empire K - Michigan Classic 2022 BBE.html5', 'AT - Fem IR K - CNDI 2022.html5', 'AT - Fem IR K - Starter - Michigan 7 2022.html5', 'AT - Fund DOD CP - Michigan 7 2022 BEJJ.html5', 'AT - IR Ks - Analytic Eclecticism - Michigan Classic 2022 MMP.html5', 'AT - Imperialism K Supplement - CNDI 2022.html5', 'AT - Leahy Law CP - Michigan 7 2022 FMPS.html5', 'AT - Midterms GOP Good DA - Michigan 7 2022 CPWW.html5', 'AT - Militarism K - CNDI 2022.html5', 'AT - Militarism K - Emory 2022.html5', 'AT - Orientalism K Updates - Michigan 7 2022 CPWW.html5', 'AT - Primacy DA - UTNIF 2022.html5', 'AT - Queer IR K - Michigan 7 2022 CPWW.html5', 'AT - Queer IR K - Michigan Classic 2022 BBE.html5', 'AT - Racial IR K - Michigan 7 2022 FMPS.html5', 'AT - Russian Relations DAs - Michigan 7 2022 CPWW.html5', 'AT - Security K - MSDI 2022.html5', 'AT - Security K - Michigan Classic 2022 BBE.html5', 'AT - Settler Colonialism K - Michigan 7 2022 BFHR.html5', 'AT - Settler Colonialism K - Michigan 7 2022 K LAB.html5', 'AT - Turkey PIC Addendum - UTNIF 2022.html5', 'AT - War Powers Act CP - Michigan 7 2022 FMPS.html5', 'AT Antiblackness Survival Strategies and Word PICs - Ruth - Wake 2016 RKS.html5', 'AT Baudrillard - Wake 2019.html5', 'AT Framework - Northwestern 2015 Sophomores .html5', 'AT Framing Contentions - DDI 2020 GG.html5', 'AT Kritik - Northwestern 2015 6WS.html5', 'AT Queer Terror K - Northwestern 2015.html5', 'AT Rights Malthus - Michigan7 2014 GRAMS.html5', 'AT Third Space - Katie - Wake 2016 RKS Seniors.html5', 'ATC Politics Updates - UTNIF 2017.html5', 'Ableism - Michigan7 2015.html5', 'Ableism K - Michigan7 2021 BFPSW.html5', 'Ableism K v. K Affirmatives - Northwestern 2015.html5', 'Abolish ICE Neg - DDI 2018 KM.html5', 'Abolish ICE aff - Gonzaga 2020 LB.html5', 'Abolish ICE neg - Gonzaga 2020 LB.html5', 'Abolish Policing 1.0 - Gonzaga 2020 LO.html5', 'AbolishICE Aff and Neg Updates 1 - SDI 2020.html5', 'AbolishICE Aff and Neg Updates 2 - SDI 2020.html5', 'AbolishICE Negative - SDI 2020.html5', 'Abolition Aff - Georgetown 2020.html5', 'Abolition Aff Neg Starter - UTNIF 2020.html5', 'Abolition K - Berkeley 2020 Starter Pack.html5', 'Abolition K - JDI 2020.html5', 'Abolition K - Northwestern 2020 BW.html5', 'Abolition K - Session 2 - UTNIF 2017.html5', 'Abolition K - UTNIF 2020.html5', 'Abolition K Starter - Georgetown 2020.html5', 'Abolition K Supplement - Berkeley 2020 Wave 2.html5', 'Abolition K and Aff Answers - Gonzaga 2020 MM.html5', 'Abolition Neg - Georgetown 2020.html5', 'Abolitionist Pedagogy Aff - Berkeley 2017.html5', 'Abolitionist Pedagogy Neg - Berkeley 2017.html5', 'Academic Achievement Core - Wake 2017.html5', 'Academy K - Michigan7 2021 K Lab.html5', 'Academy K - Wake 2019.html5', 'Activism File - UTNIF 2018.html5', 'Addendum - DA - Progressive Opposition DA - Michigan7 2020 BFHPR.html5', 'Addendum - DNA Aff - MichiganClassic 2020 LOSVW.html5', 'Advantage Answers File - UNT 2017.html5', 'Advantage CP - Berkeley 2017.html5', 'Advantage CP - DDI 2018.html5', 'Advantage CP Core - MichiganClassic 2016.html5', 'Advantage CP Core - Northwestern 2015 6WS.html5', 'Advantage CP Toolbox - Northwestern 2014.html5', 'Advantage CPs  - Michigan7 2017 OW.html5', 'Advantage CPs - Anthony - Wake 2016 RKS.html5', 'Advantage CPs - Berkeley 2019.html5', 'Advantage CPs - HSS 2015.html5', 'Advantage CPs - HSS 2017.html5', 'Advantage CPs - Michigan7 2013 PCFJV.html5', 'Advantage CPs - Michigan7 2015.html5', 'Advantage CPs - Michigan7 2016.html5', 'Advantage CPs - Michigan7 2021 BFHPR.html5', 'Advantage CPs - MichiganClassic 2021 BMZ.html5', 'Advantage CPs - Northwestern 2018.html5', 'Advantage CPs - SDI 2021 Scholars.html5', 'Advantage CPs - Wake 2017.html5', 'Advantage CPs Core Supplement - MichiganClassic 2017 OW.html5', 'Aegis Neg - DDI 2019 LO.html5', 'Aerial Surveillance Affirmative - Michigan7 2015.html5', 'Aerial Surveillance Affirmative - Northwestern 2015.html5', 'Aerial Surveillance Negative - Northwestern 2015.html5', 'Aff - AI Clarity - Michigan Classic 2022 CGNO.html5', 'Aff - AI Ethics - Starter - Michigan 7 2022.html5', 'Aff - AI Ethics 2 - MNDI 2022 PHA.html5', 'Aff - AI LAWs - MSDI 2022.html5', 'Aff - AI Logistics - CNDI 2022.html5', 'Aff - AI Subs - Michigan 7 2022 BFHR.html5', 'Aff - AI TEVV - MNDI 2022 PHA.html5', 'Aff - Abolish ICE - MichiganClassic 2020 ACV.html5', 'Aff - Ban OCOs - MSDI 2022.html5', 'Aff - Becoming War - Michigan 7 2022 K LAB.html5', 'Aff - Black Disability - Michigan 7 2022 K LAB.html5', 'Aff - Collateral Consequences - Michigan7 2020 BFHPR.html5', 'Aff - Corporate Crime - Michigan7 2020 BFHPR.html5', 'Aff - Cyber Article 5 - Northwestern 2022.html5', 'Aff - Cyber Space Assets 2 - Michigan 7 2022 BFHR.html5', 'Aff - Cybersecurity - NAUDL 2022.html5', 'Aff - Cyborg Writing - Michigan 7 2022 BFHR.html5', 'Aff - Digital Cyclops - Michigan 7 2022 BFHR.html5', 'Aff - Digital Cyclops 2 - Michigan 7 2022 BFHR.html5', 'Aff - Disease - UTNIF 2022.html5', 'Aff - Disinformation - CNDI 2022.html5', 'Aff - Fem IR - Michigan 7 2022 K LAB.html5', 'Aff - Gendered LAWs - Michigan 7 2022 FMPS.html5', 'Aff - Guantanamo - Michigan7 2020 EHJPS.html5', 'Aff - Imperialism 1AC - Michigan 7 2022 K LAB.html5', 'Aff - Information Warfare - Michigan 7 2022 BEJJ.html5', 'Aff - Information Warfare Addendum - Michigan Classic 2022 HJV.html5', 'Aff - Intellectual Property - Michigan 7 2022 FMPS.html5', 'Aff - K Answer Updates - Michigan7 2020 BFHPR.html5', 'Aff - Marijuana Supplement 2 - Michigan7 2020 BFHPR.html5', 'Aff - Neg - Indeterminate Sentencing - Michigan7 2020 EHJPS.html5', 'Aff - Neg - Mandatory Minimum Sentencing - Michigan7 2020 HKMM.html5', 'Aff - Neg DNA Database Reform - Michigan7 2020 Starter Pack.html5', 'Aff - Neg Death Penalty - Michigan7 2020 Starter Pack.html5', 'Aff - Neg Death Penalty Supplement - MichiganClassic 2020 MMP.html5', 'Aff - Neg Juvenile Justice - Michigan7 2020 CCPTW.html5', 'Aff - Neg Police Militarization Supplement - MichiganClassic 2020 MMP.html5', 'Aff - Neg Sex Workers - MichiganClassic 2020 LOSVW.html5', 'Aff - Neg White Collar Crime Grading - MichiganClassic 2020 LOSVW.html5', 'Aff - OCO Info Sharing - Georgetown 2022.html5', 'Aff - OCOs - Starter - Michigan 7 2022.html5', 'Aff - PRISM - Michigan 7 2022 CPWW.html5', 'Aff - Police Militarization 1ac - Michigan7 2020 Starter Pack.html5', 'Aff - Policing - Michigan7 2020 EHJPS.html5', 'Aff - Policing 2 - Michigan7 2020 EHJPS.html5', 'Aff - Rememory - Michigan 7 2022 BFHR.html5', 'Aff - Sett Col - Michigan7 2020 K Lab.html5', 'Aff - Techno Orientalism - Michigan 7 2022 BFHR.html5', 'Aff - War on Drugs - Michigan7 2020 BFHPR.html5', 'Aff - Warren Terror - Michigan 7 2022 K LAB.html5', 'Aff - Warren Terror Supplement - Michigan 7 2022 K LAB.html5', 'Aff Ans. to Deleuze - Michigan7 2020 K Lab.html5', 'Aff Critique Updates - SDI 2016.html5', 'Aff Economics K - Michigan7 2013.html5', 'Aff K - Deleuze - Michigan7 2020 K Lab.html5', 'Aff K Answers  - MichiganClassic 2019 BFHMRS.html5', 'Aff K Toolbox 2 - Michigan7 2013.html5', 'Aff K of TSA Case Neg - Northwestern 2015 6WS.html5', 'Aff Neg - Environmental Crimes - Michigan7 2020 BFHPR.html5', 'Aff Neolib K 2 - Michigan7 2013.html5', 'Aff Neolib K 3 - Michigan7 2013.html5', 'Aff Schopenhauer K - Michigan7 2013.html5', 'Aff Supplement - SDI 2017 EER.html5', 'Aff Tournament Updates - SDI 2017 PSW.html5', 'Affective Correspondences Aff-Neg - Ruth - Wake 2016 RKS.html5', 'Afghanistan Aff - Michigan7 2016.html5', 'Africa Brain Drain - MichiganClassic 2018 BO.html5', 'Africom Aff - Wake 2019.html5', 'Afro Asia Aff - Michigan7 2016.html5', 'Afro Asia Aff Supplement - Michigan7 2016.html5', 'Afro Pessimism Core - SDI 2015.html5', 'Afro Pessimism Critique - HSS 2015.html5', 'Afro Pessimism-Athanasopolous - Wake 2016 RKS Workshop.html5', 'Afro-Asia Neg - Michigan7 2016.html5', 'Afro-Orientalism Aff - Michigan7 2016.html5', 'Afro-Orientalism Kritik - Wake 2016 RKS K Lab.html5', 'Afro-Orientalism Neg - Michigan7 2016.html5', 'Afro-Pessimism Basic - Wake 2017.html5', 'Afro-Pessimism Core - RKS - Wake 2017.html5', 'Afro-Pessimism K - Michigan7 2016.html5', 'Afro-Pessimism K Answers - Michigan7 2016.html5', 'AfroPessimism - Gonzaga 2014.html5', 'Afrofuturism Aff   Neg - Wake 2017.html5', 'Afrofuturism Aff Supplement - Michigan7 2016.html5', 'Afrofuturism Critique - Michigan7 2015.html5', 'Afrofuturism Neg - DDI 2014 MS.html5', 'Afrofuturism case neg - DDI 2014 SWS.html5', 'Afropessimism - SDI 2018 PSW.html5', 'Afropessimism 2.0 - UTNIF 2020.html5', 'Afropessimism Aff - Wake 2019.html5', 'Afropessimism Aff Neg - Michigan7 2021 K Lab.html5', 'Afropessimism Answers - HSS 2016.html5', 'Afropessimism K - Michigan7 2021 K Lab.html5', 'Afropessimism K - Northwestern 2017.html5', 'Afropessimism Updated - Wake 2016 RKS K Lab.html5', 'Ag DA Supplement - Michigan7 2021 HKMLR.html5', 'Ag Efficiency Aff - Berkeley 2021.html5', 'Ag Efficiency Neg - Berkeley 2021.html5', 'Ag Runoff Aff Neg - UTNIF 2021.html5', 'Ag Runoff Case Neg - MSDI 2021.html5', 'Ag Subsidies Aff - Michigan7 2021 BFHPR.html5', 'Ag Subsidies Case Neg - Michigan7 2021 BFHPR.html5', 'Agamben Aff - UTNIF 2017.html5', 'Agamben Affirmative and Neg - Wake 2018.html5', 'Agamben Affirmative and Negative - Northwestern 2015.html5', 'Agamben Case Neg - UTNIF 2017.html5', 'Agamben Critique 2 - Michigan7 2015.html5', 'Agamben Critique 3 - Michigan7 2015.html5', 'Agamben Critique Answers - Michigan7 2015.html5', 'Agamben K - Master File - UTNIF 2018.html5', 'Agamben K - MichiganClassic 2017 OW.html5', 'Agamben K - Northwestern 2015.html5', 'Agamben Kritik - Michigan7 2018 MMMR.html5', 'Agamben Supplement - MichiganClassic 2015.html5', 'Agenda DA - JDI 2020.html5', 'Agenda Links Iran Sanctions DA - Michigan7 2016.html5', 'Agenda Politics - Northwestern 2014.html5', 'Agenda Politics - Wake 2017.html5', 'Agent CPs - Northwestern 2015 6WS.html5', 'Algae Counterplan - UNT 2013.html5', 'Algal Biofuels Affirmative - HSS 2014.html5', 'Algal Biofuels Negative - HSS 2014.html5', 'Alien Futurity Negative - UTNIF 2018.html5', 'All American Aff - Wake 2019.html5', 'Alliance DA Supplement - Berkeley 2019.html5', 'Alliance Impact Core - Michigan7 2019 HKMM.html5', 'Allied Prolif DA - MichiganClassic 2016.html5', 'Allies DA - Michigan7 2019 BFHR.html5', 'Amendment CP - Northwestern 2015.html5', 'Amendment CP Updates - Berkeley 2017.html5', 'American Exceptionalism Critique - SDI 2014.html5', 'Anarch-AntiMilitarism Aff - Michigan7 2019 HKMM.html5', 'Anthro Answers - Michigan7 2014.html5', 'Anthro K - Michigan7 2014 BEFJR.html5', 'Anthro K - Northwestern 2014.html5', 'Anthro K Aff Answers - Michigan7 2014 BEFJR.html5', 'Anthropocentrism Aff and Neg - Lauryn - Wake 2016 RKS K Lab.html5', 'Anthropocentrism Critique - Baylor 2014.html5', 'Anthropocentrism Critique - Berkeley 2014.html5', 'Anthropocentrism Critique - Georgetown 2014.html5', 'Anthropocentrism Critique - HSS 2014.html5', 'Anthropocentrism Critique - Samford 2014.html5', 'Anthropocentrism Critique Answers - HSS 2014.html5', 'Anthropocentrism K - Michigan7 2021 HKMLR.html5', 'Anti-Blackness Critique - UTNIF 2015.html5', 'Anti-Blackness K - Wave 1 - Michigan7 2017 AFMMKK.html5', 'Anti-Blackness Supplement - Michigan7 2017 BFHHR.html5', 'Anti-Settler Education K - UTNIF 2017.html5', 'Antiblackness - Michigan7 2018 K Lab.html5', 'Antiblackness Answers Compiled - Wake 2018.html5', 'Antiblackness K - Berkeley 2017.html5', 'Antiblackness K - MichiganClassic 2019 K Lab.html5', 'Antiblackness Kritik - SDI 2019.html5', 'Antiblackness Updates - Berkeley 2017.html5', 'Antiblackness and Pan Answers - Northwestern 2015.html5', 'Antilab Grab Bag - Georgetown 2014.html5', 'Apoc Warming K - Michigan7 2014 BEFJR.html5', 'Apocalyptic Discourse K - Northwestern 2014.html5', 'Appeasement DA  - Michigan7 2019 FFPSV.html5', 'Appeasement DA - Berkeley 2016.html5', 'Appeasement DA - DDI 2016.html5', 'Appeasement DA - Michigan7 2016.html5', 'Appeasement DA - NDCA 2016.html5', 'Appeasement DA 2 - Michigan7 2013.html5', 'Appeasement DA Answers - NDCA 2016.html5', 'Appeasement DA Updates - Michigan7 2016.html5', 'Appeasement Disadvantage - UTNIF 2013.html5', 'Appeasement Disadvantages - Northwestern 2013 4WeekSeniors.html5', 'Aquaculture Affirmative - Georgetown 2014.html5', 'Aquaculture Affirmative - JDI 2014.html5', 'Aquaculture Affirmative - Samford 2014.html5', 'Aquaculture Negative - Samford 2014.html5', 'Aquarius Reef Base Aff and Neg - Michigan7 2014 GRAMS.html5', 'Aquatic Invasive Species Affirmative - Michigan7 2014 GRAMS.html5', 'Arctic Aff Wave 1 - Michigan7 2016.html5', 'Arctic Aff Wave 2 - Michigan7 2016.html5', 'Arctic Aff-Neg - JDI 2016.html5', 'Arctic Coop Neg - Michigan7 2016.html5', 'Arctic Coop Neg 2 - Michigan7 2016.html5', 'Arctic Coop Neg 3 - Michigan7 2016.html5', 'Arctic Mapping Affirmative - Michigan7 2014 GRAMS.html5', 'Arctic Mapping Negative - Michigan7 2014 GRAMS.html5', 'Arctic OCS Affirmative - JDI 2014.html5', 'Arctic OCS Negative - JDI 2014.html5', 'Area Studies K - Michigan7 2013.html5', 'Armed Drones AFF - Wake 2015.html5', 'Arms Control K  - Michigan7 2019 Starter Pack.html5', 'Arms Sales K - Scholars - Gonzaga 2019.html5', 'Arms Sales Updates - Michigan7 2019 FFPSV.html5', 'Art Education K - Supplement - Michigan7 2017 AFMMKK.html5', 'Artic Aff - DDI 2014 SWS.html5', 'Artic Coop neg - DDI 2014 MS.html5', 'Artic Ports Neg - DDI 2014 TW.html5', 'Asia as method k - DDI 2016 CT.html5', 'Asian Masterfile - Wake 2018.html5', 'Assemblage Negative - DDI 2015 ST.html5', 'Assessments Neg - DDI 2017 ST.html5', 'Assurances DA Answers - MSDI 2016.html5', 'Asylum Aff - DDI 2018 KM.html5', 'Asylum Aff - Gonzaga 2018 Scholars.html5', 'Asylum Aff 2 - UNT 2018.html5', 'Asylum Case Neg - DDI 2018 AT.html5', 'Asylum Neg - Gonzaga 2018 Scholars.html5', 'Auctions CP - Northwestern 2018.html5', 'Autonomous Education K - UTNIF 2017.html5', 'BIE Neg Wave 1 - DDI 2017 ST.html5', 'BIT Aff - DDI 2016 HS.html5', 'BIT Aff - Michigan7 2016.html5', 'BIT Aff-Neg - Northwestern 2016.html5', 'BIT Affirmative - Additional 2AC Blocks - SDI 2016.html5', 'BIT Affirmative - MSDI 2016.html5', 'BIT Neg - Michigan7 2016.html5', 'BIT Negative - Berkeley 2016.html5', 'BLACKLANTIS Aff Neg - Michigan7 2021 K Lab.html5', 'BMD Affirmative - Berkeley 2016.html5', 'BMD Negative - Berkeley 2016.html5', 'Backdoor Negative - DDI 2015 CT.html5', 'Backdoors Affirmative - DDI 2015 SWS.html5', 'Backdoors Negative - DDI 2015 MM.html5', 'Backlash DA and Answers - Gonzaga 2018 DMB.html5', 'Ban Prisons Affirmative and Negative - Northwestern 2015.html5', 'Bank of the South CP - Michigan7 2013 CFJPV.html5', 'Barracudas Supplement - Berkeley 2014.html5', 'Base DA - DDI 2018 AT.html5', 'Base DA - Michigan7 2017 BFHR.html5', 'Base DA - Northwestern 2018.html5', 'Base DA - Packet - SDI 2018.html5', 'Base DA Updates - Berkeley 2018.html5', 'Base DA Updates - Michigan7 2018 HJPV.html5', 'Base DA Updates - Northwestern 2018.html5', 'Bataille Aff   Neg - Michigan7 2017 AFMMKK.html5', 'Bataille Aff Neg - Michigan7 2016.html5', 'Bataille Aff Updates - Michigan7 2014.html5', 'Bataille Aff master file - Wake 2019.html5', 'Bataille K - Michigan7 2016.html5', 'Bataille K - Michigan7 2019 K Lab.html5', 'Bataille K Answers - Michigan7 2016.html5', 'Bataille K Supplement - Michigan7 2016.html5', 'Bataille Ks - Wake 2018.html5', 'Bataille Supplement - Michigan7 2017 BFHHR.html5', 'Bataillie Neg - Michigan7 2016.html5', 'Baudrillard   STEM Aff - Wake 2017.html5', 'Baudrillard - Wake 2019.html5', 'Baudrillard Aff   Neg - Michigan7 2019 K Lab.html5', 'Baudrillard Aff - Michigan7 2016.html5', 'Baudrillard Aff and Neg - Michigan7 2018 K Lab.html5', 'Baudrillard K - Michigan7 2016.html5', 'Baudrillard K - Wake 2017.html5', 'Baudrillard Link File - Michigan7 2017 AFMMKK.html5', 'Baudrillard Neg - Michigan7 2016.html5', 'Baudrillard Supplement - Michigan7 2017 BFHHR.html5', 'Biden Mechanism Affirmative - SDI 2019.html5', 'Biden Mechanism Negative - SDI 2019.html5', 'Biodiversity Bad 3.0 - Michigan7 2014 BEFJR.html5', 'Biometric Surveillance Aff - Neg - MichiganClassic 2020 LOSVW.html5', 'Biometrics Aff Addendum - Northwestern 2015 6WS.html5', 'Biometrics Aff and Neg - Northwestern 2015 Sophomores .html5', 'Biometrics Negative Supplement - Northwestern 2015.html5', 'Biopiracy K - DDI 2014 KQ.html5', 'Biopolitical Borders Kritik - Berkeley 2018.html5', 'Biopolitics K - Gonzaga 2018 DMB.html5', 'Biopolitics K - Gonzaga 2020.html5', 'Biopolitics K - Wake 2018.html5', 'Biopolitics K Supplement - Michigan7 2021 CCPW.html5', 'Biopower K - Berkeley 2017.html5', 'Biopower K - JDI 2015.html5', 'Biopower K - JDI 2020.html5', 'Black Atlantic Affirmative - HSS 2014.html5', 'Black Cybernetics Aff - Michigan7 2019 K Lab.html5', 'Black FW - Wake 2019.html5', 'Black Fem K   Reparations - Wake 2017.html5', 'Black Fem K  AfroFuturism Supplement - Wake 2017.html5', 'Black Feminism - DDI 2015 ST.html5', 'Black Framework - Wake 2017.html5', 'Black Genealogy Case Neg - DDI 2018 AT.html5', 'Black Genealogy Neg - DDI 2018 KM.html5', 'Black Islamophobia Aff and Neg - Wake 2018 RKS.html5', 'Black Marxism K - Berkeley 2020 Wave 3.html5', 'Black Nationalism Affirmative - HSS 2015.html5', 'Black Nationalism Negative - HSS 2015.html5', 'Black Nihilism - Wake 2016.html5', 'Black Palestinian Aff  - Wake 2019 (1).html5', 'Blood Quantum Affirmative and Neg - Wake 2018.html5', 'Body Cameras Negative - MSDI 2020.html5', 'Body Cavity Searches Negative - DDI 2015 CT.html5', 'Border Art Aff and Neg - UTNIF 2018.html5', 'Border Drones Affirmative - Northwestern 2015 6WS.html5', 'Border Drones Negative - Northwestern 2015 6WS.html5', 'Border Surveillance Negative - Michigan7 2015.html5', 'Borders 1ac - DDI 2015 KS.html5', 'Borders Affirmative and Negative - Gonzaga 2013.html5', 'Borders Critique - Emory 2015.html5', 'Borders Impact File - Northwestern 2018.html5', 'Borders K - DDI 2015 ST.html5', 'Borders K - Starter Packet - Michigan7 2018.html5', 'Borders K - Wake 2018.html5', 'Borders Kritik - Georgetown 2018.html5', 'Borders Kritik - UTNIF 2013.html5', 'Borders Neg - DDI 2015 KQ.html5', 'Borders Negative - DDI 2015 ST.html5', 'Brown Feminist Killjoy - Wake 2018.html5', 'Buddhism K 2 - Michigan7 2013.html5', 'Bulk Data Affirmative - JDI 2015.html5', 'Bulk Data Affirmative - MSDI 2015.html5', 'Bulk Data Collection Negative - JDI 2015.html5', 'CAT AFF - MichiganClassic 2019 MPP.html5', 'CAT Neg  - MichiganClassic 2019 MPP.html5', 'CBA Aff - DDI 2021 AT.html5', 'CBA Case Neg - DDI 2021 AT.html5', 'CBA Case Neg - DDI 2021 KM.html5', 'CBFM - DDI 2014 KQ.html5', 'CBM and NFU Aff-Neg - JDI 2016.html5', 'CCP Collapse Core - Michigan7 2016.html5', 'CCP Collapse Good - Northwestern 2016.html5', 'CCP DA - MSDI 2016.html5', 'CCP Democracy Turn - HSS 2016.html5', 'CCS Counterplan - JDI 2014.html5', 'CDC Tradeoff DA - UTNIF 2017.html5', 'CDCL Aff-Neg - JDI 2016.html5', 'CIR DA 1 - Michigan7 2013.html5', 'CMSP Case neg - DDI 2014 TW.html5', 'CMSP and Exploration Disadvantage - MSDI 2014.html5', 'CMSP neg - DDI 2014 MS.html5', 'COINTELPRO Aff and Negative Upgrade - Northwestern 2015 6WS.html5', 'COINTELPRO Affirmative - Northwestern 2015 6WS.html5', 'COINTELPRO Negative - Northwestern 2015 6WS.html5', 'CP - Advantage CPs - MNDI 2022 PHA.html5', 'CP - Advantage CPs - Michigan 7 2022 CPWW.html5', 'CP - Advantage CPs - Michigan Classic 2022 CGNO.html5', 'CP - Burdensharing QPQ - Michigan 7 2022 BEJJ.html5', 'CP - Civil Military CP - Michigan 7 2022 BFHR.html5', 'CP - Congress - Michigan7 2020 Starter Pack.html5', 'CP - Courts - Michigan7 2020 BFHPR.html5', 'CP - Dept of State CP - Packet - Michigan 7 2022.html5', 'CP - EU CP - Michigan 7 2022 BFHR.html5', "CP - Executive CP's - Michigan7 2020 BFHPR.html5", 'CP - State Referendum - Michigan7 2020 BFHPR.html5', 'CP - States - Michigan7 2020 BFHPR.html5', 'CP - States Courts - MichiganClassic 2020 LOSVW.html5', 'CP - Supreme Court - Michigan 7 2022 BFHR.html5', 'CP - Turkey PIC - Michigan 7 2022 BFHR.html5', 'CP - UN CP - Michigan 7 2022 BFHR.html5', 'CP - Unilateral CP - CNDI 2022.html5', 'CP - Unilateral CP - MSDI 2022.html5', 'CRT Case Neg - DDI 2017 AS.html5', 'CTE Aff   Neg - Wave 2 - Michigan7 2017 CPPR.html5', 'CTE Aff - Wave 1 - Michigan7 2017 BFHHR.html5', 'CTE Aff Supplement - MichiganClassic 2017 CFJ.html5', 'CTE Neg - Michigan7 2017 BFHHR.html5', 'CTE Tradeoff DA - MSDI 2017.html5']

DEBATESUM_EXTREMIST_FILTER_OUT2 = ['Camp Tournament Updates - SDI 2018 DLGM.html5', 'Camp Updates  - MichiganClassic 2017 OW.html5', 'Camp Updates - Final - Michigan7 2017 AFMKK.html5', 'Cap Good Core - Wake 2018.html5', 'Cap Good Updates - MichiganClassic 2021 MMP.html5', 'Cap K  - JDI 2021.html5', 'Cap K - Berkeley 2020 Wave 4.html5', 'Cap K - DDIx 2021.html5', 'Cap K - Links - Wake 2018.html5', 'Cap K - Michigan7 2014.html5', 'Cap K - SDI 2018 BGHT.html5', 'Cap K - Starter Packet - Wake 2018.html5', 'Cap K - UNT 2018.html5', 'Cap K - Wake 2019.html5', 'Cap K Links - Michigan7 2014.html5', 'Cap K Preinstitute Set - Wake 2019.html5', 'Cap K and Answers - Gonzaga 2020 LO.html5', 'Cap K and Fem K Updates - SDI 2017.html5', 'Cap K of Race based Affs - Northwestern 2014.html5', 'Cap K vs Identity Teams - Northwestern 2015.html5', 'Cap K vs K Affirmatives - Michigan7 2018 CPPWW.html5', 'Cap K vs Race Affs - Michigan7 2014.html5', 'Cap Kritik vs K Affirmatives - Michigan7 2018 BFHPR.html5', 'Cap Supplement - UTNIF 2020.html5', 'Cap and Neolib Kritik - Michigan7 2018 CPWW.html5', 'Cap and Semiocap K - MichiganClassic 2019 K Lab.html5', 'Cap and Trade CP - Michigan7 2014 CHHJPV.html5', 'Cap v. K Affirmatives - DDI 2015 SWS.html5', 'Cap vs K Affs - Michigan7 2019 BFHR.html5', 'Capitalism Critique - Berkeley 2014.html5', 'Capitalism Critique - Georgetown 2014.html5', 'Capitalism Critique - MSDI 2014.html5', 'Capitalism Critique - Michigan7 2015.html5', 'Capitalism Critique - SDI 2015.html5', 'Capitalism Critique - SDI 2016.html5', 'Capitalism Critique - UNT 2014.html5', 'Capitalism Critique Supplement - Berkeley 2014.html5', 'Capitalism Critique vs Non-Traditional Affirmatives - HSS 2014.html5', 'Capitalism Critique vs Non-Traditional Affirmatives Answers - HSS 2014.html5', 'Capitalism Critique vs Policy Affirmatives - HSS 2014.html5', 'Capitalism Critique vs Policy Affirmatives 2 - HSS 2014.html5', 'Capitalism Impact Core - Michigan7 2021 EHJJPP.html5', 'Capitalism K - Berkeley 2019.html5', 'Capitalism K - Marc - Wake 2016 RKS.html5', 'Capitalism K - Northwestern 2021 DFW.html5', 'Capitalism K - SDI 2017 BHT.html5', 'Capitalism K - Wake 2017.html5', 'Capitalism K Answers Supplement - Emory 2016.html5', 'Capitalism K vs K Affs - Michigan7 2017 BFHHR.html5', 'Capitalism Kritik - Gonzaga 2013.html5', 'Capitalism Kritik - NDCA 2016.html5', 'Capitalism Kritik - UTNIF 2013.html5', 'Capitalism Kritik Answers - NDCA 2016.html5', 'Capitalism Kritik Supplement - Gonzaga 2013.html5', 'Capitalism Kritik Updates - UNT 2013.html5', 'Capitol Police Neg - DDI 2020 HL.html5', 'Carcerality - Wake 2018 RKS.html5', 'Cede the Political - Michigan7 2018 CPPWW.html5', 'Cede the Political DA - Michigan7 2019 CCPW.html5', 'Cede the Political DA - Michigan7 2021 CCPW.html5', 'Census Affirmative - Michigan7 2015.html5', 'Charter School Desegregation Aff - HSS 2017.html5', 'Charter School Desegregation Neg - HSS 2017.html5', 'Circumvention - Berkeley 2018.html5', 'Circumvention - Georgetown 2020.html5', 'Circumvention - MSDI 2015.html5', 'Circumvention - Northwestern 2017.html5', 'Circumvention - Northwestern 2018.html5', 'Circumvention Answers - Northwestern 2015 6WS.html5', 'Circumvention Core - Berkeley 2019.html5', 'Circumvention Core - SDI 2018 KUZ.html5', 'Circumvention DA - Michigan7 2015.html5', 'Circumvention File - Michigan7 2017 AFMMKK.html5', 'Circumvention and Gradualism - Northwestern 2015 6WS.html5', 'Citizenship K - Links - Wake 2018.html5', 'Citizenship K - Master - Wake 2018.html5', 'Citizenship K - Michigan7 2018 K Lab.html5', 'Citizenship K - Packet - SDI 2018.html5', 'Citizenship Test Case Neg - DDI 2018 AT.html5', 'Civic Education Aff - Michigan7 2017 CPPR.html5', 'Civic Education Neg  - Michigan7 2017 BFHHR.html5', 'Civic Education Neg - Michigan7 2017 CPPR.html5', 'Civil Forfeiture Aff - Neg - Berkeley 2020 Wave 4.html5', 'Climate Aff-Neg Starter Pack - Northwestern 2016.html5', 'Climate Case Neg - DDI 2016 HS.html5', 'Climate Cooperation Negative - Berkeley 2016.html5', 'Climate Financing CP - Sophomores - Gonzaga 2019.html5', 'Climate Literacy Aff - Berkeley 2017.html5', 'Climate Migration 1AC - UTNIF 2018.html5', 'Climate Migration Negative - UTNIF 2018.html5', 'Climate Neg - Michigan7 2016.html5', 'Climate Offsets CP - Michigan7 2019 BFHR.html5', 'Climate Refugees Aff and Neg - Michigan7 2018 HJPV.html5', 'Cloud Seeding Case Neg - DDI 2021 GG.html5', 'Cloud Seeding Case Neg - DDI 2021 KM.html5', 'Cloud Seeding Case Neg - DDI 2021 KS.html5', 'Coal Disadvantage - WSDI 2014.html5', 'Coastal Marine Spatial Planning Affirmative - JDI 2014.html5', 'Colonial Cartography Critique - WSDI 2014.html5', 'Coloniality K - Northwestern 2021.html5', 'Coloniality Kritik - DDI 2013.html5', 'Coloniality Kritik - UNT 2013.html5', 'Coloniality Kritik Supplement - DDI 2013.html5', 'Colorblindness Critique - HSS 2015.html5', 'Colorblindness K  - Michigan7 2017 BFHHR.html5', 'Commissions CP - Michigan7 2018 CPWW.html5', 'Common Core Aff - Wake 2017.html5', 'Common Core Affirmative - HSS 2015.html5', 'Common Core Neg - Wake 2017.html5', 'Common Core Negative - HSS 2015.html5', 'Communicative Engagement CP - Berkeley 2016.html5', 'Communicative Engagement Case Neg - DDI 2016 BAM.html5', 'Communicative Engagement Case Neg - DDI 2016 HS.html5', 'Communicative Engagement Neg - DDI 2016 MS.html5', 'Community based fisheries case neg - DDI 2014 SWS.html5', 'Competitive Grants CP - Michigan7 2017 BFHR.html5', 'Competitiveness Bad - Michigan7 2017 FFRSV.html5', 'Competitiveness Bad DA   Answers - Michigan7 2017 BFHR.html5', 'Computer Science Education Aff - UTNIF 2017.html5', 'Computer Science Education Neg - UTNIF 2017.html5', 'Con Con CP - SDI 2021.html5', 'Con Con and Con Amend CP - Compiled - Berkeley 2020 Wave 4.html5', 'Condition CP Starter Pack - Northwestern 2016.html5', 'Congress & Distinguish CP - DDI 2017.html5', 'Congress Advantage - Georgetown 2019.html5', 'Congressional Elections DA - Michigan7 2016.html5', 'Congressional Oversight Neg - DDI 2019 LO.html5', 'Connolly File - Michigan7 2018 MMMR.html5', 'Conquest K - Gonzaga 2021.html5', 'Conquest K Supplement - Gonzaga 2021.html5', 'Conrad 30 Negative - MichiganClassic 2018 FH.html5', 'Consult Congress CP - Michigan7 2019 FFPSV.html5', 'Consult India CP - JDI 2016.html5', 'Consult Indigenous CP - Michigan7 2021 CCPW.html5', 'Consult States CP - JDI 2014.html5', 'Consult the Indigineous Counterplan - Northwestern 2013 Sophomores.html5', 'Consumption K - MichiganClassic 2014.html5', 'Coordinated Mapping Aff - DDI 2014 MS.html5', 'Core Packet - Varsity - TDI 2021.html5', 'Corporate Inversions Politics DA - Michigan7 2014.html5', 'Cosmopolitanism Critique - Michigan7 2015.html5', 'Cosmopolitanism K - Georgetown 2016.html5', 'Cosmopolitanism K - SDI 2018 BJMSS.html5', 'Cosmopolitanism Kritik - Michigan7 2018 K Lab.html5',  'Court Capital DA - SDI 2020.html5', 'Court Capital DA - Wake 2015.html5', 'Court Clog DA - Georgetown 2020.html5', 'Court Clog DA - Version 2 - Michigan7 2018 CPWW.html5', 'Court DAs   Answers - Wave 1 - Michigan7 2017 BFHHR.html5', 'Court DAs - Berkeley 2018.html5', 'Court DAs - Wave 2 - Michigan7 2017 BFHHR.html5', 'Court Generics - Berkeley 2017.html5', 'Court Legitimacy DA - DDI 2017.html5', 'Court Legitimacy DA - Michigan7 2015.html5', 'Court Packing DA - Michigan Draft - Berkeley 2020 Wave 4.html5', 'Court Politics DA - Michigan7 2021 BFHPR.html5', 'Court Politics DA - Michigan7 2021 EHJJPP.html5', 'Court Stripping Turn - DDI 2017.html5', 'Courts Affirmative - SDI 2015.html5', 'Courts Affirmative - Samford 2015.html5', 'Courts Affirmative Supplement - SDI 2015.html5', 'Courts CP  - JDI 2021.html5', 'Courts CP - Berkeley 2020 Wave 2.html5', 'Courts CP - Gonzaga 2018 Scholars.html5', 'Courts CP - JDI 2017.html5', 'Courts CP - JDI 2020.html5', 'Courts CP - MSDI 2020.html5', 'Courts CP - Michigan7 2017 HJPPV.html5', 'Courts CP%2C XO CP%2C Circumvention - DDI 2015 SWS.html5', 'Courts Core - SDI 2017 NNP.html5', 'Courts Neg Wave 2 - Berkeley 2017.html5', 'Courts Negative - Michigan7 2015.html5', 'Courts Politics DA - Berkeley 2020 Starter Pack.html5', 'Courts Politics DA - JDI 2017.html5', 'Courts Politics DA - Northwestern 2015.html5', 'Courts Turns - UTNIF 2015.html5', 'Courts v. Congress - UNT 2015.html5', 'Cp - Con Con Preview - Michigan7 2020 Starter Pack.html5', 'Credibility Bad DA - SDI 2019.html5', 'Crime DA - MSDI 2015.html5', 'Criminology Answers - UTNIF 2020.html5', 'Critical Cartography Affirmative and Negative - Gonzaga 2014.html5', 'Critical Geography Critique - UTNIF 2015.html5', 'Critical Marijuana Aff and Neg - JDI 2020.html5', 'Critical Neglect Affirmative and Negative - Northwestern 2013 6WeekSeniors.html5', 'Critical Neglect Negative Supplement - Northwestern 2013 6WeekSeniors.html5', 'Critical Policing Aff and Neg - JDI 2020.html5', 'Critical Race Theory Critique - Michigan7 2015.html5', 'Critical Race Theory K  - DDI 2017.html5', 'Critical Terror Studies - DDI 2015 SWS.html5', 'Critique Aff Negative - UNT 2014.html5', 'Critique Affirmative - UNT 2014.html5', 'Critique Answers - Georgetown 2014.html5', 'Critique Answers - Reformism Good - SDI 2015.html5', 'Critique Answers - WSDI 2014.html5', 'Critique Answers Supplement - Michigan7 2015.html5', 'Critique Immigration Affirmative - Michigan7 2015.html5', 'Critique Immigration Negative - Michigan7 2015.html5', 'Crypto 2acs - DDI 2015 ST.html5', 'Crypto Affirmative - DDI 2015 ST.html5', 'Crypto Negative - DDI 2015 ST.html5', 'Cthulhu Mythos Affirmative - Gonzaga 2014.html5', 'Cthulhu Mythos Negative - Gonzaga 2014.html5', 'Cuba Aff - SCDI 2013.html5', 'Cuba Affirmative - Advanced - DDIx 2013.html5', 'Cuba Affirmative Update - Emory 2013.html5', 'Cuba Core - Northwestern 2013 Plus One.html5', 'Cuba Embargo Negative - Emory 2013.html5', 'Cuba Embargo Negative - GMU 2013.html5', 'Cuba Embargo Negative - JDI 2013.html5', 'Cuba Embargo Negative - Sanctions and Russia - SDI 2013.html5', 'Cuba Embargo Negative Supplement 2 - SDI 2013.html5', 'Cuba Embargo Negative Updates - Northwestern 2013 6WeekJuniors.html5', 'Cuba Food Aff - Michigan7 2013 ACHM.html5', 'Cuba Hospitality Affirmative - Northwestern 2013 6WeekJuniors.html5', 'Cuba Hospitality Negative - Northwestern 2013 6WeekJuniors.html5', 'Cuba K Aff and Neg - Wave 2 - Michigan7 2013 ACHM.html5', 'Cuba Neoliberalism Aff and Neg - SDI 2013.html5', 'Cuba Oil Affirmative - JDI 2013.html5', 'Cuba Oil Negative - Georgia 2014.html5', 'Cuba Rum Affirmative - DDI 2013 CM.html5', 'Cuba Science Cooperation Affirmative - Gonzaga 2013.html5', 'Cuba Sugar Ethanol Aff and Neg - UNT 2013.html5', 'Cuba Sugar Ethanol Kritik Aff and Neg - JDI 2013.html5', 'Cuba TRI Kritik Negative - UTNIF 2013.html5', 'Cuba Telecommunications Affirmative - Northwestern 2013 6Weekseniors.html5', 'Cuba Terror Aff - MichiganClassic 2013 CT.html5', 'Cuba Terror List Aff - K Version - MichiganClassic 2013 CT.html5', 'Cuba Terror List Affirmative - Gonzaga 2013.html5', 'Cuba Terror List Affirmative - HSS 2013.html5', 'Cuba Tourism Affirmative - MSDI 2013.html5', 'Cuba Trade Aff Wave 2 - Michigan7 2013 HJPP.html5', 'Cuba Trade Aff Wave 3 - Michigan7 2013 HJPP.html5', 'Cuba Travel Ban Aff - MNDI 2013 CT.html5', 'Cuba Travel Ban Aff - Michigan7 2013 ACHM.html5', 'Cuba Travel Neg - MNDI 2013 CT.html5', 'Cuban Embargo Negative - DDI 2013 AC.html5', 'Cuban ICT Negative - Northwestern 2013 6WeekJuniors.html5', 'Cultural Competency - Neg Supplement - SDI 2017 EER.html5', 'Cultural Competency Aff - SDI 2017 EER.html5', 'Cultural Competency Neg - SDI 2017 EER.html5', 'Curtiss Wright Aff - DDI 2019 KM.html5', 'Curtiss Wright Neg - DDI 2019 KM.html5', 'Cyber Aff - Michigan7 2016.html5', 'Cyber Case Neg - DDI 2016 CT.html5', 'Cyber DA - MSDI 2015.html5', 'Cyber Dams Aff Neg - DDI 2021 GDDI.html5', 'Cyber Neg - Michigan7 2016.html5', 'Cybernetics Aff   Neg Updates - Michigan7 2017 AFMMKK.html5', 'Cybernetics K - Michigan7 2017 AFMMKK.html5', 'Cybernetics K Answers - Michigan7 2017 AFMMKK.html5', 'Cybernetics Kritik - Michigan7 2018 K Lab.html5', 'Cyborgs Affirmative - Michigan7 2015.html5', 'Cyborgs Negative - Michigan7 2015.html5', 'DA - 2020 Elections 2 - Michigan7 2020 BFHPR.html5', 'DA - 2020 Elections Preview - Michigan7 2020 Starter Pack.html5', 'DA - 2020 Elections Preview 3 - Reproductive Rights Impact - Michigan7 2020 Starter Pack.html5', 'DA - 2020 Elections Updates - MichiganClassic 2020 MMP.html5', 'DA - Agenda Link Core - Michigan7 2020 BFHPR.html5', 'DA - Assurance DA - Northwestern 2022.html5', 'DA - BLM - Michigan7 2020 BFHPR.html5', 'DA - China Good - CNDI 2022.html5', 'DA - Court Capital - Michigan7 2020 Starter Pack.html5', 'DA - Court Clog - Michigan7 2020 Starter Pack.html5', 'DA - Court Packing - Michigan7 2020 BFHPR.html5', 'DA - DOD Tradeoff - UTNIF 2022.html5', 'DA - DOD Tradeoff DA - Harvard 2022.html5', 'DA - Democracy Bad - Michigan7 2020 BFHPR.html5', 'DA - Elections Supplement - Michigan7 2020 CCPTW.html5', 'DA - Elections Updates - MichiganClassic 2020 BFMZ.html5', 'DA - Elections Wave 1 - Michigan7 2020 FFPSVV.html5', 'DA - Federalism - Michigan7 2020 BFHPR.html5', 'DA - Food Innovation DA - MSDI 2022.html5', 'DA - Midterms - CNDI 2022.html5', 'DA - Midterms - Michigan 7 2022 FMPS.html5', 'DA - Midterms - Michigan Classic 2022 BVL.html5', 'DA - Midterms GOP Good - Michigan Classic 2022 CS.html5', 'DA - Midterms Updates - Michigan Classic 2022 MMP.html5', 'DA - NATO Cohesion - GDI 2022.html5', 'DA - NDAA - Michigan7 2020 BFHPR.html5', 'DA - Oversight DA - Michigan 7 2022 BFHR.html5', 'DA - Police Unions - Michigan7 2020 HKMM.html5', 'DA - Politics BBB - Michigan 7 2022 BFHR.html5', 'DA - Politics Competitiveness - Michigan 7 2022 BFHR.html5', 'DA - Primacy - UTNIF 2022.html5', 'DA - Russia Relations - Michigan 7 2022 CPWW.html5', 'DA - Senate Elections - Michigan7 2020 BFHPR.html5', 'DA - Stimulus - Michigan7 2020 BFHPR.html5', 'DA - Strategic Concept - Michigan 7 2022 BFHR.html5', 'DA - Strategic Concept v2 - Michigan 7 2022 BFHR.html5', 'DACA Neg - Starter Packet - Wake 2018.html5', 'DC Vouchers Aff - HSS 2017.html5', 'DC Vouchers Neg - HSS 2017.html5', 'DDX Packet - DDIx 2018.html5', 'DOE Tradeoff DA   Answers  - Michigan7 2017 BFHHR.html5', 'DREAM Act - Military PIC - SDI 2018 BGHT.html5', 'DREAM Act Aff - Military Advantage - SDI 2018 BJMSS.html5', 'DREAM Act Aff Compiled - SDI 2018 BJMSS.html5', 'DREAM Act Aff Neg - Berkeley 2018.html5', 'DREAM Act Affirmative - Packet - SDI 2018.html5', 'DREAM Act Neg Compiled - SDI 2018 BJMSS.html5', 'DREAM Act Neg Supplement - SDI 2018 DGLM.html5', 'DREAM Act Negative - Packet - SDI 2018.html5', 'DREAM Aff Updates - Wake 2018.html5', 'Dams Aff Neg - Michigan7 2021.html5', 'Dark Deleuze Answers  - Wake 2018.html5', 'Dark Deleuze K - Wake 2018 RKS.html5', 'De-Dev - Michigan7 2014.html5', 'De-Development - SDI 2019.html5', 'DeDev Core - Michigan7 2017 CMMW.html5', 'DeDev and Growth Good - Michigan7 2016.html5', 'DeDevelopment Updates - Michigan7 2018 CPWW.html5', 'DeSchooling K  - Michigan7 2017 BFHHR.html5', 'DeSchooling K - Wake 2017.html5', 'DeSchooling Kritik - UNT 2017.html5', 'DeVos Credibility DA - MSDI 2017.html5', 'DeVos DA   Answers  - Michigan7 2017.html5', 'Death Bad and Good Core - Wake 2018.html5', 'Death Penalty Aff - DDI 2020 HL.html5', 'Death Penalty Aff - Neg - Berkeley 2020 Wave 2.html5', 'Death Penalty Aff and Neg Updates - JDI 2020.html5', 'Death Penalty Affirmative - SDI 2020.html5', 'Death Penalty Case Neg - DDI 2020 GG.html5', 'Death Penalty Expansion - Aff and Neg - Gonzaga 2020 MM.html5', 'Death Penalty K version - Georgetown 2020.html5', 'Death Penalty Neg - DDI 2020 AT.html5', 'Death Penalty Neg - DDI 2020 FS.html5', 'Death Penalty Neg - DDI 2020 KM.html5', 'Death Penalty Neg Updates - SDI 2020.html5', 'Death Penalty Negative - MSDI 2020.html5', 'Death Penalty Negative - SDI 2020.html5', 'Debt Ceiling DA - Berkeley 2017.html5', 'Debt Ceiling DA - Michigan7 2017 BFHHR.html5', 'Debt Ceiling Politics DA - Berkeley 2019.html5', 'Debt Ceiling Politics DA - DDI 2019 KM.html5', 'Debt Negative - DDI 2015 ST.html5', 'Decarceration Affirmative - SDI 2020.html5', 'Decarceration Negative - SDI 2020.html5', 'Decoloniality Answers - DDI 2013 SS.html5', 'Decoloniality Kritik - UTNIF 2013.html5', 'Decoloniality Neg - DDI 2017 ST.html5', 'Decoloniality Negative - DDI 2013 AC.html5', 'Decolonization Neg - DDI 2017 AS.html5', 'Dedev - DDI 2016 KQ.html5', 'Dedev - DDI 2020 GG.html5', 'Dedev - JDI 2017.html5', 'Dedev Neg - DDI 2020 AT.html5', 'Dedevelopment - DDI 2014 KQ.html5', 'Dedevelopment - JDI 2016.html5', 'Dedevelopment - Northwestern 2014.html5', 'Dedevelopment Good - Emory 2014.html5', 'Dedevelopment Impact File - Northwestern 2018.html5', 'Deep Ecology K - DDI 2021.html5', 'Defending Apocalyptic Representations - Northwestern 2015 6WS.html5',  'Deferred Action Affirmative - MSDI 2015.html5', 'Dehumanism Aff and Neg - Packet - SDI 2018.html5', 'Deleuze - DDI 2015 SWS.html5', 'Deleuze Aff   Neg - Michigan7 2019 K Lab.html5', 'Deleuze Aff Neg - Michigan7 2021 BFHPR.html5', 'Deleuze K - Michigan7 2016.html5', 'Deleuze Kritik - Michigan7 2018 K Lab.html5', 'Deleuze Pedagogy Aff - Michigan7 2017 AFMMKK.html5', 'Delooze - Wake 2019.html5', 'Derrida K - UNT 2018.html5', 'Derrida Kritik - DDI 2013 CM.html5', 'Derrida Terror Negative - DDI 2015 CT.html5', 'Deschooling K  - DDI 2017.html5', 'Deschooling K  - Gonzaga 2017.html5', 'Deschooling K - MSDI 2017.html5', 'Deschooling K - Northwestern 2017.html5', 'Desegregation Aff   Neg - Starter Set - Michigan7 2017.html5', 'Desegregation Aff   Neg - Wave 2 - Michigan7 2017 HJPPV.html5', 'Desegregation Aff   Neg - Wave 3 - Michigan7 2017 HJPPV.html5', 'Desegregation Aff Neg Supplement - MNDI 2017 GJJS.html5', 'Destroy the Oceans - Michigan7 2014 GRAMS.html5', 'Detention Affirmative - Michigan7 2018 BFHPR.html5', 'Detention Affirmative - Starter - UTNIF 2018.html5', 'Detention Negative - Michigan7 2018 BFHPR.html5', 'Detention Negative - Starter - UTNIF 2018.html5', 'Development Assistance Affirmative - DDI 2013 KQ.html5', 'Development Discourse K - SCDI 2013.html5', 'Development Kritik Affirmative Answers - Northwestern 2013 4WeekSeniors.html5', 'Development Kritik Supplement - JDI 2013.html5', 'Development PIC - DDI 2014 KQ.html5', 'Development Word PIC - DDI 2013 CM.html5', 'Development Word PIC Answers - HSS 2014.html5',  'Dip Cap DA - Michigan7 2016.html5', 'Dip Cap DA Wave 2 - MichiganClassic 2016.html5', 'Diplomatic Capital DA - JDI 2016.html5']

DEBATESUM_EXTREMIST_FILTER_OUT3 = ['Diplomatic Capital Disadvantage - Gonzaga 2013.html5', 'Disability Aff - Wake 2017.html5', 'Disability Aff Supplement - Berkeley 2017.html5', 'Disability K - Michigan7 2018 FFGSV.html5', 'Disability K - Wake 2017.html5', 'Discourse Critique - Gonzaga 2014.html5', 'Discourse Kritiks - DDI 2013 SS.html5', 'Disruption Aff - Berkeley 2017.html5', 'Disruption Neg - Berkeley 2017.html5', 'Diversionary War Answers - Michigan7 2017.html5', 'Diversity Visas Aff and Neg - Michigan7 2018 FFGSV.html5', 'Do It Elsewhere CP - Northwestern 2014.html5', 'Domestic Detention Affirmative - JDI 2015.html5', 'Domestic Violence Aff - Starter Packet - Michigan7 2018.html5', 'Domestic Violence Aff and Neg Supplement - MNDI 2018 GJJJ.html5', 'Domestic Violence Aff and Neg and Gender K - Michigan7 2018 BFHPR.html5', 'Domestic Violence Affirmative - MichiganClassic 2018.html5', 'Domestic Violence Neg - Starter Packet - Michigan7 2018.html5', 'Domestic Violence Negative - MichiganClassic 2018.html5', 'Domestic Word PIC - Michigan7 2015.html5', 'Dress Code Aff - Berkeley 2017.html5', 'Dress Code Neg - Berkeley 2017.html5', 'Drug Courts Neg - DDI 2020 FS.html5', 'Drug Decriminalization Affirmative - MSDI 2020.html5', 'Drug Treatment Aff -Starter - Georgetown 2020.html5', 'EB Visas Aff and Neg - Wake 2018.html5', 'EB-5 Aff - Version 2 - Michigan7 2018 BFHPR.html5', 'EB-5 Affirmative - Wave 1 - Michigan7 2018 BFHPR.html5', 'EB-5 Negative - Wave 2 - Michigan7 2018 BFHPR.html5', 'EB5 Aff and Neg - Northwestern 2018.html5', 'EB5 Aff and Neg Updates - Northwestern 2018.html5', 'EBSA neg - DDI 2014 CM.html5', 'EBSA neg - DDI 2014 MS.html5', 'ECPA Affirmative - Northwestern 2015 6WS.html5', 'ECPA Negative - Northwestern 2015 6WS.html5', 'EE Case Neg - DDI 2017 AS.html5', 'EEZ Leasing Aff - Northwestern 2014.html5', 'EEZ Leasing Neg - Northwestern 2014.html5', 'EEZ Mapping Aff - Northwestern 2014.html5', 'EPA Tradeoff DA - DDI 2021.html5', 'EPA Tradeoff DA - Michigan7 2021 BFHPR.html5', 'EPA Tradeoff DA File 2 - Michigan7 2021 BFHPR.html5',  'Eco Feminism Critique - HSS 2014.html5', 'Eco Feminism Critique - SDI 2014.html5', 'Eco K - Northwestern 2014.html5', 'Eco Securitization K - Northwestern 2014.html5', 'Eco-Feminism Critique - WSDI 2014.html5', 'Eco-Managerialism Critique - WSDI 2014.html5', 'Eco-Socialism Critique - WSDI 2014.html5', 'EcoFem Case neg - DDI 2014 TW.html5', 'EcoFem V2 Aff - DDI 2014 SWS.html5', 'EcoPhenomenology 2AC - SDI 2014.html5', 'EcoPhenomenology Critique - SDI 2014.html5', 'Ecodoomsaying Critique - Emory 2014.html5', 'Ecofem neg - DDI 2014 MS.html5', 'Ecofeminism - Michigan7 2014.html5', 'Ecofeminism Critique - Berkeley 2014.html5', 'Ecofeminism K - DDI 2021.html5', 'Ecofeminism K - Michigan7 2014.html5', 'Ecofeminism neg - DDI 2014 KQ.html5', 'Ecological Loss Aff Neg - Michigan7 2021 K Lab.html5', 'Ecomarxism Critique - UTNIF 2014.html5', 'Economics Critique Core - JDI 2014.html5', 'Ecophenomenology Critique - HSS 2014.html5', 'Education Key File - HSS 2017.html5', 'Educational Commons Aff - Michigan7 2017 BFHHR.html5', 'Educational Commons Neg - Michigan7 2017 BFHHR.html5', 'Educational Futurism K - Michigan7 2017 AFMMKK.html5', 'Educational Futurism K - Wake 2017.html5', 'Effective Altruism K - Michigan7 2021 BFHPR.html5', 'Election DA Starter Pack - Northwestern 2016.html5', 'Elections DA - Berkeley 2016.html5', 'Elections DA - Berkeley 2020 Starter Pack.html5', 'Elections DA - DDI 2015 SWS.html5', 'Elections DA - Impact Turns - Wake 2016 RKS Seniors.html5', 'Elections DA - Michigan7 2016.html5', 'Elections DA - Northwestern 2014.html5', 'Elections DA - Northwestern 2015 6WS.html5', 'Elections DA - SDI 2020.html5', 'Elections DA - Samford 2020.html5', 'Elections DA - UTNIF 2015.html5', 'Elections DA - Updates 1 - SDI 2016.html5', 'Elections DA - Updates 2 - SDI 2016.html5', 'Elections DA - Updates 3 - SDI 2016.html5', 'Elections DA - Wave 1 - Wake 2016 RKS.html5', 'Elections DA Answers - Emory 2016.html5', 'Elections DA Starter - Michigan7 2016.html5', 'Elections DA Supplement - SDI 2016.html5', 'Elections DA Updates 1 - MichiganClassic 2016.html5', 'Elections DA v K - MichiganClassic 2016.html5', 'Elections Disadvantage - HSS 2016.html5', 'Embargo Word PIC - DDI 2013 CM.html5', 'Embassies Negative - DDI 2015 MM.html5', 'Embodiment Critique - UTNIF 2015.html5', 'Employment Visas Aff Neg - Berkeley 2018.html5', 'Enclosure Kritik - Northwestern 2013 4WeekSeniors.html5', 'Endangered Species Aff -Louis - Wake 2016 RKS.html5', 'Endangered Species Neg - Louis - Wake 2016 RKS.html5', 'Energy Disadvantage - Berkeley 2014.html5', 'Energy Prices DA - Michigan7 2013.html5', 'Env Managerialism  - Gonzaga 2021.html5', 'Environment Core - Baylor 2014.html5', 'Environment Critique - JDI 2014.html5', 'Environment DA - Michigan7 2018 BFHPR.html5', 'Environment DA - Northwestern 2014.html5', 'Environment DA Answers - Michigan7 2018 MMMR.html5', 'Environment Disadvantages - Gonzaga 2014.html5', 'Environment Impact Turns - Michigan7 2021 BFHPR.html5', 'Environment K - MSDI 2021.html5', 'Environment Management K - Michigan7 2021 CCPW.html5', 'Environmental Apocalyptic Framing Critique - HSS 2014.html5', 'Environmental Crimes Aff Masterfile - DDI 2020 GG.html5', 'Environmental Education Aff - JDI 2017.html5', 'Environmental Education Neg - JDI 2017.html5', 'Environmental Justice K - Berkeley 2021.html5', 'Environmental Justice Kritik generic - DDI 2014 Security Kritik generic.html5', 'Environmental K Answers - Michigan7 2014 BEFJR.html5', 'Environmental Management and Security Critique Answers - Gonzaga 2014.html5', 'Environmental Personhood Case Neg - DDI 2021 FJ.html5', 'Environmental Personhood Case Neg - DDI 2021 GG.html5', 'Environmental Personhood Case Neg - DDI 2021 HL.html5', 'Environmental Security K - Michigan7 2014 BEFJR.html5', 'Ephemeral Streams Case Neg - Berkeley 2021.html5', 'Epistemic Anxiety K - Berkeley 2016.html5', 'Equalize Funding Neg - DDI 2017 ST.html5', 'Eurocentrism K - Wake 2017.html5',  'Executive Counterplan - Gonzaga 2013.html5', 'Executive Order Counterplan - GMU 2014.html5', 'Executive Order Counterplan - JDI 2014.html5', 'Executive Power DA - Georgetown 2019.html5', 'Executive Power DA and CP - DDI 2019 Generic.html5', 'Exploration Critique - Wake 2014.html5', 'Exploration K - DDI 2014 KQ.html5', 'Export Controls Affirmative - Berkeley 2016.html5',  'FBI Drug Testing Affirmative - HSS 2015.html5', 'FDA Aff and Neg - Northwestern 2015 6WS.html5', 'Fabulation Case neg - Wake 2019.html5', 'Failed States Kritik - JDI 2013.html5', 'Family Aff - Gonzaga 2018 Sophomores.html5', 'Family Aff and Neg 2.0 - Gonzaga 2018 Sophomores.html5', 'Family Immigration Aff Neg - Northwestern 2018.html5', 'Family Separation Aff and Neg - SDI 2018 BJMSS.html5', 'Famine K - Northwestern 2014.html5', 'Farm Bill DA - MichiganClassic 2018 AKZ.html5', 'Farmworkers Aff and Neg - MichiganClassic 2018 BO.html5', 'Fear of the Ocean Affirmative - Michigan7 2014 GRAMS.html5', 'Federal Evidence Aff-Neg - Berkeley 2020 Starter Pack.html5', 'Federal Prisons Neg - UNT 2017.html5', 'Fem IR K - Michigan7 2014.html5', 'Fem IR K - Michigan7 2019 HKMM.html5', 'Fem IR Kritik - DDI 2013 KQ.html5', 'Fem IR Saudi Aff - DDI 2019 KS .html5', 'Fem K Supplement - MichiganClassic 2018 FH.html5', 'Fem Open Borders Aff and Neg - Michigan7 2018 CPWW.html5', 'Fem Psychoanalysis K - Wake 2016 RKS K Lab.html5', 'Feminism Critique - HSS 2017.html5', 'Feminism Critique - Samford 2015.html5', 'Feminism Critique of Privacy - MichiganClassic 2015.html5', 'Feminism IR Critique - Gonzaga 2014.html5', 'Feminism K - Wake 2017.html5', 'Feminism Kritik - GMU 2013.html5', 'Feminism Kritik - Gonzaga 2013.html5', 'Feminism Ks Answers  - Wake 2018.html5', 'Feminist Killjoy K - Wake 2016 RKS K Lab.html5', 'Feminist Killjoy Supplement - Wake 2016 RKS K Lab.html5', 'Feminist Materialism K - Wake 2016 RKS K Lab.html5', 'Feminist Pedagogy K - Berkeley 2017.html5', 'Feminist Terror K - Northwestern 2015.html5', 'Fiat, Hope and Pragmatism Core - Wake 2018.html5', 'Filipino Aff - Northwestern 2014.html5', 'Final Patch Update File - Michigan7 2017 BFHHR.html5', 'Final Update File - Michigan7 2017 CMMW.html5', 'Final Updates  - Michigan7 2017 BCPPR.html5', 'Final Updates - Michigan7 2018 BFHPR.html5', 'Final Updates - Michigan7 2018 CPPWW.html5', 'Final Updates - Michigan7 2018 K Lab.html5', 'Florida Disadvantage - UTNIF 2014.html5', 'Foreign Embassies Negative Supplement - Michigan7 2015.html5', 'Foreign Students Negative Supplement - Northwestern 2015.html5', 'Forensic Ecology Aff and Neg - Wake 2019.html5', 'Foucault Affirmative - UTNIF 2015.html5', 'Foucault Critique - Michigan7 2015.html5', 'Foucault K - Michigan7 2017 CPPR.html5', 'Foucault K - Northwestern 2015.html5', 'Foucault Negative - UTNIF 2015.html5', 'Fracking Aff - MSDI 2021.html5', 'Fracking Aff Neg - Berkeley 2021.html5', 'Fracking Aff Neg - Northwestern 2021 DFW.html5', 'Fracking Aff Neg - UTNIF 2021.html5', 'Fracking Case Neg - MSDI 2021.html5', 'Fracking Case Neg - Michigan7 2021 BFHPR.html5', 'Fracking Neg Addendum - Northwestern 2021 DFW.html5', 'Fracking Supplement - MichiganClassic 2021 MMP.html5', 'Framework - Berkeley 2016.html5', 'Framework - Berkeley 2017.html5', 'Framework - Berkeley 2018.html5', 'Framework - Cap K vs K Affs 2 - Michigan 7 2022 BFHR.html5', 'Framework - Georgetown 2014.html5', 'Framework - Gonzaga 2017.html5', 'Framework - Gonzaga 2018.html5', 'Framework - Max - Wake 2016 RKS.html5', 'Framework - Michigan7 2014 GRAMS.html5', 'Framework - Michigan7 2015.html5', 'Framework - Michigan7 2019 HKMM.html5', 'Framework - Neg K Affs - Michigan 7 2022 BFHR.html5', 'Framework - Neg vs K Affs Toolbox - Michigan 7 2022 FMPS.html5', 'Framework - SDI 2013 Starter.html5', 'Framework - SDI 2016.html5', 'Framework - SDI 2018 BJMSS.html5', 'Framework - Scholars - Gonzaga 2019.html5', 'Framework - UNT 2014.html5', 'Framework - UNT 2018.html5', 'Framework - Wake 2016 RKS K Lab.html5', 'Framework - Wake 2018.html5', 'Framework Addendum - Northwestern 2015 6WS.html5', 'Framework Addendum - Wake 2016 RKS K Lab.html5', 'Framework Answers - Berkeley 2017.html5', 'Framework Booster - Michigan7 2016.html5', 'Framework Core - Michigan7 2016.html5', 'Framework Core - Wave 1 - Michigan7 2017.html5', 'Framework Opening Packet - SDI 2015.html5', 'Framework Supplement - Michigan7 2016.html5', 'Framework Supplement 3 - Michigan7 2016.html5', 'Framework Updates - Michigan7 2014 HHJPV.html5', 'Framing - Michigan7 2021 BFHPR.html5', 'Free Black Girls Aff - Michigan7 2017 AFMMKK.html5', 'Free Market CP  Answers - Michigan7 2017 HJPPV.html5', 'Free Market CP - UTNIF 2017.html5', 'Free Market CP Updates - Michigan7 2017 BFHHR.html5', 'Free Market Core - SDI 2017 EER.html5', 'Free Trade - UNT 2013.html5', 'Freirean Dialogue Aff - SDI 2017 BHT.html5', 'Freirean Dialogue Neg - SDI 2017 BHT.html5', 'Frontier K - Michigan7 2014 GRAMS.html5', 'Fugitivity Affirmative - Michigan7 2015.html5', 'Fugitivity Master File - Michigan7 2017 AFMMKK.html5', 'Fugitivity Negative - Michigan7 2015.html5', 'Funding Equity Neg - MSDI 2017.html5', 'Fusion Centers Aff and Neg Upgrades - JDI 2015.html5', 'Fusion Centers Affirmative - JDI 2015.html5', 'Fusion Centers Affirmative and Negative - MichiganClassic 2015.html5', 'Fusion Centers Negative - MNDI 2015.html5', 'Fusion Centers Updates - JDI 2015.html5', 'Futurity K - UTNIF 2018.html5', 'Gameworks - Michigan7 2015.html5', 'Gender Ableism Aff - Wake 2016 RKS K Lab.html5', 'Gender Asylum Affirmative - Northwestern 2018.html5', 'Gender Critique - UNT 2014.html5', 'Gender Critique - UTNIF 2014.html5', 'Gender Critique - UTNIF 2015.html5', 'Gender IR K - Wake 2016 RKS K Lab.html5', 'Gender IR K - Wave 1 - Wake 2016 RKS K Lab.html5', 'Gender K - Michigan7 2016.html5', 'Gender K - Michigan7 2017 FFRSV.html5', 'Gender K - Michigan7 2021 BFPSW.html5', 'Gender Kritik - Michigan7 2018 MMMR.html5', 'Gender Kritik - SDI 2019.html5', 'Gender Negative - SDI 2019.html5', 'Gender Neutral Bathroom CP - SDI 2017 EER.html5', 'Gender Privacy K - DDI 2015 SWS.html5', 'Gendered Language - Michigan7 2014 CFJMP.html5', 'Generic Aff Answers - Northwestern 2014.html5', 'Generic Critique Answers - Michigan7 2015.html5', 'Geoengineering neg - DDI 2014 MS.html5', 'Geography K 1 - Michigan7 2013.html5', 'Gift Kritik - DDI 2013 SS.html5', 'Global Local and Consumption K - Northwestern 2014.html5', 'Google EMRs DA - HSS 2015.html5', 'Grand Bargain Aff - DDI 2016 HS.html5', 'Grand Bargain Case Neg - DDI 2016 KQ .html5', 'Grants Aff-Neg - Berkeley 2017.html5', 'Greasetrap masterfile - Wake 2019.html5', 'Gree Tech Case Neg - DDI 2016 KQ.html5', 'Green Finance Aff Neg - MichiganClassic 2016.html5', 'Green Psychoanalysis - Wake 2018.html5', 'Green Tech Aff - DDI 2016 KQ.html5', 'Green Tech Case Neg - DDI 2016 CT .html5', 'Growth Bad - JDI 2014.html5', 'Growth Bad Core - SDI 2017 PSW.html5', 'Growth Good - JDI 2017.html5', 'Guantanamo Bay Affirmative - SDI 2013.html5', 'Guantanamo Bay Negative - SDI 2013.html5', 'Guest Worker Aff and Neg - Michigan7 2018 CPPWW.html5', 'Guidance CP - Michigan7 2021 BFHPR.html5', 'Gumbs Aff - Michigan7 2021 K Lab.html5', 'Gumbs Case Neg - Michigan7 2021 BFPSW.html5', 'Gumbs Case Neg - Michigan7 2021 K Lab.html5', 'H1-B Negative - Emory 2018.html5', 'HHHpretournmentupdates - Georgetown 2020.html5', 'HR Condition CP - Michigan7 2016.html5', 'HRIA CP - Northwestern 2018.html5', 'HUMINT Advantage Answers - HSS 2015.html5', 'Handmaids Negative - DDI 2015 ST.html5', 'Handmaids Negative - DDI 2015 SWS.html5', 'Handmaids Tale Affirmative - DDI 2015 CT.html5', 'Hauntology K - Wake 2017.html5', 'Health Care Coop Case Neg - DDI 2016 MS.html5', 'Health Diplomacy Aff - Michigan7 2016.html5', 'Health Diplomacy Neg - Michigan7 2016.html5', 'Health Surveillance Affirmative and Negative - Michigan7 2015.html5', 'Health care cooperation - DDI 2016 MS.html5', 'Healthcare cooperation 2.0 - DDI 2016 MS.html5', 'Heg Bad Impact Core - Michigan7 2019 FFPSV.html5', 'Heg Core - Gonzaga 2017.html5', 'Heg Core - Wake 2018.html5', 'Heg Good Impact Core - Michigan7 2019 FFPSV.html5', 'Heg Impact File - Michigan7 2021 BFPSW.html5', 'Heg bad - DDI 2014 MS.html5', 'Hegemony - UNT 2013.html5', 'Hegemony Answers - Michigan7 2017 BFHHR.html5', 'Hegemony Bad - SDI 2013.html5', 'Hegemony Bad 3.0 - Michigan7 2014 BEFJR.html5', 'Hegemony Bad Core - Michigan7 2018 BFHPR.html5', 'Hegemony Bad and Answers - HSS 2017 PSW.html5', 'Hegemony Core - Berkeley 2017.html5', 'Hegemony Core - Michigan7 2014 CHHJPV.html5', 'Hegemony Core - Michigan7 2014 GRAMS.html5', 'Hegemony Core - Michigan7 2015.html5', 'Hegemony Core - Michigan7 2018 BFHPR.html5', 'Hegemony Core - SDI 2018 PS.html5', 'Hegemony Core - SDI 2019.html5', 'Hegemony Good   Bad - Michigan7 2017.html5', 'Hegemony Good - Michigan7 2017 BFHHR.html5', 'Hegemony Impact File - Northwestern 2015.html5', 'Hei Ren Aff and Neg - NDCA 2016.html5', 'Heidegger Critique - Berkeley 2014.html5', 'Heidegger Critique - JDI 2014.html5', 'Heidegger Critique - UTNIF 2014.html5', 'Heidegger Critique Wave 2 - JDI 2014.html5', 'Heidegger K - Berkeley 2017.html5', 'Heidegger K - Michigan7 2014.html5', 'Heidegger K - Michigan7 2021.html5', 'Heidegger K - Northwestern 2014.html5', 'Heidegger Supplement - Michigan7 2021 K Lab.html5', 'Heidegger case neg - DDI 2014 SWS.html5', 'High Skilled Aff - Georgetown 2018.html5', 'High Skilled Aff Updates - Georgetown 2018.html5', 'High Skilled Aff and Neg - Updates - Michigan7 2018 BFHPR.html5', 'High Skilled Aff and Neg Updates - Michigan7 2018 BFHPR.html5', 'High Skilled Immigrants Case Neg - DDI 2018 AT.html5', 'High Skilled Neg - Georgetown 2018.html5', 'High Skilled Workers Aff - SDI 2018 BJMSS.html5', 'High Skilled Workers Neg - SDI 2018 BJMSS.html5', 'High Tech Agriculture Advantage - HSS 2016.html5', 'High-Skilled Immigration Negative - Northwestern 2018.html5', 'Hip-Hop Pedagogy K - UTNIF 2017.html5', 'Historical Materialism Critique - UTNIF 2015.html5', 'Historical Materialism K - Michigan7 2016.html5', 'Horse Trading DA - Michigan7 2017 BFHHR.html5', 'Horse Trading DA - Michigan7 2018 MMMR.html5', 'Horse Trading DA - SDI 2018 NR.html5', 'Horse Trading DA - UTNIF 2018.html5', 'Human Rights Aff   Neg - Michigan7 2019 Starter Pack.html5', 'Human Rights Aff 2.0 - Michigan7 2019 HJPP.html5', 'Human Rights Condition CPs - Michigan7 2013 ACHM.html5', 'Human Rights Conditions CP - Berkeley 2016.html5', 'Human Rights Credibility Bad DA - Michigan7 2015.html5', 'Human Rights Neg 2-0 - Michigan7 2019 HJPP.html5', 'Human Trafficking Aff and Neg - SDI 2018 BGHT.html5', 'Humanities DA - Berkeley 2017.html5', 'ICBMs Aff - DDI 2021 KM.html5', 'ICBMs Case Neg - DDI 2021 AT.html5', 'ICBMs Case Neg - DDI 2021 GG.html5', 'ICBMs Case Neg - DDI 2021 KM.html5', 'ICE Aff - Berkeley 2017.html5', 'ICE Aff - Neg - Berkeley 2020 Wave 4.html5', 'ICE Aff-Neg Updates - Berkeley 2017.html5', 'ICE Affirmative Supplement - Michigan7 2015.html5', 'ICE Affirmative and Negative - Michigan7 2015.html5', 'ICE Neg - Berkeley 2017.html5', 'ICJ CP - JDI 2015.html5', 'IDEA Aff - Michigan7 2017 FFRSV.html5', 'IDEA Neg - Michigan7 2017 FFRSV.html5', 'IDEA Neg Updates - Michigan7 2017 BFHHR.html5', 'IDEA Quality of Life Aff - HSS 2017.html5', 'IDEA Quality of Life Neg - HSS 2017.html5', 'ILAW K - MichiganClassic 2014 SS.html5', 'ILaw Adv Aff Neg - TDI 2021 student research.html5', 'ILaw Core - SDI 2018 HLR.html5', 'INCSEA Aff - Michigan7 2016.html5', 'INSCEA Neg - Michigan7 2016.html5', 'IOOS Aff Neg - Northwestern 2014 6 week.html5', 'IPR Aff - DDI 2016 MS.html5', 'IPR Case Neg - DDI 2016 BAM.html5', 'IPR Case Neg - DDI 2016 CT.html5', 'IPR Case Neg - DDI 2016 MS.html5', 'IR Core  - Michigan7 2019 HJPP.html5', 'ISS Affirmative - MSDI 2016.html5', 'ISS Negative - MSDI 2016.html5', 'Ice Breakers Neg - Wake 2016 RKS Seniors.html5', 'Icebreakers Affirmative - Michigan7 2014 CHHJPV.html5', 'Icebreakers Updates 3.0 - Michigan7 2014.html5', 'Identity Critique - Michigan7 2015.html5', 'Identity K Aff Answers - Michigan7 2014.html5', 'Identity Negative - Michigan7 2014.html5', 'Ideology K - UTNIF 2017.html5', 'Illegal Immigrant K 1 - Michigan7 2013.html5', 'Illegitimacy Aff - Wake 2018.html5', 'Illinois Senate Elections DA - SDI 2016.html5', 'Immigration Bad - Northwestern 2013 6WeekJuniors.html5', 'Immigration Court Reform CP - MichiganClassic 2018 BO.html5', 'Immigration Critique - Michigan7 2015.html5',  'Immigration Enforcement Neg - MSDI 2020.html5']

DEBATESUM_EXTREMIST_FILTER_OUT4 = ['Impact Turns - UTNIF 2017.html5', 'Impact Turns Aff   Neg - Michigan7 2019 BFHMRS.html5', 'Impact Turns Core - Michigan7 2017 CBPPR.html5', 'Impacts - Democracy Bad - Michigan 7 2022 BFHR.html5', 'Impacts - Heg Bad - Michigan 7 2022 CPWW.html5', 'Impacts - Heg Good - Michigan 7 2022 CPWW.html5', 'Impacts - Heg Good Bad Supplement - Michigan 7 2022 BEJJ.html5', 'Impacts - Hegemony - Mean Green 2022.html5', 'Impacts - Impact Updates - Michigan Classic 2022 MMP.html5',  'Impeachment DA - UTNIF 2017.html5', 'Imperceptible Movements Kritik - Michigan7 2018 K Lab.html5', 'Imperial Capital K - UTNIF 2018.html5', 'Imperialism - Terror DA K  - Michigan7 2019 K Lab.html5', 'Imperialism K - Berkeley 2016.html5', 'Imperialism K 2 - Michigan7 2013.html5', 'Imperialism Kritik - Georgetown 2018.html5', 'Imperialism Kritik - JDI 2013.html5',  'Indigenous CP vs Fisheries - DDI 2014 MS.html5', 'Infrastructure Politics - Michigan7 2021 BFHPR.html5', 'Infrastructure Politics - SDI 2021.html5', 'Infrastructure Politics DA - MSDI 2021.html5', 'Infrastructure Politics File 3 - Michigan7 2021 BFHPR.html5',  'Innovation Bad - Inequality - Michigan7 2018 BFHPR.html5', 'Intercommunalism Aff - Wake 2016 Early Bird AS.html5', 'Intercommunalism Neg - Wake 2016 Early Bird AS.html5', 'International Actor CPs - Northwestern 2014.html5', 'International Agent CPs - Berkeley 2018.html5', 'International CPs - Northwestern 2014.html5',  'International Relations Feminism Kritik - Berkeley 2013.html5', 'International Relations Theories - Georgetown 2016.html',  'Intralocality K - Michigan7 2014 BEFRJ.html5', 'Intro Packet - Mandatory Minimums and Case Neg - DDIx 2020.html5', 'Iran Politics DA - WSDI 2015.html5',  'Iron Fertilization neg - DDI 2014 CM.html5', 'Iron Triangle Neg - DDI 2019 KM.html5', 'Iron Triangle Neg - DDI 2019 LO.html5', 'Islamophobia Affirmative - DDI 2015 SWS.html5', 'Islamophobia Affirmative - HSS 2015.html5', 'Islamophobia Affirmative - Michigan7 2015.html5', 'Islamophobia Negative - DDI 2015 MM.html5', 'Islamophobia Negative - DDI 2015 ST.html5', 'Islamophobia Negative - DDI 2015 SWS.html5', 'Islamophobia Negative - HSS 2015.html5', 'Islamophobia Negative - Michigan7 2015.html5', 'Islamophombia Neg - DDI 2015 KQ.html5', 'Israel Aff   Neg - Michigan7 2019 Starter Pack.html5', 'Israel Aff Supplement - MichiganClassic 2019 HJO.html5', 'Israel Affirmative - SDI 2019.html5', 'Israel BDS Aff Preinstitute Set - Wake 2019.html5', 'Israel BDS Neg Preinstitute Set - Wake 2019.html5',  'Israel FMF Aff Neg - Berkeley 2019.html5', 'Israel Militant Anti-Militarism Aff   Neg - Michigan7 2019 K Lab.html5', 'Israel Set Col aff and neg - Scholars - Gonzaga 2019.html5', 'Israel Zionism Aff Neg - Berkeley 2019.html5', 'Judicial Activism DA - Michigan7 2018 CPWW.html5',  'Judicial Grounds CP  - Michigan7 2017 BFHHR.html5', 'K - AI Imperialism - CNDI 2022.html5', 'K - Abolition - Michigan7 2020 HKMM.html5', 'K - Abolition - MichiganClassic 2020 ACV.html5', 'K - Abolition Addendum - Michigan7 2020 K Lab.html5', 'K - Abolition Updates - MichiganClassic 2020 BFMZ.html5', 'K - Afropessimism - Michigan 7 2022 K LAB.html5', 'K - Afropessimism - Michigan7 2020 K Lab.html5', 'K - Afropessimism Wave 1 - Michigan7 2020 K Lab.html5', 'K - Anti Blackness Updates - Michigan Classic 2022 MMP.html5', 'K - Baudrillard - Michigan 7 2022 K LAB.html5', 'K - Baudrillard Supplement - Michigan Classic 2022 BBE.html5', 'K - CRT - Michigan7 2020 CCPTW.html5', 'K - Cap - Michigan7 2020 CCPTW.html5', 'K - Cap K - Michigan 7 2022 CPWW.html5', 'K - Cap K - Starter - Michigan 7 2022.html5', 'K - Cap K - UTNIF 2022.html5', 'K - Cap K Updates - Michigan Classic 2022 MMP.html5', 'K - Cap K v K Affs - Michigan7 2020 BFHPR.html5', 'K - Cap K vs K Affs - Michigan7 2020 CCPTW.html5',  'K - Cybernetics - Michigan 7 2022 FMPS.html5', 'K - Cybernetics Supplement - Michigan Classic 2022 BBE.html5', 'K - Deleuze Supplement - Michigan7 2020 K Lab.html5', 'K - Disability - Michigan 7 2022 K LAB.html5', 'K - Empire - Michigan 7 2022 K LAB.html5', 'K - Fem IR - Starter - Michigan 7 2022.html5', 'K - Fem IR 2 - Michigan 7 2022 K LAB.html5', 'K - Fem IR 3 - Michigan 7 2022 K LAB.html5', 'K - Fem IR 4 - Michigan 7 2022 K LAB.html5', 'K - Fem IR K - CNDI 2022.html5', 'K - Fem IR Supplement - CNDI 2022.html5', 'K - Fem IR Supplement - Michigan Classic 2022 BBE.html5', 'K - Fem IR Supplement - Michigan Classic 2022 BLV.html5', 'K - Final K Supplement - Michigan 7 2022 FMPS.html5', 'K - Foucault - MichiganClassic 2020 LOSVW.html5', 'K - IR Imperialism - UTNIF 2022.html5', 'K - Imperialism - Michigan 7 2022 BFHR.html5', 'K - Militarism - CNDI 2022.html5', 'K - Militarism - Emory 2022.html5', 'K - Militarism - Michigan Classic 2022 CGNO.html5', 'K - Militarism Supplement - CNDI 2022.html5', 'K - Necropolitics - Michigan7 2020 K Lab.html5', 'K - New Wave - Michigan7 2020 K Lab.html5', 'K - Orientalism - Michigan 7 2022 CPWW.html5', 'K - Psychoanalysis - Michigan 7 2022 BFHR.html5', 'K - Queer IR - Michigan 7 2022 CPWW.html5', 'K - Queer IR Supplement - Michigan Classic 2022 BBE.html5', 'K - Racial IR - Michigan 7 2022 FMPS.html5', 'K - Racial IR K - Georgetown 2022.html5', 'K - Racial IR Supplement - Michigan Classic 2022 BBE.html5', 'K - Racialized Security - GDI 2022.html5', 'K - Security - Michigan 7 2022 BEJJ.html5', 'K - Security - Packet - SDI 2022.html5', 'K - Security - Starter - GDI 2022.html5', 'K - Security - UTNIF 2022.html5', 'K - Security K - Georgetown 2022.html5', 'K - Security K - MSDI 2022.html5', 'K - Security K - Mean Green 2022.html5', 'K - Security K - Northwestern 2022.html5', 'K - Security Supplement - GDI 2022.html5', 'K - Security Supplement - Michigan Classic 2022 BBE.html5', 'K - Sett Col - Berkeley 2020 Wave 4.html5', 'K - Sett Col - Michigan7 2020 FFPSVV.html5', 'K - Sett Col Preview - Michigan7 2020 Starter Pack.html5', 'K - Settler Colonialism - Michigan 7 2022 K LAB.html5', 'K - Will to Technology - Michigan 7 2022 BFHR.html5', 'K Aff Answers - Michigan7 2016.html5', 'K Aff Neg - Michigan7 2019 BFHR.html5', 'K Aff Supplement - Michigan7 2019 FFPSV.html5', 'K Affirmatives - Answers - Michigan7 2018 BFHPR.html5', 'K Answers Final - Michigan7 2016.html5', 'K Answers Update - Michigan7 2021 BFHPR.html5', 'K Lab Neg Supplement - Michigan7 2016.html5', 'K Lab Updates 7-26 - Michigan7 2021 K Lab.html5', 'K Link Supplement vs Space Col - Michigan7 2021 BFPSW.html5', 'K Links Supplement - Michigan7 2016.html5', 'K Updates Supplement - Michigan7 2021 HKMLR.html5', 'K of Debate Aff - Michigan7 2016.html5', 'K of Debate Case Neg - Michigan7 2016.html5', 'K- Mouths Shut - UNT 2015.html5', 'KJN Wave 3 - JDI 2020.html5', 'Kant Aff Neg - TDI 2021.html5', 'Kavanaugh Confirmation Politics DA - Northwestern 2018.html5', 'Kinship Aff and Neg - Michigan7 2018 K Lab.html5', 'Kritik Answer Supplement - Michigan7 2017 BFHHR.html5', 'Kritik Answers - Aff and Neg - Wave 3 - Michigan7 2018 BFHPR.html5', 'Kritik Answers - Berkeley 2018.html5', 'Kritik Answers - DDI 2015 SWS.html5', 'Kritik Answers - MSDI 2016.html5', 'Kritik Answers - Michigan7 2018.html5', 'Kritik Answers - Wave 1 - Michigan7 2018 BFHPR.html5', 'Kritik Answers - Wave 2 - Michigan7 2018 BFHPR.html5', 'Kritik Answers Supplement - Michigan7 2017 FFRSV.html5', 'Kritik Updates - Wave 2  - Michigan7 2017 AFMMKK.html5', 'Kritikal Cuba Affirmative - DDI 2013 KQ.html5', 'Kritikal water  - JDI 2021.html5', 'Kunlun Physiognomy Aff - Michigan7 2016.html5', 'LGBT Policy Affirmative - Michigan7 2015.html5', 'LGBT Policy Negative - Michigan7 2015.html5', 'LNG Exports Negative - MNDI 2014 LT.html5', 'LOST Affirmative - Michigan7 2014 CFJKMP.html5', 'LOST Affirmative Supplement - Gonzaga 2014.html5', 'Lab Achievement Gap K - DDI 2017 ST.html5', 'Lab Victimology K - Wake 2017.html5', 'Labor Markets DA - Wake 2018.html5', 'Labor Politics DA - Berkeley 2018.html5', 'Lat Crit Kritik - Gonzaga 2013.html5', 'Latin America Democracy Toolbox - Northwestern 2013 6WeekJuniors.html5', 'Latin America Growth Core - Northwestern 2013 6WeekSeniors.html5', 'Latin America Instability Core - DDI 2013.html5', 'Latin American Neoliberalism Kritik - Emory 2013.html5', 'Lead Aff - MSDI 2021.html5', 'Lead Case Neg - MSDI 2021.html5', 'Leftist Disadvantages - UTNIF 2014.html5', 'Legal Status CP - DDI 2018.html5', 'Legalism Critique - Georgia 2015.html5', 'Legalism Critique - HSS 2015.html5', 'Legalism Critique - SDI 2015.html5', 'Legalism Critique - UTNIF 2015.html5', 'Legalism K - MSDI 2020.html5', 'Legalism K - Northwestern 2015 6WS.html5', 'Legalism K - SDI 2018 PSW.html5', 'Legalism K Core - SDI 2018 BJMSS.html5', 'Legalism and Foucault Updates - Michigan7 2015.html5', 'Legalization Aff and Neg Updates - SDI 2020.html5', 'Legalization Affirmative - SDI 2020.html5', 'Legalization Negative - SDI 2020.html5', 'Legislative Veto Neg - DDI 2019 LO.html5', 'Leprosy Aff and Neg - Wake 2018.html5', 'Leprosy K - Full File - Wake 2018.html5', 'Leverage CP - Asia - DDI 2019 Generic.html5', 'Liberal Internationalism Kritik - GMU 2013.html5', 'Liberal Internationalism Kritik Supplement - GMU 2013.html5', 'Liberal Militarism K - DDI 2019 Generic.html5', 'Liberal Order Bad   Good - Michigan7 2019 CCPW.html5', 'Liberal Pacification Kritik - HSS 2019.html5', 'Liberalism Good - Michigan7 2019 CCPW.html5', 'Ligotti masterfile - Wake 2019.html5', 'Lingchi Aff - Michigan7 2016.html5', 'Lingchi Case Neg - Michigan7 2016.html5', 'Linguistic Imperialism K - Michigan7 2016.html5', 'Linguistic Indeterminacy DA v K - Michigan7 2016.html5', 'Linguistic Terrorism Aff - Wake 2018.html5', 'Loan Shift DA - UTNIF 2017.html5', 'Logistics Aff Neg - Michigan7 2021 K Lab.html5', 'Low Skill Aff Neg - Berkeley 2018.html5', 'Luke K Updates - Michigan7 2014 GRAMS.html5', 'Lunches Aff - Emory 2017.html5', 'MLAT Neg - DDI 2020 AT.html5', 'MLATs Case Neg - DDI 2020 GG.html5', 'MPAs Aff Neg - Michigan7 2021 BFHPR.html5', 'MPAs Neg - Samford 2021.html5', 'MSP case negs - DDI 2014 SWS.html5', 'MSP neg - DDI 2014 CM.html5', 'MSP neg - DDI 2014 KQ.html5', 'MTCR Neg - DDI 2019 LO.html5', 'Makah Neg Updates - Michigan7 2014.html5', 'Makah Whaling Aff - Northwestern 2014.html5', 'Makah Whaling Aff and Neg - Gonzaga 2014.html5', 'Makah Whaling Aff and Neg - Michigan7 2014 CFJMP.html5', 'Makah Whaling Aff and Neg - Michigan7 2014 GRAMS.html5', 'Makah Whaling Affirmative - HSS 2014.html5', 'Makah Whaling Neg - Northwestern 2014.html5', 'Makah Whaling Negative - HSS 2014.html5', 'Man3 Aff   Neg - Michigan7 2017 AFMMKK.html5', 'Managerialism Kritik - JDI 2013.html5', 'Mandatory Minimums Aff - Gonzaga 2020 MM.html5', 'Mandatory Minimums Negative - MSDI 2020.html5', 'Mann Act 1ac - DDI 2015 KS.html5', 'Mann Act Negative - DDI 2015 MM.html5', 'Manufacturing and Naval Power Bad - Michigan7 2014.html5', 'Maquiladoras Affirmative - DDI 2013 KQ.html5', 'Marihuana Aff - Neg - Berkeley 2020 Starter Pack.html5', 'Marijuana Aff Supplement - Berkeley 2020 Wave 2.html5', 'Marijuana Decrim Aff - DDI 2020 FS.html5', 'Marine Protected Areas Aff Neg - Michigan7 2021.html5', 'Marine Reserves Affirmative - Georgetown 2014.html5', 'Marine Reserves Negative - SDI 2014.html5', 'Market Economy Status Aff - Michigan7 2016.html5', 'Market Economy Status Neg - Michigan7 2016.html5', 'Marxism Kritik - DDI 2014 SWS.html5', 'Medical Cooperation Case Neg - DDI 2016 BAM.html5', 'Medical Microbes Aff - Northwestern 2014.html5', 'Medical Records Affirmative - JDI 2015.html5', 'Medical Records Supplement - JDI 2015.html5', 'Melancholy Aff - Wake 2018.html5', 'Mental Health Neg - MSDI 2017.html5', 'Metadata Affirmative - Georgia 2015.html5', 'Metadata Negative - Georgia 2015.html5', 'Metaphors Bad - Michigan7 2014 GRAMS.html5', 'Mexican Renewables Aff and Neg Updates - Michigan7 2013 ACHM.html5', 'Mexican Renewables Neg - Michigan7 2013 BFJR.html5', 'Mexico ACE Border Ports Affirmative - Northwestern 2013 6WeekJuniors.html5', 'Mexico ACE Border Ports Negative - Wave 1 - Northwestern 2013 6WeekJuniors.html5', 'Mexico Aff - DDI 2021 GG.html5', 'Mexico Aff - DDI 2021 KS.html5', 'Mexico Aff and Neg Wave 1 - Sophomores - Gonzaga 2019.html5', 'Mexico Aff and Neg supplement - Sophomores - Gonzaga 2019.html5', 'Mexico Case Neg - DDI 2021 AT.html5', 'Mexico Case Neg - DDI 2021 FJ.html5', 'Mexico Case Neg - DDI 2021 HL.html5', 'Mexico Case Neg - DDI 2021 KM.html5', 'Mexico Case Neg - DDI 2021 KS.html5', 'Mexico Energy Affirmative - Northwestern 2013 6WeekSeniors.html5', 'Mexico Energy Affirmative - Wave 1 - Northwestern 2013 Sophomores.html5', 'Mexico Energy Negative - Northwestern 2013 6WeekJuniors.html5', 'Mexico Energy Negative Supplement - Northwestern 2013 6WeekSeniors.html5', 'Mexico Grand Bargain Affirmative - MSDI 2013.html5', 'Mexico Guest Workers Advantages - HSS 2013.html5', 'Mexico Guest Workers Affirmative - HSS 2013.html5', 'Mexico Guest Workers Negative - HSS 2013.html5', 'Mexico HRC Affirmative - HSS 2013.html5', 'Mexico HRC Negative - HSS 2013.html5', 'Mexico Honduras  - Wake 2019.html5', 'Mexico Human Rights Conditions CP - HSS 2013.html5', 'Mexico Judicial Reform Affirmative - MSDI 2013.html5', 'Mexico Judicial Reform Negative - MSDI 2013.html5', 'Mexico Labor-LGBTQ QPQ CPs - Michigan7 2013 BFJR.html5', 'Mexico Maquiladoras Affirmative - Northwestern 2013 4WeekJuniors.html5', 'Mexico Maquiladoras Affirmative and Negative Supplement - Northwestern 2013 4WeekJuniors.html5', 'Mexico Neg - DDI 2019 KM.html5', 'Mexico Neg - DDI 2019 LO.html5', 'Mexico Neg Supplement - Michigan7 2013 HJPP.html5', 'Mexico Negative - MSDI 2013.html5', 'Mexico Negative - Wake 2013.html5', 'Mexico Open Borders Affirmative - HSS 2013.html5', 'Mexico POEs Affirmative - UNT 2013.html5', 'Mexico TRCs Affirmative - JDI 2013.html5', 'Mexico TTIP Affirmative - HSS 2013.html5', 'Mexico Transport Negative - Gonzaga 2013.html5', 'Mexico Visa Affirmative - Northwestern 2013 6WeekSeniors.html5', 'Mexico Women in Juarez Affirmative - JDI 2013.html5', 'Mexico Women in Juarez Negative - JDI 2013.html5', 'Microfinance Affirmative - DDI 2013 KQ.html5', 'Microfinance Negative - DDI 2013 KQ.html5', 'Microfinancing Negative - DDI 2013 SS.html5', 'Middle Passage Aff - Michigan7 2014 BEFJR.html5', 'Middle Passage Aff and Neg - SDI 2018 NR.html5', 'Middle Passage Affirmative - HSS 2014.html5', 'Middle Passage Affirmative and Negative - Berkeley 2014.html5', 'Middle Passage Critique - UNT 2014.html5', 'Middle Passage Neg - Michigan7 2014 BEFJR.html5', 'Middle Passage Negative - HSS 2014.html5', 'Middle Passage Negative Supplement - SDI 2014.html5', 'Middle Passage Updates - Michigan7 2014 BEFJR.html5', 'Middle Passage case negs - DDI 2014 SWS.html5', 'Middle passage neg - DDI 2014 KQ.html5', 'Midterm Elections DA - MNDI 2014 AM.html5', 'Midterms - Updates - MNDI 2018 GJJJ.html5', 'Midterms - Wave 2 - Michigan7 2018 BFHPR.html5', 'Midterms DA - DDI 2018.html5', 'Midterms DA - Northwestern 2017.html5', 'Midterms DA - Northwestern 2018.html5', 'Midterms DA - Packet - SDI 2018.html5', 'Midterms DA - Starter - UTNIF 2018.html5', 'Midterms DA - Starter - Wake 2018.html5', 'Midterms DA - Starter Packet - Michigan7 2018.html5', 'Midterms DA Answers - Packet - SDI 2018.html5', 'Midterms DA Updates - Berkeley 2018.html5', 'Midterms DA Wave 1 - Berkeley 2017.html5', 'Midterms Dems Good DA - JDI 2017.html5', 'Midterms Disadvantage - UTNIF 2014.html5', 'Midterms Disadvantage - WSDI 2014.html5', 'Midterms Disadvantage Update - UTNIF 2014.html5', 'Midterms Updates - Berkeley 2017.html5', 'Midterms Updates - SDI 2018 GMRS.html5', 'Militarism Aff - DDI 2019 LO.html5', 'Militarism Aff-Neg - Michigan7 2019 K Lab.html5', 'Militarism Affirmative - Gonzaga 2014.html5', 'Militarism K - Berkeley 2019.html5', 'Militarism K Supplement - Berkeley 2019.html5', 'Militarism K Supplement - MichiganClassic 2019 HJO.html5', 'Militarism K and PIC - Gonzaga 2018 Scholars.html5', 'Militarism Neg - DDI 2019 LO.html5', 'Military CP - Michigan7 2014 CHHJPV.html5', 'Military Counterplan - JDI 2014.html5', 'Military ESA Aff - MichiganClassic 2017 MZ.html5', "Military ESA's Aff  - JDI 2017.html5", 'Military ESAs Neg  - MichiganClassic 2017 MZ.html5', 'Military Engagement CP - Michigan7 2016.html5', 'Military Impact Aid Aff - Wake 2017.html5', 'Military Recruitment Aff - Berkeley 2017.html5', 'Military Recruitment Aff-Neg - Berkeley 2017.html5', 'Military Recruitment Neg - Berkeley 2017.html5', 'Military in Schools Aff - Gonzaga 2017.html5', 'Milliken 1ac - DDI 2017 AS.html5', 'Milliken Aff  - Berkeley 2017.html5', 'Milliken Aff - SDI 2017.html5', 'Milliken Aff - Wake 2017.html5', 'Milliken Aff Updates - Berkeley 2017.html5', 'Milliken Case Neg - Wake 2017.html5', 'Milliken Neg - Berkeley 2017.html5', 'Milliken Neg - DDI 2017 ST.html5', 'Milliken Neg - SDI 2017.html5', 'Milliken v Bradley 2ac - DDI 2017 AS.html5', 'Misc Case Updates - MichiganClassic 2016.html5', 'Misc K Answers - Michigan7 2016.html5', 'Misc Supplement - Berkeley 2020 Wave 4.html5', 'Miscelleanous - NeoLib%2C Shunning%2C Cuba Embargo - Berkeley 2013.html5', 'Model Minority Aff - Berkeley 2017.html5', 'Model Minority Aff - Wake 2018.html5', 'Model Minority Aff Updates - Berkeley 2017.html5', 'Model Minority K - Berkeley 2017.html5', 'Model Minority K - Wake 2018.html5', 'Model Minority Neg - Berkeley 2017.html5', 'Modern Water Aff Neg - Michigan7 2021 K Lab.html5', 'Morality Starter - SDI 2014.html5', 'Morton K - Michigan7 2021 K Lab.html5', 'Mosques Negative - DDI 2015 ST.html5', 'Moten Case Negative - DDI 2015 SWS.html5', 'Moten Masterfile - Wake 2018.html5', 'Moten Neg - DDI 2020 AT.html5', 'Moten Neg - DDI 2020 FS.html5', 'Mourning Aff - Wake 2018.html5', 'Movements DA - Berkeley 2020 Wave 2.html5', 'Multitude Security Kritik - UTNIF 2013.html5', 'Multivalent Oppression K - Northwestern 2015.html5', 'Myth of the Model Minority Aff - Wake 2017.html5', 'NACTI Aff - MNDI 2013 DM.html5', 'NAIF Neg - Michigan7 2013 HJPP.html5', 'NEPA CP - DDI 2021.html5', 'NEPA CP - Michigan7 2014 HJPV.html5', 'NEPA Environmental Assessment Counterplan - Northwestern 2013 6WeekJuniors.html5', 'NGA CP - Michigan7 2021 CCPW.html5', 'NGA CP - SDI 2021.html5', 'NIH Tradeoff DA - UTNIF 2017.html5', 'NSA AFF - Wake 2015.html5', 'NSA Aff - UNT 2015.html5', 'NSA Affirmative and Negative Supplement - Michigan7 2015.html5', 'NSA Courts Aff and Neg - JDI 2015.html5', 'NSA NEG - Wake 2015.html5', 'NSA Neg - UNT 2015.html5', 'NSD Starter Pack 2021 - Right to strike - NSD 2021.html5', 'NSLA Neg - DDI 2017 ST.html5', 'NWS CP - Michigan7 2021 BFHPR.html5', 'National Standards Aff - Neg - Northwestern 2017.html5', 'Nationalism Kritik - Northwestern 2018.html5', 'Native American Education Neg - SDI 2017 BHL.html5', 'Native American Immigration Affirmative - SDI 2018 HLR.html5', 'Native American Immigration Negative - SDI 2018 HLR.html5', 'Native Americans Aff - MSDI 2017.html5', 'Native Immersion Aff - Berkeley 2017.html5', 'Native Immersion Neg - Berkeley 2017.html5', 'Native Languages Aff - UNT 2017.html5', 'Native Mining Aff - DDI 2021 KM.html5', 'Native Mining Case Neg - DDI 2021 KS.html5', 'Native Water Rights Aff - MGC 2021.html5', 'Native Water Rights Neg - MGC 2021.html5', 'Natives Aff - Michigan7 2021 EHJJPP.html5', 'Natives Aff - Neg - Northwestern 2017.html5', 'Natives Aff Updates Wave 1 - DDI 2017 ST.html5', 'Natives Case Neg - Michigan7 2021 EHJJPP.html5', 'Natives Education 1ac v1 - DDI 2017 ST.html5', 'Natives Education 1ac v2 - DDI 2017 ST.html5', 'Natives Education Aff   Neg - Michigan7 2017 FFRSV.html5', 'Natives Education Aff - Version 2 - Michigan7 2017 FFRSV.html5', 'Natives Education Neg Update - Michigan7 2017.html5', 'Natives Education Supplement - Michigan7 2017 CPBR.html5', 'Natives Neg - DDI 2017 AS.html5', 'Natural Disasters Aff - Michigan7 2016.html5', 'Natural Disasters Neg - Michigan7 2016.html5', 'Natural Gas Add-ons and Answers - Northwestern 2014.html5', 'Natural Gas Aff and Neg - Michigan7 2014.html5', 'Natural Gas Neg - Michigan7 2014.html5']


DEBATESUM_EXTREMIST_FILTER_OUT5 = ['NecroPolitics Aff  - Wake 2017.html5', 'Necropolitics K - Wake 2017.html5', 'Necropolitics Neg - Georgetown 2020.html5', 'Neg - AI Clarity - Michigan Classic 2022 CGNO.html5', 'Neg - AI LAWs - NAUDL 2022.html5', 'Neg - AI TEVV - MNDI 2022 PHA.html5', 'Neg - Abolish ICE - MichiganClassic 2020 ACV.html5', 'Neg - Ban OCOs - MSDI 2022.html5', 'Neg - Cognitive Biotechnology - Michigan 7 2022 CPWW.html5', 'Neg - Collateral Consequences - Michigan7 2020 BFHPR.html5', 'Neg - Corporate Crime - Michigan7 2020 BFHPR.html5', 'Neg - Cyber 5G - Michigan 7 2022 BEJJ.html5', 'Neg - Cyber Info Sharing - Packet - SDI 2022.html5', 'Neg - Cyber Space Assets - Michigan 7 2022 BEJJ.html5', 'Neg - Cyber Space Assets - Michigan 7 2022 BFHR.html5', 'Neg - Cybersecurity - NAUDL 2022.html5', 'Neg - Cyborg Writing - Michigan 7 2022 BFHR.html5', 'Neg - Death Penalty 2 - Michigan7 2020 EHJPS.html5', 'Neg - Digital Cyclops - Michigan 7 2022 BFHR.html5', 'Neg - Disease - UTNIF 2022.html5', 'Neg - Disinformation - CNDI 2022.html5', 'Neg - Gendered LAWs - Michigan 7 2022 FMPS.html5', 'Neg - Guantanamo - Michigan7 2020 EHJPS.html5', 'Neg - Information Warfare - Michigan 7 2022 BEJJ.html5', 'Neg - Intellectual Property - Michigan 7 2022 FMPS.html5', 'Neg - K Affs Misc - Michigan7 2020 BFHPR.html5', 'Neg - Marijuana Decriminalization - Michigan7 2020 Starter Pack.html5', 'Neg - Marijuana Supplement 2 - Michigan7 2020 BFHPR.html5', 'Neg - Misc Updates 1 - Michigan7 2020 BFHPR.html5', 'Neg - OCOs - Starter - Michigan 7 2022.html5', 'Neg - OCOs 2 - Michigan 7 2022 BFHR.html5', 'Neg - PGMs - Michigan 7 2022 BFHR.html5', 'Neg - Policing - Michigan7 2020 EHJPS.html5', 'Neg - Rememory - Michigan 7 2022 BFHR.html5', 'Neg - Sett Col - Michigan7 2020 K Lab.html5', 'Neg - Solvency Takeouts - Michigan 7 2022 BFHR.html5', 'Neg - Solvency Takeouts - Michigan 7 2022 FMPS.html5', 'Neg - Techno Orientalism - Michigan 7 2022 BFHR.html5', 'Neg - War on Drugs - Michigan7 2020 BFHPR.html5', 'Neg Updates - 4 Week Tourney - SDI 2017 PSW.html5', 'Negative Framework Updates - Michigan7 2014 BEFJR.html5', 'Neocolonialism K - Wake 2016 RKS K Lab.html5', 'Neolib K - Georgetown 2021.html5', 'Neolib K - Michigan7 2016.html5', 'Neolib K -- AT Impact Cards - Michigan7 2013.html5', 'Neolib K -- Ethics Cards - Michigan7 2013.html5', 'Neolib K 2 - Michigan7 2013.html5', 'Neolib K Answers - Michigan7 2016.html5', 'Neolib K Supplement - Michigan7 2018 BFHPR.html5', 'Neolib K vs Race Affs - Michigan7 2013.html5', 'Neolib K-Starter - Georgetown 2020.html5', 'Neoliberalism Addendum State-phobia K - Northwestern 2015.html5', 'Neoliberalism Critique - HSS 2014.html5', 'Neoliberalism Critique - MSDI 2015.html5', 'Neoliberalism Critique - Michigan7 2015.html5', 'Neoliberalism Critique - UTNIF 2015.html5', 'Neoliberalism Critique Answers - HSS 2014.html5', 'Neoliberalism Critique Supplement - MSDI 2015.html5', 'Neoliberalism Generic - DDI 2013.html5', 'Neoliberalism Generic - DDI 2016.html5', 'Neoliberalism Impacts - DDI 2015 MM.html5', 'Neoliberalism K  - Gonzaga 2017.html5', 'Neoliberalism K  - Starter Set - Michigan7 2017.html5', 'Neoliberalism K - Berkeley 2016.html5', 'Neoliberalism K - Berkeley 2021.html5', 'Neoliberalism K - DDI 2018.html5', 'Neoliberalism K - Gonzaga 2015.html5', 'Neoliberalism K - JDI 2016.html5', 'Neoliberalism K - JDI 2020.html5', 'Neoliberalism K - MichiganClassic 2019 EGW.html5', 'Neoliberalism K - Northwestern 2015.html5', 'Neoliberalism K - SDI 2016.html5', 'Neoliberalism K Answers - Berkeley 2017 Cubs.html5', 'Neoliberalism K Answers - Starter Pack - UNT 2017.html5', 'Neoliberalism K Updates - Berkeley 2017.html5', 'Neoliberalism K Updates - Michigan7 2017 CPPR.html5', 'Neoliberalism K Updates - SDI 2016.html5', 'Neoliberalism K vs K Affs - Michigan7 2017.html5', 'Neoliberalism Kritik - Berkeley 2018.html5', 'Neoliberalism Kritik - JDI 2013.html5', 'Neoliberalism Kritik - Samford 2013.html5', 'Neoliberalism Kritik - UTNIF 2013.html5', 'Neoliberalism Kritik - Wake 2013.html5', 'Neoliberalism Kritik Answers - HSS 2013.html5', 'Neoliberalism Kritik Supplement - Northwestern 2013 6WeekJuniors.html5', 'Neoliberalism Kritik Supplement - Northwestern 2013 Sophomores.html5', 'Neoliberalism Kritik Update - Emory 2013.html5', 'Neoliberalism Kritik Wave 2 - Berkeley 2018.html5', 'Neoliberalism Kritik Wave 3 - Berkeley 2018.html5', 'Neoliberalism Link -- Prisons - DDI 2015 MM.html5', 'Neoliberalism Supplement - Michigan7 2015.html5', 'Neoliberalism Supplement - SDI 2013.html5', 'Net Widening DA - MSDI 2020.html5', 'NetWidening turns - Gonzaga 2020 LO.html5', 'New Aff Blocks - SDI Tourney Final - SDI 2017 BLRS.html5', 'New Jim Code - UTNIF 2020.html5', 'Nietzsche K - DDI 2015 SWS.html5', 'Nietzsche K - Michigan7 2019 BFHR.html5', 'Nietzsche Kritik - Berkeley 2018.html5', 'Nietzsche Kritik - Michigan7 2018 K Lab.html5', 'Nietzschean Agonism Aff   Neg - Michigan7 2017 AFKKMM.html5', 'Nietzschean Agonism Neg  - Michigan7 2017 BFHHR.html5', 'Nigeria Aff  - MichiganClassic 2019 BAZ.html5', 'Nigeria Neg - MichiganClassic 2019 BAZ.html5', 'No Borders Affirmative - Starter - UTNIF 2018.html5', 'No Borders Negative - Starter - UTNIF 2018.html5', 'No One Is illegal Aff and Neg - Wake 2018.html5', 'No War K - DDI 2016 KQ.html5', 'Nomads Affirmative - Georgetown 2014.html5', 'Nommo 1ac  neg 2.0 - Wake 2019.html5', 'Nonviolence K - MichiganClassic 2019 FH.html5', 'North Korea Affirmative - HSS 2016.html5', 'North Korea BMD Affirmative - MSDI 2016.html5', 'North Korea Disaster Planning Aff - Northwestern 2016.html5', 'North Korea Disaster Planning Neg - Northwestern 2016.html5', 'Nuclear Dialogue Aff - MichiganClassic 2016.html5', 'Nuclear Dialogue Case Neg - DDI 2016 BAM.html5', 'Nuclear Dialogue Neg - MichiganClassic 2016.html5', 'Nuclear Energy Aff - Northwestern 2016.html5', 'Nuclear Energy Coop Aff - Michigan7 2016.html5', 'Nuclear Energy Coop Neg - Michigan7 2016.html5', 'Nuclear Energy Neg - Northwestern 2016.html5', 'Nuclear Fear K - DDI 2016 KQ.html5', 'Nuclear Ks - Michigan7 2013.html5', 'Nuclear Lab to Lab Affirmative - HSS 2016.html5', 'Nuclear Lab to Lab Negative - HSS 2016.html5', 'Nuclear Shipping - Michigan7 2014 GRAMS.html5', 'Nuclear Shipping Aff - DDI 2014 KQ.html5', 'Nuclear Shipping neg - DDI 2014 MS.html5', 'Nullification CP - Northwestern 2015.html5', 'Nurses Affirmative - Wave 1 - Michigan7 2018 BFHPR.html5', 'Nurses Affirmative - Wave 2 - Michigan7 2018 BFHPR.html5', 'Nurses Neg - DDI 2018 KM.html5', 'Nurses Negative - Michigan7 2018 BFHPR.html5', 'OCS Affirmative Environment Advantage - UNT 2014.html5', 'OCS Affirmative Wave 2 - MSDI 2014.html5', 'OCS Drilling Affirmative 1 - HSS 2014.html5', 'OCS Drilling Negative 1 - HSS 2014.html5', 'OSD Negative - Michigan7 2014 GRAMS.html5', 'OSEA Affirmative Wave 2 - Georgetown 2014.html5', 'OSEA Negative Wave 2 - Georgetown 2014.html5', 'OSMR aff v 2 - DDI 2014 SWS.html5', 'OTEC Aff - DDI 2014 TW.html5', 'OTEC Aff - Michigan7 2014 CJFP.html5', 'OTEC Aff - Northwestern 2014.html5', 'OTEC Aff and Neg - GMU 2014.html5', 'OTEC Affirmative - UNT 2014.html5', 'OTEC Affirmative Novice Version - Gonzaga 2014.html5', 'OTEC Neg - Michigan7 2014 CJFP.html5', 'OTEC Neg - Northwestern 2014.html5', 'OTEC Negative - HSS 2014.html5', 'OTEC Negative - MSDI 2014.html5', 'OTEC Negative - WSDI 2014.html5', 'OTEC Updates - Michigan7 2014.html5', 'Objectivism K - Wake 2017.html5', 'Ocean Acidification Critique Affirmative - UTNIF 2014.html5', 'Ocean Adv Answers - Michigan7 2014.html5', 'Ocean Affect Aff - Northwestern 2014.html5', 'Ocean Affect Neg - Northwestern 2014.html5', 'Ocean Biodiversity Core - SDI 2014.html5', 'Ocean Borders K - Michigan7 2014.html5', 'Ocean Discourse Critique - WSDI 2014.html5', 'Ocean Drones Affirmative - Michigan7 2014 BEJFR.html5', 'Ocean Drones Negative - Michigan7 2014 BEJFR.html5', 'Ocean Drones and OTEC - Michigan7 2014.html5', 'Ocean Energy Production Critique - Wake 2014.html5', 'Ocean Impact Defense - UNT 2014.html5', 'Ocean Security Critique - MSDI 2014.html5', 'Oceanic Ontology Critique - UTNIF 2014.html5', 'Oceans Critique - GMU 2014.html5', 'Oceans Development Critique - Emory 2014.html5', 'Oceans Development Critique Answers - Emory 2014.html5', 'Odyssey Aff 2.0 - Michigan7 2014 GRAMS.html5', 'Odyssey Affirmative - Michigan7 2014 GRAMS.html5', 'Odyssey Negative - Michigan7 2014 GRAMS.html5', 'Offshore Drilling Aff - DDI 2021 HL.html5', 'Offshore Drilling Case Neg - DDI 2021 GG.html5', 'Offshore Drilling Case Neg - DDI 2021 HL.html5', 'Offshore Drilling Negative - SDI 2014.html5', 'Offshore Natural Gas Negative - Emory 2014.html5', 'Offshore Wind 3.0 - Michigan7 2014 BEFJR.html5', 'Oil Affirmative - SDI 2014.html5', 'Oil DA  - Michigan7 2019 HKMM.html5', 'Oil DA - Berkeley 2017.html5', 'Oil DA - DDI 2019 KS .html5', 'Oil DA - Michigan7 2014 GJPP.html5', 'Oil Dependence Core - HSS 2014.html5', 'Oil Disadvantage - JDI 2014.html5', 'Oil Drilling Affirmative - Michigan7 2014 GJPP.html5', 'Oil Negative - Wake 2014.html5', 'Onshore CPs Upgrades - MichiganClassic 2014 SS.html5', 'Onto-Proletarian K - Wake 2018.html5', 'Ontological Terror Aff and Neg - Michigan7 2018 K Lab.html5', 'Ontological Terror Aff and Neg - Wake 2018 RKS.html5', 'Ontological Terror Masterfile - Wake 2018.html5', 'Opacity K - Northwestern 2015.html5', 'Open Borders Aff Neg - Berkeley 2018.html5', 'Open Borders Aff and Neg - Starter Packet - Michigan7 2018.html5', 'Open Borders Planless Aff and Neg - Gonzaga 2018 DMB.html5', 'Opium Mourning Aff - Michigan7 2016.html5', 'Opium Mourning Neg - Michigan7 2016.html5', 'Opt Out Neg - DDI 2017 ST.html5', 'Orientalism Critique - SDI 2016.html5', 'Orientalism K - Berkeley 2016.html5', 'Orientalism K - Berkeley 2019.html5', 'Orientalism K - Michigan7 2013.html5', 'Orientalism K - Michigan7 2016.html5', 'Orientalism Masterfile - Wake 2019.html5', 'Orion Sonar Disadvantage - Georgetown 2014.html5', 'Overfishing Core - Gonzaga 2014.html5', 'Overheating DA - Michigan7 2018 MMMR.html5', 'Overload Core - Michigan7 2015.html5', 'Overpopulation DA - Master - Wake 2018.html5', 'PCLOB Process CP - Northwestern 2015.html5', 'PFAS - Gonzaga 2021.html5', 'PICs Core - Northwestern 2015 6WS.html5', 'PQD - SDI 2021 2 week.html5', 'PRISM Affirmative - MNDI 2015.html5', 'PRISM Affirmative Supplement - MNDI 2015.html5', 'PSG Asylum Aff - Georgetown 2018.html5', 'PTD Aff - Michigan7 2021 BFHPR.html5', 'PTD Case Neg - Michigan7 2021 BFHPR.html5', 'Pakistan Affirmative - HSS 2016.html5', 'Pakistan Negative - HSS 2016.html5', 'Pan Aff - Wake 2016 RKS K Lab.html5', 'Pan K - DDI 2015 ST.html5', 'Pan K - MNDI 2016.html5', 'Pan K and Fem Answers - Michigan7 2016.html5', 'Pan Neg - Wake 2016 RKS K Lab.html5', 'Pandemic Aff-Neg - Christina - Wake 2016 RKS.html5', 'Path to Citizenship Affirmative - Northwestern 2018.html5', 'Path to Citizenship Negative - Northwestern 2018.html5', 'Performance Args - Wake 2019.html5', 'Performance Debate Good Bad - Georgetown 2014.html5', 'Performance Master File - Wake 2018 RKS.html5', 'Performance and Method Answers  - Wake 2018.html5', 'Personal Ocean Exploration Affirmative - Wake 2014.html5', 'Personal Ocean Exploration Negative - Wake 2014.html5', 'Pessimism - Starter Packet - Wake 2018.html5', 'Pessimism Critique Answers - SDI 2015.html5', 'Physics Aff - HSS 2017.html5', 'Physics Neg - HSS 2017.html5', 'Picking Losers - MichiganClassic 2014 SS.html5', 'Pinker Answers - Michigan7 2017 FFRSV.html5', 'Pirates Critique - UTNIF 2014.html5', 'Pirates case neg - DDI 2014 TW.html5', 'Plague Reps K - JDI 2015.html5', 'Planless Aff Masterfile - DDI 2020 GG.html5', 'Plastics Disadvantage - UTNIF 2014.html5', 'Plea Bargaining Aff - Neg - Berkeley 2020 Wave 2.html5', 'Plenary Power DA - Michigan7 2018 BFHPR.html5', 'Plenary Powers Aff - Berkeley 2017.html5', 'Plenary Powers Neg - Berkeley 2017.html5', 'PoMo K - Michigan7 2021 K Lab.html5', 'Poland Aff   Neg - Michigan7 2019.html5', 'Police Militarization Aff and Neg - Northwestern 2020.html5', 'Police Union Backlash DA - MSDI 2020.html5', 'Policing Negative - DDI 2015 CT.html5', 'Politic of Funk - Michigan7 2017 AFKKMM.html5', 'Political Geography K - Berkeley 2016.html5', 'Political Geography K Generic - DDI 2016.html5', 'Political Theology - Michigan7 2019 K Lab.html5', 'Politics - Georgia 2015.html5', 'Politics - Negative - UNT 2013.html5', 'Politics - TPA - Michigan7 2015.html5', 'Politics - TPP - HSS 2015.html5', 'Politics - UTNIF 2015.html5', 'Politics Core - UTNIF 2013.html5', 'Politics Core 2.0 - SDI 2017 BLRS.html5', 'Politics Core generic - DDI 2014.html5', 'Politics DA - CIR - Northwestern 2013 Starter Packet.html5', 'Politics DA - Emory 2021.html5', 'Politics DA - SCDI 2013.html5', 'Politics DA - Starter Set - Michigan7 2017.html5', 'Politics DA - UNT 2015.html5', 'Politics DA - UNT 2017.html5', 'Politics DA Updates - Berkeley 2017.html5', 'Politics DA Updates - Michigan7 2017 BFHHR.html5', 'Politics Disadvantage - Immigration - HSS 2013.html5', 'Politics Disadvantage - Internal Links - HSS 2013.html5', 'Politics Disadvantage - Links - HSS 2013.html5', 'Politics Disadvantage - Wake 2013.html5', 'Politics Generic - DDI 2013.html5', 'Politics Internal Links Core - HSS 2014.html5', 'Politics Links - Georgia 2014.html5', 'Politics NSA Reform Disadvantage - Emory 2014.html5', 'Politics Update - Michigan7 2014 GRAMS.html5', 'Politics Update - MichiganClassic 2021 MMP.html5', 'Politics Updates - Michigan7 2018 BFHPR.html5', 'Politics Updates - SDI 2019.html5', 'Politlcs Elections - DDI 2016.html5', 'Population DA - Georgetown 2018.html5', 'Populism Good and Bad - Michigan7 2018 BFHPR.html5', 'Postcolonial Feminism Kritik - UTNIF 2013.html5', 'Postcolonial Feminism Kritik Answers - UTNIF 2013.html5', 'Postcolonialism Kritik - Gonzaga 2013.html5', 'Posthumanism Aff - Wake 2017.html5', 'Preciado v Psychoanalysis - Michigan7 2016.html5', 'Presidential Powers DA - Samford 2019.html5', 'Pressure CP - Berkeley 2016.html5', 'Pressure Core - Berkeley 2016.html5', 'Prisons Aff   Neg - Wave 1 - Michigan7 2017 BFHHR.html5', 'Prisons Aff - JDI 2017.html5', 'Prisons Affirmative - Michigan7 2015.html5', 'Prisons Neg - DDI 2015 KQ.html5', 'Prisons Neg - JDI 2017.html5', 'Prisons Negative - DDI 2015 MM.html5', 'Prisons Negative - DDI 2015 SWS.html5', 'Prisons Negative - Michigan7 2015.html5', 'Prisons Negative Addendum - Northwestern 2015 6WS.html5', 'Privacy Affirmative Version 2 - Northwestern 2015 6WS.html5', 'Privacy Core - HSS 2015.html5', 'Privacy Critique - Michigan7 2015.html5', 'Privacy Generic - DDI 2015 SWS.html5', 'Privacy K - Northwestern 2015.html5', 'Private CP - DDI 2021.html5', 'Privatization CP - Berkeley 2017.html5', 'Privatization CP - Gonzaga 2017.html5', 'Privatization CP - Michigan7 2021 BFHPR.html5', 'Privatization CP - Northwestern 2017.html5', 'Privatization Core - MichiganClassic 2021 FOPVW.html5', 'Privatization DA - UTNIF 2015.html5', 'Privitization CP - Northwestern 2014.html5', 'Process Counterplans - Northwestern 2013 6WeekSeniors.html5', 'Prolif reps K Starter Set - Gonzaga 2019.html5', 'Proliferation - Michigan7 2019 BFHR.html5', 'Prosecutorial Discretion CP - SDI 2018 BGHT.html5', 'Psychoanalysis Bad - Michigan7 2019 CCPW.html5', 'Psychoanalysis Critique - Michigan7 2015.html5', 'Psychoanalysis Critique - UTNIF 2015.html5', 'Psychoanalysis K  - Michigan7 2017 BFHHR.html5', 'Psychoanalysis K - Michigan7 2014 GRAMS.html5', 'Psychoanalysis K - Michigan7 2014 HJPV.html5', 'Psychoanalysis K - Michigan7 2016.html5', 'Psychoanalysis K - Michigan7 2021 HKMLR.html5', 'Psychoanalysis K - Wake 2017.html5', 'Psychoanalysis K - Wake 2019.html5', 'Psychoanalysis K Answers - Michigan7 2016.html5', 'Psychoanalysis K Answers - Michigan7 2017 BFHHR.html5', 'Psychoanalysis K Answers - Supplement - Michigan7 2017.html5', 'Psychoanalysis K vs K Affs - Michigan7 2018 BFHPR.html5', 'Puar Aff and Neg - Wake 2018.html5', 'Puar Critique - Michigan7 2015.html5', 'Public Charge Aff  - DDI 2018 AT.html5', 'Public Charge Aff and Neg - Michigan7 2018 CPWW.html5', 'Public Charge Aff and Neg 2.0 - Gonzaga 2018 Sophomores.html5', 'Public Charge Case Neg - DDI 2018 AT.html5', 'Public Charge Neg - DDI 2018 KM.html5', 'Push Out Aff   Neg - Wake 2017.html5', 'Qatar Aff  - Michigan7 2019 BFHR.html5', 'Qatar Neg - Michigan7 2019 BFHR.html5', 'Qualified Immunity Aff Neg - UTNIF 2020.html5', 'Qualified Immunity Supplement - UTNIF 2020.html5', 'Quantum Life Aff and Neg - Michigan7 2018 K Lab.html5', 'Quare Atlantic Negative - JDI 2014.html5', 'Quashie masterfile - Wake 2019.html5', 'Queer China Aff-Neg - Wave 2 - Wake 2016 RKS.html5', 'Queer Delinquents K - UTNIF 2017.html5', 'Queer Ecology Critique - Georgetown 2014.html5', 'Queer IR K - Berkeley 2019.html5', 'Queer Immigration Aff Neg - Berkeley 2018.html5', 'Queer Inhumanism K - Michigan7 2016.html5', 'Queer K - DDI 2015 ST.html5', 'Queer Migration K - Michigan7 2018 K Lab.html5', 'Queer Nothingness - UTNIF 2020.html5', 'Queer Pessimism Answers - Wake 2018.html5', 'Queer Pessimism neg - DDI 2014 KQ.html5', 'Queer Terror K - Northwestern 2015.html5', 'Queer Theory Answers - Michigan7 2016.html5', 'Queer Theory K - Michigan7 2017 AFKKMM.html5', 'Queer Theory K Answers - Michigan7 2017 AFKKMM.html5', 'Queer Theory Ks - UTNIF 2018.html5', 'Queer Toxicity Aff Supplement - Wake 2016 RKS K Lab.html5', 'Queer Toxicity Neg - Michigan7 2016.html5', 'Queer Trans Ks - Michigan7 2016.html5', 'Queerness Affirmative and Negative - Michigan7 2015.html5', 'Queerness K - Berkeley 2017.html5', 'Queerness Supplement - Michigan7 2016.html5', 'R-Spec File - Wake 2019.html5', 'REE Aff - Northwestern 2014.html5', 'REHY Sex Ed Aff   Neg - MichiganClassic 2017 OW.html5', 'REHY Sex Ed Aff - UTNIF 2017.html5', 'REHY Sex Ed Neg - UTNIF 2017.html5', 'RFS Addendum - Michigan7 2021 BFPSW.html5', 'RFS Aff - Michigan7 2021 BFHPR.html5', 'ROC 2ac K answers Preinstitute Set - Wake 2019 (1).html5', 'ROC Aff and Neg Preinstitute Set - Wake 2019.html5', 'ROC Grand Bargain Aff - DDI 2016 KQ.html5', 'ROC aff supplement - Wake 2019.html5', 'RPP Process CP - Georgetown 2020.html5', 'RRS Lab Supplement - Gonzaga 2017.html5', 'RTE Neg - Michigan7 2017 BFHHLR.html5', 'Race Binary Kritik - Northwestern 2013 6WeekJuniors.html5', 'Race Census Affirmative - Michigan7 2015.html5', 'Race Labor K - UNT 2018.html5', 'Race War v2 - Wake 2019.html5', 'Racial Capitalism K - Michigan7 2021 K Lab.html5', 'Racial Profiling Negative - MSDI 2020.html5', 'Racial Surveillance Affirmative - SDI 2015.html5', 'Racial Surveillance Negative - SDI 2015.html5', 'Radical Autonomy K - Northwestern 2014.html5', 'Radical Convivial Ed Aff - UTNIF 2017.html5', 'Radical Convivial Ed Neg - UTNIF 2017.html5', 'Radical Thought Aff (Baudrillard) - Michigan7 2017 AFKKMM.html5', 'Radical Thought Neg (Baudrillard) - Michigan7 2017 AFKKMM.html5', 'Realism Good - Michigan7 2019 CCPW.html5', 'Reappropriation ans for word pics - DDI 2014 SWS.html5', 'Red Atlantic neg - DDI 2014 MS.html5', 'Red Atlantic neg - DDI 2014 TW.html5', 'Referendum CP - Northwestern 2015.html5', 'Reform Arms Sales CP - Emory 2019.html5', 'Reform Fatigue DA - MSDI 2017.html5', 'Reform Good - Michigan7 2020 CCPTW.html5', 'Refugees Aff - Gonzaga 2018 Scholars.html5', 'Refugees Aff Neg - Berkeley 2018.html5', 'Refugees Aff and Neg - Gonzaga 2018 Pre-Institute Packet.html5', 'Refugees Aff and Neg - Northwestern 2018.html5', 'Refugees Aff and Neg - SDI 2018 SP.html5', 'Refugees Aff and Neg - Wake 2018.html5', 'Refugees Affirmative - SDI 2018 BJMSS.html5', 'Refugees Affirmative - Wave 1 - Michigan7 2018 HJPV.html5', 'Refugees Neg - Gonzaga 2018 Scholars.html5', 'Refugees Neg Addendum - SDI 2018 SP.html5', 'Refugees Negative - SDI 2018 BJMSS.html5', 'Reg Neg CP - Berkeley 2021.html5', 'Reg Neg CP - Michigan7 2021 BFHPR.html5', 'Reg Neg CP - Northwestern 2014.html5', 'Regional Natives Aff - MichiganClassic 2021 FOPVW.html5', 'Regulation CPs  - Michigan7 2017 HJPPV.html5', 'Reject the Res Aff-Neg - Wake 2016 RKS K Lab.html5', 'Relations and Drug War Impact Defense - Georgetown 2013.html5', 'Religious Surveillance Aff Supplement - Michigan7 2015.html5', 'Religious Surveillance Aff and Neg Supplement 2 - Michigan7 2015.html5', 'Religious Surveillance Neg Supplement - Michigan7 2015.html5', 'Remittances DA - SDI 2018 BGHT.html5', 'Remote Sensing Affirmative - Michigan7 2014 CFJP.html5', 'Renewables Disadvantage - UNT 2013.html5', 'Renewables Disadvantage - Wake 2014.html5', 'Reparations Aff   Neg - Wake 2017.html5', 'Reps Defenses - Michigan7 2019 CCPW.html5', 'Reps of Suffering K - Michigan7 2013.html5', 'Reschooling Aff - Berkeley 2017.html5', 'Reschooling Neg - Berkeley 2017.html5', 'Research CP - Northwestern 2017.html5', 'Revisionism Yes-No - MichiganClassic 2019 RW.html5', 'Right to Education Aff - Gonzaga 2017.html5', 'Right to Education Aff - JDI 2017.html5', 'Right to Education Aff - Michigan7 2017 BFHHR.html5', 'Right to Education Aff - Northwestern 2017.html5', 'Right to Education Case Neg - DDI 2017 AS.html5', 'Right to Education Neg - DDI 2017 ST.html5', 'Right to Education Neg - Gonzaga 2017.html5', 'Right to Education Neg - JDI 2017.html5', 'Right to Education Neg - Northwestern 2017.html5', 'Rights K - DDI 2017 ST.html5', 'Rights Malthus - JDI 2015.html5', 'Rights Malthus - Northwestern 2015 6WS.html5', 'Rights Malthus - UTNIF 2015.html5', 'Rights Malthus DA Supplement - Michigan7 2015.html5', 'Risk Analysis Core - Michigan7 2015.html5', 'Risk Assessment Core - Michigan7 2017 HJPPV.html5', 'River Rights Aff - Michigan7 2021 EHJJPP.html5', 'River Rights Aff Neg - Northwestern 2021.html5', 'River Rights Case Neg - Michigan7 2021 EHJJPP.html5', 'Rodriguez 1ac v2 - DDI 2017 ST.html5', 'Rodriguez Aff - DDI 2017 ST.html5', 'Rodriguez Funding Aff  - MSDI 2017.html5', 'Rodriguez Updates - DDI 2017 ST.html5', 'Root Cause Core - HSS 2014.html5', 'Rubbish Affirmative - UTNIF 2014.html5', 'Rural Education Aff   Neg  - MichiganClassic 2017 MT.html5', 'Russia Alliance DA - Michigan7 2016.html5', 'Russia Arctic CP and DA - Northwestern 2014.html5', 'Russia CP - Michigan7 2013 BJFR.html5', 'Russia Counterplan and Disadvantage - SDI 2013.html5', 'Russia DA - Berkeley 2019.html5', 'Russia DA - MSDI 2021.html5', 'Russia DA - Michigan7 2019 Starter Pack.html5', 'Russia DA 2 - Michigan7 2019 BFHR.html5', 'Russia Disadvantage - Wake 2013.html5', 'Russia Fill In DA - Gonzaga 2019.html5', 'Russia SOI DA - Michigan7 2014 GRAMS.html5', 'S Visas Aff Neg - Berkeley 2018.html5', 'S%26ED Aff Wave 2 - Michigan7 2016.html5', 'SAFE Kit Testing Negative - MSDI 2020.html5', 'SCS Affirmative - Berkeley 2016.html5', 'SCS Affirmative Updates - Emory 2016.html5']


DEBATESUM_EXTREMIST_FILTER_OUT6 = ['SCS Grand Bargain Negative - MSDI 2016.html5', 'SCS I-Law Neg - Michigan7 2016.html5', 'SCS ILaw Aff - Michigan7 2016.html5', 'SDWA Aff Neg - Michigan7 2021 BFPSW.html5', 'SEC Affirmative - HSS 2015.html5', 'SED Aff - Michigan7 2016.html5', 'SED Neg - Michigan7 2016.html5', 'SEL Neg - MSDI 2017.html5', 'SIV Aff T Update - DDI 2018 AT.html5', 'SIV Neg - DDI 2018 KM.html5', 'SMR neg - DDI 2014 KQ.html5', 'SP1 XO 12333 Negative - UTNIF 2015.html5', 'SSD Aff and Neg 2.0 - Michigan7 2014 CHHJPV.html5', 'SSD Affirmative - Michigan7 2014 CHHJPV.html5', 'SSD Affirmative Upgrades - Michigan7 2014.html5', 'SSD Negative Updates - Michigan7 2014.html5', 'SSRA Aff Supplement - SDI 2015.html5', 'SSRA Affirmative - Emory 2015.html5', 'SSRA Affirmative - Northwestern 2015.html5', 'SSRA Affirmative - SDI 2015.html5', 'SSRA Affirmative and Negative - Northwestern 2015.html5', 'STEM Aff  - Gonzaga 2017.html5', 'STEM Aff  - MSDI 2017.html5', 'STEM Aff - Starter Pack - UNT 2017.html5', 'STEM Neg - DDI 2017 AS.html5', 'STEM Neg - Northwestern 2017.html5', 'STEM Supplement - Berkeley 2017.html5', 'Sabotage Aff  - Wake 2019.html5', 'Saltwater Slavery Affirmative - SDI 2014.html5', 'Sanctions Core - MSDI 2013.html5', 'Sarbanes Oxley Affirmative and Negative - Northwestern 2015.html5', 'Satellite K Aff Answers - Michigan7 2016.html5', 'Satellite K Answers - Michigan7 2016.html5', 'Satellite Ks - Michigan7 2016.html5', 'Satire Updates - Michigan7 2014.html5', 'Say No Negative - SDI 2016.html5', 'Schmitt Critique - Michigan7 2015.html5', 'Schmitt Kritik - Northwestern 2013 6WeekSeniors.html5', 'School Abolition Neg - Michigan7 2017.html5', 'School Choice Bad and Good - SDI 2017.html5', 'School Choice CP Supplement - HSS 2017.html5', 'School Discipline Aff - Neg - JDI 2017.html5', 'School Lunches Aff - Neg - Northwestern 2017.html5', 'School Lunches Aff - Version 2 - Wake 2017.html5', 'School to Prison Pipeline Aff - Berkeley 2017.html5', 'School to Prison Pipeline Aff - HSS 2017.html5', 'School to Prison Pipeline Neg - Berkeley 2017.html5', 'School to Prison Pipeline Neg - HSS 2017.html5', 'School to Prison Pipeline Supplement - Berkeley 2017.html5', 'School to Prison Pipeline aff - UNT 2015.html5', 'Science Good Bad - Michigan7 2014 GRAMS.html5', 'Sea Turtle Protection Affirmative - JDI 2014.html5', 'Sea Turtle Protection Negative - JDI 2014.html5', 'Seaborgs Aff K Updates - Michigan7 2014.html5', 'Seaborgs Aff and Neg - Michigan7 2014 CHHJPV.html5', 'Seaborgs Affirmative Upgrades - Michigan7 2014.html5', 'Secrecy CP - MichiganClassic 2015.html5', 'Section 702 K Affirmative - Gonzaga 2015.html5', 'Section 702 Negative - DDI 2015 CT.html5', 'Secularism K - UTNIF 2017.html5', 'Security Critique - Georgia 2014.html5', 'Security Critique - JDI 2014.html5', 'Security Critique - SDI 2016.html5', 'Security Critique - UTNIF 2014.html5', 'Security Critique Link Updates - HSS 2016.html5', 'Security K  - 4 Week Lab - Gonzaga 2019.html5', 'Security K  - Michigan7 2019 BFHR.html5', 'Security K - Berkeley 2016.html5', 'Security K - DDI 2014 MS.html5', 'Security K - DDI 2015 KQ.html5', 'Security K - JDI 2015.html5', 'Security K - Michigan7 2016.html5', 'Security K - Michigan7 2018 FFGSV.html5', 'Security K - Michigan7 2021 BFPSW.html5', 'Security K - WSDI 2015.html5', 'Security K - Wake 2019.html5', 'Security K Answers - Michigan7 2016.html5', 'Security K Generic - DDI 2016.html5', 'Security K Links - Wake 2016 RKS Seniors.html5', 'Security K Starter Set - Gonzaga 2019.html5', 'Security K Supplement - JDI 2015.html5', 'Security K supplement - 4 Week Lab - Gonzaga 2019.html5', 'Security Kritik - Berkeley 2018.html5', 'Security Kritik - Emory 2013.html5', 'Security Kritik - Emory 2019.html5', 'Security Kritik - Georgia 2013.html5', 'Security Kritik - JDI 2013.html5', 'Security Kritik - Northwestern 2013 6WeekSeniors.html5', 'Security Kritik - SDI 2019.html5', 'Security Kritik Addendum - Northwestern 2013 6WeekSeniors.html5', 'Security Kritik Generic - DDI 2013.html5', 'Security and Feminism K - Wake 2015.html5', 'Security and Feminism K Answers - Wake 2015.html5', 'Security supplement- Scholars - Gonzaga 2019.html5', 'Segregation Aff - Neg - Northwestern 2017.html5', 'Sequestration Aff - Northwestern 2014.html5', 'Sequestration Neg - Northwestern 2014.html5', 'Set Col - Wake 2019.html5', 'Set Col 2.0 - Wake 2019.html5', 'Settler Colomialism Aff - Wake 2018 RKS.html5', 'Settler Colonialism Aff   Neg - Wake 2017.html5', 'Settler Colonialism Aff - Berkeley 2018.html5', 'Settler Colonialism Answers - Michigan7 2021 BFPSW.html5', 'Settler Colonialism K  - Michigan7 2019 K Lab.html5', 'Settler Colonialism K - Berkeley 2019.html5', 'Settler Colonialism K - DDI 2018.html5', 'Settler Colonialism K - DDI 2021.html5', 'Settler Colonialism K - Michigan7 2019 BFHR.html5', 'Settler Colonialism K - Michigan7 2019 CPWW.html5', 'Settler Colonialism K - Packet - SDI 2018.html5', 'Settler Colonialism K - SDI 2021.html5', 'Settler Colonialism K - UTNIF 2021.html5', 'Settler Colonialism K - Wake 2017.html5', 'Settler Colonialism Kritik - Berkeley 2018.html5', 'Settler Colonialism Kritik - Michigan7 2018 FFGSV.html5', 'Settler Colonialism Kritik - Michigan7 2018 HJPV.html5', 'Settler Colonialism Kritik - Northwestern 2018.html5', 'Settler Enclosure K - MGC 2021.html5', 'Settlerism Aff and Neg - Michigan7 2018 K Lab.html5', 'Settlerism Answers  - Wake 2018.html5', 'Settlerism K - Berkeley 2017.html5', 'Settlerism K - DDI 2015 SWS.html5', 'Settlerism K - UTNIF 2018.html5', 'Settlerism K - Wake 2018 RKS.html5', 'Settlerism Links - Wake 2018.html5', 'Settlerism Supplement - Michigan7 2021 BFPSW.html5', 'Sex Education Aff - JDI 2017.html5', 'Sex Education Neg - JDI 2017.html5', 'Sexual Difference Critique - HSS 2014.html5', 'Shipping Disadvantage - UTNIF 2014.html5', 'Shunning Disadvantage - Gonzaga 2013.html5', 'Shunning K - MSDI 2016.html5', 'Shunning Kritik - MSDI 2013.html5', 'Skilled Immigration Aff and Neg - Starter Packet - Michigan7 2018.html5', 'Small Arms Aff Neg - Berkeley 2019.html5', 'Social Movements AFF - Wake 2015.html5', 'Social Movements NEG - Wake 2015.html5', 'Soft Power Core - SDI 2015.html5', 'Solvency Core - Berkeley 2017.html5', 'Sousveillance Critique - UTNIF 2015.html5', 'South Sudan Refugees Negative - Michigan7 2018 BFHPR.html5', 'Space Aff - DDI 2016 ct.html5', 'Space Case Neg - DDI 2016 CT.html5', 'Space Case Neg v CT - DDI 2016 HS.html5', 'Space Col Supplement - Michigan7 2021 EHJJPP.html5', 'Space Coop Neg - Michigan7 2016.html5', 'Space Cooperation Affirmative - SDI 2016.html5', 'Space Cooperation Case Neg - DDI 2016 BAM.html5', 'Space Cooperation Negative - SDI 2016.html5', 'Space Elevator Aff - Michigan7 2014.html5', 'Space Elevator Neg - Michigan7 2014.html5', 'Space Exploration Aff Neg - TDI 2021.html5', 'Space Trade-off DA - Michigan7 2014 GRAMS.html5', 'Spanos K - Wake 2019.html5', 'Spark - Michigan7 2019 BFHR.html5', 'Special Interest Visas Aff and Neg - SDI 2018 NR.html5', 'Special Interest Visas Aff and Neg Updated - SDI 2018 BGHT.html5', 'Special Needs Affirmative - Michigan7 2015.html5', 'Special Needs Negative - Michigan7 2015.html5', 'Speciesism Critique - Samford 2014.html5', 'Specters Affirmative Wave 2 - UTNIF 2014.html5', 'Specters Negative - UTNIF 2014.html5', 'Specters Negative Wave 2 - UTNIF 2014.html5', 'Speed K - Gonzaga 2017.html5', 'Spending DA - DDI 2017.html5', 'Spending DA - Emory 2017.html5', 'Spending DA - HSS 2017.html5', 'Spending DA - MSDI 2017.html5', 'Spending DA - Northwestern 2017.html5', 'Spending Disadvantage - HSS 2014.html5', 'Standardized Testing Neg - MSDI 2017.html5', 'Starter Pack - Gonzaga 2017.html5', 'Startup Visas Aff Neg - Berkeley 2018.html5', 'State Budget Advantage Final - UNT 2017.html5', 'State CP DAs - Berkeley 2017.html5', 'State Dept Tradeoff DA - Michigan7 2016.html5', 'State Reformism Good Core - HSS 2014.html5', 'States CP - Berkeley 2017.html5', 'States CP - Berkeley 2020 Starter Pack.html5', 'States CP - Berkeley 2021.html5', 'States CP - DDI 2017.html5', 'States CP - Emory 2017.html5', 'States CP - Gonzaga 2017.html5', 'States CP - JDI 2017.html5', 'States CP - JDI 2021.html5', 'States CP - MSDI 2017.html5', 'States CP - MSDI 2020.html5', 'States CP - MSDI 2021.html5', 'States CP - Michigan7 2021 BFHPR.html5', 'States CP - Northwestern 2014.html5', 'States CP - SDI 2018 BGHT.html5', 'States CP - SDI 2021.html5', 'States CP - Starter Pack - JDI 2017.html5', 'States CP - Starter Set - Michigan7 2017.html5', 'States CP - WSDI 2015.html5', 'States CP - Wave 2 - Michigan7 2017 BFHHR.html5', 'States CP 2.0 - Georgetown 2020.html5', 'States CP Answers - Northwestern 2017.html5', 'States CP Supplement - HSS 2017.html5', 'States CP Updates - Berkeley 2017.html5', 'States CP Wave 2 - Berkeley 2017.html5', 'States CP and Federalism - Georgetown 2020.html5', 'States CP and Federalism - Georgetown 2021.html5', 'States CP-Federalism DA - DDI 2021.html5', 'States Federalism - Starter Pack - UNT 2017.html5', 'States and Federalism - Michigan7 2014.html5', 'Stem Aff - HSS 2017.html5', 'Stem Neg - HSS 2017.html5', 'Stick UP  - Wake 2019.html5', 'Stop and Frisk Aff-Neg - Berkeley 2020 Starter Pack.html5', 'Stop and Frisk Affirmative - Michigan7 2015.html5', 'Stored Communications Act Negative - HSS 2015.html5', 'Strategic Ambiguity Critique - SDI 2015.html5', 'Strategic Dialogue Affirmative - SDI 2016.html5', 'Strategic Dialogue Negative - SDI 2016.html5', 'Strikes CP - JDI 2016.html5', 'Student Privacy Neg - MSDI 2017.html5', 'Subjugated Knowledge K Master File - UTNIF 2017.html5', 'Substantive Due Process DA - Gonzaga 2017.html5', 'Sudan Aff - Michigan7 2016.html5', 'Sudan Neg - Michigan7 2016.html5', 'Suffering Reps K - Northwestern 2014.html5', 'Surveillance Aff Neg - Samford 2015.html5', 'Surveillance Assemblages Affirmative - DDI 2015 CT.html5', 'Surveillance Critiques - Michigan7 2015.html5', 'Suspend Whiteness Critique - SDI 2016.html5', 'Syria Affirmative - Northwestern 2018.html5', 'T - Abolition - Michigan7 2020 FFPSVV.html5', 'T - Engagement - Michigan7 2016.html5', 'T - Framework - Michigan 7 2022 FMPS.html5', 'T - Framework - Michigan7 2020 HKMM.html5', 'T - Framework Addendum - Michigan7 2020 BFHPR.html5', 'T - no plan - DDI 2014 SWS.html5', 'T Visas Critical Aff Neg - Berkeley 2018.html5', 'T Visas Policy Aff Neg - Berkeley 2018.html5', 'T-Bonds QPQ CP - Michigan7 2016.html5', 'T-Framework - Michigan7 2021 HKMLR.html5', 'TDL-CNN Aff Neg - MichiganClassic 2021 FOPVW.html5', 'THAAD Affirmative - 2AC Blocks - SDI 2016.html5', 'THAAD Affirmative - 2AC Blocks Supplement - SDI 2016.html5', 'THAAD Affirmative Supplement 1 - SDI 2016.html5', 'THAAD Affirmative Supplement 2 - SDI 2016.html5', 'THAAD Negative - SDI 2016.html5', 'TPP Aff - DDI 2016 CT.html5', 'TPP Aff-Neg Starter Pack - Northwestern 2016.html5', 'TPP Affirmative - Berkeley 2016.html5', 'TPP Case Neg - DDI 2016 HS.html5', 'TPP Neg - Michigan7 2013 HJPP.html5', 'TPP Politics DA - Michigan7 2016.html5', 'TPP Withdrawal Affirmative - Berkeley 2016.html5', 'TPP Withdrawal Negative - Berkeley 2016.html5', 'TRIG Aff Neg - Berkeley 2018.html5', 'TSA Aff and Neg Upgrade - Northwestern 2015 6WS.html5', 'TSA Affirmative - DDI 2015 SWS.html5', 'TSA Body Scanners Negative - Michigan7 2015.html5', 'Tactical Carnival Aff Neg - Berkeley 2019.html5', 'Taiwan Advantage - Georgetown 2016.html5', 'Tax Credit Scholarships CP - UNT 2017.html5', 'Tax Cuts Politics DA - Berkeley 2017.html5', 'Tax Reform Politics DA - HSS 2017.html5', 'Tax Reform Politics DA - Version 2 - Michigan7 2017 BFHHR.html5', 'Tax Reform Politics Starter DA - MSDI 2017.html5', 'Tax Reform Politics Updates - 4 Week Tourney - HSS 2017 GMMS.html5', 'Tax Reform Politics Updates - SDI 2017 GMMS.html5', 'Teacher Tenure Aff - Neg - JDI 2017.html5', 'Tear Gas Aff and Neg - Sophomores - Gonzaga 2019.html5', 'Tech Industry Advantage - HSS 2015.html5', 'Tech Leadership Bad DA - Michigan7 2014 BEFJR.html5', 'Tech Sector Advantage - HSS 2016.html5', 'Techno-Orientalism K - Wake 2017.html5', 'Technocracy K - Northwestern 2017.html5', 'Technology Critique - Michigan7 2015.html5', 'Temporary CP - Michigan7 2018 BFHPR.html5', 'Temporary Visa CPs - UTNIF 2018.html5', 'Terror 2acs - DDI 2015 ST.html5', 'Terror Case Neg - DDI 2015 MM.html5', 'Terror DA - Berkeley 2018.html5', 'Terror DA - Gonzaga 2018 DMB.html5', 'Terror DA Updates - Berkeley 2018.html5', 'Terror List Affirmative - DDI 2013 AC.html5', 'Terror Talk K - Michigan7 2013.html5', 'Terror Talk Kritik - HSS 2013.html5', 'Terror Talk Security Affirmative - DDI 2015 CT.html5', 'Terrorism DA - DDI 2018.html5', 'Terrorism DA - Georgetown 2015.html5', 'Terrorism DA - JDI 2015.html5', 'Terrorism DA - MSDI 2015.html5', 'Terrorism DA - Samford 2015.html5', 'Terrorism DA Answers - MSDI 2015.html5', 'Terrorism DA Supplement - Michigan7 2015.html5', 'Terrorism DA Updates - MNDI 2015.html5', 'Terrorism Reps K - Northwestern 2015.html5', 'Thai Slavery K Aff - DDI 2014 SWS.html5', 'Thai slavery case neg - DDI 2014 TW.html5', 'Theory File - JDI 2015.html5', 'Third Party Affirmative - Northwestern 2015.html5', 'Third Party Negative - Northwestern 2015.html5', 'Third World Consciousness Aff-Neg - Wake 2016 RKS K Lab.html5', 'Third World bad - DDI 2014 TW.html5', 'Third Worldism Cap K - Wake 2018.html5', 'This is not an aff aff and neg - Michigan7 2014 CHHJPV.html5', 'Title 1 Aff - Gonzaga 2017.html5', 'Title 1 Neg - Gonzaga 2017.html5', 'Title 1 Neg - Version 2 - Michigan7 2017 CPPR.html5', 'Title I Aff - Neg Supplement - Gonzaga 2017.html5', 'Title I Aff - Wave 2 - Michigan7 2017 HJPPV.html5', 'Title I Financing Aff - Berkeley 2017.html5', 'Title I Financing Neg - Berkeley 2017.html5', 'Title I Portability Aff - Neg - JDI 2017.html5', 'Tohono Affirmative - DDI 2015 SWS.html5', 'Tohono Negative - DDI 2015 CT.html5', 'Tohono Negative - DDI 2015 SWS.html5', 'Topic DAs - UTNIF 2015.html5', 'Topic Education and Framework Impacts - Northwestern 2015.html5', 'Topic Impact Core - JDI 2015.html5', 'Topic Link Supplement - Michigan7 2021 K Lab.html5', 'Topicality   Framework File - Wake 2017.html5', 'Topicality - Berkeley 2018.html5', 'Topicality - Blake - Wake 2016 RKS.html5', 'Topicality - Emory 2018.html5', 'Topicality - Gonzaga 2018 Sophomores.html5', 'Topicality - Gonzaga 2021.html5', 'Topicality - HSS 2013.html5', 'Topicality - JDI 2021.html5', 'Topicality - MSDI 2017.html5', 'Topicality - MSDI 2020.html5', 'Topicality - MSDI 2021.html5', 'Topicality - Michigan7 2014 BEFJR.html5', 'Topicality - Northwestern 2016 6WI.html5', 'Topicality - SDI 2016.html5', 'Topicality - SDI 2019.html5', 'Topicality - SDI 2020.html5', 'Topicality - Samford 2014.html5', 'Topicality - Starter Packet - UNT 2018.html5', 'Topicality - UTNIF 2013.html5', 'Topicality - Wake 2015.html5', 'Topicality Aff Supplement Wave 4 - Berkeley 2017.html5', 'Topicality Core - Various Versions - Michigan7 2018 FFGSV.html5', 'Topicality Education Supplement Wave 4 - Berkeley 2017.html5', 'Topicality Engagement is QPQ - NDCA 2016.html5', 'Topicality Engagement is QPQ Aff Answers - NDCA 2016.html5', 'Topicality Funding Supplement Wave 4 - Berkeley 2017.html5', 'Topicality Not Framework - China - HSS 2016.html5', 'Topicality Regulation Supplement Wave 4 - Berkeley 2017.html5', 'Topicality Substantial Supplement Wave 4 - Berkeley 2017.html5', 'Topicality Supplement - Michigan7 2015.html5', 'Topicality Supplement - Michigan7 2016.html5', 'Topicality Supplement - Michigan7 2019 FFPSV.html5', 'Topicality Supplement - UTNIF 2015.html5', 'Topicality Supplement 2 - Michigan7 2016.html5', 'Topicality Supplement Wave 3 - Berkeley 2017.html5', 'Topicality Voting Issues - SDI 2014.html5', 'Topicality v K Affs - DDI 2017 ST.html5', 'Torture Trade Aff - Michigan7 2019 CCPW.html5', 'Totalizing the West K - DDI 2017 ST.html5', 'Tournament Updates - DDI 2018 KM.html5', 'Toxicity Aff - Michigan7 2016.html5', 'Track 2 CP - Berkeley 2016.html5', 'Trade Core - Berkeley 2013.html5', 'Trade Core - DDI 2013.html5', 'Tradeoff Disadvantage - UNT 2014.html5', 'Trafficking Aff and Neg - Michigan7 2018 FFGSV.html5', 'Trafficking Case Neg - DDI 2018 AT.html5', 'Trafficking Neg - DDI 2018 KM.html5', 'Trans Rage K - Michigan7 2016.html5', 'Trans-Bathroom Neg - Starter Pack - HSS 2017.html5', 'Transgenic Fish Affirmative and Negative - Berkeley 2014.html5', 'Translators Aff and Neg - Wave 1 - Michigan7 2018 BFHPR.html5', 'Translators Negative - Michigan7 2018 BFHPR.html5', 'Transnational Anti-Blackness Answers - UTNIF 2018.html5', 'Transnational Anti-Blackness K - UTNIF 2018.html5', 'Transphobia K - DDI 2015 ST.html5', 'Travel Ban Affirmative - Michigan7 2018 MMR.html5', 'Travel Ban Affirmative - SDI 2018 BJMSS.html5', 'Travel Ban Court Politics - HSS 2017.html5', 'Travel Ban Court Politics 2.0 - SDI 2017 BHT.html5', 'Travel Ban Negative - Michigan7 2018 FFGSV.html5', 'Travel Ban Negative - Michigan7 2018 MMMR.html5', 'Tribal Mining Case Neg - DDI 2021 AT.html5', 'Tribal Mining Case Neg - DDI 2021 GG.html5', 'Tribal Mining Case Neg - DDI 2021 KM.html5', 'Trump Agenda Bad Supplement - MichiganClassic 2017 GJJS.html5', 'Trump Bad DA - Michigan7 2017.html5', 'Trump Bad DA - Version 2 - Michigan7 2017 BFHHR.html5', 'Trump Base DA - HSS 2017 BHT.html5', 'Trump Impact Core - Michigan7 2018 FFGSV.html5', 'Trump Impact Core - Michigan7 2018 MMMR.html5', 'Tsunami Warning Affirmative - HSS 2014.html5', 'Tsunamis Negative - SDI 2014.html5', 'U Visas Aff and Neg - SDI 2018 PSW.html5', 'U Visas Affirmative - Michigan7 2018 HJPV.html5', 'UN UPR CP - Michigan7 2015.html5', 'US - Earth Relations K Affirmative - Michigan7 2014 BEFJR.html5', 'USBR Tradeoff DA - DDI 2021.html5', 'USCIS Clog DA - Michigan7 2018 CPWW.html5', 'USFG Topicality - SDI 2014.html5', 'USICA  DA - JDI 2021.html5', 'USMCA Politics DA - Michigan7 2019 Starter Pack.html5', 'USMCA Politics DA 2 - Michigan7 2019 BFHR.html5', 'Ukraine aff and neg - Wake 2019.html5', 'Undercommons K - Berkeley 2017.html5', 'Undocumented Immigrants Aff   Neg  - Michigan7 2017 CMMW.html5', 'Unfunded Mandates   Spending DAs - Michigan7 2017 FFRSV.html5', 'Ungovernability K  - Michigan7 2019 Starter Pack.html5', 'Ungovernability K - Michigan7 2019 HKMM.html5', 'Unions DA - Northwestern 2018.html5', 'Update File - Michigan7 2018 FFGSV.html5', 'Update File - MichiganClassic 2018 BO.html5', 'Update File - Wave 3 - Michigan7 2017 HJPPV.html5', 'Updates - Wave 3 - Michigan7 2017 FFRSV.html5', 'Use of Force Standard Aff and Neg - Georgetown 2020.html5', 'Utopian Borders Aff and Neg - Michigan7 2018 CPWW.html5', 'Vaccines Aff   Neg - Wave 1 - Michigan7 2017 FFRSV.html5', 'Vaccines Aff   Neg Updates - Michigan7 2017 FFRSV.html5', 'Vaccines Aff - UTNIF 2017.html5', 'Vaccines DA - HSS 2015.html5', 'Vaccines Neg - UTNIF 2017.html5', 'Value Added Achievement Bad - DDI 2017 AS.html5', 'Venezuela Aff - SCDI 2013.html5', 'Venezuela Affirmative - MSDI 2013.html5', 'Venezuela Affirmative - Wake 2013.html5', 'Venezuela Conditions CP - HSS 2013.html5', 'Venezuela Debt Relief Kritik Affirmative - DDI 2013.html5', 'Venezuela K - Michigan7 2013.html5', 'Venezuela Politics DA 1 - Michigan7 2013.html5', 'Venezuela Ports Aff and Neg - Michigan7 2013.html5', 'Venezuela QPQ Negative - HSS 2013.html5', 'Venezuela Relations Disadvantage - MSDI 2013.html5', 'Venezuela Starter Pack - UTNIF 2013.html5', 'Video Surveillance Affirmative - Michigan7 2015.html5', 'Video Surveillance Negative - Michigan7 2015.html5']


DEBATESUM_EXTREMIST_FILTER_OUT7 = ['Vigilantism DA - MSDI 2020.html5', 'Visa Aff - DDI 2017 AS.html5', 'Visuality and Identity - DDI 2015 SWS.html5', 'Vol CP + Ag DA - Gonzaga 2021.html5', 'Vouchers Aff  - MSDI 2017.html5', 'Vouchers Neg - MSDI 2017.html5', 'WCC Neg - DDI 2020 FS.html5', 'WOTUS Aff - DDI 2021 FJ.html5', 'WOTUS Aff - DDIx 2021.html5', 'WOTUS Aff - Michigan7 2021 EHJJPP.html5', 'WOTUS Aff - SDI 2021.html5', 'WOTUS Case Neg - DDI 2021 AT.html5', 'WOTUS Case Neg - DDI 2021 KM.html5', 'WOTUS Natives Aff Neg - SDI 2021.html5', 'WOTUS Neg Supplement - MichiganClassic 2021 MMP.html5', 'WOTUS Updates - Michigan7 2021 EHJJPP.html5', 'Wag the Dog DA - Berkeley 2017.html5', 'Wages DA - Berkeley 2018.html5', 'Wages DA - Northwestern 2018.html5', 'Wages DA - UNT 2018.html5', 'Wages DA - Version 2 - SDI 2018 BJMSS.html5', 'Wages DA - Wave 2 - Michigan7 2018 BFHPR.html5', 'Wakanda CP - Michigan7 2019 K Lab.html5', 'Wake Work Aff   Neg - Michigan7 2017 AFMMKK.html5', 'War Impact Core - Michigan7 2014.html5', 'War Powers DA - Michigan7 2015.html5', 'Warming Aff - DDI 2016 MS.html5', 'Warming Aff-Neg - JDI 2016.html5', 'Warming Core - Gonzaga 2013.html5', 'Warming Core - Gonzaga 2014.html5', 'Water Capitalism K - Michigan7 2021 CCPW.html5', 'Water Colonialism K - Michigan7 2021 BFHPR.html5', 'Water Colonialism K - Michigan7 2021.html5', 'Water Infrastructure Aff - Michigan7 2021.html5', 'Water Infrastructure Aff Neg - Michigan7 2021 BFPSW.html5', 'Water Infrastructure Case Neg - Michigan7 2021.html5', 'Water Security K  - Gonzaga 2021.html5', 'Water Trading Aff - DDI 2021 GDDI.html5', 'Water Trading Case Neg - DDI 2021 GDDI.html5', 'Water Wars Impact Core - Michigan7 2021 HKMLR.html5', 'Weaponitis K - Sophomores - Gonzaga 2019.html5', 'Weaponitis Kritik - Georgetown 2019.html5', 'Weapons Focus K - DDI 2019 Generic.html5', 'Welfare Aff and Neg - Northwestern 2015.html5', 'Welfare DA - Berkeley 2018.html5', 'Welfare Neg Supplement - Northwestern 2015.html5', 'Welfare Surveillance Aff and Neg - Northwestern 2015.html5', 'Welfare Surveillance Affirmative - SDI 2015.html5', 'Welfare Surveillance Negative - SDI 2015.html5', 'Western Epistemology K - Michigan7 2014.html5', 'White Collar Crime Neg - DDI 2020 HL.html5', 'White FW - Wake 2019.html5', 'Whitewashing Counterplan - DDI 2013 CM.html5', 'Wikimedia Affirmative - Northwestern 2015.html5', 'Wind Power Affirmative 2AC - SDI 2014.html5', 'Word PICs - Michigan7 2018 HJPV.html5', 'Workplace Raids Affirmative - HSS 2015.html5', 'Workplace Raids Negative - HSS 2015.html5', 'Yes No War - Michigan7 2014 GRAMS.html5', 'Yes War - JDI 2015.html5', 'ZTP Negative - MSDI 2020.html5', 'Zambia Aff - Michigan7 2016.html5', 'Zambia Neg - Michigan7 2016.html5', 'Zapatistas Affirmative - DDI 2013 AC.html5', 'Zelman Aff - Berkeley 2017.html5', 'Zero Days Negative - Michigan7 2015.html5', 'Zero Tolerance Aff  - MSDI 2017.html5', 'Zero Tolerance Aff - DDI 2017 AS.html5', 'Zero Tolerance Neg - DDI 2017 AS.html5', 'Zero Tolerance Neg - MSDI 2017.html5', 'Zero Tolerance Policies Aff Wave 1 - DDI 2017 ST.html5', 'Zika Politics DA - Berkeley 2016.html5', 'Zong v1 neg - DDI 2014 MS.html5', 'Zong v2 neg - DDI 2014 MS.html5', 'anti blackness master file - Wake 2019.html5', 'cap good - DDI 2014 SWS.html5', 'cards for rks tournament - Wake 2019.html5', 'communicative engagement aff - DDI 2016 CT.html5', 'community fisheries case neg - DDI 2014 SWS.html5', 'ecofem case neg - DDI 2014 SWS.html5', 'heidegger neg - DDI 2014 KQ.html5', 'politics - infrastructure - JDI 2021.html5']


In [None]:

foo = load_dataset("Hellisotherpeople/DebateSum",split='train',streaming=True).filter(
    lambda x : (x['OriginalDebateFileName'] in DEBATESUM_EXTREMIST_FILTER_OUT2) and (len(x['Full-Document'].split(' '))>500)
)
bar = [e for e in foo]

Downloading readme:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

In [None]:
s = [e for e in bar if e['OriginalDebateFileName'] in [
'DOD and Navy Counterplan - SDI 2014.html5'
]]

In [79]:
 #     'In the following passage from a financial statement "{XX^^CONTENT^^XX}", what does the number "{XX^^NUM^^XX}" represent?\nAnswer: "{XX^^ANS^^XX}"',

def filter_finer139(x):
    return sum(x['ner_tags'])>0

def clean_finer139_for_mlm(x):
    tokens,tags = x['tokens'], x['ner_tags']
    passage = re.sub("(?<=\w)\s+(?=[\’\,\;\.\”\)])","",re.sub("(?<=\d)\s\%","%",re.sub("\$\s(?=\d)","$"," ".join(tokens))))
    #print(passage)
    concept_answers = [
        (w,FINER139_CLASSES[t].strip(),i)
        for i,(w,t) in enumerate(zip(tokens, tags)) if t!=0
    ]
    which_take = ord(passage[:20].replace(" ","")[-1])
    ans_triplet = concept_answers[which_take % len(concept_answers)]
    # get all answers of same classe
    other_answers_of_same_class = [a for a in concept_answers if a[1]==ans_triplet[1]]
    if len(other_answers_of_same_class)==1:
        tnumber,ansclass,idx = ans_triplet
        is_dollar = "$" if tokens[idx-1]=='$' else ""
        is_unit = tokens[idx+1] if tokens[idx+1].lower() in ['million','billion','hundred','thousand','percent','%','hundred-thousand'] else ""
        tnumber = (is_dollar + tnumber + " "+is_unit).strip().replace(' %',"%")
        ansprefix = {0:'a',1:'an',2:"the",3:'the'}[int(ansclass[0] in ['a','e','i','o','u','y'])+2*(ansclass[-1]=='s')]
        ansclass = {0:ansclass.title(), 1:ansclass}[which_take % 2]
        template = TEMPLATE_FINER139[which_take % len(TEMPLATE_FINER139)]
        text = template.replace("{NUMBER}",tnumber).replace("{ANSWER}",ansclass).replace("{PREFIX}",ansprefix).replace('{PASSAGE}',passage)
        return {'text':text}
    else:
        multi_number= []
        for ans_triplet in other_answers_of_same_class:
            tnumber,ansclass,idx = ans_triplet
            is_dollar = "$" if tokens[idx-1]=='$' else ""
            is_unit = tokens[idx+1] if tokens[idx+1].lower() in ['million','billion','hundred','thousand','percent','%','hundred-thousand'] else ""
            tnumber = (is_dollar + tnumber + " "+is_unit).strip().replace(' %',"%")
            multi_number.append(tnumber)
        multi_number = list(set(multi_number))
        # combine the multi numbers
        sep = {0:'/',1:' / ', 2:' and ', 3: ' & ', 4:' + '}[which_take % 5]
        tnumber = ", ".join(multi_number[:-1]) + sep + multi_number[-1]
        ansprefix = {0:'a',1:'an',2:"the",3:'the'}[int(ansclass[0] in ['a','e','i','o','u','y'])+2*(ansclass[-1]=='s')]
        ansclass = {0:ansclass.title(), 1:ansclass}[which_take % 2]
        template = TEMPLATE_FINER139[which_take % len(TEMPLATE_FINER139)]
        text = template.replace("{NUMBER}",tnumber).replace("{ANSWER}",ansclass).replace("{PREFIX}",ansprefix).replace('{PASSAGE}',passage)
        return {'text':text}

TEMPLATE_FINER139 =[
    'In the following passage from a financial statement "{PASSAGE}", what does the number "{NUMBER}" represent?\nAnswer: "{ANSWER}"',
    'Context: {PASSAGE}.\nQuestion: What financial concept does the value "{NUMBER}" pertain to?\nAnswer: {PREFIX} {ANSWER}',
    "Look at the number {NUMBER} in this statement: '{PASSAGE}'. What is {NUMBER}?\nAnswer: {PREFIX} {ANSWER}",
    "Here is a sentence from a company's financial report: '{PASSAGE}'\nQuestion: what is it's {ANSWER}?\nAnswer: '{NUMBER}'",
    "Please extract {PREFIX} {ANSWER} from this paragraph:\n'{PASSAGE}'\nAnswer: {NUMBER}",
    "Question: I need to find an example of {PREFIX} {ANSWER} in the following financial statement: '{PASSAGE}'.\nANSWER: The value {NUMBER} represents the company's {ANSWER}",
    "CONTEXT: {PASSAGE}\nQUESTION: what is the '{ANSWER}'?\nANSWER: {NUMBER}",
    "QUESTION: what is the {ANSWER}?\nCONTEXT: {PASSAGE}\nANSWER: {NUMBER}",
    "The company reported: '{PASSAGE}'. Find the numeric value representing its '{ANSWER}'\nAssistant: Certainly, the value is: {NUMBER}",
    "Human: What does the reported number {NUMBER} refer to in this financial disclosure: '{PASSAGE}'.\nAssistant: {ANSWER}",
    "Q: According to the following passage, what is the company's reported '{ANSWER}'?\nCONTEXT: {PASSAGE}\nAns: '{NUMBER}'",
    "CONTEXT: {PASSAGE}\nQUESTION: what does the value {NUMBER} mean to the company?\nANSWER: {ANSWER} is {NUMBER}",
    "Human: I need an example of '{ANSWER}'. Please write an example of a company's financial statement where {ANSWER} is {NUMBER}.\nAssistant: Certainly, here is an example financial statement: '{PASSAGE}'",
] + [
    'Based on the financial statement excerpt "{PASSAGE}", can you explain what the number(s) "{NUMBER}" represents?\nAnswer: "{ANSWER}"',
    'I came across this in a financial report: "{PASSAGE}". What exactly does the number(s) "{NUMBER}" signify?\nAnswer: {PREFIX} {ANSWER}',
    "Could you help me understand the value '{NUMBER}' in the statement '{PASSAGE}'? What does it represent?\nAnswer: {PREFIX} {ANSWER}",
    "From the financial report, I found the sentence '{PASSAGE}'.\nWhat is the significance of '{NUMBER}'?\nAnswer: '{ANSWER}'",
    "Could you please extract the {ANSWER} from this passage:\n'{PASSAGE}'\nAnswer: {NUMBER}",
    "I'm looking for an example of {PREFIX} {ANSWER} in a financial statement. Can you find it in this passage: '{PASSAGE}'?\nAnswer: The value {NUMBER} represents the company's {ANSWER}",
    "In this context: '{PASSAGE}', can you tell me what the {ANSWER} refers to?\nAnswer: {NUMBER}",
    "What is the value(s) of {ANSWER} in the context of '{PASSAGE}'?\nAnswer: {NUMBER}",
    "According to the company's report: '{PASSAGE}', where can I find the numeric representation of its '{ANSWER}'?\nAssistant: Certainly, the value is: {NUMBER}",
    "Could you explain what the reported number {NUMBER} signifies in this financial disclosure: '{PASSAGE}'?\nAssistant: {ANSWER}",
    "Given the following passage, can you tell me what the company's reported '{ANSWER}' is?\nContext: {PASSAGE}\nAnswer: '{NUMBER}'",
    "What is the {ANSWER} according to this passage: '{PASSAGE}'?\nAnswer: {ANSWER} is {NUMBER}",
    "I need an example of '{ANSWER}'. Can you provide an example of a company's financial statement where {ANSWER} is {NUMBER}?\nAssistant: Certainly, here is an example financial statement: '{PASSAGE}'",
    "I came across this in a financial report: '{PASSAGE}'. What exactly does the number '{NUMBER}' signify?\nAnswer: {PREFIX} {ANSWER}",
    "Can you explain what '{NUMBER}' represents in this financial statement excerpt: '{PASSAGE}'?\nAnswer: {PREFIX} {ANSWER}",
    "Given this paragraph '{PASSAGE}'. Can you tell me what the value '{NUMBER}' stands for?\nAnswer: '{ANSWER}'",
    "I need to find an example of {PREFIX} {ANSWER} in the following financial statement: '{PASSAGE}'.\nANSWER: The value {NUMBER} represents the company's {ANSWER}",
    "In this context: '{PASSAGE}', can you tell me what the value of '{ANSWER}' is?\nAnswer: {NUMBER}",
    "Could you please explain what the reported number {NUMBER} means in this financial disclosure: '{PASSAGE}'?\nAssistant: the value(s) refer to '{ANSWER}'",
    "Based on the financial statement excerpt '{PASSAGE}', can you explain what the number '{NUMBER}' represents?\nAnswer: '{ANSWER}'",
    "Could you explain what '{NUMBER}' represents in this financial statement excerpt: '{PASSAGE}'?\nAnswer: '{ANSWER}'",
    "From the company's financial report: '{PASSAGE}'. What is the significance of '{NUMBER}'?\nAnswer: '{ANSWER}'",
    "I need to find an example of '{ANSWER}'. Please write an example of a company's financial statement where {ANSWER} is {NUMBER}.\nAssistant: Certainly, here is an example financial statement: '{PASSAGE}'",
    "Can you explain what '{NUMBER}' represents in this financial statement excerpt: '{PASSAGE}'?\nAnswer: {PREFIX} {ANSWER}",
    "I need an example of '{ANSWER}'. Can you provide an example of a company's financial statement where {ANSWER} is {NUMBER}?\nAssistant: Certainly, here is an example financial statement: '{PASSAGE}'",
    "Could you help me understand the value '{NUMBER}' in the statement '{PASSAGE}'? What does it represent?\nAnswer: {PREFIX} {ANSWER}",
    "Here is a sentence from a company's financial report: '{PASSAGE}'\nQuestion: what is it's {ANSWER}?\nAnswer: '{NUMBER}'",
    "Look at the number {NUMBER} in this statement: '{PASSAGE}'. What is {NUMBER}?\nAnswer: {PREFIX} {ANSWER}",
    "Please extract {PREFIX} {ANSWER} from this paragraph:\n'{PASSAGE}'\nAnswer: {NUMBER}",
    "Context: {PASSAGE}.\nQuestion: What financial concept does the value '{NUMBER}' pertain to?\nAnswer: {PREFIX} {ANSWER}",
    "In the following passage from a financial statement '{PASSAGE}', what does the number '{NUMBER}' represent?\nAnswer: '{ANSWER}'",
    "What is the {ANSWER} according to this passage: '{PASSAGE}'?\nAnswer: {ANSWER} is {NUMBER}",
    "CONTEXT: {PASSAGE}\nQUESTION: what does the value(s) {NUMBER} mean to the company?\nANSWER: '{ANSWER}' is {NUMBER}",
    "The company reported: '{PASSAGE}'. Find the numeric value representing its '{ANSWER}'\nAssistant: Certainly, the value is: {NUMBER}",
    "Human: What does the reported number {NUMBER} refer to in this financial disclosure: '{PASSAGE}'.\nAssistant: {ANSWER}",
    "Q: According to the following passage, what is the company's reported '{ANSWER}'?\nCONTEXT: {PASSAGE}\nAns: '{NUMBER}'",
    "In this context: '{PASSAGE}', can you tell me what '{NUMBER}' refers to?\nAnswer: {ANSWER}",
    "What is the quantitative meaning of '{ANSWER}' according to this reported-paragraph: '{PASSAGE}'?\nAnswer: {NUMBER}",
    "According to the company's report: '{PASSAGE}', where can I find the numeric representation of its '{ANSWER}'?\nAssistant: Certainly, the value is: {NUMBER}",
    "Could you explain what the reported number {NUMBER} signifies in this financial disclosure: '{PASSAGE}'?\nAssistant: {ANSWER}",
    "Given the following passage, can you tell me what the company's reported '{ANSWER}' is?\nContext: {PASSAGE}\nAnswer: '{NUMBER}'",
    "What is the '{ANSWER}' according to this passage: '{PASSAGE}'?\nAnswer: {ANSWER} is {NUMBER}",
    "I need an example of '{ANSWER}'. Can you provide an example of a company's financial statement where {ANSWER} is {NUMBER}?\nAssistant: Certainly, here is an example financial statement: '{PASSAGE}'"
]

FINER139_CLASSES ={0: 'I do not know',
 1: 'accrual for environmental loss contingencies',
 2: 'weighted average useful life of acquired finite-lived intangible assets',
 3: 'weighted average useful life of acquired finite-lived intangible assets',
 4: 'allocated expense for share-based compensation',
 5: 'amortization of financing costs',
 6: 'amortization of intangible assets',
 7: 'amortization of intangible assets',
 8: 'securities excluded from earnings per share computation due to antidilution',
 9: 'securities excluded from earnings per share computation due to antidilution',
 10: 'area of real estate property',
 11: 'area of real estate property',
 12: 'charges for impairment of assets',
 13: 'number of shares issued for equity interests in business acquisitions',
 14: 'percentage of voting interests acquired in business acquisition',
 15: 'percentage of voting interests acquired in business acquisition',
 16: 'acquisition-related costs in business combinations',
 17: 'consideration transferred in business combinations',
 18: 'contingent consideration liability in business combinations',
 19: 'intangible assets (other than goodwill) acquired and liabilities assumed in business combinations',
 20: 'intangible assets acquired and liabilities assumed in business combinations',
 21: 'amortization of capitalized contract costs',
 22: 'fair value disclosure of cash and cash equivalents',
 23: 'exercise price of warrants or rights in a specific class',
 24: 'shares reserved for future issuance in common stock capital',
 25: 'dividends per share declared in common stock',
 26: 'par or stated value per share of common stock',
 27: 'common stock shares authorized',
 28: 'common stock shares authorized',
 29: 'common stock shares outstanding',
 30: 'concentration risk percentage1',
 31: 'contract with customer liability',
 32: 'contract with customer liability revenue recognized',
 33: 'cumulative effect of new accounting principle in period of adoption',
 34: 'debt instrument basis spread on variable rate1',
 35: 'debt instrument carrying amount',
 36: 'conversion price of convertible debt instrument',
 37: 'face value of debt instrument',
 38: 'face value of debt instrument',
 39: 'fair value of debt instrument',
 40: 'effective interest rate of debt instrument',
 41: 'stated interest rate of debt instrument',
 42: 'maturity date of debt instrument',
 43: 'maturity date of debt instrument',
 44: 'redemption price of debt instrument',
 45: 'term of debt instrument',
 46: 'term of debt instrument',
 47: 'unamortized discount on debt instrument',
 48: 'weighted average interest rate of debt',
 49: 'gross deferred finance costs',
 50: 'net deferred finance costs',
 51: 'defined benefit plan contributions by employer',
 52: 'defined contribution plan cost recognized',
 53: 'depreciation',
 54: 'derivative fixed interest rate',
 55: 'derivative notional amount',
 56: 'consideration for disposal group (including discontinued operation)',
 57: 'effective income tax rate for continuing operations',
 58: 'reconciliation of effective income tax rate to federal statutory income tax rate',
 59: 'total compensation costs not-yet-recognized of non-vested awards (for employee service share-based compensation programs)',
 60: 'total compensation costs not-yet-recognized of non-vested awards (for employee service share-based compensation programs) for period of recognition',
 61: 'total compensation costs not-yet-recognized of non-vested awards (for employee service share-based compensation programs) for period of recognition',
 62: 'total compensation costs not-yet-recognized of non-vested awards other than options (for employee service share-based compensation programs)',
 63: 'tax benefits from compensation expense (for employee service share-based compensation programs)',
 64: 'ownership percentage (for equity method investment)',
 65: 'ownership percentage (equity method investment)',
 66: 'equity method investments',
 67: 'useful life of finite lived intangible assets',
 68: 'useful life of finite lived intangible assets',
 69: 'gains losses on extinguishment of debt',
 70: 'goodwill',
 71: 'goodwill impairment loss',
 72: 'guarantee obligations maximum exposure',
 73: 'income (loss) from equity method investments',
 74: 'income tax expense benefit',
 75: 'interest expense',
 76: 'interest expense debt',
 77: 'lease and rental expense',
 78: 'lessee operating lease renewal term',
 79: 'lessee operating lease renewal term',
 80: 'lessee operating lease term of contract',
 81: 'lessee operating lease term of contract',
 82: 'letters of credit outstanding amount',
 83: 'line of credit',
 84: 'fee percentage for line of credit facility commitment',
 85: 'current borrowing capacity of the line of credit facility',
 86: 'line of credit facility interest rate at period end',
 87: 'maximum borrowing capacity line of credit facility',
 88: 'line of credit facility remaining borrowing capacity',
 89: 'line of credit facility unused capacity commitment fee percentage',
 90: 'long term debt',
 91: 'fair value of long term debt',
 92: 'loss contingency accrual at carrying value',
 93: 'value of damages sought (loss contingency)',
 94: 'estimate of possible loss for the loss contingency',
 95: 'loss contingency pending claims number',
 96: 'pending claims number (loss contingency)',
 97: 'minority-interest ownership percentage by noncontrolling owners',
 98: 'minority interest ownership percentage, by parent',
 99: 'number of operating segments',
 100: 'number of real estate properties',
 101: 'quantity of real estate properties',
 102: 'number of reportable segments',
 103: 'operating lease cost',
 104: 'operating lease expense',
 105: 'operating lease liability',
 106: 'operating lease payments',
 107: 'operating lease right of use asset',
 108: 'operating lease weighted average discount rate percent',
 109: 'weighted-average remaining lease-term of operating lease',
 110: "operating lease's weighted-average remaining term",
 111: 'operating leases rent net expense',
 112: 'operating loss carryforwards',
 113: 'gross payments to acquire businesses',
 114: 'payments to acquire businesses, net of cash acquired',
 115: 'dividend rate of preferred stock (as percent)',
 116: 'preferred stock shares authorized',
 117: 'preferred stock shares authorized',
 118: 'proceeds from issuance of common stock',
 119: 'useful life of property, plants and equipment',
 120: 'useful life of equipment, property and plants',
 121: 'requested rate of increase (or decrease) pertaining to public utilities',
 122: 'related party transaction amounts of transaction',
 123: 'related party transaction amounts of transaction',
 124: 'transaction expenses from transactions with related party',
 125: 'transaction expenses incurred from transacting with related parties',
 126: 'repayments of debt',
 127: 'expected costs related to restructuring',
 128: 'charges related to restructuring',
 129: 'company revenue from contract with customer excluding assessed tax',
 130: 'revenue from contract with customer including assessed tax',
 131: 'company revenue from related parties',
 132: 'revenue remaining performance obligation',
 133: 'revenues',
 134: 'number of shares during stock-sale tranaction',
 135: 'number of shares issued in transaction for sale of stocks',
 136: 'sale of stock price per share',
 137: 'share-based compensation',
 138: 'award vesting period for stock-based compensation arrangement by share-based payment',
 139: 'share based compensation arrangement by share based payment (award vesting period)',
 140: 'share-based compensation arrangement equity instruments (other than options grants)',
 141: 'equity instruments (other than options grants) for stock-based compensation arrangement',
 142: 'weighted-average grant date for the fair value of share based compensation arrangement',
 143: 'non-vested number of equity instruments, for share based compensation not pertaining to options',
 144: 'total fair value of stock-based compensation arrangement (not including options vested) for the reporting period',
 145: 'number of shares authorized for share-based compensation',
 146: 'number of shares authorized for stock-based compensation',
 147: 'number of shares available for grant (for stock-based compensation)',
 148: 'total intrinsic value of exercised options in the reporting period (as part of stock-based awards)',
 149: 'options grants for share-based compensation for the reporting period',
 150: 'weighted-average grant date for the fair value of share based compensation arrangement',
 151: 'share price',
 152: 'award vesting rights percentage for for share-based compensation arrangement by sharebased payment award',
 153: 'vesting rights percentage for share-based compensation arrangement',
 154: 'expiration period of share-based compensation',
 155: 'expiration period for stock-based awards/compensation',
 156: 'new issues issued during period',
 157: 'new issues issued during the reporting period',
 158: 'stock repurchase program authorized amount',
 159: 'repurchase amount of remaining authorized stock in the repurchase program',
 160: 'stock repurchased and retired during the quarter',
 161: 'number of shares repurchased during period',
 162: 'number of shares repurchased during the reporting period',
 163: 'prior year claims and claims adjustment expense for property casualty insurance underwriters (supplemental information)',
 164: "treasury stocks' average cost per share acquired",
 165: 'treasury stock shares acquired',
 166: 'amount of treasury stock acquired',
 167: 'cost method for acquired treasury stock value ',
 168: 'unrecognized tax benefits',
 169: 'unrecognized tax benefits that would impact effective tax rate',
 170: 'gross deferred finance costs',
 171: 'common stock par or stated value per share',
 172: 'loss contingency estimate of possible loss',
 173: 'defined contribution plan recognized cost',
 174: 'fair value of debt instrument',
 175: 'recognized revenue of contract with customer liability',
 176: 'revenue remaining performance obligation',
 177: 'total compensation cost of employee share-based compensation nonvested awards not yet recognized',
 178: 'stated percentage of interest rate for debt instrument',
 179: 'operating loss carryforwards',
 180: 'minority interest ownership percentage by noncontrolling owners',
 181: 'interest expense',
 182: 'long term debt',
 183: 'share based compensation',
 184: 'debt-weighted average interest rate',
 185: 'debt instrument carrying amount',
 186: 'debt instrument convertible conversion price',
 187: 'income tax expense benefit',
                   # done
 188: 'total compensation cost for share-based payment award options granted in the period (weighted average grant date fair value)',
 189: 'nonvested awards - total compensation cost not yet recognized for share-based awards (excluding options) for employee service share-based compensation',
 190: 'equity method investments',
 191: 'unamortized discount on debt instruments',
 192: 'gains/losses on extinguishment of debt',
 193: 'number of shares available for grant for share-based payment awards',
 194: 'recognized identifiable assets acquired and liabilities assumed, intangible assets (other than goodwill) pertaining to business combination',
 195: 'preferred stock: dividend rate percentage',
 196: 'revenue from contracts with customers (including assessed tax)',
 197: 'operating lease: weighted average discount rate percentage',
 198: 'line of credit',
 199: 'maximum borrowing capacity of line of credit facility',
 200: 'effective income tax rate reconciliation at federal statutory income tax rate',
 201: 'commitment fee percentage for line of credit facility',
 202: 'business combination: consideration transferred',
 203: 'common stock dividends per share declared',
 204: 'basis spread on variable rate of debt instrument',
 205: 'disposal group (including discontinued operations): consideration',
 206: 'gross number of share-based payment award options granted in the period (share-based compensation arrangement)',
 207: 'common stock: shares outstanding',
 208: 'amortization of financing costs',
 209: 'line of credit facility: current borrowing capacity',
 210: 'treasury stock value (acquired - cost method)',
 211: 'nonvested number of equity instruments other than options (share-based compensation arrangement)',
 212: 'debt instrument: effective interest rate percentage',
 213: 'sale of stock: price per share',
 214: 'capitalized contract cost amortization',
 215: 'restructuring charges',
 216: 'total fair value of vested equity instruments other than options in period (share-based compensation arrangement)',
 217: 'accrual for environmental loss contingencies',
 218: 'fair value disclosure of cash and cash equivalents',
 219: 'proceeds from issuance of common stock',
 220: 'revenues',
 221: 'recognized identifiable assets acquired and liabilities assumed, for intangibles (due to business combination)',
 222: 'letters of credit: outstanding amount',
 223: 'weighted average grant date fair value of equity instruments (other than options) granted in the period',
 224: 'operating lease payments',
 225: 'line of credit facility: remaining borrowing capacity',
 226: 'payments to acquire businesses (gross)',
 227: 'average cost per share of treasury stock acquired',
 228: 'deferred finance costs (net)',
 229: 'stock repurchase program: authorized amount',
 230: 'interest expense on debt',
 231: 'contract with customer: liability',
 232: 'operating lease expense',
 233: 'depreciation',
 234: 'allocated share-based compensation expense',
 235: 'loss contingency accrual at carrying value',
 236: 'unused capacity commitment fee percentage for line of credit facility',
 237: 'prior year claims and claims adjustment expense for property casualty insurance underwriters (supplemental information)',
 238: 'operating lease liability',
 239: 'revenue from related parties',
 240: 'payments to acquire businesses (net of cash acquired)',
 241: 'business combination: contingent consideration liability',
 242: 'loss contingency: damages sought value',
 243: 'number of operating segments',
 244: 'business acquisition: equity interests issued or issuable - number of shares issued',
 245: 'operating lease: right of use asset',
 246: 'business combination: acquisition-related costs',
 247: 'unrecognized tax benefits',
 248: 'guarantee obligations: maximum exposure',
 249: 'restructuring and related costs: expected cost',
 250: 'defined benefit plan contributions by employer',
 251: 'operating lease cost',
 252: 'derivative: fixed interest rate',
 253: 'goodwill',
 254: 'goodwill impairment loss',
 255: 'common stock capital: shares reserved for future issuance',
 256: 'stock repurchased and retired during period: shares',
 257: 'tax benefit from compensation expense for employee service share-based compensation',
 258: 'income (loss) from equity method investments',
 259: 'number of reportable segments',
 260: 'fair value of long-term debt',
 261: 'repayments of debt',
 262: 'concentration risk percentage',
 263: 'debt instrument: redemption price percentage',
 264: 'cumulative effect of new accounting principle in period of adoption',
 265: 'share price',
 266: 'unrecognized tax benefits that would impact effective tax rate',
 267: 'total intrinsic value of options exercised in the period for share-based compensation arrangement',
 268: 'effective income tax rate (continuing operations)',
 269: 'revenue from contracts with customers (excluding assessed tax)',
 270: 'stock repurchase program: remaining authorized repurchase amount',
 271: 'interest rate for line of credit facility at the end of the reporting period',
 272: 'exercise price of warrant or other exercised right',
 273: 'operating leases rent expense (net)',
 274: 'lease and rental expense',
 275: 'requested rate increase (or decrease) amount (public utilities)',
 276: 'minority interest ownership percentage by parent',
 277: 'asset impairment charges',
 278: 'notional amount of derivative'}


 TEMPLATES_SQUAD = [
    'Q: {QUESTION}\nCONTEXT: {PASSAGE}\nA: {ANSWER}',
    'Human: {QUESTION}\nCONTEXT:{PASSAGE}\nAssistant: {ANSWER}',
    'Given the following passage "{PASSAGE}", {QUESTION}.\nANSWER: {ANSWER}',
    "{QUESTION} Answer based on the following text: '{PASSAGE}'.\nAnswer: {ANSWER}",
    'I have a question: {QUESTION}.\nThe answer is in this excerpt: "{PASSAGE}". Please answer my question.\n\nAssistant: Certainly! The answer is "{ANSWER}"',
    'Human: Given this paragraph "{PASSAGE}", {QUESTION}.\n\nExpert: According the preceding paragraph, the answer to the question "{QUESTION}" is "{ANSWER}"',
    "Question: {QUESTION}\n\nContext: {PASSAGE}\n\n Answer: {ANSWER}",
    "User: I have a question about this passage: {PASSAGE}\n\nAssistant: Yes, please ask me your question?\n\nUser: {QUESTION}\n\nAssistant: The answer is {ANSWER}",
] + [
    "Context: '{PASSAGE}. Based on the preceding context, {QUESTION}\n\n. Assistant: the answer is {ANSWER}",
    'Human: {QUESTION}\nCONTEXT:{PASSAGE}\nAssistant: {ANSWER}',
    'Given the following passage "{PASSAGE}", {QUESTION}.\nANSWER: {ANSWER}',
    "{QUESTION} Please answer based on the following text: '{PASSAGE}'.\nAnswer: {ANSWER}",
    'I have a question: {QUESTION}.\nThe answer is in this excerpt: "{PASSAGE}".\nPlease answer my question.\n\nAssistant: Certainly! The answer is "{ANSWER}"',
    'Human: Given this paragraph "{PASSAGE}", {QUESTION}.\n\nExpert: According the preceding paragraph, the correct answer is "{ANSWER}"',
    "Question: {QUESTION}\n\nContext: {PASSAGE}\n\n Answer: {ANSWER}",
    "USER: Here is a passage to do with '{ANSWER}': {PASSAGE}.\nWhat is an appropriate exam question based on this passage?\n\nRESPONDENT: Here is an example question based on your passage: '{PASSAGE}'",
    "User: I have a question about this passage: {PASSAGE}\n\nAssistant: Yes, what is your question?\n\nUser: {QUESTION}\n\nAssistant: The answer is {ANSWER}",
    "Human: I have a question about this paragraph: {PASSAGE}\n\nAssistant: What is the question?\n\nHuman: {QUESTION} Can you answer?\n\nAssistant: Certainly, the correct answer is '{ANSWER}'",
    "User: Can you provide context for this question '{QUESTION}'\n\nAssistant: Certainly, here is the context: '{PASSAGE}'\n\nUser: What is the answer to the question?\n\nAnswer: '{ANSWER}'",
    "I need information about {QUESTION}.\nContext: {PASSAGE}\nAnswer: {ANSWER}",
    "I'd like to know '{QUESTION}' based on the information in this paragraph: '{PASSAGE}', what is the answer?\nAnswer: {ANSWER}",
    "What is the answer to the question: {QUESTION}\nHere is the paragraph necessary to answer: '{PASSAGE}'\nAnswer: {ANSWER}",
    "Please provide context for this question: {QUESTION}\nAssistant: Certainly, here is the context: '{PASSAGE}'\nWhat is the answer to the question?\nAnswer: {ANSWER}",
    'User: I would like to know "{QUESTION}". I have this information to address the question: "{PASSAGE}". \nBased on that text,  what is the answer?\n\nAssistant\'s Answer: "{ANSWER}"',
    "What is the answer to the question: {QUESTION}?\nHere is some context: '{PASSAGE}'\nAnswer: {ANSWER}",
    'Question: {QUESTION}\n\nCONTEXT: {PASSAGE}\n\nAnswer: {ANSWER}',
    "What is the answer to the question: {QUESTION}?\nHere is the information required to answer: '{PASSAGE}'\nAnswer: {ANSWER}",
    "Human: Please answer the following question based on the information in the 'Context'.\nQuestion: {QUESTION}\nContext:{PASSAGE}\n\nAI: Okay, I believe the answer is '{ANSWER}'",
    "Human: Please answer the following question based on the information below.\nQuestion: {QUESTION}\nContext:{PASSAGE}\n\nAI: Okay, I believe the correct answer to the question '{QUESTION}' is '{ANSWER}'",
    "I'm trying to understand '{QUESTION}', given the information in this text: '{PASSAGE}'\n\nRespondent: Okay, I can answer that. The answer is {ANSWER}",
    "Please help me with {QUESTION}.\nGiven this passage: '{PASSAGE}', what is the answer?\nAnswer: {ANSWER}",
    "I have a question: '{QUESTION}'.\nBased on this context, what is the answer?\nCONTEXT: '{PASSAGE}'\n\nAnswer: {ANSWER}",
    "User: What is the answer to the following question '{QUESTION}'\n\nPlease use this additional context to answer the question: '{PASSAGE}'\n\nAssistant: {ANSWER}",
    'Q: {QUESTION}\nCONTEXT: {PASSAGE}\nA: {ANSWER}',
]

def random_by_char(text, take=3, charlim=10):
    nums = [ord(ch) for ch in 'xqz'+text.replace(' ','')[:charlim]][(-1*take):]
    return 3*prod(nums[:2])-nums[-1]

def clean_squad(x):
    """Converst squad triplet (q,context, a) into a pseudo-conversation using templates"""
    passagetext = x['context']
    template = TEMPLATES_SQUAD[random_by_char(passagetext) % len(TEMPLATES_SQUAD)]
    text = template.replace(
        "{QUESTION}", x['question']
    ).replace(
        "{ANSWER}", x['answers']['text'][0]
    ).replace(
        "{PASSAGE}", passagetext
    )
    return {'text':text}


In [136]:

def clean_stream_refinedweb(x):
    x['text'] = x['content']
    return x

def clean_stream_arxiv(x):
    x['text'] = x['abstract']
    return x

def clean_stream_pubmedsum(x):
    x['text'] = x['article']
    return x

def remove_first_http_url(text):
    """Removes http strings from hackersnews"""
    pattern = r'http[s]*://[^ ]+'
    return re.sub(pattern, '', text, 1)

#def parse_hacker_news(text):
#    """removes hackernews' thread separators ----- ===== ~~~ and removes urls"""
#    return remove_first_http_url(" ".join([" ".join(j.split('\n')[1:]) for j in text.replace("------\n","~~~\n").replace("======\n","~~~\n").split("~~~\n")]))
#def clean_hackernews(x):
#    x['text'] = parse_hacker_news(x['text'])
#    return x

def clean_ledgarmlm(x):
    x['text'] = x['provision']
    return x

def clean_casetextbook(example):
    """Removes tables and excess \n includes somes specifics for Saylor books footmatter"""
    # discards the first 8 percent
    #discard = int(0.08*len(example['text']))
    #example['text'] = example['text'][discard:].replace('\n'," ")
    example = clean_irs_advice_mlm(example)
    # discard the first 8 lines ~ they are usually boilerplate text
    example['text'] = '\n'.join(example['text'].split('\n')[8:])
    example['text'] = example['text'].replace("Saylor URL: http://www.saylor.org/books"," ").replace("Saylor.org", " ").replace('Saylor Books', " ")
    return example

def clean_edgarcorpus(example):
    example['text'] = example['section_1'] + "\n" + example['section_2'] + "\n" + example['section_3'] + "\n" + example['section_7']
    return example

def clean_elseiver_mlm(example):
    example['text'] = example['Clean_Title'] + " - " + example['Clean_Summary'] + "\n" + example['Clean_Text']
    return example

def clean_financial_news_mlm(example):
    example['text'] = example['title'] + "\n" + example['text']
    return example

def filter_pileall_mlm(x):
    return x['meta']['pile_set_name'] in ['NIH ExPorter','OpenWebText2','PubMed Abstracts','StackExchange','Wikipedia (en)','ArXiv']

def filter_europarl_mlm(x):
    return len(x['text'])>60*7 # at least a small paragraphs

def clean_courtlistener(x):
    text = x['text']
    # remove the tables just makes things worse
    #text = "\n".join([s for s in text.split('\n') if not is_potential_table(s)])
    text = text.replace(".\n",'XxXx').replace(":\n",'YyYy').replace("-\n",'').replace("\n"," ").replace('XxXx','.\n').replace('YyYy',':\n')
    return {'text':text}

def clean_irs_advice_mlm(x):
    text = x['text']
    pattern = r'\x0C'
    text = re.sub(pattern, "", text) # ^L characters
    text = re.sub(r'^[\d,.%$+\-\s\=]+\n?$',"",text,flags=re.MULTILINE | re.DOTALL)
    text = re.sub(r'\-{10,}',"",text)
    text = re.sub(r'^(.*)?[Pp]age\s\d+\n?$',"",text,flags=re.MULTILINE)
    #x['text'] = text.replace("\n"," ").strip()
    text = text.replace(".\n",'XxXx').replace(":\n",'YyYy').replace("-\n",'').replace("\n"," ").replace('XxXx','.\n').replace('YyYy',':\n')
    # find tables and remove
    text = "\n".join([s for s in text.split('\n') if not is_potential_table(s)])
    return {'text':text}

def clean_secproceedings_mlm(x):
    text = x['text']
    if 'I.\n' in text:
        text = "".join(re.split(r"^I.\n", text, flags=re.MULTILINE)[1:])
    else:
        text = '\n'.join(text.split('\n')[10:])
    # I don't remember what this removes
    pattern = r'\x0C'
    s = re.sub(pattern, "", text) # ^L characters
    # removes a number ( or (12)) that is just a line with no text
    text = re.sub(r'^(\()*\d+[\.\)]?\n?$', '', text,flags=re.MULTILINE)
    # remove sentence-breaks
    s = s.replace(".\n",'XxXx').replace(":\n",'YyYy').replace("-\n",'').replace("\n"," ").replace('XxXx','.\n').replace('YyYy',':\n')
    s = s.replace('¶',' ')
    x['text'] = s
    return x

def filter_notcodelike(x):
    """checks if text has a lot of non-alphanumeric characters that indicates it is probably computer code / math notation"""
    ratio_specialchar = check_is_code(x['text'])
    return ratio_specialchar<0.1

def clean_hackernews(x):
    x['text']= x['Title'] + ' ' + remove_first_http_url(x['Text'])
    return x

def filter_hackernews(x):
    return (len(x['Text']) > 60) and (check_is_code(x['Text'])<0.1)

#def filter_bigpatent(x):
#    return 'SUMMARY' in x['description'] and 'BACKGROUND' in x['description']

def clean_bigpatent(x):
    start_offset=0; end_offset=1
    if 'BACKGROUND OF THE INVENTION' in x['description']:
        start_offset = x['description'].index('BACKGROUND OF THE INVENTION')+27
    elif 'BACKGROUND OF INVENTION' in x['description']:
        start_offset = x['description'].index('BACKGROUND OF INVENTION')+23
    elif 'BACKGROUND' in x['description']:
        start_offset = x['description'].index('BACKGROUND')+10
    if 'SUMMARY OF THE INVENTION' in x['description']:
        end_offset = x['description'].index('SUMMARY OF THE INVENTION')
    elif 'SUMMARY OF INVENTION' in x['description']:
        end_offset = x['description'].index('SUMMARY OF INVENTION')
    elif 'SUMMARY' in x['description']:
        end_offset = x['description'].index('SUMMARY')
    if end_offset < (start_offset+20):
        return {
            'text': '\n'.join(x['description'].split('\n')[:4]) + x['abstract']
        }
    background_text = x['description'][start_offset:end_offset]
    # remove all [xxxx] number breaks
    background_text = re.sub("\[[0-9]+\]\s*","", background_text)
    return {
        'text': (background_text.strip() + "\n"+ x['abstract'])
    }

def clean_govreport(x):
    x['text'] = x['document']
    return x

def filter_debatesum(x):
    """fiters out extremist/hateful content from the debatesum dataset (auto-labelled, poor precision)"""
    if len(str(x['Full-Document']).split(' '))<400:
        return False
    if x['OriginalDebateFileName'] in DEBATESUM_EXTREMIST_FILTER_OUT1:
        return False
    if x['OriginalDebateFileName'] in DEBATESUM_EXTREMIST_FILTER_OUT2:
        return False
    if x['OriginalDebateFileName'] in DEBATESUM_EXTREMIST_FILTER_OUT3:
        return False
    if x['OriginalDebateFileName'] in DEBATESUM_EXTREMIST_FILTER_OUT4:
        return False
    if x['OriginalDebateFileName'] in DEBATESUM_EXTREMIST_FILTER_OUT5:
        return False
    if x['OriginalDebateFileName'] in DEBATESUM_EXTREMIST_FILTER_OUT6:
        return False
    if x['OriginalDebateFileName'] in DEBATESUM_EXTREMIST_FILTER_OUT7:
        return False
    return True

def clean_debatesum(x):
    x['text'] = re.sub(r'\s+'," ",str(x['Full-Document']))
    return x

## world bank processing functions
def is_potential_table(line):
    """Checks if a text is a table"""
    num_and_special_count = sum([len(w) for w in re.findall(r'[0-9.,=\(\)$€£*\%\-\/\:]+', line)])
    total_chars = len(line)
    if total_chars==0:
        return True
    num_and_special_ratio = num_and_special_count / total_chars
    if num_and_special_ratio > 0.25:
        return True
    words = line.split()
    upper_title_words = [word for word in words if word.isupper() or word.istitle()]
    ratio = len(upper_title_words) / len(words) if len(words) > 0 else 0
    if ratio > 0.5:
        return True
    # Heuristic 3: Check if the line starts with a number (common in tables)
    if re.search("^[0-9]", line) and re.search("[0-9]$", line):
        return True
    nchar = len(re.sub(r"\s+","",line))
    nspace = len(re.sub(r"\w+","",line))
    if nspace !=0 and nchar/nspace <3.2:
        return True
    return False

def clean_worldbank(x):
    # remove weird characters
    pattern = r'\x0C'
    s = re.sub(pattern, "", x['document_text']) # ^L characters
    s = re.sub(r'\\\.', ".", s) # \. artifacts
    # remove excess/inccorect \n breaks
    s = s.replace(".\n",'XxXx').replace(":\n",'YyYy').replace("-\n",'').replace("\n"," ").replace('XxXx','.\n').replace('YyYy',':\n')
    s = s.replace('¶',' ')
    # discard text that is clearly tabular
    s_cleaned = "\n".join([
        p for p in s.split('\n') if not is_potential_table(p)
    ])
    return {'text':s_cleaned}

LIST_OF_HOSTGUEST_PAIRS = [
    ("Interviewer", "Interviewee"),
    ("Host", "Guest"),("Talk Show Host", "Guest Speaker"),("Person 1", "Person 2"),
    ("Podcast Host", "Guest Speaker"),
    ("Moderator", "Panelist"),
    ("Questioner", "Responder"),
    ("Facilitator", "Participant"),
    ("Host", "Expert"),
    ("Quizmaster", "Contestant"),
    ("Speaker", "Contributor"),
    ("Interviewer", "Discussant"),
    ("Agent", "Subject"),
    ("Facilitator", "Delegate"),
    ("Show Host", "Commentator"),
    ("Questioner", "Commentator"),
    ("Speaker", "Respondent"),
    ("Moderator", "Guest Speaker"),
    ("Examiner", "Candidate"),
    ("Instructor", "Student"),
    ("Host", "Expert"),
    ("Interviewer", "Respondent"),
    ("MC", "Panelist"),("Speaker1", "Speaker2"),
]

def clean_lexfridmanchat(x):
    convo = x['conversations']
    nm1,nm2 = LIST_OF_HOSTGUEST_PAIRS[ord(convo[0]['value'].replace(' ',"")[:20][-1]) % len(LIST_OF_HOSTGUEST_PAIRS)]
    map_to_names = {'human':nm1, 'gpt':nm2}
    text = '\n'.join([
        '%s: "%s"' % (map_to_names[talkfrag['from']], talkfrag['value']) for talkfrag in convo
    ])
    return {'text':text}

def clean_essayforum(x):
    return {'text':x['Correct Grammar']}

TEXTSEPARATOR = "%0XTEXTXEPARAT0RX%0"

# variants of Question, Context, Answer for ask historians
TEMPLATES_ASKHISTORIANS = {
    'no_context':[
        "Q:{QUESTION}|||A:{ANSWER}",
        "Q:{QUESTION}\nA:{ANSWER}",
        "{QUESTION}\n\n{ANSWER}",
        "QUESTION:{QUESTION}\nANSWER:{ANSWER}",
        "QUESTION:{QUESTION}|||ANSWER:{ANSWER}",
        "Human:{QUESTION}\n\nAssistant:{ANSWER}",
        "User:{QUESTION}\n\nRespondent:{ANSWER}",
        "Speaker-1:{QUESTION}\n\nSpeaker-2:{ANSWER}",
        "Speaker-A:{QUESTION}\n\nSpeaker-B:{ANSWER}",
    ],
    "context":[
        "Q:{QUESTION} {SELFTEXT}\nA:{ANSWER}",
        "Q:{SELFTEXT} {QUESTION}\n\nA:{ANSWER}",
        "QUESTION:{QUESTION}\n{SELFTEXT}\nANSWER:{ANSWER}",
        "Background:{SELFTEXT}\nQuestion:{QUESTION}\n\nANSWER:{ANSWER}",
        "Intro:{SELFTEXT}\nQuestion:{QUESTION}\n\nANSWER:{ANSWER}",
        "USER:{SELFTEXT} Question:{QUESTION}\n\nRespondent:{ANSWER}",
        "Human:{SELFTEXT} {QUESTION}\n\nAssistant:{ANSWER}",
        "Speaker A:{SELFTEXT} {QUESTION}\n\Speaker B:{ANSWER}",
        "Human:I have a question: {QUESTION}\nAssistant: Can you elaborate a little more?\nHuman:{SELFTEXT}\nAssistant:{ANSWER}"
        "Human:{QUESTION}\nAssistant: Can you rephrase the question, but with more detail?\nHuman:{SELFTEXT}\nAssistant:{ANSWER}"
        "Human:{QUESTION} {SELFTEXT}\n\nAssistant:{ANSWER}",
        "User:{QUESTION}  {SELFTEXT}\n\nRespondent:{ANSWER}",
        "Speaker-1:{SELFTEXT} {QUESTION}\n\nSpeaker-2:{ANSWER}",
        "User:{QUESTION} {SELFTEXT}\n\nBot:{ANSWER}",
        "User:{QUESTION} {SELFTEXT}\n\nBot:{ANSWER}",

    ],}


def filter_askhistorians(x):
    if len(x['answers']['text'])==0:
        return False
    if len(x['title'])<10:
        return False
    if '?' not in x['title']:
        return False
    return x

def make_text_for_askhistorians(question, answers, selftext):
    """splits answers into multiple separable texts
    [print(k+'\n-----\n') for k in make_text_for_askhistorians("Hello?", ["a1",'a2','a3'], "this is ome text").split('%0XTEXTXEPARAT0RX%0')]
    """
    if len(answers)>1:
        out_texts_listed = []
        for answer in answers:
            out_texts_listed.append(make_text_for_askhistorians(
                question, [answer], selftext
            ))
        return TEXTSEPARATOR.join(out_texts_listed)

    # no self text
    if len(selftext)<2:
        # find a template
        template = TEMPLATES_ASKHISTORIANS['no_context'][
                random_by_char(answers[0]) % len(TEMPLATES_ASKHISTORIANS['no_context'])
        ]
        out_text = template.replace("{QUESTION}",question).replace("{ANSWER}",answers[0])
        return out_text
    # yes self-text
    out_texts_listed = [make_text_for_askhistorians(question, answers, "")]
    # find a template
    template = TEMPLATES_ASKHISTORIANS['context'][
        random_by_char(answers[0]) % len(TEMPLATES_ASKHISTORIANS['context'])
    ]
    out_texts_listed += [template.replace("{QUESTION}",question).replace("{ANSWER}",answers[0]).replace("{SELFTEXT}", selftext)]
    return TEXTSEPARATOR.join(out_texts_listed)


def clean_askhistorians(x):
    out_text=  make_text_for_askhistorians(
        x['title'], x['answers']['text'], x['selftext']
    )
    return {'text':out_text}

PAIRS_OF_USERASSISTANT_NAMES = [("Human","Assistant")]*3 + [("User","Assistant")]*3+[
        ("Human","Respondent"), ("User","Respondent"), ("Person 1","Person 2"), ("Speaker 1","Speaker 2"), ("User","Helper"),("Client","Agent"), ("Human","Agent")
]

def clean_isotonicconversations(x):
    """Randomizes the names of speakers and question-askers, as well as removes other non-natural language text"""
    text = x['text']
    if len(list(re.findall("####Human####:",text)) + list(re.findall(r"\n+human\:",text))) == 1:
         names_of_agents = [('Question', "Answer")]*2 + [("Q",'A')] + [('Question', "Response")]
    else:
        names_of_agents = PAIRS_OF_USERASSISTANT_NAMES
    name_human, name_bot = names_of_agents[random_by_char(text,charlim=20) % len(names_of_agents)]
    # easy replace: expect template of ####human###
    text = text.replace("####Human####",name_human).replace("####Assistant####",name_bot)
    # more difficult for weird follow-up questions
    if 'human:' in text.lower() or 'humans:' in text.lower():
        text = re.sub('\n+Huma(n|ns)\:',"\n\n"+name_human+":", text, flags=re.MULTILINE)
    if 'assistant:' in text.lower():
        text = re.sub('\n+Assistant\:',"\n\n"+name_bot+":", text, flags=re.MULTILINE)
    text = text.replace('<|stop|>',"").replace('\nOutput:',"")
    # remove lines that are just a single number
    #text = re.sub("\:\n(?=\d)","^XxXx^",text,flag).
    return {'text':text}

def clean_legalcontractslong(x):
    # remove pagination like 5 -----
    text = re.sub(r"\s*\d+\s*\-+","",x['text'])
    # remove extended -------
    text = re.sub(r"\s*\-{4,}\s*","",text)
    # remove extended ——————————————
    text = re.sub(r"\s*\—{4,}\s*","",text)
    # remove extended __
    text = re.sub(r"\s*\_{3,}\s*","___",text)
    # remove multiple line breaks
    text = re.sub(r"\n+","\n",text,flags=re.MULTILINE)
    # calculate the word count per line
    lines = [l.strip() for l in text.split('\n') if len(l.strip())>0]
    if len(lines)<5:
        return {'text':text}
    # calculate word count
    wc = [len([w for w in l.strip().split(' ') if len(w)>0]) for l in lines]
    # calculate densities
    density = [wc[0]] + [
        sum(l)/3 for l in zip(wc[:-2], wc[1:-1], wc[2:])
    ] + [wc[-1]]
    # threshold
    threshold_on_density = 3.01
    # remove sections likely to be pagination/signatures and other table-like-stuff
    filtered_text = "\n".join([
        l for d,l in zip(density[:-2], lines) if d>=threshold_on_density
    ])
    # attach sentences that are incorrectly split into paragras
    filtered_text = filtered_text.replace(
        ".\n",'XxXx'
    ).replace(":\n",'YyYy').replace(";\n",'ZzZz').replace("-\n",'').replace("\n"," ").replace(
        'XxXx','.\n'
    ).replace('YyYy',':\n').replace('ZzZz',';\n')
    #
    #faketext = "\n".join([
    #    "d:%0.3f:::%s" % (d,s) for d,s in zip(density, lines)
    #])
    return {'text':filtered_text}

In [137]:
#print('NEED TO VERIFY IF PILE DEDUP CAN BE STREAMED: YES')
print('ADD "santoshtyss/us-court-cases" to have case-law"')
print('remove more debates from the filter')

print('Consider filtering big_patent for just the backgrounds')


mlm_streaming_cleaning_functions = {
    #'EleutherAI/pile/all':(lambda x: x, filter_pileall_mlm, ['meta']), # GONE
    'EleutherAI/the_pile_deduplicated':(lambda x: x, filter_notcodelike, []),
    # monology/pile-uncopyrighted -> this seems like the original pile that I was using, with book3 removed
    "tiiuae/falcon-refinedweb":(clean_stream_refinedweb, None, ['url', 'timestamp', 'dump', 'segment', 'image_urls','content']),
    'Skylion007/openwebtext':(lambda x : x, None, []),
    "Cohere/wikipedia-22-12":(lambda x : x, None, ['id', 'title', 'url', 'wiki_id', 'views', 'paragraph_id', 'langs']),
    "Multi-Domain-Expert-Layers/the_pile_books3_packed_128k":(lambda x: x, None, ['meta']),
    "nRuaif/book2-lite-cleaned":(lambda x: {'text':x['text'][1000:]}, None, []),
    "macrocosm/arxiv_abstracts":(clean_stream_arxiv, None, ['embeddings', 'doi','abstract']),
    "ccdv/pubmed-summarization":(clean_stream_pubmedsum, None, ['abstract','article']),
    #"conceptofmind/pile_uspto_backgrounds":(lambda x : x ,None, ['meta']),
    'big_patent':(clean_bigpatent, None, ['description', 'abstract']),
    "pile-of-law/pile-of-law/euro_parl":(lambda x : x, filter_europarl_mlm, ['created_timestamp', 'downloaded_timestamp', 'url']),
    #"philArchive": fails, but available as subset in eloukas/edgar-corpus as domain=='PhilPapers'
    'kerinin/hackernews-stories':(clean_hackernews, filter_hackernews, ['Title','Text','labels']),
    "https://the-eye.eu/public/AI/pile_v2/data/NIH_ExPORTER_awarded_grant_text.jsonl.zst":(lambda x:x, None,['meta']),
    "https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019.jsonl.zip":(clean_ledgarmlm,None,['provision','source']),
    "pile-of-law/pile-of-law/r_legaladvice":(lambda x : x, None, ['created_timestamp', 'downloaded_timestamp', 'url']),
    "pile-of-law/pile-of-law/exam_outlines":(lambda x : x, None, ['created_timestamp', 'downloaded_timestamp', 'url']),
    "pile-of-law/pile-of-law/cc_casebooks":(clean_casetextbook, None, ['created_timestamp', 'downloaded_timestamp', 'url']), # clean_casetextbook
    "eloukas/edgar-corpus":(
        clean_edgarcorpus, None, [
            'filename', 'cik', 'year', 'section_1A', 'section_1B', 'section_4', 'section_1', 'section_2', 'section_3', 'section_7',
            'section_5', 'section_6', 'section_8', 'section_9', 'section_10', 'section_7A', 'section_9A', 'section_9B',
            'section_11', 'section_12', 'section_13', 'section_14', 'section_15'
        ]),
    "Rahmaa/ElsevieR_ClEaN":(clean_elseiver_mlm, None, ['Unnamed: 0', 'Clean_Title', 'Clean_Text', 'Clean_Summary']),
    'ashraq/financial-news-articles':(clean_financial_news_mlm, None, ['title','url']),
    'pile-of-law/pile-of-law/courtlistener_opinions':(clean_courtlistener, None, ['created_timestamp', 'downloaded_timestamp', 'url']),
    "pile-of-law/pile-of-law/sec_administrative_proceedings":(clean_secproceedings_mlm, None, ['created_timestamp', 'downloaded_timestamp', 'url']),
    "pile-of-law/pile-of-law/irs_legal_advice_memos":(clean_irs_advice_mlm, None, ['created_timestamp', 'downloaded_timestamp', 'url']),
    'launch/gov_report':(clean_govreport, None, ['id','document','summary']),
    'izumi-lab/open-text-books':(lambda x: x, None, []),
    'gigant/ted_descriptions':(lambda x: {'text':x['descr']}, None, []),
    'Skelebor/book_titles_and_descriptions':(lambda x : {'text': x['description']},lambda x : len(str(x['description']))>80, []),
    'joelito/legal_case_document_summarization':(lambda x: {'text':x['summary']}, None, []),
    'joelito/legal-mc4/en':(lambda x:x, None, ['url','timestamp','matches']),
    'Hellisotherpeople/DebateSum':(clean_debatesum, filter_debatesum,[
        '#CharsAbstract', '#CharsDocument', '#CharsExtract', '#WordsAbstract', '#WordsDocument', '#WordsExtract', 'AbsCompressionRatio', 'Abstract', 'Citation',
        'DebateCamp', 'ExtCompressionRatio', 'Extract', 'Tag', 'Unnamed: 0', 'Year', 'Full-Document','OriginalDebateFileName'
    ]),
    'lukesjordan/worldbank-project-documents':(clean_worldbank, None, ['project_id','document_text','document_type']),
    '64bits/lex_fridman_podcast_for_llm_vicuna':(clean_lexfridmanchat, None, ['conversations','id']),
    'nid989/EssayFroum-Dataset':(clean_essayforum, None, ['Cleaned Essay','Correct Grammar']),
    "nlpaueb/finer-139":(clean_finer139_for_mlm, filter_finer139, ['ner_tags','tokens','id']),
    'squad':(clean_squad, None, ['context','question','answers','title','id']),
    'Pavithree/askHistorians':(clean_askhistorians, filter_askhistorians, ['q_id','title','selftext','document','subreddit','url','answers']),
    "Isotonic/human_assistant_conversation":(clean_isotonicconversations, None, ["prompt","response"]),
    "albertvillanova/legal_contracts":(clean_legalcontractslong, None,[]),
}

# entries: url, subset, probability, size, option(name of postprocess subsetting), shuffle?
mlm_files = [
    ('EleutherAI/the_pile_deduplicated', None, 16.21, 134000000, 'mlm', (1650, 81405), 0.1), # 1650 files each with ~?
    # monology/pile-uncopyrighted -> this seems like the original pile that I was using, with book3 removed
    ("tiiuae/falcon-refinedweb", None, 17.11, 968000000, "mlm", (5534, 174000), 0.1), # CC; has 5534 files as parquet (each with ~174919)
    ('Skylion007/openwebtext', None, 5.0, 4000000, 'mlm', (21, 213000), 0.1),
    ("Cohere/wikipedia-22-12", 'en', 35.0, 8590000, "mlm",(351, 100000), 0.16), # wikipedia has 351 files (each with 100000 examples)
    ("Multi-Domain-Expert-Layers/the_pile_books3_packed_128k", None, 4.8/2, 34500, "mlm", (15, 9900), 0.15), # has 15 files (each with with ~9978/9983)
    ("nRuaif/book2-lite-cleaned", None, 4.8/2, 81500, "mlm", (818, 100), 0.1),
    ("macrocosm/arxiv_abstracts", None, 3.6, 2250000, "mlm", (23, 2250000//23), 0.12), # set to zero because in PILE (has 23 parquet files)
    ("ccdv/pubmed-summarization", None, 0, 120000, "mlm", False, 0.12), # 3.75 set to zero because elsiever and pubmed in Pile below
    ('big_patent', 'all', 0.60, 154000, 'mlm', False, 0.15), # use as an alternative to /NIH_ExPORTER_awarded_grant_text.jsonl.zst
    ("pile-of-law/pile-of-law",'euro_parl', 0.55, 7254, "mlm", False, 0.1),
    # I think I should remove the hackernews because it was originally included as a discussion-tree in pile
    ('kerinin/hackernews-stories', None, 0, 31300, 'mlm', (8, 52220), 0.1), # 1.7 hackernews stories alternative: this was originally included because of discussion
    ("https://the-eye.eu/public/AI/pile_v2/data/NIH_ExPORTER_awarded_grant_text.jsonl.zst", None, 0, 985651, "mlm", False, 0.15), # still works, but may fail eventually
    ("https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019.jsonl.zip", None, 6.25, 1200000, "mlm", False, 0.2),
    ("pile-of-law/pile-of-law",'r_legaladvice', 1.63, 109740, "mlm", False, 0.15),
    ("pile-of-law/pile-of-law",'exam_outlines', 0.1, 12, "mlm",False, 0.2), # useless (but interesting)
    ("pile-of-law/pile-of-law",'cc_casebooks',0.5, 59 ,"mlm",False, 0.2),
    ("eloukas/edgar-corpus", "full", 2.15, 47000, "mlm",(28, 4000), 0.15), # has 28 files each with 1k-5k (variable amount of data: 1styear 1060 vs 5508 in 2018
    ("Rahmaa/ElsevieR_ClEaN", None, 1.7, 31600, "mlm", False, 0.15),
    ('ashraq/financial-news-articles', None, 1.0, 306000, "mlm", (2, 153100), 0.1), # has 2 files (each with 153121)
    ('pile-of-law/pile-of-law','courtlistener_opinions',  1.25, 1000000 , "mlm", (16, 229000), 0.1), # has 16 files (each with 229678 to 526543)
    ('pile-of-law/pile-of-law',"sec_administrative_proceedings", 0.9, 10805, "mlm", False, 0.1), # 118.4 MiB
    ('pile-of-law/pile-of-law',"irs_legal_advice_memos", 0.76, 442, "mlm", False,0.18), # 35.8 MiB
    ('launch/gov_report','plain_text',0.55, 17500, 'mlm', False, 0.1),
    ('izumi-lab/open-text-books',None,  3.45, 150000, 'mlm', False, 0.15),
    ('gigant/ted_descriptions',None, 0, 5705, 'mlm', False, 0.2), # too small and irrelevant
    ('Skelebor/book_titles_and_descriptions', None, 2.24, 1000000,'mlm', (2, 1000000//2), 0.2),
    ('joelito/legal_case_document_summarization',None, 2.2, 7700, 'mlm', False, 0.2),
    ('joelito/legal-mc4','en', 1.1, 180000, 'mlm', False, 0.1),
    ('Hellisotherpeople/DebateSum', None, 1.58, 24647, 'mlm',False, 0.9),
    ('lukesjordan/worldbank-project-documents', None, 0.35, 15700, 'mlm', False, 0.08),
    ('64bits/lex_fridman_podcast_for_llm_vicuna',None, 0.52, 17200,'mlm',False,0.5),
    ('nid989/EssayFroum-Dataset',None, 0.71, 25600,'mlm',False,0.5),
    ('nlpaueb/finer-139',None, 1.05, 179195, 'mlm',False, 0.8),
    ('squad',None, 1.59, 87600, 'mlm', False, 0.2),
    ('Pavithree/askHistorians',None, 0.59, 51300,'mlm',False, 0.8),
    ("Isotonic/human_assistant_conversation",None,1.05, 58700, 'mlm',(3, 195590),0.09),
    ("albertvillanova/legal_contracts", None, 1.0, 106000, 'mlm', False, 0.15),
] #

# entries: url, subset, probability, size, option(name of postprocess subsetting), shuffle?
# looks like the Pile is finally gone
# monology/pile says it will be available in december
# the_pile_openwebtext2 -> substitute? (no)
# could just use: EleutherAI/the_pile_deduplicated and then filter for english and exclude too much special characters

print([k[2] for k in mlm_files])
total_prob = sum([k[2] for k in mlm_files])
for url, f in zip(mlm_files,mlm_streaming_cleaning_functions.keys()):
    print("%0.3f" % (url[2]/total_prob) + "  "+ url[0] + " ||| " + f + '\n')

data_streaming_config = {
    'files':mlm_files,
    'val_size':2000,
    'min_seq_length':48,
    'max_seq_length':512,
    'max_chunk_size':6,
    'train_chunk_size':6000,
    'max_chunk_start':1000000,
    "seed":42,
}


ADD "santoshtyss/us-court-cases" to have case-law"
remove more debates from the filter
Consider filtering big_patent for just the backgrounds
[16.21, 17.11, 5.0, 35.0, 2.4, 2.4, 3.6, 0, 0.6, 0.6, 0, 0, 6.25, 1.63, 0.1, 0.5, 2.15, 1.7, 1.0, 1.35, 0.9, 0.76, 0.55, 3.4, 0, 2.24, 2.2, 1.1, 1.58, 0.35, 0.5, 0.71, 1.0, 1.55, 0.56, 1.0, 1.0]
0.139  EleutherAI/the_pile_deduplicated ||| EleutherAI/the_pile_deduplicated

0.146  tiiuae/falcon-refinedweb ||| tiiuae/falcon-refinedweb

0.043  Skylion007/openwebtext ||| Skylion007/openwebtext

0.299  Cohere/wikipedia-22-12 ||| Cohere/wikipedia-22-12

0.021  Multi-Domain-Expert-Layers/the_pile_books3_packed_128k ||| Multi-Domain-Expert-Layers/the_pile_books3_packed_128k

0.021  nRuaif/book2-lite-cleaned ||| nRuaif/book2-lite-cleaned

0.031  macrocosm/arxiv_abstracts ||| macrocosm/arxiv_abstracts

0.000  ccdv/pubmed-summarization ||| ccdv/pubmed-summarization

0.005  big_patent ||| big_patent

0.005  pile-of-law/pile-of-law ||| pile-of-law/pile-of-law/

In [138]:
data_streaming_config_mlm = {
    'files':mlm_files,
    'val_size':10, #2000,
    'min_seq_length':48,
    'max_seq_length':512,
    'max_chunk_size':6,
    'train_chunk_size':2000,
    'max_chunk_start':1000000,
    "seed":42,
}

#!rm cache_*
dataset_static_mlm = initialize_and_get_mlm_streaming_datasets(
    data_streaming_config=data_streaming_config_mlm,
    streaming_cleaning_functions=mlm_streaming_cleaning_functions,
    start_proportion = None,
    epoch=1,
    seed=42,
    path_to_val_cache = 'cache_val_mlm.pkl',
    path_to_train_cache_epoch = 'cache_train_mlm_%03g.pkl',
    do_check_english = True
)

trying EleutherAI/the_pile_deduplicated initialization (shuffling through 1650 files)


Resolving data files:   0%|          | 0/1650 [00:00<?, ?it/s]

take 1 from EleutherAI/the_pile_deduplicated validation
take 277 from EleutherAI/the_pile_deduplicated training
Done getting streams/reloading from EleutherAI/the_pile_deduplicated
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying tiiuae/falcon-refinedweb initialization (shuffling through 5534 files)


Resolving data files:   0%|          | 0/5534 [00:00<?, ?it/s]

take 1 from tiiuae/falcon-refinedweb validation
take 292 from tiiuae/falcon-refinedweb training
Done getting streams/reloading from tiiuae/falcon-refinedweb
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying Skylion007/openwebtext initialization (shuffling through 21 files)
take 1 from Skylion007/openwebtext validation
take 85 from Skylion007/openwebtext training
Done getting streams/reloading from Skylion007/openwebtext
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying Cohere/wikipedia-22-12 initialization (shuffling through 351 files)


Repo card metadata block was not found. Setting CardData to empty.


take 2 from Cohere/wikipedia-22-12 validation
take 598 from Cohere/wikipedia-22-12 training
Done getting streams/reloading from Cohere/wikipedia-22-12
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying Multi-Domain-Expert-Layers/the_pile_books3_packed_128k initialization (shuffling through 15 files)
take 1 from Multi-Domain-Expert-Layers/the_pile_books3_packed_128k validation
take 41 from Multi-Domain-Expert-Layers/the_pile_books3_packed_128k training
Done getting streams/reloading from Multi-Domain-Expert-Layers/the_pile_books3_packed_128k
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying nRuaif/book2-lite-cleaned initialization (shuffling through 818 files)


Resolving data files:   0%|          | 0/816 [00:00<?, ?it/s]

take 1 from nRuaif/book2-lite-cleaned validation
take 41 from nRuaif/book2-lite-cleaned training
Done getting streams/reloading from nRuaif/book2-lite-cleaned
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying macrocosm/arxiv_abstracts initialization (shuffling through 23 files)


Resolving data files:   0%|          | 0/24 [00:00<?, ?it/s]

take 1 from macrocosm/arxiv_abstracts validation
take 62 from macrocosm/arxiv_abstracts training
Done getting streams/reloading from macrocosm/arxiv_abstracts
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying big_patent initialization
take 1 from big_patent validation
take 10 from big_patent train
Done getting streams/reloading from big_patent
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying pile-of-law/pile-of-law initialization


Downloading builder script:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/25.6k [00:00<?, ?B/s]

Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60
INFO:datasets.info:Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60


take 1 from pile-of-law/pile-of-law validation
take 10 from pile-of-law/pile-of-law train
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.euro_parl.jsonl.xz
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.euro_parl.jsonl.xz
Done getting streams/reloading from pile-of-law/pile-of-law
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019.jsonl.zip initialization


Using custom data configuration default-5bd96a4d9bfd62a2
INFO:datasets.builder:Using custom data configuration default-5bd96a4d9bfd62a2
Loading Dataset Infos from /usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json
INFO:datasets.info:Loading Dataset Infos from /usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json


take 1 from https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019.jsonl.zip validation
take 107 from https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019.jsonl.zip train
Done getting streams/reloading from https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019.jsonl.zip
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying pile-of-law/pile-of-law initialization


Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60
INFO:datasets.info:Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60


take 1 from pile-of-law/pile-of-law validation
take 28 from pile-of-law/pile-of-law train
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.r_legaldvice.jsonl.xz
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.r_legaldvice.jsonl.xz
Done getting streams/reloading from pile-of-law/pile-of-law
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying pile-of-law/pile-of-law initialization


Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60
INFO:datasets.info:Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60


take 1 from pile-of-law/pile-of-law validation
take 2 from pile-of-law/pile-of-law train
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.examoutlines.jsonl.xz
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.examoutlines.jsonl.xz
Done getting streams/reloading from pile-of-law/pile-of-law
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying pile-of-law/pile-of-law initialization


Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60
INFO:datasets.info:Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60


take 1 from pile-of-law/pile-of-law validation
take 9 from pile-of-law/pile-of-law train
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.cc_casebooks.jsonl.xz
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.cc_casebooks.jsonl.xz
Done getting streams/reloading from pile-of-law/pile-of-law
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying eloukas/edgar-corpus initialization (shuffling through 28 files)


https://huggingface.co/datasets/eloukas/edgar-corpus/resolve/main/edgar-corpus.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/a97879a87e4a23ed18a404fbc53698af174a89130d54d2662fbb31361b29975c.a6b5c996cf6cb814c357c6be4099b2fe8418ed3506d14517915cfb2f1dbc5549.py.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/eloukas/edgar-corpus/resolve/main/edgar-corpus.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/a97879a87e4a23ed18a404fbc53698af174a89130d54d2662fbb31361b29975c.a6b5c996cf6cb814c357c6be4099b2fe8418ed3506d14517915cfb2f1dbc5549.py.incomplete


Downloading builder script:   0%|          | 0.00/4.64k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/eloukas/edgar-corpus/resolve/main/edgar-corpus.py in cache at /root/.cache/huggingface/datasets/downloads/a97879a87e4a23ed18a404fbc53698af174a89130d54d2662fbb31361b29975c.a6b5c996cf6cb814c357c6be4099b2fe8418ed3506d14517915cfb2f1dbc5549.py
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/eloukas/edgar-corpus/resolve/main/edgar-corpus.py in cache at /root/.cache/huggingface/datasets/downloads/a97879a87e4a23ed18a404fbc53698af174a89130d54d2662fbb31361b29975c.a6b5c996cf6cb814c357c6be4099b2fe8418ed3506d14517915cfb2f1dbc5549.py
creating metadata file for /root/.cache/huggingface/datasets/downloads/a97879a87e4a23ed18a404fbc53698af174a89130d54d2662fbb31361b29975c.a6b5c996cf6cb814c357c6be4099b2fe8418ed3506d14517915cfb2f1dbc5549.py
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/a97879a87e4a23ed18a404fbc53698af174a89130d54d2662fbb31361b29975c.a6b5c996cf6cb814c357c6be4099b2fe8418ed35

Downloading readme:   0%|          | 0.00/43.7k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/eloukas/edgar-corpus/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/d6a140d0ea4425f1b16ac44ee14d5071bfe4ff24dc22630baaec36ad3d95c028.47bb1a34b4776937628da47f28595b0d93431703a469e86e4354812a30299cc1
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/eloukas/edgar-corpus/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/d6a140d0ea4425f1b16ac44ee14d5071bfe4ff24dc22630baaec36ad3d95c028.47bb1a34b4776937628da47f28595b0d93431703a469e86e4354812a30299cc1
creating metadata file for /root/.cache/huggingface/datasets/downloads/d6a140d0ea4425f1b16ac44ee14d5071bfe4ff24dc22630baaec36ad3d95c028.47bb1a34b4776937628da47f28595b0d93431703a469e86e4354812a30299cc1
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/d6a140d0ea4425f1b16ac44ee14d5071bfe4ff24dc22630baaec36ad3d95c028.47bb1a34b4776937628da47f28595b0d93431703a469e86e4354812a30299

take 1 from eloukas/edgar-corpus validation
take 37 from eloukas/edgar-corpus training
Done getting streams/reloading from eloukas/edgar-corpus
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying Rahmaa/ElsevieR_ClEaN initialization


https://huggingface.co/datasets/Rahmaa/ElsevieR_ClEaN/resolve/091860c29d4d69c06bf41f15090e03c787424fda/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/cbac8c5c9aaa279c9275b6eb38eed4f9eadea05912fa872f3392e7c41407638d.094a50cb8064faee8c4cb789efc018e2ed5bcdc14dcae7d757da917ebaf6626b.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/Rahmaa/ElsevieR_ClEaN/resolve/091860c29d4d69c06bf41f15090e03c787424fda/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/cbac8c5c9aaa279c9275b6eb38eed4f9eadea05912fa872f3392e7c41407638d.094a50cb8064faee8c4cb789efc018e2ed5bcdc14dcae7d757da917ebaf6626b.incomplete


Downloading readme:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

storing https://huggingface.co/datasets/Rahmaa/ElsevieR_ClEaN/resolve/091860c29d4d69c06bf41f15090e03c787424fda/README.md in cache at /root/.cache/huggingface/datasets/downloads/cbac8c5c9aaa279c9275b6eb38eed4f9eadea05912fa872f3392e7c41407638d.094a50cb8064faee8c4cb789efc018e2ed5bcdc14dcae7d757da917ebaf6626b
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/Rahmaa/ElsevieR_ClEaN/resolve/091860c29d4d69c06bf41f15090e03c787424fda/README.md in cache at /root/.cache/huggingface/datasets/downloads/cbac8c5c9aaa279c9275b6eb38eed4f9eadea05912fa872f3392e7c41407638d.094a50cb8064faee8c4cb789efc018e2ed5bcdc14dcae7d757da917ebaf6626b
creating metadata file for /root/.cache/huggingface/datasets/downloads/cbac8c5c9aaa279c9275b6eb38eed4f9eadea05912fa872f3392e7c41407638d.094a50cb8064faee8c4cb789efc018e2ed5bcdc14dcae7d757da917ebaf6626b
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/cbac8c5c9aaa279c9275b6eb38eed4f9eadea05912fa872f3392

take 1 from Rahmaa/ElsevieR_ClEaN validation
take 29 from Rahmaa/ElsevieR_ClEaN train
Done getting streams/reloading from Rahmaa/ElsevieR_ClEaN
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying ashraq/financial-news-articles initialization (shuffling through 2 files)


https://huggingface.co/datasets/ashraq/financial-news-articles/resolve/9920e8130b63513c598a6cdde10df3e2728bccef/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/b2785c62f11a7b7b0c0607784cd7180cf38a1457530150136d1f465f6f7b6977.78203cceda51700ab31adfc29b11cb1e3a368c608b7286797757649eaf892e7c.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/ashraq/financial-news-articles/resolve/9920e8130b63513c598a6cdde10df3e2728bccef/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/b2785c62f11a7b7b0c0607784cd7180cf38a1457530150136d1f465f6f7b6977.78203cceda51700ab31adfc29b11cb1e3a368c608b7286797757649eaf892e7c.incomplete


Downloading readme:   0%|          | 0.00/543 [00:00<?, ?B/s]

storing https://huggingface.co/datasets/ashraq/financial-news-articles/resolve/9920e8130b63513c598a6cdde10df3e2728bccef/README.md in cache at /root/.cache/huggingface/datasets/downloads/b2785c62f11a7b7b0c0607784cd7180cf38a1457530150136d1f465f6f7b6977.78203cceda51700ab31adfc29b11cb1e3a368c608b7286797757649eaf892e7c
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/ashraq/financial-news-articles/resolve/9920e8130b63513c598a6cdde10df3e2728bccef/README.md in cache at /root/.cache/huggingface/datasets/downloads/b2785c62f11a7b7b0c0607784cd7180cf38a1457530150136d1f465f6f7b6977.78203cceda51700ab31adfc29b11cb1e3a368c608b7286797757649eaf892e7c
creating metadata file for /root/.cache/huggingface/datasets/downloads/b2785c62f11a7b7b0c0607784cd7180cf38a1457530150136d1f465f6f7b6977.78203cceda51700ab31adfc29b11cb1e3a368c608b7286797757649eaf892e7c
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/b2785c62f11a7b7b0c0607784cd7180cf3

take 1 from ashraq/financial-news-articles validation
take 17 from ashraq/financial-news-articles training
Done getting streams/reloading from ashraq/financial-news-articles
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying pile-of-law/pile-of-law initialization (shuffling through 16 files)


Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60
INFO:datasets.info:Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60


take 1 from pile-of-law/pile-of-law validation
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.courtlisteneropinions.0.jsonl.xz
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.courtlisteneropinions.1.jsonl.xz


Exception ignored in: <generator object PileOfLaw._generate_examples at 0x7ca0f02cfa70>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/iterable_dataset.py", line 233, in __iter__
    yield from self.generate_examples_fn(**self.kwargs)
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object ExamplesIterable.__iter__ at 0x7ca0f02ce880>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/iterable_dataset.py", line 233, in __iter__
    yield from self.generate_examples_fn(**self.kwargs)
RuntimeError: generator ignored GeneratorExit


take 23 from pile-of-law/pile-of-law training
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.courtlisteneropinions.7.jsonl.xz
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.courtlisteneropinions.2.jsonl.xz


Exception ignored in: <generator object PileOfLaw._generate_examples at 0x7ca0f02cdd20>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/iterable_dataset.py", line 261, in __iter__
    yield from self.generate_examples_fn(**kwargs_with_shuffled_shards)
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object ShuffledDataSourcesExamplesIterable.__iter__ at 0x7ca0f02cfa70>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/iterable_dataset.py", line 261, in __iter__
    yield from self.generate_examples_fn(**kwargs_with_shuffled_shards)
RuntimeError: generator ignored GeneratorExit


Done getting streams/reloading from pile-of-law/pile-of-law
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying pile-of-law/pile-of-law initialization


Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60
INFO:datasets.info:Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60


take 1 from pile-of-law/pile-of-law validation
take 15 from pile-of-law/pile-of-law train
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.sec.jsonl.xz
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.sec.jsonl.xz
Done getting streams/reloading from pile-of-law/pile-of-law
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying pile-of-law/pile-of-law initialization


Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60
INFO:datasets.info:Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/pile-of-law--pile-of-law/c1090502f95031ebfad49ede680394da5532909fa46b7a0452be8cddecc9fa60


take 1 from pile-of-law/pile-of-law validation
take 13 from pile-of-law/pile-of-law train
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.irs_legal_advice_memos.jsonl.xz
Error reading file: https://huggingface.co/datasets/pile-of-law/pile-of-law/resolve/main/data/train.irs_legal_advice_memos.jsonl.xz
Done getting streams/reloading from pile-of-law/pile-of-law
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying launch/gov_report initialization


https://huggingface.co/datasets/launch/gov_report/resolve/main/gov_report.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/8592f359583c23cec40549ae340147d412ae4ff68cb41ef637a3946cf068096c.beda861b3eaffcb858aee41560c69e53169e1d43c68036e03d0fe55058d10a05.py.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/launch/gov_report/resolve/main/gov_report.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/8592f359583c23cec40549ae340147d412ae4ff68cb41ef637a3946cf068096c.beda861b3eaffcb858aee41560c69e53169e1d43c68036e03d0fe55058d10a05.py.incomplete


Downloading builder script:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/launch/gov_report/resolve/main/gov_report.py in cache at /root/.cache/huggingface/datasets/downloads/8592f359583c23cec40549ae340147d412ae4ff68cb41ef637a3946cf068096c.beda861b3eaffcb858aee41560c69e53169e1d43c68036e03d0fe55058d10a05.py
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/launch/gov_report/resolve/main/gov_report.py in cache at /root/.cache/huggingface/datasets/downloads/8592f359583c23cec40549ae340147d412ae4ff68cb41ef637a3946cf068096c.beda861b3eaffcb858aee41560c69e53169e1d43c68036e03d0fe55058d10a05.py
creating metadata file for /root/.cache/huggingface/datasets/downloads/8592f359583c23cec40549ae340147d412ae4ff68cb41ef637a3946cf068096c.beda861b3eaffcb858aee41560c69e53169e1d43c68036e03d0fe55058d10a05.py
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/8592f359583c23cec40549ae340147d412ae4ff68cb41ef637a3946cf068096c.beda861b3eaffcb858aee41560c69e53169e1d43c68036e03d

Downloading readme:   0%|          | 0.00/6.69k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/launch/gov_report/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/fc200ae7e538ce53944ce22c4c6ae32a88eaff8460b85d0c0162b3561bd47b66.99c73aa01f57c69db33c491f995aa8edfb756af728bed02dc3c0966b7750ff66
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/launch/gov_report/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/fc200ae7e538ce53944ce22c4c6ae32a88eaff8460b85d0c0162b3561bd47b66.99c73aa01f57c69db33c491f995aa8edfb756af728bed02dc3c0966b7750ff66
creating metadata file for /root/.cache/huggingface/datasets/downloads/fc200ae7e538ce53944ce22c4c6ae32a88eaff8460b85d0c0162b3561bd47b66.99c73aa01f57c69db33c491f995aa8edfb756af728bed02dc3c0966b7750ff66
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/fc200ae7e538ce53944ce22c4c6ae32a88eaff8460b85d0c0162b3561bd47b66.99c73aa01f57c69db33c491f995aa8edfb756af728bed02dc3c0966b7750ff66
Lo

take 1 from launch/gov_report validation
take 9 from launch/gov_report train
Done getting streams/reloading from launch/gov_report
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying izumi-lab/open-text-books initialization


https://huggingface.co/datasets/izumi-lab/open-text-books/resolve/1245fefd628d37483366b8e707fdc5650fd3c48e/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/3a6cafa442d623fa1a9fafa6256d2bde1a822e6981244e62afbdcf597f27d9f3.e56fddbfa20be2788374fbcc61e92813080be196c0cacb299e34c84c57efcef2.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/izumi-lab/open-text-books/resolve/1245fefd628d37483366b8e707fdc5650fd3c48e/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/3a6cafa442d623fa1a9fafa6256d2bde1a822e6981244e62afbdcf597f27d9f3.e56fddbfa20be2788374fbcc61e92813080be196c0cacb299e34c84c57efcef2.incomplete


Downloading readme:   0%|          | 0.00/488 [00:00<?, ?B/s]

storing https://huggingface.co/datasets/izumi-lab/open-text-books/resolve/1245fefd628d37483366b8e707fdc5650fd3c48e/README.md in cache at /root/.cache/huggingface/datasets/downloads/3a6cafa442d623fa1a9fafa6256d2bde1a822e6981244e62afbdcf597f27d9f3.e56fddbfa20be2788374fbcc61e92813080be196c0cacb299e34c84c57efcef2
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/izumi-lab/open-text-books/resolve/1245fefd628d37483366b8e707fdc5650fd3c48e/README.md in cache at /root/.cache/huggingface/datasets/downloads/3a6cafa442d623fa1a9fafa6256d2bde1a822e6981244e62afbdcf597f27d9f3.e56fddbfa20be2788374fbcc61e92813080be196c0cacb299e34c84c57efcef2
creating metadata file for /root/.cache/huggingface/datasets/downloads/3a6cafa442d623fa1a9fafa6256d2bde1a822e6981244e62afbdcf597f27d9f3.e56fddbfa20be2788374fbcc61e92813080be196c0cacb299e34c84c57efcef2
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/3a6cafa442d623fa1a9fafa6256d2bde1a822e698124

take 1 from izumi-lab/open-text-books validation
take 58 from izumi-lab/open-text-books train
Done getting streams/reloading from izumi-lab/open-text-books
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying Skelebor/book_titles_and_descriptions initialization (shuffling through 2 files)


https://huggingface.co/datasets/Skelebor/book_titles_and_descriptions/resolve/main/dataset_infos.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/6dcb956807b8b6aaf3a753564d5464713b7ffeaac04908f47e091dae787fb14e.095651d2bbba48c3c8e07c927c14c89687ed181f2f2c31088706cc34e1e1dde5.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/Skelebor/book_titles_and_descriptions/resolve/main/dataset_infos.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/6dcb956807b8b6aaf3a753564d5464713b7ffeaac04908f47e091dae787fb14e.095651d2bbba48c3c8e07c927c14c89687ed181f2f2c31088706cc34e1e1dde5.incomplete


Downloading metadata:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/Skelebor/book_titles_and_descriptions/resolve/main/dataset_infos.json in cache at /root/.cache/huggingface/datasets/downloads/6dcb956807b8b6aaf3a753564d5464713b7ffeaac04908f47e091dae787fb14e.095651d2bbba48c3c8e07c927c14c89687ed181f2f2c31088706cc34e1e1dde5
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/Skelebor/book_titles_and_descriptions/resolve/main/dataset_infos.json in cache at /root/.cache/huggingface/datasets/downloads/6dcb956807b8b6aaf3a753564d5464713b7ffeaac04908f47e091dae787fb14e.095651d2bbba48c3c8e07c927c14c89687ed181f2f2c31088706cc34e1e1dde5
creating metadata file for /root/.cache/huggingface/datasets/downloads/6dcb956807b8b6aaf3a753564d5464713b7ffeaac04908f47e091dae787fb14e.095651d2bbba48c3c8e07c927c14c89687ed181f2f2c31088706cc34e1e1dde5
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/6dcb956807b8b6aaf3a753564d5464713b7ffeaac04908f47e091dae787fb14e.095651d2b

take 1 from Skelebor/book_titles_and_descriptions validation
take 38 from Skelebor/book_titles_and_descriptions training
Done getting streams/reloading from Skelebor/book_titles_and_descriptions
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying joelito/legal_case_document_summarization initialization


https://huggingface.co/datasets/joelito/legal_case_document_summarization/resolve/176a3f11b7ef453947b486c1de843068d108acef/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/de9e48164b70a1b900a9fa88b72938a55fe54984edc38a894dcaaf1edcda10c2.9c20a1a696c0f24d17f910f503c09ec1eeaf660d470fa30f899e3f069646900f.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/joelito/legal_case_document_summarization/resolve/176a3f11b7ef453947b486c1de843068d108acef/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/de9e48164b70a1b900a9fa88b72938a55fe54984edc38a894dcaaf1edcda10c2.9c20a1a696c0f24d17f910f503c09ec1eeaf660d470fa30f899e3f069646900f.incomplete


Downloading readme:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/joelito/legal_case_document_summarization/resolve/176a3f11b7ef453947b486c1de843068d108acef/README.md in cache at /root/.cache/huggingface/datasets/downloads/de9e48164b70a1b900a9fa88b72938a55fe54984edc38a894dcaaf1edcda10c2.9c20a1a696c0f24d17f910f503c09ec1eeaf660d470fa30f899e3f069646900f
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/joelito/legal_case_document_summarization/resolve/176a3f11b7ef453947b486c1de843068d108acef/README.md in cache at /root/.cache/huggingface/datasets/downloads/de9e48164b70a1b900a9fa88b72938a55fe54984edc38a894dcaaf1edcda10c2.9c20a1a696c0f24d17f910f503c09ec1eeaf660d470fa30f899e3f069646900f
creating metadata file for /root/.cache/huggingface/datasets/downloads/de9e48164b70a1b900a9fa88b72938a55fe54984edc38a894dcaaf1edcda10c2.9c20a1a696c0f24d17f910f503c09ec1eeaf660d470fa30f899e3f069646900f
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/de9e48164b70

take 1 from joelito/legal_case_document_summarization validation
take 38 from joelito/legal_case_document_summarization train
Done getting streams/reloading from joelito/legal_case_document_summarization
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying joelito/legal-mc4 initialization


https://huggingface.co/datasets/joelito/legal-mc4/resolve/main/legal-mc4.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/3bed0c147bfb100a1b112f6cc6ffb22c50bf5799c7c3714007cac6b504af973f.b84ac56a6004b2fc44e4fefe4c03b9985e565ee1bc4cf4495b05ac02699fbbdb.py.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/joelito/legal-mc4/resolve/main/legal-mc4.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/3bed0c147bfb100a1b112f6cc6ffb22c50bf5799c7c3714007cac6b504af973f.b84ac56a6004b2fc44e4fefe4c03b9985e565ee1bc4cf4495b05ac02699fbbdb.py.incomplete


Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/joelito/legal-mc4/resolve/main/legal-mc4.py in cache at /root/.cache/huggingface/datasets/downloads/3bed0c147bfb100a1b112f6cc6ffb22c50bf5799c7c3714007cac6b504af973f.b84ac56a6004b2fc44e4fefe4c03b9985e565ee1bc4cf4495b05ac02699fbbdb.py
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/joelito/legal-mc4/resolve/main/legal-mc4.py in cache at /root/.cache/huggingface/datasets/downloads/3bed0c147bfb100a1b112f6cc6ffb22c50bf5799c7c3714007cac6b504af973f.b84ac56a6004b2fc44e4fefe4c03b9985e565ee1bc4cf4495b05ac02699fbbdb.py
creating metadata file for /root/.cache/huggingface/datasets/downloads/3bed0c147bfb100a1b112f6cc6ffb22c50bf5799c7c3714007cac6b504af973f.b84ac56a6004b2fc44e4fefe4c03b9985e565ee1bc4cf4495b05ac02699fbbdb.py
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/3bed0c147bfb100a1b112f6cc6ffb22c50bf5799c7c3714007cac6b504af973f.b84ac56a6004b2fc44e4fefe4c03b9985e565ee1bc4cf4495b05

Downloading readme:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/joelito/legal-mc4/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/713b46c86ba9022773d298e2ccae8af90ee704d43385c7515f6e1188e42ba8a2.dd9adffcbb25e2d8bc7a52263093267fe5ee4d259608b7867f9533fac4076631
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/joelito/legal-mc4/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/713b46c86ba9022773d298e2ccae8af90ee704d43385c7515f6e1188e42ba8a2.dd9adffcbb25e2d8bc7a52263093267fe5ee4d259608b7867f9533fac4076631
creating metadata file for /root/.cache/huggingface/datasets/downloads/713b46c86ba9022773d298e2ccae8af90ee704d43385c7515f6e1188e42ba8a2.dd9adffcbb25e2d8bc7a52263093267fe5ee4d259608b7867f9533fac4076631
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/713b46c86ba9022773d298e2ccae8af90ee704d43385c7515f6e1188e42ba8a2.dd9adffcbb25e2d8bc7a52263093267fe5ee4d259608b7867f9533fac4076631
Lo

take 1 from joelito/legal-mc4 validation
take 19 from joelito/legal-mc4 train
Done getting streams/reloading from joelito/legal-mc4
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying Hellisotherpeople/DebateSum initialization


https://huggingface.co/datasets/Hellisotherpeople/DebateSum/resolve/d65bea5f7a48f9af06453e855dbffe1753a0f508/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/09258cb7fca93fc1751a4952007d2dbd3ccb6c05c0fe88b3e4e18c70c7d2f7ac.b1d75e4446247940af3d4a609e5b2334adeeca5160b579f63f0cbf442302c7eb.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/Hellisotherpeople/DebateSum/resolve/d65bea5f7a48f9af06453e855dbffe1753a0f508/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/09258cb7fca93fc1751a4952007d2dbd3ccb6c05c0fe88b3e4e18c70c7d2f7ac.b1d75e4446247940af3d4a609e5b2334adeeca5160b579f63f0cbf442302c7eb.incomplete


Downloading readme:   0%|          | 0.00/4.25k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/Hellisotherpeople/DebateSum/resolve/d65bea5f7a48f9af06453e855dbffe1753a0f508/README.md in cache at /root/.cache/huggingface/datasets/downloads/09258cb7fca93fc1751a4952007d2dbd3ccb6c05c0fe88b3e4e18c70c7d2f7ac.b1d75e4446247940af3d4a609e5b2334adeeca5160b579f63f0cbf442302c7eb
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/Hellisotherpeople/DebateSum/resolve/d65bea5f7a48f9af06453e855dbffe1753a0f508/README.md in cache at /root/.cache/huggingface/datasets/downloads/09258cb7fca93fc1751a4952007d2dbd3ccb6c05c0fe88b3e4e18c70c7d2f7ac.b1d75e4446247940af3d4a609e5b2334adeeca5160b579f63f0cbf442302c7eb
creating metadata file for /root/.cache/huggingface/datasets/downloads/09258cb7fca93fc1751a4952007d2dbd3ccb6c05c0fe88b3e4e18c70c7d2f7ac.b1d75e4446247940af3d4a609e5b2334adeeca5160b579f63f0cbf442302c7eb
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/09258cb7fca93fc1751a4952007d2dbd3ccb6c05

take 1 from Hellisotherpeople/DebateSum validation
take 27 from Hellisotherpeople/DebateSum train
Done getting streams/reloading from Hellisotherpeople/DebateSum
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying lukesjordan/worldbank-project-documents initialization


https://huggingface.co/datasets/lukesjordan/worldbank-project-documents/resolve/c435ecfd98f198f2ea0e741591d347423ff056e7/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/7bb3c2b0a5c56b702f43accda31033e3311ce1419c66e79fb041b5c33e153f50.2bdaa713397083611c0501f63cbc0acf0b68ec058942afbb7500bf10623b1df9.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/lukesjordan/worldbank-project-documents/resolve/c435ecfd98f198f2ea0e741591d347423ff056e7/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/7bb3c2b0a5c56b702f43accda31033e3311ce1419c66e79fb041b5c33e153f50.2bdaa713397083611c0501f63cbc0acf0b68ec058942afbb7500bf10623b1df9.incomplete


Downloading readme:   0%|          | 0.00/4.63k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/lukesjordan/worldbank-project-documents/resolve/c435ecfd98f198f2ea0e741591d347423ff056e7/README.md in cache at /root/.cache/huggingface/datasets/downloads/7bb3c2b0a5c56b702f43accda31033e3311ce1419c66e79fb041b5c33e153f50.2bdaa713397083611c0501f63cbc0acf0b68ec058942afbb7500bf10623b1df9
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/lukesjordan/worldbank-project-documents/resolve/c435ecfd98f198f2ea0e741591d347423ff056e7/README.md in cache at /root/.cache/huggingface/datasets/downloads/7bb3c2b0a5c56b702f43accda31033e3311ce1419c66e79fb041b5c33e153f50.2bdaa713397083611c0501f63cbc0acf0b68ec058942afbb7500bf10623b1df9
creating metadata file for /root/.cache/huggingface/datasets/downloads/7bb3c2b0a5c56b702f43accda31033e3311ce1419c66e79fb041b5c33e153f50.2bdaa713397083611c0501f63cbc0acf0b68ec058942afbb7500bf10623b1df9
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/7bb3c2b0a5c56b70

take 1 from lukesjordan/worldbank-project-documents validation
take 6 from lukesjordan/worldbank-project-documents train
Done getting streams/reloading from lukesjordan/worldbank-project-documents
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying 64bits/lex_fridman_podcast_for_llm_vicuna initialization


https://huggingface.co/datasets/64bits/lex_fridman_podcast_for_llm_vicuna/resolve/22ce5eaa1e0015e37cede361d7147738679af2d4/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/4bd20262c630a62e81fcd24f85a51c9a02f66cae76f7bdf473a836c745d60518.fdb38b86eb866e3b19b6bdb7f3df0050bb45f72f811c511d4872061102797b17.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/64bits/lex_fridman_podcast_for_llm_vicuna/resolve/22ce5eaa1e0015e37cede361d7147738679af2d4/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/4bd20262c630a62e81fcd24f85a51c9a02f66cae76f7bdf473a836c745d60518.fdb38b86eb866e3b19b6bdb7f3df0050bb45f72f811c511d4872061102797b17.incomplete


Downloading readme:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/64bits/lex_fridman_podcast_for_llm_vicuna/resolve/22ce5eaa1e0015e37cede361d7147738679af2d4/README.md in cache at /root/.cache/huggingface/datasets/downloads/4bd20262c630a62e81fcd24f85a51c9a02f66cae76f7bdf473a836c745d60518.fdb38b86eb866e3b19b6bdb7f3df0050bb45f72f811c511d4872061102797b17
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/64bits/lex_fridman_podcast_for_llm_vicuna/resolve/22ce5eaa1e0015e37cede361d7147738679af2d4/README.md in cache at /root/.cache/huggingface/datasets/downloads/4bd20262c630a62e81fcd24f85a51c9a02f66cae76f7bdf473a836c745d60518.fdb38b86eb866e3b19b6bdb7f3df0050bb45f72f811c511d4872061102797b17
creating metadata file for /root/.cache/huggingface/datasets/downloads/4bd20262c630a62e81fcd24f85a51c9a02f66cae76f7bdf473a836c745d60518.fdb38b86eb866e3b19b6bdb7f3df0050bb45f72f811c511d4872061102797b17
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/4bd20262c630

take 1 from 64bits/lex_fridman_podcast_for_llm_vicuna validation
take 9 from 64bits/lex_fridman_podcast_for_llm_vicuna train
Done getting streams/reloading from 64bits/lex_fridman_podcast_for_llm_vicuna
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying nid989/EssayFroum-Dataset initialization


https://huggingface.co/datasets/nid989/EssayFroum-Dataset/resolve/73d805de8c0299677d1037085f4272949da330ef/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/dec2e69624f46513d3200c23bec332a0cebb802c17a149d56eb13ddb5c9ee96b.dfd6587e58d78649c8fd0eacb7047496cbd5c5126aacc96ccce8485b144b2e82.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/nid989/EssayFroum-Dataset/resolve/73d805de8c0299677d1037085f4272949da330ef/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/dec2e69624f46513d3200c23bec332a0cebb802c17a149d56eb13ddb5c9ee96b.dfd6587e58d78649c8fd0eacb7047496cbd5c5126aacc96ccce8485b144b2e82.incomplete


Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

storing https://huggingface.co/datasets/nid989/EssayFroum-Dataset/resolve/73d805de8c0299677d1037085f4272949da330ef/README.md in cache at /root/.cache/huggingface/datasets/downloads/dec2e69624f46513d3200c23bec332a0cebb802c17a149d56eb13ddb5c9ee96b.dfd6587e58d78649c8fd0eacb7047496cbd5c5126aacc96ccce8485b144b2e82
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/nid989/EssayFroum-Dataset/resolve/73d805de8c0299677d1037085f4272949da330ef/README.md in cache at /root/.cache/huggingface/datasets/downloads/dec2e69624f46513d3200c23bec332a0cebb802c17a149d56eb13ddb5c9ee96b.dfd6587e58d78649c8fd0eacb7047496cbd5c5126aacc96ccce8485b144b2e82
creating metadata file for /root/.cache/huggingface/datasets/downloads/dec2e69624f46513d3200c23bec332a0cebb802c17a149d56eb13ddb5c9ee96b.dfd6587e58d78649c8fd0eacb7047496cbd5c5126aacc96ccce8485b144b2e82
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/dec2e69624f46513d3200c23bec332a0cebb802c17a1

take 1 from nid989/EssayFroum-Dataset validation
take 12 from nid989/EssayFroum-Dataset train
Done getting streams/reloading from nid989/EssayFroum-Dataset
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying nlpaueb/finer-139 initialization


https://huggingface.co/datasets/nlpaueb/finer-139/resolve/main/finer-139.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/a5fec3087527d120b900829086bbc4d8e59ba573cb38feaf605c2deb09f39a45.81e852345f8ad63ed894c4683d8245865735f379658add909f748da831df90f2.py.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/nlpaueb/finer-139/resolve/main/finer-139.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/a5fec3087527d120b900829086bbc4d8e59ba573cb38feaf605c2deb09f39a45.81e852345f8ad63ed894c4683d8245865735f379658add909f748da831df90f2.py.incomplete


Downloading builder script:   0%|          | 0.00/19.1k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/nlpaueb/finer-139/resolve/main/finer-139.py in cache at /root/.cache/huggingface/datasets/downloads/a5fec3087527d120b900829086bbc4d8e59ba573cb38feaf605c2deb09f39a45.81e852345f8ad63ed894c4683d8245865735f379658add909f748da831df90f2.py
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/nlpaueb/finer-139/resolve/main/finer-139.py in cache at /root/.cache/huggingface/datasets/downloads/a5fec3087527d120b900829086bbc4d8e59ba573cb38feaf605c2deb09f39a45.81e852345f8ad63ed894c4683d8245865735f379658add909f748da831df90f2.py
creating metadata file for /root/.cache/huggingface/datasets/downloads/a5fec3087527d120b900829086bbc4d8e59ba573cb38feaf605c2deb09f39a45.81e852345f8ad63ed894c4683d8245865735f379658add909f748da831df90f2.py
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/a5fec3087527d120b900829086bbc4d8e59ba573cb38feaf605c2deb09f39a45.81e852345f8ad63ed894c4683d8245865735f379658add909f74

Downloading metadata:   0%|          | 0.00/15.9k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/nlpaueb/finer-139/resolve/main/dataset_infos.json in cache at /root/.cache/huggingface/datasets/downloads/6522ce87565d0ff0d9c96d2955847214619b9655753105e29033512409f6512d.0c11b76f6b98fa7750ca92f7e3c85fe34138a9597dff3b8e83e679565acf526b
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/nlpaueb/finer-139/resolve/main/dataset_infos.json in cache at /root/.cache/huggingface/datasets/downloads/6522ce87565d0ff0d9c96d2955847214619b9655753105e29033512409f6512d.0c11b76f6b98fa7750ca92f7e3c85fe34138a9597dff3b8e83e679565acf526b
creating metadata file for /root/.cache/huggingface/datasets/downloads/6522ce87565d0ff0d9c96d2955847214619b9655753105e29033512409f6512d.0c11b76f6b98fa7750ca92f7e3c85fe34138a9597dff3b8e83e679565acf526b
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/6522ce87565d0ff0d9c96d2955847214619b9655753105e29033512409f6512d.0c11b76f6b98fa7750ca92f7e3c85fe34138a9597dff3b8e8

Downloading readme:   0%|          | 0.00/9.55k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/nlpaueb/finer-139/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/2ee93f905eda81763dc68760c00d6cffc70a2ce8e4f052b496333517b6cf2cd8.d5e853aa8948c6cccd2a8eb1c6ba85f5483087d9c9525e8d780e6f3594ae9139
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/nlpaueb/finer-139/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/2ee93f905eda81763dc68760c00d6cffc70a2ce8e4f052b496333517b6cf2cd8.d5e853aa8948c6cccd2a8eb1c6ba85f5483087d9c9525e8d780e6f3594ae9139
creating metadata file for /root/.cache/huggingface/datasets/downloads/2ee93f905eda81763dc68760c00d6cffc70a2ce8e4f052b496333517b6cf2cd8.d5e853aa8948c6cccd2a8eb1c6ba85f5483087d9c9525e8d780e6f3594ae9139
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/2ee93f905eda81763dc68760c00d6cffc70a2ce8e4f052b496333517b6cf2cd8.d5e853aa8948c6cccd2a8eb1c6ba85f5483087d9c9525e8d780e6f3594ae9139
No

take 1 from nlpaueb/finer-139 validation
take 17 from nlpaueb/finer-139 train
Done getting streams/reloading from nlpaueb/finer-139
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying squad initialization


https://huggingface.co/datasets/squad/resolve/main/squad.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/86cbb67316ccaf9f769f141ddcba24adb96d8adde79e68aab51ec4a80b08b6af.121650427388673ffe2b913edcacf8f9873edf1c4d19761102687f28484e39a5.py.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/squad/resolve/main/squad.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/86cbb67316ccaf9f769f141ddcba24adb96d8adde79e68aab51ec4a80b08b6af.121650427388673ffe2b913edcacf8f9873edf1c4d19761102687f28484e39a5.py.incomplete


Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/squad/resolve/main/squad.py in cache at /root/.cache/huggingface/datasets/downloads/86cbb67316ccaf9f769f141ddcba24adb96d8adde79e68aab51ec4a80b08b6af.121650427388673ffe2b913edcacf8f9873edf1c4d19761102687f28484e39a5.py
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/squad/resolve/main/squad.py in cache at /root/.cache/huggingface/datasets/downloads/86cbb67316ccaf9f769f141ddcba24adb96d8adde79e68aab51ec4a80b08b6af.121650427388673ffe2b913edcacf8f9873edf1c4d19761102687f28484e39a5.py
creating metadata file for /root/.cache/huggingface/datasets/downloads/86cbb67316ccaf9f769f141ddcba24adb96d8adde79e68aab51ec4a80b08b6af.121650427388673ffe2b913edcacf8f9873edf1c4d19761102687f28484e39a5.py
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/86cbb67316ccaf9f769f141ddcba24adb96d8adde79e68aab51ec4a80b08b6af.121650427388673ffe2b913edcacf8f9873edf1c4d19761102687f28484e39a5.py
https://huggingf

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/squad/resolve/main/dataset_infos.json in cache at /root/.cache/huggingface/datasets/downloads/e7830d4cb0750ec97f15dc68a057421d47be7b87942399068020c2a738d5691f.dbf664a8a4fbbcee29722cc663e703085eae5022d24daefc08d5cfcbe4085c0a
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/squad/resolve/main/dataset_infos.json in cache at /root/.cache/huggingface/datasets/downloads/e7830d4cb0750ec97f15dc68a057421d47be7b87942399068020c2a738d5691f.dbf664a8a4fbbcee29722cc663e703085eae5022d24daefc08d5cfcbe4085c0a
creating metadata file for /root/.cache/huggingface/datasets/downloads/e7830d4cb0750ec97f15dc68a057421d47be7b87942399068020c2a738d5691f.dbf664a8a4fbbcee29722cc663e703085eae5022d24daefc08d5cfcbe4085c0a
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/e7830d4cb0750ec97f15dc68a057421d47be7b87942399068020c2a738d5691f.dbf664a8a4fbbcee29722cc663e703085eae5022d24daefc08d5cfcbe4085c0a
https://

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/squad/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/94fe703883a1850055695505cee42dcb38fbdbecd11abd45ef317f8650ecd86e.4810d3cc74275fb6a7b58ede6680c4b0bd760c8c0f507d71121fd6f66b8d68b9
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/squad/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/94fe703883a1850055695505cee42dcb38fbdbecd11abd45ef317f8650ecd86e.4810d3cc74275fb6a7b58ede6680c4b0bd760c8c0f507d71121fd6f66b8d68b9
creating metadata file for /root/.cache/huggingface/datasets/downloads/94fe703883a1850055695505cee42dcb38fbdbecd11abd45ef317f8650ecd86e.4810d3cc74275fb6a7b58ede6680c4b0bd760c8c0f507d71121fd6f66b8d68b9
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/94fe703883a1850055695505cee42dcb38fbdbecd11abd45ef317f8650ecd86e.4810d3cc74275fb6a7b58ede6680c4b0bd760c8c0f507d71121fd6f66b8d68b9
No config specified, defau

take 1 from squad validation
take 26 from squad train
Done getting streams/reloading from squad
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying Pavithree/askHistorians initialization


https://huggingface.co/datasets/Pavithree/askHistorians/resolve/9603afe1e507fdc70f80ab3c532872fb217c7cc5/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/7ec44ec396acd9ad13c3cdd6c95b3c15db399a5069b7a74995cd7a3a85fa1559.1d3d831df91fe91f2f89ca74f20f86b288cde58f4be1328c21c11155f35f83d0.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/Pavithree/askHistorians/resolve/9603afe1e507fdc70f80ab3c532872fb217c7cc5/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/7ec44ec396acd9ad13c3cdd6c95b3c15db399a5069b7a74995cd7a3a85fa1559.1d3d831df91fe91f2f89ca74f20f86b288cde58f4be1328c21c11155f35f83d0.incomplete


Downloading readme:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

storing https://huggingface.co/datasets/Pavithree/askHistorians/resolve/9603afe1e507fdc70f80ab3c532872fb217c7cc5/README.md in cache at /root/.cache/huggingface/datasets/downloads/7ec44ec396acd9ad13c3cdd6c95b3c15db399a5069b7a74995cd7a3a85fa1559.1d3d831df91fe91f2f89ca74f20f86b288cde58f4be1328c21c11155f35f83d0
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/Pavithree/askHistorians/resolve/9603afe1e507fdc70f80ab3c532872fb217c7cc5/README.md in cache at /root/.cache/huggingface/datasets/downloads/7ec44ec396acd9ad13c3cdd6c95b3c15db399a5069b7a74995cd7a3a85fa1559.1d3d831df91fe91f2f89ca74f20f86b288cde58f4be1328c21c11155f35f83d0
creating metadata file for /root/.cache/huggingface/datasets/downloads/7ec44ec396acd9ad13c3cdd6c95b3c15db399a5069b7a74995cd7a3a85fa1559.1d3d831df91fe91f2f89ca74f20f86b288cde58f4be1328c21c11155f35f83d0
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/7ec44ec396acd9ad13c3cdd6c95b3c15db399a5069b7a749

take 1 from Pavithree/askHistorians validation
take 10 from Pavithree/askHistorians train
Done getting streams/reloading from Pavithree/askHistorians
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying Isotonic/human_assistant_conversation initialization (shuffling through 3 files)


https://huggingface.co/datasets/Isotonic/human_assistant_conversation/resolve/eefe292fe4eec3bcc82a59c662bb8380510356cf/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/fe95c9902b07c6b2cb2d2edf583bbc21e15a8fb4ca50eba9d074802440406657.6aecf67d425d2fce014536356acafbab7380a685f31a9847397474162e187ac2.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/Isotonic/human_assistant_conversation/resolve/eefe292fe4eec3bcc82a59c662bb8380510356cf/README.md not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/fe95c9902b07c6b2cb2d2edf583bbc21e15a8fb4ca50eba9d074802440406657.6aecf67d425d2fce014536356acafbab7380a685f31a9847397474162e187ac2.incomplete


Downloading readme:   0%|          | 0.00/473 [00:00<?, ?B/s]

storing https://huggingface.co/datasets/Isotonic/human_assistant_conversation/resolve/eefe292fe4eec3bcc82a59c662bb8380510356cf/README.md in cache at /root/.cache/huggingface/datasets/downloads/fe95c9902b07c6b2cb2d2edf583bbc21e15a8fb4ca50eba9d074802440406657.6aecf67d425d2fce014536356acafbab7380a685f31a9847397474162e187ac2
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/Isotonic/human_assistant_conversation/resolve/eefe292fe4eec3bcc82a59c662bb8380510356cf/README.md in cache at /root/.cache/huggingface/datasets/downloads/fe95c9902b07c6b2cb2d2edf583bbc21e15a8fb4ca50eba9d074802440406657.6aecf67d425d2fce014536356acafbab7380a685f31a9847397474162e187ac2
creating metadata file for /root/.cache/huggingface/datasets/downloads/fe95c9902b07c6b2cb2d2edf583bbc21e15a8fb4ca50eba9d074802440406657.6aecf67d425d2fce014536356acafbab7380a685f31a9847397474162e187ac2
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/fe95c9902b07c6b2cb2d

take 1 from Isotonic/human_assistant_conversation validation
take 17 from Isotonic/human_assistant_conversation training
Done getting streams/reloading from Isotonic/human_assistant_conversation
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
trying albertvillanova/legal_contracts initialization


Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/albertvillanova--legal_contracts/fdb450d43ecae9e66d9f6dcf189b79a6b75059ce81f948673512b81b5146bfc1
INFO:datasets.info:Loading Dataset Infos from /root/.cache/huggingface/modules/datasets_modules/datasets/albertvillanova--legal_contracts/fdb450d43ecae9e66d9f6dcf189b79a6b75059ce81f948673512b81b5146bfc1


take 1 from albertvillanova/legal_contracts validation
take 17 from albertvillanova/legal_contracts train
Done getting streams/reloading from albertvillanova/legal_contracts
done val language check
done val longtext chunking
done train language check
done trains longtext chunking
Done collecting streaming data
saving streamed validation data: cache_val_mlm.pkl
saving streamed training for epoch 1: cache_train_mlm_001.pkl


In [139]:
import json
sums = {}
for setnm,setcnt in dataset_static_mlm['log_source'].items():
    #if setnm=='train':
    #    continue
    for dnm,dcnt in setcnt.items():
        if dnm not in sums: sums[dnm]=0
        sums[dnm]+=dcnt
print(json.dumps({k:round(v/sum([a for a in sums.values()]),4) for k,v in sums.items()},indent=3))

#dataset_static_mlm['log_source']

print('TRAIN')
print(json.dumps(dataset_static_mlm['log_source']['train'],indent=3))
print('VAL')
print(json.dumps(dataset_static_mlm['log_source']['val'],indent=3))

# epoch 0
epoch_1 = {
   "EleutherAI/the_pile_deduplicated": 0.1413,
   "tiiuae/falcon-refinedweb": 0.1142,
   "Skylion007/openwebtext": 0.0395,
   "Cohere/wikipedia-22-12": 0.123,
   "Multi-Domain-Expert-Layers/the_pile_books3_packed_128k": 0.0796,
   "nRuaif/book2-lite-cleaned": 0.0744,
   "macrocosm/arxiv_abstracts": 0.0199,
   "big_patent": 0.0059,
   "pile-of-law/pile-of-law/euro_parl": 0.0144,
   "https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019.jsonl.zip": 0.0222,
   "pile-of-law/pile-of-law/r_legaladvice": 0.0101,
   "pile-of-law/pile-of-law/exam_outlines": 0.0069,
   "pile-of-law/pile-of-law/cc_casebooks": 0.0209,
   "eloukas/edgar-corpus": 0.0516,
   "Rahmaa/ElsevieR_ClEaN": 0.0477,
   "ashraq/financial-news-articles": 0.0065,
   "pile-of-law/pile-of-law/courtlistener_opinions": 0.0173,
   "pile-of-law/pile-of-law/sec_administrative_proceedings": 0.0176,
   "pile-of-law/pile-of-law/irs_legal_advice_memos": 0.0206,
   "launch/gov_report": 0.0186,
   "izumi-lab/open-text-books": 0.0173,
   "Skelebor/book_titles_and_descriptions": 0.0098,
   "joelito/legal_case_document_summarization": 0.0144,
   "joelito/legal-mc4/en": 0.0199,
   "Hellisotherpeople/DebateSum": 0.0134,
   "lukesjordan/worldbank-project-documents": 0.0134,
   "64bits/lex_fridman_podcast_for_llm_vicuna": 0.0069,
   "nid989/EssayFroum-Dataset": 0.0042,
   "nlpaueb/finer-139": 0.0039,
   "squad": 0.0088,
   "Pavithree/askHistorians": 0.0082,
   "Isotonic/human_assistant_conversation": 0.0049,
   "albertvillanova/legal_contracts": 0.0228
}

{
   "EleutherAI/the_pile_deduplicated": 0.1413,
   "tiiuae/falcon-refinedweb": 0.1142,
   "Skylion007/openwebtext": 0.0395,
   "Cohere/wikipedia-22-12": 0.123,
   "Multi-Domain-Expert-Layers/the_pile_books3_packed_128k": 0.0796,
   "nRuaif/book2-lite-cleaned": 0.0744,
   "macrocosm/arxiv_abstracts": 0.0199,
   "big_patent": 0.0059,
   "pile-of-law/pile-of-law/euro_parl": 0.0144,
   "https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019.jsonl.zip": 0.0222,
   "pile-of-law/pile-of-law/r_legaladvice": 0.0101,
   "pile-of-law/pile-of-law/exam_outlines": 0.0069,
   "pile-of-law/pile-of-law/cc_casebooks": 0.0209,
   "eloukas/edgar-corpus": 0.0516,
   "Rahmaa/ElsevieR_ClEaN": 0.0477,
   "ashraq/financial-news-articles": 0.0065,
   "pile-of-law/pile-of-law/courtlistener_opinions": 0.0173,
   "pile-of-law/pile-of-law/sec_administrative_proceedings": 0.0176,
   "pile-of-law/pile-of-law/irs_legal_advice_memos": 0.0206,
   "launch/gov_report": 0.0186,
   "iz

In [261]:
#!rm *.pkl
def insert_newlines(text, chars_per_line=130):
    return '\n'.join(text[i:i+chars_per_line] for i in range(0, len(text), chars_per_line))

np.random.choice(dataset_static_mlm['train']['nextsentence'])
print(insert_newlines(np.random.choice(dataset_static_mlm['train']['mlm'])))



Facts This memo relies on facts provided by the taxpayer and available in the public records.
 Taxpayer manufactures Product and i
s based in City, State X. States X, Y, and Z filed suit against Taxpayer in the federal court accusing it of fixing a minimum pric
e for one of its more sought-after products. According to the complaint, Taxpayer engaged in anti-competitive practices in order t
o discourage price competition among retailers and keep prices higher than they otherwise would be. Taxpayer’s alleged policy expl
icitly forbade retailers from advertising Product below a dictated price. If any retailer violated the price policy, they would lo
se access to Taxpayer’s products for one year. This penalty was allegedly enforced against some retailers; however, Taxpayer often
 resumed its relationship with the retailers before the end of the one year embargo.
 The states alleged that this was an anti-com
petitive practice, preventing customers from purchasing low-priced Products, and pr


### Q&A Triplets!

Here I make a triplet dataset of query, positive answer, and negatives (if available)

B) QA Tasks
- squad_2
- WikiHow - used by S-BERT (questions and articles) - needs to be manually downloaded - https://github.com/mahnazkoupaee/WikiHow-Dataset/
- trivia_qa - 680 question, ans, evidence triplets. But, the context strings are very long (like wikipedia) and the questions are almost pop culture
- LLukas22/fiqa - financial QA, like conversations
- embedding-data/WikiAnswers - question-duplicates as paraphrases
- embedding-data/QQP_triplets - question-duplicates plus negatives (Quora)
- DONE LLukas22/lfqa_preprocessed - question and answers 226k (from REDDIT)
- DONE gbharti/finance-alpaca (like FIQA - finance Q&A) on 14k?
- DONE embedding-data/PAQ_pairs - wikipedia question & answers
- GONE the_pile_stack_exchange - single texts, but can be split into question, answer
- DONE donfu/oa-stackexchange - 6.3 million (AND GROWING -- must monitor)
- cais/mmlu - multiple choice, but some of the answers are longers (need to filter)
- DONE sciq - science questions - see question and support
- DONE wiki_qa - wikipedia QA
- qasc - high-school questions - can combine the "facts" into a support
- pubmed_qa - science QA with answers
- DONE JoBeer/eclassTrainST - can easily convert into question-answer pairs
- DONE - dictinonary -
- DONE POLICY QA - alzoubi36/policy_qa (has contracts and questions about contract)
=ra)
- DONE sc2qa/sc2q_commoncrawl - qa on common crawl with 45k
- DONE: yahoo_answers_topics (filter for 6=business; 3=education; 9=govt)
- TODO: wikihow:
-- need to save locally, then I think it can be streamed. Need to follow it's internal instructions
-- urllib.request.urlretrieve('https://public.boxcloud.com/d/1/b1!_xh0QGyf95mEUFLjiTuiMfD08KRjjfr5iLY8-codMzDJX8aMjQ4l8HqGYMVmT_OkhEHzS5cTD2PWy54NII7Egr9sotD9S17Pbf1ZmFeG7Rslpq-bLO7cpMBCHzDMUUahQqPX8bi42hrdxHBSIECK46tb1eum9GySp39bgzW5I0HckhGWzIPU3XeAdZ3IY38MVTVFqo5Y_CAfAwBbuN-ZSX7h3oJ3UzNqeGdCMfkBfQNcPgd0Hs283KDUEH0ZL4X7dwsakK5NyKcyiyZ5iwrCkTXzDksrf4ezJOPtRWlPVNsuq0PRuUOrLF-ynXQJglhqUtlXPwVPZ5FuhggADk5vCTDZRdBsIodyLdin8h8hwcYLEUTXBlMNKsrlmMmwsWluTxtlDoSbD41bp8YNW8bB50-dDBasJd6YjbKf7FpyH-RZh6LEr4VqJhX8BYwJD6jqKjtfKgY4QWqsczY1DN0W8aOkgRjqdUUjVNcM36Be-ueqP6fN0GkWC-jEEY7uwjp2imOQRd_CccvfEVvndJI2vhm4YSbwcnCWkX_-weiUtiebMBL-K8t7KLzw0J2frAZeKKKvLGwfZ3pactzK__XRMRiFwL5sRttWd2ctcgNs8VXGxd_XLxMBiyJutZtmdCyv00QuOL8H_t-Kld5n7dltppTU6h-b0zCoYDnM14yZYqDkQ-TSf1UVJUqwcyjLaHS357iFaJmRK5KwA5yc15sKZuh26KBmAia-XcElWfdoqbhzJhzDcBzdaPFWulYzTkdVY42sFBUI_ZBwKaMTDDddr1nMdagiJeTIPZ4XBJnEa8nLiXiozWB7wfn7Ce-aRoZ7Prf3chnflmiaI6dLP4LAYomD3EI9rtPzWmBCg6Gp9ASriydlLtHzmD_lHVacoax0y0Mft2EWf3ClyjhsQJo8cpL3vw0-69ol6MLvXgygOZWfEmhDBA5gClHuUjMRIVJnkc4sJryXi7Mnhjnx0B7uWKNWj-J2P-A8zEg4371Do6QJcrVvmCkGVxo71iErLfAjBU__KyZmSQ221vp9NJjqXJBTwivUtOXSF98sKoYfCC2AT0_Umq_qx4m3ucyYnwVeeV09_oMmtZOtnhcNV_cMNerTGk6qP54u9JlImevbd289CT5urlCxham79o8Aaxz1An__gofnL7_ZLF2lhT7X-S6e0gDMUjk73JusvbyWW8DhqbnUZ-obcI33qGl9AJzLM35nI5mO-WPEYgE-Z1DTKA../download', '/tmp/foo.zip')



In [262]:
#JoBeer/eclassTrainST
#foo =  load_dataset('gart-labor/eclassTrainST',split='train',streaming=True).map(clean_eclassTrainST).remove_columns(['text', 'entailment', 'contradiction', 'label'])
#foo =  load_dataset('gbharti/finance-alpaca',split='train',streaming=True)  # good, financial questions
#foo =  load_dataset('gart-labor/eclassTrainST',split='train',streaming=True) # NAD; just for paraphrased questions, not for QA
# foo =  load_dataset('parquet',data_files = 'https://huggingface.co/datasets/gart-labor/eclassTrainST/resolve/main/data/eval-00001-of-00001-d8aa08935841e6a9.parquet',split='train',streaming=False) # NAD; just for paraphrased questions, not for QA
#foo =  load_dataset('wiki_qa',split='train',streaming=True) # excellent; with negatives and positives
#foo =  load_dataset('THUDM/webglm-qa',split='train',streaming=True) # excellent; with negatives and positives
#foo = load_dataset("sciq",split='train',streaming=False) #
#foo = load_dataset("LLukas22/lfqa_preprocessed", split='train',streaming=True)

def clean_govreportqa(x):
    q_raw = x['question_summary_pairs']['question']
    a_raw = x['question_summary_pairs']['summary']
    if len(q_raw)==1:
        q_concat = q_raw[0]
        a_concat = a_raw[0]
    elif len(q_raw)<=3 and len(q_raw)>1:
        q_proc = [q[0].lower() + q[1:].strip('?') for q in q_raw]
        q_concat = ', '.join(q_proc[:-1]) + ', and ' + q_proc[-1] + '?'
        a_concat = ' '.join(a_raw)
    else:
        q_raw = q_raw[:2] + [random.choice(q_raw[2:])]
        q_proc = [q[0].lower() + q[1:].strip('?') for q in q_raw]
        q_concat = ', '.join(q_proc[:-1]) + ', and ' + q_proc[-1] + '?'
        a_concat = ' '.join(a_raw)
    x['query']=q_concat
    x['positives']=[a_concat]
    x['negatives']=[]
    return x



https://huggingface.co/datasets/launch/gov_report_qs/resolve/main/gov_report_qs.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/c07298c715d56a2e2a365a9d17248c047537e0a20eb5395b6b5c876c5ee69c24.d6843a5e64f4465492599bbda4467dc6175d23754e2bd48960a09e13fa2421c0.py.incomplete
INFO:datasets.utils.file_utils:https://huggingface.co/datasets/launch/gov_report_qs/resolve/main/gov_report_qs.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/c07298c715d56a2e2a365a9d17248c047537e0a20eb5395b6b5c876c5ee69c24.d6843a5e64f4465492599bbda4467dc6175d23754e2bd48960a09e13fa2421c0.py.incomplete


Downloading builder script:   0%|          | 0.00/13.3k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/launch/gov_report_qs/resolve/main/gov_report_qs.py in cache at /root/.cache/huggingface/datasets/downloads/c07298c715d56a2e2a365a9d17248c047537e0a20eb5395b6b5c876c5ee69c24.d6843a5e64f4465492599bbda4467dc6175d23754e2bd48960a09e13fa2421c0.py
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/launch/gov_report_qs/resolve/main/gov_report_qs.py in cache at /root/.cache/huggingface/datasets/downloads/c07298c715d56a2e2a365a9d17248c047537e0a20eb5395b6b5c876c5ee69c24.d6843a5e64f4465492599bbda4467dc6175d23754e2bd48960a09e13fa2421c0.py
creating metadata file for /root/.cache/huggingface/datasets/downloads/c07298c715d56a2e2a365a9d17248c047537e0a20eb5395b6b5c876c5ee69c24.d6843a5e64f4465492599bbda4467dc6175d23754e2bd48960a09e13fa2421c0.py
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/c07298c715d56a2e2a365a9d17248c047537e0a20eb5395b6b5c876c5ee69c24.d6843a5e64f4465492599bbda4467dc6175d23

Downloading readme:   0%|          | 0.00/8.16k [00:00<?, ?B/s]

storing https://huggingface.co/datasets/launch/gov_report_qs/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/b8456fbad33d1aeb4d6ae6d786a851ebf9bfb8ab08c4f56d4c18c3f0835d5b1b.537ee53f0a040d3cfff696ab5e04fe3d5a351779417a4280df1a2c31d38e5eca
INFO:datasets.utils.file_utils:storing https://huggingface.co/datasets/launch/gov_report_qs/resolve/main/README.md in cache at /root/.cache/huggingface/datasets/downloads/b8456fbad33d1aeb4d6ae6d786a851ebf9bfb8ab08c4f56d4c18c3f0835d5b1b.537ee53f0a040d3cfff696ab5e04fe3d5a351779417a4280df1a2c31d38e5eca
creating metadata file for /root/.cache/huggingface/datasets/downloads/b8456fbad33d1aeb4d6ae6d786a851ebf9bfb8ab08c4f56d4c18c3f0835d5b1b.537ee53f0a040d3cfff696ab5e04fe3d5a351779417a4280df1a2c31d38e5eca
INFO:datasets.utils.file_utils:creating metadata file for /root/.cache/huggingface/datasets/downloads/b8456fbad33d1aeb4d6ae6d786a851ebf9bfb8ab08c4f56d4c18c3f0835d5b1b.537ee53f0a040d3cfff696ab5e04fe3d5a351779417a4280df1a2c31d38e5

{'doc_id': 'CRS_R43130', 'summary_paragraph_index': 3, 'document_sections': {'title': ['Politics and Governance Under President Zuma', 'Governance Challenges', 'Youth Population: Political Potential and Character'], 'paragraphs': ["The ANC has held a large majority in the National Assembly (parliament) since the first universal suffrage elections in 1994, and currently holds a majority just shy of the two-thirds required to amend the constitution. The parliament elects the country's president and, as a result, the ANC has controlled the executive branch since 1994. The ANC customarily nominates its party president to serve as national president, with some exceptions (see textbox below). The assembly elected the incumbent ANC president, Jacob Zuma, to his first term as national president in 2009.\nDespite rising criticism of Zuma and internal party challenges to his candidacy, Zuma mounted a carefully crafted, successful campaign to ensure his reelection as head of the ANC at a late 201

In [263]:
from torch.utils import data as torch_data
from rank_bm25 import BM25Okapi
import pandas as pd
import os

In [270]:
print("""TODO:
Blacks law: try streaming this CSV file (or just get it):
https://raw.githubusercontent.com/LexPredict/lexpredict-legal-dictionary/master/sources/blacks_second_edition/blacks_second_edition_terms.csv
And probably filter for length, add a context
""") # nah it is really dirty

def clean_govreportqa(x):
    q_raw = x['question_summary_pairs']['question']
    a_raw = x['question_summary_pairs']['summary']
    if len(q_raw)==1:
        q_concat = q_raw[0]
        a_concat = a_raw[0]
    elif len(q_raw)<=3 and len(q_raw)>1:
        q_proc = [q[0].lower() + q[1:].strip('?') for q in q_raw]
        q_concat = ', '.join(q_proc[:-1]) + ', and ' + q_proc[-1] + '?'
        a_concat = ' '.join(a_raw)
    else:
        q_raw = q_raw[:2] + [random.choice(q_raw[2:])]
        q_proc = [q[0].lower() + q[1:].strip('?') for q in q_raw]
        q_concat = ', '.join(q_proc[:-1]) + ', and ' + q_proc[-1] + '?'
        a_concat = ' '.join(a_raw)
    x['query']=q_concat
    x['positives']=[a_concat]
    x['negatives']=[]
    x['type'] = 'qa_triplet'
    return x

POLICYQA_PREPEND = [
    "Regarding the online Terms of Service and Data Protection policies: %s",
    "Considering your data security, %s",
    "With respect to user privacy guidelines, %s",
    "As outlined in the data protection commitments, %s",
    "Pertaining to the terms of service, %s",
    "Addressing information usage, %s",
    "Pertaining to data confidentiality, %s",
    "In connection to your privacy assurance, %s",
    "Touching upon online data policies, %s",
    "In alignment with your user data safeguards, %s",
    "Taking a closer look at data security, %s",
    "In light of your user-information protocols, %s",
    "Within the context of user data sharing, %s",
    "In consideration for user data ownership, %s",
    "In consideration of data securty and user-data protection, %s",
    "In regards to your online services, %s",
    "Regarding your data handling procedures and policies, %s",
    "In the context of user information disclosure and privacy, %s",
    "As per the information security policies, %s",
    "In the context of the Terms of Service: %s",
    "%s (in the context of your online Terms of Service)"
]

STACKEXCHANGE_NONQUANT_DOMAINS = [
    "stackexchange-"+k for k in [
        "academia",
        "aviation",
        "bicycles",
        "biology",
        "buddhism",
        "chemistry",
        "chess",
        "christianity",
        "coffee",
        "cogsci",
        "cooking",
        "crafts",
        "cseducators",
        "diy",
        "drones",
        "earthscience",
        "ebooks",
        "electronics",
        "english",
        "expatriates",
        "fitness",
        "freelancing",
        "gardening",
        "gaming",
        "genealogy",
        "ham",
        "hardwarerecs",
        "health",
        "hinduism",
        "history",
        "homebrew",
        "hsm",
        "interpersonal",
        "iot",
        "islam",
        "judaism",
        "law",
        "lifehacks",
        "linguistics",
        "literature",
        "martialarts",
        "materials",
        "mechanics",
        "moderators",
        "money",
        "music",
        "mythology",
        "outdoors",
        "parenting",
        "patents",
        "pets",
        "philosophy",
        "pm",
        "politics",
        "security",
        "skeptics",
        "softwarerecs",
        "sustainability",
        "travel",
        "vegetarianism",
        "woodworking",
        "workplace",
        "worldbuilding",
        "writers"
        ]
    ]

list_of_dictionary_paraphrases = [
    "Define the term: %s",
    "What is the definition of the following word or expression: %s",
    "Define the following term: %s",
    "what does the following term mean: %s",
    "What is the definition of the following word: %s",
    'Provide the definition for the term "%s".',
    'Explain the meaning of the word "%s".',
    'Elucidate the definition of "%s".',
    'Clarify the term "%s".',
    'What does the word "%s" signify?',
    'Provide a definition for "%s".',
    'How is the term "%s" defined?',
    'What exactly is meant by "%s"?',
    'Share the definition of "%s".',
    'Offer a definition for word or expression "%s".',
    'Explain what "%s" refers to.',
    'Define the term "%s", please.',
    'What\'s the definition of "%s"?',
    'Please elucidate "%s".',
    'How is "%s" defined?',
    'Explain the concept behind "%s".',
    'What is meant by the word "%s"?',
    'Can you give the definition of "%s"?',
    'Could you provide the meaning of "%s"?',
    'Please offer the definition of "%s".'
]


def filter_dictionary(x):
    """get definitions of only medium sized words with large definitions"""
    if x['word'] is None:
        return False
    return len(x['definition'])>100 and len(x['word'].replace(" ",''))>=4

def clean_dictionary(x):
    """converts a dictionary term into a question, sampling randomly from 20 template questions"""
    idx_random_question_template = ord(x['definition'].replace(' ','')[-6]) % len(list_of_dictionary_paraphrases)
    question_template =list_of_dictionary_paraphrases[idx_random_question_template]
    x['query'] = question_template % x['word']
    x['positives'] = [x['definition']]
    x['negatives'] = []
    x['type'] = 'qa_triplet'
    return x

def clean_webglmqa(x):
    x['query']=x['question']
    x['positives'] = [x['answer']]
    x['negatives'] = []
    x['type'] = 'qa_triplet'
    return x

def clean_stream_PAQ_pairs(x):
    x['query'] = x['set'][0]
    x['positives'] = [x['set'][1]]
    x['negatives'] = []
    x['type'] = 'qa_triplet'
    return x

def clean_stream_finance_alpaca(x):
    x['query'] = x['instruction']
    x['positives'] = [x['output']]
    x['negatives'] = []
    x['type'] = 'qa_triplet'
    return x

def clean_stream_wiki_qa(x):
    x['query'] = x['question']
    is_pos = x['label']
    answer = x['answer']
    pos = [answer] if is_pos else []
    neg = [answer] if (not is_pos) else []
    x['positives'] = pos
    x['negatives'] = neg
    x['type'] = 'qa_triplet'
    return x

def clean_stream_oa_stackexchange(x):
    x['query'] = x['INSTRUCTION']
    x['positives'] = [x['RESPONSE']]
    x['negatives'] = []
    x['type'] = 'qa_triplet'
    return x

def clean_stream_sciqa(x):
    x['query'] = x['question']
    x['positives'] = [x['support']]
    x['negatives'] = []
    x['type'] = 'qa_triplet'
    return x

def clean_lfqa(x):
    x['query'] = x['question']
    x['positives'] = [x['answer']]
    x['negatives'] = []
    x['type'] = 'qa_triplet'
    return x

def filter_os_stackexchange(x):
    return x['SOURCE'] in STACKEXCHANGE_NONQUANT_DOMAINS

def get_name_and_description_eclassTrainST(text):
    description, name = text.split("; Name:")
    return description.replace("Description: ","").strip(), name.strip()

def clean_eclassTrainST(x):
    """This set isn't really about entailment/contradiction; it is really a dictionary"""
    description, name = get_name_and_description_eclassTrainST(x['text'])
    pos, _ = get_name_and_description_eclassTrainST(x['entailment'])
    extra, _ = get_name_and_description_eclassTrainST(x['contradiction'])
    x['query'] = 'What is a "%s"?' % name
    x['positives'] = [pos]
    x['negatives'] = []
    # add the entailment as positive, contradiction as negatives
    if x['label'] == 'entailment':
        x['positives'].append(extra)
    else:
        x['negatives'] = [extra]
    x['type'] = 'qa_triplet'
    return x

# do to: alzoubi36/policy_qa - policy questions
def clean_policyqa(x):
    """Adds more context to the questions about data security in the alzoubi36/policy_qa qa set """
    idx_random_question_template = ord(x['context'].replace(' ','')[-5]) % len(POLICYQA_PREPEND)
    question_template =POLICYQA_PREPEND[idx_random_question_template]
    q = x['question']
    q = q[0].lower() + q[1:]
    x['query'] = question_template % q # ['id', 'title', 'context', 'question', 'answers']
    x['positives'] = [x['context']]
    # fetch a negative from the negative corpus
    negatives_random, _ = negative_example_generator.find_negative(x['context'], k = 1, skip=10)
    x['negatives'] = negatives_random
    x['type'] = 'qa_triplet'
    return x

def clean_sc2qa(x):
    x['query'] = x['question']
    x['positives'] = [x['article']]
    x['negatives'] = []
    x['type'] = 'qa_triplet'
    return x

def clean_yahooanswers(x):
    x['query'] = (x['question_title'] + " " + x['question_content']).strip() # 'question_title', 'question_content'
    x['positives'] = [x['best_answer']]
    x['negatives'] = []
    x['type'] = 'qa_triplet'
    return x

def filter_yahooanswers(x):
    """Yahoo news filtering (filter for 6=business; 3=education; 9=govt)"""
    return x['topic'] in [3,6,9] and len(x['question_title'])>10 and len(x['best_answer'])>10


def clean_businessbook(x):
    """17k business books cleaning"""
    x['query'] = x['question']
    x['positives'] = [x['answer']]
    x['negatives'] = []
    x['type'] = 'qa_triplet'
    return x

#dict_keys(['question_id', 'question', 'document_title', 'answer', 'label'])
qa_streaming_cleaning_functions = {
    'embedding-data/PAQ_pairs':(clean_stream_PAQ_pairs, None, ['query','positives','negatives'],['set']),
    'gbharti/finance-alpaca':(clean_stream_finance_alpaca,None, ['query','positives','negatives'],['input', 'output', 'text', 'instruction']),
    'wiki_qa':(clean_stream_wiki_qa, None, ['query','positives','negatives'],['question_id', 'question', 'document_title', 'answer', 'label']),
    'donfu/oa-stackexchange':(clean_stream_oa_stackexchange, filter_os_stackexchange, ['query','positives','negatives'], ['INSTRUCTION', 'RESPONSE', 'SOURCE', 'METADATA']),
    'gart-labor/eclassTrainST':(clean_eclassTrainST, None, ['query','positives','negatives'], ['text', 'entailment', 'contradiction', 'label']),
    'THUDM/webglm-qa':( clean_webglmqa, None, ['query','positives','negatives'], ['question','answer','references']),
    'sciqa': (clean_stream_sciqa, None, ['query','positives','negatives'], ['question', 'distractor3', 'distractor1', 'distractor2', 'correct_answer', 'support']),
    'LLukas22/lfqa_preprocessed':(clean_lfqa, None, ['query','positives','negatives'], ['question','answer','context']), #REDDIT QUESTION ANSWERS (ASK historians, ask me like I'M FIVE)
    'npvinHnivqn/EnglishDictionary':(clean_dictionary, filter_dictionary, ['query','positives','negatives'], ['word','definition']), # dictionaries
    'alzoubi36/policy_qa':(clean_policyqa, None, ['query','positives','negatives'],  ['id', 'title', 'context', 'question', 'answers'] ), # PRIVACYGLUE
    'sc2qa/sc2q_commoncrawl':(clean_sc2qa, None, ['query','positives','negatives'], ['question','article','url']),
    'yahoo_answers_topics':(clean_yahooanswers, filter_yahooanswers, ['query','positives','negatives'], ['id', 'topic', 'question_title', 'question_content', 'best_answer']),
    'launch/gov_report_qs':(clean_govreportqa, None, ['query','positives','negatives'],['doc_id', 'summary_paragraph_index', 'document_sections', 'question_summary_pairs']),
    'theoldmandthesea/17k_business_book':(clean_businessbook, None, ['query','positives','negatives'], ['question','answer','book']),
}

DEFAULT_PROB_QA = 0.1
qa_files = [
    ('embedding-data/PAQ_pairs',None, DEFAULT_PROB_QA, 7.29*10**6, 'qa_triplet', False), # wikipedia pop culture pairs # get from 'set', 7.29*10**6
    ('gbharti/finance-alpaca',None, DEFAULT_PROB_QA, 6.89*10**5, 'qa_triplet', False), # Stanford's Alpaca (https://github.com/tatsu-lab/stanford_alpaca) and FiQA (https://sites.google.com/view/fiqa/) with another 1.3k pairs custom generated using GPT3.5
    ('wiki_qa',None, DEFAULT_PROB_QA, 20.4*10**3, 'qa_triplet', False), # Wiki Question Answering corpus from Microsoft. with multiple negatives that are similar!
    ('donfu/oa-stackexchange',None, DEFAULT_PROB_QA*2, 6330000, 'qa_triplet', (14, int(6330000//14))), # stack-exchange question-answer pairs, across lots of domains; notice the original is 6.6 million, but there is a filter
    ('gart-labor/eclassTrainST', None, 0.02, 450912, 'qa_triplet', False), # questions about trade / business stuff
    ('THUDM/webglm-qa', None, DEFAULT_PROB_QA, 43600, 'qa_triplet', False),
    ('sciq',None, DEFAULT_PROB_QA, 11679, 'qa_triplet', False), # science questions from Allenai, with a question and support
    ('LLukas22/lfqa_preprocessed', None, DEFAULT_PROB_QA, 226000,'qa_triplet',False),# REDDIT QUESTION ANSWERS (ASK historians, ask me like I'M FIVE)
    ('npvinHnivqn/EnglishDictionary',None, DEFAULT_PROB_QA/4, 30864, 'qa_triplet',False), # 0.05 original size: 11200, post-file 30865
    ('alzoubi36/policy_qa', None, DEFAULT_PROB_QA/4, 17100,  'qa_triplet',False),
    ('sc2qa/sc2q_commoncrawl',None, DEFAULT_PROB_QA, 44500, 'qa_triplet', False),
    ('yahoo_answers_topics', None, DEFAULT_PROB_QA, 401357,'qa_triplet', False),
    ('launch/gov_report_qs','paragraph', DEFAULT_PROB_QA/5, 4878, 'qa_triplet', False),
    ('theoldmandthesea/17k_business_book', None, DEFAULT_PROB_QA/4, 17480, 'qa_triplet', False),
]

qadata_streaming_config = {
    'files':qa_files,
    'max_seq_length':512,
    'prepend_q': 'query: ',
    'prepend_a': 'passage: ',
    'val_size':1000,
    'train_chunk_size':5000,
    'seed':42,
}

def initialize_qa_streaming_datasets(data_streaming_config, streaming_cleaning_functions):
    files = data_streaming_config['files']
    qa_streaming_datsets, qa_probabilities, qa_datasizes = [],[],[]
    for (qa_nm, set_nm, prob, dataset_size, special_handling, partition_shuffle) in files:

        if prob ==0:
            continue
        # get cleaning & filter functions for streaming data / map & filters
        clean_func, filter_func, feature_names, removefeature_names = streaming_cleaning_functions[qa_nm]

        # arguments for the load_dataset (huggingface repos)
        load_dataset_args = {
            'path':qa_nm, 'name':set_nm, 'split':'train', 'streaming':True
        }
        # for other non-huggingface repos, path needs to be a "builder"
        if qa_nm.endswith('.jsonl') or qa_nm.endswith('.jsonl.zip') or qa_nm.endswith('.jsonl.zst'):
            load_dataset_args.update({'path':'json','data_files':qa_nm})

        print('trying %s' % qa_nm)
        if filter_func is None:
            dset_stream = load_dataset(**load_dataset_args).map(clean_func).remove_columns(removefeature_names)
        else:
            dset_stream = load_dataset(**load_dataset_args).filter(filter_func).map(clean_func).remove_columns(removefeature_names)

        qa_streaming_datsets.append(dset_stream)
        qa_probabilities.append(prob);
        qa_datasizes.append(dataset_size)

    print('done initializing the QA streaming datasets')
    return qa_streaming_datsets, qa_probabilities, qa_datasizes

def streaming_skip(skip, list_of_streaming_datasets, probabilities, datasizes, seed=42, convert_to_static = False):
    """Function loops through a list of streaming datasets, skips a first K values based on the probabilities, and returns them"""
    out = []
    normalized_p = [p/sum(probabilities) for p in probabilities]
    for dset, p, size in list_of_streaming_datasets, normalized_p, datasizes:
        skip_in_this_set = max(0,int(p)*skip)
        out.append(dset.skip(skip_in_this_set))
    return out

def streaming_take(skip, start_proportion, chunksize, list_of_streaming_datasets, probabilities, datasizes,  convert_to_static = False):
    """Takes some examples based on a starting point within the dataset, as a proportion of its total size"""
    out = []
    normalized_p = [p/sum(probabilities) for p in probabilities]
    for j, (dset, p, size) in enumerate(zip(list_of_streaming_datasets, normalized_p, datasizes)):
        #print(type(dset))
        #print(type(p))
        #print(type(size))
        # skip for valset
        skip_in_this_set = int(round(p*skip))
        # afterwards, where to start?
        skip_to_start = int(start_proportion*(size-skip_in_this_set))
        take_from_this_set = int(round(chunksize*p))
        if skip_to_start>0:
            dset_skipped = dset.skip(skip_in_this_set+skip_to_start).take(take_from_this_set)
        else:
            dset_skipped = dset.take(take_from_this_set)

        if not convert_to_static:
            # option to return the streaming dataset
            out.append(dset_skipped)
        else:
            # option just to convert the streaming dataset to static outputs
            for example in dset_skipped:
                example['source_id'] = j
                out.append(example)
        print('done %d' % j)
    return out

def train_test_splits_from_stream_qa(
    streaming_dataset,
    val_size = 100,#2000,
    epoch = 0,
    chunk_size = 500,#6000,
    path_to_val_cache = 'val_qa_cache.pkl',
    probabilities = None,
    datasizes = None,
    seed=42
):
    """
    val_size = 2000, number of streaming-iter to skip, reserved for the val-sze
    epoch = 0, epoch will change the seed when sampling the chunk idx for making the training set
    chunk_size = 5000, # number of streaming-iter to select the training data chunk
    max_chunk_start = 2000000, # randomly sample within this interval for streaming chunks
    """
    if os.path.isfile(path_to_val_cache):
        print('RELOADING VAL-QA SET: iter=%s' % path_to_val_cache)
        with open(path_to_val_cache,'rb') as pcon:
            val_corpus_list = pickle.load(pcon)
        print('VAL-QA SET SIZE: %d' % len(val_corpus_list))
    else:
        # stream validation set
        print('STREAMING VAL-QA DATA: %d' % val_size)
        val_corpus_list = streaming_take(
            skip=0,
            start_proportion=0,
            chunksize=val_size,
            list_of_streaming_datasets=streaming_dataset,
            probabilities=probabilities,
            datasizes=datasizes,
            convert_to_static = True
        )
        print('REALIZED VAL-QA DATA: %d' % len(val_corpus_list))
        # save the validation corpus
        print('SAVING VAL-QA SET: %s' % path_to_val_cache)
        with open(path_to_val_cache,'wb') as pcon:
            pickle.dump(val_corpus_list, pcon)

    # take a random interger to start the streaming of training data
    # starts at a random position
    train_start_proportion = np.random.RandomState(seed + epoch).random()*0.99
    print(train_start_proportion)

    # stream training data
    print('STREAMING TRAIN QA-DATA: %d STARTING AT: %0.3f' % (chunk_size,train_start_proportion))
    train_corpus_list = streaming_take(
            skip=val_size,
            start_proportion=train_start_proportion,
            chunksize=chunk_size,
            list_of_streaming_datasets=streaming_dataset,
            probabilities=probabilities,
            datasizes=datasizes,
            convert_to_static = True
        )

    print('REALISED TRAIN QA-DATA SIZE: %d' % len(train_corpus_list))
    return {
        'train':train_corpus_list,
        'val':val_corpus_list,
        'epoch':0,
        'index_stream':train_start_proportion
    }

def initialize_and_get_triplet_streaming_datasets(
    data_streaming_config,
    streaming_cleaning_functions,
    start_proportion = None,
    epoch=0,
    seed=42,
    path_to_val_cache = 'cache_val_qa.pkl',
    path_to_train_cache_epoch = 'cache_train_qa_%03g.pkl',
    do_check_english = True,
    name = 'QA' #
):
    """Converts stream of unlabelled text data into static datasets for: for Triplet data tasks (QA-task/IR-task)"""
    # list of files to stream
    print('Initializing the streaming-QA to static-dataset procedure...')
    files = data_streaming_config['files']
    # number of examples to take from stream for validation set
    val_size = data_streaming_config['val_size']
    # number of examples to take from stream for training set
    train_chunk_size = data_streaming_config['train_chunk_size']
    min_seq_len = data_streaming_config.get('min_seq_length', 48)
    # normalization constant for normalizing the weights into probabilities
    probability_normalization_const = sum([x[2] for x in files])

    # where to initialize start-stream for training data
    if start_proportion is None:
        start_proportion = np.random.RandomState(seed+epoch).uniform()*0.99

    # reload cached files
    path_to_train_cache = None if not '%03g' in path_to_train_cache_epoch else path_to_train_cache_epoch % epoch
    do_make_valset = not os.path.isfile(path_to_val_cache)
    do_make_trainset = not os.path.isfile(path_to_train_cache)
    if not do_make_valset:
        print(f'RELOADING VAL-{name} SET: iter=%s' % path_to_val_cache)
        with open(path_to_val_cache,'rb') as pcon:
            datalist_val_triplet_static = pickle.load(pcon)
        print(f'VAL-{name} SET SIZE: %d' % len(datalist_val_triplet_static))
    else:
        datalist_val_triplet_static = []
    if not do_make_trainset:
        print(f'RELOADING VAL-{name} SET: iter=%s' % path_to_val_cache)
        with open(path_to_train_cache,'rb') as pcon:
            datalist_train_triplet_static = pickle.load(pcon)
        print(f'TRAIN-{name} EPOCH-%d SET SIZE: %d' % (epoch, len(datalist_train_triplet_static)))
    else:
        datalist_train_triplet_static = []

    if (do_make_trainset or do_make_valset):

        # loop through datasets
        for (data_nm, set_nm, prob, dataset_size, special_handling, partition_shuffle), dataset_key in zip(
            files, streaming_cleaning_functions.keys()
        ):
            if prob ==0:
                continue
            prob /= probability_normalization_const

            # get cleaning & filter functions for streaming data functionality
            clean_func, filter_func, feature_names, removefeature_names = streaming_cleaning_functions[dataset_key]

            # set arguments for the load_dataset (huggingface repos)
            load_dataset_args = {
                'path':data_nm, 'name':set_nm, 'split':'train', 'streaming':True
            }
            # for other non-huggingface repos, path needs to be a "builder"
            if data_nm.endswith('.jsonl') or data_nm.endswith('.jsonl.zip') or data_nm.endswith('.jsonl.zst'):
                load_dataset_args.update({'path':'json','data_files':data_nm})

            # special proecssing of datasets with multiple partitions
            if bool(partition_shuffle): # or str(epoch)=='val':

                n_files, n_per_file = partition_shuffle
                dataset_size = n_per_file
                print('trying %s initialization (shuffling through %d files)' % (data_nm, n_files))

                # whether there is a filter
                if filter_func is None:
                    dset_stream = load_dataset(**load_dataset_args)
                else:
                    dset_stream = load_dataset(**load_dataset_args).filter(filter_func)

                # validation set
                if do_make_valset:
                    # take from stream
                    n_valset_take = max(int(prob*val_size), 1)
                    if n_valset_take==1:
                        print(prob)
                        print(val_size)
                    print('take %d from %s validation'% (n_valset_take, data_nm))
                    dset_stream_val = dset_stream.take(n_valset_take).map(clean_func).remove_columns(removefeature_names)
                    # convert stream to a static set and do check
                    dset_static_val_thisset = [
                        e for e in dset_stream_val if bool(re.search(r"\w+",e['query'][:200]))
                    ]
                # training set
                if do_make_trainset:
                    # randomly skip a bunch from this set
                    skip_to_start = int(start_proportion*n_per_file)
                    take_from_this_set = max(int(round(train_chunk_size*prob)),1)
                    print('take %d from %s training'% (take_from_this_set, data_nm))
                    # shuffle: take a random data partition (from the dataset's list of files)
                    dset_stream_train = dset_stream_val.shuffle(
                        seed = seed+epoch, buffer_size = skip_to_start+take_from_this_set,
                    )
                    dset_stream_train = dset_stream_train.skip(
                        skip_to_start # random skip through dataset to new start position
                    ).take(
                        take_from_this_set # take this amount for the training ste
                    ).map(clean_func).remove_columns(removefeature_names)
                    # convert training to static dataset
                    dset_static_train_thisset = [
                        e for e in dset_stream_train if bool(re.search(r"\w+",e['query'][:200]))
                    ]
            else:
                # regular streaming
                print('trying %s initialization' % data_nm)
                # whether there is a filter
                if filter_func is None:
                    dset_stream = load_dataset(**load_dataset_args).map(clean_func).remove_columns(removefeature_names)
                else:
                    dset_stream = load_dataset(**load_dataset_args).filter(filter_func).map(clean_func).remove_columns(removefeature_names)
                # take from stream
                n_valset_take = max(int(prob*val_size), 1) # size of valset
                if n_valset_take==1:
                    print(prob)
                    print(val_size)
                print('take %d from %s validation'% (n_valset_take, data_nm))
                skip_to_start = int(start_proportion*(dataset_size-n_valset_take)) # random point to skip to
                n_train_take = max(int(round(train_chunk_size*prob)),1) # size of train set
                print('take %d from %s train'% (n_train_take, data_nm))
                if do_make_valset:
                    dset_stream_val = dset_stream.take(n_valset_take)
                    dset_static_val_thisset = [
                        e for e in dset_stream_val if bool(re.search(r"\w+",e['query'][:200]))
                    ]
                if do_make_trainset:
                    dset_stream_train = dset_stream.skip(n_valset_take+skip_to_start).take(n_train_take)
                    dset_static_train_thisset = [
                        e for e in dset_stream_train if bool(re.search(r"\w+",e['query'][:200]))
                    ]
            print('Done getting streams/reloading from %s' % data_nm)
            # check language, chunk sentences
            if do_make_valset:
                # discard non-english
                dset_static_val_thisset =[
                    e for e in dset_static_val_thisset if check_language(e['query'])[0] #detect(e['query'][:200]+" hello")=='en'
                ]
                print('done val language check')
                # add to val set
                datalist_val_triplet_static.extend(dset_static_val_thisset)

            # check language, chunk sentences
            if do_make_trainset:
                # discard non-english
                dset_static_train_thisset =[
                    e for e in dset_static_train_thisset if check_language(e['query'])[0] #detect(e['query'][:200] +" hello")=='en'
                ]
                print('done train language check')

                # ensure that none of the examples in the traning set are in the validation set
                if do_make_valset:
                    val_queries = set([q['query'] for q in dset_static_val_thisset])
                    dset_static_train_thisset = [
                        s for s in dset_static_train_thisset if s['query'] not in val_queries
                    ]

                # add to training set
                datalist_train_triplet_static.extend(dset_static_train_thisset)

        print(f'Done collecting {name} streaming data')

    if do_make_valset:
        print('saving streamed %s validation data: %s' % (name, path_to_val_cache))
        with open(path_to_val_cache,'wb') as pcon:
            pickle.dump(datalist_val_triplet_static, pcon)

    if do_make_trainset:
        print('saving streamed %s training for epoch %d: %s' % (name, epoch, path_to_train_cache))
        with open(path_to_train_cache,'wb') as pcon:
            pickle.dump(datalist_train_triplet_static, pcon)

    return {
        'train':datalist_train_triplet_static,
        'val':datalist_val_triplet_static,
        'epoch':epoch,
        'index_stream':start_proportion
    }


class DatasetTriplets(torch_data.Dataset):
    def __init__(
        self,
        list_of_data=None,
        n_negatives= 3,
        topk_negatives_discard = 15, # get top kth most-similar results, discard first k, to use as negative
        focal_text_name ='query',
        positives_text_name ='positives',
        negativess_text_name ='negatives',
        seed = 32,
        negative_corpus_method = 'bm25', # how to sample (pseudo)negatives internally
        label_processor_class = None # (optional) function to process negatives
    ):
        self.n_negatives = n_negatives
        self.topk_negatives_discard = topk_negatives_discard
        self.data = {}
        self.focal_text_name =focal_text_name
        self.positives_text_name = positives_text_name
        self.negativess_text_name = negativess_text_name
        self.seed = 42
        self.random = np.random.RandomState(self.seed)
        self.label_processor_class = label_processor_class
        self.negative_corpus_method = negative_corpus_method
        assert negative_corpus_method in ['bm25','ann-tfidf']

        if list_of_data is not None and len(list_of_data)>0:

            # loop through the data and add each triplets: export a panda df as final data
            self.df = self.process(list_of_data)

    def process(self, list_of_data):
        """Makes (query,pos,neg)-triplets, converts samples to dataframe for pytorch iteration"""

        # loop through the data and add each triplets
        self._loop_through_list_of_data_and_add_to_selfdata(
            list_of_data = list_of_data
        )

        # add positives to self.data
        self._find_positives_and_add_to_data()

        # add negatives to self.data
        self._find_negatives_and_add_to_data()

        # harden the dataset to pandas dataframe
        df = self.sample_data_and_make_static_dataframe(self.data)
        return df

    def _loop_through_list_of_data_and_add_to_selfdata(
        self,
        list_of_data
    ):
        """loops through and adds the positive/focal texts and negatives"""
        for raw_example in list_of_data:
            # add each element to the data
            self._add_triplet_to_data(
                focal_texts=raw_example[self.focal_text_name],
                positve_texts=raw_example[self.positives_text_name],
                negative_texts=raw_example[self.negativess_text_name],
            )
        self.focal_texts_as_keys = list(self.data.keys())

    def _add_triplet_to_data(
        self,
        focal_texts,
        positve_texts,
        negative_texts
    ):
        """add focal text to the data"""
        do_add_focals = False
        if isinstance(focal_texts,list):
            focal_text = sort(focal_texts)[0]
            do_add_focals = True
        elif isinstance(focal_texts, str):
            focal_text = focal_texts
        if focal_text not in self.data.keys():
            self.data[focal_text] = {'positives':[], 'negatives':[]}
        self.data[focal_text]['positives'] += [p for p in positve_texts if p not in self.data[focal_text]['positives']]
        #if negative_texts is None:
        #    print(focal_texts)
        #    print(positve_texts)
        #    print(negative_texts)
        self.data[focal_text]['negatives'] += negative_texts if (negative_texts is not None) else []
        if do_add_focals:
            self.data[focal_text]['positives'] += focal_texts[1:]

    def _build_corpus_of_potential_negatives(self):
        # grab positives as default negatives
        potential_corpus = [
            self.data[k]['positives'][:1] for k in self.focal_texts_as_keys
        ]
        # insert NEGATIVE if empty for an entry
        potential_corpus = [
            'NEGATIVE' if (not bool(s)) else s[0] for s in potential_corpus
        ]

        # negatives by BM25
        if self.negative_corpus_method == 'bm25':

            # tokenize for BM25
            print('building negatives via BM25')
            tokenized_corpus = [s.lower().split(" ") for s in potential_corpus]
            # compile BM25 corpus
            bm25 = BM25Okapi(tokenized_corpus)
            return {'retriever':bm25, 'corpus':potential_corpus}

        elif self.negative_corpus_method == 'ann-tfidf':
            print('building negatives via ANN-TFIDF')
            potential_corpus = [
                s for s in potential_corpus
                if len(s)>40 and len(s.split(" "))>10
            ]
            negative_example_generator= NegativeExampleGenerator(
                n_reps = 1, #
                tfidf_nfeatures = 4000,
                nchar_max_paragraph=3000,
                nword_max=100,
                nchar_max_word=4,
                save_cache = 'negative_sampler_%d-%s.pkl' % (len(potential_corpus), potential_corpus[0][0]),
                corpus = potential_corpus
            )
            return {'retriever':negative_example_generator, 'corpus':potential_corpus}

    def _find_negative(
        self,
        focal_text_as_query,
        positive_examples=None,
        use_focal_text = True,
        use_positives=True,
        neg_retriever=None,
        corpus = None
    ):
        """Given a query, uses BM25 to find similar but wrong answers, to serve as triplet negatives; for a single query"""
        bmquery = (focal_text_as_query if use_focal_text else "") + " " + ("" if (not use_positives) else positive_examples[0])
        bmquery = bmquery.strip()
        if self.negative_corpus_method == 'bm25':
            # make the query tokens
            bmquery_tokenized = bmquery.lower().split(" ")
            # search by BM25
            top_results = neg_retriever.get_top_n(
                bmquery_tokenized, corpus, n=self.topk_negatives_discard + self.n_negatives
            )
        elif self.negative_corpus_method == 'ann-tfidf':
            # query the ANN index
            top_results,_ = neg_retriever.find_negative(
                bmquery, k=self.n_negatives+2, skip=self.topk_negatives_discard
            )

        top_results = [
            s for s in top_results
            if (
                s not in positive_examples+[focal_text_as_query]
            )
        ]
        # remove any text that is equivalent to the query / focal texts
        potential_negatives = top_results[-1*self.n_negatives:]
        return potential_negatives

    def _find_positives_and_add_to_data(self):
        """For data that has a label, this can be used to artifically find and create synthetic positives"""
        pass

    def _find_negatives_and_add_to_data(self):
        """Uses BM25 to find similar but wrong answers, to serve as triplet negatives; loop over data"""

        # build bm25 corpus or tfidf-ANN index
        neg_corpus = self._build_corpus_of_potential_negatives()

        # loop through data, find examples which don't have negatives
        for k,d in self.data.items():
            if not bool(d['negatives']):
                negatives = self._find_negative(
                    focal_text_as_query=k,
                    positive_examples=d['positives'],
                    use_focal_text = True,
                    use_positives=bool(d['positives']),
                    neg_retriever=neg_corpus['retriever'],
                    corpus = neg_corpus['corpus']
                )
                d['negatives']+= negatives
        print('done finding negatives')

    def sample_data_and_make_static_dataframe(self, seed = 42):
        focals =[]
        pos =[]
        neg = []
        for query,d in self.data.items():
            for j in range(min(self.n_negatives, len(d['negatives']))):
                if len(d['positives'])==0:
                    continue
                elif len(d['positives'])==1:
                    pos+=d['positives']
                elif len(d['positives'])>1:
                    pos.append(self.random.choice(d['positives']))
                neg.append(d['negatives'][j])
                focals.append(query)
        df = pd.DataFrame({'query':focals, 'pos':pos, 'neg':neg})
        return df

    def __len__(self):
        return len(self.df)

    def __getitem__(self,idx):
        #key = self.focal_texts_as_keys[idx]
        #return {**{'query':key}, **self.data[key]}
        return self.df.iloc[idx].to_dict()



TODO:
Blacks law: try streaming this CSV file (or just get it):
https://raw.githubusercontent.com/LexPredict/lexpredict-legal-dictionary/master/sources/blacks_second_edition/blacks_second_edition_terms.csv
And probably filter for length, add a context



In [271]:
qadata_streaming_config = {
    'files':qa_files,
    'max_seq_length':512,
    'val_size':2000,
    'train_chunk_size':5000,
    'seed':42,
}

# !rm cache_*
qa_statics_datsets = initialize_and_get_triplet_streaming_datasets(
    data_streaming_config = qadata_streaming_config,
    streaming_cleaning_functions = qa_streaming_cleaning_functions,
    start_proportion = None,
    epoch=0,
    seed=42,
    path_to_val_cache = 'cache_val_qa.pkl',
    path_to_train_cache_epoch = 'cache_train_qa_%03g.pkl',
    do_check_english = True,
    name = 'QA' #
)

Initializing the streaming-QA to static-dataset procedure...
trying embedding-data/PAQ_pairs initialization


Using custom data configuration default-f55751402dbf0730
INFO:datasets.builder:Using custom data configuration default-f55751402dbf0730
Loading Dataset Infos from /usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json
INFO:datasets.info:Loading Dataset Infos from /usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json


5.7204670875786356e-06
2000
take 1 from embedding-data/PAQ_pairs validation
take 1 from embedding-data/PAQ_pairs train


KeyboardInterrupt: ignored

In [None]:
print(qa_statics_datsets.keys())
#qa_statics_datsets['train']

for i,e in enumerate(qa_statics_datsets['train'][::100]):
    if i>200:
        break
    print("-------\nQ:%s\nA:%s" % (e['query'], e['positives'][0].replace("\n"," ") if bool(e['positives']) else e['negatives'][0].replace("\n"," ")))


In [None]:
# this takes a long time to query for negatives (maybe I should just use the other negaitve generator)

NEGATIVE_CORPUS_METHOD_QA = 'ann-tfidf' #'bm25'
qa_torchdataset_val = DatasetTriplets(
    list_of_data = qa_statics_datsets['val'],
    n_negatives= 3,
    focal_text_name ='query',
    positives_text_name ='positives',
    negativess_text_name ='negatives',
    topk_negatives_discard=15, # to get similar but different negatives, use BM25 and discard these topk
    negative_corpus_method = NEGATIVE_CORPUS_METHOD_QA
)

#
if True:
    qa_torchdataset_train = DatasetTriplets(
        list_of_data = qa_statics_datsets['train'],
        n_negatives= 3,
        focal_text_name ='query',
        positives_text_name ='positives',
        negativess_text_name ='negatives',
        topk_negatives_discard=15, # to get similar but different negatives, use BM25 and discard these topk
        negative_corpus_method = NEGATIVE_CORPUS_METHOD_QA
    )

building negatives via ANN-TFIDF
using predefined corpus of length: 1662
finished building the ANN index
done finding negatives
building negatives via ANN-TFIDF
using predefined corpus of length: 3041
finished building the ANN index
done finding negatives


In [None]:
print(len(qa_torchdataset_train))
qa_torchdataset_train[2700]

# WORKS: done with the QA sets (need to expand amount of data)

9151


{'query': "How do they make 3D movies? Why do we have to wear glasses? What's their role?",
 'pos': '3D movies use a variety of technologies to create the illusion of depth and realism. Specialty glasses are used to create the 3D effect by allowing each eye to receive different images. The glasses use either shutters, color filters, or polarized lenses to receive the images so that your brain can put the 3D effect together. Without the glasses, the 3D movie will look blurred and may be too uncomfortable for some people to watch[2]. The glasses also enhance the depth perception of the images so that they seem more lifelike and as if they are leaping from the screen[3]. There are several different types of 3D technologies in use today, but they all work together to send each eye different perspectives of the same image[5].',
 'neg': 'Looking directly at the sun can cause a condition called solar retinopathy, which is when solar radiation damages the eyes and can even lead to permanent bl

### A) Retrieval Tasks
In general, what loss would I use for the QA & retrieval tasks? Distillation is obvious, but what about
- SQUAD - has QA pairs - squad_v2
    - good for distillation
- ORCA - has GPT-like prompting QA pairs: https://huggingface.co/datasets/Open-Orca/OpenOrca/viewer/Open-Orca--OpenOrca/train?row=29
- DONE Simple-Wiki https://huggingface.co/datasets/embedding-data/simple-wiki - has paraphrases
- DONE embedding-data/coco_captions_quintets - multiple captions as paraphrases
- DONE embedding-data/simple-wiki - pairs of paraphrases from wikipedia
- DONE embedding-data/SPECTER - triplets of {anchor, pos, neg}, small headline-like snippets in technical /statistical /science fields
- https://huggingface.co/embedding-data - has a lot of retrieval tasks
- LLukas22/scidocs - titles and abstracts
- DONE allenai/scirepeval - cite_prediction - has query,pos, neg based on citations
- DONE - LEDGAR - can possible do triplets on same label
- Rahmaa/ElsevieR_ClEaN - possible relation between title and abstract
- embedding-data/WikiAnswers - 25 question paraphrases (maybe no answers)
- cnn_dailymail - summarization possiblility 287k (beware |||?)
- multi_news - another summarization 45k (beware |||?)
- DONE xsum - BBC extreme summarization 204k
- DONE lighteval/legal_summarization - legal summization of bills (BillSum 18.8k)
- gigaword - small paraphrases
- SKIP launch/gov_report # this could be used for LONG document summaries/retrieval


In [None]:
#foo =  load_dataset("embedding-data/simple-wiki",split='train',streaming=True)
#foo =  load_dataset("embedding-data/coco_captions_quintets",split='train',streaming=True).take(2000)
#foo =  load_dataset("embedding-data/SPECTER",split='train',streaming=True)
#foo = load_dataset(**{'path': 'embedding-data/SPECTER', 'name':None, 'split':'train', 'streaming':True})
#foo =  load_dataset("paws",'labeled_final',split='train',streaming=True)
#foo =  load_dataset("embedding-data/QQP_triplets",None,split='train',streaming=True)
#foo =  load_dataset("",None,split='train',streaming=True)
#foo =  load_dataset("",None,split='train',streaming=True)
#foo = load_dataset("allenai/scirepeval", 'cite_prediction',None, split='train',streaming=True)
# foo = load_dataset(**{'path': 'allenai/scirepeval', 'name':'cite_prediction', 'split':'train', 'streaming':True})
#foo = load_dataset('json', data_files="https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019.jsonl.zip", split="train", streaming=False)
#foo = load_dataset(**{'path': 'json', 'name':None, 'data_files':'https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019.jsonl.zip', 'split':'train', 'streaming':True})
foo =  load_dataset("lighteval/legal_summarization","BillSum",split='train',streaming=True)

if True:
    # embedding-data/WikiAnswers
    for j,e in enumerate(foo):
        print(e)
        #print(len(e['set']))
        if j > 100:
            break
    print(e.keys())

In [None]:


def clean_legalsum(x):
    MAX_CHAR_LEN_BILLSUM = int(6.7*600)
    text = x['article'][:MAX_CHAR_LEN_BILLSUM]
    if 'SEC. 2.' in text:
        text = ".".join(text.split('SEC. 2.')[1].split('.')[1:])
    else:
        if 'SHORT TITLE' in text:
             text = text.split('SHORT TITLE')[1]
    x['query'] = x['summary']
    x['positives'] = [text.strip()]
    x['negatives'] = []
    x['type'] = 'sts_triplet'
    return x

def clean_xsum(x):
    x['query'] = x['summary']
    x['negatives'] = []
    x['positives'] = [x['document']]
    x['type'] = 'sts_triplet'
    return x

def clean_eurlex(x):
    x['query'] = x['text']
    x['negatives'] = []
    x['positives'] = []
    x['type'] = 'sts_by_textlabel'
    x['label'] = x['eurovoc_concepts']
    return x

def clean_allenai_citeprediction(x):
    x['query'] = x['query']['abstract']
    pos = x['pos']['abstract']
    x['positives'] = [pos] if pos is not None else []
    neg = x['neg']['abstract']
    x['negatives'] = [neg] if neg is not None else []
    x['type'] = 'sts_triplet'
    return x

def clean_simple_wiki(x):
    x['query'] = x['set'][0]
    x['positives'] = [x['set'][1]]
    x['negatives'] = []
    x['type'] = 'sts_triplet'
    return x

def clean_coco_captions_quintets(x):
    x['query'] = x['set'][0]
    x['positives'] = x['set'][1:]
    x['negatives'] = []
    x['type'] = 'sts_triplet'
    return x

def clean_specter(x):
    x['query'] = x['set'][0]
    x['positives'] = [x['set'][1]]
    x['negatives'] = [x['set'][2]]
    x['type'] = 'sts_triplet'
    return x

def clean_paws(x):
    x['query'] = x['sentence1']
    x['positives'] = [x['sentence2']]
    x['negatives'] = []
    x['type'] = 'sts_triplet'
    return x

def clean_qqp(x):
    x['query'] = x['set']['query']
    x['positives'] = x['set']['pos']
    x['negatives'] = x['set']['neg']
    x['type'] = 'sts_triplet'
    return x

def clean_ledgarlabelled(x):
    x['query'] = x['provision']
    x['negatives'] = []
    x['positives'] = []
    x['type'] = 'sts_by_textlabel'
    return x


def clean_debatesum(x):
    x['query'] = x['Abstract']
    x['positives'] = [x['Extract']]
    x['negatives'] = []
    x['type'] = 'sts_triplet'
    return x


def filter_chatgptparaphrases(x):
    return x['category']=='sentence'

def clean_chatgptparaphrases(x):
    x['query'] = x['text']
    x['positives'] = eval(x['paraphrases'])
    x['negatives'] = []
    x['type'] = 'sts_triplet'
    return x

def clean_gigaword(x):
    x['query'] = x['summary']
    x['positives'] = [x['document']]
    x['negatives'] = []
    x['type'] = 'sts_triplet'
    return x

#dict_keys(['question_id', 'question', 'document_title', 'answer', 'label'])
sts_streaming_cleaning_functions = {
    'xsum':(clean_xsum, None, ['query','positives','negatives'],['summary','id','document']),
    'embedding-data/simple-wiki':(clean_simple_wiki, None, ['query','positives','negatives'],['set']),
    'embedding-data/coco_captions_quintets':(clean_coco_captions_quintets,None, ['query','positives','negatives'],['set']),
    'embedding-data/SPECTER':(clean_specter,None, ['query','positives','negatives'],['set']),
    'paws':(clean_paws,None, ['query','positives','negatives'],['id', 'sentence1', 'sentence2', 'label']),
    'embedding-data/QQP_triplets':(clean_qqp,None, ['query','positives','negatives'],['set']),
    "allenai/scirepeval":(clean_allenai_citeprediction, None,  ['query','positives','negatives'], ['pos','neg']),
    "lighteval/legal_summarization":(clean_legalsum, None, ['query','positives','negatives'], ['article', 'summary']),
    "https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019.jsonl.zip":(
        clean_ledgarlabelled, None, ['query','label'], ['provision','source']
    ),
    "eurlex":(clean_eurlex, None,  ['query','positives','negatives'], ['celex_id', 'title', 'text', 'eurovoc_concepts']),
    'humarin/chatgpt-paraphrases':(clean_chatgptparaphrases, filter_chatgptparaphrases, ['query','positives','negatives'], ['text','paraphrases','category','source']),
    'gigaword':(clean_gigaword, None, ['query','positives','negatives'], ['document','summary'])
}

DEFAULT_PROB = 1.0
sts_files = [
    # dataset name, subset, take_probability, dataset size
    ('xsum', None, DEFAULT_PROB, 204000, 'sts_by_triplet', False),
    ('embedding-data/simple-wiki',None, DEFAULT_PROB, 102000, 'sts_by_triplet', False), # wikipedia paraphrases
    ('embedding-data/coco_captions_quintets',None, DEFAULT_PROB,82800, 'sts_by_triplet', False), # caption paraphrases
    ('embedding-data/SPECTER',None, DEFAULT_PROB,684000, 'sts_by_triplet', False), # ?
    ('paws','labeled_final',DEFAULT_PROB, 49400, 'sts_by_triplet', False), # paws paraphrases
    ('embedding-data/QQP_triplets',None,DEFAULT_PROB, 102000, 'sts_by_triplet', False), # quora?
    ("allenai/scirepeval", 'cite_prediction_new', DEFAULT_PROB, 1300000, 'sts_by_triplet', False), # ?
    ("lighteval/legal_summarization","BillSum", DEFAULT_PROB, 18900, 'sts_by_triplet', False),
    ('https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019.jsonl.zip', None, DEFAULT_PROB, 1000000, 'sts_by_label', False),
    ('eurlex', None, DEFAULT_PROB, 45000, 'sts_by_label', False),
    ('humarin/chatgpt-paraphrases',None, DEFAULT_PROB, 172059, 'sts_by_triplet', False),
    ('gigaword', None, DEFAULT_PROB, 2000000, 'sts_by_triplet',False),
]

stsdata_streaming_config = {
    'files':sts_files,
    'max_seq_length':512,
    'prepend_q': 'passage: ',
    'prepend_a': 'passage: ',
    'val_size':100,
    'train_chunk_size':500,
    'seed':42,
}


In [None]:
stsdata_streaming_config = {
    'files':sts_files,
    'max_seq_length':512,
    'val_size':200,
    'train_chunk_size':500,
    'seed':42,
}

sts_statics_datsets = initialize_and_get_triplet_streaming_datasets(
    data_streaming_config = stsdata_streaming_config,
    streaming_cleaning_functions = sts_streaming_cleaning_functions,
    start_proportion = None,
    epoch=0,
    seed=42,
    path_to_val_cache = 'cache_val_sts.pkl',
    path_to_train_cache_epoch = 'cache_train_sts_%03g.pkl',
    do_check_english = True,
    name = 'STS' #
)


if False:
    print('old functions')
    # initialize streaming data for sts tasks
    sts_streaming_datsets, sts_probabilities, sts_datasizes = initialize_qa_streaming_datasets(
        stsdata_streaming_config,
        sts_streaming_cleaning_functions
    )

    # split and make-static (train and val sets, non-streaming)
    sts_statics_datsets = train_test_splits_from_stream_qa(
        streaming_dataset=sts_streaming_datsets,
        val_size = 100,#2000,
        epoch = 0,
        chunk_size = 2000,#6000,
        path_to_val_cache = 'val_sts_cache.pkl',
        probabilities = sts_probabilities,
        datasizes = sts_datasizes,
        seed=stsdata_streaming_config['seed']
    )


Initializing the streaming-QA to static-dataset procedure...
RELOADING VAL-STS SET: iter=cache_val_sts.pkl
VAL-STS SET SIZE: 196
trying humarin/chatgpt-paraphrases initialization


Using custom data configuration default-81f2c6c048238397
INFO:datasets.builder:Using custom data configuration default-81f2c6c048238397
Loading Dataset Infos from /usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/csv
INFO:datasets.info:Loading Dataset Infos from /usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/csv


take 200 from humarin/chatgpt-paraphrases validation
take 500 from humarin/chatgpt-paraphrases train
Done getting streams/reloading from humarin/chatgpt-paraphrases
done train language check
Done collecting STS streaming data
saving streamed STS training for epoch 0: cache_train_sts2_000.pkl


In [None]:
for i,e in enumerate(sts_statics_datsets['train'][::24]):
  if i>20:
    break
  print(e)

In [None]:
sts_statics_datsets['train'][0]

{'query': 'Speaker Martin concluded that Eisenhower worked too much through subordinates in dealing with Congress, with results, "often the reverse of what he has desired" because Members of Congress, "resent having some young fellow who was picked up by the White House without ever having been elected to office himself coming around and telling them \'The Chief wants this\'.',
 'positives': '[\'Speaker Martin stated that Eisenhower relied heavily on subordinates to handle Congress, leading to outcomes that were often contrary to his intentions. This was due to the fact that Members of Congress disliked being instructed by inexperienced individuals who were appointed by the White House and had never been elected to office themselves.\', "According to Speaker Martin, Eisenhower\'s approach to dealing with Congress involved delegating too much responsibility to subordinates, resulting in outcomes that were frequently the opposite of what he had intended. This was due to the fact that Mem

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import numpy as np
from multiprocessing import Pool
# Download stopwords and lemmatization resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
#lemmatizer = WordNetLemmatizer()
#stemmer = PorterStemmer()
#stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:

class LabelProcesser:

    def __init__(
        self,
        pos_thres = 0.97,
        neg_thres = 0.9,
        min_similarity_matrix_pos =0.34,
        max_similarity_matrix_pos = 0.30,
        examples=None, seed=42, textname='text',labelname='label'
    ):
        self.pos_thres = pos_thres # jaccard similarity index max
        self.neg_thres = neg_thres # jaccard similarity index max
        self.min_similarity_matrix = min_similarity_matrix_pos # threshold the similarity matrix by this, else 0
        self.max_similarity_matrix = max_similarity_matrix_neg # threshold the similarity matrix by this
        #self.lemmatizer = WordNetLemmatizer()
        #self.stemmer = PorterStemmer()
        #self.stop_words = set(stopwords.words('english'))
        #self.random = np.random.RandomState(seed)
        self.label_corpus =None
        self.label2stem =None
        self.textname=textname
        self.labelname=labelname

        if examples is not None and len(examples)>0:

            # build corpus from examples
            label_corpus, label2stem = self.build_corpus_by_labels(examples)
            self.label_corpus = label_corpus
            self.label2stem = label2stem

            # build label-similarity matrix
            self.SimMat = self.compute_similarity_matrix(list(self.label_corpus.keys()))

    def preprocess_label(self, text):
        pass

    @staticmethod
    def jaccard_similarity(tokens1, tokens2):
        set1 = set(tokens1)
        set2 = set(tokens2)
        intersection = set1.intersection(set2)
        union = set1.union(set2)
        similarity_score = len(intersection) / len(union)
        return similarity_score

    def build_corpus_by_labels(self, list_of_dict_with_labels_and_text):
        """Makes a dictionary of (tokenized/stemmed) labels:List[str] as the corpus by labels"""
        pass

    def _compute_similarity_for_processor_func(self, pair):
        """to be used internally with Pool map similarity functions"""
        idx, j, tokens1, tokens2 = pair
        return idx, j, self.jaccard_similarity(tokens1, tokens2)

    def compute_similarity_matrix(self, corpus):
        """Csompute similarity using calculate_similarity"""
        corpus_size = len(corpus)

        # Create an empty similarity matrix
        similarity_matrix = np.zeros((corpus_size, corpus_size))

        # Generate all pairwise combinations of indices and texts
        pairs = [(i, j, corpus[i], corpus[j]) for i in range(corpus_size) for j in range(i + 1, corpus_size)]

        # Use parallel processing to compute similarities efficiently
        with Pool() as pool:
            results = pool.map(self._compute_similarity_for_processor_func, pairs)

        # Fill in the similarity matrix
        for i,j, similarity in results:
            #i, j = divmod(idx, corpus_size)
            similarity_matrix[i, j] = similarity
            similarity_matrix[j, i] = similarity

        # threshold the similarity matrx -- no, because that will creat positives in the negatives
        return similarity_matrix

    @staticmethod
    def is_in(tuple1, tuple2):
        """is a in b or b in a"""
        s1=set(tuple1); s2 = set(tuple2)
        if not bool(s1.difference(s2)):
            return True
        return not bool(s2.difference(s1))

    @staticmethod
    def _quick_text_hash(text):
        return re.sub("\W+","",text.lower())

    def find_positive(
        self,
        query_text, # text of anchor/query (used to ensure not too similar, like an exact match)
        query_labelstem, # processed label (often a multi-label)
        corpus_keys, # corpus keys of other labels to find matches
        max_candidates=15
    ):
        """find positive match, based on best overlap of multi-label"""
        # first, check if there are other text with same label
        query_label_hash = self._quick_text_hash(query_text)

        # get all text with same label
        best_candidates_text = [
            s for s in self.label_corpus[query_labelstem] if self._quick_text_hash(s)!=query_label_hash
        ]
        if len(best_candidates_text)==0:
            # no similar text: need to find text with overlapping labelss
            kidx = corpus_keys.index(query_labelstem)
            # get similarities with other keys
            k_similarities = self.SimMat[kidx]
            if k_similarities.max()==0:
                #print("%s has no matches:" % '-'.join(query_labelstem))
                return []
            else:
                idx_bests = np.argsort(-1*k_similarities)[:max_candidates]
                # get most similar labels
                label_candidates = [
                    corpus_keys[j] for j in idx_bests if k_similarities[j]>= self.min_similarity_matrix
                ]
                # assert that the labels are AT LEAST inside of each other -- otherwise, no match
                label_candidates = [
                    lab for lab in label_candidates if self.is_in(lab, query_labelstem)
                ]
                if len(label_candidates)==0:
                    #print("%s has no matches:" % '-'.join(query_labelstem))
                    return []

                # get the text of the top candidate text
                best_candidates_text = [subs for s in [
                    self.label_corpus[lab] for lab in label_candidates
                ] for subs in s][:100]

                # ensure candidate texts are not the same
                best_candidates_text = [
                  s for s in self.label_corpus[query_labelstem] if self._quick_text_hash(s)!=query_label_hash
                ]
                if len(best_candidates_text)==0:
                    #print("%s has no matches:" % '-'.join(query_labelstem))
                    return []

        # grab first candidate text htat is NOT a high jaccard similarity
        best_candidates_text = best_candidates_text[::-1]
        top_match = None
        query_text_tokenized = [w for w in query_text.split(" ") if bool(re.search("\w+",w))]
        while top_match is None and len(best_candidates_text)>0:
            candidate_text = best_candidates_text.pop()
            # check that they aren't too similar in text
            candidate_text_tokenized = [w for w in candidate_text.split(" ") if bool(re.search("\w+",w))]
            candidate_sim_score = self.jaccard_similarity(query_text_tokenized, candidate_text_tokenized)
            if candidate_sim_score < self.pos_thres:
                top_match = candidate_text
                return [top_match]
        #print("%s has no matches:" % '-'.join(query_labelstem))
        #print('Its candidate pool was:')
        #print(best_candidates_text[:4])
        return []

    def find_positives(self, examples):
        if True:
            # find positives
            for idx, example in enumerate(examples):
                pos = self.find_positive(
                    query_text=example[self.textname],
                    query_labelstem=self.label2stem[tuple(example[self.labelname])],
                    corpus_keys = list(self.label_corpus.keys()),
                )
                example.update({'positives':pos})
                examples[idx] = example

        return examples

    def find_negative(self, query_text, query_labelstem, corpus_keys, max_candidates=15, n_negatives=1):
        # first, check if there are other text with same label
        query_label_hash = self._quick_text_hash(query_text)
        # get similarities with other keys
        kidx = corpus_keys.index(query_labelstem)
        k_similarities = self.SimMat[kidx]
        if k_similarities.max()==0:
            best_candidate_label = query_labelstem
            while best_candidate_label == query_labelstem:
                best_candidate_label = self.random.choice(corpus_keys)
        else:
            idx_bests = np.argsort(-1*k_similarities)[:max_candidates]
            # get most similar labels
            label_candidates = [
                corpus_keys[j] for j in idx_bests if (k_similarities[j]!=0 and k_similarities[j] <= self.max_similarity_matrix)
            ]
            # assert that the labels have some disjoint labels
            label_candidates = [
                lab for lab in label_candidates if not self.is_in(lab, query_labelstem)
            ] # disjoint entirely
            # sample randomly from candidate labels
            if len(label_candidates)>0:
                best_candidate_label_idx = self.random.choice(np.arange(len(label_candidates)))
                best_candidate_label = label_candidates[best_candidate_label_idx]
            # sample randomly from entire corpus
            elif len(label_candidates)==0:
                # pick random
                best_candidate_label = query_labelstem
                while best_candidate_label == query_labelstem:
                    best_candidate_label_idx = self.random.choice(np.arange(len(corpus_keys)))
                    best_candidate_label = corpus_keys[best_candidate_label_idx]

        # grab best text
        best_candidates_text = self.label_corpus[best_candidate_label]
        if len(best_candidates_text)==0:
            return []

        # ensure texts and query are not the same
        best_candidates_text = [
            s for s in best_candidates_text if self._quick_text_hash(s)!=query_label_hash
        ]
        if len(best_candidates_text)==0:
            return []

        # ensure texts are not very similar
        top_matches = []
        query_text_tokenized = [w for w in query_text.split(" ") if bool(re.search("\w+",w))]
        while len(top_matches) < n_negatives and len(best_candidates_text)>0:
            candidate_text = best_candidates_text.pop()
            # check that they aren't too similar in text
            candidate_text_tokenized = [w for w in candidate_text.split(" ") if bool(re.search("\w+",w))]
            candidate_sim_score = self.jaccard_similarity(query_text_tokenized, candidate_text_tokenized)
            if candidate_sim_score < self.neg_thres:
                top_matches.append(candidate_text)
                if len(top_matches)==n_negatives:
                    return top_matches
        # no matches
        return []

    def find_negatives(self, examples, n_negatives=1):
        if True:
            # find negatives
            for idx, example in enumerate(examples):
                neg = self.find_negative(
                    query_text=example[self.textname],
                    query_labelstem=self.label2stem[tuple(example[self.labelname])],
                    corpus_keys = list(self.label_corpus.keys()),
                    n_negatives=1
                )
                example.update({'negatives':neg})
                examples[idx] = example

        return examples


class LabelProcesserLedgar(LabelProcesser):
    """Preprocesses labels of LEDGAR for semantic similarity, as well as functionality for finding positive and negative pairs"""

    def __init__(
        self,
        pos_thres = 0.97,
        neg_thres = 0.9,
        min_similarity_matrix_pos =0.33,
        max_similarity_matrix_neg=0.3,
        examples=None,
        seed=42,
        textname='text',
        labelname='label'
    ):
        self.pos_thres = pos_thres # jaccard similarity index max
        self.neg_thres = neg_thres # jaccard similarity index max
        self.min_similarity_matrix = min_similarity_matrix_pos # threshold the similarity matrix by this, else 0
        self.max_similarity_matrix = max_similarity_matrix_neg # threshold the similarity matrix by this, else 0
        self.lemmatizer = WordNetLemmatizer()
        self.stemmer = PorterStemmer()
        self.stop_words = set(stopwords.words('english'))
        self.random = np.random.RandomState(seed)
        self.label_corpus =None
        self.label2stem =None
        self.textname=textname
        self.labelname=labelname
        #print(self.preprocess_label("The Borrowers’ obligation"))
        #print(self.preprocess_label("The Borrower's obligations"))

        if examples is not None and len(examples)>0:

            # build corpus from examples
            label_corpus, label2stem = self.build_corpus_by_labels(examples)
            self.label_corpus = label_corpus
            self.label2stem = label2stem

            # build label-similarity matrix
            self.SimMat = self.compute_similarity_matrix(list(self.label_corpus.keys()))

    def preprocess_label(self, text):
        if isinstance(text,str):
            tokens = word_tokenize(text.lower())
            # Remove stop words
            filtered_tokens = [token for token in tokens if token not in self.stop_words]
            # Perform lemmatization and stemming
            processed_tokens = [self.lemmatizer.lemmatize(self.stemmer.stem(token)) for token in filtered_tokens]
            processed_tokens = [w for w in processed_tokens if w not in ["'", "’", "’s", "'s", "(",")", ",", "."]]
            # Return the lemmatized and stop word-free tokens as a string
            return sorted(processed_tokens)

        elif isinstance(text,list):
            if len(text)==1:
                return self.preprocess_label(text[0])
            all_labels = [self.preprocess_label(l) for l in text]
            return sorted([subl for l in all_labels for subl in l])
        else:
            raise NotImplementedError(text)

    def build_corpus_by_labels(self, list_of_dict_with_labels_and_text):
        """Makes a dictionary of (tokenized/stemmed) labels:List[str] as the corpus by labels"""
        label_corpus = {}
        label2lem = {}
        for example in list_of_dict_with_labels_and_text:
            label = example[self.labelname]
            s = example[self.textname]
            if tuple(label) not in label2lem:
                labelstemmed = tuple(self.preprocess_label(label))
                label2lem[tuple(label)] = labelstemmed
            else:
                labelstemmed = label2lem[tuple(label)]
            if labelstemmed not in label_corpus.keys():
                label_corpus[labelstemmed] = []
            if s not in label_corpus[labelstemmed]:
                label_corpus[labelstemmed].append(s)

        # next, calculate the similarities between all pairs of keys
        return label_corpus, label2lem


class DatasetTripletsSimilarityByCoLabel(DatasetTriplets):

    def process(self, list_of_data):
        """Makes (query,pos,neg)-triplets, converts samples to dataframe for pytorch iteration"""

        # initialize the LabelProcessor
        label_processor = self.label_processor_class(
            examples = list_of_data,
            textname = self.focal_text_name
        )

        # find positives
        list_of_data = label_processor.find_positives(list_of_data)

        # only do ones with positives (otherwise no point)
        #list_of_data = [example for example in list_of_data if len(example['positives'])>0]
        #print(len(list_of_data))

        # find negatives
        list_of_data = label_processor.find_negatives(list_of_data, n_negatives=self.n_negatives)
        print(len(list_of_data))

        # loop through the data and add each triplets
        self._loop_through_list_of_data_and_add_to_selfdata(list_of_data = list_of_data)

        # harden the dataset to pandas dataframe
        df = self.sample_data_and_make_static_dataframe(self.data)
        return df #pd.DataFrame({})

    def _build_corpus_of_potential_negatives(self):
        pass

    def _find_negative(self):
        pass

    def _find_positives_and_add_to_data(self):
        """For data that has a label, this can be used to artifically find and create synthetic positives"""
        pass

    def _find_negatives_and_add_to_data(self):
       pass


In [None]:
sts_statics_datsets['train'][0]

{'query': "A man who tried to cut the throat of his estranged wife's aunt has been jailed for 22 years.",
 'negatives': [],
 'positives': ['Farai Kambarani, 26, was convicted of the attempted murder of social worker Ruth Nayamazana, who he wrongly blamed for not letting him see his child.\nLuton Crown Court heard his victim, who he also punched in the head 10 to 20 times, still lives in fear.\nKambarani was given a 22-year jail sentence with a three-year extension on licence.\nThe court heard Kambarani, from Wolverhampton, shunted a car into the back of Ruth Nayamazana\'s vehicle in Saxon Gate car park in Milton Keynes on 22 August last year.\nWhen she got out, he repeatedly punched her in the head.\nIn the witness box, the 34-year-old said he pulled out a small knife and used it against the side of her throat.\nShe said: "I was screaming. I thought he was going to cut my throat. The blood started gushing out."\nKambarani, a former carer for elderly people, was also convicted criminal 

In [None]:
class LabelProcesserEurlex(LabelProcesser):
    """Preprocesses labels of EURLEX for semantic similarity, as well as functionality for finding positive and negative pairs"""

    def __init__(self, pos_thres = 0.97, neg_thres = 0.9, min_similarity_matrix_pos =0.33, max_similarity_matrix_neg =0.30,  examples=None, seed=42, textname='text',labelname='label'):
        self.pos_thres = pos_thres # jaccard similarity index max
        self.neg_thres = neg_thres # jaccard similarity index max
        self.min_similarity_matrix = min_similarity_matrix_pos # threshold the similarity matrix by this, else 0
        self.max_similarity_matrix = max_similarity_matrix_neg # threshold the similarity matrix by this, else 0
        self.random = np.random.RandomState(seed)
        self.label_corpus =None
        self.label2stem =None
        self.textname=textname
        self.labelname=labelname
        #print(self.preprocess_label("The Borrowers’ obligation"))
        #print(self.preprocess_label("The Borrower's obligations"))

        if examples is not None and len(examples)>0:

            # build corpus from examples
            label_corpus, label2stem = self.build_corpus_by_labels(examples)
            self.label_corpus = label_corpus
            self.label2stem = label2stem

            # build label-similarity matrix
            self.SimMat = self.compute_similarity_matrix(list(self.label_corpus.keys()))

    def preprocess_label(self, text):
        # eurlex labels are already "tokenized" into integers of concepts
        if isinstance(text,str):
            return text
        elif isinstance(text,list):
            if len(text)==1:
                return text
            return sorted(list(set(text)))
        else:
            raise NotImplementedError(text)

    def build_corpus_by_labels(self, list_of_dict_with_labels_and_text):
        """Makes a dictionary of (tokenized/stemmed) labels:List[str] as the corpus by labels"""
        label_corpus = {}
        label2lem = {}
        for example in list_of_dict_with_labels_and_text:
            label = example[self.labelname]
            s = example[self.textname]
            if tuple(label) not in label2lem:
                labelstemmed = tuple(self.preprocess_label(label))
                label2lem[tuple(label)] = labelstemmed
            else:
                labelstemmed = label2lem[tuple(label)]
            if labelstemmed not in label_corpus.keys():
                label_corpus[labelstemmed] = []
            if s not in label_corpus[labelstemmed]:
                label_corpus[labelstemmed].append(s)

        # next, calculate the similarities between all pairs of keys
        return label_corpus, label2lem

In [None]:
sts_statics_datsets['train'][0]

label_processer_eurlex = LabelProcesserEurlex(
    pos_thres = 0.97,
    neg_thres = 0.9,
    min_similarity_matrix_pos =0.33,
    examples=sts_statics_datsets['train'],
    seed=42,
    textname='query',
    labelname='label'
)

In [None]:
sts_statics_datsets['train'] = label_processer_eurlex.find_positives(sts_statics_datsets['train'])

sts_statics_datsets['train'] = label_processer_eurlex.find_negatives(sts_statics_datsets['train'], n_negatives=3)
#print(len(list_of_data))

In [None]:
foo = [e for e in sts_statics_datsets['train'] if bool(e['positives'])]

In [None]:
sts_torchdataset_train_eurlex = DatasetTripletsSimilarityByCoLabel(
    list_of_data=[
        example for example in sts_statics_datsets['train'] if example['type']=='sts_by_textlabel'
    ],
    n_negatives= 3,
    focal_text_name ='query',
    positives_text_name ='positives',
    negativess_text_name ='negatives',
    seed = 42,
    label_processor_class = LabelProcesserEurlex
)

100


  best_candidate_label = self.random.choice(corpus_keys)


In [None]:
sts_torchdataset_train_ledgar = DatasetTripletsSimilarityByCoLabel(
    list_of_data=[
        example for example in sts_statics_datsets['train'] if example['type']=='sts_by_textlabel'
    ],
    n_negatives= 3,
    focal_text_name ='query',
    positives_text_name ='positives',
    negativess_text_name ='negatives',
    seed = 42,
    label_processor_class = LabelProcesserLedgar
)

  best_candidate_label = self.random.choice(corpus_keys)


100


In [None]:
sts_torchdataset_train_eurlex[-1]

{'query': 'Seller has not received any written notice of any pending or threatened condemnation of any portion of the Properties.',
 'pos': 'If the whole or any substantial (more than 25%) part of the Premises shall be condemned by eminent domain for any public or quasi-public purpose, this Lease shall terminate on the date of the vesting of title, and Tenant shall have no claim against Landlord for the value of any unexpired portion of the term of the Lease, nor shall Tenant be entitled to any part of the condemnation award. If less than a substantial part of the Premises is condemned, this Lease shall not terminate, but Rent shall abate in proportion to the portion of the Premises condemned.',
 'neg': "Tenant has inspected the Premises prior to entering this Lease and hereby accepts the Premises in its “As Is” condition. Landlord shall keep the foundation, outer walls, roof and buried conduits of the Premises in good repair except the Landlord shall not be called on to make any such 

In [None]:
for example in sts_statics_datsets['train']:
    if example['type']=='sts_by_textlabel':
        assert 'label' in example.keys()


In [None]:
labelprocessor = LabelProcesserLedgar(examples = [
  example for example in sts_statics_datsets['train'] if example['type']=='sts_by_textlabel'
])

foopos = labelprocessor.find_positives([
  example for example in sts_statics_datsets['train'] if example['type']=='sts_by_textlabel'
])

print(sum([bool(d['positives']) for d in foopos])/len(foopos))

fooneg = labelprocessor.find_negatives([
  example for example in sts_statics_datsets['train'] if example['type']=='sts_by_textlabel'
])

print(sum([bool(d['negatives']) for d in fooneg])/len(fooneg))

['borrow', 'oblig']
['borrow', 'oblig']
0.4376


  best_candidate_label = self.random.choice(corpus_keys)


1.0


In [None]:
NEGATIVE_CORPUS_METHOD_STS ='ann-tfidf'
# convert to torch dataset (val)
sts_torchdataset_val = DatasetTriplets(
    list_of_data = [
       x for x in sts_statics_datsets['val'] if x.get('type','na') == 'sts_triplet'
    ],
    n_negatives= 3,
    focal_text_name ='query',
    positives_text_name ='positives',
    negativess_text_name ='negatives',
    topk_negatives_discard=15, # to get similar but different negatives, use BM25 and discard these topk
    negative_corpus_method = NEGATIVE_CORPUS_METHOD_STS

)
# convert to torch dataset (train)
print('STS DatasetTriplet')
sts_torchdataset_train = DatasetTriplets(
    list_of_data = [
       x for x in sts_statics_datsets['train'] if x.get('type','na')== 'sts_triplet'
    ],
    n_negatives= 3,
    focal_text_name ='query',
    positives_text_name ='positives',
    negativess_text_name ='negatives',
    topk_negatives_discard=15, # to get similar but different negatives, use BM25 and discard these topk
    negative_corpus_method = NEGATIVE_CORPUS_METHOD_STS
)

building negatives via ANN-TFIDF
using predefined corpus of length: 111
finished building the ANN index
done finding negatives
STS DatasetTriplet
building negatives via ANN-TFIDF
using predefined corpus of length: 286
finished building the ANN index
done finding negatives


In [None]:
sts_torchdataset_train[-4]

{'query': 'Elaine Sullivan Act - Amends title XVIII (Medicare) of the Social Security Act to require emergency departments to contact family members, a specified healthcare agent, or a surrogate decisionmaker of an incapacitated patient within 24 hours of arrival at the emergency department. Authorizes the Secretary of Health and Human Services to make grants to qualified not-for-profit organizations for the purpose of assisting them to establish and operate voluntary next of kin registries.',
 'pos': "(a) In General.--Section 1866(a)(1) of the Social Security Act (42 U.S.C. 1395cc(a)(1)) is amended-- (1) in subparagraph (U), by striking ``and'' at the end; (2) in subparagraph (V), by striking the period at the end and inserting ``, and''; and (3) by inserting after subparagraph (V) the following new subparagraph: ``(W) in the case of a hospital (as defined in section 1861(e)) with an emergency department, to adopt and enforce a policy to ensure compliance with the requirements of subs

### Pair Classifications Datasets
- one datasets are naturally pair-based (NLI, cannot-datasets); some like the multi-label dataets can be made into a "same class / different class" binary dataset (ag_news, ; others like sentiment


###### Datasets
- DONE snli (550k, 1 file) - naturally pair classification  
    - 3 labels: 0,1,2
- DONE multi_nli - (393k, 1 file)
- NO ag_news classification - (a couple of labels -- only 4)
- DONE heegyu/news-category-dataset - (maybe multiple categories)
- dbpedia_14 (560k, 1 file)- news classification or topic ? (~14 labels corresponding to art or building types)
    - 14 classes
- ccdv/patent-classification - 25k (abstract) (maybe skip)
- fkdosilovic/docee-event-classification (21.9k, 1 file) - 59 labels (news-event like elections, diasters)
- NO scholarly360/contracts-classification-instruction-llm-experiments - 6.05k (clauses) -- no, I think these are just the auto-labels from LEDGAR
- NO 'rcds/swiss_judgment_prediction','mt_en', (59703 examples) (NO, it is autotranslated)
- DONE **'tum-nlp/cannot-dataset'** - like entailment, but contains paraphrases & negations
- NO sentiment analysis -- ?
- NO samchain/BIS_Speeches_97_23 - next sentence prediction
- next sentence prediction from MLM

MASKING: a mask vector will be used to focus the loss only on the appropriate dataset

In [None]:
#from datasets import load_dataset
foo =load_dataset("nlpaueb/finer-139", split="train") # very big, maybe just download the val

# these can be converted to a smaller subset by stripping the B- and I-
tag_names_full = foo.features["ner_tags"].feature.names
print(tag_names_full)

['O', 'B-AccrualForEnvironmentalLossContingencies', 'B-AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife', 'I-AcquiredFiniteLivedIntangibleAssetsWeightedAverageUsefulLife', 'B-AllocatedShareBasedCompensationExpense', 'B-AmortizationOfFinancingCosts', 'B-AmortizationOfIntangibleAssets', 'I-AmortizationOfIntangibleAssets', 'B-AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount', 'I-AntidilutiveSecuritiesExcludedFromComputationOfEarningsPerShareAmount', 'B-AreaOfRealEstateProperty', 'I-AreaOfRealEstateProperty', 'B-AssetImpairmentCharges', 'B-BusinessAcquisitionEquityInterestsIssuedOrIssuableNumberOfSharesIssued', 'B-BusinessAcquisitionPercentageOfVotingInterestsAcquired', 'I-BusinessAcquisitionPercentageOfVotingInterestsAcquired', 'B-BusinessCombinationAcquisitionRelatedCosts', 'B-BusinessCombinationConsiderationTransferred1', 'B-BusinessCombinationContingentConsiderationLiability', 'B-BusinessCombinationRecognizedIdentifiableAssetsAcquiredAndLiabilitiesAssum

In [None]:
labels = set()
for i,e in enumerate(foo):
    labels |= {e['label']}
    if i > 2:
        break

print(e.keys())



dict_keys(['example_id', 'citing_prompt', 'holding_0', 'holding_1', 'holding_2', 'holding_3', 'holding_4', 'label'])


In [None]:
DEFAULT_COLUMNS = ['pair1','pair2','label','type','cls_id','n_labels']

DBPEDIA_L2 = {
    'Tower': 0, 'NaturalPlace': 1, 'Presenter': 2, 'RacingDriver': 3, 'FloweringPlant': 4, 'SportFacility': 5, 'Venue': 6, 'Database': 7,
    'EducationalInstitution': 8, 'Olympics': 9, 'Race': 10, 'VolleyballPlayer': 11, 'Infrastructure': 12, 'MusicalWork': 13, 'Genre': 14, 'ComicsCharacter': 15,
    'Song': 16, 'MusicalArtist': 17, 'Settlement': 18, 'Tournament': 19, 'Engine': 20, 'Politician': 21, 'Coach': 22, 'SocietalEvent': 23, 'Person': 24,
    'LegalCase': 25, 'AmusementParkAttraction': 26, 'GridironFootballPlayer': 27, 'Cleric': 28, 'FootballLeagueSeason': 29, 'MotorcycleRider': 30, 'SportsTeam': 31,
    'SportsEvent': 32, 'Satellite': 33, 'Eukaryote': 34, 'RaceTrack': 35, 'Boxer': 36, 'Wrestler': 37, 'Scientist': 38, 'Building': 39, 'Actor': 40, 'Plant': 41,
    'Cartoon': 42, 'NaturalEvent': 43, 'SportsLeague': 44, 'RouteOfTransportation': 45, 'OrganisationMember': 46, 'FictionalCharacter': 47, 'Horse': 48,
    'ClericalAdministrativeRegion': 49, 'PeriodicalLiterature': 50, 'WrittenWork': 51, 'Writer': 52, 'CelestialBody': 53, 'WinterSportPlayer': 54,
    'SportsTeamSeason': 55, 'Company': 56, 'Animal': 57, 'Broadcaster': 58, 'BritishRoyalty': 59, 'Organisation': 60, 'Athlete': 61, 'Group': 62, 'Stream': 63,
    'Artist': 64, 'Station': 65, 'SportsManager': 66, 'BodyOfWater': 67, 'Software': 68, 'Comic': 69, 'other': 70
}

DBPEDIA_L3 = {
    'TelevisionStation': 0, 'NetballPlayer': 1, 'FigureSkater': 2, 'BadmintonPlayer': 3, 'School': 4, 'River': 5, 'WomensTennisAssociationTournament': 6,
    'ShoppingMall': 7, 'GreenAlga': 8, 'Winery': 9, 'Religious': 10, 'SumoWrestler': 11, 'Planet': 12, 'Swimmer': 13, 'Curler': 14, 'Astronaut': 15,
    'MemberOfParliament': 16, 'MythologicalFigure': 17, 'CanadianFootballTeam': 18, 'OlympicEvent': 19, 'Senator': 20, 'Album': 21, 'PublicTransitSystem': 22,
    'Photographer': 23, 'Library': 24, 'Village': 25, 'Play': 26, 'Legislature': 27, 'AdultActor': 28, 'Lake': 29, 'Earthquake': 30,
    'SupremeCourtOfTheUnitedStatesCase': 31, 'Airline': 32, 'Road': 33, 'SoccerPlayer': 34, 'BaseballSeason': 35, 'CultivatedVariety': 36, 'Judge': 37,
    'PlayboyPlaymate': 38, 'GolfPlayer': 39, 'RadioHost': 40, 'WrestlingEvent': 41, 'Theatre': 42, 'Saint': 43, 'CollegeCoach': 44, 'VideoGame': 45,
    'NCAATeamSeason': 46, 'Museum': 47, 'GolfCourse': 48, 'ComicsCreator': 49, 'Cycad': 50, 'Bird': 51, 'Stadium': 52, 'Magazine': 53, 'Manga': 54,
    'Newspaper': 55, 'BaseballPlayer': 56, 'Reptile': 57, 'Diocese': 58, 'ChessPlayer': 59, 'SoccerLeague': 60, 'Grape': 61, 'Architect': 62, 'Monarch': 63,
    'Cave': 64, 'Skater': 65, 'HorseRace': 66, 'RadioStation': 67, 'MilitaryPerson': 68, 'EurovisionSongContestEntry': 69, 'Fish': 70,
    'NationalFootballLeagueSeason': 71, 'PoliticalParty': 72, 'Single': 73, 'Skier': 74, 'MixedMartialArtsEvent': 75, 'Philosopher': 76,
    'Hospital': 77, 'BasketballTeam': 78, 'Mountain': 79, 'RailwayStation': 80, 'Comedian': 81, 'Galaxy': 82, 'AmericanFootballPlayer': 83,
    'Cardinal': 84, 'Mollusca': 85, 'Journalist': 86, 'OfficeHolder': 87, 'Glacier': 88, 'Rower': 89, 'Baronet': 90, 'RollerCoaster': 91,
    'BaseballLeague': 92, 'ArtificialSatellite': 93, 'Dam': 94, 'MilitaryUnit': 95, 'Engineer': 96, 'Restaurant': 97, 'HockeyTeam': 98,
    'GaelicGamesPlayer': 99, 'Hotel': 100, 'Publisher': 101, 'Fungus': 102, 'AutomobileEngine': 103, 'Moss': 104, 'FormulaOneRacer': 105,
    'Cricketer': 106, 'IceHockeyPlayer': 107, 'Mayor': 108, 'MartialArtist': 109, 'RaceHorse': 110, 'Canoeist': 111, 'BeachVolleyballPlayer': 112,
    'RecordLabel': 113, 'Musical': 114, 'BusinessPerson': 115, 'ArtistDiscography': 116, 'SoccerClubSeason': 117, 'Ambassador': 118, 'Gymnast': 119,
    'RailwayLine': 120, 'Town': 121, 'CyclingTeam': 122, 'LacrossePlayer': 123, 'HollywoodCartoon': 124, 'MilitaryConflict': 125, 'RugbyClub': 126,
    'Racecourse': 127, 'Pope': 128, 'RoadTunnel': 129, 'Economist': 130, 'University': 131, 'President': 132, 'Bodybuilder': 133, 'DartsPlayer': 134,
    'Canal': 135, 'CricketGround': 136, 'Crustacean': 137, 'SpeedwayRider': 138, 'Cyclist': 139, 'MusicGenre': 140, 'Volcano': 141, 'Medician': 142,
    'Castle': 143, 'Anime': 144, 'BasketballPlayer': 145, 'Model': 146, 'SoccerManager': 147, 'Chef': 148, 'SportsTeamMember': 149, 'Convention': 150,
    'Airport': 151, 'HandballTeam': 152, 'FootballMatch': 153, 'ClassicalMusicComposition': 154, 'Conifer': 155, 'RugbyLeague': 156, 'Fern': 157,
    'HistoricBuilding': 158, 'ChristianBishop': 159, 'BusCompany': 160, 'VoiceActor': 161, 'SoccerTournament': 162, 'GolfTournament': 163, 'HorseRider': 164,
    'SolarEclipse': 165, 'Prison': 166, 'CyclingRace': 167, 'AustralianRulesFootballPlayer': 168, 'BasketballLeague': 169, 'Bridge': 170, 'Noble': 171,
    'Arachnid': 172, 'ComicStrip': 173, 'AnimangaCharacter': 174, 'Bank': 175, 'Amphibian': 176, 'Poet': 177, 'LawFirm': 178, 'NascarDriver': 179,
    'Congressman': 180, 'FashionDesigner': 181, 'BiologicalDatabase': 182, 'CricketTeam': 183, 'HandballPlayer': 184, 'MountainPass': 185, 'Band': 186,
    'Brewery': 187, 'AcademicJournal': 188, 'Insect': 189, 'Jockey': 190, 'ClassicalMusicArtist': 191, 'Governor': 192, 'PokerPlayer': 193, 'Poem': 194,
    'TennisPlayer': 195, 'Historian': 196, 'ScreenWriter': 197, 'MusicFestival': 198, 'TennisTournament': 199, 'TradeUnion': 200, 'BeautyQueen': 201,
    'AustralianFootballTeam': 202, 'AmateurBoxer': 203, 'SquashPlayer': 204, 'Painter': 205, 'RugbyPlayer': 206, 'MountainRange': 207, 'Lighthouse': 208,
    'TableTennisPlayer': 209, 'SoapCharacter': 210, 'IceHockeyLeague': 211, 'HorseTrainer': 212, 'Election': 213, 'GrandPrix': 214, 'PrimeMinister': 215,
    'Entomologist': 216, 'BroadcastNetwork': 217, 'FilmFestival': 218, 'other': 219
    }

DOCEEEVENTS = {
    'Famous Person - Death': 0, 'Strike': 1, 'Awards ceremony': 2, 'Road Crash': 3, 'Famous Person - Commit Crime - Accuse': 4, 'New wonders in nature': 5,
    'Droughts': 6, 'Mudslides': 7, 'Shipwreck': 8, 'Government Policy Changes': 9, 'Famous Person - Commit Crime - Sentence': 10, 'Tsunamis': 11,
    'Insect Disaster': 12, 'Government Job change - Election': 13, 'Famous Person - Sick': 14, 'Train collisions': 15, 'Financial Crisis': 16, 'Earthquakes': 17,
    'Protest_Online Condemnation': 18, 'Tear Up Agreement': 19, 'Famine': 20, 'Organization Established': 21, 'Gas explosion': 22, 'Military Exercise': 23,
    'Sign Agreement': 24, 'Armed Conflict': 25, 'Famous Person - Commit Crime - Arrest': 26, 'Withdraw from an Organization': 27,
    'Famous Person - Give a speech': 28, 'Organization Closed': 30, 'Famous Person - Commit Crime - Release': 31, 'Fire': 32, 'Financial Aid': 33,
    'Bank Robbery': 34, 'Disease Outbreaks': 35, 'Riot': 36, 'Hurricanes_Tornado_Storm_Blizzard': 37, 'Air crash': 38,
    'Government Job change - Appoint_Inauguration': 39, 'Famous Person - Recovered': 40, 'Break historical records': 41, 'Join in an Organization': 42,
    'Famous Person - Marriage': 43, 'Diplomatic Talks _ Diplomatic_Negotiation_ Summit Meeting': 44, 'Organization Fine': 45, 'Floods': 46,
    'Sports Competition': 47, 'Volcano Eruption': 48, 'New achievements in aerospace': 49, 'Regime Change': 50, 'Government Job change - Resignation_Dismissal': 51,
    'Mine Collapses': 52, 'Famous Person - Divorce': 53, 'Mass Poisoning': 54, 'New archeological discoveries': 55, 'Famous Person - Commit Crime - Investigate': 56,
    'Diplomatic Visit': 57, 'Organization Merge': 58, 'Environment Pollution': 59, 'other': 60,
}

# categories for news categories
NEWSCATEGORIES = {
    'WORLDPOST': 0, 'PARENTS': 1, 'COMEDY': 2,'MONEY': 3, 'WOMEN': 4,'GOOD NEWS': 5,'WEIRD NEWS': 6,'TECH': 8,'ARTS & CULTURE': 9,
    'WEDDINGS': 10,'EDUCATION': 11,'CRIME': 13,'FIFTY': 14,'STYLE': 15,'SPORTS': 16,'TASTE': 17,'COLLEGE': 18,'THE WORLDPOST': 19,'WORLD NEWS': 20,
    'GREEN': 21,'CULTURE & ARTS': 22,'POLITICS': 23, 'WELLNESS': 24,'HOME & LIVING': 25,'MEDIA': 26,'SCIENCE': 27,'HEALTHY LIVING': 28,
    'U.S. NEWS': 29,'ARTS': 30,'FOOD & DRINK': 31,'ENTERTAINMENT': 32,'ENVIRONMENT': 33,'IMPACT': 34,'RELIGION': 35,
    'PARENTING': 36,'STYLE & BEAUTY': 37,'BUSINESS': 38,'TRAVEL': 39,'OTHER':40
}

def clean_snli(x):
    x['pair1'] = x['premise']
    x['pair2'] = x['hypothesis']
    x['type'] = 'pair_classification'
    x['cls_id'] = 'snli'
    x['n_labels'] = 3
    return x

def clean_contractnli(x):
    x['pair1'] = x['premise']
    x['pair2'] = x['hypothesis']
    x['type'] = 'pair_classification'
    x['cls_id'] = 'contractnli'
    x['n_labels'] = 3
    return x

def clean_mnli(x):
    x['pair1'] = x['premise']
    x['pair2'] = x['hypothesis']
    #x['label'] = []
    x['type'] = 'pair_classification'
    x['cls_id'] = 'mnli'
    x['n_labels'] = 3
    return x

def clean_cannotdatast(x):
    x['pair1'] = x['premise']
    x['pair2'] = x['hypothesis']
    x['type'] = 'pair_classification'
    x['cls_id'] = 'cannotdataset'
    x['n_labels'] = 2
    return x

def clean_newscategory(x):
    x['pair1'] = x['headline'] + ". " + x['short_description']
    x['pair2'] = None
    x['label'] = NEWSCATEGORIES.get(x['category'],NEWSCATEGORIES['OTHER'])
    x['type'] = 'classification'
    x['cls_id'] = 'newscategory'
    x['n_labels'] = 40
    return x

def clean_doceeevents(x):
    x['pair1'] = x['text']
    x['pair2'] = None
    x['label'] = DOCEEEVENTS.get(x['event_type'],DOCEEEVENTS['other'])
    x['type'] = 'classification'
    x['cls_id'] = 'doceeevents'
    x['n_labels'] = 61
    return x

def clean_dbpedia_l2(x):
    x['pair1'] = x['text']
    x['pair2'] = None
    x['label'] = DBPEDIA_L2.get(x['l2'],DBPEDIA_L2['other'])
    x['type'] = 'classification'
    x['cls_id'] = 'dbpedia_l2'
    x['n_labels'] = 71 # 219
    return x

def clean_dbpedia_l3(x):
    x['pair1'] = x['text']
    x['pair2'] = None
    x['label'] = DBPEDIA_L3.get(x['l3'],DBPEDIA_L3['other'])
    x['type'] = 'classification'
    x['cls_id'] = 'dbpedia_l3'
    x['n_labels'] = 220
    return x

def clean_casehold_positives(x):
    x['pair1'] = x['citing_prompt'].split('(<HOLDING>)')[0]
    correct_holding_id = int(x['label'])
    correct_holding_text = x['holding_%d' % correct_holding_id]
    x['pair2'] = correct_holding_text
    x['label'] = 1
    x['type'] = 'pair_classification'
    x['cls_id'] = 'casehold'
    x['n_labels'] = 2
    return x

def clean_casehold_negatives(x):
    x['pair1'] = x['citing_prompt'].split('(<HOLDING>)')[0]
    correct_holding_id = int(x['label'])
    incorrect_holding_id = (correct_holding_id+1) % 4
    incorrect_holding_text = x['holding_%d' % incorrect_holding_id]
    x['pair2'] = incorrect_holding_text
    x['label'] = 0
    x['type'] = 'pair_classification'
    x['cls_id'] = 'casehold'
    x['n_labels'] = 2
    return x

def filter_snli(x):
    return x['label']!=-1

def filter_newscategory(x):
    return x['category'] not in ['LATINO VOICES',"QUEER VOICES", "BLACK VOICES"]

def clean_mtopintent(x):
    # id (int64)	text (string)	label (int32)	label_text (string)
    x['pair1'] = x['text']
    x['pair2'] = None
    x['type'] = 'classification'
    x['cls_id'] = 'mtopintent'
    x['n_labels'] = 113
    return x

cls_streaming_cleaning_functions = {
    'snli':(clean_snli, filter_snli, DEFAULT_COLUMNS,['hypothesis','premise']),
    'multi_nli':(clean_mnli, None, DEFAULT_COLUMNS, ['promptID', 'pairID', 'premise', 'premise_binary_parse', 'premise_parse', 'hypothesis', 'hypothesis_binary_parse','hypothesis_parse','genre']),
    'tum-nlp/cannot-dataset':(clean_cannotdatast, None, DEFAULT_COLUMNS,['hypothesis','premise']),
    'kiddothe2b/contract-nli/contractnli_a':(clean_contractnli, None, DEFAULT_COLUMNS, ['premise','hypothesis']),
    'kiddothe2b/contract-nli/contractnli_b':(clean_contractnli, None, DEFAULT_COLUMNS, ['premise','hypothesis']),
    'heegyu/news-category-dataset':(clean_newscategory, filter_newscategory, DEFAULT_COLUMNS, ['category', 'headline', 'authors', 'link', 'short_description', 'date']),
    'fkdosilovic/docee-event-classification':(clean_doceeevents, None, DEFAULT_COLUMNS, ['title', 'text', 'event_type', 'date', 'metadata']),
    'DeveloperOats/DBPedia_Classes_level2':(clean_dbpedia_l2, None, DEFAULT_COLUMNS, ['text','l1','l2','l3']),
    'DeveloperOats/DBPedia_Classes_level3':(clean_dbpedia_l3, None, DEFAULT_COLUMNS, ['text','l1','l2','l3']),
    'casehold/casehold_positives':(clean_casehold_positives, None, DEFAULT_COLUMNS, ['example_id', 'citing_prompt', 'holding_0', 'holding_1', 'holding_2', 'holding_3', 'holding_4']),
    'casehold/casehold_negatives':(clean_casehold_negatives, None, DEFAULT_COLUMNS, ['example_id', 'citing_prompt', 'holding_0', 'holding_1', 'holding_2', 'holding_3', 'holding_4']),
    'mteb/mtop_intent':(clean_mtopintent, None, DEFAULT_COLUMNS, ['label_text','id','text']),
}

DEFAULT_PROB = 1.0
print('TODO use the FINER-149 dataset')
cls_files = [
    # dataset name, subset, take_probability, dataset size
    ('snli', None, DEFAULT_PROB/2, 550000, 'pair_classification', False),
    ('multi_nli', None, DEFAULT_PROB, 393000, 'pair_classification', False),
    ('tum-nlp/cannot-dataset', None, DEFAULT_PROB, 77400, 'pair_classification', False),
    ('kiddothe2b/contract-nli','contractnli_a', DEFAULT_PROB//3, 6200, 'pair_classification', False),
    ('kiddothe2b/contract-nli','contractnli_b', DEFAULT_PROB//3, 7190, 'pair_classification', False),
    ('heegyu/news-category-dataset', None, DEFAULT_PROB/2, 210000, 'classification', False),
    ('fkdosilovic/docee-event-classification', None, DEFAULT_PROB/2, 21949, 'classification', False),
    ('DeveloperOats/DBPedia_Classes', None, DEFAULT_PROB/2, 241000, 'classification', False),
    ('DeveloperOats/DBPedia_Classes', None, DEFAULT_PROB/2, 241000,'classification', False),
    ('casehold/casehold', 'all', DEFAULT_PROB/2, 53100, 'pair_classification', False),
    ('casehold/casehold', 'all', DEFAULT_PROB/2, 53100, 'pair_classification', False),
    ('mteb/mtop_intent', 'en',DEFAULT_PROB, 15700, 'classification',False)
]

clsdata_streaming_config = {
    'files':cls_files,
    'max_seq_length':512,
    'val_size':500,
    'train_chunk_size':1000,
    'seed':42,
}

print([k1[0]+"|||"+k2 for k1,k2 in zip(cls_files, list(cls_streaming_cleaning_functions.keys()))])

['snli|||snli', 'multi_nli|||multi_nli', 'tum-nlp/cannot-dataset|||tum-nlp/cannot-dataset', 'kiddothe2b/contract-nli|||kiddothe2b/contract-nli/contractnli_a', 'kiddothe2b/contract-nli|||kiddothe2b/contract-nli/contractnli_b', 'heegyu/news-category-dataset|||heegyu/news-category-dataset', 'fkdosilovic/docee-event-classification|||fkdosilovic/docee-event-classification', 'DeveloperOats/DBPedia_Classes|||DeveloperOats/DBPedia_Classes_level2', 'DeveloperOats/DBPedia_Classes|||DeveloperOats/DBPedia_Classes_level3', 'casehold/casehold|||casehold/casehold_positives', 'casehold/casehold|||casehold/casehold_negatives', 'mteb/mtop_intent|||mteb/mtop_intent']


In [None]:
def initialize_and_get_classification_streaming_datasets(
    data_streaming_config,
    streaming_cleaning_functions,
    start_proportion = None,
    epoch=0,
    seed=42,
    path_to_val_cache = 'cache_val_cls.pkl',
    path_to_train_cache_epoch = 'cache_train_cls_%03g.pkl',
    do_check_english = True,
    name = 'CLS' #
):
    """Converts stream of unlabelled text data into static datasets for: pair-classification tasks"""
    # list of files to stream
    files = data_streaming_config['files']
    # number of examples to take from stream for validation set
    val_size = data_streaming_config['val_size']
    # number of examples to take from stream for training set
    train_chunk_size = data_streaming_config['train_chunk_size']
    min_seq_len = data_streaming_config.get('min_seq_length', 48)
    # normalization constant for normalizing the weights into probabilities
    probability_normalization_const = sum([x[2] for x in files])

    # where to initialize start-stream for training data
    if start_proportion is None:
        start_proportion = np.random.RandomState(seed+epoch).uniform()*0.99

    # reload cached files
    path_to_train_cache = None if not '%03g' in path_to_train_cache_epoch else path_to_train_cache_epoch % epoch
    do_make_valset = not os.path.isfile(path_to_val_cache)
    do_make_trainset = not os.path.isfile(path_to_train_cache)
    if not do_make_valset:
        print(f'RELOADING VAL-{name} SET: iter=%s' % path_to_val_cache)
        with open(path_to_val_cache,'rb') as pcon:
            datalist_val_triplet_static = pickle.load(pcon)
        print(f'VAL-{name} SET SIZE: %d' % len(datalist_val_triplet_static))
    else:
        datalist_val_triplet_static = []
    if not do_make_trainset:
        print(f'RELOADING VAL-{name} SET: iter=%s' % path_to_val_cache)
        with open(path_to_train_cache,'rb') as pcon:
            datalist_train_triplet_static = pickle.load(pcon)
        print(f'TRAIN-{name} EPOCH-%d SET SIZE: %d' % (epoch, len(datalist_train_triplet_static)))
    else:
        datalist_train_triplet_static = []

    if (do_make_trainset or do_make_valset):

        # loop through datasets
        for (data_nm, set_nm, prob, dataset_size, special_handling, partition_shuffle), dataset_key in zip(
            files, streaming_cleaning_functions.keys()
        ):
            if prob ==0:
                continue
            prob /= probability_normalization_const

            # get cleaning & filter functions for streaming data functionality
            clean_func, filter_func, feature_names, removefeature_names = streaming_cleaning_functions[dataset_key]

            # set arguments for the load_dataset (huggingface repos)
            load_dataset_args = {
                'path':data_nm, 'name':set_nm, 'split':'train', 'streaming':True
            }
            # for other non-huggingface repos, path needs to be a "builder"
            if data_nm.endswith('.jsonl') or data_nm.endswith('.jsonl.zip') or data_nm.endswith('.jsonl.zst'):
                load_dataset_args.update({'path':'json','data_files':data_nm})

            # special proecssing of datasets with multiple partitions
            if bool(partition_shuffle): # or str(epoch)=='val':

                n_files, n_per_file = partition_shuffle
                dataset_size = n_per_file
                print('trying %s initialization (shuffling through %d files)' % (data_nm, n_files))

                # whether there is a filter
                if filter_func is None:
                    dset_stream = load_dataset(**load_dataset_args)
                else:
                    dset_stream = load_dataset(**load_dataset_args).filter(filter_func)

                # validation set
                if do_make_valset:
                    # take from stream
                    n_valset_take = max(int(prob*val_size), 1)
                    print('take %d from %s validation'% (n_valset_take, data_nm))
                    dset_stream_val = dset_stream.take(n_valset_take).map(clean_func).remove_columns(removefeature_names)
                    # convert stream to a static set and do check
                    dset_static_val_thisset = [
                        e for e in dset_stream_val if bool(re.search(r"\w+",e['pair1'][:200]))
                    ]
                # training set
                if do_make_trainset:
                    # randomly skip a bunch from this set
                    skip_to_start = int(start_proportion*n_per_file)
                    take_from_this_set = max(int(round(train_chunk_size*prob)),1)
                    print('take %d from %s training'% (take_from_this_set, data_nm))
                    # shuffle: take a random data partition (from the dataset's list of files)
                    dset_stream_train = dset_stream_val.shuffle(
                        seed = seed+epoch, buffer_size = skip_to_start+take_from_this_set,
                    )
                    dset_stream_train = dset_stream_train.skip(
                        skip_to_start # random skip through dataset to new start position
                    ).take(
                        take_from_this_set # take this amount for the training ste
                    ).map(clean_func).remove_columns(removefeature_names)
                    # convert training to static dataset
                    dset_static_train_thisset = [
                        e for e in dset_stream_train if bool(re.search(r"\w+",e['pair1'][:200]))
                    ]
            else:
                # regular streaming
                print('trying %s initialization' % data_nm)
                # whether there is a filter
                if filter_func is None:
                    dset_stream = load_dataset(**load_dataset_args).map(clean_func).remove_columns(removefeature_names)
                else:
                    dset_stream = load_dataset(**load_dataset_args).filter(filter_func).map(clean_func).remove_columns(removefeature_names)
                # take from stream
                n_valset_take = max(int(prob*val_size), 1) # size of valset
                print('take %d from %s validation'% (n_valset_take, data_nm))
                skip_to_start = int(start_proportion*(dataset_size-n_valset_take)) # random point to skip to
                n_train_take = max(int(round(train_chunk_size*prob)),1) # size of train set
                print('take %d from %s train'% (n_train_take, data_nm))
                if do_make_valset:
                    dset_stream_val = dset_stream.take(n_valset_take)
                    dset_static_val_thisset = [
                        e for e in dset_stream_val if bool(re.search(r"\w+",e['pair1'][:200]))
                    ]
                if do_make_trainset:
                    dset_stream_train = dset_stream.skip(n_valset_take+skip_to_start).take(n_train_take)
                    dset_static_train_thisset = [
                        e for e in dset_stream_train if bool(re.search(r"\w+",e['pair1'][:200]))
                    ]
            print('Done getting streams/reloading from %s' % data_nm)
            # check language
            if do_make_valset:
                # discard non-english
                dset_static_val_thisset =[
                    e for e in dset_static_val_thisset if check_language(e['pair1'])[0] #detect(e['pair1'][:200]+" hello")=='en'
                ]
                print('done val language check')
                # add to val set
                datalist_val_triplet_static.extend(dset_static_val_thisset)

            # check language
            if do_make_trainset:
                # discard non-english
                dset_static_train_thisset =[
                    e for e in dset_static_train_thisset if check_language(e['pair1'])[0]
                ]
                print('done train language check')

                # ensure that none of the examples in the traning set are in the validation set
                def hashtest(text1,text2):
                    texthash = text1.lower()
                    texthash+= "" if text2 is None else text2[:1000].lower()
                    return texthash

                if do_make_valset:
                    val_queries = set([hashtest(q['pair1'],q['pair2']) for q in dset_static_val_thisset])
                    dset_static_train_thisset = [
                        s for s in dset_static_train_thisset if hashtest(s['pair1'],s['pair2']) not in val_queries
                    ]

                # add to training set
                datalist_train_triplet_static.extend(dset_static_train_thisset)

        print(f'Done collecting {name} streaming data')

    if do_make_valset:
        print('saving streamed %s validation data: %s' % (name, path_to_val_cache))
        with open(path_to_val_cache,'wb') as pcon:
            pickle.dump(datalist_val_triplet_static, pcon)

    if do_make_trainset:
        print('saving streamed %s training for epoch %d: %s' % (name, epoch, path_to_train_cache))
        with open(path_to_train_cache,'wb') as pcon:
            pickle.dump(datalist_train_triplet_static, pcon)

    return {
        'train':datalist_train_triplet_static,
        'val':datalist_val_triplet_static,
        'epoch':epoch,
        'index_stream':start_proportion
    }


In [None]:
clsdata_streaming_config = {
    'files':cls_files,
    'max_seq_length':512,
    'val_size':500,
    'train_chunk_size':2000,
    'seed':42,
}

cls_statics_datsets = initialize_and_get_classification_streaming_datasets(
    data_streaming_config=clsdata_streaming_config,
    streaming_cleaning_functions=cls_streaming_cleaning_functions,
    start_proportion = None,
    epoch=0,
    seed=42,
    path_to_val_cache = 'cache_val_cls.pkl',
    path_to_train_cache_epoch = 'cache_train_cls_%03g.pkl',
    do_check_english = True,
    name = 'CLS' #
)

trying snli initialization
take 38 from snli validation
take 154 from snli train
Done getting streams/reloading from snli
done val language check
done train language check
trying multi_nli initialization
take 76 from multi_nli validation
take 308 from multi_nli train
Done getting streams/reloading from multi_nli
done val language check
done train language check
trying tum-nlp/cannot-dataset initialization
take 76 from tum-nlp/cannot-dataset validation
take 308 from tum-nlp/cannot-dataset train
Done getting streams/reloading from tum-nlp/cannot-dataset
done val language check
done train language check
trying heegyu/news-category-dataset initialization
take 38 from heegyu/news-category-dataset validation
take 154 from heegyu/news-category-dataset train
Done getting streams/reloading from heegyu/news-category-dataset
done val language check
done train language check
trying fkdosilovic/docee-event-classification initialization
take 38 from fkdosilovic/docee-event-classification validation


Downloading readme:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

take 38 from DeveloperOats/DBPedia_Classes validation
take 154 from DeveloperOats/DBPedia_Classes train
Done getting streams/reloading from DeveloperOats/DBPedia_Classes
done val language check
done train language check
trying DeveloperOats/DBPedia_Classes initialization
take 38 from DeveloperOats/DBPedia_Classes validation
take 154 from DeveloperOats/DBPedia_Classes train
Done getting streams/reloading from DeveloperOats/DBPedia_Classes
done val language check
done train language check
trying casehold/casehold initialization


Downloading builder script:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

take 38 from casehold/casehold validation
take 154 from casehold/casehold train
Done getting streams/reloading from casehold/casehold
done val language check
done train language check
trying casehold/casehold initialization
take 38 from casehold/casehold validation
take 154 from casehold/casehold train
Done getting streams/reloading from casehold/casehold
done val language check
done train language check
trying mteb/mtop_intent initialization
take 76 from mteb/mtop_intent validation
take 308 from mteb/mtop_intent train
Done getting streams/reloading from mteb/mtop_intent
done val language check
done train language check
Done collecting CLS streaming data
saving streamed CLS validation data: cache_val_cls.pkl
saving streamed CLS training for epoch 0: cache_train_cls_000.pkl


In [None]:
!rm *.pkl

In [None]:
# TODO - make dataloaders for pair_classification, classification, and next-sentence-prediction

In [None]:
import torch.utils.data as torch_data

In [None]:

class DatasetPairClassification(torch_data.Dataset):
    def __init__(
        self,
        list_of_data=None,
        text1_name ='pair1',
        text2_name ='pair2',
        label_name = 'label',
        datasetname_name = 'cls_id',
        classificationtype_name = 'type',
        nlabels_name = 'n_labels',
        seed = 42
    ):
        self.data = {} # internal data preprocessed
        self.datasets = [] # list of names of datasets in Dataset class
        self.label2int = {} #maps {label:int} dictionary
        self.label2dataset = {} #maps {label:mask}
        self.label2mask = {}
        self.dataset_classification_types = {} # dataset types (pair-classification, classificaiton)
        self.text1_name = text1_name
        self.text2_name = text2_name
        self.label_name = label_name#'label',
        self.datasetname_name = datasetname_name #'cls_id',
        self.classificationtype_name = classificationtype_name#'type',
        self.nlabels_name = nlabels_name #'n_labels'
        self.seed = seed

        # random state
        self.np_random = np.random.RandomState(seed)

        if list_of_data is not None and len(list_of_data)>0:

            # loop through the data and add each triplets: export a panda df as final data
            self.df = self.process(list_of_data, False)

    def process(self, list_of_data, inplace=True):
        """convert the raw examples to dataset"""
        # loop through the data and add each triplets
        self._loop_through_list_of_data_and_add_to_selfdata(
            list_of_data = list_of_data
        )

        # add positives to self.data
        self._find_positives_and_add_to_data()

        # add negatives to self.data
        self._find_negatives_and_add_to_data()

        # make mask for loss function
        self._convert_labelint_to_vectors()

        # make mask for loss function
        self._make_mask()

        # harden the dataset to pandas dataframe
        data_flatten = self.flatten_data(self.data)
        if not inplace:
            return data_flatten
        self.df = data_flatten

    def _loop_through_list_of_data_and_add_to_selfdata(
            self,
            list_of_data
        ):
        """loops through and adds the text pair and label"""
        for raw_example in list_of_data:

            # add each element to the data
            self._add_unit_to_data(
                text1 = raw_example[self.text1_name],
                text2= raw_example[self.text2_name],
                label= raw_example[self.label_name],
                n_labels= raw_example[self.nlabels_name],
                dataset_name= raw_example[self.datasetname_name],
                method = raw_example[self.classificationtype_name]
            )

    def _find_positives_and_add_to_data(self):
        """Finds data with the same label, and adds them as positives"""
        which_clsdatasets_lack_positives = [
            datasetname for datasetname, datasettype
            in self.dataset_classification_types.items()
            if datasettype == 'classification'
        ]
        for datasetname in which_clsdatasets_lack_positives:

            # all unique labels in subdataset
            ulabels_in_clsdataset = sorted(list(set([
                (d['class'],d['label']) for d in self.data[datasetname]
                if d['label'] == self.label2int['%s_%d' % (datasetname, 1)]
            ])))

            # loop through label classes
            for labelclass,label in ulabels_in_clsdataset:

                # other samples with the same class (and positive)
                # `class` is the original dataset class, label = {different, same}
                idx_this_class = [
                    i for i,d
                    in enumerate(self.data[datasetname])
                    if d['class'] == labelclass and d['label']==label
                ]

                idx_this_class_need_positives = [
                    i for i,d
                    in enumerate(self.data[datasetname])
                    if d['class'] == labelclass and d['label']==label
                    and d['text2'] is None
                ]

                # subsample within by permutation
                idx_sample_within = self.np_random.permutation(idx_this_class)

                # get text of permuted-indicies, assign as positive for each sample
                for i,j in zip(idx_this_class_need_positives, idx_sample_within[:len(idx_this_class_need_positives)]):

                    self.data[datasetname][i]['text2'] = self.data[datasetname][j]['text1']

    def _find_negatives_and_add_to_data(self):
        """Finds data with the same label, and adds them as positives"""
        which_clsdatasets_lack_negatives = [
            datasetname for datasetname, datasettype
            in self.dataset_classification_types.items()
            if datasettype == 'classification'
        ]
        for datasetname in which_clsdatasets_lack_negatives:

            # all unique labels in subdataset
            ulabels_in_clsdataset = sorted(list(set([
                (d['class'],d['label']) for d in self.data[datasetname]
                if d['label'] == self.label2int['%s_%d' % (datasetname, 0)]
            ])))

            # loop through label classes
            for labelclass,label in ulabels_in_clsdataset:

                # other samples with the same class (and positive)
                # `class` is the original dataset class, label = {different, same}
                idx_this_class = [
                    i for i,d
                    in enumerate(self.data[datasetname])
                    if d['class'] == labelclass and d['label']==label
                ]
                # indices of all other data
                idx_this_other_class = [
                    i for i,d
                    in enumerate(self.data[datasetname])
                    if d['class'] != labelclass
                ]

                # subsample within by permutation
                idx_sample_otherlabels= self.np_random.choice(idx_this_other_class, size =len(idx_this_class))

                # get text of permuted-indicies, assign as positive for each sample
                for i,j in zip(idx_this_class, idx_sample_otherlabels):

                    self.data[datasetname][i]['text2'] = self.data[datasetname][j]['text1']

    def _convert_labelint_to_vectors(self):
        """Loops through data and converts each labelinteger into a vector for multi-label loss"""
        for datasetname, dataset in self.data.items():
            for example in dataset:
                example.update({
                    'labelvector':self._convert_labelint_to_vector([example['label']])
                })

    def _convert_labelint_to_vector(self, labelints):
        """Loops through data and converts each labelinteger into a vector for multi-label loss"""
        v = np.zeros(len(self.label2int))
        for labelint in labelints:
            v[labelint]=1
        return v

    def _make_mask(self):
        """for each sample, the loss should only pertain to labels within the same dataset, not other datasets -- by masking"""
        if (
            len(self.label2mask)!=self.label2dataset
        ) or bool(
            set(list(self.label2mask.keys())).symmetric_differnce(set(list(self.label2dataset.keys())))
        ):
            # make the self.label2mask
            for label,dataset in self.label2dataset.items():
                #
                self.label2mask[self.label2int[label]] = self._convert_labelint_to_vector([
                    self.label2int[l] for l,dset in self.label2dataset.items() if dset==dataset
                ])

        # loop through data and insert mask into each sample
        for datasetname, dataset in self.data.items():
            for example in dataset:
                example.update({
                    'mask':self.label2mask[example['label']]
                })

    def _add_labels_to_label2int(self, dataset_labels_as_globalname, dataset_name):
        for globallabel in dataset_labels_as_globalname:
            if globallabel not in self.label2int.keys():
                next_label_int = len(self.label2int)
                self.label2int[globallabel] = next_label_int
                self.label2dataset[globallabel] = dataset_name

    def _add_unit_to_data(
        self,
        text1,
        text2,
        label,
        n_labels,
        dataset_name,
        method
    ):
        """Adds one unit of processed data to the internal self.data"""
        if method == 'pair_classification':

            # pair classification: two texts with a label of the relationship between pair
            self._add_text_pair_to_data(
                text1,
                text2,
                label,
                n_labels,
                dataset_name
            )

        elif method == 'classification':

            # classification: single texts, with negatives needing to be deduced later
            self._add_textclass_to_data(
                text1,
                label,
                dataset_name
            )

    def _add_text_pair_to_data(
        self,
        text1,
        text2,
        label,
        n_labels,
        dataset_name
    ):
        """add a text pair to the self data: specifically for pair_classification"""
        if dataset_name not in self.data.keys():
            print('encountered new dataset for pair-classification: %s' % dataset_name)
            self.data[dataset_name] = []
            self.datasets += [dataset_name]
            self.dataset_classification_types[dataset_name] = 'pair_classification'

            # common naming for all labels across all datasets
            dataset_labels_as_globalname = [
                "%s_%d" % (dataset_name, l) for l in range(n_labels)
            ]

            self._add_labels_to_label2int(dataset_labels_as_globalname, dataset_name)

        if text2 is not None:

            self.data[dataset_name].append({
                'text1':text1,
                'text2':text2,
                'label':self.label2int["%s_%d" % (dataset_name, label)],
                'mask':None
            })

    def _add_textclass_to_data(
        self,
        text,
        classlabel,
        dataset_name
    ):
        """add a text to the self data: specifically for classification"""
        if dataset_name not in self.data.keys():
            print('encountered new dataset for classification: %s' % dataset_name)
            self.data[dataset_name] = []
            # register dataset to list of datasets
            self.datasets += [dataset_name]
            # map datset classification types
            self.dataset_classification_types[dataset_name] = 'classification'

            # common naming for all labels across all datasets
            dataset_labels_as_globalname = [
                "%s_%d" % (dataset_name, l) for l in [0,1]
            ]

            self._add_labels_to_label2int(dataset_labels_as_globalname, dataset_name)

        # positives and negatives must be added seperately (same label, different label)
        for label in [0,1]:

            self.data[dataset_name].append({
                'text1':text,
                'text2':None,
                'mask':None,
                'label':self.label2int["%s_%d" % (dataset_name, label)],
                'class':classlabel
            })

    def flatten_data(self, data):
        """Converts data to a giant list"""
        data_all_flat = []
        for datasetname, subdataset in self.data.items():
            data_all_flat += subdataset
        return data_all_flat

    def integrate_another_dataset(
        self,
        list_of_newdata,
        function_to_reformatdata,
        dataset_name
    ):
        """
        Adds new data to the existing self.df
        Arguments:
        :param list_of_newdata: list of data to integrate/add
        :function_to_reformatdata: function that converts the data in list_of_newdata[idx]
        """
        # check that the unit of data has the required fields
        newtestdata = function_to_reformatdata(list_of_newdata[0])
        assert isinstance(newtestdata, list), 'function_to_reformatdata must output a list of reformated data'
        assert not bool(set(['text1','text2','class']).difference(set(newtestdata[0].keys()))), 'new data must have `text1`, `text2`,`class`'
        classlabels = set()
        newdata_converted_all = []

        # loop through and convert all data to acceptable format for internal datasets
        for newdata in list_of_newdata:
            newdata_converted = function_to_reformatdata(newdata)
            for unit in newdata_converted:
                classlabels |= set([unit['class']])
            newdata_converted_all += newdata_converted

        # loop through and ingest all converted data into the self.data internal dataset
        for unit in newdata_converted_all:
            self._add_text_pair_to_data(
                text1=unit['text1'],
                text2=unit['text2'],
                label=unit['class'],
                n_labels=len(classlabels),
                dataset_name=dataset_name
            )
        # remake the mask for ALL data, given the new label sets
        self._convert_labelint_to_vectors()
        # make mask for loss function
        self._make_mask()
        # harden the dataset to pandas dataframe
        data_flatten = self.flatten_data(self.data)
        self.df = data_flatten
        print('done integrating new dataset %s' % dataset_name)

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        return self.df[idx]

In [None]:
# initialize the (empty) CLS-dataset
dataset_paircls = DatasetPairClassification(
    list_of_data=None,
    text1_name ='pair1',
    text2_name ='pair2',
    label_name = 'label',
    datasetname_name = 'cls_id',
    classificationtype_name = 'type',
    nlabels_name = 'n_labels',
    seed = 42
)

# add the data to empty datas
dataset_paircls.process(cls_statics_datsets['train'], inplace=True)

print('Size of pair-classification dataset %d' % len(dataset_paircls))

encountered new dataset for pair-classification: snli
encountered new dataset for pair-classification: mnli
encountered new dataset for pair-classification: cannotdataset
encountered new dataset for classification: newscategory
encountered new dataset for classification: doceeevents
encountered new dataset for classification: dbpedia_l2
encountered new dataset for classification: dbpedia_l3
encountered new dataset for pair-classification: casehold
encountered new dataset for classification: mtopintent
Size of pair-classification dataset 2862


In [None]:
dataset_paircls[2211]

{'text1': 'with “intentionally and knowingly” causing the death of a police officer. The trial court’s charge to the jury authorized conviction upon a finding that he “intentionally or knowingly” caused such result. In his first point of error, appellant complains that the trial judge improperly permitted conviction on a theory other than that alleged in the indictment. This Court has long approved the practice of prosecuting authorities to plead culpable mental states conjunctively and submit them for jury consideration disjunctively whenever the statutory language is disjunctive. Ely v. State, 582 S.W.2d 416, 421 (Tex.Cr.App.1979) (on original submission); Cowan v. State, 562 S.W.2d 236, 240 (Tex.Cr.App.1978) (rehearing denied en banc). But cf. Hunter v. State, 576 S.W.2d 395 (Tex.Cr.App.1979) ',
 'text2': 'recognizing by citing several cases that diametric opposing lines of case law authority exist with one line suggesting that information need not allege each element of offense cha

In [None]:
## integrate the nextsentence data
def function_to_reformatnextsentence(x):
    """reformats a triplet into one positive pair and one negative pair"""
    return [
        {"text1":x['anchor'], "text2":x['next'], "class":1},
        {"text1":x['anchor'], "text2":x['opposite'], "class":0},
    ]

# integrate the next sentence data
dataset_paircls.integrate_another_dataset(
    list_of_newdata = dataset_static_mlm['train']['nextsentence'],
    function_to_reformatdata = function_to_reformatnextsentence,
    dataset_name = 'nextsentence'
)

print('Size of pair-classification dataset %d' % len(dataset_paircls))

print('types of datasets in classification task:')
print(dataset_paircls.dataset_classification_types)

done integrating new dataset nextsentence
Size of pair-classification dataset 7986
types of datasets in classification task:
{'snli': 'pair_classification', 'mnli': 'pair_classification', 'cannotdataset': 'pair_classification', 'newscategory': 'classification', 'doceeevents': 'classification', 'dbpedia_l2': 'classification', 'dbpedia_l3': 'classification', 'casehold': 'pair_classification', 'nextsentence': 'pair_classification'}


In [None]:
dataset_paircls[1006]

{'text1': "Breathe Easy, Jar Jar Binks Won't Be In 'Star Wars: The Force Awakens'. Crisis averted.",
 'text2': 'Omarion Wasn\'t Happy About Grammy Nomination Snub. "As an artist you look forward to being acknowledged by the game."',
 'mask': array([0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0.]),
 'label': 9,
 'class': 32,
 'labelvector': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0.])}