<div align="center">

# 🚀 Spearecode Preprocessing 🚀

</div>

<br>

Welcome to the **Spearecode Preprocessing Notebook**! This notebook will guide you through the necessary preprocessing steps to prepare a toy dataset for Language Model training. We will focus on making the dataset more suitable for training by performing the following steps:

1. 📚 **Loading the dataset**: We'll start by importing the dataset from a file or external source.
2. 📦 **Chunking the text**: The dataset will be divided into smaller chunks or segments, making it easier to process during training.
3. 💬 **Tokenization**: Each chunk of text will be split into individual tokens (words or subwords), which are the basic units for language models.
4. 📊 **Basic Exploratory Data Analysis (EDA)**: We'll analyze the dataset's characteristics, such as token frequency, to gain insights and identify potential issues.

After completing the preprocessing and EDA, the toy dataset will be converted into `TFRecords` format. This efficient binary format is designed for use with TensorFlow and will enable seamless integration with your Language Model training pipeline.

Let's dive in and start preprocessing the dataset! 🎉


<br><br>

<div align="center">

# 🌟 Table of Contents 🌟

</div>

---

0. [**Setup**](#setup)
1. [**Loading the Dataset**](#loading-the-dataset)
2. [**Chunking the Text**](#chunking-the-text)
3. [**Tokenization**](#tokenization)
4. [**Basic Exploratory Data Analysis (EDA)**](#basic-eda)
5. [**Converting to TFRecords**](#converting-to-tfrecords)

---



<br>

<div align="center">

## 🛠️ Setup <a name="setup"></a>

</div>

<br>

In this section, we'll import required libraries and methods from our utilities file. We will also define relevant paths and high level information we may need later. We also run a few basic Tensorflow setup steps to ensure optimal and reproducible runs.

In [2]:
### IMPORTS ###
import os
import sys
import random
import numpy as np
import pandas as pd
from glob import glob
import tensorflow as tf
import sentencepiece as spm
from tqdm.notebook import tqdm; tqdm.pandas()

PROJECT_DIR = os.path.dirname(os.getcwd())
sys.path.insert(0, PROJECT_DIR) # project root into path

from spearecode.preprocessing_utils import load_from_txt_file, preprocess_shakespeare, save_to_txt_file, print_check_speare
from spearecode.general_utils import tf_xla_jit, tf_set_memory_growth, seed_it_all, flatten_l_o_l, print_ln

### DEFINE PATHS ###
NBS_PATH = os.path.join(PROJECT_DIR, "nbs")
DATA_PATH = os.path.join(PROJECT_DIR, "data")
SS_TEXT_PATH = os.path.join(DATA_PATH, "t8.shakespeare.txt")
PREPROCESSED_FULL_TEXT_PATH = SS_TEXT_PATH.replace(".txt", "_preprocessed.txt")

<br>

<div align="center">

## 📚 Loading the Dataset <a name="loading-the-dataset"></a>

</div>

<br>

In this section, we'll import the dataset from a file or external source. The dataset will be read into memory, allowing us to manipulate and process the text as needed throughout the preprocessing steps.


In [4]:
raw_text = load_from_txt_file(SS_TEXT_PATH)
ss_text = preprocess_shakespeare(raw_text)
save_to_txt_file(ss_text, PREPROCESSED_FULL_TEXT_PATH)
print_check_speare(ss_text)


... DATASET INFO:
	NUMBER OF CHARS --> 5,419,872
	NUMBER OF LINES --> 120,696


... FIRST 1000 CHARACTERS:



1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou that art now the world's fresh ornament,
  And only herald to the gaudy spring,
  Within thine own bud buriest thy content,
  And tender churl mak'st waste in niggarding:
    Pity the world, or else this glutton be,
    To eat the world's due, by the grave and thee.

2
  When forty winters shall besiege thy brow,
  And dig deep trenches in thy beauty's field,
  Thy youth's proud livery so gazed on now,
  Will be a tattered weed of small worth held:  
  Then being asked, where all thy beauty lies,
  Where al

<br>

<div align="center">

## 📦 Chunking the Text <a name="chunking-the-text"></a>

</div>

<br>

Once the dataset is loaded, we'll divide it into smaller chunks or segments. This step is crucial for making the dataset more similar to code files (which is the type of data we will be using during the other parallel streams).

I implement two simple methods:
1. A basic double newline split **(`\n\n`)** resulting in 6294 chunks
2. Using Langchain RecursiveTextSplitter to chunk to a particular text length
    * This allows us to specify our desired text length and even overlap the chunks.
        * Note we allow for a small amount of overlap and this may cause some leakage... but whatever.
    * **We will use this method for our purposes.**
    


In [188]:
def do_rcts_chunking(text, chunk_size=1024, chunk_overlap=128, length_fn=len):
    """
    Perform Recursive Character Text Splitting (RCTS) chunking on the input text.
    
    Args:
        text (str): The input text to be chunked.
        chunk_size (int): The maximum size of each chunk.
        chunk_overlap (int): The number of overlapping characters between adjacent chunks.
        length_fn (callable, optional): Function to calculate the length of the text. Defaults to len.
    
    Returns:
        list: A list of chunked text segments.
    """
    # Import the RecursiveCharacterTextSplitter from langchain.text_splitter module
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    # Instantiate the text splitter with the specified parameters
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=length_fn,
    )
    
    # Split the input text into chunks
    docs = text_splitter.create_documents([text])
    
    # Return the list of chunked text segments
    return [x.page_content for x in docs if len(x.page_content)>1]

def do_basic_chunking(text, chunk_delimeter="\n\n", max_length=1800, min_length=300):
    """
    Perform basic chunking on the input text using the specified delimiter.
    
    Args:
        text (str): The input text to be chunked.
        chunk_delimeter (str, optional): The delimiter used to split the text. Defaults to "\n\n".
    
    Returns:
        list: A list of chunked text segments.
    """
    # Split the input text based on the specified delimiter (ensure no empty chunks by stripping from ends)
    raw_docs = text.strip(chunk_delimeter).split(chunk_delimeter)
    tmp_docs = []
    docs = []
    
    while len(raw_docs)>0:
        doc = raw_docs.pop()
        
        if len(doc)>max_length:
            raw_docs+=doc.split("\n")
        elif len(doc)<min_length:
            tmp_docs.append(doc)
        else:
            docs.append(doc)
            
        if len("\n".join(tmp_docs))>min_length:
            docs.append("\n".join(tmp_docs))
            tmp_docs = []
    if tmp_docs:
        docs.append("\n".join(tmp_docs))
    
    # Return the list of chunked text segments
    return docs

In [189]:
# Feel free to pass non-default kwargs 
#    -- otherwise the rcts chunks will overlap by 64 and be 512 characters long
CHUNK_STYLE = "basic" # one of ['basic' | 'rcts']
basic_chunks = do_basic_chunking(ss_text)
rcts_chunks = do_rcts_chunking(ss_text)

print("\n... FIRST BASIC CHUNK ...\n")
print(basic_chunks[0])

print("\n... FIRST RCTS CHUNK ...\n")
print(rcts_chunks[0])

print("\n... EXAMPLE RANDOM BASIC CHUNK ...\n")
print(random.sample(basic_chunks, 1)[0])

print("\n... EXAMPLE RANDOM RCTS CHUNK ...\n")
print(random.sample(rcts_chunks, 1)[0])

print("\n... LAST BASIC CHUNK ...\n")
print(basic_chunks[-1])

print("\n... LAST RCTS CHUNK ...\n")
print(rcts_chunks[-1])



... FIRST BASIC CHUNK ...

  'O, that infected moisture of his eye,
  O, that false fire which in his cheek so glowed,
  O, that forced thunder from his heart did fly,
  O, that sad breath his spongy lungs bestowed,
  O, all that borrowed motion, seeming owed,
  Would yet again betray the fore-betrayed,
  And new pervert a reconciled maid.'

... FIRST RCTS CHUNK ...

1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou that art now the world's fresh ornament,
  And only herald to the gaudy spring,
  Within thine own bud buriest thy content,
  And tender churl mak'st waste in niggarding:
    Pity the world, or else this glutton be,
    To eat the world's due, by the g

<br>

<div align="center">

## 💬 Tokenization <a name="tokenization"></a>

</div>

<br>

In this section, we'll tokenize the text, which involves splitting the chunks into individual tokens (words or subwords). Tokenization is an essential step in preprocessing, as it helps the Language Model understand the basic units of the text and learn meaningful patterns.

* We will train our tokenizer on the non-chunked dataset (after basic preprocessing), however, we will 


In [190]:
# Setup model directory if not already setup
MODEL_DIR = os.path.join(os.path.dirname(DATA_PATH), "models")
if not os.path.isdir(MODEL_DIR): os.makedirs(MODEL_DIR, exist_ok=True)

# User defined parameters (matching alphafold and code tokenization standards)
MODEL_PATH = os.path.join(MODEL_DIR, 'spearecode')
USER_DEFINED_SYMBOLS = ["\n","\t","\r","\f","\v"]
VOCAB_SIZE = 8_000
CHAR_COVERAGE = 1.0000

# Tokenizer parameters (and some defaults)
base_tokenizer_kwargs = dict(
    input = PREPROCESSED_FULL_TEXT_PATH,
    vocab_size=VOCAB_SIZE,
    character_coverage=CHAR_COVERAGE,
    pad_id=0, unk_id=1, bos_id=2, eos_id=3,
    remove_extra_whitespaces=False,
    allow_whitespace_only_pieces=True,
    add_dummy_prefix=False,
    user_defined_symbols=USER_DEFINED_SYMBOLS,
    normalization_rule_name="identity",
    num_threads=os.cpu_count(),
)

unigram_tokenizer_kwargs = base_tokenizer_kwargs.copy()
unigram_tokenizer_kwargs.update(dict(
    model_prefix=MODEL_PATH+"_unigram",
    model_type="unigram",
))

bpe_tokenizer_kwargs = base_tokenizer_kwargs.copy()
bpe_tokenizer_kwargs.update(dict(
    model_prefix=MODEL_PATH+"_bpe",
    model_type="bpe",
))

# train_tokenizer(ALL_TXT_PATHS, MODEL_PATH, VOCAB_SIZE, TOKENIZER_STYLE)
spm.SentencePieceTrainer.Train(**unigram_tokenizer_kwargs)
spm.SentencePieceTrainer.Train(**bpe_tokenizer_kwargs)

sp_uni = spm.SentencePieceProcessor()
sp_uni.load(f'{unigram_tokenizer_kwargs["model_prefix"]}.model')
uni_encoder = lambda x: sp_uni.encode(x)
uni_decoder = lambda x: sp_uni.decode(x)

sp_bpe = spm.SentencePieceProcessor()
sp_bpe.load(f'{bpe_tokenizer_kwargs["model_prefix"]}.model')
bpe_encoder = lambda x: sp_bpe.encode(x)
bpe_decoder = lambda x: sp_bpe.decode(x)

In [191]:
from IPython.display import HTML
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors


def replace_rightmost_newline(s, replacement='<br>'):
    parts = s.rsplit('\\n', 1)
    return replacement.join(parts)


def get_color(value, cmap='Pastel1', transparency=0.5):
    """
    Returns an HTML-formatted string representing the background color for a token.

    Args:
        value (int): The index of the token to color.
        cmap (str, optional): The name of the colormap to use. Defaults to 'Pastel1'.

    Returns:
        str: An HTML-formatted string representing the background color.
    """
    colormap = plt.get_cmap(cmap)
    return f"background-color: rgba{tuple([int(x*255) for x in colormap(value % colormap.N)[:-1]]+[transparency,])};"


def get_line_viz(token_lines, decoder, cmap='Pastel1', font_family='Courier New',
                 transparency=0.75, font_size='1.1em', unk_token='???', font_weight=300, padding='0px',
                 margin_right='0px', border_radius='0px', display_inline=False):
    """
    Generates an HTML string to visualize the tokenization of a text.

    Args:
        token_lines (list):
            – A list of lists of integer tokens.
        decoder (function, optional):
            – A function that maps an integer to the representative string
            – If this is None, the tokens are assumed to be strings not integers
        cmap (str, optional):
            – The name of the colormap to use. Defaults to 'Pastel1'.
        font_family (str, optional):
            – The font family to use for tokens. Defaults to 'Courier New'.
        transparency (float, optional):
            background transparency
        font_size (str, optional):
            – The font size to use for tokens. Defaults to '1.1em'.
        unk_token (str, optional):
            – The string to use for unknown tokens. Defaults to '???'
        font_weight (str, optional):
            – The font weight to use for tokens. Defaults to 'bold'.
        padding (str, optional):
            – The padding to use for tokens. Defaults to '2px'.
        margin_right (str, optional):
            – The right margin to use for tokens. Defaults to '5px'.
        border_radius (str, optional):
            – The border radius to use for tokens. Defaults to '3px'.
        display_inline (bool, optional):
            – Whether to display the HTML inline. Defaults to False.

    Returns:
        str: An HTML string representing the tokenized text with styling.
    """
    html = f"<style>span.token {{font-family: {font_family} !important; font-size: {font_size} !important; font-weight: {font_weight} !important; " \
           f"padding: {padding} !important; margin-right: {margin_right} !important; border-radius: {border_radius} !important;}}</style>"

    html += "<div style='background-color: #C0C0C0; padding: 25px; border-radius: 8px; margin-left: 10px; margin-right: 10px; margin-top: 20px; margin-bottom: 20px;'>"
    
    for token_line in token_lines:
        for i, token in enumerate(token_line):
            color = get_color(i, cmap, transparency)
            try:
                html += f"<span class='token' style='{color}'>{decoder(token).replace(' ', '&nbsp;')}</span>".replace('\t', '\\t').replace('\n', '\\n').replace('\r', '\\r').replace('\f', '\\f').replace('\v', '\\v')
            except TypeError:
                html += f"<span class='token' style='{color}'>{unk_token}</span>"
        html = replace_rightmost_newline(html)
    html += "</div>"
    
    if display_inline:
        HTML(html)

    return html

def plot_tokenization(text, encoder, decoder, split_on="\n"):
    display(HTML(get_line_viz([encoder(x+split_on) for x in text.split(split_on)], decoder)))

In [192]:
print("\n... BPE TOKENIZATION:")
plot_tokenization(basic_chunks[0], bpe_encoder, bpe_decoder)

print("\n... UNIGRAM TOKENIZATION:")
plot_tokenization(basic_chunks[0], uni_encoder, uni_decoder)


... BPE TOKENIZATION:



... UNIGRAM TOKENIZATION:


<br>

<div align="center">

## 📊 Basic Exploratory Data Analysis (EDA) <a name="basic-eda"></a>

</div>

<br>

Here, we'll perform a basic EDA on the dataset to gain insights and identify potential issues. This analysis may include examining token frequency, distribution of chunk lengths, and other relevant characteristics. This information can be helpful in understanding the dataset's structure and guiding further preprocessing decisions.


In [237]:
def get_n_chars(text):
    return len(text)

def get_n_lines(text):
    return len(text.split("\n"))

def get_n_tokens(tokens):
    return len(tokens)

def tokenize(text, encoder):
    return encoder(text)

def check_chunks(n_tokens, min_chunk_size=128, max_chunk_size=2048):
    if min_chunk_size<=n_tokens<max_chunk_size:
        return True
    else:
        return False
    
def get_metadata_df(df, drop_col_strings=('content',), additional_drop_strs=None):
    if additional_drop_strs is not None: 
        drop_col_strings = list(additional_drop_strs)+additional_drop_strs
    return df.copy().drop(
        columns=[_c for _c in df.columns if any(True if _x in _c else False for _x in drop_col_strings)]
    )
    
basic_chunk_df = pd.DataFrame({"content":basic_chunks})
basic_chunk_df["uni_token_content"] = basic_chunk_df["content"].progress_apply(lambda x: tokenize(x, uni_encoder))
basic_chunk_df["bpe_token_content"] = basic_chunk_df["content"].progress_apply(lambda x: tokenize(x, bpe_encoder))
basic_chunk_df["n_uni_tokens"] = basic_chunk_df["uni_token_content"].apply(get_n_tokens)
basic_chunk_df["n_bpe_tokens"] = basic_chunk_df["bpe_token_content"].apply(get_n_tokens)
basic_chunk_df["n_chars"] = basic_chunk_df["content"].apply(get_n_chars)
basic_chunk_df["n_lines"] = basic_chunk_df["content"].apply(get_n_lines)
basic_chunk_df["valid_uni_chunk"] = basic_chunk_df["n_uni_tokens"].apply(check_chunks)
basic_chunk_df["valid_bpe_chunk"] = basic_chunk_df["n_bpe_tokens"].apply(check_chunks)

basic_chunk_df_meta = get_metadata_df(basic_chunk_df)
display(basic_chunk_df)


rcts_chunk_df = pd.DataFrame({"content":rcts_chunks})
rcts_chunk_df["uni_token_content"] = rcts_chunk_df["content"].progress_apply(lambda x: tokenize(x, uni_encoder))
rcts_chunk_df["bpe_token_content"] = rcts_chunk_df["content"].progress_apply(lambda x: tokenize(x, bpe_encoder))
rcts_chunk_df["n_uni_tokens"] = rcts_chunk_df["uni_token_content"].apply(get_n_tokens)
rcts_chunk_df["n_bpe_tokens"] = rcts_chunk_df["bpe_token_content"].apply(get_n_tokens)
rcts_chunk_df["n_chars"] = rcts_chunk_df["content"].apply(get_n_chars)
rcts_chunk_df["n_lines"] = rcts_chunk_df["content"].apply(get_n_lines)
rcts_chunk_df["valid_uni_chunk"] = rcts_chunk_df["n_uni_tokens"].apply(check_chunks)
rcts_chunk_df["valid_bpe_chunk"] = rcts_chunk_df["n_bpe_tokens"].apply(check_chunks)

rcts_chunk_df_meta = get_metadata_df(rcts_chunk_df)
display(rcts_chunk_df)

  0%|          | 0/13896 [00:00<?, ?it/s]

  0%|          | 0/13896 [00:00<?, ?it/s]

Unnamed: 0,content,uni_token_content,bpe_token_content,n_uni_tokens,n_bpe_tokens,n_chars,n_lines,valid_uni_chunk,valid_bpe_chunk
0,"'O, that infected moisture of his eye,\n O,...","[44, 13, 792, 9, 29, 4987, 6305, 779, 20, 37, ...","[9, 7953, 7952, 7938, 82, 4994, 76, 1065, 408,...",95,101,315,7,False,False
1,'Thus merely with the garment of a Grace \n...,"[44, 13, 5722, 4696, 35, 15, 5163, 20, 21, 472...","[9, 7953, 2971, 3439, 121, 91, 22, 2011, 517, ...",94,97,315,7,False,False
2,'That not a heart which in his level came\n ...,"[44, 13, 6379, 30, 21, 157, 128, 26, 37, 3626,...","[9, 7953, 222, 77, 13, 378, 359, 61, 104, 7129...",95,98,331,7,False,False
3,"'In him a plenitude of subtle matter,\n App...","[44, 13, 455, 115, 43, 21, 684, 159, 226, 408,...","[9, 7953, 470, 112, 13, 325, 39, 3988, 48, 614...",100,96,316,7,False,False
4,"'For lo, his passion, but an art of craft,\n...","[44, 13, 4020, 1420, 9, 37, 1201, 9, 53, 114, ...","[9, 7953, 338, 165, 7938, 104, 2627, 7938, 170...",91,95,316,7,False,False
...,...,...,...,...,...,...,...,...,...
13891,"4\n Unthrifty loveliness why dost thou spend,...","[2714, 4, 12, 1046, 256, 102, 1275, 1404, 84, ...","[7993, 4, 9, 5070, 7928, 3843, 7937, 252, 7933...",176,183,619,15,True,True
13892,3\n Look in thy glass and tell the face thou ...,"[678, 4, 12, 552, 26, 54, 1885, 17, 175, 15, 3...","[7991, 4, 9, 2377, 61, 163, 3612, 47, 465, 22,...",174,183,643,15,True,True
13893,2\n When forty winters shall besiege thy brow...,"[695, 4, 12, 187, 4144, 1738, 14, 61, 6423, 54...","[7990, 4, 9, 656, 7126, 1618, 3323, 189, 1076,...",185,188,662,15,True,True
13894,1\n From fairest creatures we desire increase...,"[615, 4, 12, 318, 598, 2968, 3404, 67, 676, 36...","[7985, 4, 9, 1251, 5259, 5836, 92, 1570, 6677,...",179,192,643,15,True,True


  0%|          | 0/7699 [00:00<?, ?it/s]

  0%|          | 0/7699 [00:00<?, ?it/s]

Unnamed: 0,content,uni_token_content,bpe_token_content,n_uni_tokens,n_bpe_tokens,n_chars,n_lines,valid_uni_chunk,valid_bpe_chunk
0,1\n From fairest creatures we desire increase...,"[615, 4, 12, 318, 598, 2968, 3404, 67, 676, 36...","[7985, 4, 9, 1251, 5259, 5836, 92, 1570, 6677,...",179,192,643,15,True,True
1,2\n When forty winters shall besiege thy brow...,"[695, 4, 12, 187, 4144, 1738, 14, 61, 6423, 54...","[7990, 4, 9, 656, 7126, 1618, 3323, 189, 1076,...",185,188,662,15,True,True
2,3\n Look in thy glass and tell the face thou ...,"[678, 4, 12, 552, 26, 54, 1885, 17, 175, 15, 3...","[7991, 4, 9, 2377, 61, 163, 3612, 47, 465, 22,...",174,183,643,15,True,True
3,"4\n Unthrifty loveliness why dost thou spend,...","[2714, 4, 12, 1046, 256, 102, 1275, 1404, 84, ...","[7993, 4, 9, 5070, 7928, 3843, 7937, 252, 7933...",176,183,619,15,True,True
4,5\n Those hours that with gentle work did fra...,"[3284, 4, 12, 1694, 1160, 29, 35, 439, 709, 13...","[7994, 4, 9, 4430, 2303, 82, 91, 523, 1551, 33...",169,176,652,15,True,True
...,...,...,...,...,...,...,...,...,...
7694,"'""Lo, this device was sent me from a nun,\n O...","[13, 967, 220, 263, 9, 40, 2604, 88, 599, 31, ...","[7953, 7988, 4864, 7938, 116, 5184, 256, 1157,...",277,282,944,23,True,True
7695,"'""How mighty then you are, O hear me tell!\n ...","[13, 967, 272, 1092, 1337, 121, 23, 62, 9, 77,...","[7953, 7988, 796, 2696, 326, 43, 193, 7938, 12...",309,298,983,23,True,True
7696,"'""Now all these hearts that do on mine depend,...","[13, 967, 3343, 245, 56, 154, 1013, 29, 55, 64...","[7953, 7988, 1178, 173, 417, 1938, 82, 138, 12...",285,283,977,23,True,True
7697,"'For lo, his passion, but an art of craft,\n ...","[13, 4020, 1420, 9, 37, 1201, 9, 53, 114, 188,...","[7953, 338, 165, 7938, 104, 2627, 7938, 170, 1...",289,292,965,23,True,True


<br>

**DATASET VERSIONS**
* v1:
    * No filtering, no chunks removed (the above dataframe)
* v2
    * Split into individual datasets for bpe, unigram within each chunking technique (4 datasets total)
    * Rename columns to not specify tokenization method to allow for more generalization across interaction
* v3
    * Drop small chunks
    * Drop really big chunks
* v4
    * Evenly pad/truncated tokenized sequences up to reasonable length (close to max length --> 90th percentile?)


In [238]:
def pad_truncate_centered(tokenized_str, fixed_length=384, pad_value=0):
    """
    Pad or truncate the tokenized strings such that they have a 
    fixed length and are centered around the middle of the string.

    Args:
        tokenized_str (list): List of integers representing tokenized string.
        fixed_length (int, optional): The desired fixed length for the output list. Defaults to 384.
        pad_value (int, optional): The value to use for padding. Defaults to 0.

    Returns:
        list: Padded or truncated list of integers with the specified fixed length.
    """
    n_tokens = len(tokenized_str)

    # If the number of tokens is less than the fixed length, pad the tokenized string
    if n_tokens < fixed_length:
        n_pad = fixed_length - n_tokens
        n_left_pad = n_pad // 2
        n_right_pad = n_pad - n_left_pad
        return [pad_value] * n_left_pad + tokenized_str + [pad_value] * n_right_pad

    # If the number of tokens is greater than or equal to the fixed length, truncate the tokenized string
    else:
        n_remove = n_tokens - fixed_length
        n_left_remove = n_remove // 2
        n_right_remove = n_remove - n_left_remove
        return tokenized_str[n_left_remove:n_tokens - n_right_remove]
    

def drop_str_from_col_names(df, s_to_drop):
    df.columns = [_c.replace(s_to_drop, "").replace("__", "_").strip("_") for _c in df.columns]
    return df


def save_ds_version(df, output_suffix, version_str="v1", meta_df=None, _meta_dir=META_DIR, _ds_dir=DATASET_DIR): 
    ds_csv_path = os.path.join(_ds_dir, version_str)
    df.to_csv(ds_csv_path+f"_{output_suffix.strip('_')}.csv", index=False)
    
    if meta_df is None:
        meta_df = get_metadata_df(df)
    
    ds_meta_path = os.path.join(_meta_dir, version_str)
    meta_df.to_csv(ds_meta_path+f"_{output_suffix.strip('_')}.csv", index=False)

    
DATASET_DIR = os.path.join(DATA_PATH, "datasets")
if not os.path.isdir(DATASET_DIR): os.makedirs(DATASET_DIR, exist_ok=True)

META_DIR = os.path.join(DATASET_DIR, "meta")
if not os.path.isdir(META_DIR): os.makedirs(META_DIR, exist_ok=True)
    
    
######################################## v1 ########################################
# Save the previously created datasets along with the manually created metadata
####################################################################################
save_ds_version(rcts_chunk_df, "rcts", "v1", meta_df=rcts_chunk_df_meta)
save_ds_version(basic_chunk_df, "basic", "v1", meta_df=basic_chunk_df_meta)
####################################################################################

######################################## v2 ########################################
# Split bpe and unigram into their own dataframes (meta is generated automatically)
####################################################################################
rcts_uni_chunk_df = rcts_chunk_df.copy().drop(columns=[_c for _c in rcts_chunk_df.columns if "bpe" in _c])
basic_uni_chunk_df = basic_chunk_df.copy().drop(columns=[_c for _c in basic_chunk_df.columns if "bpe" in _c])
rcts_bpe_chunk_df = rcts_chunk_df.copy().drop(columns=[_c for _c in rcts_chunk_df.columns if "uni" in _c])
basic_bpe_chunk_df = basic_chunk_df.copy().drop(columns=[_c for _c in basic_chunk_df.columns if "uni" in _c])

# Rename columns
rcts_uni_chunk_df = drop_str_from_col_names(rcts_uni_chunk_df, "uni")
rcts_bpe_chunk_df = drop_str_from_col_names(rcts_bpe_chunk_df, "bpe")
basic_uni_chunk_df = drop_str_from_col_names(basic_uni_chunk_df, "uni")
basic_bpe_chunk_df = drop_str_from_col_names(basic_bpe_chunk_df, "bpe")

save_ds_version(rcts_uni_chunk_df, "rcts_uni", "v2")
save_ds_version(rcts_bpe_chunk_df, "rcts_bpe", "v2")
save_ds_version(basic_uni_chunk_df, "basic_uni", "v2")
save_ds_version(basic_bpe_chunk_df, "basic_bpe", "v2")
####################################################################################

######################################## v3 ########################################
# Filtering (big and little get dropped)
####################################################################################

# Filter and drop valid chunk col
for _df in [rcts_uni_chunk_df, rcts_bpe_chunk_df, basic_uni_chunk_df, basic_bpe_chunk_df]:
    _df = _df[_df.valid_chunk].drop(columns=["valid_chunk"]).reset_index(drop=True)

# Save
save_ds_version(rcts_uni_chunk_df, "rcts_uni", "v3")
save_ds_version(rcts_bpe_chunk_df, "rcts_bpe", "v3")
save_ds_version(basic_uni_chunk_df, "basic_uni", "v3")
save_ds_version(basic_bpe_chunk_df, "basic_bpe", "v3")
####################################################################################

# ######################################## v4 ########################################
# # Padding and truncation --> basic upper limit of 384 (assuming context lengths of 64-128)
# ####################################################################################
FIXED_CHUNK_SIZE = 384

for _df in [rcts_uni_chunk_df, rcts_bpe_chunk_df, basic_uni_chunk_df, basic_bpe_chunk_df]:
    _df["token_content"] = _df["token_content"].apply(lambda x: pad_truncate_centered(x, FIXED_CHUNK_SIZE))

save_ds_version(rcts_uni_chunk_df, "rcts_uni", "v4")
save_ds_version(rcts_bpe_chunk_df, "rcts_bpe", "v4")
save_ds_version(basic_uni_chunk_df, "basic_uni", "v4")
save_ds_version(basic_bpe_chunk_df, "basic_bpe", "v4")
# ####################################################################################

In [236]:
rcts_bpe_chunk_df

Unnamed: 0,content,token_content,n_tokens,n_chars,n_lines,valid_chunk
0,1\n From fairest creatures we desire increase...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",192,643,15,True
1,2\n When forty winters shall besiege thy brow...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",188,662,15,True
2,3\n Look in thy glass and tell the face thou ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",183,643,15,True
3,"4\n Unthrifty loveliness why dost thou spend,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",183,619,15,True
4,5\n Those hours that with gentle work did fra...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",176,652,15,True
...,...,...,...,...,...,...
7694,"'""Lo, this device was sent me from a nun,\n O...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",282,944,23,True
7695,"'""How mighty then you are, O hear me tell!\n ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",298,983,23,True
7696,"'""Now all these hearts that do on mine depend,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",283,977,23,True
7697,"'For lo, his passion, but an art of craft,\n ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",292,965,23,True


<br>

<div align="center">

## 💾 Converting to TFRecords <a name="converting-to-tfrecords"></a>

</div>

<br>

Finally, after completing the preprocessing steps and EDA, we'll convert the toy dataset into the `TFRecords` format. This efficient binary format is designed for use with TensorFlow and will enable seamless integration with your Language Model training pipeline.



In [256]:
def _bytes_feature(value, is_list=False):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
    
    if not is_list:
        value = [value]
    
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=value))

def _float_feature(value, is_list=False):
    """Returns a float_list from a float / double."""
        
    if not is_list:
        value = [value]
        
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))

def _int64_feature(value, is_list=False):
    """Returns an int64_list from a bool / enum / int / uint."""
        
    if not is_list:
        value = [value]
        
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))

    
def serialize_raw(token_ids):
    """
    Creates a tf.Example message ready to be written to a file from N features.

    Args:
        token_ids (list of ints): A list of integers representing the tokens for each string
    
    Returns:
        A tf.Example Message ready to be written to file
    """
    
    # Create a dictionary mapping the feature name to the 
    # tf.Example-compatible data type.
    feature = {
        "token_content": _int64_feature(token_ids, is_list=True),
    }
    
    # Create a Features message using tf.train.Example.
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()



def write_tfrecords(ds, n_ex, output_suffix, version_str, n_ex_per_rec=10_000, serialize_fn=serialize_raw, out_dir="./tfrecords"):
    """"""
    n_recs = int(np.ceil(n_ex/n_ex_per_rec))
    
    # Make dataset generator iterable
    ds = iter(ds)
        
    # Dataset directory
    if not os.path.isdir(out_dir): os.makedirs(out_dir, exist_ok=True)
    out_dir = os.path.join(out_dir, f"{output_suffix.strip('_')}_{version_str}")
    if not os.path.isdir(out_dir): os.makedirs(out_dir, exist_ok=True)
        
    # Create tfrecords
    for i in tqdm(range(n_recs), total=n_recs):
        print(f"\n... Writing TFRecord {i+1} of {n_recs} ({n_ex_per_rec} per TFRecord)...\n")
        tfrec_path = os.path.join(out_dir, f"{(i+1):02}_{n_recs:02}.tfrec")
        
        # This makes the tfrecord
        with tf.io.TFRecordWriter(tfrec_path) as writer:
            for ex in tqdm(range(n_ex_per_rec), total=n_ex_per_rec):
                try:
                    example = serialize_fn(next(ds))
                    writer.write(example)
                except:
                    break
                    
                    

TFRECORD_DIR = os.path.join(DATASET_DIR, "tfrecords")
N_PER = 1000 # artificially low to replicate tfrecord amounts expected


for _df in [rcts_uni_chunk_df, rcts_bpe_chunk_df, basic_uni_chunk_df, basic_bpe_chunk_df]:
    write_tfrecords(
        rcts_uni_chunk_df['token_content'], len(rcts_uni_chunk_df), "rcts_uni", "v4", 
        out_dir=TFRECORD_DIR, n_ex_per_rec=N_PER
    )
    write_tfrecords(
        rcts_bpe_chunk_df['token_content'], len(rcts_bpe_chunk_df), "rcts_bpe", "v4", 
        out_dir=TFRECORD_DIR, n_ex_per_rec=N_PER
    )
    write_tfrecords(
        basic_uni_chunk_df['token_content'], len(basic_uni_chunk_df), "basic_uni", "v4", 
        out_dir=TFRECORD_DIR, n_ex_per_rec=N_PER
    )
    write_tfrecords(
        basic_bpe_chunk_df['token_content'], len(basic_bpe_chunk_df), "basic_bpe", "v4", 
        out_dir=TFRECORD_DIR, n_ex_per_rec=N_PER
    )

  0%|          | 0/8 [00:00<?, ?it/s]


... Writing TFRecord 1 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]


... Writing TFRecord 1 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]


... Writing TFRecord 1 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 9 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 10 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 11 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 12 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 13 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 14 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]


... Writing TFRecord 1 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 9 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 10 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 11 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 12 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 13 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 14 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]


... Writing TFRecord 1 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]


... Writing TFRecord 1 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]


... Writing TFRecord 1 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 9 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 10 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 11 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 12 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 13 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 14 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]


... Writing TFRecord 1 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 9 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 10 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 11 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 12 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 13 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 14 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]


... Writing TFRecord 1 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]


... Writing TFRecord 1 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]


... Writing TFRecord 1 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 9 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 10 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 11 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 12 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 13 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 14 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]


... Writing TFRecord 1 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 9 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 10 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 11 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 12 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 13 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 14 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]


... Writing TFRecord 1 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?it/s]


... Writing TFRecord 1 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 8 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]


... Writing TFRecord 1 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 9 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 10 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 11 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 12 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 13 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 14 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]


... Writing TFRecord 1 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 2 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 3 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 4 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 5 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 6 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 7 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 8 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 9 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 10 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 11 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 12 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 13 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]


... Writing TFRecord 14 of 14 (1000 per TFRecord)...



  0%|          | 0/1000 [00:00<?, ?it/s]

<br>

**Check dataset**

In [270]:
def parse_example(example_proto, max_seq_len=384, default_value=0):
    feature_map = {
        'token_content': tf.io.FixedLenFeature(
            shape=[max_seq_len], dtype=tf.int64, default_value=[default_value]*max_seq_len
        ) # tf.io.VarLenFeature(tf.int64),
    }
    features = tf.io.parse_single_example(example_proto, features=feature_map)['token_content']
    
    ### HOW FOR VARIABLE
    # tf.io.VarLenFeature(tf.int64),
    #parsed_example = tf.io.parse_single_example(example_proto, feature_map)
    #features = tf.sparse.to_dense(parsed_example['token_content'])
    
    return features

def load_tfrecord_dataset(input_files):
    raw_dataset = tf.data.TFRecordDataset(input_files)
    parsed_dataset = raw_dataset.map(parse_example)
    return parsed_dataset

TFREC_PATHS = sorted(glob(os.path.join(TFRECORD_DIR, "rcts_bpe_v4", "*.tfrec")))
rcts_bpe_v4_ds = load_tfrecord_dataset(TFREC_PATHS)

In [284]:
plot_tokenization(bpe_decoder(next(iter(rcts_bpe_v4_ds)).numpy().tolist()), bpe_encoder, bpe_decoder)
plot_tokenization(bpe_decoder(rcts_bpe_chunk_df["token_content"][0]), bpe_encoder, bpe_decoder)

In [276]:
next(iter(rcts_bpe_v4_ds)).numpy().shape

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 7985,
 4,
 9,
 1251,
 5259,
 5836,
 92,
 1570,
 6677,
 7938,
 4,
 9,
 222,
 7377,
 1509,
 7953,
 7930,
 3811,
 667,
 498,
 850,
 7938,
 4,
 9,
 302,
 127,
 22,
 531,
 586,
 337,
 199,
 446,
 4148,
 516,
 7938,
 4,
 9,
 1008,
 2017,
 2356,
 667,
 629,
 104,
 4160,
 7981,
 4,
 9,
 302,
 110,
 664,
 109,
 2798,
 45,
 887,
 548,
 3335,
 613,
 7938,
 4,
 9,
 7967,
 1271,
 7953,
 53,
 163,
 1093,
 7953,
 7930,
 7315,
 91,
 1545,
 7972,
 7930,
 1575,
 5679,
 1559,
 28,
 7935,
 278,
 7938,
 4,
 9,
 6759,
 13,
 2548,
 209,
 466,
 436,
 1186,
 342,
 1493,
 7938,
 4,
 9,
 1255,
 1545,
 163,
 3848,
 7938,
 45,
 163,
 55