In [None]:
import pandas as pd
import re
from transformers import RobertaTokenizer
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, processors

data = pd.read_csv("labeled_dataset.csv")
data.drop('Unnamed: 0', axis=1, inplace=True)
# data

# print(data.describe(include="all").T)

for index, row in data['log_entry'].items():
    log_entry = row
    # Perform operations on the log_entry, do not actually print cause vsc will die
    # print(log_entry)

# data.head()

In [None]:
#### Step 1: Prep Work

# Function to extract text from the JSON content in the DataFrame
# TODO: goal is having a list of all domain_corpus in the jsons, duplicates are needed. removing special chars tbd
def extract_text_from_df(df):
    corpus = []
    entries = df['log_entry'].tolist()
    for entry in entries:
        corpus.append(entry)
    return corpus


# if necessary expand re by: |(\/)   -- [Includes ' and /]
def tokenize_text(corpus, remove_special_chars = False): 
    domain_corpus = []
    for text in corpus:
        if remove_special_chars:
            # set remove_special_chars to True if needed
            text = re.sub(r'[(==+)|(\|)|(\')]', '', text)
        domain_corpus.extend(text.split())
    return domain_corpus

corpus = extract_text_from_df(data)
# print(len(corpus))
# print(corpus)
domain_corpus = tokenize_text(corpus)
#print(domain_corpus)
print(len(domain_corpus))

In [None]:
#### Step 2: Extract Tokens using BPE

# Load the original RoBERTa tokenizer
roberta_tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

# Get the original vocabulary
roberta_vocab = roberta_tokenizer.get_vocab()
print(f"ROBERTA Tokens: {len(roberta_vocab.keys())}")

# Initialize a BPE tokenizer
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()

# Adjust the vocabulary size to match RoBERTa's
trainer = trainers.BpeTrainer(vocab_size=50265,min_frequency=1, special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])
tokenizer.train_from_iterator(domain_corpus, trainer)

# Get the new vocabulary from the trained tokenizer
new_vocab = tokenizer.get_vocab()
print(f"New Tokens: {len(new_vocab.keys())}")

In [None]:
#### Step 3: Merge Token List

# Find mutual and distinct tokens
mutual_tokens = set(roberta_vocab.keys()).intersection(set(new_vocab.keys()))
distinct_tokens = set(new_vocab.keys()).difference(set(roberta_vocab.keys()))

# Ensure the size of the new vocabulary matches the original
assert len(mutual_tokens) + len(distinct_tokens) == 50265, "Vocabulary size mismatch"

# Custom tokenizer (from https://arxiv.org/abs/2204.02685)

<font size="4">
For building the tokenizer, we employ a byte pair encoding (BPE) method to build a vocabulary of words and subwords from the cybersecurity corpora, as it is proven to have better performance versus word-based tokenizer. Character based encoding used in BPE allows for the learning of a small subword vocabulary that can encode any input text without introducing any "unknown" tokens. Our objective is to create a vocabulary that retains the tokens already provided in RoBERTa’s tokenizer while also incorporating additional unique cybersecurity related tokens. In this context, we extract 50, 265 tokens from the cybersecurity corpora to generate the initial token vocabulary ΨSec. We intentionally make the size of ΨSec the same with that of the RoBERTa’s token vocabulary ΨRoBERT a as we intended to imitate original RoBERTa’s design.
If ΨSec represents the vocabulary set of SecureBERT, and ΨRoBERT a denotes the vocabulary set of original RoBERTa, both with size of 50, 265, ΨSec shares 32, 592 mutual tokens with ΨRoBERT a leaving 17, 673 tokens contribute uniquely to cybersecurity corpus, such as *firewall, breach, crack, ransomware, malware, phishing, mysql, kaspersky, obfuscated, and vulnerability*, where RoBERTa’s tokenizer analyzes those using byte pairs: 
<br>
<br>
Vmutual = ΨSec ∩ ΨRoBERT a → 32, 592 tokens <br>
Vdistinct = ΨSec − ΨRoBERT a → 17, 673 tokens<br>
<br>
Studies shows utilizing complete words (not subwords) for those are common in specific domain, can enhance the performance during training since alignments may be more challenging to understand during model training, as target tokens often require attention from multiple source tokens. Hence, we choose all mutual terms and assign their original indices, while the remainder new tokens are assigned random indices with no conflict, where the original indices refers to the indices in RoBERTa’s tokenizer, to build our tokenizer. Ultimately, we develop a customized tokenizer with a vocabulary size similar to that of the original model, which includes tokens commonly seen in cybersecurity corpora in addition to cross-domain tokens. Our tokenizer encodes mutual tokens (Vmutual) as original model, ensuring that the model returns the appropriate pre-trained weights, while for new terms (Vdistinct) the indices and accordingly the weights would be random.

</font>

## GENEXT ANSWER

To implement a custom tokenizer using Byte Pair Encoding (BPE) that incorporates both the original RoBERTa tokens and additional domain-specific tokens, follow these steps:

1. **Prepare the Corpus**: Gather your domain-specific corpus.
2. **Extract Tokens using BPE**: Use a BPE tokenizer to extract tokens from your corpus.
3. **Merge Token Lists**: Combine the original RoBERTa tokens with the new domain-specific tokens.
4. **Assign Indices**: Ensure mutual tokens retain their original indices, and assign new indices to the additional tokens.
5. **Create the Custom Tokenizer**: Build the tokenizer using the merged vocabulary.

### Step-by-Step Implementation

#### Step 1: Prepare the Corpus
Ensure you have your domain-specific text corpus ready for tokenization.

```python
# Example: Load your domain-specific corpus
domain_corpus = ["Your domain-specific text data goes here."]
```

#### Step 2: Extract Tokens using BPE
Use the `tokenizers` library to perform BPE tokenization.

```python
from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, processors

# Initialize a BPE tokenizer
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()

# Train the tokenizer on your domain-specific corpus
trainer = trainers.BpeTrainer(vocab_size=50265, special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])
tokenizer.train_from_iterator(domain_corpus, trainer)
```

#### Step 3: Merge Token Lists
Load the original RoBERTa tokenizer and merge its tokens with the new ones.

```python
from transformers import RobertaTokenizer

# Load the original RoBERTa tokenizer
roberta_tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

# Get the original vocabulary
roberta_vocab = roberta_tokenizer.get_vocab()

# Get the new vocabulary from the trained tokenizer
new_vocab = tokenizer.get_vocab()

# Find mutual and distinct tokens
mutual_tokens = set(roberta_vocab.keys()).intersection(set(new_vocab.keys()))
distinct_tokens = set(new_vocab.keys()).difference(set(roberta_vocab.keys()))

# Ensure the size of the new vocabulary matches the original
assert len(mutual_tokens) + len(distinct_tokens) == 50265
```

#### Step 4: Assign Indices
Assign indices to the mutual and distinct tokens.

```python
# Create a new vocabulary dictionary
custom_vocab = {}

# Assign original indices to mutual tokens
for token in mutual_tokens:
    custom_vocab[token] = roberta_vocab[token]

# Assign new indices to distinct tokens, ensuring no conflicts
current_index = max(roberta_vocab.values()) + 1
for token in distinct_tokens:
    custom_vocab[token] = current_index
    current_index += 1
```

#### Step 5: Create the Custom Tokenizer
Build the custom tokenizer using the merged vocabulary.

```python
# Create a new tokenizer with the custom vocabulary
tokenizer = Tokenizer(models.BPE(custom_vocab))
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# Save the tokenizer
tokenizer.save("custom_tokenizer.json")

# Load and use the custom tokenizer
custom_tokenizer = Tokenizer.from_file("custom_tokenizer.json")

# Example usage
encoded = custom_tokenizer.encode("Example text to tokenize.")
print(encoded.tokens)
```

### Explanation

1. **Prepare the Corpus**: Load your domain-specific text data.
2. **Extract Tokens using BPE**: Train a BPE tokenizer on your corpus to extract tokens.
3. **Merge Token Lists**: Combine the RoBERTa tokens with the new tokens, ensuring mutual tokens are preserved.
4. **Assign Indices**: Assign indices to mutual tokens as per the original RoBERTa tokenizer and assign new indices to distinct tokens.
5. **Create the Custom Tokenizer**: Build and save the custom tokenizer with the merged vocabulary.

This approach ensures that your custom tokenizer can handle both general and domain-specific tokens effectively, leveraging the pre-trained weights of RoBERTa for mutual tokens and incorporating new tokens for domain-specific terms.

In [None]:
# Example usage to encode text

text = "'/var/cache/bazel/execroot/ddad/application/foresight/activities/foresight_road_data/road_data_links_builder/src/map_api_links_converters.cpp', 'INFO:', '[pool-5-thread-3]', '/var/cache/bazel/execroot/ddad/application/foresight/activities/foresight_road_data/road_data_links_builder/src/map_api_links_utility.cpp', 'INFO:', '[pool-5-thread-12]', '/var/cache/bazel/execroot/ddad/application/foresight/activities/foresight_road_data/road_data_links_builder/src/map_api_links_warning_sign_converter.cpp', 'INFO:', '[pool-5-thread-15]', '/var/cache/bazel/execroot/ddad/application/foresight/activities/foresight_road_data/road_data_links_builder/src/map_api_node_data_list.cpp', 'INFO:', '[pool-5-thread-7]', '/var/cache/bazel/execroot/ddad/application/foresight/activities/foresight_road_data/road_data_links_builder/src/map_api_node_data_list_builder.cpp', 'INFO:', '[pool-5-thread-14]', '/var/cache/bazel/execroot/ddad/application/foresight/activities/foresight_road_data/road_data_links_builder/src/map_api_operator_equality.cpp', 'INFO:', '[pool-5-thread-5]', '/var/cache/bazel/execroot/ddad/application/foresight/activities/foresight_road_data/road_data_links_builder/src/pool/multiple_element_pool_hash_map.cpp', 'INFO:', '[pool-5-thread-8]', '/var/cache/bazel/execroot/ddad/application/foresight/activities/foresight_road_data/road_data_links_builder/src/pool/multiple_element_pool_vector.cpp', 'INFO:', '[pool-5-thread-5]', '/var/cache/bazel/execroot/ddad/application/foresight/activities/test/multi_component/most_probable_path_multi_component_test.cpp', 'INFO:', '[pool-5-thread-14]', "

encoded_text = roberta_tokenizer.encode(text)
print(f"Encoded text: {encoded_text}")

In [None]:
# Overview over unique chars / amount of chars in words; pre-regex

def word_count(words):
    # Create a set 'word_set' to remove duplicate words from the input list.
    word_set = set(words)
    
    # Create an empty dictionary 'word_counts' to store word counts.
    word_counts = {}
    
    # Iterate over the unique words in 'word_set'.
    for word in word_set:
        # Count the occurrences of each word in the input list and store the count in 'word_counts'.
        word_counts[word] = words.count(word)
    
    # Return the 'word_counts' dictionary.
    return word_counts

# Call the 'word_count' function with the 'words' list and print the word counts.
#print(word_count(words)) 
uniques = set("".join(words))
print(uniques)

for char in uniques:
    amount = words.count(char)
    # print(f"Amount of { char} in words: {amount}")

In [None]:
# Here is how we can set the post-processing to give us the traditional BERT inputs: (https://huggingface.co/docs/tokenizers/quicktour)

from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

I have the following script to download json files from an aws bucket. I want to change it so it:
1. checks all objects in the bucket
2. chooses an predefined amount (e.g. 20) randomly
3. saves them in a dir called "logs" using the same names as in thebucket
4. if an object with the same name is already in the dir choose another one randomly until 20 new ones are downloaded