# 🛠️ Advanced Tokenization with SentencePiece in NLP

## 🙏 Acknowledgement to Inspirational Work
A special thanks to the insightful notebook by [datafan07](https://www.kaggle.com/code/datafan07/train-your-own-tokenizer). This notebook served as a foundational inspiration for my exploration into the world of tokenizers and their impact on model performance.

## 🌟 Emphasizing SentencePiece: A Google Innovation
In this project, I delve deep into the capabilities of SentencePiece, a powerful tokenizer developed by Google. SentencePiece is unique in its approach to tokenization, which is crucial for Neural Network-based text generation systems.

### Why SentencePiece? 🤖
- **Model-Agnostic and Data-Driven**: It's designed to be model-agnostic and purely data-driven, training directly from raw sentences without the need for pre-tokenization. This makes SentencePiece incredibly flexible and adaptable.
- **Language Independence**: It treats text as a sequence of Unicode characters, without any language-specific logic, making it universally applicable.
- **Subword Algorithms Support**: SentencePiece supports BPE (Byte-Pair Encoding) and unigram language models, providing robust options for text segmentation.
- **Subword Regularization and BPE-Dropout**: These features in SentencePiece enhance the robustness and accuracy of NMT (Neural Machine Translation) models by introducing variability in the tokenization process.
- **Efficiency and Self-Containment**: It's fast, lightweight, and ensures consistency in tokenization/detokenization as long as the same model file is used.
- **NFKC-Based Normalization**: This form of normalization ensures text consistency before tokenization.
- **Direct Vocabulary ID Generation**: SentencePiece can generate vocabulary ID sequences directly from raw sentences, which is a significant advantage in processing efficiency.

### 📚 Technical Highlights
SentencePiece stands out from other implementations like `subword-nmt` and WordPiece in several ways, such as its support for multiple subword algorithms, subword regularization, customizable normalization, and direct ID generation.

### 🌐 Overview and Application
SentencePiece alleviates open vocabulary problems in neural machine translation by implementing effective sub-word units. Its approach to handling raw sentences and treating whitespace as a basic symbol allows for lossless conversions and language-independent processing. Furthermore, SentencePiece's subword regularization and BPE-dropout methods contribute to the enhanced performance and robustness of NMT models.

### 🚀 In This Project
I focus on integrating SentencePiece, particularly its BPE model, into our text processing pipeline. This is a departure from traditional methods like typo correction in preprocessing, aiming to incorporate linguistic nuances directly into the tokenizer’s vocabulary.

### 🚨 Important Note
This methodology heavily relies on the public leaderboard score and the test set, so a careful, strategic approach is advised.

---

With this enhanced focus on SentencePiece, let's embark on a journey of sophisticated text processing and modeling, beginning with our initial setup and library imports.

## 📚 Library Imports and Preparations

In this cell, we're setting up our environment with all the necessary libraries and tools for our analysis and modeling. Here's a quick rundown of what each import is for:

- `sys` and `gc`: System-level operations and garbage collection for memory management.
- `pandas` (🐼): Our go-to library for data manipulation and analysis.
- `sklearn`: A crucial library for machine learning, providing tools for model selection and evaluation.
- `numpy` (🔢): Essential for numerical operations.
- `TfidfVectorizer`: To convert text data into a matrix of TF-IDF features.
- `tokenizers` and `datasets` from Hugging Face: Advanced tools for efficient text tokenization and dataset handling.
- `tqdm`: For displaying beautiful progress bars during lengthy operations.
- `transformers`: Provides access to state-of-the-art transformer models.
- `SGDClassifier`, `MultinomialNB`, `VotingClassifier`: Various machine learning models for classification tasks.

🧑‍💻 Let's go ahead and import these libraries to kickstart our data processing and modeling journey!

---


In [7]:
import sys
import os
import gc

import pandas as pd
from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.metrics import roc_auc_score

from sklearn.feature_extraction.text import TfidfVectorizer

from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
    SentencePieceBPETokenizer
)

from datasets import Dataset
from tqdm.auto import tqdm
from transformers import PreTrainedTokenizerFast
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.svm import SVC

from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier

In [2]:
if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
    pass
else:
    sub = pd.read_csv('/sample1_submission.csv')
    sub.to_csv('submission.csv', index=False)
    sys.exit()

SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [9]:
test = pd.read_csv('data/test_essays.csv')
sub = pd.read_csv('data/sample_submission.csv')
org_train = pd.read_csv('data/train_essays.csv')
train = pd.read_csv("data/train_v2_drcat_02.csv", sep=',')

In [10]:
train = train.drop_duplicates(subset=['text'])
train.reset_index(drop=True, inplace=True)
y_train = train['label'].values

In [11]:
LOWERCASE = False
VOCAB_SIZE = 30522

## 🧩 Building and Training the Tokenizer

### Tokenizer Setup
Firstly, we're utilizing the `SentencePieceBPETokenizer` from the `tokenizers` library. This tokenizer is based on Byte-Pair Encoding (BPE), which is highly effective for natural language processing tasks.

- **Initialization**: We begin by creating a raw tokenizer instance.
- **Normalization and Pre-tokenization**: We set up normalization (including optional lowercase conversion).

### Special Tokens and Training
- **Special Tokens**: We define special tokens like `[UNK]`, `[PAD]`, `[CLS]`, `[SEP]`, and `[MASK]`, which are crucial for certain language models.
- **Trainer Setup**: We create a trainer with a specified vocabulary size and include our special tokens.

### Tokenizer Training
- **Dataset Preparation**: We're converting our test dataset into a Hugging Face `Dataset` object for efficient handling.
- **Training Process**: The tokenizer is then trained using an iterator over our dataset. It's important to note that we're training on the test set here.

### Tokenization of Text Data
- **Finalizing the Tokenizer**: We wrap our raw tokenizer with `PreTrainedTokenizerFast` for compatibility with transformer models, specifying our special tokens.
- **Tokenization of Test and Train Sets**: Finally, we tokenize both our test and training text data, preparing it for further analysis or model input.

🔄 With our tokenizer now trained and text data tokenized, we're set to move forward with the next steps in our analysis or modeling process.

---


In [12]:
# Creating Byte-Pair Encoding tokenizer
raw_tokenizer = SentencePieceBPETokenizer()

# Adding normalization and pre_tokenizer
raw_tokenizer.normalizer = normalizers.Sequence([normalizers.NFC()] + [normalizers.Lowercase()] if LOWERCASE else [])
raw_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
# Adding special tokens and creating trainer instance
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]

# Creating huggingface dataset object
dataset = Dataset.from_pandas(test[['text']])

def train_corp_iter():
    """
    A generator function for iterating over a dataset in chunks.
    """    
    for i in range(0, len(dataset), 300):
        yield dataset[i : i + 300]["text"]

# Training from iterator REMEMBER it's training on test set...
raw_tokenizer.train_from_iterator(train_corp_iter())

tokenizer = PreTrainedTokenizerFast(
    tokenizer_object = raw_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

tokenized_texts_test = []

# Tokenize test set with new tokenizer
for text in tqdm(test['text'].tolist()):
    tokenized_texts_test.append(tokenizer.tokenize(text))


# Tokenize train set
tokenized_texts_train = []

for text in tqdm(train['text'].tolist()):
    tokenized_texts_train.append(tokenizer.tokenize(text))

100%|██████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3007.39it/s]
100%|███████████████████████████████████████████████████████████████████████████| 44868/44868 [03:10<00:00, 235.68it/s]


In [13]:
def dummy(text):
    """
    A dummy function to use as tokenizer for TfidfVectorizer. It returns the text as it is since we already tokenized it.
    """
    return text

## 📈 Fitting the TfidfVectorizer

### Vectorizer Setup on Test Set
- **Initialization**: We set up our `TfidfVectorizer` with specific parameters, including an n-gram range of 3 to 5 and a custom tokenizer (`dummy`) defined above.
- **Fitting**: The vectorizer is then fitted on the tokenized test texts. This step involves learning the vocabulary and idf from the test set.

### Extracting Vocabulary
- After fitting, we extract the vocabulary (`vocab`) from the vectorizer and print it out for inspection.

### Applying Vectorizer to Train Set
- **Adjustment for Training Set**: We create a new `TfidfVectorizer` instance, this time using the vocabulary learned from the test set. This ensures consistency in feature representation between the test and training sets.
- **Transformation**: We then fit and transform the tokenized training texts, and transform the tokenized test texts using this vectorizer.

### Cleanup
- Post-processing, we delete the vectorizer and run garbage collection to free up memory. This is an important step in managing resources, especially when working with large datasets.

🔍 By fitting the `TfidfVectorizer` on the test set first and then applying it to the train set, we ensure that our model's features are consistent and relevant to our test data.

---


In [14]:
# Fitting TfidfVectoizer on test set
def fitting_vectorizer_on_test(a, b):
    vectorizer = TfidfVectorizer(ngram_range=(3, 5), lowercase=False, sublinear_tf=True, analyzer = 'word',
        tokenizer = dummy,
        preprocessor = dummy,
        token_pattern = None#, strip_accents='unicode'
                                )

    vectorizer.fit(b)

    # Getting vocab
    vocab = vectorizer.vocabulary_

    print(vocab)


    # Here we fit our vectorizer on train set but this time we use vocabulary from test fit.
    vectorizer = TfidfVectorizer(ngram_range=(3, 5), lowercase=False, sublinear_tf=True, vocabulary=vocab,
                                analyzer = 'word',
                                tokenizer = dummy,
                                preprocessor = dummy,
                                token_pattern = None#, strip_accents='unicode'
                                )

    tf_train = vectorizer.fit_transform(a)
    tf_test = vectorizer.transform(b)

    del vectorizer
    gc.collect()
    return(tf_train, tf_test)  

Just some sanity checks...

## 🌐 Building and Training the Ensemble Model

### Model Instantiation
- **Multinomial Naive Bayes**: We create a `MultinomialNB` model with a specific alpha value. This model is well-suited for classification with discrete features (like word counts or frequencies).
- **SGD Classifier**: A `SGDClassifier` is initialized with a high number of maximum iterations and a tolerance for stopping criteria. The loss function is set to "modified_huber", which is robust to outliers and effective for classification.

### Ensemble Creation
- **Voting Ensemble**: We combine these two models into a `VotingClassifier`. This ensemble method uses a soft voting mechanism, where the predicted probabilities from each model are used to make the final decision.
- **Weights Assignment**: We assign more weight to the `SGDClassifier` as indicated by the weights parameter `[0.9, 0.1]`. This means the ensemble will lean more towards the predictions made by the SGD model.
- **Parallel Processing**: The `n_jobs=-1` parameter enables the use of all available CPU cores for parallel processing, speeding up the training process.

### Model Training
- The ensemble model is then fitted on our TF-IDF transformed training data (`tf_train`) and the target variable (`y_train`).

### Resource Management
- After training the model, we run garbage collection to efficiently manage memory, an important step when working with large datasets and complex models.

🤖 With this ensemble approach, we aim to leverage the strengths of both models and potentially improve our overall prediction accuracy.

---


## MODELS
LGBM Parameters from:
https://www.kaggle.com/code/siddhvr/llm-daigt-sub
CatBoost parameters from:
https://www.kaggle.com/code/batprem/llm-daigt-preprocessing-bypass-catboost-added/notebook

In [15]:
def calculate_voting(tf_train, tf_test, y_train):
    clf = MultinomialNB(alpha=0.02)
    clf2 = MultinomialNB(alpha=0.01)
    
    sgd_model = SGDClassifier(max_iter=8000, tol=1e-4, loss="modified_huber") 
    p6={'n_iter': 1500,'verbose': -1,'objective': 'binary','metric': 'auc','learning_rate': 0.05073909898961407, 'colsample_bytree': 0.726023996436955, 'colsample_bynode': 0.5803681307354022, 'lambda_l1': 8.562963348932286, 'lambda_l2': 4.893256185259296, 'min_data_in_leaf': 115, 'max_depth': 23, 'max_bin': 898}
    lgb=LGBMClassifier(**p6)

    cat=CatBoostClassifier(
        iterations=1000,
        verbose=0,
        l2_leaf_reg=6.6591278779517808,
        learning_rate=0.005689066836106983,
        allow_const_label=True
    )
    
    weights = [9.5, 43, 42, 42]
    # Creating the ensemble model
    ensemble = VotingClassifier(estimators=[
        ('mnb', clf),
        ('sgd', sgd_model),
        ('lgb', lgb), 
        ('cat', cat)],
        weights = [w/sum(weights) for w in weights],
        voting='soft',
        n_jobs=-1)

    # Fit the ensemble model
    ensemble.fit(tf_train, y_train)
    final_preds = ensemble.predict_proba(tf_test)[:,1]
    # Garbage collection
    gc.collect()
    return(final_preds)

In [16]:
### Execute the processing

In [18]:
tf_train, tf_test = fitting_vectorizer_on_test(tokenized_texts_train, tokenized_texts_test)  
final_preds_sentencePiece = calculate_voting(tf_train, tf_test, y_train)
_ = gc.collect()

{'Ġ A a': 24, 'A a a': 0, 'a a Ġ': 12, 'a Ġ bb': 15, 'Ġ bb b': 33, 'bb b Ġccc': 19, 'b Ġccc .': 18, 'Ġ A a a': 25, 'A a a Ġ': 1, 'a a Ġ bb': 13, 'a Ġ bb b': 16, 'Ġ bb b Ġccc': 34, 'bb b Ġccc .': 20, 'Ġ A a a Ġ': 26, 'A a a Ġ bb': 2, 'a a Ġ bb b': 14, 'a Ġ bb b Ġccc': 17, 'Ġ bb b Ġccc .': 35, 'Ġ B bb': 27, 'B bb Ġccc': 3, 'bb Ġccc Ġddd': 21, 'Ġccc Ġddd .': 38, 'Ġ B bb Ġccc': 28, 'B bb Ġccc Ġddd': 4, 'bb Ġccc Ġddd .': 22, 'Ġ B bb Ġccc Ġddd': 29, 'B bb Ġccc Ġddd .': 5, 'Ġ CC C': 30, 'CC C Ġddd': 9, 'C Ġddd Ġ': 6, 'Ġddd Ġ ee': 39, 'Ġ ee e': 36, 'ee e .': 23, 'Ġ CC C Ġddd': 31, 'CC C Ġddd Ġ': 10, 'C Ġddd Ġ ee': 7, 'Ġddd Ġ ee e': 40, 'Ġ ee e .': 37, 'Ġ CC C Ġddd Ġ': 32, 'CC C Ġddd Ġ ee': 11, 'C Ġddd Ġ ee e': 8, 'Ġddd Ġ ee e .': 41}


KeyboardInterrupt: 

## Byte-Pair Encoding Tokenizer
Same sfuff, just a little different at the beginning

In [None]:
raw_tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
# Adding normalization and pre_tokenizer
raw_tokenizer.normalizer = normalizers.Sequence([normalizers.NFC()] + [normalizers.Lowercase()] if LOWERCASE else [])
raw_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
# Adding special tokens and creating trainer instance
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.BpeTrainer(vocab_size=VOCAB_SIZE, special_tokens=special_tokens)
# Creating huggingface dataset object
dataset = Dataset.from_pandas(test[['text']])
def train_corp_iter(): 
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]
raw_tokenizer.train_from_iterator(train_corp_iter(), trainer=trainer)
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=raw_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)
tokenized_texts_test = []

for text in tqdm(test['text'].tolist()):
    tokenized_texts_test.append(tokenizer.tokenize(text))

tokenized_texts_train = []

for text in tqdm(train['text'].tolist()):
    tokenized_texts_train.append(tokenizer.tokenize(text))
    
tf_train, tf_test = fitting_vectorizer_on_test(tokenized_texts_train, tokenized_texts_test) 
final_preds_bytePair = calculate_voting(tf_train, tf_test, y_train)


100%|███████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 172.39it/s]
  0%|                                                                             | 14/44868 [00:06<5:22:47,  2.32it/s]

## 🚀 Final Submission Preparation

### Generating Predictions
- We assign the final predictions (contained in `final_preds`) to a column named `generated` in our submission DataFrame `sub`.

### Creating Submission File
- **Export to CSV**: The DataFrame `sub` is then exported to a CSV file named `submission.csv`. We ensure `index=False` to exclude the DataFrame index from the CSV file, as per typical submission format requirements in competitions like Kaggle.
- **File Inspection**: After saving, we display the DataFrame `sub` to visually confirm its structure and the predictions it contains.

### Final Step
- 📁 The `submission.csv` file is now ready to be uploaded as our competition entry. This file encapsulates our model's predictions and represents the culmination of our data processing, model training, and prediction efforts.

🎉 With the submission file created and verified, we've reached the end of our data science journey for this project. It's time to submit our predictions and see how our model performs!

---


In [20]:
final_preds = (final_preds_sentencePiece + final_preds_bytePair) / 2
final_preds

NameError: name 'final_preds_sentencePiece' is not defined

In [None]:
sub['generated'] = final_preds
sub.to_csv('submission.csv', index=False)
sub