# Notebook submission for The Learning Agency Lab - PII Data Detection

<font size="4"> This notebook is a submission for the Kaggle competition ‘The Learning Agency Lab - PII Data Detection.’ The goal is to identify personal identifiable information within a corpus of student essays. Detailed information regarding the competition may be found at the link below:
https://www.kaggle.com/c/pii-detection-removal-from-educational-data


<font size="4">To begin, import the required packages. Note the "Weights and Biases" package is used to log the project. Wandb is a great tool to track the results of data science projects. 

<font size="4">https://wandb.ai/site

In [1]:
#Import packages
import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
from sklearn.metrics import classification_report, precision_recall_fscore_support
from collections import Counter
import datasets
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
import pandas as pd
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import gc
import re
import random
from itertools import chain

import wandb
# Initialize wandb
wandb.init(project="pii-detection")

#Set random seeds for reproducibility
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

seed_value = 42  
set_seed(seed_value)

import time
# Capture the start time
start_time = time.time()
print("Start Time: ", time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(start_time)))



2024-08-07 23:18:39.610248: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-07 23:18:39.610362: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-07 23:18:39.741724: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Start Time:  2024-08-07 23:40:48


# Load Data
<font size="4">Load the training data from the provided JSON file and create a Hugging Face Dataset object. The data is split into training and evaluation sets.

In [2]:
# Load dataset
dataset = datasets.load_dataset('json', data_files='/kaggle/input/pii-detection-removal-from-educational-data/train.json')

# Convert to DataFrame and preprocess
df = dataset['train'].to_pandas()
all_labels = [label for sublist in df['labels'] for label in sublist]
unique_labels = list(set(all_labels))
id2label = {i: label for i, label in enumerate(unique_labels)}
label2id = {label: i for i, label in enumerate(unique_labels)}
all_labels = list(label2id.keys())  # Get all unique labels from mapping
num_labels = len(all_labels)  # Update the num_labels

# Tokenizer and Model
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
model = AutoModelForTokenClassification.from_pretrained("microsoft/deberta-v3-base", num_labels=num_labels, id2label=id2label, label2id=label2id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Split Dataset
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_dataset = datasets.Dataset.from_pandas(train_df)
test_dataset = datasets.Dataset.from_pandas(test_df)
dataset_dict = datasets.DatasetDict({'train': train_dataset, 'test': test_dataset})





Generating train split: 0 examples [00:00, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
Some weights of DebertaV2ForTokenClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<font size="4">We can explore the dataset object. 

In [3]:
print('Dataset length', len(dataset))
dataset

Dataset length 1


DatasetDict({
    train: Dataset({
        features: ['document', 'full_text', 'tokens', 'trailing_whitespace', 'labels'],
        num_rows: 6807
    })
})

<font size="4">The dataset has four columns, 'document', 'tokens', 'labels', and 'trailing_whitespace'. Pandas can be used to visualize the dataset and get a better understanding of the structure. Note that pandas is used for visualization purposes only. The datasets library is needed to use with the Hugging Face model. Some may argue that it is not best practice to not include Pandas in final production code (possible slower performance). 

In [4]:
df = pd.read_json('/kaggle/input/pii-detection-removal-from-educational-data/train.json')
df.head()

Unnamed: 0,document,full_text,tokens,trailing_whitespace,labels
0,7,Design Thinking for innovation reflexion-Avril...,"[Design, Thinking, for, innovation, reflexion,...","[True, True, True, True, False, False, True, F...","[O, O, O, O, O, O, O, O, O, B-NAME_STUDENT, I-..."
1,10,Diego Estrada\n\nDesign Thinking Assignment\n\...,"[Diego, Estrada, \n\n, Design, Thinking, Assig...","[True, False, False, True, True, False, False,...","[B-NAME_STUDENT, I-NAME_STUDENT, O, O, O, O, O..."
2,16,Reporting process\n\nby Gilberto Gamboa\n\nCha...,"[Reporting, process, \n\n, by, Gilberto, Gambo...","[True, False, False, True, True, False, False,...","[O, O, O, O, B-NAME_STUDENT, I-NAME_STUDENT, O..."
3,20,Design Thinking for Innovation\n\nSindy Samaca...,"[Design, Thinking, for, Innovation, \n\n, Sind...","[True, True, True, False, False, True, False, ...","[O, O, O, O, O, B-NAME_STUDENT, I-NAME_STUDENT..."
4,56,Assignment: Visualization Reflection Submitt...,"[Assignment, :, , Visualization, , Reflecti...","[False, False, False, False, False, False, Fal...","[O, O, O, O, O, O, O, O, O, O, O, O, B-NAME_ST..."


<font size="4">The pandas data frame produces a better visualization of the structure of the JSON file. Each row represents an essay written by a student. There are 6807 Rows in the data frame, each row representing a student essay. The column 'full text' contains the full essay. The 'Tokens' column contains the text separated by tokens. The trailing white space column is a list of placeholders indicating if a token contains a trailing white space. Finally, the labels column represents a label for each token. Each token is as  one of our desired categories of PII, or 'O' if the token does not belong to PII a category.
    
<font size="4">We can view the first full essay as an example. A function 'format_text' is used to make the document more readable for humans.

In [5]:
def format_text(text):
    # Add paragraph breaks
    formatted_text = text.replace('\n\n', '\n\n<p>\n\n')

    # Add bullet points to list items
    formatted_text = re.sub(r'•\s', '\n- ', formatted_text)

    # Handle remaining single newlines
    formatted_text = re.sub(r'(?<!\n)\n(?!\n)', ' ', formatted_text)
    
    # Remove leading and trailing spaces
    formatted_text = re.sub(r'\s+\n', '\n', formatted_text)
    formatted_text = re.sub(r'\n\s+', '\n', formatted_text)

    return formatted_text

# Example usage
raw_text = df['full_text'][0]

print(format_text(raw_text))


Design Thinking for innovation reflexion-Avril 2021-Nathalie Sylla
<p>
Challenge & selection
<p>
The tool I use to help all stakeholders finding their way through the complexity of a project is the  mind map.
<p>
What exactly is a mind map? According to the definition of Buzan T. and Buzan B. (1999, Dessine-moi  l'intelligence. Paris: Les Éditions d'Organisation.), the mind map (or heuristic diagram) is a graphic  representation technique that follows the natural functioning of the mind and allows the brain's  potential to be released. Cf Annex1
<p>
This tool has many advantages:
<p>
-  It is accessible to all and does not require significant material investment and can be done  quickly
<p>
-  It is scalable
<p>
-  It allows categorization and linking of information
<p>
-  It can be applied to any type of situation: notetaking, problem solving, analysis, creation of  new ideas
<p>
-  It is suitable for all people and is easy to learn
<p>
-  It is fun and encourages exchanges
<p>
-  It 

<font size="4"> We can also get the distribution of the unique labels. 

In [6]:
# Count the frequency of each label in the 'train' dataset
label_freq = Counter(chain(*train_dataset['labels']))

# Display the frequency of each label
for label, freq in label_freq.items():
    print(f"Label: {label}, Frequency: {freq}")


Label: O, Frequency: 3984659
Label: B-NAME_STUDENT, Frequency: 1102
Label: I-NAME_STUDENT, Frequency: 852
Label: B-ID_NUM, Frequency: 68
Label: B-URL_PERSONAL, Frequency: 82
Label: B-EMAIL, Frequency: 36
Label: B-USERNAME, Frequency: 6
Label: B-PHONE_NUM, Frequency: 4
Label: I-PHONE_NUM, Frequency: 12
Label: I-URL_PERSONAL, Frequency: 1
Label: I-ID_NUM, Frequency: 1
Label: B-STREET_ADDRESS, Frequency: 1
Label: I-STREET_ADDRESS, Frequency: 10


<font size="4">The data set is heavily distributed with tokens that do not belong to a PII category (ie 'O' labels). We will use a focal loss and class weights with our model to help with this class imbalance. This will help ensure our model does not simply predict all 'O's' given the large distribution of those labels.

# Tokenization and Alignment of Labels

<font size="4">Explanations of the unique labels in the data set are below. These are the desired PII categories we seek to identify in the student essays. Note the tokens are split using a 'piece wise' tokenizer format. 

# **Explanation of Labels:**

<font size="4">

- **B-EMAIL**: Beginning of an email address.
- **B-ID_NUM**: Beginning of an identification number.
- **B-NAME_STUDENT**: Beginning of a student's name.
- **B-PHONE_NUM**: Beginning of a phone number.
- **B-STREET_ADDRESS**: Beginning of a street address.
- **B-URL_PERSONAL**: Beginning of a personal URL.
- **B-USERNAME**: Beginning of a username.
- **I-ID_NUM**: Inside an identification number.
- **I-NAME_STUDENT**: Inside a student's name.
- **I-PHONE_NUM**: Inside a phone number.
- **I-STREET_ADDRESS**: Inside a street address.
- **I-URL_PERSONAL**: Inside a personal URL.
- **O**: Outside of any named entity.

**Piecewise Tokenization:**

The DeBERTa model uses a 'piecewise' tokenizer. This type of tokenizer breaks down text into smaller subword units, which is useful for handling rare words and morphological variations. It ensures that even if a word is not in the vocabulary, the tokenizer can still represent it using smaller known subword units.

**Tokenization Process:**

- **Token Splitting**: The text is split into tokens based on whitespace and punctuation.
- **Subword Tokenization**: Each token is further split into subwords. For example, the word "unhappiness" might be split into "un", "happi", and "ness".
- **Label Alignment**: Labels are aligned with the subword tokens. If a token is split into multiple subwords, the label for the original token is assigned to the first subword, and a special label (typically -100) is assigned to the subsequent subwords.

Here's an example of how a token and its label might be split and aligned:

Original token: "unhappiness" (Label: B-EMOTION)
Subword tokens: ["un", "happi", "ness"]
Aligned labels: [B-EMOTION, -100, -100]

The alignment ensures that the model learns to identify entities correctly even when tokens are split into subwords.
    
</font>


<font size="4"> A function to tokenize the input data and align the labels with the tokenized inputs s also defined. his function handles the trailing whitespace correctly. The map function, which is included in the datasets library, is used to map the function to all of the documents in the data set. This will create the tokenized dataset required for the Hugging Face Deberta model.


In [7]:
#Define function to tokenize and align labels
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        is_split_into_words=True,
        truncation=True,
        padding='max_length',
        max_length=1024,
        return_offsets_mapping=True
    )

    batch_original_tokens = []
    batch_tokenized_tokens = []
    batch_label_ids = []
    batch_input_ids = []
    batch_attention_masks = []
    batch_token_type_ids = []

    for i, label in enumerate(examples["labels"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        original_tokens = examples["tokens"][i]
        tokenized_tokens = tokenizer.convert_ids_to_tokens(tokenized_inputs["input_ids"][i])

        previous_word_idx = None
        label_ids = []
        original_token_list = []
        tokenized_token_list = []
        input_id_list = []
        attention_mask_list = []
        token_type_id_list = []

        for j, word_idx in enumerate(word_ids):
            if word_idx is None:
                label_ids.append(-100)
                current_original_token = ''  # Special token
            elif word_idx == previous_word_idx:
                label_ids.append(-100)
                current_original_token = ''  # Subword token
            else:
                label_ids.append(label2id[label[word_idx]])
                current_original_token = original_tokens[word_idx]

            original_token_list.append(current_original_token)
            tokenized_token_list.append(tokenized_tokens[j])
            input_id_list.append(tokenized_inputs["input_ids"][i][j])
            attention_mask_list.append(tokenized_inputs["attention_mask"][i][j])
            if "token_type_ids" in tokenized_inputs:
                token_type_id_list.append(tokenized_inputs["token_type_ids"][i][j])
            else:
                token_type_id_list.append(0)
            previous_word_idx = word_idx  # Update for the next iteration

        batch_original_tokens.append(original_token_list)
        batch_tokenized_tokens.append(tokenized_token_list)
        batch_label_ids.append(label_ids)
        batch_input_ids.append(input_id_list)
        batch_attention_masks.append(attention_mask_list)
        batch_token_type_ids.append(token_type_id_list)

    # Include BERT-required columns
    return {
        "original_tokens": batch_original_tokens,
        "tokenized_tokens": batch_tokenized_tokens,
        "labels": batch_label_ids,
        "input_ids": batch_input_ids,  
        "attention_mask": batch_attention_masks,
        "token_type_ids": batch_token_type_ids  
    }

# Tokenize
tokenized_datasets = dataset_dict.map(tokenize_and_align_labels, batched=True)

gc.collect()


Map:   0%|          | 0/5445 [00:00<?, ? examples/s]

Map:   0%|          | 0/1362 [00:00<?, ? examples/s]

1389

<font size="4">It is important to ensure the tokenizer works correctly. The function below displays a comparison of the original tokens to the processed tokens.

In [8]:
import datasets

# Function to print the first few elements of each relevant column
def print_comparison(dataset, num_elements=20):
    first_document = dataset[0]
    original_tokens = first_document['original_tokens'][:num_elements]
    tokenized_tokens = first_document['tokenized_tokens'][:num_elements]
    labels = first_document['labels'][:num_elements]
    input_ids = first_document['input_ids'][:num_elements]
    attention_mask = first_document['attention_mask'][:num_elements]
    token_type_ids = first_document['token_type_ids'][:num_elements]

    # Print the columns in a readable format
    for i in range(num_elements):
        print(f"Original Token: {original_tokens[i]:<15} | "
              f"Tokenized Token: {tokenized_tokens[i]:<20} | "
              f"Label: {labels[i]:<5} | "
              f"Input ID: {input_ids[i]:<10} | "
              f"Attention Mask: {attention_mask[i]:<5} | "
              f"Token Type ID: {token_type_ids[i]}")

# Example usage with the train dataset
print("Comparison of first 20 elements for the first document in the train dataset:")
print_comparison(tokenized_datasets['train'], num_elements=20)


Comparison of first 20 elements for the first document in the train dataset:
Original Token:                 | Tokenized Token: [CLS]                | Label: -100  | Input ID: 1          | Attention Mask: 1     | Token Type ID: 0
Original Token: Visualization   | Tokenized Token: ▁Visualization       | Label: 1     | Input ID: 51146      | Attention Mask: 1     | Token Type ID: 0
Original Token: tool            | Tokenized Token: ▁tool                | Label: 1     | Input ID: 1637       | Attention Mask: 1     | Token Type ID: 0
Original Token: to              | Tokenized Token: ▁to                  | Label: 1     | Input ID: 264        | Attention Mask: 1     | Token Type ID: 0
Original Token: control         | Tokenized Token: ▁control             | Label: 1     | Input ID: 719        | Attention Mask: 1     | Token Type ID: 0
Original Token: Humidity        | Tokenized Token: ▁Humidity            | Label: 1     | Input ID: 79916      | Attention Mask: 1     | Token Type ID: 0
Origi

<font size="4">The tokenizer appears to have processed the original tokens correctly. We can remove the unnecessary column from the tokenized data set and only keep the columns required for training.

In [9]:
# List of columns to keep
columns_to_keep = ['labels', 'input_ids', 'attention_mask', 'token_type_ids']

# Function to remove unnecessary columns
def remove_unnecessary_columns(dataset, columns_to_keep):
    return dataset.remove_columns([column for column in dataset.column_names if column not in columns_to_keep])

# Apply the function to both train and test datasets
tokenized_datasets['train'] = remove_unnecessary_columns(tokenized_datasets['train'], columns_to_keep)
tokenized_datasets['test'] = remove_unnecessary_columns(tokenized_datasets['test'], columns_to_keep)

# Deberta Model Training
<font size="4">Recall the data set is imbalanced, with the vast majority of labels belonging to a non-PII category 'O.' Class weights with a weighted loss function will be used to help with this imbalance. This means the model will put more focus on the minority classes and will put less weight on the majority classes when training. Also, a custom compute metrics function with a classification report is defined for evaluation.

In [10]:
# Calculate class weights
class_weights = compute_class_weight('balanced', classes=np.unique(all_labels), y=all_labels)
class_weights = torch.tensor(class_weights, dtype=torch.float).to(device)

In [11]:
# Define data collator
data_collator = DataCollatorForTokenClassification(tokenizer)

# Define Focal Loss
class FocalLoss(nn.Module):
    def __init__(self, gamma=2, alpha=None):
        super(FocalLoss, self).__init__()
        self.gamma = gamma
        self.alpha = alpha
        if isinstance(alpha, (float, int)):
            self.alpha = torch.Tensor([alpha, 1 - alpha])
        if isinstance(alpha, list):
            self.alpha = torch.Tensor(alpha)

    def forward(self, inputs, targets):
        BCE_loss = F.cross_entropy(inputs, targets, reduction='none', weight=self.alpha)
        pt = torch.exp(-BCE_loss)
        F_loss = ((1 - pt) ** self.gamma) * BCE_loss
        return F_loss.mean()

# Define data collator
data_collator = DataCollatorForTokenClassification(tokenizer)

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_labels = [[id2label[label] for label in doc if label != -100] for doc in labels]
    true_predictions = [
        [id2label[pred] for pred, label in zip(doc, labels[i]) if label != -100]
        for i, doc in enumerate(predictions)
    ]

    # Flatten the lists
    true_labels_flat = [item for sublist in true_labels for item in sublist]
    true_predictions_flat = [item for sublist in true_predictions for item in sublist]

    # Calculate metrics
    results = precision_recall_fscore_support(true_labels_flat, true_predictions_flat, average='weighted')
    
    # Compute classification report
    class_report = classification_report(
        true_labels_flat, true_predictions_flat, labels=all_labels, zero_division=0  # Explicitly set labels and handle zero division
    )

    print("Classification Report:\n", class_report)
    return {
        "precision": results[0],
        "recall": results[1],
        "f1": results[2],
        "accuracy": (results[2] * results[1])
    }

# Custom Trainer with custom loss function
class CustomTrainer(Trainer):
    def __init__(self, loss_fn, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.loss_fn = loss_fn

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss = self.loss_fn(logits.view(-1, logits.shape[-1]), labels.view(-1))
        
        return (loss, outputs) if return_outputs else loss

# Choose the loss function (WeightedLoss or FocalLoss)
loss_fn = FocalLoss(gamma=2, alpha=class_weights)

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=5,
    weight_decay=0.01,
    report_to='none'
)

# Trainer
trainer = CustomTrainer(
    loss_fn=loss_fn,
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)



In [12]:
# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(eval_results)


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0031,0.000119,0.999875,0.999891,0.999873,0.999764
2,0.0002,9.9e-05,0.999865,0.999876,0.999865,0.999741
3,0.0001,9.2e-05,0.999897,0.999901,0.999897,0.999798
4,0.0,8.5e-05,0.999908,0.999917,0.999912,0.999828
5,0.0,7.9e-05,0.99991,0.999911,0.999907,0.999818


  _warn_prf(average, modifier, msg_start, len(result))


Classification Report:
                   precision    recall  f1-score   support

        I-ID_NUM       0.00      0.00      0.00         0
               O       1.00      1.00      1.00    887890
      B-USERNAME       0.00      0.00      0.00         0
  I-NAME_STUDENT       0.95      0.97      0.96       234
  I-URL_PERSONAL       0.00      0.00      0.00         0
I-STREET_ADDRESS       0.00      0.00      0.00         9
     I-PHONE_NUM       0.00      0.00      0.00         3
  B-URL_PERSONAL       1.00      0.15      0.26        27
         B-EMAIL       0.75      1.00      0.86         3
        B-ID_NUM       0.69      0.90      0.78        10
  B-NAME_STUDENT       0.93      0.92      0.93       253
     B-PHONE_NUM       0.00      0.00      0.00         2
B-STREET_ADDRESS       0.00      0.00      0.00         1

       micro avg       1.00      1.00      1.00    888432
       macro avg       0.41      0.38      0.37    888432
    weighted avg       1.00      1.00      1.0

  _warn_prf(average, modifier, msg_start, len(result))


Classification Report:
                   precision    recall  f1-score   support

        I-ID_NUM       0.00      0.00      0.00         0
               O       1.00      1.00      1.00    887890
      B-USERNAME       0.00      0.00      0.00         0
  I-NAME_STUDENT       0.96      0.94      0.95       234
  I-URL_PERSONAL       0.00      0.00      0.00         0
I-STREET_ADDRESS       0.00      0.00      0.00         9
     I-PHONE_NUM       0.00      0.00      0.00         3
  B-URL_PERSONAL       1.00      0.33      0.50        27
         B-EMAIL       0.75      1.00      0.86         3
        B-ID_NUM       0.69      0.90      0.78        10
  B-NAME_STUDENT       0.86      0.95      0.90       253
     B-PHONE_NUM       0.00      0.00      0.00         2
B-STREET_ADDRESS       0.00      0.00      0.00         1

       micro avg       1.00      1.00      1.00    888432
       macro avg       0.40      0.39      0.38    888432
    weighted avg       1.00      1.00      1.0

  _warn_prf(average, modifier, msg_start, len(result))


Classification Report:
                   precision    recall  f1-score   support

        I-ID_NUM       0.00      0.00      0.00         0
               O       1.00      1.00      1.00    887890
      B-USERNAME       0.00      0.00      0.00         0
  I-NAME_STUDENT       0.96      0.98      0.97       234
  I-URL_PERSONAL       0.00      0.00      0.00         0
I-STREET_ADDRESS       0.00      0.00      0.00         9
     I-PHONE_NUM       0.00      0.00      0.00         3
  B-URL_PERSONAL       0.64      1.00      0.78        27
         B-EMAIL       0.75      1.00      0.86         3
        B-ID_NUM       0.53      0.90      0.67        10
  B-NAME_STUDENT       0.89      0.96      0.92       253
     B-PHONE_NUM       0.00      0.00      0.00         2
B-STREET_ADDRESS       0.00      0.00      0.00         1

       micro avg       1.00      1.00      1.00    888432
       macro avg       0.37      0.45      0.40    888432
    weighted avg       1.00      1.00      1.0

  _warn_prf(average, modifier, msg_start, len(result))


Classification Report:
                   precision    recall  f1-score   support

        I-ID_NUM       0.00      0.00      0.00         0
               O       1.00      1.00      1.00    887890
      B-USERNAME       0.00      0.00      0.00         0
  I-NAME_STUDENT       0.95      0.99      0.97       234
  I-URL_PERSONAL       0.00      0.00      0.00         0
I-STREET_ADDRESS       0.00      0.00      0.00         9
     I-PHONE_NUM       0.00      0.00      0.00         3
  B-URL_PERSONAL       0.87      0.96      0.91        27
         B-EMAIL       0.75      1.00      0.86         3
        B-ID_NUM       0.64      0.90      0.75        10
  B-NAME_STUDENT       0.89      0.96      0.93       253
     B-PHONE_NUM       0.00      0.00      0.00         2
B-STREET_ADDRESS       0.00      0.00      0.00         1

       micro avg       1.00      1.00      1.00    888432
       macro avg       0.39      0.45      0.42    888432
    weighted avg       1.00      1.00      1.0

  _warn_prf(average, modifier, msg_start, len(result))


Classification Report:
                   precision    recall  f1-score   support

        I-ID_NUM       0.00      0.00      0.00         0
               O       1.00      1.00      1.00    887890
      B-USERNAME       0.00      0.00      0.00         0
  I-NAME_STUDENT       0.97      0.97      0.97       234
  I-URL_PERSONAL       0.00      0.00      0.00         0
I-STREET_ADDRESS       1.00      0.22      0.36         9
     I-PHONE_NUM       0.00      0.00      0.00         3
  B-URL_PERSONAL       0.86      0.93      0.89        27
         B-EMAIL       0.75      1.00      0.86         3
        B-ID_NUM       0.75      0.90      0.82        10
  B-NAME_STUDENT       0.88      0.96      0.92       253
     B-PHONE_NUM       0.00      0.00      0.00         2
B-STREET_ADDRESS       0.00      0.00      0.00         1

       micro avg       1.00      1.00      1.00    888432
       macro avg       0.48      0.46      0.45    888432
    weighted avg       1.00      1.00      1.0

  _warn_prf(average, modifier, msg_start, len(result))


Classification Report:
                   precision    recall  f1-score   support

        I-ID_NUM       0.00      0.00      0.00         0
               O       1.00      1.00      1.00    887890
      B-USERNAME       0.00      0.00      0.00         0
  I-NAME_STUDENT       0.97      0.97      0.97       234
  I-URL_PERSONAL       0.00      0.00      0.00         0
I-STREET_ADDRESS       1.00      0.22      0.36         9
     I-PHONE_NUM       0.00      0.00      0.00         3
  B-URL_PERSONAL       0.86      0.93      0.89        27
         B-EMAIL       0.75      1.00      0.86         3
        B-ID_NUM       0.75      0.90      0.82        10
  B-NAME_STUDENT       0.88      0.96      0.92       253
     B-PHONE_NUM       0.00      0.00      0.00         2
B-STREET_ADDRESS       0.00      0.00      0.00         1

       micro avg       1.00      1.00      1.00    888432
       macro avg       0.48      0.46      0.45    888432
    weighted avg       1.00      1.00      1.0

# Results
<font size="4"> The model achieved extremely high precision, recall, and F1 scores, all near or above 0.9999. This indicates the model performs exceptionally well in classifying the majority class 'O', however, this can be misleading due to the class imbalance. Some minority classes have a precision and recall of 0.00 due to insufficient samples. The number of samples in the evaluation set can be seen in the 'support' column in the classification report.
    
<font size="4">A focal loss was used to help with the imbalance data set, however additional steps could be taken to help with this. Generating synthetic data with data augmentation could help, however this is CPU intensive and time consuming. For example, an attempt was made to generate 100 synthetic augmented examples for each of the underrepresented pii categories; the processing time for just 100 examples per category was over 90 minutes. It would be unfeasible to attempt a CPU augmentation for a data set of this size.
    

# Submission

In [13]:
# Load test dataset
test_dataset = datasets.load_dataset('json', data_files='/kaggle/input/pii-detection-removal-from-educational-data/test.json')['train']

def tokenize_test_set(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"],
        is_split_into_words=True,
        truncation=True,
        padding='max_length',
        max_length=1536,
        return_offsets_mapping=True
    )

    batch_original_tokens = []
    batch_tokenized_tokens = []
    batch_input_ids = []
    batch_attention_masks = []
    batch_token_type_ids = []
    batch_offset_mappings = []  # New line to store offset mappings

    for i in range(len(examples["tokens"])):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        original_tokens = examples["tokens"][i]
        tokenized_tokens = tokenizer.convert_ids_to_tokens(tokenized_inputs["input_ids"][i])

        original_token_list = []
        tokenized_token_list = []
        input_id_list = []
        attention_mask_list = []
        token_type_id_list = []

        for j, word_idx in enumerate(word_ids):
            current_original_token = '' if word_idx is None else original_tokens[word_idx]

            original_token_list.append(current_original_token)
            tokenized_token_list.append(tokenized_tokens[j])
            input_id_list.append(tokenized_inputs["input_ids"][i][j])
            attention_mask_list.append(tokenized_inputs["attention_mask"][i][j])
            if "token_type_ids" in tokenized_inputs:
                token_type_id_list.append(tokenized_inputs["token_type_ids"][i][j])
            else:
                token_type_id_list.append(0)

        batch_original_tokens.append(original_token_list)
        batch_tokenized_tokens.append(tokenized_token_list)
        batch_input_ids.append(input_id_list)
        batch_attention_masks.append(attention_mask_list)
        batch_token_type_ids.append(token_type_id_list)
        # Append offset_mapping for the current sample
        batch_offset_mappings.append(tokenized_inputs['offset_mapping'][i])

    return {
        "original_tokens": batch_original_tokens,
        "tokenized_tokens": batch_tokenized_tokens,
        "input_ids": batch_input_ids,
        "attention_mask": batch_attention_masks,
        "token_type_ids": batch_token_type_ids,
        "offset_mapping": batch_offset_mappings,
    }

test_dataset = test_dataset.map(tokenize_test_set, batched=True)

# List of columns to keep
columns_to_keep = ['document', 'input_ids', 'attention_mask', 'token_type_ids']
test_dataset = test_dataset.remove_columns([col for col in test_dataset.column_names if col not in columns_to_keep])

# Define the confidence threshold
confidence_threshold = 0.95

# Predict on test dataset
test_predictions = trainer.predict(test_dataset)
pred_probs = test_predictions.predictions
preds = np.argmax(pred_probs, axis=2)
max_probs = np.max(pred_probs, axis=2)

# Create submission file
submission = []

for i, doc in enumerate(pred_probs):
    word_ids = test_dataset[i]['input_ids']
    for j, word_id in enumerate(word_ids):
        if word_id != tokenizer.pad_token_id:
            row_id = len(submission)
            document_id = test_dataset[i]['document']
            token_id = j
            pred_label = id2label[preds[i][j]]
            if pred_label != 'O' and max_probs[i][j] >= confidence_threshold:  # Exclude outside labels and apply threshold
                submission.append([row_id, document_id, token_id, pred_label])

submission_df = pd.DataFrame(submission, columns=["row_id", "document", "token", "label"])

# Save to CSV
submission_df.to_csv('submission.csv', index=False)

print("Submission file created successfully!")

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Submission file created successfully!


In [14]:
submission_df

Unnamed: 0,row_id,document,token,label
0,0,7,11,B-NAME_STUDENT
1,1,7,12,I-NAME_STUDENT
2,2,7,13,I-NAME_STUDENT
3,3,7,460,B-NAME_STUDENT
4,4,7,461,I-NAME_STUDENT
5,5,7,462,I-NAME_STUDENT
6,6,7,706,B-NAME_STUDENT
7,7,7,707,I-NAME_STUDENT
8,8,7,708,I-NAME_STUDENT
9,9,10,1,B-NAME_STUDENT


# Conclusion

<font size="4"> We were able to use the Hugging Face Deberta v3 Base model for this task. Some notable observations that I found during this experiment were the following:

**<font size="4">1. The importance of handling extreme imbalance in datasets:** <font size="4">Various techniques were attempted to address the extreme data imbalance. Undersampling the majority class does not leave enough data to properly train the model, while oversampling creates a data set too large to train in a reasonable time. Additional research could be completed to find time-efficient techniques for handling extreme data imbalance.

**<font size="4">2. Importance of verifying correct tokenization:** <font size="4">I found that tokenizing and aligning the data was the most challenging part of this experiment. It is important to choose a tokenizer that best works for a given task. It is also important to visually and programmatically check that the tokenizer has worked correctly.

**<font size="4">3. Robust options available for PII detection:** <font size="4">Hugging Face provides hundreds of free-to-use models for experimentation. Also, other packages such as Spacy and Microsoft Presidio offer lightweight solutions for similar tasks with smaller data sets. It is important to choose an appropriate model for the given structure of the data set, data set size, etc.

<font size="4">I am appreciative of other Kaggle users who have submitted and made their notebooks public for this competition. Participating in this competition has allowed me to learn about various techniques and packages for PII detection. Thank you to the Kaggle platform and the host of this competition. Please feel free to leave comments, feedback, suggestions, etc. Feedback from the Kaggle and data science community would be greatly appreciated.


In [15]:
# Capture the end time
end_time = time.time()
print("End Time: ", time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(end_time)))

# Calculate the duration
duration = end_time - start_time

# Convert duration to minutes and seconds
minutes = int(duration // 60)
seconds = int(duration % 60)

print(f"Notebook Run Time Duration: {minutes} minutes and {seconds} seconds")

End Time:  2024-08-08 01:33:50
Notebook Run Time Duration: 113 minutes and 2 seconds
