# ----------- DistilBERT - finetuned_ai4privacy_v2 -----------

DistilBERT is a streamlined version of the BERT model, designed for faster training and inferencing on off-cloud servers, making it a suitable choice for applications with constraints on computational resources (Wei et al., 2022). As a distilled version of BERT, DistilBERT maintains a significant proportion of the original model's capabilities, including 97% of BERT's language understanding proficiency, while offering the advantages of being 60% faster and 40% smaller (Sanh et al., 2019).

The model has demonstrated its efficacy across various applications, such as suicidal intention detection in social media posts (Ananthakrishnan et al., 2022) and opinion holder detection (Al-Mahmud & Shimada, 2022). This versatility is attributable to its robust foundational architecture derived from BERT, coupled with enhancements in efficiency and speed.

1. Wei, F., Yang, J., Mao, Q., Qin, H., & Dabrowski, A. (2022). An Empirical Comparison of DistilBERT, Longformer and Logistic Regression for Predictive Coding. 2022 IEEE International Conference on Big Data (Big Data). 

    - URL: https://ieeexplore.ieee.org/document/10020486
    
2. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv. 

    - URL: https://www.semanticscholar.org/paper/DistilBERT%2C-a-distilled-version-of-BERT%3A-smaller%2C-Sanh-Debut/a54b56af24bb4873ed0163b77df63b92bd018ddc

3. Ananthakrishnan, G., Jayaraman, A. K., Trueman, T., Mitra, S., A K, A., & Murugappan, A. (2022). Suicidal Intention Detection in Tweets Using BERT-Based Transformers. 2022 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS). 

    - URL: https://ieeexplore.ieee.org/document/10037677

4. Al-Mahmud, & Shimada, K. (2022). Dataset Construction and Classification Based on Pre-trained Models for Opinion Holder Detection. 2022 12th International Congress on Advanced Applied Informatics (IIAI-AAI). 

    - URL: https://ieeexplore.ieee.org/document/9894524

### DistilBERT:
Paper: https://arxiv.org/abs/1910.01108

Huggingface: https://huggingface.co/distilbert-base-uncased

Documentation: https://huggingface.co/docs/transformers/main/en/model_doc/distilbert

### ai4privacy:
Huggingface: https://huggingface.co/ai4privacy

Model: https://huggingface.co/Isotonic/distilbert_finetuned_ai4privacy_v2

GitHub: https://github.com/Sripaad/ai4privacy


In [1]:
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
import pandas as pd
from sklearn.metrics import precision_score, recall_score, f1_score
from IPython.display import display, HTML
from collections import Counter
import numpy as np

display(HTML("<style>.container { width:100% !important; }</style>"))

In [2]:
model_name = "Isotonic/distilbert_finetuned_ai4privacy_v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

In [3]:
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer)

text = "My name is Max and im from Frankfurt. I study at the University of Mannheim."
results = nlp(text)

results


[{'entity': 'B-FIRSTNAME',
  'score': 0.86211693,
  'index': 4,
  'word': 'max',
  'start': 11,
  'end': 14},
 {'entity': 'B-STATE',
  'score': 0.9750011,
  'index': 8,
  'word': 'frankfurt',
  'start': 27,
  'end': 36},
 {'entity': 'B-STATE',
  'score': 0.9271666,
  'index': 16,
  'word': 'mannheim',
  'start': 67,
  'end': 75}]

In [4]:
labels = model.config.id2label
print(labels)

{0: 'O', 1: 'B-PHONEIMEI', 2: 'I-PHONEIMEI', 3: 'B-JOBAREA', 4: 'B-FIRSTNAME', 5: 'I-FIRSTNAME', 6: 'B-VEHICLEVIN', 7: 'I-VEHICLEVIN', 8: 'B-AGE', 9: 'B-GENDER', 10: 'I-GENDER', 11: 'B-HEIGHT', 12: 'I-HEIGHT', 13: 'B-BUILDINGNUMBER', 14: 'I-BUILDINGNUMBER', 15: 'B-MASKEDNUMBER', 16: 'I-MASKEDNUMBER', 17: 'B-PASSWORD', 18: 'I-PASSWORD', 19: 'B-DOB', 20: 'I-DOB', 21: 'B-IPV6', 22: 'I-IPV6', 23: 'B-NEARBYGPSCOORDINATE', 24: 'I-NEARBYGPSCOORDINATE', 25: 'B-USERAGENT', 26: 'I-USERAGENT', 27: 'B-TIME', 28: 'I-TIME', 29: 'B-JOBTITLE', 30: 'I-JOBTITLE', 31: 'B-COUNTY', 32: 'B-EMAIL', 33: 'I-EMAIL', 34: 'B-ACCOUNTNUMBER', 35: 'I-ACCOUNTNUMBER', 36: 'B-PIN', 37: 'I-PIN', 38: 'B-EYECOLOR', 39: 'I-EYECOLOR', 40: 'B-LASTNAME', 41: 'I-LASTNAME', 42: 'I-JOBAREA', 43: 'B-IPV4', 44: 'I-IPV4', 45: 'B-DATE', 46: 'I-DATE', 47: 'B-STREET', 48: 'I-STREET', 49: 'B-CITY', 50: 'I-CITY', 51: 'B-PREFIX', 52: 'I-PREFIX', 53: 'B-CREDITCARDISSUER', 54: 'B-CREDITCARDNUMBER', 55: 'I-CREDITCARDNUMBER', 56: 'I-CREDITCA

In [5]:
data = pd.read_json("../../data/dataset_english.json")
data

Unnamed: 0,masked_text,unmasked_text,privacy_mask,span_labels,bio_labels,tokenised_text
0,A students assessment was found on device bear...,A students assessment was found on device bear...,"{'[PHONEIMEI_1]': '06-184755-866851-3', '[JOBA...","[[0, 57, O], [57, 75, PHONEIMEI_1], [75, 138, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-PHON...","[a, student, s, assessment, was, found, on, de..."
1,"Dear [FIRSTNAME_1], as per our records, your l...","Dear Omer, as per our records, your license 78...","{'[FIRSTNAME_1]': 'Omer', '[VEHICLEVIN_1]': '7...","[[0, 5, O], [5, 9, FIRSTNAME_1], [9, 44, O], [...","[O, B-FIRSTNAME, I-FIRSTNAME, O, O, O, O, O, O...","[dear, om, ##er, ,, as, per, our, records, ,, ..."
2,[FIRSTNAME_1] could you please share your reco...,Kattie could you please share your recomndatio...,"{'[FIRSTNAME_1]': 'Kattie', '[AGE_1]': '72', '...","[[0, 6, FIRSTNAME_1], [6, 75, O], [75, 77, AGE...","[B-FIRSTNAME, I-FIRSTNAME, O, O, O, O, O, O, O...","[kat, ##tie, could, you, please, share, your, ..."
3,Emergency supplies in [BUILDINGNUMBER_1] need ...,Emergency supplies in 16356 need a refill. Use...,"{'[BUILDINGNUMBER_1]': '16356', '[MASKEDNUMBER...","[[0, 22, O], [22, 27, BUILDINGNUMBER_1], [27, ...","[O, O, O, B-BUILDINGNUMBER, I-BUILDINGNUMBER, ...","[emergency, supplies, in, 1635, ##6, need, a, ..."
4,"The [AGE_1] old child at [BUILDINGNUMBER_1], h...","The 88 old child at 5862, has showcased an unu...","{'[AGE_1]': '88', '[BUILDINGNUMBER_1]': '5862'...","[[0, 4, O], [4, 6, AGE_1], [6, 20, O], [20, 24...","[O, B-AGE, O, O, O, B-BUILDINGNUMBER, I-BUILDI...","[the, 88, old, child, at, 58, ##6, ##2, ,, has..."
...,...,...,...,...,...,...
43496,"Hello [FIRSTNAME_1], your cognitive therapy ap...","Hello Nellie, your cognitive therapy appointme...","{'[FIRSTNAME_1]': 'Nellie', '[DATE_1]': '8/21'...","[[0, 6, O], [6, 12, FIRSTNAME_1], [12, 66, O],...","[O, B-FIRSTNAME, O, O, O, O, O, O, O, O, B-DAT...","[hello, nellie, ,, your, cognitive, therapy, a..."
43497,"Dear [FIRSTNAME_1], we appreciate your active ...","Dear Jalon, we appreciate your active involvem...","{'[FIRSTNAME_1]': 'Jalon', '[CREDITCARDNUMBER_...","[[0, 5, O], [5, 10, FIRSTNAME_1], [10, 159, O]...","[O, B-FIRSTNAME, I-FIRSTNAME, O, O, O, O, O, O...","[dear, ja, ##lon, ,, we, appreciate, your, act..."
43498,"Dear [SEX_1] at [ZIPCODE_1], we are raising fu...","Dear Female at 32363-2779, we are raising fund...","{'[SEX_1]': 'Female', '[ZIPCODE_1]': '32363-27...","[[0, 5, O], [5, 11, SEX_1], [11, 15, O], [15, ...","[O, B-SEX, O, B-ZIPCODE, I-ZIPCODE, I-ZIPCODE,...","[dear, female, at, 323, ##6, ##3, -, 277, ##9,..."
43499,"Hello [FIRSTNAME_1], we encourage you to pay t...","Hello Tito, we encourage you to pay the fees o...","{'[FIRSTNAME_1]': 'Tito', '[ETHEREUMADDRESS_1]...","[[0, 6, O], [6, 10, FIRSTNAME_1], [10, 137, O]...","[O, B-FIRSTNAME, O, O, O, O, O, O, O, O, O, O,...","[hello, tito, ,, we, encourage, you, to, pay, ..."


In [6]:
def convert_privacy_mask(privacy_mask):
    new_dict = {}
    for key, value in privacy_mask.items():
        new_key = key.strip('[]').split('_')[0]
        new_dict[new_key] = value
    return new_dict

data['data_results'] = data['privacy_mask'].apply(convert_privacy_mask)

data[['privacy_mask', 'data_results']].head()


Unnamed: 0,privacy_mask,data_results
0,"{'[PHONEIMEI_1]': '06-184755-866851-3', '[JOBA...","{'PHONEIMEI': '06-184755-866851-3', 'JOBAREA':..."
1,"{'[FIRSTNAME_1]': 'Omer', '[VEHICLEVIN_1]': '7...","{'FIRSTNAME': 'Omer', 'VEHICLEVIN': '78B5R2MVF..."
2,"{'[FIRSTNAME_1]': 'Kattie', '[AGE_1]': '72', '...","{'FIRSTNAME': 'Kattie', 'AGE': '72', 'GENDER':..."
3,"{'[BUILDINGNUMBER_1]': '16356', '[MASKEDNUMBER...","{'BUILDINGNUMBER': '16356', 'MASKEDNUMBER': '5..."
4,"{'[AGE_1]': '88', '[BUILDINGNUMBER_1]': '5862'...","{'AGE': '88', 'BUILDINGNUMBER': '5862', 'PASSW..."


In [41]:
data_small = data.head(3)

In [42]:
# an empty list to store the simplified results
model_simplified_results = []

# through the rows of the df
for index, row in data_small.iterrows():
    text = row['unmasked_text']
    # Apply the model to the text
    result = nlp(text)

    # Simplify the model output and extract entity and word
    simplified_dict = {}
    for item in result:
        # Remove the "B-" or "I-" or "-O" prefix from the entity label
        entity = item['entity'][2:]
        word = item['word']
        # Accumulate words for each entity, separating them with space
        simplified_dict[entity] = simplified_dict.get(entity, "") + " " + word

    # Strip extra spaces from the accumulated words for each entity
    simplified_dict = {k: v.strip() for k, v in simplified_dict.items()}

    # Append the simplified dictionary to the list
    model_simplified_results.append(simplified_dict)

# Add new column with results of the model
data_small['model_results'] = model_simplified_results


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_small['model_results'] = model_simplified_results


In [43]:
data_small[["unmasked_text", "masked_text", "data_results", "model_results"]]

Unnamed: 0,unmasked_text,masked_text,data_results,model_results
0,A students assessment was found on device bear...,A students assessment was found on device bear...,"{'PHONEIMEI': '06-184755-866851-3', 'JOBAREA':...",{'PHONEIMEI': '06 - 1847 ##55 - 86 ##6 ##85 ##...
1,"Dear Omer, as per our records, your license 78...","Dear [FIRSTNAME_1], as per our records, your l...","{'FIRSTNAME': 'Omer', 'VEHICLEVIN': '78B5R2MVF...","{'FIRSTNAME': 'om ##er', 'SSN': '78 ##b', 'DOB..."
2,Kattie could you please share your recomndatio...,[FIRSTNAME_1] could you please share your reco...,"{'FIRSTNAME': 'Kattie', 'AGE': '72', 'GENDER':...","{'FIRSTNAME': 'kat ##tie', 'AGE': '72', 'GENDE..."


In [44]:
print(data_small['data_results'].apply(type).value_counts())
print(data_small['model_results'].apply(type).value_counts())

data_results
<class 'dict'>    3
Name: count, dtype: int64
model_results
<class 'dict'>    3
Name: count, dtype: int64


In [45]:
def flatten_results(data, entity_types):
    # lists to store flattened true and predicted values
    y_true = []
    y_pred = []

    for _, row in data.iterrows():
        true_entities = row['data_results']  # Actual entities from the data 
        predicted_entities = row['model_results'] # Predicted entities from model

        for entity in entity_types:
            actual = true_entities.get(entity, '')
            predicted = predicted_entities.get(entity, '')
            y_true.append(actual.lower())
            y_pred.append(predicted.lower())

    return y_true, y_pred


# set of all entity types present in the dataset
entity_types = list(set([entity for row in data_small['data_results'] for entity in row.keys()])) 
entity_types

['PHONEIMEI', 'VEHICLEVIN', 'JOBAREA', 'GENDER', 'FIRSTNAME', 'HEIGHT', 'AGE']

In [46]:
# Flatten the results
y_true, y_pred = flatten_results(data_small, entity_types)

y_true

['06-184755-866851-3',
 '',
 'optimization',
 '',
 '',
 '',
 '',
 '',
 '78b5r2mvfahj48500',
 '',
 '',
 'omer',
 '',
 '',
 '',
 '',
 '',
 'intersex person',
 'kattie',
 '158centimeters',
 '72']

In [47]:
# calculate metrics for each entity
metrics = {}
    # FYI: For each entity type, binary lists are created. entity_true indicates the actual presence (1)
         # or absence (0) of that entity in each record of the DataFrame. entity_pred does the same for the model's predictions.
for entity in entity_types:
    entity_true = [1 if y == entity.lower() else 0 for y in y_true]
    entity_pred = [1 if y == entity.lower() else 0 for y in y_pred]

    # -> y_true and y_pred contain actual and predicted entities for each dataset record.
        # For each entity in entity_types, we check if its corresponding value in y_true or y_pred matches the entity#s name (y == entity.lower()),
        # ensuring case-insensitive comparison.
        # If there's a match, a 1 is added to the list, indicating the presence of that entity; otherwise, a 0 is added, indicating absence.
    precision = precision_score(entity_true, entity_pred, zero_division=0)
    recall = recall_score(entity_true, entity_pred, zero_division=0)
    f1 = f1_score(entity_true, entity_pred, zero_division=0)

    metrics[entity] = {'precision': precision, 'recall': recall, 'f1_score': f1}

# calculate overall metrics
overall_precision = precision_score(y_true, y_pred, average='micro', zero_division=0)
overall_recall = recall_score(y_true, y_pred, average='micro', zero_division=0)
overall_f1 = f1_score(y_true, y_pred, average='micro', zero_division=0)

# overall metrics add to the metrics dictionary
metrics['overall'] = {'precision': overall_precision, 'recall': overall_recall, 'f1_score': overall_f1}

# metrics in a formatted way
for entity, entity_metrics in metrics.items():
    print(f"Entity: {entity}")
    print(f" Precision: {entity_metrics['precision']:.2f}")
    print(f" Recall: {entity_metrics['recall']:.2f}")
    print(f" F1-Score: {entity_metrics['f1_score']:.2f}\n")



Entity: PHONEIMEI
 Precision: 0.00
 Recall: 0.00
 F1-Score: 0.00

Entity: VEHICLEVIN
 Precision: 0.00
 Recall: 0.00
 F1-Score: 0.00

Entity: JOBAREA
 Precision: 0.00
 Recall: 0.00
 F1-Score: 0.00

Entity: GENDER
 Precision: 0.00
 Recall: 0.00
 F1-Score: 0.00

Entity: FIRSTNAME
 Precision: 0.00
 Recall: 0.00
 F1-Score: 0.00

Entity: HEIGHT
 Precision: 0.00
 Recall: 0.00
 F1-Score: 0.00

Entity: AGE
 Precision: 0.00
 Recall: 0.00
 F1-Score: 0.00

Entity: overall
 Precision: 0.71
 Recall: 0.71
 F1-Score: 0.71



In [48]:
print(y_true[:20])  # Zeigt die ersten 20 tatsächlichen Werte
print(y_pred[:20])  # Zeigt die ersten 20 vorhergesagten Werte


['06-184755-866851-3', '', 'optimization', '', '', '', '', '', '78b5r2mvfahj48500', '', '', 'omer', '', '', '', '', '', 'intersex person', 'kattie', '158centimeters']
['06 - 1847 ##55 - 86 ##6 ##85 ##1 - 3', '', 'optimization', '', '', '', '', '', '', '', '', 'om ##er', '', '', '', '', '', 'inter ##se ##x person', 'kat ##tie', '158 ##cent ##imeters']
