# ----------- DistilBERT - finetuned_ai4privacy_v2 -----------

DistilBERT is a streamlined version of the BERT model, designed for faster training and inferencing on off-cloud servers, making it a suitable choice for applications with constraints on computational resources (Wei et al., 2022). As a distilled version of BERT, DistilBERT maintains a significant proportion of the original model's capabilities, including 97% of BERT's language understanding proficiency, while offering the advantages of being 60% faster and 40% smaller (Sanh et al., 2019).

The model has demonstrated its efficacy across various applications, such as suicidal intention detection in social media posts (Ananthakrishnan et al., 2022) and opinion holder detection (Al-Mahmud & Shimada, 2022). This versatility is attributable to its robust foundational architecture derived from BERT, coupled with enhancements in efficiency and speed.

1. Wei, F., Yang, J., Mao, Q., Qin, H., & Dabrowski, A. (2022). An Empirical Comparison of DistilBERT, Longformer and Logistic Regression for Predictive Coding. 2022 IEEE International Conference on Big Data (Big Data). 

    - URL: https://ieeexplore.ieee.org/document/10020486
    
2. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv. 

    - URL: https://www.semanticscholar.org/paper/DistilBERT%2C-a-distilled-version-of-BERT%3A-smaller%2C-Sanh-Debut/a54b56af24bb4873ed0163b77df63b92bd018ddc

3. Ananthakrishnan, G., Jayaraman, A. K., Trueman, T., Mitra, S., A K, A., & Murugappan, A. (2022). Suicidal Intention Detection in Tweets Using BERT-Based Transformers. 2022 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS). 

    - URL: https://ieeexplore.ieee.org/document/10037677

4. Al-Mahmud, & Shimada, K. (2022). Dataset Construction and Classification Based on Pre-trained Models for Opinion Holder Detection. 2022 12th International Congress on Advanced Applied Informatics (IIAI-AAI). 

    - URL: https://ieeexplore.ieee.org/document/9894524

### DistilBERT:
Paper: https://arxiv.org/abs/1910.01108

Huggingface: https://huggingface.co/distilbert-base-uncased

Documentation: https://huggingface.co/docs/transformers/main/en/model_doc/distilbert

### ai4privacy:
Huggingface: https://huggingface.co/ai4privacy

Model: https://huggingface.co/Isotonic/distilbert_finetuned_ai4privacy_v2

GitHub: https://github.com/Sripaad/ai4privacy


In [36]:
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
import pandas as pd
from sklearn.metrics import precision_score, recall_score, f1_score
from IPython.display import display, HTML

display(HTML("<style>.container { width:100% !important; }</style>"))


In [37]:
model_name = "Isotonic/distilbert_finetuned_ai4privacy_v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

In [38]:
nlp = pipeline("token-classification", model=model, tokenizer=tokenizer)

text = "My name is Max and im from Frankfurt. I study at the University of Mannheim."
results = nlp(text)

results


[{'entity': 'B-FIRSTNAME',
  'score': 0.86211693,
  'index': 4,
  'word': 'max',
  'start': 11,
  'end': 14},
 {'entity': 'B-STATE',
  'score': 0.9750011,
  'index': 8,
  'word': 'frankfurt',
  'start': 27,
  'end': 36},
 {'entity': 'B-STATE',
  'score': 0.9271666,
  'index': 16,
  'word': 'mannheim',
  'start': 67,
  'end': 75}]

In [39]:
labels = model.config.id2label
print(labels)

{0: 'O', 1: 'B-PHONEIMEI', 2: 'I-PHONEIMEI', 3: 'B-JOBAREA', 4: 'B-FIRSTNAME', 5: 'I-FIRSTNAME', 6: 'B-VEHICLEVIN', 7: 'I-VEHICLEVIN', 8: 'B-AGE', 9: 'B-GENDER', 10: 'I-GENDER', 11: 'B-HEIGHT', 12: 'I-HEIGHT', 13: 'B-BUILDINGNUMBER', 14: 'I-BUILDINGNUMBER', 15: 'B-MASKEDNUMBER', 16: 'I-MASKEDNUMBER', 17: 'B-PASSWORD', 18: 'I-PASSWORD', 19: 'B-DOB', 20: 'I-DOB', 21: 'B-IPV6', 22: 'I-IPV6', 23: 'B-NEARBYGPSCOORDINATE', 24: 'I-NEARBYGPSCOORDINATE', 25: 'B-USERAGENT', 26: 'I-USERAGENT', 27: 'B-TIME', 28: 'I-TIME', 29: 'B-JOBTITLE', 30: 'I-JOBTITLE', 31: 'B-COUNTY', 32: 'B-EMAIL', 33: 'I-EMAIL', 34: 'B-ACCOUNTNUMBER', 35: 'I-ACCOUNTNUMBER', 36: 'B-PIN', 37: 'I-PIN', 38: 'B-EYECOLOR', 39: 'I-EYECOLOR', 40: 'B-LASTNAME', 41: 'I-LASTNAME', 42: 'I-JOBAREA', 43: 'B-IPV4', 44: 'I-IPV4', 45: 'B-DATE', 46: 'I-DATE', 47: 'B-STREET', 48: 'I-STREET', 49: 'B-CITY', 50: 'I-CITY', 51: 'B-PREFIX', 52: 'I-PREFIX', 53: 'B-CREDITCARDISSUER', 54: 'B-CREDITCARDNUMBER', 55: 'I-CREDITCARDNUMBER', 56: 'I-CREDITCA

In [40]:
data = pd.read_json("../../data/dataset_english.json")
data

Unnamed: 0,masked_text,unmasked_text,privacy_mask,span_labels,bio_labels,tokenised_text
0,A students assessment was found on device bear...,A students assessment was found on device bear...,"{'[PHONEIMEI_1]': '06-184755-866851-3', '[JOBA...","[[0, 57, O], [57, 75, PHONEIMEI_1], [75, 138, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-PHON...","[a, student, s, assessment, was, found, on, de..."
1,"Dear [FIRSTNAME_1], as per our records, your l...","Dear Omer, as per our records, your license 78...","{'[FIRSTNAME_1]': 'Omer', '[VEHICLEVIN_1]': '7...","[[0, 5, O], [5, 9, FIRSTNAME_1], [9, 44, O], [...","[O, B-FIRSTNAME, I-FIRSTNAME, O, O, O, O, O, O...","[dear, om, ##er, ,, as, per, our, records, ,, ..."
2,[FIRSTNAME_1] could you please share your reco...,Kattie could you please share your recomndatio...,"{'[FIRSTNAME_1]': 'Kattie', '[AGE_1]': '72', '...","[[0, 6, FIRSTNAME_1], [6, 75, O], [75, 77, AGE...","[B-FIRSTNAME, I-FIRSTNAME, O, O, O, O, O, O, O...","[kat, ##tie, could, you, please, share, your, ..."
3,Emergency supplies in [BUILDINGNUMBER_1] need ...,Emergency supplies in 16356 need a refill. Use...,"{'[BUILDINGNUMBER_1]': '16356', '[MASKEDNUMBER...","[[0, 22, O], [22, 27, BUILDINGNUMBER_1], [27, ...","[O, O, O, B-BUILDINGNUMBER, I-BUILDINGNUMBER, ...","[emergency, supplies, in, 1635, ##6, need, a, ..."
4,"The [AGE_1] old child at [BUILDINGNUMBER_1], h...","The 88 old child at 5862, has showcased an unu...","{'[AGE_1]': '88', '[BUILDINGNUMBER_1]': '5862'...","[[0, 4, O], [4, 6, AGE_1], [6, 20, O], [20, 24...","[O, B-AGE, O, O, O, B-BUILDINGNUMBER, I-BUILDI...","[the, 88, old, child, at, 58, ##6, ##2, ,, has..."
...,...,...,...,...,...,...
43496,"Hello [FIRSTNAME_1], your cognitive therapy ap...","Hello Nellie, your cognitive therapy appointme...","{'[FIRSTNAME_1]': 'Nellie', '[DATE_1]': '8/21'...","[[0, 6, O], [6, 12, FIRSTNAME_1], [12, 66, O],...","[O, B-FIRSTNAME, O, O, O, O, O, O, O, O, B-DAT...","[hello, nellie, ,, your, cognitive, therapy, a..."
43497,"Dear [FIRSTNAME_1], we appreciate your active ...","Dear Jalon, we appreciate your active involvem...","{'[FIRSTNAME_1]': 'Jalon', '[CREDITCARDNUMBER_...","[[0, 5, O], [5, 10, FIRSTNAME_1], [10, 159, O]...","[O, B-FIRSTNAME, I-FIRSTNAME, O, O, O, O, O, O...","[dear, ja, ##lon, ,, we, appreciate, your, act..."
43498,"Dear [SEX_1] at [ZIPCODE_1], we are raising fu...","Dear Female at 32363-2779, we are raising fund...","{'[SEX_1]': 'Female', '[ZIPCODE_1]': '32363-27...","[[0, 5, O], [5, 11, SEX_1], [11, 15, O], [15, ...","[O, B-SEX, O, B-ZIPCODE, I-ZIPCODE, I-ZIPCODE,...","[dear, female, at, 323, ##6, ##3, -, 277, ##9,..."
43499,"Hello [FIRSTNAME_1], we encourage you to pay t...","Hello Tito, we encourage you to pay the fees o...","{'[FIRSTNAME_1]': 'Tito', '[ETHEREUMADDRESS_1]...","[[0, 6, O], [6, 10, FIRSTNAME_1], [10, 137, O]...","[O, B-FIRSTNAME, O, O, O, O, O, O, O, O, O, O,...","[hello, tito, ,, we, encourage, you, to, pay, ..."


In [41]:
def convert_privacy_mask(privacy_mask):
    new_dict = {}
    for key, value in privacy_mask.items():
        new_key = key.strip('[]').split('_')[0]
        new_dict[new_key] = value
    return new_dict

data['data_results'] = data['privacy_mask'].apply(convert_privacy_mask)

data[['privacy_mask', 'data_results']].head()


Unnamed: 0,privacy_mask,data_results
0,"{'[PHONEIMEI_1]': '06-184755-866851-3', '[JOBA...","{'PHONEIMEI': '06-184755-866851-3', 'JOBAREA':..."
1,"{'[FIRSTNAME_1]': 'Omer', '[VEHICLEVIN_1]': '7...","{'FIRSTNAME': 'Omer', 'VEHICLEVIN': '78B5R2MVF..."
2,"{'[FIRSTNAME_1]': 'Kattie', '[AGE_1]': '72', '...","{'FIRSTNAME': 'Kattie', 'AGE': '72', 'GENDER':..."
3,"{'[BUILDINGNUMBER_1]': '16356', '[MASKEDNUMBER...","{'BUILDINGNUMBER': '16356', 'MASKEDNUMBER': '5..."
4,"{'[AGE_1]': '88', '[BUILDINGNUMBER_1]': '5862'...","{'AGE': '88', 'BUILDINGNUMBER': '5862', 'PASSW..."


In [42]:
data_small = data.head(100)

In [43]:
def extract_entities_and_words(model_output):
    extracted_data = []
    for item in model_output:
        # Entfernen des "B-" oder "I-" Präfixes und Extrahieren von Entität und Wort
        entity = item['entity'][2:]
        word = item['word']
        extracted_data.append({entity: word})
    return extracted_data

In [44]:
# Eine leere Liste, um die vereinfachten Ergebnisse zu speichern
model_simplified_results = []

# Iteration durch die Zeilen des DataFrame
for index, row in data_small.iterrows():
    text = row['unmasked_text']
    # Anwenden des Modells auf den Text
    result = nlp(text)
    # Vereinfachen der Modellausgabe und Extrahieren von Entität und Wort
    simplified_and_extracted_result = extract_entities_and_words(result)
    # Anhängen an die Liste
    model_simplified_results.append(simplified_and_extracted_result)

# Hinzufügen der vereinfachten Ergebnisse als neue Spalte zum DataFrame
data_small['model_results'] = model_simplified_results

# Anzeigen der ersten Zeilen zur Überprüfung
data_small

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_small['model_results'] = model_simplified_results


Unnamed: 0,masked_text,unmasked_text,privacy_mask,span_labels,bio_labels,tokenised_text,data_results,model_results
0,A students assessment was found on device bear...,A students assessment was found on device bear...,"{'[PHONEIMEI_1]': '06-184755-866851-3', '[JOBA...","[[0, 57, O], [57, 75, PHONEIMEI_1], [75, 138, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-PHON...","[a, student, s, assessment, was, found, on, de...","{'PHONEIMEI': '06-184755-866851-3', 'JOBAREA':...","[{'PHONEIMEI': '06'}, {'PHONEIMEI': '-'}, {'PH..."
1,"Dear [FIRSTNAME_1], as per our records, your l...","Dear Omer, as per our records, your license 78...","{'[FIRSTNAME_1]': 'Omer', '[VEHICLEVIN_1]': '7...","[[0, 5, O], [5, 9, FIRSTNAME_1], [9, 44, O], [...","[O, B-FIRSTNAME, I-FIRSTNAME, O, O, O, O, O, O...","[dear, om, ##er, ,, as, per, our, records, ,, ...","{'FIRSTNAME': 'Omer', 'VEHICLEVIN': '78B5R2MVF...","[{'FIRSTNAME': 'om'}, {'FIRSTNAME': '##er'}, {..."
2,[FIRSTNAME_1] could you please share your reco...,Kattie could you please share your recomndatio...,"{'[FIRSTNAME_1]': 'Kattie', '[AGE_1]': '72', '...","[[0, 6, FIRSTNAME_1], [6, 75, O], [75, 77, AGE...","[B-FIRSTNAME, I-FIRSTNAME, O, O, O, O, O, O, O...","[kat, ##tie, could, you, please, share, your, ...","{'FIRSTNAME': 'Kattie', 'AGE': '72', 'GENDER':...","[{'FIRSTNAME': 'kat'}, {'FIRSTNAME': '##tie'},..."
3,Emergency supplies in [BUILDINGNUMBER_1] need ...,Emergency supplies in 16356 need a refill. Use...,"{'[BUILDINGNUMBER_1]': '16356', '[MASKEDNUMBER...","[[0, 22, O], [22, 27, BUILDINGNUMBER_1], [27, ...","[O, O, O, B-BUILDINGNUMBER, I-BUILDINGNUMBER, ...","[emergency, supplies, in, 1635, ##6, need, a, ...","{'BUILDINGNUMBER': '16356', 'MASKEDNUMBER': '5...","[{'ZIPCODE': '1635'}, {'ZIPCODE': '##6'}, {'SS..."
4,"The [AGE_1] old child at [BUILDINGNUMBER_1], h...","The 88 old child at 5862, has showcased an unu...","{'[AGE_1]': '88', '[BUILDINGNUMBER_1]': '5862'...","[[0, 4, O], [4, 6, AGE_1], [6, 20, O], [20, 24...","[O, B-AGE, O, O, O, B-BUILDINGNUMBER, I-BUILDI...","[the, 88, old, child, at, 58, ##6, ##2, ,, has...","{'AGE': '88', 'BUILDINGNUMBER': '5862', 'PASSW...","[{'AGE': '88'}, {'BUILDINGNUMBER': '58'}, {'BU..."
...,...,...,...,...,...,...,...,...
95,Need to make payment of [CURRENCYSYMBOL_1][AMO...,Need to make payment of ₩471362.04 for your pr...,"{'[CURRENCYSYMBOL_1]': '₩', '[AMOUNT_1]': '471...","[[0, 24, O], [24, 25, CURRENCYSYMBOL_1], [25, ...","[O, O, O, O, O, B-CURRENCYSYMBOL, O, O, I-AMOU...","[need, to, make, payment, of, ₩, ##47, ##13, #...","{'CURRENCYSYMBOL': '₩', 'AMOUNT': '471362.04',...","[{'CURRENCYSYMBOL': '₩'}, {'AMOUNT': '##6'}, {..."
96,"Customer - [USERNAME_1], as part of our recent...","Customer - Joe_Schuster53, as part of our rece...","{'[USERNAME_1]': 'Joe_Schuster53', '[PHONEIMEI...","[[0, 11, O], [11, 25, USERNAME_1], [25, 232, O...","[O, O, B-USERNAME, I-USERNAME, I-USERNAME, I-U...","[customer, -, joe, _, schuster, ##53, ,, as, p...","{'USERNAME': 'Joe_Schuster53', 'PHONEIMEI': '0...","[{'USERNAME': 'joe'}, {'USERNAME': '_'}, {'USE..."
97,"For our market research, we will be focusing o...","For our market research, we will be focusing o...","{'[COUNTY_1]': 'Jefferson County', '[IP_1]': '...","[[0, 48, O], [48, 64, COUNTY_1], [64, 168, O],...","[O, O, O, O, O, O, O, O, O, O, B-COUNTY, I-COU...","[for, our, market, research, ,, we, will, be, ...","{'COUNTY': 'Jefferson County', 'IP': '6d4c:a3e...","[{'COUNTY': 'jefferson'}, {'COUNTY': 'county'}..."
98,"Welcome [FIRSTNAME_1], our Cryptocurrency 101 ...","Welcome Aidan, our Cryptocurrency 101 course u...","{'[FIRSTNAME_1]': 'Aidan', '[LITECOINADDRESS_1...","[[0, 8, O], [8, 13, FIRSTNAME_1], [13, 92, O],...","[O, B-FIRSTNAME, O, O, O, O, O, O, O, O, O, O,...","[welcome, aidan, ,, our, crypt, ##oc, ##ur, ##...","{'FIRSTNAME': 'Aidan', 'LITECOINADDRESS': '37q...","[{'FIRSTNAME': 'aidan'}, {'BUILDINGNUMBER': '1..."


In [45]:
data_small[["unmasked_text", "masked_text", "data_results", "model_results"]]

Unnamed: 0,unmasked_text,masked_text,data_results,model_results
0,A students assessment was found on device bear...,A students assessment was found on device bear...,"{'PHONEIMEI': '06-184755-866851-3', 'JOBAREA':...","[{'PHONEIMEI': '06'}, {'PHONEIMEI': '-'}, {'PH..."
1,"Dear Omer, as per our records, your license 78...","Dear [FIRSTNAME_1], as per our records, your l...","{'FIRSTNAME': 'Omer', 'VEHICLEVIN': '78B5R2MVF...","[{'FIRSTNAME': 'om'}, {'FIRSTNAME': '##er'}, {..."
2,Kattie could you please share your recomndatio...,[FIRSTNAME_1] could you please share your reco...,"{'FIRSTNAME': 'Kattie', 'AGE': '72', 'GENDER':...","[{'FIRSTNAME': 'kat'}, {'FIRSTNAME': '##tie'},..."
3,Emergency supplies in 16356 need a refill. Use...,Emergency supplies in [BUILDINGNUMBER_1] need ...,"{'BUILDINGNUMBER': '16356', 'MASKEDNUMBER': '5...","[{'ZIPCODE': '1635'}, {'ZIPCODE': '##6'}, {'SS..."
4,"The 88 old child at 5862, has showcased an unu...","The [AGE_1] old child at [BUILDINGNUMBER_1], h...","{'AGE': '88', 'BUILDINGNUMBER': '5862', 'PASSW...","[{'AGE': '88'}, {'BUILDINGNUMBER': '58'}, {'BU..."
...,...,...,...,...
95,Need to make payment of ₩471362.04 for your pr...,Need to make payment of [CURRENCYSYMBOL_1][AMO...,"{'CURRENCYSYMBOL': '₩', 'AMOUNT': '471362.04',...","[{'CURRENCYSYMBOL': '₩'}, {'AMOUNT': '##6'}, {..."
96,"Customer - Joe_Schuster53, as part of our rece...","Customer - [USERNAME_1], as part of our recent...","{'USERNAME': 'Joe_Schuster53', 'PHONEIMEI': '0...","[{'USERNAME': 'joe'}, {'USERNAME': '_'}, {'USE..."
97,"For our market research, we will be focusing o...","For our market research, we will be focusing o...","{'COUNTY': 'Jefferson County', 'IP': '6d4c:a3e...","[{'COUNTY': 'jefferson'}, {'COUNTY': 'county'}..."
98,"Welcome Aidan, our Cryptocurrency 101 course u...","Welcome [FIRSTNAME_1], our Cryptocurrency 101 ...","{'FIRSTNAME': 'Aidan', 'LITECOINADDRESS': '37q...","[{'FIRSTNAME': 'aidan'}, {'BUILDINGNUMBER': '1..."


In [46]:

def prepare_data_for_evaluation(data_row):
    # Tatsächliche Entitäten extrahieren
    actual_entities = set(data_row['data_results'].keys())
    # Vom Modell vorhergesagte Entitäten extrahieren und duplizierte Einträge entfernen
    predicted_entities = set([list(entity.keys())[0] for entity in data_row['model_results']])
    return actual_entities, predicted_entities

# Listen für tatsächliche und vorhergesagte Entitäten
y_true = []
y_pred = []

for index, row in data_small.iterrows():
    actual, predicted = prepare_data_for_evaluation(row)
    y_true.append(actual)
    y_pred.append(predicted)

In [47]:
y_true

[{'JOBAREA', 'PHONEIMEI'},
 {'FIRSTNAME', 'VEHICLEVIN'},
 {'AGE', 'FIRSTNAME', 'GENDER', 'HEIGHT'},
 {'BUILDINGNUMBER', 'MASKEDNUMBER'},
 {'AGE', 'BUILDINGNUMBER', 'PASSWORD'},
 {'DOB', 'IPV6'},
 {'GENDER', 'PASSWORD'},
 {'NEARBYGPSCOORDINATE', 'USERAGENT'},
 {'FIRSTNAME', 'NEARBYGPSCOORDINATE', 'PASSWORD'},
 {'ACCOUNTNUMBER', 'COUNTY', 'EMAIL', 'EYECOLOR', 'JOBTITLE', 'PIN', 'TIME'},
 {'FIRSTNAME', 'LASTNAME', 'MASKEDNUMBER'},
 {'FIRSTNAME', 'IPV4', 'IPV6', 'JOBAREA', 'JOBTITLE'},
 {'BUILDINGNUMBER',
  'CITY',
  'DATE',
  'FIRSTNAME',
  'NEARBYGPSCOORDINATE',
  'STREET',
  'TIME'},
 {'COUNTY', 'MIDDLENAME', 'PHONEIMEI', 'PREFIX', 'TIME'},
 {'FIRSTNAME', 'PREFIX'},
 {'CREDITCARDISSUER', 'CREDITCARDNUMBER'},
 {'CREDITCARDNUMBER', 'IPV4', 'JOBTITLE'},
 {'NEARBYGPSCOORDINATE', 'STREET'},
 {'CREDITCARDISSUER', 'DOB', 'FIRSTNAME', 'MASKEDNUMBER'},
 {'IPV4', 'JOBAREA'},
 {'CITY', 'EMAIL', 'LASTNAME', 'PREFIX'},
 {'FIRSTNAME', 'LASTNAME', 'MIDDLENAME'},
 {'FIRSTNAME', 'STATE'},
 {'VEHICLEVIN'

In [48]:
y_pred

[{'JOBAREA', 'PHONEIMEI'},
 {'DOB', 'FIRSTNAME', 'SSN'},
 {'AGE', 'FIRSTNAME', 'GENDER', 'HEIGHT'},
 {'SSN', 'ZIPCODE'},
 {'AGE', 'BUILDINGNUMBER', 'PASSWORD'},
 {'DOB', 'IPV6'},
 {'GENDER', 'PASSWORD'},
 {'NEARBYGPSCOORDINATE', 'USERAGENT'},
 {'FIRSTNAME', 'NEARBYGPSCOORDINATE', 'PASSWORD'},
 {'COUNTY', 'EMAIL', 'EYECOLOR', 'JOBTITLE', 'PIN', 'SSN', 'TIME'},
 {'FIRSTNAME', 'LASTNAME', 'SSN'},
 {'FIRSTNAME', 'IP', 'IPV4', 'JOBAREA', 'JOBTITLE', 'MAC'},
 {'BUILDINGNUMBER',
  'CITY',
  'DATE',
  'FIRSTNAME',
  'NEARBYGPSCOORDINATE',
  'STREET',
  'TIME'},
 {'COUNTY', 'LASTNAME', 'PREFIX', 'SSN', 'TIME'},
 {'FIRSTNAME', 'PREFIX'},
 {'CREDITCARDISSUER', 'CREDITCARDNUMBER', 'SSN'},
 {'CREDITCARDNUMBER', 'IPV4', 'JOBTITLE', 'SSN'},
 {'NEARBYGPSCOORDINATE', 'STREET'},
 {'CREDITCARDISSUER', 'DATE', 'FIRSTNAME', 'SSN'},
 {'IPV4', 'JOBAREA'},
 {'COUNTY', 'EMAIL', 'LASTNAME', 'PREFIX'},
 {'FIRSTNAME', 'LASTNAME', 'MIDDLENAME'},
 {'FIRSTNAME', 'STATE'},
 {'CURRENCYCODE', 'DOB', 'GENDER', 'SSN', 'V

In [51]:
def calculate_metrics(data):
    TP = FP = FN = 0

    for index, row in data.iterrows():
        true_entities = set(row['data_results'].keys())
        predicted_entities = set(entity for entity_dict in row['model_results'] for entity in entity_dict.keys())

        TP += len(true_entities & predicted_entities)
        FP += len(predicted_entities - true_entities)
        FN += len(true_entities - predicted_entities)

    precision = TP / (TP + FP) if TP + FP > 0 else 0
    recall = TP / (TP + FN) if TP + FN > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
        
    return precision, recall, f1, TP

precision, recall, f1 = calculate_metrics(data_small)
print(f"Precision: {precision}, Recall: {recall}, F1-Score: {f1}")


Precision: 0.7596439169139466, Recall: 0.8178913738019169, F1-Score: 0.7876923076923077


In [52]:
from collections import defaultdict

# Initialize a dictionary to store metrics for each entity type
metrics = defaultdict(lambda: {'TP': 0, 'FP': 0, 'FN': 0})

# Iterate through each row of the DataFrame
for index, row in data_small.iterrows():
    # Extract actual entities from 'data_results' and predicted entities from 'model_results'
    true_entities = {entity for entity_dict in row['data_results'] for entity in entity_dict.keys()}
    predicted_entities = {entity for entity_dict in row['model_results'] for entity in entity_dict.keys()}

    # Calculate True Positives and False Negatives
    for entity in true_entities:
        if entity in predicted_entities:
            metrics[entity]['TP'] += 1  # True Positive: Entity is correctly predicted
        else:
            metrics[entity]['FN'] += 1  # False Negative: Entity is missed by the model

    # Calculate False Positives
    for entity in predicted_entities:
        if entity not in true_entities:
            metrics[entity]['FP'] += 1  # False Positive: Entity is incorrectly predicted

# Calculate Precision, Recall, and F1-Score for each entity type
for entity, counts in metrics.items():
    # Precision: Proportion of correct positive predictions
    precision = counts['TP'] / (counts['TP'] + counts['FP']) if counts['TP'] + counts['FP'] > 0 else 0

    # Recall: Proportion of actual positives that were correctly identified
    recall = counts['TP'] / (counts['TP'] + counts['FN']) if counts['TP'] + counts['FN'] > 0 else 0

    # F1-Score: Harmonic mean of Precision and Recall
    f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0

    # Print the metrics for each entity
    print(f"{entity}: Precision = {precision}, Recall = {recall}, F1-Score = {f1}")


AttributeError: 'str' object has no attribute 'keys'