The NER tags you're working with are from the dataset mentioned in the paper titled "A Dataset of GDPR Compliant NER for Privacy Policies". This dataset was created to specifically address GDPR compliance by annotating privacy policies with tags that represent key GDPR-related entities.

Here’s a breakdown of what this dataset includes:

33 GDPR-related tags were designed to capture important concepts, terms, and legal rights under GDPR.
These tags are based on the Data Privacy Vocabulary (DPV) notation and include entities such as:
Personal Data (PD)
Consent (CONS)
Data Controller (DC)
Data Processor (DP)
Retention (RET)
Data Subject Rights (DSR), such as Right to Access (DSR15), Right to Erasure (DSR17), etc.
The dataset was created by annotating privacy policy documents with these NER tags, aiming to assist in detecting and extracting key GDPR components from privacy policies. This allows for automated compliance checking by identifying whether a policy mentions the necessary GDPR elements, such as consent, data retention, and data subject rights.

The dataset is tailored to GDPR compliance and is useful for training NER models to automatically detect GDPR-related entities in privacy policy text.


datasets like this typically rely on publicly available or commonly referenced privacy policies from major websites, platforms, or services, which are representative of a variety of sectors and compliance efforts.

In [None]:
!pip install bertopic

Collecting bertopic
  Downloading bertopic-0.16.3-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.38.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.6-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading bertopic-0.16.3-py3-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hdbscan-0.8.38.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m47.5 MB/s[0m eta [36m0:0

In [None]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
import matplotlib.pyplot as plt
import seaborn as sns
import os
import plotly.express as px
import json
pd.set_option('display.max_colwidth', None)
import plotly.express as px
import pandas as pd
import nltk
from nltk.corpus import stopwords
from collections import Counter
import string
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import plotly.graph_objs as go
import numpy as np

# from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
import plotly.graph_objects as go
import plotly.express as px
from collections import Counter


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


###1. Load the data

In [None]:
file_path = '/content/drive/MyDrive/compliance/ner/gdpr-compliant-ner.conll'

In [None]:
def conll_to_dataframe(file_path):
    words = []
    tags = []

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            if line.strip():
                word, tag = line.split()
                words.append(word)
                tags.append(tag)
            else:
                words.append(None)
                tags.append(None)

    df = pd.DataFrame({'Word': words, 'Tag': tags})

    return df

In [None]:
df = conll_to_dataframe(file_path)
print(df.shape)

(271472, 2)


Named Entity Recognition (NER) tasks typically use tags like B-Entity (Beginning of an entity), I-Entity (Inside an entity), and O (Outside) to annotate words in a sentence.

In [None]:
df.tail()

Unnamed: 0,Word,Tag
271467,You,B-DS
271468,can,O
271469,also,O
271470,delete,B-P
271471,,


### 2. Check for missing values

In [None]:
print(df['Tag'].isna().sum())
print(df['Word'].isna().sum())

44
44


### 3. Drop the missing values

In [None]:
df_cleaned = df.dropna()

df_cleaned

Unnamed: 0,Word,Tag
0,Privacy,O
1,Policy,O
2,Effective,O
3,:,O
4,June,O
...,...,...
271466,.,O
271467,You,B-DS
271468,can,O
271469,also,O


In [None]:
eda_df = df_cleaned.copy()
eda_df.head()

Unnamed: 0,Word,Tag
0,Privacy,O
1,Policy,O
2,Effective,O
3,:,O
4,June,O


### 2. Tag Distribution

In [None]:
tag_distribution = eda_df['Tag'].value_counts()

fig = px.bar(x=tag_distribution.index, y=tag_distribution.values,
             labels={'x': 'NER Tags', 'y': 'Count'},
             title='NER Tag Distribution')

fig.update_layout(
    xaxis_title="NER Tags",
    yaxis_title="Count",
    xaxis_tickangle=-45,
    height=600,
    width=1000,
    title_font_size=16
)

fig.show()


In [None]:
filtered_df = eda_df[eda_df['Tag'] != 'O']

tag_distribution = filtered_df['Tag'].value_counts()

fig = px.bar(x=tag_distribution.index, y=tag_distribution.values,
             labels={'x': 'NER Tags', 'y': 'Count'},
             title='NER Tag Distribution')

fig.update_layout(
    xaxis_title="NER Tags",
    yaxis_title="Count",
    xaxis_tickangle=-45,
    height=600,
    width=1000,
    title_font_size=16
)

fig.show()


  1. RP - Required Purpose

  2. PD - Personal Data

  3. OM - Organisational Measure

  4. P - Processing

  5. NPD - Non-Personal Data

  6. LI - Legitimate Interest

  7. RET - Retention

  8. TP - Third Party

  9. CONS - Consent

  10. DC - Data Controller

  11. R - Recipient

  12. DSO - Data Source

  13. LB - Legal Basis

  14. TM - Technical Measure

  15. SNEU - Scale Non-EU

  16. RI - Right

  17. CONT - Contract

  18. DS - Data Subject

  19. DP - Data Processor

  20. NRP - Not-Required Purpose

  21. DSR21 - Art. 21 Right to Object

  22. DSR15 - Art. 15 Right to Access by the Data Subject

  23. DSR17 - Art. 17 Right to Erasure

  24. A - Authority

  25. DPO - Data Protection Officer

  26. DSR18 - Art. 18 Right to Restriction of Processing

  27. SEU - Scale EU

  28. DSR16 - Art. 16 Right to Rectification

  29. DSR20 - Art. 20 Right to Data Portability

  30. ADM - Automated Decision Making

  31. LC - Lodge Complaint

  32. DSR19 - Art. 19 Notification Obligations
  
  33. DSR22 - Art. 22 Automated Individual Decision Making

In [None]:
eda_df['Tag_new'] = eda_df['Tag'].str.replace(r'^B-|^I-', '', regex=True)

filtered_df = eda_df[eda_df['Tag_new'] != 'O']

tag_distribution = filtered_df['Tag_new'].value_counts()

fig = px.bar(x=tag_distribution.index, y=tag_distribution.values,
             labels={'x': 'NER Tags', 'y': 'Count'},
             title='Tag Distribution')

fig.update_layout(
    xaxis_title="NER Tags",
    yaxis_title="Count",
    xaxis_tickangle=-45,
    height=600,
    width=1000,
    title_font_size=16
)

fig.show()


In [None]:
entity_lengths = eda_df[eda_df['Tag'].str.startswith('B')].groupby('Tag').size()

fig = px.bar(entity_lengths,
             title='Entity Count by Tag',
             labels={'index': 'Entity Tags', 'value': 'Count'},
             height=600)

fig.update_layout(
    xaxis_title="Entity Tags",
    yaxis_title="Count",
    xaxis_tickangle=-45
)

fig.show()

The dataset shows moderate reliability with a Cohen’s Kappa of 0.64. While this indicates a good level of consistency in the annotation process, it is not perfect, and some inconsistencies may arise, potentially due to differences in sentence segmentation. Overall, it provides a useful foundation for future research and tools aimed at improving the accessibility and usability of privacy policies.

In [None]:
transitions = [(eda_df['Tag'].iloc[i], eda_df['Tag'].iloc[i+1]) for i in range(len(eda_df)-1)]
transition_counts = Counter(transitions)
print(transition_counts)

Counter({('O', 'O'): 161023, ('I-RP', 'I-RP'): 20892, ('I-PD', 'I-PD'): 7889, ('I-OM', 'I-OM'): 5586, ('B-PD', 'I-PD'): 3944, ('O', 'B-PD'): 3758, ('I-LI', 'I-LI'): 3615, ('I-RET', 'I-RET'): 3383, ('I-PD', 'O'): 3362, ('I-NPD', 'I-NPD'): 2759, ('O', 'B-P'): 2752, ('B-P', 'O'): 2168, ('B-RP', 'I-RP'): 1843, ('I-RP', 'O'): 1576, ('I-CONS', 'I-CONS'): 1540, ('O', 'B-DC'): 1490, ('I-DSO', 'I-DSO'): 1454, ('I-TP', 'I-TP'): 1380, ('O', 'B-RP'): 1352, ('I-R', 'I-R'): 1269, ('I-P', 'I-P'): 1230, ('I-LB', 'I-LB'): 1173, ('B-DC', 'O'): 1033, ('O', 'B-TP'): 997, ('I-TM', 'I-TM'): 911, ('B-TP', 'I-TP'): 885, ('I-CONT', 'I-CONT'): 862, ('O', 'B-NPD'): 857, ('B-NPD', 'I-NPD'): 851, ('I-TP', 'O'): 804, ('I-RI', 'I-RI'): 759, ('O', 'B-CONS'): 712, ('I-NPD', 'O'): 702, ('I-DC', 'I-DC'): 677, ('O', 'B-TM'): 673, ('I-SNEU', 'I-SNEU'): 622, ('B-PD', 'O'): 615, ('O', 'B-R'): 550, ('O', 'B-DS'): 550, ('I-NRP', 'I-NRP'): 527, ('B-P', 'B-PD'): 508, ('I-DSR21', 'I-DSR21'): 493, ('B-CONS', 'I-CONS'): 489, ('B-R

This output provides a useful summary of how entities transition from one to another in your dataset. By analyzing these transitions, you can validate the structure of your tags and ensure that your NER model learns from sequences that make sense in the context of GDPR compliance checking.

In [None]:
from transformers import BertTokenizer, BertForTokenClassification
import torch


tokenizer = BertTokenizer.from_pretrained('dbmdz/bert-large-cased-finetuned-conll03-english')
model = BertForTokenClassification.from_pretrained('dbmdz/bert-large-cased-finetuned-conll03-english')

policy_text = "Your privacy is important. We collect personal data such as your name, email, etc."
inputs = tokenizer(policy_text, return_tensors="pt", truncation=True, padding=True)

outputs = model(**inputs)
logits = outputs.logits

predicted_tags = torch.argmax(logits, dim=2)

tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
tags = [model.config.id2label[tag_id.item()] for tag_id in predicted_tags[0]]

token_tag_pairs = [(token, tag) for token, tag in zip(tokens, tags)]
token_tag_pairs


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[('[CLS]', 'O'),
 ('Your', 'O'),
 ('privacy', 'O'),
 ('is', 'O'),
 ('important', 'O'),
 ('.', 'O'),
 ('We', 'O'),
 ('collect', 'O'),
 ('personal', 'O'),
 ('data', 'O'),
 ('such', 'O'),
 ('as', 'O'),
 ('your', 'O'),
 ('name', 'O'),
 (',', 'O'),
 ('email', 'O'),
 (',', 'O'),
 ('etc', 'O'),
 ('.', 'O'),
 ('[SEP]', 'O')]

In [None]:
import spacy
from spacy.training.example import Example

# Load a blank model or a pre-trained one to fine-tune
nlp = spacy.blank("en")

# Add NER pipeline to the model
if "ner" not in nlp.pipe_names:
    ner = nlp.add_pipe("ner", last=True)
else:
    ner = nlp.get_pipe("ner")

# Add labels (e.g., PERSONAL_DATA, CONSENT)
ner.add_label("PERSONAL_DATA")

# Training data (Example format: text and entity annotations)
TRAIN_DATA = [
    ("We collect personal data such as your name and email.", {"entities": [(11, 23, "PERSONAL_DATA")]}),
    ("The consent form must be signed.", {"entities": [(4, 11, "CONSENT")]}),
    # More training examples here
]

# Disable other pipelines during training
optimizer = nlp.begin_training()
for i in range(10):  # Number of training iterations
    losses = {}
    for text, annotations in TRAIN_DATA:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        nlp.update([example], drop=0.2, losses=losses)
    print(f"Losses at iteration {i}: {losses}")

# Test the trained model
test_text = "We collect personal data such as your name."
doc = nlp(test_text)
for ent in doc.ents:
    print(ent.text, ent.label_)



[W030] Some entities could not be aligned in the text "We collect personal data such as your name and ema..." with entities "[(11, 23, 'PERSONAL_DATA')]". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.



Losses at iteration 0: {'ner': 13.967061430215836}
Losses at iteration 1: {'ner': 13.443223297595978}
Losses at iteration 2: {'ner': 12.113514438271523}
Losses at iteration 3: {'ner': 9.587828852236271}
Losses at iteration 4: {'ner': 5.677579149603844}
Losses at iteration 5: {'ner': 2.814053500071168}
Losses at iteration 6: {'ner': 1.8879609094001353}
Losses at iteration 7: {'ner': 1.6659003438930995}
Losses at iteration 8: {'ner': 1.7120502286935562}
Losses at iteration 9: {'ner': 1.797586587169679}
