<a href="https://colab.research.google.com/github/felipecasali-usp/mba-tcc-identify-lgpdsensitive-data/blob/main/02_transformers_pii_identification_mixed_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

!mkdir data
!mkdir models

Mounted at /content/drive


# Fine Tuning Transformer for PII Identification (LGPD)

### Introduction

In this tutorial we will be fine tuning a transformer model for the **PII Identification (LGPD)** problem.
Currently there is a big challenge in identifying sensitive data in datasets. This is part of the TCC for the USP MBA in IA & BigData.

#### Flow of the notebook

The notebook will be divided into seperate sections to provide a organized walk through for the process used. This process can be modified for individual use cases. The sections are:

1. [Importing Python Libraries and preparing the environment](#section01)
2. [Importing and Pre-Processing the domain data](#section02)
3. [Preparing the Dataset and Dataloader](#section03)
4. [Creating the Neural Network for Fine Tuning](#section04)
5. [Fine Tuning the Model](#section05)
6. [Validating the Model Performance](#section06)
7. [Saving the model and artifacts for Inference in Future](#section07)

#### Technical Details

This script leverages on multiple tools designed by other teams. Details of the tools used below. Please ensure that these elements are present in your setup to successfully implement this script.

 - Data:
	 - We are using the News aggregator dataset available at by [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/News+Aggregator)
	 - We are referring only to the first csv file from the data dump: `newsCorpora.csv`
	 - There are `422937` rows of data.  Where each row has the following data-point:
		 - ID Numeric ID  
		 - TITLE News title  
		 - URL Url  
		 - PUBLISHER Publisher name  
		 - CATEGORY News category (b = business, t = science and technology, e = entertainment, m = health)  
		 - STORY Alphanumeric ID of the cluster that includes news about the same story  
		 - HOSTNAME Url hostname  
		 - TIMESTAMP Approximate time the news was published, as the number of milliseconds since the epoch 00:00:00 GMT, January 1, 1970


 - Language Model Used:
	 - DistilBERT this is a smaller transformer model as compared to BERT or Roberta. It is created by process of distillation applied to Bert.
	 - [Blog-Post](https://medium.com/huggingface/distilbert-8cf3380435b5)
	 - [Research Paper](https://arxiv.org/abs/1910.01108)
     - [Documentation for python](https://huggingface.co/transformers/model_doc/distilbert.html)


 - Hardware Requirements:
	 - Python 3.6 and above
	 - Pytorch, Transformers and All the stock Python ML Libraries
	 - GPU enabled setup


 - Script Objective:
	 - The objective of this script is to fine tune DistilBERT to be able to classify a news headline into the following categories:
		 - Business
		 - Technology
		 - Health
		 - Entertainment


<a id='section01'></a>
### Importing Python Libraries and preparing the environment

At this step we will be importing the libraries and modules needed to run our script. Libraries are:
* Pandas
* Pytorch
* Pytorch Utils for Dataset and Dataloader
* Transformers
* DistilBERT Model and Tokenizer

Followed by that we will preapre the device for CUDA execeution. This configuration is needed if you want to leverage on onboard GPU.

In [2]:
import pandas as pd
import random
import hashlib

# Load the original dataset
file_path = "/content/drive/MyDrive/Colab Notebooks/data/K3241.K03200Y9.D30708.SOCIOCSV"
df = pd.read_csv(file_path, sep=';')

df.head()

Unnamed: 0,CNPJ_BASICO,COD_IDENT_SOCIO,NOME_SOCIO,CNPJCPF_SOCIO,QUALIFIC_SOCIO,DATA_INGRESSO,PAIS,REP_LEGAL,NOME_REP_LEGAL,QUALIFIC_REP_LEGAL,COD_FAIXA_ETARIA
0,1879005,2,LUCIANO FONSECA,***477633**,16,20050912,,***000000**,,0,7
1,1879008,2,ERNESTO ODONE ALVES CUNHA,***594908**,49,19970513,,***000000**,,0,6
2,1879008,2,ELAINE APARECIDA DE ALMEIDA CUNHA,***989406**,22,20060116,,***000000**,,0,5
3,5204955,2,EMERSON SANTIAGO,***432268**,49,20060130,,***000000**,,0,5
4,5204955,2,DANIELA COSTA SANTIAGO,***246208**,49,20151127,,***000000**,,0,5


In [3]:
# Function to generate fictitious but valid CNPJ numbers
def generate_fake_cnpj():
    base_numbers = [str(random.randint(0, 9)) for _ in range(8)]
    base_numbers.extend(['0', '0', '0', '1'])
    verification_digits = [0, 0]

    # Calculate the first verification digit
    for i in range(12):
        verification_digits[0] += int(base_numbers[i]) * (5 - (i % 4))

    verification_digits[0] = (verification_digits[0] % 11)
    if verification_digits[0] < 2:
        verification_digits[0] = 0
    else:
        verification_digits[0] = 11 - verification_digits[0]

    base_numbers.append(str(verification_digits[0]))

    # Calculate the second verification digit
    for i in range(13):
        verification_digits[1] += int(base_numbers[i]) * (6 - (i % 5))

    verification_digits[1] = (verification_digits[1] % 11)
    if verification_digits[1] < 2:
        verification_digits[1] = 0
    else:
        verification_digits[1] = 11 - verification_digits[1]

    base_numbers.append(str(verification_digits[1]))

    return ''.join(base_numbers)

def gerar_cpf_hash():
    # Gerar os 9 primeiros dígitos aleatórios do CPF
    cpf = [random.randint(0, 9) for _ in range(9)]

    # Cálculo dos dígitos verificadores
    # Primeiro dígito verificador
    d1 = 0
    for i, digit in enumerate(cpf):
        d1 += digit * (10 - i)
    d1 = 11 - (d1 % 11)
    d1 = d1 if d1 < 10 else 0
    cpf.append(d1)

    # Segundo dígito verificador
    d2 = 0
    for i, digit in enumerate(cpf):
        d2 += digit * (11 - i)
    d2 = 11 - (d2 % 11)
    d2 = d2 if d2 < 10 else 0
    cpf.append(d2)

    # Formatar o CPF no formato XXX.XXX.XXX-XX
    cpf_str = f"{cpf[0]}{cpf[1]}{cpf[2]}.{cpf[3]}{cpf[4]}{cpf[5]}.{cpf[6]}{cpf[7]}{cpf[8]}-{cpf[9]}{cpf[10]}"

    # Aplicando a função de hash SHA-256 ao CPF
    cpf_formatado = hashlib.sha256(cpf_str.encode()).hexdigest()

    return cpf_str

# Create a new column with fictitious CPFs
new_cpf_column = [gerar_cpf_hash() for _ in range(len(df))]
df.insert(0, "SOCIO_CPF", new_cpf_column)

# Create a new column with fictitious CNPJ numbers
new_cnpj_column = [generate_fake_cnpj() for _ in range(len(df))]
df.insert(0, "CNPJ", new_cnpj_column)

In [4]:
# List of Brazilian state codes
brazilian_states = [
    'AC', 'AL', 'AP', 'AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MA', 'MT', 'MS',
    'MG', 'PA', 'PB', 'PR', 'PE', 'PI', 'RJ', 'RN', 'RS', 'RO', 'RR', 'SC', 'SP', 'SE', 'TO'
]

# Add a new column with random Brazilian state codes
df['ESTADO'] = [random.choice(brazilian_states) for _ in range(len(df))]


In [5]:
# Save the modified DataFrame to a new CSV file
new_file_path = "./data/SOCIOS.modified.csv"
df.to_csv(new_file_path, sep=';', index=False, header=True)

In [6]:
# Select the first 500 rows
new_dataset = df.head(50000)

# Save the new dataset to a CSV file
new_dataset.to_csv('./data/SOCIOS.modified.small.csv', index=False)  # Replace with desired file name


In [7]:
!pip install transformers
# Importing the libraries needed
import pandas as pd
import torch
import transformers
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertModel, DistilBertTokenizer

Collecting transformers
  Downloading transformers-4.32.0-py3-none-any.whl (7.5 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/7.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/7.5 MB[0m [31m3.5 MB/s[0m eta [36m0:00:03[0m[2K     [91m━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/7.5 MB[0m [31m23.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m7.5/7.5 MB[0m [31m73.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (fr

In [8]:
# Setting up the device for GPU usage

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

<a id='section02'></a>
### Importing and Pre-Processing the domain data

We will be working with the data and preparing for fine tuning purposes.
*Assuming that the `SOCIOS.modified.csv` is already downloaded in your `data` folder*

Import the file in a dataframe and give it the headers as per the documentation.
Cleaning the file to remove the unwanted columns and create an additional column for training.
The final Dataframe will be something like this:

CNPJ|SOCIO_CPF|NOME_SOCIO|ESTADO|chk_CNPJ|chk_SOCIO_CPF|chk_NOME_SOCIO|chk_ESTADO|
|--|--|--|--|--|--|--|--|
|xxx.xxx.xxx/xxxx-xx|xxx.xxx.xxx-xx| First Second Last |xx|--|--|--|--|

After mixing the classified data:

DATA|DATA_CLASSFIER|
|--|--|
|xxx.xxx.xxx/xxxx-xx|0|
|xxx.xxx.xxx-xx|1|
|AAAAA AAAAAA AAAAAA|2|
|AA|3|

In [9]:
# Função para criar a tabela "tipos_dados_sensiveis"
def criar_tipos_dados_sensiveis():
    tipos_dados_sensiveis_data = {
        'regex': [
            r'(?i)[A-Za-z]+ [A-Za-z]+',
            r'(?i)\d{3}\.\d{3}\.\d{3}-\d{2}',
            r'^(?!(\d)\1{13})\d{14}$',
            r'(?i)\(\d{2}\) \d{4,5}-\d{4}',
            r'(?i)Rua .+|Avenida .+|Travessa .+',
            r'\d{5}-\d{3}',
            r'^(?:AC|AL|AP|AM|BA|CE|DF|ES|GO|MA|MT|MS|MG|PA|PB|PR|PE|PI|RJ|RN|RS|RO|RR|SC|SP|SE|TO)$'
            # Adicione mais expressões regulares para outros tipos de dados sensíveis
        ],
        'tipo_dado': ['nome', 'cpf', 'cnpj', 'telefone', 'logradouro', 'cep', 'estado']
        # Adicione mais tipos de dados sensíveis aqui
    }
    return pd.DataFrame(tipos_dados_sensiveis_data)

# Chamando a função para criar a tabela "tipos_dados_sensiveis"
tipos_dados_sensiveis = criar_tipos_dados_sensiveis()

print(tipos_dados_sensiveis)

                                               regex   tipo_dado
0                            (?i)[A-Za-z]+ [A-Za-z]+        nome
1                      (?i)\d{3}\.\d{3}\.\d{3}-\d{2}         cpf
2                             ^(?!(\d)\1{13})\d{14}$        cnpj
3                        (?i)\(\d{2}\) \d{4,5}-\d{4}    telefone
4                  (?i)Rua .+|Avenida .+|Travessa .+  logradouro
5                                        \d{5}-\d{3}         cep
6  ^(?:AC|AL|AP|AM|BA|CE|DF|ES|GO|MA|MT|MS|MG|PA|...      estado


In [10]:
import re

# Import the csv into pandas dataframe and add the headers
df = pd.read_csv('./data/SOCIOS.modified.small.csv', sep=',')

# # Removing unwanted columns and only leaving title of news and the category which will be the target
df = df[['CNPJ','SOCIO_CPF','NOME_SOCIO','ESTADO']]
df.head()

Unnamed: 0,CNPJ,SOCIO_CPF,NOME_SOCIO,ESTADO
0,18570236000123,068.586.116-30,LUCIANO FONSECA,RJ
1,67747063000162,033.341.745-30,ERNESTO ODONE ALVES CUNHA,AM
2,79457697000136,320.025.791-13,ELAINE APARECIDA DE ALMEIDA CUNHA,PR
3,31442633000196,560.387.250-08,EMERSON SANTIAGO,CE
4,41603778000184,386.913.490-93,DANIELA COSTA SANTIAGO,AL


In [11]:
for column in df.columns:
    # Iterar sobre as expressões regulares da tabela tipos_dados_sensiveis
    for _, row in tipos_dados_sensiveis.iterrows():
        regex_name = row["tipo_dado"]
        regex_expr = row["regex"]
        # Aplicar a expressão regular em cada registro da coluna atual e criar a nova coluna
        df[f"chk_{column}_{regex_name}"] = df[column].apply(lambda x: str(regex_name) if  re.search(regex_expr, str(x)) is not None else "0")

# Visualizar o DataFrame df após a análise
df.head()


Unnamed: 0,CNPJ,SOCIO_CPF,NOME_SOCIO,ESTADO,chk_CNPJ_nome,chk_CNPJ_cpf,chk_CNPJ_cnpj,chk_CNPJ_telefone,chk_CNPJ_logradouro,chk_CNPJ_cep,...,chk_NOME_SOCIO_logradouro,chk_NOME_SOCIO_cep,chk_NOME_SOCIO_estado,chk_ESTADO_nome,chk_ESTADO_cpf,chk_ESTADO_cnpj,chk_ESTADO_telefone,chk_ESTADO_logradouro,chk_ESTADO_cep,chk_ESTADO_estado
0,18570236000123,068.586.116-30,LUCIANO FONSECA,RJ,0,0,cnpj,0,0,0,...,0,0,0,0,0,0,0,0,0,estado
1,67747063000162,033.341.745-30,ERNESTO ODONE ALVES CUNHA,AM,0,0,cnpj,0,0,0,...,0,0,0,0,0,0,0,0,0,estado
2,79457697000136,320.025.791-13,ELAINE APARECIDA DE ALMEIDA CUNHA,PR,0,0,cnpj,0,0,0,...,0,0,0,0,0,0,0,0,0,estado
3,31442633000196,560.387.250-08,EMERSON SANTIAGO,CE,0,0,cnpj,0,0,0,...,0,0,0,0,0,0,0,0,0,estado
4,41603778000184,386.913.490-93,DANIELA COSTA SANTIAGO,AL,0,0,cnpj,0,0,0,...,0,0,0,0,0,0,0,0,0,estado


In [12]:
df.to_csv('./data/SOCIOS.classified.csv',index=False)

In [13]:
# Find and remove columns filled with only zeros
non_zero_columns = [col for col in df.columns if not (df[col] == "0").all()]
df_filtered = df[non_zero_columns]

df_filtered.head()

Unnamed: 0,CNPJ,SOCIO_CPF,NOME_SOCIO,ESTADO,chk_CNPJ_cnpj,chk_SOCIO_CPF_cpf,chk_NOME_SOCIO_nome,chk_ESTADO_estado
0,18570236000123,068.586.116-30,LUCIANO FONSECA,RJ,cnpj,cpf,nome,estado
1,67747063000162,033.341.745-30,ERNESTO ODONE ALVES CUNHA,AM,cnpj,cpf,nome,estado
2,79457697000136,320.025.791-13,ELAINE APARECIDA DE ALMEIDA CUNHA,PR,cnpj,cpf,nome,estado
3,31442633000196,560.387.250-08,EMERSON SANTIAGO,CE,cnpj,cpf,nome,estado
4,41603778000184,386.913.490-93,DANIELA COSTA SANTIAGO,AL,cnpj,cpf,nome,estado


In [14]:
df_filtered.to_csv('./data/SOCIOS.classified.non-zero.csv',index=False)

In [15]:
import pandas as pd
pd = pd.DataFrame(df_filtered)

# Converting the codes to appropriate categories using a dictionary
my_dict = {
    'cnpj':'0',
    'cpf':'1',
    'nome':'2',
    'estado':'3'
}

# Iterate over columns and replace values
for col in pd.columns:
    if col.startswith('chk_'):
        pd[col] = pd[col].apply(lambda x: my_dict.get(x, x))

pd.head()

Unnamed: 0,CNPJ,SOCIO_CPF,NOME_SOCIO,ESTADO,chk_CNPJ_cnpj,chk_SOCIO_CPF_cpf,chk_NOME_SOCIO_nome,chk_ESTADO_estado
0,18570236000123,068.586.116-30,LUCIANO FONSECA,RJ,0,1,2,3
1,67747063000162,033.341.745-30,ERNESTO ODONE ALVES CUNHA,AM,0,1,2,3
2,79457697000136,320.025.791-13,ELAINE APARECIDA DE ALMEIDA CUNHA,PR,0,1,2,3
3,31442633000196,560.387.250-08,EMERSON SANTIAGO,CE,0,1,2,3
4,41603778000184,386.913.490-93,DANIELA COSTA SANTIAGO,AL,0,1,2,3


In [16]:
pd.dtypes

CNPJ                    int64
SOCIO_CPF              object
NOME_SOCIO             object
ESTADO                 object
chk_CNPJ_cnpj          object
chk_SOCIO_CPF_cpf      object
chk_NOME_SOCIO_nome    object
chk_ESTADO_estado      object
dtype: object

In [17]:
# Converting Multiple columns to int

# Convert columns starting with 'chk_' to integers
for col in pd.columns:
    if col.startswith('chk_'):
        pd = pd.astype({col:"int"})


In [18]:
pd.dtypes

CNPJ                    int64
SOCIO_CPF              object
NOME_SOCIO             object
ESTADO                 object
chk_CNPJ_cnpj           int64
chk_SOCIO_CPF_cpf       int64
chk_NOME_SOCIO_nome     int64
chk_ESTADO_estado       int64
dtype: object

In [19]:
import pandas as pd

# Load the original dataset
file_path = "./data/SOCIOS.classified.non-zero.csv"
unique_dataset  = pd.read_csv(file_path, sep=',')

In [20]:
# Splitting into data and classifier DataFrames
data_columns = ['CNPJ', 'SOCIO_CPF', 'NOME_SOCIO', 'ESTADO']
data_dataset = unique_dataset[data_columns]

classifier_columns = ['chk_CNPJ_cnpj', 'chk_SOCIO_CPF_cpf', 'chk_NOME_SOCIO_nome', 'chk_ESTADO_estado']
classifier_dataset = unique_dataset[classifier_columns]

# Print or use the resulting data and classifier DataFrames
print("Data Dataset:")
print(data_dataset)

print("\nClassifier Dataset:")
print(classifier_dataset)

Data Dataset:
                 CNPJ       SOCIO_CPF                         NOME_SOCIO  \
0      18570236000123  068.586.116-30                    LUCIANO FONSECA   
1      67747063000162  033.341.745-30          ERNESTO ODONE ALVES CUNHA   
2      79457697000136  320.025.791-13  ELAINE APARECIDA DE ALMEIDA CUNHA   
3      31442633000196  560.387.250-08                   EMERSON SANTIAGO   
4      41603778000184  386.913.490-93             DANIELA COSTA SANTIAGO   
...               ...             ...                                ...   
49995  71557210000190  808.072.016-93   ANTONIO ALEXANDRE RIBEIRO VIEIRA   
49996  88517715000199  381.902.551-00            ALEXANDRE BRASIL VIEIRA   
49997  99483637000131  456.638.303-24                SARAH BRASIL VIEIRA   
49998  64704720000192  360.371.265-00                   LANI SIAN VARGAS   
49999  91543430000177  973.215.453-50             MARCELO SILVEIRA GOMES   

      ESTADO  
0         RJ  
1         AM  
2         PR  
3         CE 

In [21]:
# Create a new DataFrame for the combined dataset
combined_data = []
combined_classifier = []

# Iterate through each pair of data and classifier columns
for data_col, classifier_col in zip(data_dataset.columns, classifier_dataset.columns):
    data_values = data_dataset[data_col]
    classifier_values = classifier_dataset[classifier_col]

    # Add the values to the combined lists
    combined_data.extend(data_values)
    combined_classifier.extend(classifier_values)

# Create a new DataFrame from the combined lists
combined_dataset = pd.DataFrame({'Data': combined_data, 'Classifier': combined_classifier})

# Print or use the combined dataset
print(combined_dataset)

                  Data Classifier
0       18570236000123       cnpj
1       67747063000162       cnpj
2       79457697000136       cnpj
3       31442633000196       cnpj
4       41603778000184       cnpj
...                ...        ...
199995              RJ     estado
199996              AL     estado
199997              RS     estado
199998              SC     estado
199999              GO     estado

[200000 rows x 2 columns]


In [22]:
combined_dataset.dtypes

Data          object
Classifier    object
dtype: object

In [38]:
import pandas as pd
pd = combined_dataset

# Converting the codes to appropriate categories using a dictionary
my_dict = {
    'cnpj':'0',
    'cpf':'1',
    'nome':'2',
    'estado':'3'
}

# Iterate over columns and replace values
for col in pd.columns:
    if col.startswith('Classifier'):
        pd[col] = pd[col].apply(lambda x: my_dict.get(x, x))


pd.head(5)


Unnamed: 0,Data,Classifier
0,18570236000123,0
1,67747063000162,0
2,79457697000136,0
3,31442633000196,0
4,41603778000184,0


In [39]:
pd.tail(5)

Unnamed: 0,Data,Classifier
199995,RJ,3
199996,AL,3
199997,RS,3
199998,SC,3
199999,GO,3


In [24]:
pd.to_csv('./data/SOCIOS_02.mixed.csv',sep=";",index=False)

<a id='section03'></a>
### Preparing the Dataset and Dataloader

We will start with defining few key variables that will be used later during the training/fine tuning stage.
Followed by creation of Dataset class - This defines how the text is pre-processed before sending it to the neural network. We will also define the Dataloader that will feed  the data in batches to the neural network for suitable training and processing.
Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader read the [docs at PyTorch](https://pytorch.org/docs/stable/data.html)

#### *Triage* Dataset Class
- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the DistilBERT model for training.
- We are using the DistilBERT tokenizer to tokenize the data in the `TITLE` column of the dataframe.
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`
- To read further into the tokenizer, [refer to this document](https://huggingface.co/transformers/model_doc/distilbert.html#distilberttokenizer)
- `target` is the encoded category on the news headline.
- The *Triage* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **80% of the original data**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training.

#### Dataloader
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [25]:
# Defining some key variables that will be used later on in the training
MAX_LEN = 512
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 2
EPOCHS = 2
LEARNING_RATE = 1e-05
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [26]:
import pandas as df

file_path = "./data/SOCIOS_02.mixed.csv"
df = df.read_csv(file_path, sep=';')

df.count()

Data          199998
Classifier    200000
dtype: int64

In [27]:
print(df.head(10))

             Data  Classifier
0  18570236000123           0
1  67747063000162           0
2  79457697000136           0
3  31442633000196           0
4  41603778000184           0
5  14962482000156           0
6   9531936000102           0
7  36127770000100           0
8  37075367000146           0
9  91712461000114           0


In [28]:
class Triage(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __getitem__(self, index):
        data_a = str(self.data.Data[index])
        data_a = " ".join(data_a.split())
        inputs = self.tokenizer.encode_plus(
            data_a,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': torch.tensor(self.data.Classifier[index], dtype=torch.long)
        }

    def __len__(self):
        return self.len

In [29]:
# Creating the dataset and dataloader for the neural network

train_size = 0.8
train_dataset=df.sample(frac=train_size,random_state=200)
test_dataset=df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)


print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = Triage(train_dataset, tokenizer, MAX_LEN)
testing_set = Triage(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (200000, 2)
TRAIN Dataset: (160000, 2)
TEST Dataset: (40000, 2)


In [30]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

<a id='section04'></a>
### Creating the Neural Network for Fine Tuning

#### Neural Network
 - We will be creating a neural network with the `DistillBERTClass`.
 - This network will have the DistilBERT Language model followed by a `dropout` and finally a `Linear` layer to obtain the final outputs.
 - The data will be feed to the DistilBERT Language model as defined in the dataset.
 - Final layer outputs is what will be compared to the `encoded category` to determine the accuracy of models prediction.
 - We will initiate an instance of the network called `model`. This instance will be used for training and then to save the final trained model for future inference.

#### Loss Function and Optimizer
 - `Loss Function` and `Optimizer` and defined in the next cell.
 - The `Loss Function` is used the calculate the difference in the output created by the model and the actual output.
 - `Optimizer` is used to update the weights of the neural network to improve its performance.

#### Further Reading
- You can refer to my [Pytorch Tutorials](https://github.com/abhimishra91/pytorch-tutorials) to get an intuition of Loss Function and Optimizer.
- [Pytorch Documentation for Loss Function](https://pytorch.org/docs/stable/nn.html#loss-functions)
- [Pytorch Documentation for Optimizer](https://pytorch.org/docs/stable/optim.html)
- Refer to the links provided on the top of the notebook to read more about DistiBERT.

In [31]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model.

class DistillBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistillBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 4)

    def forward(self, input_ids, attention_mask):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

In [32]:
model = DistillBERTClass()
model.to(device)

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

DistillBERTClass(
  (l1): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(i

In [33]:
# Creating the loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

<a id='section05'></a>
### Fine Tuning the Model

After all the effort of loading and preparing the data and datasets, creating the model and defining its loss and optimizer. This is probably the easier steps in the process.

Here we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network.

Following events happen in this function to fine tune the neural network:
- The dataloader passes data to the model based on the batch size.
- Subsequent output from the model and the actual category are compared to calculate the loss.
- Loss value is used to optimize the weights of the neurons in the network.
- After every 5000 steps the loss value is printed in the console.

As you can see just in 1 epoch by the final step the model was working with a miniscule loss of 0.0002485 i.e. the output is extremely close to the actual output.

In [34]:
# Function to calculate the accuracy of the model

def calcuate_accu(big_idx, targets):
    n_correct = (big_idx==targets).sum().item()
    return n_correct

In [35]:
# Defining the training function on the 80% of the dataset for tuning the distilbert model

def train(epoch):
    global tr_loss, n_correct
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for _,data in enumerate(training_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask)
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accu(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)

        if _%5000==0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples
            print(f"Training Loss per 5000 steps: {loss_step}")
            print(f"Training Accuracy per 5000 steps: {accu_step}")

        optimizer.zero_grad()
        loss.backward()
        # # When using GPU
        optimizer.step()

    print(tr_loss)
    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss Epoch: {epoch_loss}")
    print(f"Training Accuracy Epoch: {epoch_accu}")

    return

In [41]:
for epoch in range(EPOCHS):
    train(epoch)

Training Loss per 5000 steps: 0.02039811573922634
Training Accuracy per 5000 steps: 100.0
Training Loss per 5000 steps: 0.006118703328943032
Training Accuracy per 5000 steps: 99.9000199960008
Training Loss per 5000 steps: 0.003738875281423613
Training Accuracy per 5000 steps: 99.93750624937506
Training Loss per 5000 steps: 0.0029770986164525115
Training Accuracy per 5000 steps: 99.95166988867409
Training Loss per 5000 steps: 0.0025202601131663325
Training Accuracy per 5000 steps: 99.9600019999
Training Loss per 5000 steps: 0.002214353880748805
Training Accuracy per 5000 steps: 99.9670013199472
Training Loss per 5000 steps: 0.001845638442674356
Training Accuracy per 5000 steps: 99.97250091663611
Training Loss per 5000 steps: 0.0019132595478872578
Training Accuracy per 5000 steps: 99.97142938773177
70.98590587247016
The Total Accuracy for Epoch 0: 99.97375
Training Loss Epoch: 0.001774647646811754
Training Accuracy Epoch: 99.97375
Training Loss per 5000 steps: 1.430509428246296e-06
Train

<a id='section06'></a>
### Validating the Model

During the validation stage we pass the unseen data(Testing Dataset) to the model. This step determines how good the model performs on the unseen data.

This unseen data is the 20% of `SOCIOS.csv` which was seperated during the Dataset creation stage.
During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value. This comparison is then used to calculate the accuracy of the model.

As you can see the model is predicting the correct category of a given headline to a 99.9% accuracy.

In [42]:
def valid(model, testing_loader):
    global tr_loss, nb_tr_steps, nb_tr_examples
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.eval()
    n_correct = 0; n_wrong = 0; total = 0
    with torch.no_grad():
        for _, data in enumerate(testing_loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.long)
            outputs = model(ids, mask).squeeze()
            loss = loss_function(outputs, targets)
            tr_loss += loss.item()
            big_val, big_idx = torch.max(outputs.data, dim=1)
            n_correct += calcuate_accu(big_idx, targets)

            nb_tr_steps += 1
            nb_tr_examples+=targets.size(0)

            if _%1000==0:
                loss_step = tr_loss/nb_tr_steps
                accu_step = (n_correct*100)/nb_tr_examples
                print(f"Validation Loss per 100 steps: {loss_step}")
                print(f"Validation Accuracy per 100 steps: {accu_step}")
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Validation Loss Epoch: {epoch_loss}")
    print(f"Validation Accuracy Epoch: {epoch_accu}")

    return epoch_accu


In [43]:
print('This is the validation section to print the accuracy and see how it performs')
print('Here we are leveraging on the dataloader crearted for the validation dataset, the approcah is using more of pytorch')

acc = valid(model, testing_loader)
print("Accuracy on test data = %0.2f%%" % acc)

This is the validation section to print the accuracy and see how it performs
Here we are leveraging on the dataloader crearted for the validation dataset, the approcah is using more of pytorch
Validation Loss per 100 steps: 23.81480569151008
Validation Accuracy per 100 steps: 100.0
Validation Loss per 100 steps: 0.023792478849550606
Validation Accuracy per 100 steps: 100.0
Validation Loss per 100 steps: 0.014688755031693184
Validation Accuracy per 100 steps: 99.97501249375313
Validation Loss per 100 steps: 0.009795672744130358
Validation Accuracy per 100 steps: 99.98333888703766
Validation Loss per 100 steps: 0.0073477578148661805
Validation Accuracy per 100 steps: 99.98750312421895
Validation Loss per 100 steps: 0.005878801158768331
Validation Accuracy per 100 steps: 99.99000199960008
Validation Loss per 100 steps: 0.005958796872603313
Validation Accuracy per 100 steps: 99.98333611064822
Validation Loss per 100 steps: 0.005349670273999593
Validation Accuracy per 100 steps: 99.97857448

<a id='section07'></a>
### Saving the Trained Model Artifacts for inference

This is the final step in the process of fine tuning the model.

The model and its vocabulary are saved locally. These files are then used in the future to make inference on new inputs of news headlines.

Please remember that a trained neural network is only useful when used in actual inference after its training.

In the lifecycle of an ML projects this is only half the job done. We will leave the inference of these models for some other day.

In [44]:
# Saving the files for re-use

output_model_file = '/content/drive/MyDrive/Colab Notebooks/models/pytorch_distilbert_lgpd.bin'
output_vocab_file = '/content/drive/MyDrive/Colab Notebooks/models/vocab_distilbert_lgpd.bin'

model_to_save = model
torch.save(model_to_save, output_model_file)
tokenizer.save_vocabulary(output_vocab_file)

print('All files saved')
print('This tutorial is completed')

All files saved
This tutorial is completed
