# Artigo 4

## Propósito

Essa lição tem como objetivo criar modelo a partir de uma lingagem natural e como preparar os dados para o treinamento do modelo.



## Tema

Para testar utilizei um dataset do site Kaggle que contém emails e se eles são spam ou não.

In [1]:
! pip install transformers datasets evaluate huggingface_hub



In [2]:
from huggingface_hub import notebook_login
import pandas as pd
from datasets import Dataset,DatasetDict
from transformers import AutoTokenizer,TrainingArguments,Trainer,AutoModelForSequenceClassification
import evaluate
import sklearn
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


Após a importação das bibliotecas nescessárias, importei localmente o csv em um dataframe para podermos alterar os dados se nescessário.

In [3]:
df = pd.read_csv('spam_assassin.csv')
df

Unnamed: 0,text,target
0,From ilug-admin@linux.ie Mon Jul 29 11:28:02 2...,0
1,From gort44@excite.com Mon Jun 24 17:54:21 200...,1
2,From fork-admin@xent.com Mon Jul 29 11:39:57 2...,1
3,From dcm123@btamail.net.cn Mon Jun 24 17:49:23...,1
4,From ilug-admin@linux.ie Mon Aug 19 11:02:47 2...,0
...,...,...
5791,From ilug-admin@linux.ie Mon Jul 22 18:12:45 2...,0
5792,From fork-admin@xent.com Mon Oct 7 20:37:02 20...,0
5793,Received: from hq.pro-ns.net (localhost [127.0...,1
5794,From razor-users-admin@lists.sourceforge.net T...,0


In [4]:
df.describe(include='object')

Unnamed: 0,text
count,5796
unique,5329
top,Return-Path: ler@lerami.lerctr.org Delivery-Da...
freq,6


Aqui mudei as labels do dataframe para o treinamento do modelo pronto que vai ser feito. Depois podemos checar as informações para ver se tudo está certo.

In [5]:
df = df.astype({'target':'float'})
df2 = df.rename(columns={'target': 'label'})

In [6]:
df2

Unnamed: 0,text,label
0,From ilug-admin@linux.ie Mon Jul 29 11:28:02 2...,0.0
1,From gort44@excite.com Mon Jun 24 17:54:21 200...,1.0
2,From fork-admin@xent.com Mon Jul 29 11:39:57 2...,1.0
3,From dcm123@btamail.net.cn Mon Jun 24 17:49:23...,1.0
4,From ilug-admin@linux.ie Mon Aug 19 11:02:47 2...,0.0
...,...,...
5791,From ilug-admin@linux.ie Mon Jul 22 18:12:45 2...,0.0
5792,From fork-admin@xent.com Mon Oct 7 20:37:02 20...,0.0
5793,Received: from hq.pro-ns.net (localhost [127.0...,1.0
5794,From razor-users-admin@lists.sourceforge.net T...,0.0


In [7]:
ds_train = Dataset.from_pandas(df2)
ds_train

Dataset({
    features: ['text', 'label'],
    num_rows: 5796
})

Agora transformei o dataframe em um Dataset para transformar o texto em tokens, que possibilita o modelo a entender a linguagem natural.

Na hora de escolher o modelo pré treinado tentei usar o mesmo que foi passado na aula, pórem tivem algumas dificuldades de instalações.
Então Pesquisei no Hugginface por outros modelos e escolhi o distilbert-base-uncased por ser um mais rápido, o que serve para um ambiente de aprendizado.

In [8]:
model_nm = 'distilbert-base-uncased'

Agora só transformar em tokens com base no modelo escolhido

In [9]:
toks = AutoTokenizer.from_pretrained(model_nm)

In [11]:
def tokenizer_function(examples):
    return toks(examples["text"], truncation=True)

In [12]:
tokenized = ds_train.map(tokenizer_function, batched=True)

100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00,  1.06ba/s]


Abaixo podemos ver os headers criados e um exemplo do primeiro e-mail antes e depois de cirar tokens. Se você for com calma e olhar palavras iguais poderá ver que são os mesmos números.

In [13]:
tokenized

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 5796
})

In [14]:
row = tokenized[0]
row['text'], row['input_ids']

 [101,
  2013,
  6335,
  15916,
  1011,
  4748,
  10020,
  1030,
  11603,
  1012,
  29464,
  12256,
  21650,
  2756,
  2340,
  1024,
  2654,
  1024,
  6185,
  2526,
  2709,
  1011,
  4130,
  1024,
  1026,
  6335,
  15916,
  1011,
  4748,
  10020,
  1030,
  11603,
  1012,
  29464,
  1028,
  5359,
  1011,
  2000,
  1024,
  1061,
  2100,
  2100,
  2100,
  1030,
  2334,
  15006,
  2102,
  1012,
  5658,
  22074,
  2378,
  2278,
  1012,
  4012,
  2363,
  1024,
  2013,
  2334,
  15006,
  2102,
  1006,
  2334,
  15006,
  2102,
  1031,
  13029,
  1012,
  1014,
  1012,
  1014,
  1012,
  1015,
  1033,
  1007,
  2011,
  6887,
  16429,
  2891,
  1012,
  13625,
  1012,
  5658,
  22074,
  2378,
  2278,
  1012,
  4012,
  1006,
  2695,
  8873,
  2595,
  1007,
  2007,
  9686,
  20492,
  2361,
  8909,
  17350,
  29097,
  2683,
  22932,
  16932,
  2546,
  2005,
  1026,
  1046,
  2213,
  1030,
  2334,
  15006,
  2102,
  1028,
  1025,
  12256,
  1010,
  2756,
  21650,
  2526,
  5757,
  1024,
  2423,
  1024,

Nesse momento vamos dividir o dataset em uma parte para o treino e outra parte para a validação de forma aleatória. Foi dividido 75%/25% de forma arbitrária.

In [15]:
dds = tokenized.train_test_split(0.25, seed=50)
dds

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4347
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1449
    })
})

Depois de ter o dataset pronto, agora preparei a metrica que será utiliazada pelo modelo. Apesar de ele utilizar a métrica Pearson na aula, eu decidi utilizar a 'precision' disponibilizada do Hugginface para testar de jeitos diferentes. Com outras métricas parecia não estar fazendo a predição certa. 

In [16]:
def metrics(eval_pred):
    metric = evaluate.load('precision')
    logits, labels = eval_pred
    predictions = np.clip(logits, 0, 1)
    return metric.compute(predictions=predictions, references=labels)

Agora é definir os parametros de treinamento que serão utilizados. Essa parte, como dita na aula, é bem padrão e não precisa modificar muita coisa. <br/>
Os BatchSizes e epochs foram testados por tentativa até achar um valores bons. O Batch até utilizar toda memória da minha GPU e epochs suficiente para evitar um overfiting.
Depois é só iniciar o treinamento.

In [17]:
bs = 32
epochs = 4
lr = 8e-5

In [18]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine',
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)

trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=toks, compute_metrics=metrics)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier

In [19]:
trainer.train();

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 4347
  Num Epochs = 4
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 544
  Number of trainable parameters = 66954241
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision
1,No log,0.030424,0.0
2,No log,0.012165,0.9975
3,No log,0.011293,0.997462
4,0.033700,0.008538,0.996689


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1449
  Batch size = 64
  _warn_prf(average, modifier, msg_start, len(result))
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1449
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this me

O modelo teve uma precisão muito boa! Mas, para garantir que não é algum tipo de overfiting, como padrão testei o modelo com um novo csv com diferentes emails de um dataset diferente que também olha spams. Selecionei alguns deles e dei o mesmo tratamento de dados anterior.

In [37]:
test = pd.read_csv('spam_test.csv')
test2 = test.rename(columns={'target': 'label'})
ds_test = Dataset.from_pandas(test2)
test_tokenized = ds_test.map(tokenizer_function, batched=True)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.33ba/s]


In [40]:
test_tokenized

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 49
})

In [41]:
preds = trainer.predict(test_tokenized).predictions.astype(float)
preds

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 49
  Batch size = 64


array([[ 0.99556041],
       [-0.01485257],
       [ 0.95708787],
       [ 1.00250733],
       [-0.00619153],
       [-0.00544256],
       [ 1.00747418],
       [-0.0089712 ],
       [-0.00653905],
       [ 0.03442222],
       [-0.01448531],
       [ 0.00758614],
       [-0.00554579],
       [ 0.04764483],
       [ 1.00512576],
       [ 0.00123645],
       [-0.00685828],
       [-0.01745206],
       [-0.00856242],
       [ 0.00847327],
       [-0.01538832],
       [-0.01034953],
       [-0.01477222],
       [ 1.00492048],
       [ 1.0067848 ],
       [ 1.00312841],
       [ 1.01755738],
       [ 1.01044309],
       [ 1.000646  ],
       [ 0.99413502],
       [-0.01599978],
       [ 1.0028677 ],
       [-0.0136067 ],
       [-0.01726106],
       [ 0.01838026],
       [-0.00998145],
       [ 1.00364876],
       [-0.01431229],
       [-0.01103633],
       [ 1.01125586],
       [-0.01235565],
       [-0.01044114],
       [-0.01505763],
       [-0.00119178],
       [ 0.99303335],
       [-0

Podemos ver que, apesar de alguns números passarem de 1 e menores que 0, foi um resultado be acertivo. Agora é so tratar para mostrar os valores entre 0 e 1.

In [42]:
preds = np.clip(preds, 0, 1)
preds

array([[0.99556041],
       [0.        ],
       [0.95708787],
       [1.        ],
       [0.        ],
       [0.        ],
       [1.        ],
       [0.        ],
       [0.        ],
       [0.03442222],
       [0.        ],
       [0.00758614],
       [0.        ],
       [0.04764483],
       [1.        ],
       [0.00123645],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.00847327],
       [0.        ],
       [0.        ],
       [0.        ],
       [1.        ],
       [1.        ],
       [1.        ],
       [1.        ],
       [1.        ],
       [1.        ],
       [0.99413502],
       [0.        ],
       [1.        ],
       [0.        ],
       [0.        ],
       [0.01838026],
       [0.        ],
       [1.        ],
       [0.        ],
       [0.        ],
       [1.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.99303335],
       [0.        ],
       [0.        ],
       [0.   