<a href="https://colab.research.google.com/github/viniciused26/fastaiOnCampus/blob/main/lesson04_160147816_viniciused26_vinicius_edwardo_pereira_oliveira.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Esse texto pertence ao Lovecraft?
A ideia desde projeto é criar um modelo capaz de distinguir textos do Lovecraft baseado em excertos de seus livros e contos.

Configurando ambiente.

In [1]:
import pandas as pd
import numpy as np

Organizando o datase

In [2]:
train_df = pd.read_excel('texts.xlsx')
train_df['is_lovecraft'] = train_df['is_lovecraft'].astype(float)
train_df['input'] = 'TEXT: ' + train_df.text 
train_df.head()

Unnamed: 0,text,is_lovecraft,input
0,I am forced into speech because men of science...,1.0,TEXT: I am forced into speech because men of s...
1,In the end I must rely on the judgment and sta...,1.0,TEXT: In the end I must rely on the judgment a...
2,"It is further against us that we are not, in t...",1.0,TEXT: It is further against us that we are not...
3,"The crowning abnormality, of course, was the c...",1.0,"TEXT: The crowning abnormality, of course, was..."
4,"But whatever had happened, it was hideous and ...",1.0,"TEXT: But whatever had happened, it was hideou..."


Fazendo a tokenização

In [3]:
!pip install datasets
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(train_df)
ds

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Dataset({
    features: ['text', 'is_lovecraft', 'input'],
    num_rows: 100
})

Usando o mesmo modelo da aula https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners

In [4]:
model_nm = 'microsoft/deberta-v3-small'

In [5]:
!pip install --no-cache-dir transformers sentencepiece accelerate
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm, use_fast = False)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
tokz.tokenize("The animals were restless in their stalls, whickering and snorting at the scent of blood.")

['▁The',
 '▁animals',
 '▁were',
 '▁restless',
 '▁in',
 '▁their',
 '▁stalls',
 ',',
 '▁wh',
 'icker',
 'ing',
 '▁and',
 '▁snorting',
 '▁at',
 '▁the',
 '▁scent',
 '▁of',
 '▁blood',
 '.']

Vamos representar os tokens de maneira numérica.

In [7]:
def tok_func(x): return tokz(x["input"])
tok_ds = ds.map(tok_func, batched=True)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [8]:
row = tok_ds[0]
row['input'], row['input_ids']

('TEXT: I am forced into speech because men of science have refused to follow my advice without knowing why.',
 [1,
  54453,
  294,
  273,
  481,
  2705,
  352,
  2890,
  401,
  842,
  265,
  1693,
  286,
  4977,
  264,
  1111,
  312,
  1678,
  497,
  2843,
  579,
  260,
  2])

In [9]:
tok_ds = tok_ds.rename_columns({'is_lovecraft':'labels'})

Usando a correlação da aula https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners

In [10]:
def corr(x,y): return np.corrcoef(x,y)[0][1]

In [11]:
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

Divindo os datasets para o treinamento.

In [12]:
dds = tok_ds.train_test_split(0.20, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 80
    })
    test: Dataset({
        features: ['text', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 20
    })
})

In [13]:
from transformers import TrainingArguments,Trainer

In [14]:
bs = 64
epochs = 4
lr = 8e-5

In [16]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine',
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

In [17]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

Downloading pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.classifier.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from 

Após o modelo ser criado e os datasets definidos, criamos o treinador e usamos ele.

In [18]:
trainer.train();



Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.371415,0.47862
2,No log,0.246313,0.652252
3,No log,0.260755,0.819448
4,No log,0.267698,0.838853


Agora, iremos precisar de um dataset de teste. Repetimos o processo inicial de tokenização.

In [20]:
test_df = pd.read_excel('texts.xlsx')
test_df['input'] = 'TEXT: ' + test_df.text 
test_ds = Dataset.from_pandas(test_df)
test_tok_ds = test_ds.map(tok_func, batched=True)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Realizando predição

In [22]:
preds = trainer.predict(test_tok_ds).predictions.astype(int)
preds

array([[0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
    

In [29]:
prediction_res  = []
for i in preds:
    if i == 1:
        prediction_res.append('True')
    else:
        prediction_res.append('False')

In [36]:
test_results = []
test_df['is_lovecraft'] = test_df['is_lovecraft'].astype(str) 
for i in test_df['is_lovecraft']:
    test_results.append(i)

In [38]:
wrongs = 0 
for i in range(len(prediction_res)):
    if prediction_res[i] != test_results[i]:
        wrongs += 1
        print(f'{i}: {prediction_res[i]} != {test_results[i]} ')
print(f'\nTaxa de erro: {(wrongs/125)*100}%')

0: False != True 
1: False != True 
2: False != True 
3: False != True 
4: False != True 
5: False != True 
6: False != True 
7: False != True 
8: False != True 
9: False != True 
10: False != True 
11: False != True 
12: False != True 
13: False != True 
14: False != True 
15: False != True 
16: False != True 
17: False != True 
18: False != True 
19: False != True 
20: False != True 
21: False != True 
22: False != True 
23: False != True 
24: False != True 
25: False != True 
26: False != True 
27: False != True 
28: False != True 
29: False != True 
30: False != True 
31: False != True 
32: False != True 
33: False != True 
34: False != True 
35: False != True 
36: False != True 
37: False != True 
38: False != True 
39: False != True 
40: False != True 
41: False != True 
42: False != True 
43: False != True 
44: False != True 
45: False != True 
46: False != True 
47: False != True 
48: False != True 
49: False != True 

Taxa de erro: 40.0%


Deploy do modelo

In [39]:
!pip install --upgrade huggingface_hub
import huggingface_hub
from huggingface_hub import login
login()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [40]:
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id="lesson-4")

RepoUrl('https://huggingface.co/viniciused26/lesson-4', endpoint='https://huggingface.co', repo_type='model', repo_id='viniciused26/lesson-4')

In [41]:
trainer.push_to_hub("lesson-4")

Cloning https://huggingface.co/viniciused26/outputs into local empty directory.


Upload file pytorch_model.bin:   0%|          | 1.00/541M [00:00<?, ?B/s]

Upload file spm.model:   0%|          | 1.00/2.35M [00:00<?, ?B/s]

Upload file training_args.bin:   0%|          | 1.00/3.75k [00:00<?, ?B/s]

To https://huggingface.co/viniciused26/outputs
   e2dad04..0e7e565  main -> main

   e2dad04..0e7e565  main -> main

To https://huggingface.co/viniciused26/outputs
   0e7e565..ebc0ae1  main -> main

   0e7e565..ebc0ae1  main -> main



'https://huggingface.co/viniciused26/outputs/commit/0e7e565fc8012e7f354d391f07ff8bcac8cdbeb7'

# Conclusão
A taxa de erro deu bastante alta, acredito que seja por conta da pequena quantidade de textos e da similaridade deles, não existem muitos nomes ou palavras que se distinguem, a ideia de conseguir discernir a escrita do autor somente com a tokenização não parece ser o suficiente. Talvez com uma amostrar maior e nomes de personagens ou locais da literatura desse autor ajude o modelo a distinguir melhor os textos.