# Lesson 04, poem classification.


O objetivo deste notebook é criar um modelo para a classificação do gênero dos poemas apresentados ao modelo. Para isto serão utilizados dois datasets
para treino e teste do modelo. Os datasets estão em formato csv com 2 colunas, GÊNERO e POEMA.

Quantidade de linhas com dados em cada dataset:
- Treino: 840
- Teste: 149

Gêneros possíveis:
- Music
- Death
- Affection
- Environment

# Acessing train and test dataframes with pandas

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_dataset_path = '/kaggle/input/poem-classification-nlp/Poem_classification - train_data.csv'
test_dataset_path = '/kaggle/input/poem-classification-nlp/Poem_classification - test_data.csv'

train_df = pd.read_csv(train_dataset_path)
test_df = pd.read_csv(test_dataset_path)

# Data cleaning

## Gender to number

Cada genero será expresso por um número, assim o modelo pode retornar um inteiro ao invés de uma string.

In [42]:
train_df['GenNumber'] = train_df['Genre'].map({'Music':0.0, 'Death':0.25, 'Affection':0.5, 'Environment':0.75})
test_df['GenNumber'] = test_df['Genre'].map({'Music':0.0, 'Death':0.25, 'Affection':0.5, 'Environment':0.75})

train_df

Unnamed: 0,Genre,Poem,GenNumber,input
0,Music,,0.00,Genre: Music; Poem: : GenNumber: 0 0.00\...
1,Music,In the thick brushthey spend the...,0.00,Genre: Music; Poem: In the thick...
2,Music,Storms are generous. ...,0.00,Genre: Music; Poem: Storms are generous. ...
3,Music,—After Ana Mendieta Did you carry around the ...,0.00,Genre: Music; Poem: —After Ana Mendieta Did y...
4,Music,for Aja Sherrard at 20The portent may itself ...,0.00,Genre: Music; Poem: for Aja Sherrard at 20The...
...,...,...,...,...
836,Environment,Why make so much of fragmentary blue In here a...,0.75,Genre: Environment; Poem: Why make so much of ...
837,Environment,"Woman, I wish I didn't know your name. What co...",0.75,"Genre: Environment; Poem: Woman, I wish I didn..."
838,Environment,"Yonder to the kiosk, beside the creek, Paddle ...",0.75,"Genre: Environment; Poem: Yonder to the kiosk,..."
839,Environment,You come to fetch me from my work to-night Whe...,0.75,Genre: Environment; Poem: You come to fetch me...


## Replacing None values from Poems and replace it by empty string

In [43]:
train_df.fillna(' ', inplace=True)
test_df.fillna(' ', inplace=True)
train_df

Unnamed: 0,Genre,Poem,GenNumber,input
0,Music,,0.00,Genre: Music; Poem: : GenNumber: 0 0.00\...
1,Music,In the thick brushthey spend the...,0.00,Genre: Music; Poem: In the thick...
2,Music,Storms are generous. ...,0.00,Genre: Music; Poem: Storms are generous. ...
3,Music,—After Ana Mendieta Did you carry around the ...,0.00,Genre: Music; Poem: —After Ana Mendieta Did y...
4,Music,for Aja Sherrard at 20The portent may itself ...,0.00,Genre: Music; Poem: for Aja Sherrard at 20The...
...,...,...,...,...
836,Environment,Why make so much of fragmentary blue In here a...,0.75,Genre: Environment; Poem: Why make so much of ...
837,Environment,"Woman, I wish I didn't know your name. What co...",0.75,"Genre: Environment; Poem: Woman, I wish I didn..."
838,Environment,"Yonder to the kiosk, beside the creek, Paddle ...",0.75,"Genre: Environment; Poem: Yonder to the kiosk,..."
839,Environment,You come to fetch me from my work to-night Whe...,0.75,Genre: Environment; Poem: You come to fetch me...


In [44]:
train_df['input'] = "Genre: " + train_df.Genre + "; Poem: " + train_df.Poem + ": GenNumber: " + str(train_df.GenNumber)
test_df['input'] = "Genre: " + test_df.Genre + "; Poem: " + test_df.Poem + ": GenNumber: " + str(test_df.GenNumber)

train_df.input.head()

0    Genre: Music; Poem:  : GenNumber: 0      0.00\...
1    Genre: Music; Poem:               In the thick...
2    Genre: Music; Poem:    Storms are generous.   ...
3    Genre: Music; Poem:  —After Ana Mendieta Did y...
4    Genre: Music; Poem:  for Aja Sherrard at 20The...
Name: input, dtype: object

# Tokenização utilizando deberta

A tokenização é utilizada para transformar palavras em números.

In [29]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer

pre_train = 'microsoft/deberta-v3-small'
toke_maker = AutoTokenizer.from_pretrained(pre_train)

# Teste do toke_maker
toke_maker.tokenize(
"I walk through the valley of the shadow of death "
"And I fear no evil because I'm blind to it all "
"And my mind, my gun they comfort me "
"Because I know I'll kill my enemies when they come")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


['▁I',
 '▁walk',
 '▁through',
 '▁the',
 '▁valley',
 '▁of',
 '▁the',
 '▁shadow',
 '▁of',
 '▁death',
 '▁And',
 '▁I',
 '▁fear',
 '▁no',
 '▁evil',
 '▁because',
 '▁I',
 "'",
 'm',
 '▁blind',
 '▁to',
 '▁it',
 '▁all',
 '▁And',
 '▁my',
 '▁mind',
 ',',
 '▁my',
 '▁gun',
 '▁they',
 '▁comfort',
 '▁me',
 '▁Because',
 '▁I',
 '▁know',
 '▁I',
 "'",
 'll',
 '▁kill',
 '▁my',
 '▁enemies',
 '▁when',
 '▁they',
 '▁come']

In [30]:
# Funcao para retornar objeto do tipo toke_maker dada a coluna 'Poem' do DS
# Em alguns casos o Poem pode ser nulo, entao o retorno e um token vazio
def give_toke(element):
    return toke_maker(element['input'])

In [45]:
# Converting pandas to dataset, transformers uses Dataset object and not pandas.
from datasets import Dataset, DatasetDict
train_ds = Dataset.from_pandas(train_df)
test_ds = Dataset.from_pandas(test_df)
train_ds

Dataset({
    features: ['Genre', 'Poem', 'GenNumber', 'input'],
    num_rows: 841
})

In [46]:
# Obtain tokens from dataset.
train_tokens = train_ds.map(give_toke, batched=True)
test_tokens = test_ds.map(give_toke, batched=True)
line = train_tokens[1] # I'm using 1 cause 0 is empty string in this train_tokens.
line.keys()

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

dict_keys(['Genre', 'Poem', 'GenNumber', 'input', 'input_ids', 'token_type_ids', 'attention_mask'])

In [33]:
# Show line and ids.
line['input'], line['input_ids']

('Genre: Music; Poem: \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 In the thick brushthey spend the hottest part of the day,\xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 soaking their hoovesin the trickle of mountain water\xa0 \xa0 \xa0 \xa0 \xa0 \xa0 \xa0 the ravine hoardson behalf of the oleander.\xa0 \xa0 \xa0 \xa0 \xa0 \xa0: GenNumber: 0      0.00\n1      0.00\n2      0.00\n3      0.00\n4      0.00\n       ... \n836    0.75\n837    0.75\n838    0.75\n839    0.75\n840    0.75\nName: GenNumber, Length: 841, dtype: float64',
 [1,
  33063,
  294,
  2515,
  346,
  54694,
  294,
  344,
  262,
  3901,
  5063,
  6366,
  1446,
  262,
  11914,
  465,
  265,
  262,
  406,
  261,
  18065,
  308,
  55042,
  547,
  262,
  30471,
  265,
  3439,
  529,
  262,
  48099,
  42074,
  3336,
  3910,
  265,
  262,
  28728,
  21673,
  260,
  877,
  6481,
  36869,
  294,
  767,
  767,
  260,
  962,
  376,
  767,
  260,
  962,
  392,
  767,
  260,
  962,
  404,
  767,
  260,
  962,
  453,
  767,
  260,
  962,
  323,
  260,
  260

In [34]:
train_tokens = train_tokens.rename_columns({'GenNumber':'labels'})
train_tokens

Dataset({
    features: ['Genre', 'Poem', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 841
})

# Definindo correlação de Pearson

In [12]:
##??np.corrcoef

In [35]:
def correlacao(a,b):
    return np.corrcoef(a,b)[0][1]

def format_correlacao(args):
    dicionario = {'pearson': correlacao(*args)}
    return dicionario

## Separando o dataset de validacao em : (treino, validacao)

E o datasset de teste ? será utilizado no final do notebook para avaliar o modelo.

In [36]:
dataset_train_validation = train_tokens.train_test_split(0.25, seed=10)
dataset_train_validation

DatasetDict({
    train: Dataset({
        features: ['Genre', 'Poem', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 630
    })
    test: Dataset({
        features: ['Genre', 'Poem', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 211
    })
})

In [37]:
from transformers import TrainingArguments, Trainer

In [38]:
bs = 16
epochs = 4
lr = 8e-5

args = TrainingArguments(
    'outputs',
    learning_rate=lr,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    fp16=True,
    evaluation_strategy="epoch",
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs,
    weight_decay=0.01,
    report_to='none')

In [17]:
#??AutoModelForSequenceClassification.from_pretrained

In [39]:
model = AutoModelForSequenceClassification.from_pretrained(pre_train, num_labels=1)
trainer = Trainer(
    model, args, 
    train_dataset=dataset_train_validation['train'],
    eval_dataset=dataset_train_validation['test'],
    tokenizer=toke_maker,
    compute_metrics=format_correlacao)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.bias', 'pooler.dense.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [40]:
trainer.train()

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.089018,0.584029
2,No log,0.007454,0.978287
3,No log,0.003376,0.991028
4,No log,0.003565,0.992119


TrainOutput(global_step=80, training_loss=0.05410410761833191, metrics={'train_runtime': 38.8699, 'train_samples_per_second': 64.832, 'train_steps_per_second': 2.058, 'total_flos': 106350962389500.0, 'train_loss': 0.05410410761833191, 'epoch': 4.0})

# Testing the model

In [69]:
predictions = trainer.predict(test_tokens).predictions.astype(float)
predictions

array([[-0.08906467],
       [-0.05857813],
       [-0.0473283 ],
       [-0.05148886],
       [-0.08475068],
       [-0.05614309],
       [-0.08436757],
       [-0.07095812],
       [-0.05352484],
       [-0.07165819],
       [-0.05133129],
       [-0.07060902],
       [ 0.19357182],
       [ 0.25681582],
       [ 0.17280775],
       [ 0.22983736],
       [ 0.19352446],
       [ 0.20002493],
       [ 0.19206983],
       [ 0.29333666],
       [ 0.19712353],
       [ 0.21615067],
       [ 0.23266932],
       [ 0.22462741],
       [ 0.18236068],
       [ 0.48875222],
       [ 0.58774918],
       [ 0.68393379],
       [ 0.61456925],
       [ 0.58230036],
       [ 0.57748991],
       [ 0.53173977],
       [ 0.59790862],
       [ 0.5443576 ],
       [ 0.63920313],
       [ 0.55092692],
       [ 0.56303924],
       [ 0.50779581],
       [ 0.64202148],
       [ 0.60877478],
       [ 0.60810065],
       [ 0.64680851],
       [ 0.60569751],
       [ 0.58813488],
       [ 0.58700997],
       [ 0

In [70]:
# ...

genero = [
    'Music', 'Death', 'Affection', 'Environment'
]

def exact_predict(value):
    if value <= (0 + 0.1250):
        return genero[0]
    elif value <= (0.25 + 0.1250):
        return genero[1]
    elif value <= (0.5 + 0.1250):
        return genero[2]
    else:
        return genero[3]

In [71]:
predictions = list(map(exact_predict,predictions))
predictions

['Music',
 'Music',
 'Music',
 'Music',
 'Music',
 'Music',
 'Music',
 'Music',
 'Music',
 'Music',
 'Music',
 'Music',
 'Death',
 'Death',
 'Death',
 'Death',
 'Death',
 'Death',
 'Death',
 'Death',
 'Death',
 'Death',
 'Death',
 'Death',
 'Death',
 'Affection',
 'Affection',
 'Environment',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Environment',
 'Affection',
 'Affection',
 'Affection',
 'Environment',
 'Affection',
 'Affection',
 'Environment',
 'Affection',
 'Affection',
 'Affection',
 'Environment',
 'Affection',
 'Affection',
 'Environment',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Environment',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affection',
 'Affect

In [72]:
# A quantidade de linhas de testes é igual a quantidade de linhas de predições ? TEM QUE SER!!!
len(test_df['Genre']), len(predictions)

(150, 150)

In [75]:
quantia_de_erros = sum(list(map(lambda a,b: 0 if a == b else 1, test_df['Genre'], predictions)))
quantia_de_erros

13

In [76]:
porcentagem_de_acertos = 1 - (quantia_de_erros/len(predictions))
porcentagem_de_acertos * 100

91.33333333333333

# Exportando o modelo

In [78]:
trainer.save_model('./model')

# DEPLOY

[DEPLOY](https://huggingface.co/HellSank/poems)