## Introdução

Este trabalho apresenta uma aplicação do modelo BERT na tarefa de classificação de texto.

Será analisando o banco de dados X extraído do site Kagle. Este banco possuí duas colunas, uma com o texto da notícia e outra indicando se a notícia é falsa ou não.

Utilizaremos neste trabalho a linguagem `Python` (versão 3.11.3), `R` (versão 4.2.1), o software `RStudio` (versão 2023.06.0+421), os pacotes em python `pandas` (versão 2.0.3), `transformers` (versão 4.30.2), `torch` (versão 2.0.1), e os pacotes em R .

## Setup

In [2]:
import os
import pandas as pd
import math
import numpy as np
from statistics import mean
# import nltk
import torch
from transformers import BertTokenizer, BertForSequenceClassification

In [3]:
TRUE_PATH = r"dados\DataSet_Misinfo_TRUE.csv"
FAKE_PATH = r"dados\DataSet_Misinfo_FAKE.csv"
FINAL_PATH = os.path.join(os.path.split(TRUE_PATH)[0], 'data.csv')

def get_processed_df(true_path: str, fake_path: str, final_path: str):
    """Método que:
    - Une as bases com noticias verdadeiras e falsas
    - Remove linhas com textos vazios
    - Coloca o texto em minúsculo
    - Remove linhas com texto e flag duplicadas
    - Se um texto está duplicado, mantém somente um

    true_path: Caminho da base com as noticias verdadeiras
    fake_path: Caminho da base com as noticias falsas
    final_path: Caminho do .CSV com a base processada
    """
    df_true = pd.read_csv(true_path)
    df_fake = pd.read_csv(fake_path)

    df_true.drop('Unnamed: 0', axis=1, inplace=True)
    df_fake.drop('Unnamed: 0', axis=1, inplace=True)

    df_true['flag'] = 1
    df_fake['flag'] = 0

    df = pd.concat([df_true, df_fake])
    df = df[~df['text'].isnull()]

    df['text'] = df['text'].str.lower()

    df.drop_duplicates(inplace=True)
    df.drop_duplicates(subset='text', keep=False, inplace=True, ignore_index=True)

    df.to_csv(final_path, index=False)

## Estatísticas descritivas

In [4]:
# Rodar somente se o dataframe total nao tiver sido gerado
# get_processed_df(true_path=TRUE_PATH, fake_path=FAKE_PATH, final_path=FINAL_PATH)

Lendo os dados processados

In [5]:
df = pd.read_csv(r"dados\data.csv")

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68600 entries, 0 to 68599
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    68600 non-null  object
 1   flag    68600 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.0+ MB


Unnamed: 0,text,flag
0,the head of a conservative republican faction ...,1
1,transgender people will be allowed for the fir...,1
2,the special counsel investigation of links bet...,1
3,trump campaign adviser george papadopoulos tol...,1
4,president donald trump called on the u.s. post...,1


Obtendo a distribuição de palavras de cada texto

In [9]:
len(df['text'][23750].split())

1890

In [10]:
print(max([len(i.split()) for i in df['text'].dropna()]))

24234


In [7]:
df.groupby('flag').size()

flag
0    34074
1    34526
dtype: int64

## Importando o modelo BERT pré-treinado

Vamos importar o modelo BERT base, pois não dispomos de capacidade computacional suficiente para rodar o modelo BERT large. Mais especificamente, utilizaremos o modelo BERT base uncased, que não faz diferença entre palavras com letras maiúsculas e minusculas.

In [8]:
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Change num_labels according to your classification task

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

In [17]:
# Tokenize and preprocess input text
def preprocess_text(text):
    inputs = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        truncation=True,
        padding='max_length',
        max_length=2000,  # Adjust according to your input length requirements
        return_tensors='pt'
    )
    return inputs

# Example text for classification
text = "This is an example sentence for classification."

# Preprocess input text
inputs = tokenizer.encode_plus((df['text'][23750]))
print(df['text'][23750])
print(tokenizer.convert_ids_to_tokens(inputs['input_ids']))

the man entered the red robin restaurant inside the staten island mall two minutes after 6 p. m. on a friday. he walked straight past the booths and tables and entered the men’s room. a manager would find him there seven minutes later, lying on the floor with a needle and foaming at the mouth. his name was jonathan ayers, 27, and he was declared dead within the hour that evening, sept. 9, apparently of a heroin overdose. mr. ayers’s fatal overdose was the latest addition to a body count without precedent. so far in 2016, there have been 71 deaths that appear to be from heroin overdoses on the island, the richmond county district attorney’s office said, already on pace to more than double the record set two years ago. nine people died of heroin overdoses in a recent   period, prosecutors said. mr. ayers left behind an account of his addiction. after his death, his mother, ann ayers, and brother, christopher, found a journal he had kept for the last couple of years that chronicled the li