## Introdução

Este trabalho apresenta uma aplicação do modelo BERT na tarefa de classificação de texto.

Será analisando o banco de dados X extraído do site Kagle. Este banco possuí duas colunas, uma com o texto da notícia e outra indicando se a notícia é falsa ou não.

Utilizaremos neste trabalho a linguagem `Python` (versão 3.11.3), `R` (versão 4.2.1), o software `RStudio` (versão 2023.06.0+421), os pacotes em python `pandas` (versão 2.0.3), `transformers` (versão 4.30.2), `torch` (versão 2.0.1), e os pacotes em R .

## Setup

In [12]:
import os
import pandas as pd
import math
import numpy as np
from statistics import mean
# import nltk
import torch

from transformers import BertTokenizer, BertForSequenceClassification

In [2]:
import os
import pandas as pd

In [5]:
TRUE_PATH = r"dados\DataSet_Misinfo_TRUE.csv"
FAKE_PATH = r"dados\DataSet_Misinfo_FAKE.csv"
FINAL_PATH = os.path.join(os.path.split(TRUE_PATH)[0], 'data.csv')

def get_processed_df(true_path: str, fake_path: str, final_path: str):
    """Método que:
    - Une as bases com noticias verdadeiras e falsas
    - Remove linhas com textos vazios
    - Coloca o texto em minúsculo
    - Remove linhas com texto e flag duplicadas
    - Se um texto está duplicado, mantém somente um

    true_path: Caminho da base com as noticias verdadeiras
    fake_path: Caminho da base com as noticias falsas
    final_path: Caminho do .CSV com a base processada
    """
    df_true = pd.read_csv(true_path)
    df_fake = pd.read_csv(fake_path)

    df_true.drop('Unnamed: 0', axis=1, inplace=True)
    df_fake.drop('Unnamed: 0', axis=1, inplace=True)

    df_true['flag'] = 1
    df_fake['flag'] = 0

    df = pd.concat([df_true, df_fake])
    df = df[~df['text'].isnull()]

    df['text'] = df['text'].str.lower().str.replace('“', '').str.replace('”', '')

    df.drop_duplicates(inplace=True)
    df.drop_duplicates(subset='text', keep=False, inplace=True, ignore_index=True)

    df.to_csv(final_path, index=False)

## Estatísticas descritivas

In [6]:
# Rodar somente se o dataframe total nao tiver sido gerado
get_processed_df(true_path=TRUE_PATH, fake_path=FAKE_PATH, final_path=FINAL_PATH)

Lendo os dados processados

In [3]:
df = pd.read_csv(r"dados\data.csv")

In [4]:
df['text'] = df['text'].str.replace('“', '').str.replace('”', '')
df['text'][27884]

'tel aviv  —   immigration lawyers from groups financed by billionaire george soros, a champion of open border policies, were signatories to a lawsuit filed saturday to block president donald trump’s executive order halting visas for 90 days for immigrants and   from syria, somalia, sudan, libya, yemen, iran, and iraq. [the executive order further suspended the entry of all refugees for 120 days, indefinitely blocks syrian refugees from entering and lowers the ceiling to 50, 000 for refugees allowed to enter the u. s. during fiscal year 2017 .  the new york times first reported on the lawsuit: at least one case quickly prompted a legal challenge as lawyers representing two iraqi refugees held at kennedy international airport in new york filed a motion early saturday seeking to have their clients released. they also filed a motion for class certification, in an effort to represent all refugees and other immigrants who they said were being unlawfully detained at ports of entry. the suit 

In [41]:
df['tamanho'] = [len(i.split()) for i in df['text']]

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68600 entries, 0 to 68599
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     68600 non-null  object
 1   flag     68600 non-null  int64 
 2   tamanho  68600 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.6+ MB


Unnamed: 0,text,flag,tamanho
0,the head of a conservative republican faction ...,1,746
1,transgender people will be allowed for the fir...,1,396
2,the special counsel investigation of links bet...,1,454
3,trump campaign adviser george papadopoulos tol...,1,373
4,president donald trump called on the u.s. post...,1,849


Obtendo a distribuição de palavras de cada texto

In [42]:
mean(df['tamanho'])

482.30418367346937

In [5]:
np.quantile(df['tamanho'], q=[.1, .2, .3, .4, .5, .6, .7, .8, .9])

array([  46.,  100.,  196.,  290.,  358.,  421.,  523.,  700., 1051.])

In [75]:
len(df[df['tamanho'] < 450])/len(df)

0.6332944606413994

In [7]:
df.groupby('flag').size()

flag
0    34074
1    34526
dtype: int64

## Importando o modelo BERT pré-treinado

Vamos importar o modelo BERT base, pois não dispomos de capacidade computacional suficiente para rodar o modelo BERT large. Mais especificamente, utilizaremos o modelo BERT base uncased, que não faz diferença entre palavras com letras maiúsculas e minusculas.

In [13]:
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
#model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)  # Change num_labels according to your classification task


KeyboardInterrupt



In [None]:
# Tokenize and preprocess input text
def preprocess_text(text):
    inputs = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        truncation=True,
        padding='max_length',
        max_length=512,  # Adjust according to your input length requirements
        return_tensors='pt'
    )
    return inputs

# Example text for classification
text = "This is an example sentence for classification."

# Preprocess input text
inputs = tokenizer.encode_plus('text                  a   q',#df['text'][5],
        add_special_tokens=True,
        truncation=True,
        padding='max_length',
        max_length=512)
print(inputs['input_ids'])
print(tokenizer.convert_ids_to_tokens(inputs['input_ids']))
print(len([i for i in inputs['input_ids'] if i != 0]))

NameError: name 'tokenizer' is not defined

In [72]:
df.head()

Unnamed: 0,text,flag,tamanho
0,the head of a conservative republican faction ...,1,746
1,transgender people will be allowed for the fir...,1,396
2,the special counsel investigation of links bet...,1,454
3,trump campaign adviser george papadopoulos tol...,1,373
4,president donald trump called on the u.s. post...,1,849
