# **RNN Model Notebook**

@authors: miguelrocha and Grupo 03

No presente reposi√≥rio n√£o est√° o ficheiro glove.6B.50d.txt respons√°vel pelo embedding devido ao tamanho do ficheiro, ent√£o se quiser testar deve dar download ao mesmo atrav√©s do link:

https://www.kaggle.com/datasets/watts2/glove6b50dtxt

In [174]:
# Notebook Imports
import numpy as np
import pandas as pd
import re
from collections import Counter
import pickle
import random
import time

from helpers.dataset import Dataset
from helpers.activation import TanhActivation
from helpers.losses import BinaryCrossEntropy
from helpers.metrics import accuracy
from helpers.activation import ReLUActivation
from models.rnn_model import RNN

**Modifica√ß√£o na classe Optimizer**

In [175]:
class Optimizer:
    def __init__(self, learning_rate=0.01, momentum=0.9):
        self.velocity = {}  # Dicion√°rio para armazenar velocidades dos gradientes
        self.learning_rate = learning_rate
        self.momentum = momentum

    def update(self, param, grad):
        """Atualiza os pesos usando Gradient Descent com Momentum"""

        param_id = id(param)  # üîπ Usar ID √∫nico do numpy array

        if param_id not in self.velocity:
            self.velocity[param_id] = np.zeros_like(grad)

        # Atualiza√ß√£o com momentum
        self.velocity[param_id] = self.momentum * self.velocity[param_id] + (1 - self.momentum) * grad
        return param - self.learning_rate * self.velocity[param_id]  # üîπ Retorna os novos pesos



### **Tratamento de Dados**

**An√°lise Inicial dos Datasets e Jun√ß√£o dos mesmos para tratamento simult√¢neo**

In [176]:
# Definir os caminhos dos arquivos de TREINO
input_csv1 = "../tarefa_1/clean_input_datasets/gpt_vs_human_data_set_inputs.csv"
output_csv1 = "../tarefa_1/clean_output_datasets/gpt_vs_human_data_set_outputs.csv"

# Definir os caminhos dos arquivos de TESTE FINAL
input_csv2 = "../tarefa_1/clean_input_datasets/dataset2_stor_inputs.csv"
output_csv2 = "../tarefa_1/clean_output_datasets/dataset2_stor_outputs.csv"
 
# Carregar os datasets de treino
df_input1 = pd.read_csv(input_csv1, sep="\t")  
df_output1 = pd.read_csv(output_csv1, sep="\t")

# Carregar os datasets de teste
df_input2 = pd.read_csv(input_csv2, sep="\t")
df_output2 = pd.read_csv(output_csv2, sep="\t")

# Jun√ß√£o com coluna ID
df_train = pd.merge(df_input1, df_output1, on="ID")
df_test = pd.merge(df_input2, df_output2, on="ID")

# Concatenar treino e teste para aplicar as altera√ß√µes simultaneamente
df_dataset1_merged = pd.concat([df_train, df_test], ignore_index=True)

# Mostrar as primeiras 5 linhas do dataset completo
print("\nDataset Completo - Primeiras 5 linhas:")
print(df_dataset1_merged.head())

print("\nDataset Completo - Ultimas 5 linhas:")
print(df_dataset1_merged.tail())


Dataset Completo - Primeiras 5 linhas:
  ID                                               Text  Label
0  0  Advanced electromagnetic potentials are indige...  Human
1  1  This research paper investigates the question ...     AI
2  2  We give an algorithm for finding network encod...  Human
3  3  The paper presents an efficient centralized bi...     AI
4  4  We introduce an exponential random graph model...  Human

Dataset Completo - Ultimas 5 linhas:
          ID                                               Text  Label
4148   D2-96  Though a part of the continent of North Americ...  Human
4149   D2-97  There has been a steady increase in the number...     AI
4150   D2-98  Plasticizers like phthalates were thought to b...     AI
4151   D2-99  The main causes of lung cancer are multifacete...     AI
4152  D2-100  It is an approximation useful in chemistry, bu...  Human


**Remover caracteres especiais e pontua√ß√£o e Converter em min√∫sculas**

In [177]:
# Fun√ß√£o para limpar texto
def clean_text(text):
    text = text.lower()  # Converter para min√∫sculas
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  # Remover pontua√ß√£o
    return text

df_dataset1_merged["clean_text"] = df_dataset1_merged["Text"].apply(clean_text)

# Manter apenas as colunas desejadas e renomear clean_text para Text
df_dataset1_merged = df_dataset1_merged[["ID", "clean_text", "Label"]].rename(columns={"clean_text": "Text"})

print("Texto limpo - primeiras 5 linhas:")
print(df_dataset1_merged.head())

Texto limpo - primeiras 5 linhas:
  ID                                               Text  Label
0  0  advanced electromagnetic potentials are indige...  Human
1  1  this research paper investigates the question ...     AI
2  2  we give an algorithm for finding network encod...  Human
3  3  the paper presents an efficient centralized bi...     AI
4  4  we introduce an exponential random graph model...  Human


**Remover stopwords**

In [178]:
# Lista de stopwords comuns
stopwords = {
    "the", "of", "and", "in", "to", "is", "a", "that", "for", "are", "on", "with", 
    "as", "at", "by", "from", "this", "it", "an", "be", "or", "which", "was", "were"
}

# Fun√ß√£o para remover stopwords
def remove_stopwords(text):
    words = text.split()  # Dividir em palavras
    filtered_words = [word for word in words if word not in stopwords]  # Remover stopwords
    return " ".join(filtered_words)  # Juntar as palavras de novo

# Aplicar ao dataset
df_dataset1_merged["Text"] = df_dataset1_merged["Text"].apply(remove_stopwords)

# Exibir as primeiras 5 linhas ap√≥s remo√ß√£o de stopwords
print("Texto sem stopwords - primeiras 5 linhas:")
print(df_dataset1_merged.head())



Texto sem stopwords - primeiras 5 linhas:
  ID                                               Text  Label
0  0  advanced electromagnetic potentials indigenous...  Human
1  1  research paper investigates question whether a...     AI
2  2  we give algorithm finding network encoding dec...  Human
3  3  paper presents efficient centralized binary mu...     AI
4  4  we introduce exponential random graph model ne...  Human


**Criar Embeddings e Label Encoding**

In [179]:
# Mapear labels para valores num√©ricos
label_map = {"Human": 0, "AI": 1}
df_dataset1_merged["Label"] = df_dataset1_merged["Label"].map(label_map)

# Carregar o GloVe
EMBEDDING_DIM = 50  # Dimens√£o do embedding

embedding_dict = {}
with open("helpers/glove.6B.50d.txt", "r", encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], dtype="float32")
        embedding_dict[word] = vector

print(f"Total de palavras carregadas do GloVe: {len(embedding_dict)}")

# Converter palavras para embeddings
def text_to_embedding(text, embedding_dict, embedding_dim=50):
    words = text.split()
    embeddings = [embedding_dict.get(word, np.zeros(embedding_dim)) for word in words]  # Usa vetor do GloVe ou vetor zerado
    
    # Se a lista estiver vazia, retorna um vetor de zeros
    if len(embeddings) == 0:
        embeddings = [np.zeros(embedding_dim)]

    return embeddings

df_dataset1_merged["Embedding"] = df_dataset1_merged["Text"].apply(lambda x: text_to_embedding(x, embedding_dict, EMBEDDING_DIM))


Total de palavras carregadas do GloVe: 400000


**Padronizar o comprimento das sequ√™ncias**


In [None]:
# Padronizar comprimento das sequ√™ncias
MAX_SEQUENCE_LENGTH = 130  # foram testados v√°rios valores sendo o melhor 130

def pad_embedding_sequence(seq, max_length, embedding_dim):
    seq = np.array(seq)  # Garante que a sequ√™ncia √© um array NumPy
    
    if seq.shape[0] == 0:  # Se for uma sequ√™ncia vazia, criar um array de zeros
        seq = np.zeros((1, embedding_dim))

    if seq.shape[0] > max_length:  # Truncar se for maior
        return seq[:max_length]
    
    padding = np.zeros((max_length - seq.shape[0], embedding_dim))  # Criar padding
    return np.vstack([seq, padding])  # Adicionar padding no final

# Aplicar padding √†s sequ√™ncias de embeddings
df_dataset1_merged["Embedding"] = df_dataset1_merged["Embedding"].apply(
    lambda x: pad_embedding_sequence(x, MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)
)

# Converter para array NumPy para alimentar o modelo
X = np.array(df_dataset1_merged["Embedding"].tolist())
y = np.array(df_dataset1_merged["Label"])  # Labels num√©ricos

print("Formato final dos dados para o modelo:", X.shape)  # Deve ser (n_amostras, MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)

# Manter apenas as colunas desejadas e renomear "Embedding" para "Text"
df_dataset1_merged = df_dataset1_merged[["ID", "Embedding", "Label"]].rename(columns={"Embedding": "Text"})

print("Dataset ap√≥s embedding - primeiras 5 linhas:")
print(df_dataset1_merged.head())


Formato final dos dados para o modelo: (4153, 130, 50)
Dataset ap√≥s embedding - primeiras 5 linhas:
  ID                                               Text  Label
0  0  [[-0.3009699881076813, -0.11428999900817871, 0...      0
1  1  [[0.7125800251960754, 0.6449199914932251, 0.05...      1
2  2  [[0.5738700032234192, -0.32728999853134155, 0....      0
3  3  [[-0.7121599912643433, 0.028648000210523605, 0...      1
4  4  [[0.5738700032234192, -0.32728999853134155, 0....      0


**Normaliza√ß√£o dos Embeddings**


In [181]:
# Fun√ß√£o para normalizar cada embedding (zero mean, unit variance)
def normalize_embedding(emb):
    mean = np.mean(emb, axis=0)  # M√©dia por dimens√£o do embedding
    std = np.std(emb, axis=0) + 1e-8  # Desvio padr√£o (evita divis√£o por zero)
    return (emb - mean) / std

# Aplicar normaliza√ß√£o alternativa aos embeddings
df_dataset1_merged["Text"] = df_dataset1_merged["Text"].apply(normalize_embedding)

# Converter para array NumPy para treinar o modelo
X = np.array(df_dataset1_merged["Text"].tolist())
y = np.array(df_dataset1_merged["Label"])  # Labels num√©ricos

print("Formato final dos dados para o modelo:", X.shape)  # Deve ser (n_amostras, MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)

# Print do dataset atualizado
print("\nDataset ap√≥s normaliza√ß√£o dos embeddings:")
print(df_dataset1_merged.head())


Formato final dos dados para o modelo: (4153, 130, 50)

Dataset ap√≥s normaliza√ß√£o dos embeddings:
  ID                                               Text  Label
0  0  [[-1.3483198034668138, -0.22574247577337878, 1...      0
1  1  [[1.4213576609868754, 1.7255806047138122, 0.11...      1
2  2  [[0.905188811450096, -1.1959995997633175, -0.0...      0
3  3  [[-2.165679718307827, 0.18190477259827562, -0....      1
4  4  [[0.7698617016091831, -1.0859811428972335, -0....      0


**Drop da coluna ID**

In [182]:
if "ID" in df_dataset1_merged.columns:
    df_dataset1_merged = df_dataset1_merged.drop(columns=["ID"])

print("Formato final dos dados para o modelo:", X.shape)  # Deve ser (n_amostras, MAX_SEQUENCE_LENGTH, EMBEDDING_DIM)

# Print do dataset atualizado
print("\nDataset ap√≥s drop:")
print(df_dataset1_merged.head())

Formato final dos dados para o modelo: (4153, 130, 50)

Dataset ap√≥s drop:
                                                Text  Label
0  [[-1.3483198034668138, -0.22574247577337878, 1...      0
1  [[1.4213576609868754, 1.7255806047138122, 0.11...      1
2  [[0.905188811450096, -1.1959995997633175, -0.0...      0
3  [[-2.165679718307827, 0.18190477259827562, -0....      1
4  [[0.7698617016091831, -1.0859811428972335, -0....      0


**Divis√£o do Dataset**

Dataset de Treino:

- 70% : Treino
- 15% : Valida√ß√£o
- 15% : Teste

Dataset de Avalia√ß√£o:

- 100% : Teste Final


In [183]:
# Definir seed global para garantir reprodutibilidade
SEED = 42
np.random.seed(SEED)
random.seed(SEED)

######################################################### dataset de teste
# Separar as √∫ltimas linhas para avalia√ß√£o final
df_eval_final = df_dataset1_merged.tail(100)

# Remover essas linhas do dataset antes de embaralhar
df_remaining = df_dataset1_merged.iloc[:-100]
#########################################################

# Embaralhar o dataset restante
df_remaining = df_remaining.sample(frac=1, random_state=SEED).reset_index(drop=True)

# Definir propor√ß√µes de treino (70%), valida√ß√£o (15%) e teste (15%)
train_ratio = 0.7
val_ratio = 0.15  # 15% valida√ß√£o
test_ratio = 0.15  # 15% teste

# Definir √≠ndices para divis√£o
train_index = int(len(df_remaining) * train_ratio)
val_index = train_index + int(len(df_remaining) * val_ratio)

# Separar os conjuntos de treino, valida√ß√£o e teste
df_train = df_remaining.iloc[:train_index]
df_val = df_remaining.iloc[train_index:val_index]
df_test = df_remaining.iloc[val_index:]

# Print dos tamanhos dos datasets
print(f"Tamanho do conjunto de treino: {df_train.shape}")
print(f"Tamanho do conjunto de valida√ß√£o: {df_val.shape}")
print(f"Tamanho do conjunto de teste: {df_test.shape}")
print(f"Tamanho do conjunto de avalia√ß√£o final: {df_eval_final.shape}")

# Converter para arrays NumPy
X_train, y_train = np.array(df_train["Text"].tolist()), np.array(df_train["Label"])
X_val, y_val = np.array(df_val["Text"].tolist()), np.array(df_val["Label"])
X_test, y_test = np.array(df_test["Text"].tolist()), np.array(df_test["Label"])
X_eval_final, y_eval_final = np.array(df_eval_final["Text"].tolist()), np.array(df_eval_final["Label"])

# Print dos formatos dos dados
print(f"Formato dos dados:")
print(f"   Treino: {X_train.shape}")
print(f"   Valida√ß√£o: {X_val.shape}")
print(f"   Teste: {X_test.shape}")
print(f"   Avalia√ß√£o final: {X_eval_final.shape}")



Tamanho do conjunto de treino: (2837, 2)
Tamanho do conjunto de valida√ß√£o: (607, 2)
Tamanho do conjunto de teste: (609, 2)
Tamanho do conjunto de avalia√ß√£o final: (100, 2)
Formato dos dados:
   Treino: (2837, 130, 50)
   Valida√ß√£o: (607, 130, 50)
   Teste: (609, 130, 50)
   Avalia√ß√£o final: (100, 130, 50)


**Verifica√ß√£o Final do Dataset**

In [184]:
print("\n Primeiras 5 entradas do conjunto de TREINO:")
print(df_train.head())

print("\n Primeiras 5 entradas do conjunto de VALIDA√á√ÉO:")
print(df_val.head())

print("\n Primeiras 5 entradas do conjunto de TESTE:")
print(df_test.head())

print("\n Primeiras 5 entradas do conjunto de AVALIA√á√ÉO FINAL:")
print(df_eval_final.head())



 Primeiras 5 entradas do conjunto de TREINO:
                                                Text  Label
0  [[-0.6009678327132941, 0.12333267828806424, -0...      0
1  [[0.9114800543113507, 1.4439433160725084, 0.09...      1
2  [[1.3506935653698395, 1.6531935857404196, 0.29...      1
3  [[1.310270918702656, 1.5259911587058368, 0.093...      1
4  [[1.8283282955354128, 1.7640411958771838, 0.24...      1

 Primeiras 5 entradas do conjunto de VALIDA√á√ÉO:
                                                   Text  Label
2837  [[-1.9058246305264064, 1.747933859364225, 2.07...      0
2838  [[1.068762991628599, 1.1935346455879021, 0.110...      1
2839  [[-1.9646222319222544, -0.06567262018296684, 0...      0
2840  [[1.0562077640374348, -1.1135655477551067, -0....      0
2841  [[-1.6292355387267745, 0.9655902363131101, -1....      0

 Primeiras 5 entradas do conjunto de TESTE:
                                                   Text  Label
3444  [[1.2944574828066835, 1.107379356904149, 0.373...  

### **Constru√ß√£o do modelo RNN com c√≥digo raiz (Sem TensorFlow/SKLearn)**

**Inicializa√ß√£o de Pesos**

Antes de tudo, vamos definir os pesos da rede:

- W_xh: Pesa a entrada para os neur√¥nios recorrentes.
- W_hh: Pesa as conex√µes recorrentes.
- W_hy: Pesa a sa√≠da do neur√¥nio recorrente para a previ√ß√£o final.
- b_h e b_y: Bias da camada oculta e da sa√≠da.

In [185]:
# Definir hiperpar√¢metros
input_size = 50    # Dimens√£o dos embeddings
hidden_size = 64   # N√∫mero de neur√¥nios na camada oculta
output_size = 1    # Sa√≠da bin√°ria (0 ou 1)
learning_rate = 0.01  

# Inicializar pesos
np.random.seed(42)  # Para reprodutibilidade
W_xh = np.random.randn(input_size, hidden_size) * 0.01  # Pesos da entrada para a camada oculta
W_hh = np.random.randn(hidden_size, hidden_size) * 0.01 # Pesos da camada oculta para ela mesma
W_hy = np.random.randn(hidden_size, output_size) * 0.01 # Pesos da camada oculta para sa√≠da

# Bias
b_h = np.zeros((1, hidden_size))
b_y = np.zeros((1, output_size))

print("Pesos e Biases inicializados!")

Pesos e Biases inicializados!


**Fun√ß√£o de Custo (Binary Cross-Entropy)**

In [186]:
def binary_cross_entropy(y_true, y_pred):
    y_pred = np.clip(y_pred, 1e-8, 1 - 1e-8)  # üîπ Evita log(0) ou log(1)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) / y_pred.shape[0]

**Mini-Batches**

In [187]:
def get_mini_batches(X, y, batch_size=16, shuffle=True):
    """Divide os dados em mini-batches."""
    n_samples = X.shape[0]
    indices = np.arange(n_samples)
    if shuffle:
        np.random.shuffle(indices)
    
    for start in range(0, n_samples, batch_size):
        end = min(start + batch_size, n_samples)
        yield X[indices[start:end]], y[indices[start:end]]


**Otimiza√ß√£o de Hiperpar√¢metros (Inicial)**

In [188]:
# Fun√ß√£o de ativa√ß√£o Sigmoid
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Definir pesos corretamente (Xavier Initialization)
W_xh = np.random.randn(input_size, hidden_size) * np.sqrt(1. / input_size)
W_hh = np.random.randn(hidden_size, hidden_size) * np.sqrt(1. / hidden_size)
W_hy = np.random.randn(hidden_size, output_size) * np.sqrt(1. / hidden_size)

HYPERPARAMS = [
    {"epochs": 5, "batch_size": 8, "learning_rate": 0.01, "momentum": 0.9, "bptt_trunc": 2},
    {"epochs": 10, "batch_size": 16, "learning_rate": 0.005, "momentum": 0.95, "bptt_trunc": 3},
    {"epochs": 7, "batch_size": 8, "learning_rate": 0.007, "momentum": 0.8, "bptt_trunc": 2},
]

best_accuracy = 0
best_params = None
best_model = None

# Testando hiperpar√¢metros
for params in HYPERPARAMS:
    print(f"\nTestando hiperpar√¢metros: {params}")

    rnn = RNN(
        n_units=20,
        # activation=ReLUActivation(),
        activation=TanhActivation(),
        bptt_trunc=params["bptt_trunc"],
        input_shape=(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM),
        epochs=params["epochs"],
        batch_size=params["batch_size"],
        learning_rate=params["learning_rate"],
        momentum=params["momentum"],
        loss=BinaryCrossEntropy,
        metric=accuracy
    )

    optimizer = Optimizer(learning_rate=params["learning_rate"])
    rnn.initialize(optimizer)

    for epoch in range(params["epochs"]):
        total_loss = 0
        for X_batch, y_batch in get_mini_batches(X_train, y_train, params["batch_size"]):
            y_pred = rnn.forward_propagation(X_batch)
            y_pred_final = sigmoid(y_pred[:, -1, :])  # Aplica Sigmoid na √∫ltima sa√≠da

            loss = binary_cross_entropy(y_batch.reshape(-1, 1), y_pred_final)
            grad_loss = (y_pred_final - y_batch.reshape(-1, 1)) / y_batch.shape[0]

            grad_loss_expanded = np.zeros_like(y_pred)
            grad_loss_expanded[:, -1, :] = grad_loss

            rnn.backward_propagation(grad_loss_expanded)

            total_loss += loss

        print(f"√âpoca {epoch+1}/{params['epochs']} - Loss: {total_loss:.4f}")

    # Avalia√ß√£o
    preds = rnn.predict(X_val)
    
    # Debug do formato de `preds`
    print(f"Formato de preds: {preds.shape}")

    # Corrigir caso `preds` seja 1D
    if preds.ndim == 1:
        preds = preds[:, np.newaxis]

    acc = accuracy(y_val, preds)

    print(f"Accuracy com esses hiperpar√¢metros: {acc:.4f}")

    if acc > best_accuracy:
        best_accuracy = acc
        best_params = params
        best_model = rnn

print(f"\nMelhor combina√ß√£o encontrada: {best_params} com accuracy {best_accuracy:.4f}")


Testando hiperpar√¢metros: {'epochs': 5, 'batch_size': 8, 'learning_rate': 0.01, 'momentum': 0.9, 'bptt_trunc': 2}
√âpoca 1/5 - Loss: 29.9710
√âpoca 2/5 - Loss: 24.7932
√âpoca 3/5 - Loss: 18.0956
√âpoca 4/5 - Loss: 14.9464
√âpoca 5/5 - Loss: 13.4533
Formato de preds: (607,)
Accuracy com esses hiperpar√¢metros: 0.8715

Testando hiperpar√¢metros: {'epochs': 10, 'batch_size': 16, 'learning_rate': 0.005, 'momentum': 0.95, 'bptt_trunc': 3}
√âpoca 1/10 - Loss: 7.8057
√âpoca 2/10 - Loss: 7.7698
√âpoca 3/10 - Loss: 7.6781
√âpoca 4/10 - Loss: 7.2010
√âpoca 5/10 - Loss: 6.8687
√âpoca 6/10 - Loss: 6.3363
√âpoca 7/10 - Loss: 5.8161
√âpoca 8/10 - Loss: 5.1602
√âpoca 9/10 - Loss: 4.6612
√âpoca 10/10 - Loss: 4.4035
Formato de preds: (607,)
Accuracy com esses hiperpar√¢metros: 0.8649

Testando hiperpar√¢metros: {'epochs': 7, 'batch_size': 8, 'learning_rate': 0.007, 'momentum': 0.8, 'bptt_trunc': 2}
√âpoca 1/7 - Loss: 30.7943
√âpoca 2/7 - Loss: 29.9650
√âpoca 3/7 - Loss: 26.0493
√âpoca 4/7 - Loss: 20.

**Treinar o Modelo Final com melhor accuracy (obtido no passo anterior)**

In [189]:
final_rnn = RNN(
    n_units=20,
    # activation=ReLUActivation(),
    activation=TanhActivation(),
    bptt_trunc=best_params["bptt_trunc"],
    input_shape=(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM),
    epochs=best_params["epochs"],
    batch_size=best_params["batch_size"],
    learning_rate=best_params["learning_rate"],
    momentum=best_params["momentum"],
    loss=BinaryCrossEntropy,
    metric=accuracy
)

final_optimizer = Optimizer(learning_rate=best_params["learning_rate"])
final_rnn.initialize(final_optimizer)

for epoch in range(best_params["epochs"]):
    total_loss = 0
    for X_batch, y_batch in get_mini_batches(X_train, y_train, best_params["batch_size"]):
        y_pred = final_rnn.forward_propagation(X_batch)
        y_pred_final = sigmoid(y_pred[:, -1, :])  # Aplica Sigmoid na √∫ltima sa√≠da

        loss = binary_cross_entropy(y_batch.reshape(-1, 1), y_pred_final)

        # Calcular o gradiente correto
        grad_loss = (y_pred_final - y_batch.reshape(-1, 1)) / y_batch.shape[0]

        # Expandir para 3 dimens√µes para ser compat√≠vel com a RNN
        grad_loss_expanded = np.zeros_like(y_pred)  # (batch_size, timesteps, output_size)
        grad_loss_expanded[:, -1, :] = grad_loss  # Apenas o √∫ltimo timestep recebe gradiente

        # Passar o gradiente expandido
        final_rnn.backward_propagation(grad_loss_expanded)

        total_loss += loss

    print(f"Treino final - √âpoca {epoch+1}/{best_params['epochs']} - Loss: {total_loss:.4f}")

# Testar Modelo Final
y_test_pred = final_rnn.predict(X_test)

print(f"Formato de y_test_pred: {y_test_pred.shape}")  # üõ†Ô∏è Debug

# Se for 1D, expandimos para 2D
if y_test_pred.ndim == 1:
    y_test_pred = y_test_pred[:, np.newaxis]

# Se for 2D (batch_size, timesteps), pegamos o √∫ltimo timestep
if y_test_pred.ndim == 2:
    y_test_pred_final = y_test_pred[:, -1]  #  Sem `:` no final, pois j√° √© 1D
else:
    y_test_pred_final = y_test_pred[:, -1, :]  #  Apenas se for 3D

y_test_pred_labels = (y_test_pred_final > 0.5).astype(int)

y_test_true = y_test.flatten()
accuracy = np.mean(y_test_pred_labels == y_test_true)
print(f"\nAccuracy final no conjunto de teste: {accuracy:.4f}")

# Criar DataFrame com Expected vs Predicted
df_results = pd.DataFrame({
    "expected_value": y_test_true,
    "predicted_value_raw": y_test_pred_final.flatten(),  # Valor original antes do arredondamento
    "predicted_value": y_test_pred_labels.flatten()  # Valor final bin√°rio (0 ou 1)
})

# Mostrar as previs√µes para compara√ß√£o
print("\nCompara√ß√£o entre valores esperados e previstos:")
print(df_results)



Treino final - √âpoca 1/7 - Loss: 30.5953
Treino final - √âpoca 2/7 - Loss: 27.7803
Treino final - √âpoca 3/7 - Loss: 22.4836
Treino final - √âpoca 4/7 - Loss: 17.6398
Treino final - √âpoca 5/7 - Loss: 15.4857
Treino final - √âpoca 6/7 - Loss: 14.2848
Treino final - √âpoca 7/7 - Loss: 13.3325
Formato de y_test_pred: (609,)

Accuracy final no conjunto de teste: 0.8736

Compara√ß√£o entre valores esperados e previstos:
     expected_value  predicted_value_raw  predicted_value
0                 1                  1.0                1
1                 0                  0.0                0
2                 1                  1.0                1
3                 0                  0.0                0
4                 1                  1.0                1
..              ...                  ...              ...
604               1                  1.0                1
605               1                  1.0                1
606               0                  1.0                1

**Avalia√ß√£o do Modelo com dados do Dataset de Teste Final**

In [190]:
# Testar Modelo Final
y_test_pred2 = final_rnn.predict(X_eval_final)

print(f"Formato de y_test_pred2: {y_test_pred2.shape}")  # üõ†Ô∏è Debug

# Se for 1D, expandimos para 2D
if y_test_pred2.ndim == 1:
    y_test_pred2 = y_test_pred2[:, np.newaxis]

# Se for 2D (batch_size, timesteps), pegamos o √∫ltimo timestep
if y_test_pred2.ndim == 2:
    y_test_pred_final2 = y_test_pred2[:, -1]  #  Sem `:` no final, pois j√° √© 1D
else:
    y_test_pred_final2 = y_test_pred2[:, -1, :]  #  Apenas se for 3D

y_test_pred_labels2 = (y_test_pred_final2 > 0.5).astype(int)

y_test_true2 = y_eval_final.flatten()
accuracy = np.mean(y_test_pred_labels2 == y_test_true2)
print(f"\nAccuracy final no conjunto de teste: {accuracy:.4f}")

# Criar DataFrame com Expected vs Predicted
df_results = pd.DataFrame({
    "expected_value": y_test_true2,
    "predicted_value_raw": y_test_pred_final2.flatten(),  # Valor original antes do arredondamento
    "predicted_value": y_test_pred_labels2.flatten()  # Valor final bin√°rio (0 ou 1)
})

# Mostrar as previs√µes para compara√ß√£o
print("\nCompara√ß√£o entre valores esperados e previstos:")
print(df_results)

######################################################################### cria√ß√£o do ficheiro csv com a previs√£o

# Criar IDs para cada amostra com o formato "D2-1", "D2-2", etc.
id_column = [f"D2-{i}" for i in range(1, len(y_test_pred_labels2) + 1)]

# Converter labels para "Human" e "AI"
labels = np.where(y_test_pred_labels2.flatten() == 1, "AI", "Human")

# Criar DataFrame com ID e LABEL
df_output = pd.DataFrame({
    "ID": id_column,
    "Label": labels
})

# Guardar em CSV com separa√ß√£o por tabula√ß√£o
df_output.to_csv("rnn_predictions.csv", index=False, sep='\t')

print("Ficheiro 'rnn_predictions.csv' gerado com sucesso!")

######################################################################### accuracy entre o nosso ficheiro .csv de previs√£o e os resultados esperados

# Comparar dois ficheiros CSV para calcular a accuracy
path_csv1 = "rnn_predictions.csv"  # Caminho do primeiro ficheiro
path_csv2 = "..\\tarefa_1\clean_output_datasets\dataset2_stor_outputs.csv"  # Caminho do segundo ficheiro (resultados esperados)

# Carregar os ficheiros
df1 = pd.read_csv(path_csv1, sep='\t')
df2 = pd.read_csv(path_csv2, sep='\t')

# Garantir que est√£o ordenados corretamente
df1 = df1.sort_values(by="ID").reset_index(drop=True)
df2 = df2.sort_values(by="ID").reset_index(drop=True)

# Calcular a accuracy
accuracy = np.mean(df1["Label"] == df2["Label"])
print(f"Accuracy entre os dois ficheiros: {accuracy:.4f}")


Formato de y_test_pred2: (100,)

Accuracy final no conjunto de teste: 0.6700

Compara√ß√£o entre valores esperados e previstos:
    expected_value  predicted_value_raw  predicted_value
0                0                  0.0                0
1                0                  1.0                1
2                1                  1.0                1
3                0                  0.0                0
4                0                  0.0                0
..             ...                  ...              ...
95               0                  0.0                0
96               1                  1.0                1
97               1                  0.0                0
98               1                  1.0                1
99               0                  0.0                0

[100 rows x 3 columns]
Ficheiro 'rnn_predictions.csv' gerado com sucesso!
Accuracy entre os dois ficheiros: 0.6700


  path_csv2 = "..\\tarefa_1\clean_output_datasets\dataset2_stor_outputs.csv"  # Caminho do segundo ficheiro (resultados esperados)


### **An√°lise de resultados**

**Treino com dataset: gpt_vs_human**

- Durante o treino: 0.87 - 0.9

- Para dataset1: 0.66

- Para dataset2: 0.8 - 1.0

- Para ai_human: 0.51

- Para dataset disponibilizado pelo professor: 0.66

**Treino com dataset: ai_human**

- Durante o treino: 0.81 - 0.84

- Para gpt_vs_human: 0.49

### **Hypertuning com base no modelo anterior - teste com 3600 combina√ß√µes diferentes**

Foi feito o loop apresentado abaixo, com 3600 combina√ß√µes, por√©m por uma quest√£o de brevidade, estamos neste momento a rodar o c√≥digo apenas com o melhor resultado obtido:

**Melhor combina√ß√£o encontrada: {'epochs': 5, 'batch_size': 8, 'learning_rate': 0.01, 'momentum': 0.8, 'bptt_trunc': 6} com accuracy 0.8929**

In [191]:
print(f"Tipo de accuracy antes da chamada: {type(accuracy)}")
if not callable(accuracy):  # Se n√£o for mais uma fun√ß√£o
    del accuracy  # Remover a vari√°vel sobrescrita
    from helpers.metrics import accuracy  # Reimporte 


# Definir hiperpar√¢metros para busca extensa
# HYPERPARAMS = [
#     {"epochs": ep, "batch_size": bs, "learning_rate": lr, "momentum": mo, "bptt_trunc": bt}
#     for ep in [5, 10, 15, 20, 25, 30]
#     for bs in [8, 16, 32, 64]
#     for lr in [0.01, 0.005, 0.001, 0.0005, 0.0001]
#     for mo in [0.7, 0.8, 0.85, 0.9, 0.95, 0.99]
#     for bt in [2, 3, 4, 5, 6]
# ]

# Apenas com os melhores hiperpar√¢metros calculados anteriormente
HYPERPARAMS = [
    {"epochs": ep, "batch_size": bs, "learning_rate": lr, "momentum": mo, "bptt_trunc": bt}
    for ep in [5]
    for bs in [8]
    for lr in [0.01]
    for mo in [0.8]
    for bt in [6]
]

best_accuracy = 0
best_params = None
best_model = None

start_time = time.time()
MAX_TIME = 21600 #6 horas em segundos

# Teste de hiperpar√¢metros 
for params in HYPERPARAMS:
    if time.time() - start_time > MAX_TIME:
        break
    
    print(f"\nA testar hiperpar√¢metros: {params}")
    
    rnn = RNN(
        n_units=20,
        activation=TanhActivation(),
        bptt_trunc=params["bptt_trunc"],
        input_shape=(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM),
        epochs=params["epochs"],
        batch_size=params["batch_size"],
        learning_rate=params["learning_rate"],
        momentum=params["momentum"],
        loss=BinaryCrossEntropy,
        metric=accuracy
    )
    
    optimizer = Optimizer(learning_rate=params["learning_rate"])
    rnn.initialize(optimizer)
    
    for epoch in range(params["epochs"]):
        total_loss = 0
        for X_batch, y_batch in get_mini_batches(X_train, y_train, params["batch_size"]):
            y_pred = rnn.forward_propagation(X_batch)
            y_pred_final = sigmoid(y_pred[:, -1, :])

            loss = binary_cross_entropy(y_batch.reshape(-1, 1), y_pred_final)
            grad_loss = (y_pred_final - y_batch.reshape(-1, 1)) / y_batch.shape[0]
            
            grad_loss_expanded = np.zeros_like(y_pred)
            grad_loss_expanded[:, -1, :] = grad_loss
            
            rnn.backward_propagation(grad_loss_expanded)
            total_loss += loss
        
        print(f"√âpoca {epoch+1}/{params['epochs']} - Loss: {total_loss:.4f}")
    
    preds = rnn.predict(X_val)
    if preds.ndim == 1:
        preds = preds[:, np.newaxis]
    acc_value = accuracy(y_val, preds)
    
    print(f"Accuracy com esses hiperpar√¢metros: {acc_value:.4f}")
    
    if acc_value > best_accuracy:
        best_accuracy = acc_value
        best_params = params
        best_model = rnn

print(f"\nMelhor combina√ß√£o encontrada: {best_params} com accuracy {best_accuracy:.4f}")

Tipo de accuracy antes da chamada: <class 'numpy.float64'>

A testar hiperpar√¢metros: {'epochs': 5, 'batch_size': 8, 'learning_rate': 0.01, 'momentum': 0.8, 'bptt_trunc': 6}
√âpoca 1/5 - Loss: 30.7250
√âpoca 2/5 - Loss: 26.9777
√âpoca 3/5 - Loss: 18.3877
√âpoca 4/5 - Loss: 15.3894
√âpoca 5/5 - Loss: 13.6599
Accuracy com esses hiperpar√¢metros: 0.8764

Melhor combina√ß√£o encontrada: {'epochs': 5, 'batch_size': 8, 'learning_rate': 0.01, 'momentum': 0.8, 'bptt_trunc': 6} com accuracy 0.8764


**Treino do modelo final, com os melhores hiperpar√¢metros**

In [192]:
final_rnn = RNN(
    n_units=20,
    activation=TanhActivation(),
    bptt_trunc=best_params["bptt_trunc"],
    input_shape=(MAX_SEQUENCE_LENGTH, EMBEDDING_DIM),
    epochs=best_params["epochs"],
    batch_size=best_params["batch_size"],
    learning_rate=best_params["learning_rate"],
    momentum=best_params["momentum"],
    loss=BinaryCrossEntropy,
    metric=accuracy
)

final_optimizer = Optimizer(learning_rate=best_params["learning_rate"])
final_rnn.initialize(final_optimizer)

for epoch in range(best_params["epochs"]):
    total_loss = 0
    for X_batch, y_batch in get_mini_batches(X_train, y_train, best_params["batch_size"]):
        y_pred = final_rnn.forward_propagation(X_batch)
        y_pred_final = sigmoid(y_pred[:, -1, :])  # Aplica Sigmoid na √∫ltima sa√≠da

        loss = binary_cross_entropy(y_batch.reshape(-1, 1), y_pred_final)

        # Calcular o gradiente correto
        grad_loss = (y_pred_final - y_batch.reshape(-1, 1)) / y_batch.shape[0]

        # Expandir para 3 dimens√µes para ser compat√≠vel com a RNN
        grad_loss_expanded = np.zeros_like(y_pred)  # (batch_size, timesteps, output_size)
        grad_loss_expanded[:, -1, :] = grad_loss  # Apenas o √∫ltimo timestep recebe gradiente

        # Passar o gradiente expandido
        final_rnn.backward_propagation(grad_loss_expanded)

        total_loss += loss

    print(f"Treino final - √âpoca {epoch+1}/{best_params['epochs']} - Loss: {total_loss:.4f}")

# Testar Modelo Final
y_test_pred = final_rnn.predict(X_test)

print(f"Formato de y_test_pred: {y_test_pred.shape}")  # Debug

# Se for 1D, expandimos para 2D
if y_test_pred.ndim == 1:
    y_test_pred = y_test_pred[:, np.newaxis]

# Se for 2D (batch_size, timesteps), pegamos o √∫ltimo timestep
if y_test_pred.ndim == 2:
    y_test_pred_final = y_test_pred[:, -1]  #  Sem `:` no final, pois j√° √© 1D
else:
    y_test_pred_final = y_test_pred[:, -1, :]  #  Apenas se for 3D

y_test_pred_labels = (y_test_pred_final > 0.5).astype(int)

y_test_true = y_test.flatten()
accuracy = np.mean(y_test_pred_labels == y_test_true)
print(f"\nAccuracy final no conjunto de teste: {accuracy:.4f}")

# Criar DataFrame com Expected vs Predicted
df_results = pd.DataFrame({
    "expected_value": y_test_true,
    "predicted_value_raw": y_test_pred_final.flatten(),  # Valor original antes do arredondamento
    "predicted_value": y_test_pred_labels.flatten()  # Valor final bin√°rio (0 ou 1)
})

# Mostrar as previs√µes para compara√ß√£o
print("\nCompara√ß√£o entre valores esperados e previstos:")
print(df_results)

Treino final - √âpoca 1/5 - Loss: 30.5587
Treino final - √âpoca 2/5 - Loss: 23.2034
Treino final - √âpoca 3/5 - Loss: 17.1216
Treino final - √âpoca 4/5 - Loss: 14.4594
Treino final - √âpoca 5/5 - Loss: 13.0550
Formato de y_test_pred: (609,)

Accuracy final no conjunto de teste: 0.8736

Compara√ß√£o entre valores esperados e previstos:
     expected_value  predicted_value_raw  predicted_value
0                 1                  1.0                1
1                 0                  0.0                0
2                 1                  1.0                1
3                 0                  0.0                0
4                 1                  1.0                1
..              ...                  ...              ...
604               1                  1.0                1
605               1                  1.0                1
606               0                  1.0                1
607               1                  1.0                1
608               0      

**Testar no dataset disponibilizado pelo professor**

In [193]:
# Testar Modelo Final
y_test_pred2 = final_rnn.predict(X_eval_final)

print(f"Formato de y_test_pred2: {y_test_pred2.shape}")  # Debug

# Se for 1D, expandimos para 2D
if y_test_pred2.ndim == 1:
    y_test_pred2 = y_test_pred2[:, np.newaxis]

# Se for 2D (batch_size, timesteps), pegamos o √∫ltimo timestep
if y_test_pred2.ndim == 2:
    y_test_pred_final2 = y_test_pred2[:, -1]  #  Sem `:` no final, pois j√° √© 1D
else:
    y_test_pred_final2 = y_test_pred2[:, -1, :]  #  Apenas se for 3D

y_test_pred_labels2 = (y_test_pred_final2 > 0.5).astype(int)

y_test_true2 = y_eval_final.flatten()
accuracy = np.mean(y_test_pred_labels2 == y_test_true2)
print(f"\nAccuracy final no conjunto de teste: {accuracy:.4f}")

# Criar DataFrame com Expected vs Predicted
df_results = pd.DataFrame({
    "expected_value": y_test_true2,
    "predicted_value_raw": y_test_pred_final2.flatten(),  # Valor original antes do arredondamento
    "predicted_value": y_test_pred_labels2.flatten()  # Valor final bin√°rio (0 ou 1)
})

# Mostrar as previs√µes para compara√ß√£o
print("\nCompara√ß√£o entre valores esperados e previstos:")
print(df_results)


Formato de y_test_pred2: (100,)

Accuracy final no conjunto de teste: 0.7500

Compara√ß√£o entre valores esperados e previstos:
    expected_value  predicted_value_raw  predicted_value
0                0                  0.0                0
1                0                  1.0                1
2                1                  1.0                1
3                0                  0.0                0
4                0                  0.0                0
..             ...                  ...              ...
95               0                  0.0                0
96               1                  1.0                1
97               1                  1.0                1
98               1                  1.0                1
99               0                  0.0                0

[100 rows x 3 columns]


**Cria√ß√£o do Ficheiro CSV com a previs√£o final para o dataset disponibilizado pelo professor**

In [194]:
# Generate IDs for each prediction in the format D2-1, D2-2, ...
ids = [f"D2-{i+1}" for i in range(len(y_test_pred_labels2))]

# Map 0 to "Human" and 1 to "AI"
labels = ["Human" if pred == 0 else "AI" for pred in y_test_pred_labels2.flatten()]

# Create a DataFrame with ID and Label columns
df_predictions = pd.DataFrame({
    "ID": ids,
    "Label": labels
})

# Save the predictions to a CSV file using a tab separator to match the exact format
df_predictions.to_csv("previsoes_rnn.csv", sep="\t", index=False)

print("\nPredictions saved to previsoes_rnn.csv successfully!")

# Comparar dois ficheiros CSV para calcular a accuracy
path_csv1 = "previsoes_rnn.csv"
path_csv2 = "..\\tarefa_1\clean_output_datasets\dataset2_stor_outputs.csv"

# Carregar os ficheiros
df1 = pd.read_csv(path_csv1, sep='\t')
df2 = pd.read_csv(path_csv2, sep='\t')

# Garantir que est√£o ordenados corretamente
df1 = df1.sort_values(by="ID").reset_index(drop=True)
df2 = df2.sort_values(by="ID").reset_index(drop=True)

# Calcular a accuracy
accuracy = np.mean(df1["Label"] == df2["Label"])
print(f"Accuracy entre os dois ficheiros: {accuracy:.4f}")


Predictions saved to previsoes_rnn.csv successfully!
Accuracy entre os dois ficheiros: 0.7500


  path_csv2 = "..\\tarefa_1\clean_output_datasets\dataset2_stor_outputs.csv"


### **An√°lise de resultados da melhor combina√ß√£o encontrada**

**Melhor combina√ß√£o encontrada: {'epochs': 5, 'batch_size': 8, 'learning_rate': 0.01, 'momentum': 0.8, 'bptt_trunc': 6} com accuracy 0.8929**

**Treino com dataset: gpt_vs_human**

- Durante o treino: 0.87 - 0.9

- Para dataset1: 0.60

- Para dataset2: 0.8 - 1

**Dataset disponibilizado pelo professor**

- 0.66  (segundo a previs√£o)