# 📚 PROYECTO AI-TEXTIFICATION
## 👅 Procesamiento de Lenguaje Natural
## 💻 Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas
## 🏫 Universidad Nacional Autónoma de México

<hr>

### 🤖 AI-TEXTIFICATION
### 📓 CUADERNO [01]: PREPARACIÓN
### 📄 Detección de autoría en textos AI - Humanos:

#### 🔵 **Tarea A**: Clasificación Binaria:
1. Texto de Humano.
2. Texto de Inteligencia Artificial.

#### 🔵 **Tarea B**: Clasificación Multiclase:
1. Texto de ChatGPT.
2. Texto de Cohere.
3. Texto de Davinci.
4. Texto de Dolly.
5. Texto de Humano.

**👬 Autores:**
* León Rosas Manuel Alejandro.
* Ramos Herrera Iván Alejandro.


# [01] 🎯 Objetivo

**EN ESTE NOTEBOOK SE OBTENDRÁ EL DATASET Y SE LE APLICARÁN TRANSFORMACIONES DE PREPROCESAMIENTO CON DISTINTAS COMBINACIONES:**

1. **PA: Cleaned => Lemma => UNK => ~GENSIM Embbedings~**.
2. **PB: Cleaned => Lemma => UNK => ~OWN Embbedings~**.
3. **PC: Cleaned => UNK => ~GENSIM Embbedings~**.
4. **PD: Cleaned => UNK => ~OWN Embbeds~**.

# [02] 📓 Selección de Dataset

## SemEval2024-task8
### Fuente: https://github.com/mbzuai-nlp/SemEval2024-task8
### Información del dataset:

#### **SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection**

> Large language models (LLMs) are becoming mainstream and easily accessible, ushering in an explosion of machine-generated content over various channels, such as news, social media, question-answering forums, educational, and even academic contexts. Recent LLMs, such as ChatGPT and GPT-4, generate remarkably fluent responses to a wide variety of user queries. The articulate nature of such generated texts makes LLMs attractive for replacing human labor in many scenarios. However, this has also resulted in concerns regarding their potential misuse, such as spreading misinformation and causing disruptions in the education system. Since humans perform only slightly better than chance when classifying machine-generated vs. human-written text, there is a need to develop automatic systems to identify machine-generated text with the goal of mitigating its potential misuse.

> We offer three subtasks over two paradigms of text generation: (1) full text when a considered text is entirely written by a human or generated by a machine; and (2) mixed text when a machine-generated text is refined by a human or a human-written text paraphrased by a machine.


**Subtasks**

> **Subtask A.** Binary Human-Written vs. Machine-Generated Text Classification: Given a full text, determine whether it is human-written or machine-generated. There are two tracks for subtask A: monolingual (only English sources) and multilingual.

> **Subtask B.** Multi-Way Machine-Generated Text Classification: Given a full text, determine who generated it. It can be human-written or generated by a specific language model.

> **Subtask C.** Human-Machine Mixed Text Detection: Given a mixed text, where the first part is human-written and the second part is machine-generated, determine the boundary, where the change occurs.

**Downloading**

| Task | File ID |
|------|---------|
| Whole dataset | 14DulzxuH5TDhXtviRVXsH5e2JTY2POLi
| Subtask A |	1CAbb3DjrOPBNm0ozVBfhvrEh9P9rAppc |
| Subtask B |	11YeloR2eTXcTzdwI04Z-M2QVvIeQAU6- |
| Subtask C |	16bRUuoeb_LxnCkcKM-ed6X6K5t_1C6mL |



# [03] 📖 PREPARACIÓN

## [A] 😀 Clasificación Binaria [Humano: 0 | Máquina: 1]

## Obtención

In [None]:
import pandas as pd
import numpy as np
import nltk
import re
from utils import sigmoid, get_batches, compute_pca, get_dict
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from collections import Counter
nltk.download("punkt")
nltk.download("wordnet")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
!pip install gdown



In [None]:
# Dataset SubTaskA:
!gdown --folder https://drive.google.com/drive/folders/1CAbb3DjrOPBNm0ozVBfhvrEh9P9rAppc

Retrieving folder list
Processing file 1e_G-9a66AryHxBOwGWhriePYCCa4_29e subtaskA_dev_monolingual.jsonl
Processing file 123UQ92LxtHaVTbNYlmjnG1CWwD-x7wDL subtaskA_dev_multilingual.jsonl
Processing file 1HeCgnLuDoUHhP-2OsTSSC3FXRLVoI6OG subtaskA_train_monolingual.jsonl
Processing file 13-9-DakCeLFbPgCiVIU0v6_BCQx0ppz6 subtaskA_train_multilingual.jsonl
Retrieving folder list completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1e_G-9a66AryHxBOwGWhriePYCCa4_29e
To: /content/SubtaskA/subtaskA_dev_monolingual.jsonl
100% 10.8M/10.8M [00:00<00:00, 50.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=123UQ92LxtHaVTbNYlmjnG1CWwD-x7wDL
To: /content/SubtaskA/subtaskA_dev_multilingual.jsonl
100% 21.2M/21.2M [00:00<00:00, 154MB/s]
Downloading...
From: https://drive.google.com/uc?id=1HeCgnLuDoUHhP-2OsTSSC3FXRLVoI6OG
To: /content/SubtaskA/subtaskA_train_monolingual.jsonl
100% 347M/347M [00:02<00:00, 122MB/s]
Downloa

In [None]:
dataAdev = pd.read_json("/content/SubtaskA/subtaskA_dev_monolingual.jsonl", lines=True).set_index("id")
dataAtrain = pd.read_json("/content/SubtaskA/subtaskA_train_monolingual.jsonl", lines=True).set_index("id")

In [None]:
dataAdev

Unnamed: 0_level_0,text,label,model,source
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Giving gifts should always be enjoyable. Howe...,1,bloomz,wikihow
1,Yveltal (Japanese: ユベルタル) is one of the main a...,1,bloomz,wikihow
2,If you'd rather not annoy others by being rude...,1,bloomz,wikihow
3,If you're interested in visiting gravesite(s) ...,1,bloomz,wikihow
4,The following are some tips for becoming succe...,1,bloomz,wikihow
...,...,...,...,...
4995,The paper deals with an interesting applicatio...,0,human,peerread
4996,This manuscript tries to tackle neural network...,0,human,peerread
4997,The paper introduced a regularization scheme t...,0,human,peerread
4998,Inspired by the analysis on the effect of the ...,0,human,peerread


In [None]:
dataAtrain

Unnamed: 0_level_0,text,label,model,source
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Forza Motorsport is a popular racing game that...,1,chatGPT,wikihow
1,Buying Virtual Console games for your Nintendo...,1,chatGPT,wikihow
2,Windows NT 4.0 was a popular operating system ...,1,chatGPT,wikihow
3,How to Make Perfume\n\nPerfume is a great way ...,1,chatGPT,wikihow
4,How to Convert Song Lyrics to a Song'\n\nConve...,1,chatGPT,wikihow
...,...,...,...,...
119752,"The paper is an interesting contribution, prim...",0,human,peerread
119753,\nWe thank the reviewers for all their comment...,0,human,peerread
119754,The authors introduce a semi-supervised method...,0,human,peerread
119755,This paper proposes the Neural Graph Machine t...,0,human,peerread


In [None]:
# Verificando la cantidad de autores:
dataAdev["model"].unique()

array(['bloomz', 'human'], dtype=object)

In [None]:
dataAtrain["model"].unique()

array(['chatGPT', 'cohere', 'davinci', 'dolly', 'human'], dtype=object)

In [None]:
dataA1 = dataAdev[["text", "label"]]
dataA2 = dataAtrain[["text", "label"]]

In [None]:
dataA1

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Giving gifts should always be enjoyable. Howe...,1
1,Yveltal (Japanese: ユベルタル) is one of the main a...,1
2,If you'd rather not annoy others by being rude...,1
3,If you're interested in visiting gravesite(s) ...,1
4,The following are some tips for becoming succe...,1
...,...,...
4995,The paper deals with an interesting applicatio...,0
4996,This manuscript tries to tackle neural network...,0
4997,The paper introduced a regularization scheme t...,0
4998,Inspired by the analysis on the effect of the ...,0


In [None]:
dataA2

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Forza Motorsport is a popular racing game that...,1
1,Buying Virtual Console games for your Nintendo...,1
2,Windows NT 4.0 was a popular operating system ...,1
3,How to Make Perfume\n\nPerfume is a great way ...,1
4,How to Convert Song Lyrics to a Song'\n\nConve...,1
...,...,...
119752,"The paper is an interesting contribution, prim...",0
119753,\nWe thank the reviewers for all their comment...,0
119754,The authors introduce a semi-supervised method...,0
119755,This paper proposes the Neural Graph Machine t...,0


## Exploración

In [None]:
# Explorando desbalance:
human_texts_counts = (dataAdev["label"] == 0).sum()
machine_texts_counts = (dataAdev["label"] == 1).sum()
print("Instancias de textos [HUMANOS][LABEL: 0]", human_texts_counts)
print("Instancias de textos [MÁQUINA][LABEL: 1]", machine_texts_counts)

Instancias de textos [HUMANOS][LABEL: 0] 2500
Instancias de textos [MÁQUINA][LABEL: 1] 2500


In [None]:
# Explorando desbalance:
human_texts_counts = (dataAtrain["label"] == 0).sum()
machine_texts_counts = (dataAtrain["label"] == 1).sum()
print("Instancias de textos [HUMANOS][LABEL: 0]", human_texts_counts)
print("Instancias de textos [MÁQUINA][LABEL: 1]", machine_texts_counts)

Instancias de textos [HUMANOS][LABEL: 0] 63351
Instancias de textos [MÁQUINA][LABEL: 1] 56406


**✅ El dataset está balanceado, no hace falta hacer resampling!**

## Preparación

### Cleaned

In [None]:
# Símbolos que se reemplazarán por texto:
symbols_replacement = {
  "(": " xparenthesis ",
  ")": " parenthesisx ",
  ",": " xcomma ",
  ".": " xpoint ",
  ";": " xpointcomma ",
  "\"": " xdoublequote ",
  "\'": " xsimplequote ",
  "-": " xdash ",
  "?": " xinterrogation ",
  "!": " xadmiration ",
  "&": " xand "
}

# LIMPIEZA GENERAL DEL TEXTO:
def clean_text(text):

  # Minúsculas:
  text = text.lower()

  # Reemplaza los símbolos por su equivalente:
  for symbol, replacement in symbols_replacement.items():
    text = text.replace(symbol, replacement)

  # Borra todo lo que no sea texto o números:
  text = re.sub(r'[^a-zA-Z0-9]', " ", text)

  # Elimina "\n", "\t" y espacios dobles:
  text = " ".join(text.split())

  return text

In [None]:
# Aplica la limpieza:
dataAcleanDev = dataAdev.copy()
dataAcleanDev["text"] = dataAcleanDev["text"].apply(clean_text)
dataAcleanDev = dataAcleanDev[["text", "label"]]

dataAcleanTrain = dataAtrain.copy()
dataAcleanTrain["text"] = dataAcleanTrain["text"].apply(clean_text)
dataAcleanTrain = dataAcleanTrain[["text", "label"]]

In [None]:
dataAcleanDev

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,giving gifts should always be enjoyable xpoint...,1
1,yveltal xparenthesis japanese parenthesisx is ...,1
2,if you xsimplequote d rather not annoy others ...,1
3,if you xsimplequote re interested in visiting ...,1
4,the following are some tips for becoming succe...,1
...,...,...
4995,the paper deals with an interesting applicatio...,0
4996,this manuscript tries to tackle neural network...,0
4997,the paper introduced a regularization scheme t...,0
4998,inspired by the analysis on the effect of the ...,0


In [None]:
dataAcleanTrain

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,forza motorsport is a popular racing game that...,1
1,buying virtual console games for your nintendo...,1
2,windows nt 4 xpoint 0 was a popular operating ...,1
3,how to make perfume perfume is a great way to ...,1
4,how to convert song lyrics to a song xsimplequ...,1
...,...,...
119752,the paper is an interesting contribution xcomm...,0
119753,we thank the reviewers for all their comments ...,0
119754,the authors introduce a semi xdash supervised ...,0
119755,this paper proposes the neural graph machine t...,0


### Generales

#### Lemmatiización

In [None]:
# Lemmatización:
def lemma_text(text):
  # Convierte a minúsculas:
  text = text.lower()

  # Tokenizar el texto:
  tokens = word_tokenize(text)

  # Lematiza el texto usando WordNetLemmatizer:
  lemmatizer = WordNetLemmatizer()
  lemmas = [lemmatizer.lemmatize(token) for token in tokens]

  # Unir los tokens lemmas en un texto nuevamente:
  lemmatizated_text = " ".join(lemmas)

  return lemmatizated_text

#### Palabras poco frecuentes

In [None]:
# Bajas frecuencias:
def replace_unk(df, text_column, min_frecuency):

  # Tokeniza y cuenta las frecuencias:
  word = " ".join(df[text_column]).split()
  frecuencies = Counter(word)

  # Reemplazar palabras con frecuencia menor a x por "unk":
  df[text_column] = df[text_column].apply(lambda x: " ".join(["unk" if frecuencies[word] < min_frecuency else word for word in x.split()]))

  return df

#### OWN EMBEDDINGS

In [None]:
class Embeddings:

  # Constructor:
  def __init__(self, vocabulary_size: int, dimension: int):
    """
      ATTRIBUTES:
        - vocabulary_size [int]: Size of the vocabulary.
        - dimension [int]: Dimension of the hoped embedding.
    """
    self.V = vocabulary_size
    self.D = dimension

    # Initializate the Weights and Biases of the FFD NN for CBOW:
    # XAVIER INITIALIZATION:
    np.random.seed(11)
    self.W1 = np.random.randn(self.D, self.V) * np.sqrt(1 / self.V)
    self.W2 = np.random.randn(self.V, self.D) * np.sqrt(1 / self.D)
    self.b1 = np.zeros((self.D, 1))
    self.b2 = np.zeros((self.V, 1))
    self.grad_W1 = self.grad_W2 = self.grad_b1 = self.grad_b2 = None
    assert self.W1.shape == ((self.D, self.V))
    assert self.W2.shape == ((self.V, self.D))


  # Activation function:
  def softmax(self, x):
    """Returns the computation for Softmax(x), to the output layer"""
    exp_x = np.exp(x - np.max(x))
    return exp_x / np.sum(exp_x, axis=0, keepdims=True)


  # FEEDFORWARD NEURAL NETWORK FUNCTIONS:
  def cost(self, y_real, y_predicted, batch_size):
    """Computes the Cross Entropy function(y_real, y_predicted, batch_size)"""
    logprobs = np.multiply(np.log(y_predicted),y_real) + np.multiply(np.log(1 - y_predicted), 1 - y_real)
    cost = - 1/batch_size * np.sum(logprobs)
    cost = np.squeeze(cost)
    return cost


  def forward(self, x):
    """Computes the Forward step on the FFD NN as Wx + b across the layer."""
    """[h1 = ReLU(W1 * x + b1)] => [a2 = (W2 * h1 + b2)]"""

    # Preactivation a1:
    a1 = self.W1 @ x + self.b1
    # h1 = ReLU(a1):
    h1 = np.maximum(0, a1)
    # Preactivation a2:
    a2 = self.W2 @ h1 + self.b2
    return a2, h1


  def backward(self, x, y_predicted, y_real, h, batch_size):
    """Computes Backrpopagation across Ouput Layer => a1"""
    # Layer 1:
    l1 = self.W2.T @ (y_predicted - y_real)
    # ReLu:
    l1 = np.maximum(0, l1)

    # Gradients:
    self.grad_W1 = (1 / batch_size) * np.dot(l1, x.T)
    self.grad_W2 = (1 / batch_size) * np.dot(y_predicted - y_real, h.T)
    self.grad_b1 = (1 / batch_size) * np.sum(l1, axis=1, keepdims=True)
    self.grad_b2 = (1 / batch_size) * np.sum(y_predicted - y_real, axis=1, keepdims=True)


  def embeddings(self, text, wordIndexes, iterations, learning_rate=0.001, batch_size=20):
    """Optimizator to update Weights and Biases using Gradient Descend and returns the embeddings (Weights)"""
    for x, y in get_batches(text, wordIndexes, self.V, 2, batch_size):
      # Forward Step:
      a2, h1 = self.forward(x)
      # Prediction:
      y_predicted = self.softmax(a2)
      # Cost:
      cost = -np.sum(y * np.log(y_predicted)) / batch_size
      # if ((iterations+1) % 10 == 0):
      #  print(f"iterations: {iterations + 1} cost: {cost:.6f}")
      # Backpropagation step:
      self.backward(x, y_predicted, y, h1, batch_size)

      # Updating the Weights and Biases:
      self.W1 -= learning_rate * self.grad_W1
      self.W2 -= learning_rate * self.grad_W2
      self.b1 -= learning_rate * self.grad_b1
      self.b2 -= learning_rate * self.grad_b2

      iterations += 1
      if iterations == iterations:
          break
      if iterations % 100 == 0:
          learning_rate *= 0.66

    # Returns the embeddings:
    return (self.W1.T + self.W2) / 2.0


In [None]:
# Example of use:
with open("shakespeare.txt", "r") as xfile:
  text = xfile.read()
lines = text.split("\n")
tokens = nltk.word_tokenize(text)
words = [ ch.lower() for ch in tokens if ch.isalpha() or ch == "." ]
print("Number of tokens:", len(tokens),"\n", words[:16])

Number of tokens: 63521 
 ['o', 'for', 'a', 'muse', 'of', 'fire', 'that', 'would', 'ascend', 'the', 'brightest', 'heaven', 'of', 'invention', 'a', 'kingdom']


In [None]:
# Getting the embeddings:
word2Index, Index2Word = get_dict(text)
EMB = Embeddings(vocabulary_size=len(word2Index), dimension=50)
embeds = EMB.embeddings(text, word2Index, iterations=1000, learning_rate=0.001, batch_size=1)

In [None]:
# Visualizing the word vectors:
from matplotlib import pyplot
words = ["king", "queen","lord","man", "woman","dog","wolf",
         "rich","happy","sad"]

# idx = [word2Index[word] for word in words]
# X = embeds[idx, :]
# print(X.shape, idx)

In [None]:
# result = compute_pca(X, 2)
# pyplot.scatter(result[:, 0], result[:, 1])
# for i, word in enumerate(words):
#   pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
# pyplot.show()

In [None]:
def own_embeddings(dataframe, text_column="text", dimensions=50, iterations=1000, learning_rate=0.001, batch_size=1):
  # Obtén el diccionario de palabras:
  word2Index, _ = get_dict(" ".join(dataframe[text_column]))

  # Inicializa el objeto de embeddings:
  EMB = Embeddings(vocabulary_size=len(word2Index), dimension=dimensions)

  # Obtiene embeddings para cada texto en el DataFrame:
  dataframe["embeddings"] = dataframe[text_column].apply(lambda texto: get_own_embeddings(texto, word2Index, EMB, iterations, learning_rate, batch_size))

  return dataframe

def get_own_embeddings(text, word2Index, embed_model, iterations, learning_rate, batch_size):
  return embed_model.embeddings(text, word2Index, iterations=1000, learning_rate=0.001, batch_size=1)

#### PYTHON LIBRARY EMBEDDINGS

In [None]:
!pip install gensim



In [None]:
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Tokenize the sentences:
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in lines]

# Train Word2Vec model:
model = Word2Vec(sentences=tokenized_sentences, vector_size=50, window=5, min_count=1, workers=4)

# Save the model:
model.save("word2vec.model")

# Get the word vector for a specific word:
word_vector = model.wv["little"]
print(f"Vector for \"little\": {word_vector}")

# Similarity between two words:
similarity = model.wv.similarity("kingdom", "king")
print(f"Similarity between \"kingdom\" and \"king\": {similarity}")

Vector for "little": [-0.04396478 -0.04938854 -0.00606027  0.1021896  -0.06823986 -0.12721178
  0.11723934  0.25505534 -0.19398691 -0.13440281  0.05841015 -0.18300968
  0.03100637  0.20222826 -0.14868276  0.12527372  0.11266765 -0.03033224
 -0.29242936 -0.12783493  0.01100952  0.16719273  0.28449255 -0.00273758
  0.05475254  0.03364776 -0.00928153  0.01515056 -0.08387332  0.00763966
  0.13602145 -0.06342961 -0.06859242 -0.11698745 -0.02638629  0.11907171
  0.20634604  0.03257493  0.11899865 -0.06488971  0.20469421 -0.05844789
 -0.06282072  0.00431382  0.31327993  0.08990625 -0.0544945   0.00780122
  0.1237552   0.03068915]
Similarity between "kingdom" and "king": 0.9749098420143127


In [None]:
def gensim_embeddings(dataframe, text_column="text", dimensions=50, window=5, min_count=1, workers=4):
  # Tokeniza el texto:
  tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in dataframe[text_column]]

  # Entrenar el modelo Word2Vec:
  model = Word2Vec(sentences=tokenized_sentences, vector_size=dimensions, window=window, min_count=min_count, workers=workers)

  # Obtener embeddings para cada texto en el DataFrame:
  dataframe["embeddings"] = dataframe[text_column].apply(lambda text: get_embeddings(text, model))

  return dataframe


def get_embeddings(text, model):
  # Tokeniza y procesa el text:
  tokens = word_tokenize(text.lower())

  # Obtiene el vector promedio de embeddings de palabras:
  embeddings = [model.wv[token] for token in tokens if token in model.wv]
  if not embeddings:
    # Si no hay embeddings, retornar un vector de ceros:
    return np.zeros(model.dimensions)
  return np.mean(embeddings, axis=0)

### **PA**: Cleaned => Lemma => UNK => ~Python Embbeds~

In [None]:
# DEV:
# Cleaned:
PA_dev = dataAcleanDev.copy()
PA_dev["text"] = PA_dev["text"].apply(clean_text)

# Lemma:
PA_dev["text"] = PA_dev["text"].apply(lemma_text)

# Bajas frecuencias:
PA_dev = replace_unk(PA_dev, "text", 4)

# Guarda el dataset con el texto limpio:
PA_dev.to_csv("/content/drive/MyDrive/Datasets/AITextification/TaskA-DevAB.csv", index=False)

In [None]:
# TRAIN:
# Cleaned:
PA_train = dataAcleanTrain.copy()
PA_train["text"] = PA_train["text"].apply(clean_text)

# Lemma:
PA_train["text"] = PA_train["text"].apply(lemma_text)

# Bajas frecuencias:
PA_train = replace_unk(PA_train, "text", 4)

# Guarda el dataset con el texto limpio:
PA_train.to_csv("/content/drive/MyDrive/Datasets/AITextification/TaskA-TrainAB.csv", index=False)

### **PB**: Cleaned => Lemma => UNK => ~Own Embbeds~

In [None]:
# Cleaned:
# datasetPB = dataA.copy()
# datasetPB["text"] = datasetPB["text"].apply(clean_text)

# Lemma:
# datasetPB["text"] = datasetPB["text"].apply(lemma_text)

# Bajas frecuencias:
# datasetPB = replace_unk(datasetPB, "text", 4)

In [None]:
# Dataset con embeddings:
# datasetPB = own_embeddings(datasetPB, "text", dimensions=50, iterations=1000, learning_rate=0.001, batch_size=1)
# datasetPB.to_csv("/content/drive/MyDrive/Datasets/AITextification/PB-embeddings.csv", index=False)

### **PC**: Cleaned => UNK => ~Python Embbeds~

In [None]:
# DEV:
# Cleaned:
PC_dev = dataAcleanDev.copy()
PC_dev["text"] = PC_dev["text"].apply(clean_text)

# Bajas frecuencias:
PC_dev = replace_unk(PC_dev, "text", 4)

# Guarda el dataset con el texto limpio:
PC_dev.to_csv("/content/drive/MyDrive/Datasets/AITextification/TaskA-DevCD.csv", index=False)

In [None]:
# TRAIN:
# Cleaned:
PC_train = dataAcleanTrain.copy()
PC_train["text"] = PC_train["text"].apply(clean_text)

# Bajas frecuencias:
PC_train = replace_unk(PC_train, "text", 4)

# Guarda el dataset con el texto limpio:
PC_train.to_csv("/content/drive/MyDrive/Datasets/AITextification/TaskA-TrainCD.csv", index=False)

In [None]:
# Dataset con embeddings:
# datasetPC = gensim_embeddings(datasetPC, "text", dimensions=50)
# datasetPC.to_csv("/content/drive/MyDrive/Datasets/AITextification/PC-embeddings.csv", index=False)

### **PD**: Cleaned => UNK => ~Own Embbeds~

In [None]:
# Cleaned:
# datasetPD = dataA.copy()
# datasetPD["text"] = datasetPD["text"].apply(clean_text)

# Bajas frecuencias:
# datasetPD = replace_unk(datasetPD, "text", 4)

In [None]:
# Dataset con embeddings:
# datasetPD = own_embeddings(datasetPD, "text", dimensions=50, iterations=1000, learning_rate=0.001, batch_size=1)
# datasetPD.to_csv("/content/drive/MyDrive/Datasets/AITextification/PD-embeddings.csv", index=False)

## [B] 🤖 Clasificación Multiclase [Modelos de IA]

In [None]:
# Dataset SubTaskB:
!gdown --folder https://drive.google.com/drive/folders/11YeloR2eTXcTzdwI04Z-M2QVvIeQAU6-

Retrieving folder list
Processing file 1oh9c-d0fo3NtETNySmCNLUc6H1j4dSWE subtaskB_dev.jsonl
Processing file 1k5LMwmYF7PF-BzYQNE2ULBae79nbM268 subtaskB_train.jsonl
Retrieving folder list completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1oh9c-d0fo3NtETNySmCNLUc6H1j4dSWE
To: /content/SubtaskB/subtaskB_dev.jsonl
100% 4.93M/4.93M [00:00<00:00, 142MB/s]
Downloading...
From: https://drive.google.com/uc?id=1k5LMwmYF7PF-BzYQNE2ULBae79nbM268
To: /content/SubtaskB/subtaskB_train.jsonl
100% 155M/155M [00:01<00:00, 86.7MB/s]
Download completed


In [None]:
dataBdev = pd.read_json("/content/SubtaskB/subtaskB_dev.jsonl", lines=True).set_index("id")
dataBtrain = pd.read_json("/content/SubtaskB/subtaskB_train.jsonl", lines=True).set_index("id")

In [None]:
dataBdev

Unnamed: 0_level_0,text,model,source,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1844,"Overall, I found the paper ""Machine Comprehens...",chatGPT,peerread,1
1845,"This paper ""Machine Comprehension Using Match-...",chatGPT,peerread,1
1846,The paper presents an end-to-end neural archit...,chatGPT,peerread,1
1847,This paper proposes an end-to-end neural archi...,chatGPT,peerread,1
1848,Title: Incorporating long-range consistency in...,chatGPT,peerread,1
...,...,...,...,...
14560,The paper Energy-Based Spherical Sparse Coding...,dolly,peerread,5
14561,"Dear Author, I have reviewed your submitted pa...",dolly,peerread,5
14562,Denoising Auto-Encoders (DAE) have been used i...,dolly,peerread,5
14563,"The paper Revisiting Denoising Auto-Encoders, ...",dolly,peerread,5


In [None]:
dataBtrain

Unnamed: 0_level_0,text,model,source,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,Forza Motorsport is a popular racing game that...,chatGPT,wikihow,1
1,Buying Virtual Console games for your Nintendo...,chatGPT,wikihow,1
2,Windows NT 4.0 was a popular operating system ...,chatGPT,wikihow,1
3,How to Make Perfume\n\nPerfume is a great way ...,chatGPT,wikihow,1
4,How to Convert Song Lyrics to a Song'\n\nConve...,chatGPT,wikihow,1
...,...,...,...,...
71022,"During the Cold War, the United States was po...",cohere,reddit,2
71023,"The ""continuity thesis"" is the idea that ther...",cohere,reddit,2
71024,"In the early Middle Ages, the pagan Norse wer...",cohere,reddit,2
71025,There are many similarities between the langu...,cohere,reddit,2


In [None]:
# Verificando la cantidad de autores:
dataBdev["model"].unique()

array(['chatGPT', 'human', 'davinci', 'cohere', 'bloomz', 'dolly'],
      dtype=object)

In [None]:
dataBtrain["model"].unique()

array(['chatGPT', 'human', 'cohere', 'davinci', 'bloomz', 'dolly'],
      dtype=object)

In [None]:
dataB1 = dataBdev[["text", "label"]]
dataB2 = dataBtrain[["text", "label"]]

In [None]:
dataB1

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1844,"Overall, I found the paper ""Machine Comprehens...",1
1845,"This paper ""Machine Comprehension Using Match-...",1
1846,The paper presents an end-to-end neural archit...,1
1847,This paper proposes an end-to-end neural archi...,1
1848,Title: Incorporating long-range consistency in...,1
...,...,...
14560,The paper Energy-Based Spherical Sparse Coding...,5
14561,"Dear Author, I have reviewed your submitted pa...",5
14562,Denoising Auto-Encoders (DAE) have been used i...,5
14563,"The paper Revisiting Denoising Auto-Encoders, ...",5


In [None]:
dataB2

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Forza Motorsport is a popular racing game that...,1
1,Buying Virtual Console games for your Nintendo...,1
2,Windows NT 4.0 was a popular operating system ...,1
3,How to Make Perfume\n\nPerfume is a great way ...,1
4,How to Convert Song Lyrics to a Song'\n\nConve...,1
...,...,...
71022,"During the Cold War, the United States was po...",2
71023,"The ""continuity thesis"" is the idea that ther...",2
71024,"In the early Middle Ages, the pagan Norse wer...",2
71025,There are many similarities between the langu...,2


## Exploración

In [None]:
# Explorando desbalance:
human_texts_counts = (dataB1["label"] == 1).sum()
machine_texts_counts = (dataB1["label"] == 0).sum()
machine_texts_counts += (dataB1["label"] == 2).sum()
machine_texts_counts += (dataB1["label"] == 3).sum()
machine_texts_counts += (dataB1["label"] == 4).sum()
machine_texts_counts += (dataB1["label"] == 5).sum()
print("Instancias de textos [HUMANOS][LABEL: 1]", human_texts_counts)
print("Instancias de textos [MÁQUINA][LABEL: 0, 2, 3, 4, 5]", machine_texts_counts)

Instancias de textos [HUMANOS][LABEL: 1] 500
Instancias de textos [MÁQUINA][LABEL: 0, 2, 3, 4, 5] 2500


In [None]:
# Explorando desbalance:
human_texts_counts = (dataB2["label"] == 1).sum()
machine_texts_counts = (dataB2["label"] == 0).sum()
machine_texts_counts += (dataB2["label"] == 2).sum()
machine_texts_counts += (dataB2["label"] == 3).sum()
machine_texts_counts += (dataB2["label"] == 4).sum()
machine_texts_counts += (dataB2["label"] == 5).sum()
print("Instancias de textos [HUMANOS][LABEL: 1]", human_texts_counts)
print("Instancias de textos [MÁQUINA][LABEL: 0, 2, 3, 4, 5]", machine_texts_counts)

Instancias de textos [HUMANOS][LABEL: 1] 11995
Instancias de textos [MÁQUINA][LABEL: 0, 2, 3, 4, 5] 59032


**🛑 El dataset está desbalanceado, por ello será importante el F1-Score!**

## Preparación

### Cleaned

In [None]:
# Aplica la limpieza:
dataBcleanDev = dataB1.copy()
dataBcleanDev["text"] = dataBcleanDev["text"].apply(clean_text)

dataBcleanTrain = dataB2.copy()
dataBcleanTrain["text"] = dataBcleanTrain["text"].apply(clean_text)

In [None]:
dataBcleanDev

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1844,overall xcomma i found the paper xdoublequote ...,1
1845,this paper xdoublequote machine comprehension ...,1
1846,the paper presents an end xdash to xdash end n...,1
1847,this paper proposes an end xdash to xdash end ...,1
1848,title incorporating long xdash range consisten...,1
...,...,...
14560,the paper energy xdash based spherical sparse ...,5
14561,dear author xcomma i have reviewed your submit...,5
14562,denoising auto xdash encoders xparenthesis dae...,5
14563,the paper revisiting denoising auto xdash enco...,5


In [None]:
dataBcleanTrain

Unnamed: 0_level_0,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,forza motorsport is a popular racing game that...,1
1,buying virtual console games for your nintendo...,1
2,windows nt 4 xpoint 0 was a popular operating ...,1
3,how to make perfume perfume is a great way to ...,1
4,how to convert song lyrics to a song xsimplequ...,1
...,...,...
71022,during the cold war xcomma the united states w...,2
71023,the xdoublequote continuity thesis xdoublequot...,2
71024,in the early middle ages xcomma the pagan nors...,2
71025,there are many similarities between the langua...,2


### **PA**: Cleaned => Lemma => UNK => ~Python Embbeds~

In [None]:
# DEV:
# Cleaned:
PA_TB_dev = dataBcleanDev.copy()
PA_TB_dev["text"] = PA_TB_dev["text"].apply(clean_text)

# Lemma:
PA_TB_dev["text"] = PA_TB_dev["text"].apply(lemma_text)

# Bajas frecuencias:
PA_TB_dev = replace_unk(PA_TB_dev, "text", 4)

# Guarda el dataset con el texto limpio:
PA_TB_dev.to_csv("/content/drive/MyDrive/Datasets/AITextification/TaskB-DevAB.csv", index=False)

In [None]:
# TRAIN:
# Cleaned:
PA_TB_train = dataBcleanTrain.copy()
PA_TB_train["text"] = PA_TB_train["text"].apply(clean_text)

# Lemma:
PA_TB_train["text"] = PA_TB_train["text"].apply(lemma_text)

# Bajas frecuencias:
PA_TB_train = replace_unk(PA_TB_train, "text", 4)

# Guarda el dataset con el texto limpio:
PA_TB_train.to_csv("/content/drive/MyDrive/Datasets/AITextification/TaskB-TrainAB.csv", index=False)

### **PB**: Cleaned => Lemma => UNK => ~Own Embbeds~

In [None]:
# ...

### **PC**: Cleaned => UNK => ~Python Embbeds~

In [None]:
# DEV:
# Cleaned:
PC_TB_dev = dataBcleanDev.copy()
PC_TB_dev["text"] = PC_TB_dev["text"].apply(clean_text)

# Bajas frecuencias:
PC_TB_dev = replace_unk(PC_TB_dev, "text", 4)

# Guarda el dataset con el texto limpio:
PC_TB_dev.to_csv("/content/drive/MyDrive/Datasets/AITextification/TaskB-DevCD.csv", index=False)

In [None]:
# TRAIN:
# Cleaned:
PC_TB_train = dataBcleanTrain.copy()
PC_TB_train["text"] = PC_TB_train["text"].apply(clean_text)

# Bajas frecuencias:
PC_TB_train = replace_unk(PC_TB_train, "text", 4)

# Guarda el dataset con el texto limpio:
PC_TB_train.to_csv("/content/drive/MyDrive/Datasets/AITextification/TaskB-TrainCD.csv", index=False)

### **PD**: Cleaned => UNK => ~Own Embbeds~

In [None]:
# ...

# [04] 👬 Autores
* León Rosas Manuel Alejandro.
* Ramos Herrera Iván Alejandro.