# Ejercicio 5: Modelo Probabilístico

## Objetivo de la práctica
- Aplicar paso a paso técnicas de preprocesamiento, evaluando el impacto de cada etapa en el número de tokens y en el vocabulario final.

## Parte 0: Carga del Corpus

In [48]:
import ir_datasets
dataset = ir_datasets.load("beir/arguana")
for qrel in dataset.qrels_iter():
    qrel
print(dataset)

Dataset(id='beir/arguana', provides=['docs', 'queries', 'qrels'])


## Parte 1: Tokenización

### Actividad 
1. Tokeniza los documentos.

In [49]:
# bibliotecas necesarias para tokenización
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

# Función para tokenizar el corpus
def tokenize_corpus(dataset):
    tokenized_corpus = []
    for document in dataset.docs_iter():
        sentences = sent_tokenize(document.text)
        tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
        tokenized_corpus.append(tokenized_sentences)
    return tokenized_corpus
# Tokenizar el corpus

tokenized_corpus = tokenize_corpus(dataset)
print(f"Tokenized corpus contains {len(tokenized_corpus)} documents.")
print("Tokenized corpus:", tokenized_corpus[:2]) 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wil_s\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Tokenized corpus contains 8674 documents.
Tokenized corpus: [[['You', 'don', '’', 't', 'have', 'to', 'be', 'vegetarian', 'to', 'be', 'green', '.'], ['Many', 'special', 'environments', 'have', 'been', 'created', 'by', 'livestock', 'farming', '–', 'for', 'example', 'chalk', 'down', 'land', 'in', 'England', 'and', 'mountain', 'pastures', 'in', 'many', 'countries', '.'], ['Ending', 'livestock', 'farming', 'would', 'see', 'these', 'areas', 'go', 'back', 'to', 'woodland', 'with', 'a', 'loss', 'of', 'many', 'unique', 'plants', 'and', 'animals', '.'], ['Growing', 'crops', 'can', 'also', 'be', 'very', 'bad', 'for', 'the', 'planet', ',', 'with', 'fertilisers', 'and', 'pesticides', 'polluting', 'rivers', ',', 'lakes', 'and', 'seas', '.'], ['Most', 'tropical', 'forests', 'are', 'now', 'cut', 'down', 'for', 'timber', ',', 'or', 'to', 'allow', 'oil', 'palm', 'trees', 'to', 'be', 'grown', 'in', 'plantations', ',', 'not', 'to', 'create', 'space', 'for', 'meat', 'production', '.'], ['British', 'farmer'

## Parte 2: Normalización

### Actividad 
1. Convierte todos los tokens a minúsculas.
2. Elimina puntuación y símbolos no alfabéticos.

In [50]:
#Convertir todos los tokens a minúsculas
def lowercase_tokens(tokenized_corpus):
    lowercase_corpus = []
    for document in tokenized_corpus:
        lowercase_document = [[word.lower() for word in sentence] for sentence in document]
        lowercase_corpus.append(lowercase_document)
    return lowercase_corpus
# Convertir los tokens a minúsculas
lowercase_corpus = lowercase_tokens(tokenized_corpus)
print("Lowercased corpus:", lowercase_corpus[:2])

Lowercased corpus: [[['you', 'don', '’', 't', 'have', 'to', 'be', 'vegetarian', 'to', 'be', 'green', '.'], ['many', 'special', 'environments', 'have', 'been', 'created', 'by', 'livestock', 'farming', '–', 'for', 'example', 'chalk', 'down', 'land', 'in', 'england', 'and', 'mountain', 'pastures', 'in', 'many', 'countries', '.'], ['ending', 'livestock', 'farming', 'would', 'see', 'these', 'areas', 'go', 'back', 'to', 'woodland', 'with', 'a', 'loss', 'of', 'many', 'unique', 'plants', 'and', 'animals', '.'], ['growing', 'crops', 'can', 'also', 'be', 'very', 'bad', 'for', 'the', 'planet', ',', 'with', 'fertilisers', 'and', 'pesticides', 'polluting', 'rivers', ',', 'lakes', 'and', 'seas', '.'], ['most', 'tropical', 'forests', 'are', 'now', 'cut', 'down', 'for', 'timber', ',', 'or', 'to', 'allow', 'oil', 'palm', 'trees', 'to', 'be', 'grown', 'in', 'plantations', ',', 'not', 'to', 'create', 'space', 'for', 'meat', 'production', '.'], ['british', 'farmer', 'and', 'former', 'editor', 'simon', 'fa

In [51]:
# Eliminar puntuación y símbolos no alfabéticos
import string
def remove_punctuation(tokenized_corpus):
    table = str.maketrans('', '', string.punctuation)
    cleaned_corpus = []
    for document in tokenized_corpus:
        cleaned_document = [[word.translate(table) for word in sentence if word.isalpha()] for sentence in document]
        cleaned_corpus.append(cleaned_document)
    return cleaned_corpus
# Eliminar puntuación y símbolos no alfabéticos
cleaned_corpus = remove_punctuation(lowercase_corpus)
print("Cleaned corpus:", cleaned_corpus[:2])

Cleaned corpus: [[['you', 'don', 't', 'have', 'to', 'be', 'vegetarian', 'to', 'be', 'green'], ['many', 'special', 'environments', 'have', 'been', 'created', 'by', 'livestock', 'farming', 'for', 'example', 'chalk', 'down', 'land', 'in', 'england', 'and', 'mountain', 'pastures', 'in', 'many', 'countries'], ['ending', 'livestock', 'farming', 'would', 'see', 'these', 'areas', 'go', 'back', 'to', 'woodland', 'with', 'a', 'loss', 'of', 'many', 'unique', 'plants', 'and', 'animals'], ['growing', 'crops', 'can', 'also', 'be', 'very', 'bad', 'for', 'the', 'planet', 'with', 'fertilisers', 'and', 'pesticides', 'polluting', 'rivers', 'lakes', 'and', 'seas'], ['most', 'tropical', 'forests', 'are', 'now', 'cut', 'down', 'for', 'timber', 'or', 'to', 'allow', 'oil', 'palm', 'trees', 'to', 'be', 'grown', 'in', 'plantations', 'not', 'to', 'create', 'space', 'for', 'meat', 'production'], ['british', 'farmer', 'and', 'former', 'editor', 'simon', 'farrell', 'also', 'states', 'many', 'vegans', 'and', 'vegeta

## Parte 3: Eliminación de Stopwords

### Actividad 
1. Elimina las palabras vacías usando una lista estándar.

In [52]:
# Eliminar las palabras vacías (stopwords) usando lista estandar de NLTK
from nltk.corpus import stopwords
nltk.download('stopwords')
# Función para eliminar las palabras vacías (stopwords)
def remove_stopwords(tokenized_corpus):
    stop_words = set(stopwords.words('english'))
    filtered_corpus = []
    for document in tokenized_corpus:
        filtered_document = [[word for word in sentence if word not in stop_words] for sentence in document]
        filtered_corpus.append(filtered_document)
    return filtered_corpus
# Eliminar las palabras vacías (stopwords)
filtered_corpus = remove_stopwords(cleaned_corpus)
print("Filtered corpus:", filtered_corpus[:2])


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\wil_s\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Filtered corpus: [[['vegetarian', 'green'], ['many', 'special', 'environments', 'created', 'livestock', 'farming', 'example', 'chalk', 'land', 'england', 'mountain', 'pastures', 'many', 'countries'], ['ending', 'livestock', 'farming', 'would', 'see', 'areas', 'go', 'back', 'woodland', 'loss', 'many', 'unique', 'plants', 'animals'], ['growing', 'crops', 'also', 'bad', 'planet', 'fertilisers', 'pesticides', 'polluting', 'rivers', 'lakes', 'seas'], ['tropical', 'forests', 'cut', 'timber', 'allow', 'oil', 'palm', 'trees', 'grown', 'plantations', 'create', 'space', 'meat', 'production'], ['british', 'farmer', 'former', 'editor', 'simon', 'farrell', 'also', 'states', 'many', 'vegans', 'vegetarians', 'rely', 'one', 'source', 'calculation', 'livestock', 'generates', 'global', 'carbon', 'emissions', 'figure', 'contains', 'basic', 'mistakes'], ['attributes', 'deforestation', 'ranching', 'cattle', 'rather', 'logging', 'development'], ['also', 'muddles', 'emissions', 'deforestation', 'also', 'refu

## Parte 4: Stemming o Lematización

### Actividad
1. Aplica stemming.
2. Aplica lematización.
3. Compara ambas técnicas.

In [53]:
# Aplicar stemming usando el algoritmo de Porter
from nltk.stem import PorterStemmer
def stem_tokens(tokenized_corpus):
    stemmer = PorterStemmer()
    stemmed_corpus = []
    for document in tokenized_corpus:
        stemmed_document = [[stemmer.stem(word) for word in sentence] for sentence in document]
        stemmed_corpus.append(stemmed_document)
    return stemmed_corpus
# Aplicar stemming
stemmed_corpus = stem_tokens(filtered_corpus)
print("Stemmed corpus:", stemmed_corpus[:2])

Stemmed corpus: [[['vegetarian', 'green'], ['mani', 'special', 'environ', 'creat', 'livestock', 'farm', 'exampl', 'chalk', 'land', 'england', 'mountain', 'pastur', 'mani', 'countri'], ['end', 'livestock', 'farm', 'would', 'see', 'area', 'go', 'back', 'woodland', 'loss', 'mani', 'uniqu', 'plant', 'anim'], ['grow', 'crop', 'also', 'bad', 'planet', 'fertilis', 'pesticid', 'pollut', 'river', 'lake', 'sea'], ['tropic', 'forest', 'cut', 'timber', 'allow', 'oil', 'palm', 'tree', 'grown', 'plantat', 'creat', 'space', 'meat', 'product'], ['british', 'farmer', 'former', 'editor', 'simon', 'farrel', 'also', 'state', 'mani', 'vegan', 'vegetarian', 'reli', 'one', 'sourc', 'calcul', 'livestock', 'gener', 'global', 'carbon', 'emiss', 'figur', 'contain', 'basic', 'mistak'], ['attribut', 'deforest', 'ranch', 'cattl', 'rather', 'log', 'develop'], ['also', 'muddl', 'emiss', 'deforest', 'also', 'refut', 'statement', 'meat', 'product', 'ineffici', 'scientist', 'calcul', 'global', 'ratio', 'amount', 'use', 

In [54]:
# Aplicar lematización
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
def lemmatize_tokens(tokenized_corpus):
    lemmatizer = WordNetLemmatizer()
    lemmatized_corpus = []
    for document in tokenized_corpus:
        lemmatized_document = [[lemmatizer.lemmatize(word) for word in sentence] for sentence in document]
        lemmatized_corpus.append(lemmatized_document)
    return lemmatized_corpus   
# Aplicar lematización
lemmatized_corpus = lemmatize_tokens(filtered_corpus)
print("Lemmatized corpus:", lemmatized_corpus[:2])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\wil_s\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Lemmatized corpus: [[['vegetarian', 'green'], ['many', 'special', 'environment', 'created', 'livestock', 'farming', 'example', 'chalk', 'land', 'england', 'mountain', 'pasture', 'many', 'country'], ['ending', 'livestock', 'farming', 'would', 'see', 'area', 'go', 'back', 'woodland', 'loss', 'many', 'unique', 'plant', 'animal'], ['growing', 'crop', 'also', 'bad', 'planet', 'fertiliser', 'pesticide', 'polluting', 'river', 'lake', 'sea'], ['tropical', 'forest', 'cut', 'timber', 'allow', 'oil', 'palm', 'tree', 'grown', 'plantation', 'create', 'space', 'meat', 'production'], ['british', 'farmer', 'former', 'editor', 'simon', 'farrell', 'also', 'state', 'many', 'vegan', 'vegetarian', 'rely', 'one', 'source', 'calculation', 'livestock', 'generates', 'global', 'carbon', 'emission', 'figure', 'contains', 'basic', 'mistake'], ['attribute', 'deforestation', 'ranching', 'cattle', 'rather', 'logging', 'development'], ['also', 'muddle', 'emission', 'deforestation', 'also', 'refutes', 'statement', 'me

In [55]:
# Comparación de los resultados
print("\nComparison of tokenization steps:")
for i in range(2):
    print(f"\nDocument {i+1}:")
    print("Original:", dataset.docs_iter()[i].text)
    print("Tokenized:", tokenized_corpus[i])
    print("Lowercased:", lowercase_corpus[i])
    print("Cleaned:", cleaned_corpus[i])
    print("Filtered:", filtered_corpus[i])
    print("Stemmed:", stemmed_corpus[i])
    print("Lemmatized:", lemmatized_corpus[i])


Comparison of tokenization steps:

Document 1:
Original: You don’t have to be vegetarian to be green. Many special environments have been created by livestock farming – for example chalk down land in England and mountain pastures in many countries. Ending livestock farming would see these areas go back to woodland with a loss of many unique plants and animals. Growing crops can also be very bad for the planet, with fertilisers and pesticides polluting rivers, lakes and seas. Most tropical forests are now cut down for timber, or to allow oil palm trees to be grown in plantations, not to create space for meat production.  British farmer and former editor Simon Farrell also states: “Many vegans and vegetarians rely on one source from the U.N. calculation that livestock generates 18% of global carbon emissions, but this figure contains basic mistakes. It attributes all deforestation from ranching to cattle, rather than logging or development. It also muddles up one-off emissions from defo