# Ejercicio 5: Modelo Probabilístico

## Objetivo de la práctica
- Aplicar paso a paso técnicas de preprocesamiento, evaluando el impacto de cada etapa en el número de tokens y en el vocabulario final.

## Parte 0: Carga del Corpus

In [3]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
corpus = newsgroups.data

## Parte 1: Tokenización

### Actividad 
1. Tokeniza los documentos.

In [9]:
# bibliotecas necesarias para tokenización
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

# Función para tokenizar el corpus
def tokenize_corpus(corpus):
    tokenized_corpus = []
    for document in corpus:
        sentences = sent_tokenize(document)
        tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
        tokenized_corpus.append(tokenized_sentences)
    return tokenized_corpus
# Tokenizar el corpus

tokenized_corpus = tokenize_corpus(corpus)
print(f"Tokenized corpus contains {len(tokenized_corpus)} documents.")
print("Tokenized corpus:", tokenized_corpus[:2]) 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\wil_s\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Tokenized corpus contains 18846 documents.
Tokenized corpus: [[['I', 'am', 'sure', 'some', 'bashers', 'of', 'Pens', 'fans', 'are', 'pretty', 'confused', 'about', 'the', 'lack', 'of', 'any', 'kind', 'of', 'posts', 'about', 'the', 'recent', 'Pens', 'massacre', 'of', 'the', 'Devils', '.'], ['Actually', ',', 'I', 'am', 'bit', 'puzzled', 'too', 'and', 'a', 'bit', 'relieved', '.'], ['However', ',', 'I', 'am', 'going', 'to', 'put', 'an', 'end', 'to', 'non-PIttsburghers', "'", 'relief', 'with', 'a', 'bit', 'of', 'praise', 'for', 'the', 'Pens', '.'], ['Man', ',', 'they', 'are', 'killing', 'those', 'Devils', 'worse', 'than', 'I', 'thought', '.'], ['Jagr', 'just', 'showed', 'you', 'why', 'he', 'is', 'much', 'better', 'than', 'his', 'regular', 'season', 'stats', '.'], ['He', 'is', 'also', 'a', 'lot', 'fo', 'fun', 'to', 'watch', 'in', 'the', 'playoffs', '.'], ['Bowman', 'should', 'let', 'JAgr', 'have', 'a', 'lot', 'of', 'fun', 'in', 'the', 'next', 'couple', 'of', 'games', 'since', 'the', 'Pens', 'a

## Parte 2: Normalización

### Actividad 
1. Convierte todos los tokens a minúsculas.
2. Elimina puntuación y símbolos no alfabéticos.

In [10]:
#Convertir todos los tokens a minúsculas
def lowercase_tokens(tokenized_corpus):
    lowercase_corpus = []
    for document in tokenized_corpus:
        lowercase_document = [[word.lower() for word in sentence] for sentence in document]
        lowercase_corpus.append(lowercase_document)
    return lowercase_corpus
# Convertir los tokens a minúsculas
lowercase_corpus = lowercase_tokens(tokenized_corpus)
print("Lowercased corpus:", lowercase_corpus[:2])

Lowercased corpus: [[['i', 'am', 'sure', 'some', 'bashers', 'of', 'pens', 'fans', 'are', 'pretty', 'confused', 'about', 'the', 'lack', 'of', 'any', 'kind', 'of', 'posts', 'about', 'the', 'recent', 'pens', 'massacre', 'of', 'the', 'devils', '.'], ['actually', ',', 'i', 'am', 'bit', 'puzzled', 'too', 'and', 'a', 'bit', 'relieved', '.'], ['however', ',', 'i', 'am', 'going', 'to', 'put', 'an', 'end', 'to', 'non-pittsburghers', "'", 'relief', 'with', 'a', 'bit', 'of', 'praise', 'for', 'the', 'pens', '.'], ['man', ',', 'they', 'are', 'killing', 'those', 'devils', 'worse', 'than', 'i', 'thought', '.'], ['jagr', 'just', 'showed', 'you', 'why', 'he', 'is', 'much', 'better', 'than', 'his', 'regular', 'season', 'stats', '.'], ['he', 'is', 'also', 'a', 'lot', 'fo', 'fun', 'to', 'watch', 'in', 'the', 'playoffs', '.'], ['bowman', 'should', 'let', 'jagr', 'have', 'a', 'lot', 'of', 'fun', 'in', 'the', 'next', 'couple', 'of', 'games', 'since', 'the', 'pens', 'are', 'going', 'to', 'beat', 'the', 'pulp',

In [11]:
# Eliminar puntuación y símbolos no alfabéticos
import string
def remove_punctuation(tokenized_corpus):
    table = str.maketrans('', '', string.punctuation)
    cleaned_corpus = []
    for document in tokenized_corpus:
        cleaned_document = [[word.translate(table) for word in sentence if word.isalpha()] for sentence in document]
        cleaned_corpus.append(cleaned_document)
    return cleaned_corpus
# Eliminar puntuación y símbolos no alfabéticos
cleaned_corpus = remove_punctuation(lowercase_corpus)
print("Cleaned corpus:", cleaned_corpus[:2])

Cleaned corpus: [[['i', 'am', 'sure', 'some', 'bashers', 'of', 'pens', 'fans', 'are', 'pretty', 'confused', 'about', 'the', 'lack', 'of', 'any', 'kind', 'of', 'posts', 'about', 'the', 'recent', 'pens', 'massacre', 'of', 'the', 'devils'], ['actually', 'i', 'am', 'bit', 'puzzled', 'too', 'and', 'a', 'bit', 'relieved'], ['however', 'i', 'am', 'going', 'to', 'put', 'an', 'end', 'to', 'relief', 'with', 'a', 'bit', 'of', 'praise', 'for', 'the', 'pens'], ['man', 'they', 'are', 'killing', 'those', 'devils', 'worse', 'than', 'i', 'thought'], ['jagr', 'just', 'showed', 'you', 'why', 'he', 'is', 'much', 'better', 'than', 'his', 'regular', 'season', 'stats'], ['he', 'is', 'also', 'a', 'lot', 'fo', 'fun', 'to', 'watch', 'in', 'the', 'playoffs'], ['bowman', 'should', 'let', 'jagr', 'have', 'a', 'lot', 'of', 'fun', 'in', 'the', 'next', 'couple', 'of', 'games', 'since', 'the', 'pens', 'are', 'going', 'to', 'beat', 'the', 'pulp', 'out', 'of', 'jersey', 'anyway'], ['i', 'was', 'very', 'disappointed', 'n

## Parte 3: Eliminación de Stopwords

### Actividad 
1. Elimina las palabras vacías usando una lista estándar.

In [17]:
# Eliminar las palabras vacías (stopwords) usando lista estandar de NLTK
from nltk.corpus import stopwords
nltk.download('stopwords')
# Función para eliminar las palabras vacías (stopwords)
def remove_stopwords(tokenized_corpus):
    stop_words = set(stopwords.words('english'))
    filtered_corpus = []
    for document in tokenized_corpus:
        filtered_document = [[word for word in sentence if word not in stop_words] for sentence in document]
        filtered_corpus.append(filtered_document)
    return filtered_corpus
# Eliminar las palabras vacías (stopwords)
filtered_corpus = remove_stopwords(cleaned_corpus)
print("Filtered corpus:", filtered_corpus[:2])


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\wil_s\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Filtered corpus: [[['sure', 'bashers', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 'pens', 'massacre', 'devils'], ['actually', 'bit', 'puzzled', 'bit', 'relieved'], ['however', 'going', 'put', 'end', 'relief', 'bit', 'praise', 'pens'], ['man', 'killing', 'devils', 'worse', 'thought'], ['jagr', 'showed', 'much', 'better', 'regular', 'season', 'stats'], ['also', 'lot', 'fo', 'fun', 'watch', 'playoffs'], ['bowman', 'let', 'jagr', 'lot', 'fun', 'next', 'couple', 'games', 'since', 'pens', 'going', 'beat', 'pulp', 'jersey', 'anyway'], ['disappointed', 'see', 'islanders', 'lose', 'final', 'regular', 'season', 'game'], ['pens', 'rule'], []], [['brother', 'market', 'video', 'card', 'supports', 'vesa', 'local', 'bus', 'ram'], ['anyone', 'diamond', 'stealth', 'pro', 'local', 'bus', 'orchid', 'farenheit', 'ati', 'graphics', 'ultra', 'pro', 'vlb', 'card', 'please', 'post', 'email'], ['thank'], ['matt']]]


## Parte 4: Stemming o Lematización

### Actividad
1. Aplica stemming.
2. Aplica lematización.
3. Compara ambas técnicas.

In [14]:
# Aplicar stemming usando el algoritmo de Porter
from nltk.stem import PorterStemmer
def stem_tokens(tokenized_corpus):
    stemmer = PorterStemmer()
    stemmed_corpus = []
    for document in tokenized_corpus:
        stemmed_document = [[stemmer.stem(word) for word in sentence] for sentence in document]
        stemmed_corpus.append(stemmed_document)
    return stemmed_corpus
# Aplicar stemming
stemmed_corpus = stem_tokens(filtered_corpus)
print("Stemmed corpus:", stemmed_corpus[:2])

Stemmed corpus: [[['sure', 'basher', 'pen', 'fan', 'pretti', 'confus', 'lack', 'kind', 'post', 'recent', 'pen', 'massacr', 'devil'], ['actual', 'bit', 'puzzl', 'bit', 'reliev'], ['howev', 'go', 'put', 'end', 'relief', 'bit', 'prais', 'pen'], ['man', 'kill', 'devil', 'wors', 'thought'], ['jagr', 'show', 'much', 'better', 'regular', 'season', 'stat'], ['also', 'lot', 'fo', 'fun', 'watch', 'playoff'], ['bowman', 'let', 'jagr', 'lot', 'fun', 'next', 'coupl', 'game', 'sinc', 'pen', 'go', 'beat', 'pulp', 'jersey', 'anyway'], ['disappoint', 'see', 'island', 'lose', 'final', 'regular', 'season', 'game'], ['pen', 'rule'], []], [['brother', 'market', 'video', 'card', 'support', 'vesa', 'local', 'bu', 'ram'], ['anyon', 'diamond', 'stealth', 'pro', 'local', 'bu', 'orchid', 'farenheit', 'ati', 'graphic', 'ultra', 'pro', 'vlb', 'card', 'pleas', 'post', 'email'], ['thank'], ['matt']]]


In [15]:
# Aplicar lematización
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
def lemmatize_tokens(tokenized_corpus):
    lemmatizer = WordNetLemmatizer()
    lemmatized_corpus = []
    for document in tokenized_corpus:
        lemmatized_document = [[lemmatizer.lemmatize(word) for word in sentence] for sentence in document]
        lemmatized_corpus.append(lemmatized_document)
    return lemmatized_corpus   
# Aplicar lematización
lemmatized_corpus = lemmatize_tokens(filtered_corpus)
print("Lemmatized corpus:", lemmatized_corpus[:2])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\wil_s\AppData\Roaming\nltk_data...


Lemmatized corpus: [[['sure', 'bashers', 'pen', 'fan', 'pretty', 'confused', 'lack', 'kind', 'post', 'recent', 'pen', 'massacre', 'devil'], ['actually', 'bit', 'puzzled', 'bit', 'relieved'], ['however', 'going', 'put', 'end', 'relief', 'bit', 'praise', 'pen'], ['man', 'killing', 'devil', 'worse', 'thought'], ['jagr', 'showed', 'much', 'better', 'regular', 'season', 'stats'], ['also', 'lot', 'fo', 'fun', 'watch', 'playoff'], ['bowman', 'let', 'jagr', 'lot', 'fun', 'next', 'couple', 'game', 'since', 'pen', 'going', 'beat', 'pulp', 'jersey', 'anyway'], ['disappointed', 'see', 'islander', 'lose', 'final', 'regular', 'season', 'game'], ['pen', 'rule'], []], [['brother', 'market', 'video', 'card', 'support', 'vesa', 'local', 'bus', 'ram'], ['anyone', 'diamond', 'stealth', 'pro', 'local', 'bus', 'orchid', 'farenheit', 'ati', 'graphic', 'ultra', 'pro', 'vlb', 'card', 'please', 'post', 'email'], ['thank'], ['matt']]]


In [16]:
# Comparación de los resultados
print("\nComparison of tokenization steps:")
for i in range(2):
    print(f"\nDocument {i+1}:")
    print("Original:", corpus[i])
    print("Tokenized:", tokenized_corpus[i])
    print("Lowercased:", lowercase_corpus[i])
    print("Cleaned:", cleaned_corpus[i])
    print("Filtered:", filtered_corpus[i])
    print("Stemmed:", stemmed_corpus[i])
    print("Lemmatized:", lemmatized_corpus[i])


Comparison of tokenization steps:

Document 1:
Original: 

I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!


Tokenized: [['I', 'am', 'sure', 'some', 'bashers', 'of', 'Pens', 'fans', 'are', 'pretty', 'confused', 'about', 'the', 'lack', 'of', 'any', 'kind', 'of', 'posts', 'about', 'the', 'recent', 'Pens', 'massacre', 'of', 'the', 'Devils'