Paulo Sánchez


Juan M. González-Campo


Pedro Marroquín

In [49]:
corpus=corpus = """
On a warm spring morning, the city slowly came to life. The streets were bathed in golden light as the first commuters began their daily routines. Shops opened their doors, the aroma of freshly brewed coffee wafting from the corner cafés. Joggers passed through the park, nodding politely at each other, while newspaper vendors shouted out headlines to the early risers. Everything moved with a quiet rhythm, as if the city itself was taking a deep breath before the rush began.

At the university, the campus buzzed with energy. Students hurried across the lawns with backpacks slung over one shoulder, balancing coffee cups and textbooks. In lecture halls, professors prepared their presentations, flipping through slides and checking microphones. The library, though quieter, was no less active—pages turned rapidly, keyboards clicked in steady patterns, and study groups gathered around large wooden tables, whispering ideas and scribbling equations on notepads. The smell of ink, paper, and a hint of stress hung in the air.

Meanwhile, in the countryside just outside the city, life moved at a slower pace. Farmers tended to their crops with practiced ease, guiding tractors through rows of green fields under the watchful gaze of distant hills. Children ran barefoot on dirt roads, chasing each other under the sun, their laughter echoing through the valley. The village storekeeper arranged fresh produce on wooden crates, greeting every passerby with a smile and a nod. Unlike the hurried tempo of the city, the countryside offered a comforting stillness that many longed for but rarely experienced.
"""



# Preprocesamiento

In [50]:
#Preprocesamiento del corpus 
import re
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text.split(' ')
preprocessed_corpus = preprocess_text(corpus)
print(preprocessed_corpus[:10])

['on', 'a', 'warm', 'spring', 'morning', 'the', 'city', 'slowly', 'came', 'to']


In [51]:
from collections import defaultdict
import math 

class bigram:
    def __init__(self, corpus):
        """
        Inicializa el modelo de bigramas con un corpus de texto.
        """
        self.corpus = corpus
        self.unigram_counts = defaultdict(int)
        self.bigram_counts = defaultdict(int)
        self.tokens = preprocess_text(corpus)
        self.total_palabras = len(self.tokens)
        self._train()

    def _train(self):
        """
        Entrena el modelo de bigramas contando los unigramas y bigramas
        """
        
        # Contar unigramas y bigramas
        for i in range(len(self.tokens) - 1):
            self.unigram_counts[self.tokens[i]] += 1
            bigram = (self.tokens[i], self.tokens[i + 1])
            self.bigram_counts[bigram] += 1
        # Contar el último token
        self.unigram_counts[self.tokens[-1]] += 1

        # Para g

    def bigram_prob(self, word1, word2):
        """
        Calcula la probabilidad de un bigrama P(word2 | word1)
        """
        bigram = (word1, word2)
        if self.unigram_counts[word1] == 0:
            return 0.0
        return self.bigram_counts[bigram] / self.unigram_counts[word1]
    
   
    
    def bigram_prob_laplace(self, w1, w2):
        """
        Probabilidad con suavizado de Laplace (add-one)
        """
        V = len(self.unigram_counts)
        return (self.bigram_counts[(w1, w2)] + 1) / (self.unigram_counts[w1] + V)
    
    def bigram_prob_add_k(self, w1, w2, k=0.01):
        V = len(self.unigram_counts)
        return (self.bigram_counts[(w1, w2)] + k) / (self.unigram_counts[w1] + k * V)
    
    def prob_sentence(self, sentence, method='normal', k=0.01):
        """
        Calcula la probabilidad de una oración dada
        """
        words = preprocess_text(sentence)
        prob = 1.0
        for i in range(len(words) - 1):
            if method == 'normal':
                prob *= self.bigram_prob(words[i], words[i + 1])
            elif method == 'laplace':
                prob *= self.bigram_prob_laplace(words[i], words[i + 1])
            elif method == 'add_k':
                prob *= self.bigram_prob_add_k(words[i], words[i + 1], k)
        return prob
    
    def entropy(self):
        """
        Calcula la entropía del modelo de bigramas
        """
        entropy = 0.0
        N = self.total_palabras-1
        for i in range(N):
            word1 = self.tokens[i]
            word2 = self.tokens[i + 1]
            prob = self.bigram_prob(word1, word2)
            if prob > 0:
                entropy -= math.log2(prob)
        return entropy / N
    
    def perplexity(self):
        """
        Calcula la perplejidad del modelo de bigramas
        """
        entropy_value = self.entropy()
        return 2 ** entropy_value if entropy_value > 0 else float('inf')
    

    
    

In [52]:
bigrama = bigram(corpus)

#Imprimir unigramas y bigramas
print("Unigramas:")
for word, count in bigrama.unigram_counts.items():
    print(f"{word}: {count}")
print("\nBigramas:")
for bigram, count in bigrama.bigram_counts.items():
    print(f"{bigram}: {count}")

Unigramas:
on: 4
a: 8
warm: 1
spring: 1
morning: 1
the: 24
city: 4
slowly: 1
came: 1
to: 3
life: 2
streets: 1
were: 1
bathed: 1
in: 5
golden: 1
light: 1
as: 2
first: 1
commuters: 1
began: 2
their: 5
daily: 1
routines: 1
shops: 1
opened: 1
doors: 1
aroma: 1
of: 6
freshly: 1
brewed: 1
coffee: 2
wafting: 1
from: 1
corner: 1
cafés: 1
joggers: 1
passed: 1
through: 4
park: 1
nodding: 1
politely: 1
at: 3
each: 2
other: 2
while: 1
newspaper: 1
vendors: 1
shouted: 1
out: 1
headlines: 1
early: 1
risers: 1
everything: 1
moved: 2
with: 5
quiet: 1
rhythm: 1
if: 1
itself: 1
was: 2
taking: 1
deep: 1
breath: 1
before: 1
rush: 1
university: 1
campus: 1
buzzed: 1
energy: 1
students: 1
hurried: 2
across: 1
lawns: 1
backpacks: 1
slung: 1
over: 1
one: 1
shoulder: 1
balancing: 1
cups: 1
and: 6
textbooks: 1
lecture: 1
halls: 1
professors: 1
prepared: 1
presentations: 1
flipping: 1
slides: 1
checking: 1
microphones: 1
library: 1
though: 1
quieter: 1
no: 1
less: 1
activepages: 1
turned: 1
rapidly: 1
keyboards:

In [53]:
entropia = bigrama.entropy()
print(f"\nEntropía del modelo de bigramas: {entropia:.4f}")
perplejidad = bigrama.perplexity()
print(f"Perplejidad del modelo de bigramas: {perplejidad:.4f}")



Entropía del modelo de bigramas: 0.9304
Perplejidad del modelo de bigramas: 1.9058


In [57]:
# laplace

frase_prueba = "The students walked through the city."

print("\nProbabilidades de bigramas con suavizado de Laplace:")
prob_laplace = bigrama.prob_sentence(frase_prueba, method='laplace')
print(f"Probabilidad de la frase '{frase_prueba}': {prob_laplace:.10f}")
print("\nProbabilidades de bigramas con suavizado add-k:")
prob_add_k = bigrama.prob_sentence(frase_prueba, method='add_k', k=0.01)
print(f"Probabilidad de la frase '{frase_prueba}': {prob_add_k:.10f}")
print("\nProbabilidades de bigramas sin suavizado:")
prob_normal = bigrama.prob_sentence(frase_prueba, method='normal')
print(f"Probabilidad de la frase '{frase_prueba}': {prob_normal:.10f}")



Probabilidades de bigramas con suavizado de Laplace:
Probabilidad de la frase 'The students walked through the city.': 0.0000000001

Probabilidades de bigramas con suavizado add-k:
Probabilidad de la frase 'The students walked through the city.': 0.0000000004

Probabilidades de bigramas sin suavizado:
Probabilidad de la frase 'The students walked through the city.': 0.0000000000


## Actividad para compañeros

### La perplejidad baja cuando...
a) El vocabulario es más grande 


b) El corpus es más corto


c) El modelo predice mejor las palabras del texto 