Paulo Sánchez


Juan M. González-Campo


Pedro Marroquín

In [20]:
corpus="""Anyone living in the United States in the early 1990s and paying even
a whisper of attention to the nightly news or a daily paper could be
forgiven for having been scared out of his skin.
The culprit was crime. It had been rising relentlessly—a graph
plotting the crime rate in any American city over recent decades
looked like a ski slope in profile—and it seemed now to herald the

end of the world as we knew it. Death by gunfire, intentional and oth-
erwise, had become commonplace. So too had carjacking and crack

dealing, robbery and rape. Violent crime was a gruesome, constant
companion. And things were about to get even worse. Much worse.
All the experts were saying so.
The cause was the so-called superpredator. For a time, he was
everywhere. Glowering from the cover of newsweeklies. Swaggering
his way through foot-thick government reports. He was a scrawny,
big-city teenager with a cheap gun in his hand and nothing in his
heart but ruthlessness. There were thousands out there just like him,

F R E A KO N O M I C S
we were told, a generation of killers about to hurl the country into
deepest chaos.

In 1995 the criminologist James Alan Fox wrote a report for the

U.S. attorney general that grimly detailed the coming spike in mur-
ders by teenagers. Fox proposed optimistic and pessimistic scenarios.

In the optimistic scenario, he believed, the rate of teen homicides
would rise another 15 percent over the next decade; in the pessimistic
scenario, it would more than double. “The next crime wave will get so
bad,” he said, “that it will make 1995 look like the good old days.”"""


In [21]:
#Preprocesamiento del corpus 
import re
def preprocess_text(text):
    # Volver a minuscula
    text = text.lower()
    # Puntuacion y caracteres especiales
    text = re.sub(r'[^\w\s]', '', text)
    #Eliminar palabras de una sola letra
    text = re.sub(r'\b\w{1}\b', '', text)
    # Espacios en blanco
    text = re.sub(r'\s+', ' ', text).strip()
    #Eliminar numeros
    text = re.sub(r'\d+', '', text)
    #Partir las oracion
    text = text.split(' ')
    return text
preprocessed_corpus = preprocess_text(corpus)
print(preprocessed_corpus[:10])

['anyone', 'living', 'in', 'the', 'united', 'states', 'in', 'the', 'early', 's']


In [22]:
from collections import defaultdict
# Conteo de unigramas y bigramas
unigram_counts = defaultdict(int)
bigram_counts = defaultdict(int)

for i in range(len(preprocessed_corpus) - 1):
    unigram_counts[preprocessed_corpus[i]] += 1
    bigram = (preprocessed_corpus[i], preprocessed_corpus[i + 1])
    bigram_counts[bigram] += 1

# Último token
unigram_counts[preprocessed_corpus[-1]] += 1

In [23]:
#Imprimir unigramas y bigramas
print("Unigramas:")
for word, count in unigram_counts.items():
    print(f"{word}: {count}")
print("\nBigramas:")
for bigram, count in bigram_counts.items():
    print(f"{bigram}: {count}")

Unigramas:
anyone: 1
living: 1
in: 10
the: 21
united: 1
states: 1
early: 1
s: 1
and: 7
paying: 1
even: 2
whisper: 1
of: 6
attention: 1
to: 4
nightly: 1
news: 1
or: 1
daily: 1
paper: 1
could: 1
be: 1
forgiven: 1
for: 3
having: 1
been: 2
scared: 1
out: 2
his: 4
skin: 1
culprit: 1
was: 5
crime: 4
it: 5
had: 3
rising: 1
relentlesslya: 1
graph: 1
plotting: 1
rate: 2
any: 1
american: 1
city: 1
over: 2
recent: 1
decades: 1
looked: 1
like: 3
ski: 1
slope: 1
profileand: 1
seemed: 1
now: 1
herald: 1
end: 1
world: 1
as: 1
we: 2
knew: 1
death: 1
by: 2
gunfire: 1
intentional: 1
oth: 1
erwise: 1
become: 1
commonplace: 1
so: 3
too: 1
carjacking: 1
crack: 1
dealing: 1
robbery: 1
rape: 1
violent: 1
gruesome: 1
constant: 1
companion: 1
things: 1
were: 4
about: 2
get: 2
worse: 2
much: 1
all: 1
experts: 1
saying: 1
cause: 1
socalled: 1
superpredator: 1
time: 1
he: 4
everywhere: 1
glowering: 1
from: 1
cover: 1
newsweeklies: 1
swaggering: 1
way: 1
through: 1
footthick: 1
government: 1
reports: 1
scrawny: 1


In [24]:
import math
# Probabilidad bigrama sin smoothing
prob_bigrams = {}
log_prob_sum = 0
N = len(preprocessed_corpus) - 1  # número de bigramas

for i in range(len(preprocessed_corpus) - 1):
    w1, w2 = preprocessed_corpus[i], preprocessed_corpus[i+1]
    bigram = (w1, w2)
    prob = bigram_counts[bigram] / unigram_counts[w1]
    prob_bigrams[bigram] = prob
    log_prob_sum += -math.log2(prob)  # para entropía

entropy = log_prob_sum / N
perplexity = 2 ** entropy
print(f"Entropía (sin smoothing): {entropy:.4f}")
print(f"Perplejidad (sin smoothing): {perplexity:.4f}")


Entropía (sin smoothing): 1.0291
Perplejidad (sin smoothing): 2.0407
