<a href="https://colab.research.google.com/github/cbadenes/curso-pln/blob/main/notebooks/02_modelos_ngramas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modelo tri-grama del corpus Reuters

El [Corpus Reuters](https://www.nltk.org/book/ch02.html) de NLTKL contiene 10.788 documentos de noticias que suman 1.7 millones de palabras. Los documentos han sido clasificados en 90 temas y agrupados en dos conjuntos llamados "entrenamiento" y "prueba"; así, el texto con fileid `test/14826` es un documento extraído del conjunto de prueba. Esta división es para entrenar y probar algoritmos que detectan automáticamente el tema de un documento.

## 1) Carga de Datos

In [9]:
from collections import Counter
import nltk
nltk.download('reuters')
nltk.download('punkt')
nltk.download('punkt_tab')
!unzip -o -q /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora
from nltk.corpus import reuters

print("contando palabras..")
total_count = len(reuters.words())
print("Número Total de Palabras:", total_count)
counts = Counter(reuters.words())
print("Top 5 palabras más comunes:", counts.most_common(n=5))

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


contando palabras..
Total de Palabras: 1720901
Top 5 palabras más comunes: [('.', 94687), (',', 72360), ('the', 58251), ('of', 35979), ('to', 34035)]


## 2) N-gramas

Obtener los bigramas:

In [10]:
from nltk import bigrams, trigrams

def get_bigrams(sentence,pad=False):
    return list(bigrams(sentence,pad_left=pad, pad_right=pad))

def get_trigrams(sentence,pad=False):
    return list(trigrams(sentence,pad_left=pad, pad_right=pad))

sentence="Natural language processing is a subfield of linguistics, computer science, and artificial intelligence"

get_bigrams(sentence.split(" "))

[('Natural', 'language'),
 ('language', 'processing'),
 ('processing', 'is'),
 ('is', 'a'),
 ('a', 'subfield'),
 ('subfield', 'of'),
 ('of', 'linguistics,'),
 ('linguistics,', 'computer'),
 ('computer', 'science,'),
 ('science,', 'and'),
 ('and', 'artificial'),
 ('artificial', 'intelligence')]

Obtener los bigramas con relleno (inicio y fin de frase):

In [11]:
get_bigrams(sentence.split(" "),pad=True)

[(None, 'Natural'),
 ('Natural', 'language'),
 ('language', 'processing'),
 ('processing', 'is'),
 ('is', 'a'),
 ('a', 'subfield'),
 ('subfield', 'of'),
 ('of', 'linguistics,'),
 ('linguistics,', 'computer'),
 ('computer', 'science,'),
 ('science,', 'and'),
 ('and', 'artificial'),
 ('artificial', 'intelligence'),
 ('intelligence', None)]

Obtener los trigramas:

In [12]:
get_trigrams(sentence.split(" "))

[('Natural', 'language', 'processing'),
 ('language', 'processing', 'is'),
 ('processing', 'is', 'a'),
 ('is', 'a', 'subfield'),
 ('a', 'subfield', 'of'),
 ('subfield', 'of', 'linguistics,'),
 ('of', 'linguistics,', 'computer'),
 ('linguistics,', 'computer', 'science,'),
 ('computer', 'science,', 'and'),
 ('science,', 'and', 'artificial'),
 ('and', 'artificial', 'intelligence')]

Obtener los trigramas con relleno (inicio y fin de frase):

In [13]:
get_trigrams(sentence.split(" "),pad=True)

[(None, None, 'Natural'),
 (None, 'Natural', 'language'),
 ('Natural', 'language', 'processing'),
 ('language', 'processing', 'is'),
 ('processing', 'is', 'a'),
 ('is', 'a', 'subfield'),
 ('a', 'subfield', 'of'),
 ('subfield', 'of', 'linguistics,'),
 ('of', 'linguistics,', 'computer'),
 ('linguistics,', 'computer', 'science,'),
 ('computer', 'science,', 'and'),
 ('science,', 'and', 'artificial'),
 ('and', 'artificial', 'intelligence'),
 ('artificial', 'intelligence', None),
 ('intelligence', None, None)]

## 3) Contar Ocurrencias

In [14]:
from collections import defaultdict
model = defaultdict(lambda: defaultdict(lambda: 0))

for sentence in reuters.sents():
    for w1, w2, w3 in get_trigrams(sentence, pad=True):
        model[(w1, w2)][w3] += 1
print("Total de tri-gramas:",len(model))

Total de tri-gramas: 398630


¿Cuántas veces "economists" sigue a "what the"?

In [15]:
model["what", "the"]["economists"]

2

¿Y "nonexistingword"?

In [16]:
print(model["what", "the"]["nonexistingword"])

0


¿Cuántas oraciones comienzan con "The"?

In [17]:
model[None, None]["The"]

8839

Vamos a transformar los conteos en probabilidades (normalización):

In [18]:
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count
print("¡listo!")

¡listo!


¿Cuál es la probabilidad de que "economists" siga a "what the"?

In [19]:
model["what", "the"]["economists"]

0.043478260869565216

¿Y de que una oración comience con "The"?

In [20]:
model[None, None]["The"]

0.16154324146501936

## 4) Generación de Texto

¿Cuáles son las palabras más probables que siguen a "The market"?

In [21]:
words = model["The","market"]
for word in sorted(words, key=words.get, reverse=True)[:5]:
    print(word, words[word])

is 0.37735849056603776
had 0.07547169811320754
now 0.07547169811320754
has 0.07547169811320754
doesn 0.03773584905660377


Vamos a crear una oración aleatoria:

In [26]:
import random

text = ["The", "market"]
#text = [None, None]

sentence_finished = False

while not sentence_finished:
    r = random.random()
    accumulator = .0

    for word in model[tuple(text[-2:])].keys():
        accumulator += model[tuple(text[-2:])][word]

        if accumulator >= r:
            text.append(word)
            break

    if text[-2:] == [None, None]:
        sentence_finished = True

print(' '.join([t for t in text if t]))

The market is still too early to predict after last year , had a difficult economic climate .
