# Construindo uma enciclopedia

Este projeto consiste na criação de uma enciclopédia utilizando conceitos de processamento de linguagem natural, neste são implementados etapas como extração de palavras de documentos do corpus reuters, extração de conteúdos da api do wikipedia referente a cada palavra, processamento desses conteúdos retirando caracteres e textos desnecessários, e treinamento de um modelo para selecionar as melhores descrições das palavras com análises de perplexidade.

## Importa dependências

In [1]:
# importe de bibliotecas
import re
import nltk
from nltk.corpus import reuters
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE, Laplace

import requests

import json
from joblib import Parallel, delayed

import math
from time import sleep

In [31]:
# Download do nltk
if True:
    nltk.download('reuters')
    nltk.download('punkt')

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\EduardoFM\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\EduardoFM\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


## Importa o dataset de palavras a serem estudadas

In [5]:
# Cria a lista de palavras
words = list(map(str.lower, reuters.words()))
words[0:10]

['asian', 'exporters', 'fear', 'damage', 'from', 'u', '.', 's', '.-', 'japan']

In [6]:
# filtra as palavras de acordo com os critérios definidos
freq_reuters = nltk.FreqDist(words)
filtered = list(set(word for word in words if (freq_reuters[word] > 5 and len(word) > 2 and re.fullmatch(r"[a-z]*", word))))

filtered[0: 10]

['connection',
 'inc',
 'rejecting',
 'drawing',
 'surprise',
 'principally',
 'arbitragers',
 'pursuant',
 'photo',
 'ixl']

In [7]:
# ordena a lista e salva em um arquivo para evitar uma nova leitura
filtered = sorted(filtered)

with open("storage/good_words.txt", "w") as dump:
    for word in filtered:
        dump.write(f"{word}\n")

## Faz os downloads das paginas da wikipédia para construir o dataset

In [8]:
# Le o bloco de notas com as palavras
with open("storage/good_words.txt", "r") as load:
    good_words = load.read().split("\n")

In [9]:
# Define os parametros para a API
BATCH_SIZE = 50
URL = "https://en.wikipedia.org/w/api.php"
PARAMS = {
    "action": "query",
    "prop": "revisions",
    "rvprop": "content",
    "rvslots": "main",
    "rvsection": "0",
    "titles": "",
    "format": "json",
}

# Define uma função para dividir os batches
def split_batches(words, batch_size=BATCH_SIZE):
    k = 0
    while k < len(words):
        yield words[k:(k + batch_size)]
        k += batch_size

main_texts = {}
error_log = []

# Inicia a seção
S = requests.Session()

In [10]:
print(f"{len(good_words)} palavras boas para baixar")
print()

# realiza o fetch para cada batch das palavras inicialmente boas
for k, batch in enumerate(split_batches(good_words)):
    try:
        print(f'\rProcessando batch #{k + 1:05}', end='')
        PARAMS['titles'] = '|'.join(batch)
        r = S.get(url=URL, params=PARAMS)
        r_json = r.json()

        # Reverse map of normalized titles.
        title_map = {}
        for item in r_json['query']['normalized']:
            title_map[item['to']] = item['from']
            
        # Get texts.
        texts = {}
        for pageid, page_content in r_json['query']['pages'].items():
            if int(pageid) < 0:
                continue
            text = page_content['revisions'][0]['slots']['main']['*']
            if page_content['title'] in title_map:
                w = title_map[page_content['title']]
            else:
                w = page_content['title']
                
            texts[w] = text

        # Add to global dict.
        main_texts.update(texts)

    except Exception as e:
        error_log.append((e, r))
        
print()
print()
print("Download concluido")

9179 palavras boas para baixar

Processando batch #00184

Download concluido


### Realiza a busca das palavras com redirecionamento

In [11]:
# Fução para limpar as palavras encontradas no redirected
def treat_redirected(text):
    
    text = text.split("[")[-1].split("]")[0].split(" (")[0]
    text = text.strip(" ")
    
    if (re.fullmatch(r"[a-z_ ]*", text.lower())):
        return text
    
    return "Remove Me"

In [12]:
# Procura nos textos recebidos palavras que não tiveram a pagina enontrada
redirected_words = list(set(treat_redirected(main_texts[word]) for word in main_texts.keys() if "#redirect" in main_texts[word][:50].lower()))
try:
    redirect.remove("Remove Me")
except:
    pass

print(f"{len(redirected_words)} palavras redirecionadas para baixar")
print()

# realiza o fetch para cada batch das palavras redirecionadas
for k, batch in enumerate(split_batches(redirected_words)):
    try:
        print(f'\rProcessando batch #{k + 1:05}', end='')
        PARAMS['titles'] = '|'.join(batch)
        r = S.get(url=URL, params=PARAMS)
        r_json = r.json()

        # Reverse map of normalized titles.
        title_map = {}
        for item in r_json['query']['normalized']:
            title_map[item['to']] = item['from']
            
        # Get texts.
        texts = {}
        for pageid, page_content in r_json['query']['pages'].items():
            if int(pageid) < 0:
                continue
            text = page_content['revisions'][0]['slots']['main']['*']
            if page_content['title'] in title_map:
                w = title_map[page_content['title']]
            else:
                w = page_content['title']
                
            texts[w.lower()] = text

        # Add to global dict.
        main_texts.update(texts)

    except Exception as e:
        error_log.append((e, r))
        
print()
print()
print("Download concluido")

2157 palavras redirecionadas para baixar

Processando batch #00044

Download concluido


### Remocão de páginas sem Conteudo

In [13]:
# Retira da lista palavras que foram redirecionadas
words = main_texts.keys()
not_good_data = list(set(word for word in words if "#redirect" in main_texts[word][:50].lower()))
print(f"{len(not_good_data)} paginas redirecionadas para remover")
for word in not_good_data:
    main_texts.pop(word)
    
# Retira palavras que vão para as paginas de desambiguação
words = main_texts.keys()
not_good_data = list(set(word for word in words if "may refer to:" in main_texts[word]))
print(f"{len(not_good_data)} paginas de desambiguação para remover")
for word in not_good_data:
    main_texts.pop(word)
    
# Retira palavras que vão para as paginas muito curtas
words = main_texts.keys()
not_good_data = list(set(word for word in words if "{{Short pages monitor}}" in main_texts[word]))
not_good_data = list(set(not_good_data + [word for word in words if "{{wiktionary redirect}}" in main_texts[word]]))

print(f"{len(not_good_data)} paginas com erro")
for word in not_good_data:
    main_texts.pop(word)

2663 paginas redirecionadas para remover
2651 paginas de desambiguação para remover
32 paginas com erro


In [14]:
# Exporta os resultados para arquivos
with open('storage/texts.json', 'w') as f:
    json.dump(main_texts, f, indent=4)

with open('storage/errors.txt', 'w') as f:
    for e, r in error_log:
        f.write(f'{e} ({type(e)})\nConteudo:\n{r.headers}\n{"*"*100}\n')

## Realia a limpeza dos textos

In [15]:
# Le o bloco de notas com as palavras
with open("storage/texts.json", "r") as load:
    main_texts = json.load(load)

### Demonstração da limpeza

In [16]:
sample = "hanover"

In [17]:
# Define a variavel de teste
clean = main_texts[sample]
clean

'{{about|the German city|other uses|Hanover (disambiguation)}}\n{{Redirect|Hannover}}\n{{short description|Capital of Lower Saxony, Germany}}\n{{Infobox German location\n|name               = Hanover\n|German_name        = Hannover\n|type               = City\n|image_photo        = {{Photomontage|position=center\n| photo1a = Hannover Blick Neues Rathaus 01.jpg\n| photo2a = Hannover_old_townhall_Karmarschstrasse_Mitte_Hannover_Germany_01.jpg\n| photo2b = Marktkirche_St_Georgii_et_Jacobi_Mitte_Hannover_Germany.jpg\n| photo3a = Herrenhäuser gärten 2.jpg\n| photo3b = Neues Rathaus bei Nacht.jpg\n| photo4a = Universität Hannover - Hauptgebäude - B02.jpg\n   | size = 280\n   | spacing = 2\n   | color = #FFFFFF\n   | border = 0\n}}\n|image_caption = Clockwise from top: View over the city centre, [[Marktkirche St. Georgii et Jacobi|Market Church of Sts. George and James]], [[New Town Hall (Hanover)|New Town Hall]], [[University of Hanover]], [[Herrenhausen Gardens]], [[Altes Rathaus, Hanover|O

In [18]:
# Remove referencias do HTML
clean = re.sub(r"<ref.*?(/ref>|/>)", "", clean, flags=re.DOTALL|re.MULTILINE)
clean = re.sub(r"<sup.*?(/sup>|/>)", "", clean, flags=re.DOTALL|re.MULTILINE)
clean

"{{about|the German city|other uses|Hanover (disambiguation)}}\n{{Redirect|Hannover}}\n{{short description|Capital of Lower Saxony, Germany}}\n{{Infobox German location\n|name               = Hanover\n|German_name        = Hannover\n|type               = City\n|image_photo        = {{Photomontage|position=center\n| photo1a = Hannover Blick Neues Rathaus 01.jpg\n| photo2a = Hannover_old_townhall_Karmarschstrasse_Mitte_Hannover_Germany_01.jpg\n| photo2b = Marktkirche_St_Georgii_et_Jacobi_Mitte_Hannover_Germany.jpg\n| photo3a = Herrenhäuser gärten 2.jpg\n| photo3b = Neues Rathaus bei Nacht.jpg\n| photo4a = Universität Hannover - Hauptgebäude - B02.jpg\n   | size = 280\n   | spacing = 2\n   | color = #FFFFFF\n   | border = 0\n}}\n|image_caption = Clockwise from top: View over the city centre, [[Marktkirche St. Georgii et Jacobi|Market Church of Sts. George and James]], [[New Town Hall (Hanover)|New Town Hall]], [[University of Hanover]], [[Herrenhausen Gardens]], [[Altes Rathaus, Hanover|O

In [19]:
# Remove marcas de objetos desenhados na pagina (entre chaves)
clean = re.sub(r"\{\{(?:[^\'\'\'])*?\}\}", "", clean, flags=re.DOTALL|re.MULTILINE)
clean

"\n\n\n\n|image_caption = Clockwise from top: View over the city centre, [[Marktkirche St. Georgii et Jacobi|Market Church of Sts. George and James]], [[New Town Hall (Hanover)|New Town Hall]], [[University of Hanover]], [[Herrenhausen Gardens]], [[Altes Rathaus, Hanover|Old Town Hall]]\n|image_flag         = Flagge Hanover.svg\n|image_coa          = Coat of arms of Hannover.svg\n|coordinates        = \n|image_plan         = Hannover in H.svg\n|state              = Lower Saxony\n|district           = Hannover\n|elevation          = 55\n|area               = 204.01\n|area_metro         = <!-- Metropolitan area, in km². XXX.XX (no commas or other text) -->\n<!-- |population         = 518386  filled via Gemeindeschlüssel \n|pop_date              = 31 December 2013\n|pop_ref            = \n-->\n|pop_metro          = 1119032\n|postal_code        = 30001 - 30669\n|area_code          = 0511\n|licence            = H\n|Gemeindeschlüssel  = 03 2 41 001\n|NUTS               = <!-- NUTS value: DEX

In [20]:
# Corta as legendas pré texto
splited_clean = clean.split("'''")[1:]

if len(splited_clean) == 0:
    pass
else:
    clean = "".join(splited_clean)

clean

"Hanover (;  ; ) is the capital and largest city of the German [[States of Germany|state]] of [[Lower Saxony]]. Its 535,061 (2017) inhabitants make it the [[List of cities in Germany by population|13th-largest city]] in [[Germany]] as well as the third-largest city in [[Northern Germany]] after [[Hamburg]] and [[Bremen]]. Hanover's [[urban area]] comprises the towns of [[Garbsen]], [[Langenhagen]] and [[Laatzen]] and has a population of about 791,000 (2018). The [[Hanover Region]] has approximately 1.16 million inhabitants (2019).\n\nThe city lies at the [[confluence]] of the [[River Leine]] (progression: ) and its [[tributary]] [[Ihme]], in the south of the [[North German Plain]], and is the largest city in the [[Hannover–Braunschweig–Göttingen–Wolfsburg Metropolitan Region]]. It is the fifth-largest city in the [[Low German]] dialect area after Hamburg, [[Dortmund]], [[Essen]] and Bremen.\n\nBefore it became the capital of Lower Saxony in 1946, Hanover was the capital of the [[Princi

In [21]:
# Subistitui palavras entre colxete pela própria palavra
clean = re.sub(r"\[\[((?:[^|])*?)\]\]", r"\1", clean, flags=re.DOTALL|re.MULTILINE)
clean

"Hanover (;  ; ) is the capital and largest city of the German [[States of Germany|state]] of Lower Saxony. Its 535,061 (2017) inhabitants make it the [[List of cities in Germany by population|13th-largest city]] in Germany as well as the third-largest city in Northern Germany after Hamburg and Bremen. Hanover's urban area comprises the towns of Garbsen, Langenhagen and Laatzen and has a population of about 791,000 (2018). The Hanover Region has approximately 1.16 million inhabitants (2019).\n\nThe city lies at the confluence of the River Leine (progression: ) and its tributary Ihme, in the south of the North German Plain, and is the largest city in the Hannover–Braunschweig–Göttingen–Wolfsburg Metropolitan Region. It is the fifth-largest city in the Low German dialect area after Hamburg, Dortmund, Essen and Bremen.\n\nBefore it became the capital of Lower Saxony in 1946, Hanover was the capital of the Principality of Calenberg (1636–1692), the Electorate of Hanover (1692–1814), the Ki

In [22]:
# Subistitui palavras entre colxete com pipe pela palavra depois do pipe
clean = re.sub(r"\[\[(?:[^|]|)*(.*?)\]\]", r"\1", clean, flags=re.DOTALL|re.MULTILINE)
clean

"Hanover (;  ; ) is the capital and largest city of the German |state of Lower Saxony. Its 535,061 (2017) inhabitants make it the |13th-largest city in Germany as well as the third-largest city in Northern Germany after Hamburg and Bremen. Hanover's urban area comprises the towns of Garbsen, Langenhagen and Laatzen and has a population of about 791,000 (2018). The Hanover Region has approximately 1.16 million inhabitants (2019).\n\nThe city lies at the confluence of the River Leine (progression: ) and its tributary Ihme, in the south of the North German Plain, and is the largest city in the Hannover–Braunschweig–Göttingen–Wolfsburg Metropolitan Region. It is the fifth-largest city in the Low German dialect area after Hamburg, Dortmund, Essen and Bremen.\n\nBefore it became the capital of Lower Saxony in 1946, Hanover was the capital of the Principality of Calenberg (1636–1692), the Electorate of Hanover (1692–1814), the Kingdom of Hanover (1814–1866), the Province of Hanover of the Kin

In [23]:
# limpa algumas outras variaveis irrelevantes
clean = re.sub(r"\n|\'", r"", clean, flags=re.DOTALL|re.MULTILINE)
clean = re.sub(r"}}", r"", clean, flags=re.DOTALL|re.MULTILINE)
clean = re.sub(r"\|", r"", clean, flags=re.DOTALL|re.MULTILINE)
clean = re.sub(r"\(;|\(,", r"(SPECIALCHAR", clean, flags=re.DOTALL|re.MULTILINE)
clean = re.sub(r"\(SPECIALCHAR.*?\)", "", clean, flags=re.DOTALL|re.MULTILINE)
clean

'Hanover  is the capital and largest city of the German state of Lower Saxony. Its 535,061 (2017) inhabitants make it the 13th-largest city in Germany as well as the third-largest city in Northern Germany after Hamburg and Bremen. Hanovers urban area comprises the towns of Garbsen, Langenhagen and Laatzen and has a population of about 791,000 (2018). The Hanover Region has approximately 1.16 million inhabitants (2019).The city lies at the confluence of the River Leine (progression: ) and its tributary Ihme, in the south of the North German Plain, and is the largest city in the Hannover–Braunschweig–Göttingen–Wolfsburg Metropolitan Region. It is the fifth-largest city in the Low German dialect area after Hamburg, Dortmund, Essen and Bremen.Before it became the capital of Lower Saxony in 1946, Hanover was the capital of the Principality of Calenberg (1636–1692), the Electorate of Hanover (1692–1814), the Kingdom of Hanover (1814–1866), the Province of Hanover of the Kingdom of Prussia (1

### Limpeza total

In [24]:
def clean_string(clean, remove_special=True):
    # Remove referencias do HTML
    clean = re.sub(r"<ref.*?(/ref>|/>)", "", clean, flags=re.DOTALL|re.MULTILINE)
    clean = re.sub(r"<sup.*?(/sup>|/>)", "", clean, flags=re.DOTALL|re.MULTILINE)

    # Remove marcas de objetos desenhados na pagina (entre chaves)
    clean = re.sub(r"\{\{(?:[^\'\'\'])*?\}\}", "", clean, flags=re.DOTALL|re.MULTILINE)

    # Corta as legendas pré texto
    splited_clean = clean.split("'''")[1:]

    if len(splited_clean) == 0:
        pass
    else:
        clean = "".join(splited_clean)

    # Subistitui palavras entre colchete pela própria palavra
    clean = re.sub(r"\[\[((?:[^|])*?)\]\]", r"\1", clean, flags=re.DOTALL|re.MULTILINE)

    # Subistitui palavras entre colxete com pipe pela palavra depois do pipe
    clean = re.sub(r"\[\[(?:[^|]|)*(.*?)\]\]", r"\1", clean, flags=re.DOTALL|re.MULTILINE)

    # limpa algumas outras variaveis irrelevantes
    clean = re.sub(r"\n|\'", r"", clean, flags=re.DOTALL|re.MULTILINE)
    clean = re.sub(r"}}", r"", clean, flags=re.DOTALL|re.MULTILINE)
    clean = re.sub(r"\|", r"", clean, flags=re.DOTALL|re.MULTILINE)
    clean = re.sub(r"\(;|\(,", r"(SPECIALCHAR", clean, flags=re.DOTALL|re.MULTILINE)
    
    if remove_special:
        clean = re.sub(r"\(SPECIALCHAR.*?\)", "", clean, flags=re.DOTALL|re.MULTILINE)
    
    return clean

In [25]:
for word in list(main_texts.keys()):
    
    new_text = clean_string(main_texts[word])
    
    if len(new_text) == 0:
        main_texts.pop(word)
    else:
        main_texts[word] = new_text

In [26]:
# Exporta os resultados para arquivos
with open('storage/clean_texts.json', 'w') as f:
    json.dump(main_texts, f, indent=4)

## Divide o texto em sentenças usando o tokenizador Punkt

In [27]:
# Le o bloco de notas com as palavras
with open("storage/clean_texts.json", "r") as load:
    main_texts = json.load(load)

In [28]:
sample = "hanover"

In [32]:
main_sents = {}
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
for key, value in main_texts.items():
    main_sents[key] = sent_tokenizer.tokenize(value)

In [33]:
main_sents[sample]

['Hanover  is the capital and largest city of the German state of Lower Saxony.',
 'Its 535,061 (2017) inhabitants make it the 13th-largest city in Germany as well as the third-largest city in Northern Germany after Hamburg and Bremen.',
 'Hanovers urban area comprises the towns of Garbsen, Langenhagen and Laatzen and has a population of about 791,000 (2018).',
 'The Hanover Region has approximately 1.16 million inhabitants (2019).The city lies at the confluence of the River Leine (progression: ) and its tributary Ihme, in the south of the North German Plain, and is the largest city in the Hannover–Braunschweig–Göttingen–Wolfsburg Metropolitan Region.',
 'It is the fifth-largest city in the Low German dialect area after Hamburg, Dortmund, Essen and Bremen.Before it became the capital of Lower Saxony in 1946, Hanover was the capital of the Principality of Calenberg (1636–1692), the Electorate of Hanover (1692–1814), the Kingdom of Hanover (1814–1866), the Province of Hanover of the King

In [34]:
# Exporta os resultados para arquivos
with open('storage/def_sents.json', 'w') as f:
    json.dump(main_sents, f, indent=4)

## Análise por Perplexidade

A fim de selecionar a melhor setença que define cada palavra-chave, será utilizada a análise por perplexidade, método este que consiste em mensurar a "estranheza", através da perspectiva de um modelo treinado sob uma base grande de textos, de cada sentença.

In [35]:
# Le o bloco de notas com as palavras
with open("storage/def_sents.json", "r") as load:
    main_sents = json.load(load)

### Usando Corpus Reuters como base

Nesta seção o corpus reuters é utilizado como base de treinamento para o modelo, com o pressuposto inicial de que ao utilizar um conjunto grande de textos, tem-se um modelo amplamente treinado sobre palavras muito utilizadas e pouco utilizadas para mensurar o grau de "estranhesa" de cada sentença.

In [36]:
# Prepraração dos dados
reuters_sentences = reuters.sents()
reuters_train, reuters_vocab = padded_everygram_pipeline(2, reuters_sentences)

reuters_train = list(list(t) for t in reuters_train)
reuters_vocab = list(reuters_vocab)

#Treinamento do Modelo
lm_reuters = Laplace(2)
lm_reuters.fit(reuters_train, reuters_vocab)

In [37]:
sample = "hanover"

In [38]:
# Encontra as perplexidades da amostra
text_sentences = [t.split() for t in main_sents[sample]]
test, _ = padded_everygram_pipeline(2, text_sentences)
test = list(list(t) for t in test)

import time 
start_time = time.time()

# Calcula a perplexidade das sentenças
idx = 0 
min_value = math.inf
for i, s in enumerate(test):
    px = lm_reuters.perplexity(s)
    
    if(min_value > px):
        min_value = px
        idx = i

    print(px)
print(f"Tempo de Execução: {time.time() - start_time} segundos")

2517.139793945666
8372.738449782728
12312.347223746885
6315.5021930547955
7722.020235677807
10478.996008500326
14306.6105660475
16213.089441452663
7638.787064473857
11221.220325754459
5488.853729160127
19418.311305672127
10551.841006068196
4051.3891598621826
5014.1451339978485
Tempo de Execução: 0.006981611251831055 segundos


Como pode-se observar nos testes utilizando o corpus router como base, há um valor alto de tempo de execução para apenas uma palavra-chave, o que torna a execução do codigo para todo o conjunto de pavras da enciclopedia a ser implementada muito alto, aproximadamente de 1:30h para o conjunto atual, logo o teste completo com esta base, embora implementado, n foi executado neste notebook. 

In [39]:
def find_lowest_perplexity_reuter_parallel(main_sents, words, do_print=False):
    
    encyclopedia_reuters = {}
    for i, word in enumerate(words):
        idx, min_value = find_lowest_perplexity_reuters(main_sents, word)
        encyclopedia_reuters[word] = main_sents[word][idx]
        
    return encyclopedia_reuters
        
def find_lowest_perplexity_reuters(main_sents, word, do_print=False):

    # Prepraração dos dados
    text_sentences = [t.split() for t in main_sents[word]]
    test, _ = padded_everygram_pipeline(2, text_sentences)
    test = list(list(t) for t in test)

    # Calcula a perplexidade das sentenças
    idx = 0 
    min_value = math.inf
    for i, s in enumerate(test):
        px = lm_reuters.perplexity(s)

        if(min_value > px):
            min_value = px
            idx = i

        if do_print:
            print(px)
            
    return idx, min_value 

#### Rodando em serie

In [40]:
# encyclopedia_reuters = {}
# words = list(main_sents.keys())
# total = len(words)
# for i, word in enumerate(words):
#     idx, min_value = find_lowest_perplexity_reuters(main_sents, word)
#     encyclopedia_reuters[word] = main_sents[word][idx]
    
#     print(f"\r{i+1} out of {total}", end='')

#### Rodando em Paralelo

In [41]:
# n_jobs = 12
# encyclopedia_reuters = {}
# words = list(main_sents.keys())
# part = round(len(words)/n_jobs)

# words_batch = []
# for i in range(n_jobs):
#     if ((i+1)*part > len(words)):
#         words_batch.append(words[i*part:])
#     else:
#         words_batch.append(words[i*part:(i+1)*part])

In [42]:
# encyclopedia_reuters = {}
# results = Parallel(n_jobs=n_jobs, verbose=50)(delayed(find_lowest_perplexity_reuter_parallel)(main_sents, words) for words in words_batch)

In [43]:
# # Exporta os resultados para arquivos
# with open('storage/encyclopedia_reuters.json', 'w') as f:
#     json.dump(encyclopedia_reuters, f, indent=4)

### Usando o próprio texto como base

Nesta seção o próprio documento referente a palavra-chave é utilizado como base de treinamento para o modelo, com o pressuposto inicial de que a sentença que melhor define a palavra contem elementos com uso mais frequente em seu contexto específico.

In [44]:
sample = "hanover"

In [45]:
# Prepraração dos dados
text_sentences = [t.split() for t in main_sents[sample]]
train, vocab = padded_everygram_pipeline(2, text_sentences)

train = list(list(t) for t in train)
vocab = list(vocab)

# Treinamento do Modelo
lm = Laplace(2)
lm.fit(train, vocab)

# Calcula a perplexidade das sentenças
idx = 0 
min_value = math.inf
for i, s in enumerate(train):
    px = lm.perplexity(s)
    
    if(min_value > px):
        min_value = px
        idx = i

    print(px)

77.23595126135935
132.66110823535138
137.8052817783378
111.49717703151089
89.3669246617187
130.71641079116478
118.04072560239162
143.31842453140845
124.88363853010559
161.86166212548235
120.47089382568275
113.8615045296886
126.2236923379113
123.00294373187059
133.52740330778033


In [46]:
def find_lowest_perplexity_same_parallel(main_sents, words, do_print=False):
    
    encyclopedia_same = {}
    for i, word in enumerate(words):
        idx, min_value = find_lowest_perplexity_same(main_sents, word)
        encyclopedia_same[word] = main_sents[word][idx]
        
    return encyclopedia_same

def find_lowest_perplexity_same(main_sents, word, do_print=False):

    # Prepraração dos dados
    text_sentences = [t.split() for t in main_sents[word]]
    train, vocab = padded_everygram_pipeline(2, text_sentences)

    train = list(list(t) for t in train)
    vocab = list(vocab)

    # Treinamento do Modelo
    lm = Laplace(2)
    lm.fit(train, vocab)
    
    # Calcula a perplexidade das sentenças
    idx = 0 
    min_value = math.inf
    for i, s in enumerate(train):
        px = lm.perplexity(s)
        
        if(min_value > px):
            min_value = px
            idx = i
        
        if do_print:
            print(px)
            
    return idx, min_value 

In [47]:
encyclopedia_same = {}
words = list(main_sents.keys())
total = len(words)
for i, word in enumerate(words):
    idx, min_value = find_lowest_perplexity_same(main_sents, word)
    encyclopedia_same[word] = main_sents[word][idx]
    
    print(f"\r{i+1} out of {total}", end='')

3238 out of 3238

#### Rodando em paralelo

In [48]:
n_jobs = 12
encyclopedia_reuters = {}
words = list(main_sents.keys())
part = round(len(words)/n_jobs)

words_batch = []
for i in range(n_jobs):
    if ((i+1)*part > len(words)):
        words_batch.append(words[i*part:])
    else:
        words_batch.append(words[i*part:(i+1)*part])

In [49]:
encyclopedia_same = {}
results = Parallel(n_jobs=n_jobs, verbose=50)(delayed(find_lowest_perplexity_same_parallel)(main_sents, words) for words in words_batch)

for result in results:
    encyclopedia_same.update(result)

[Parallel(n_jobs=12)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done   1 tasks      | elapsed:    4.4s
[Parallel(n_jobs=12)]: Done   2 out of  12 | elapsed:    4.5s remaining:   22.8s
[Parallel(n_jobs=12)]: Done   3 out of  12 | elapsed:    4.6s remaining:   13.9s
[Parallel(n_jobs=12)]: Done   4 out of  12 | elapsed:    4.9s remaining:    9.9s
[Parallel(n_jobs=12)]: Done   5 out of  12 | elapsed:    5.0s remaining:    7.1s
[Parallel(n_jobs=12)]: Done   6 out of  12 | elapsed:    5.1s remaining:    5.1s
[Parallel(n_jobs=12)]: Done   7 out of  12 | elapsed:    5.2s remaining:    3.7s
[Parallel(n_jobs=12)]: Done   8 out of  12 | elapsed:    5.2s remaining:    2.5s
[Parallel(n_jobs=12)]: Done   9 out of  12 | elapsed:    5.4s remaining:    1.7s
[Parallel(n_jobs=12)]: Done  10 out of  12 | elapsed:    5.4s remaining:    1.0s
[Parallel(n_jobs=12)]: Done  12 out of  12 | elapsed:    5.8s remaining:    0.0s
[Parallel(n_jobs=12)]: Done  12 out of  12 | elapse

In [50]:
# Exporta os resultados para arquivos
with open('storage/encyclopedia_same.json', 'w') as f:
    json.dump(encyclopedia_same, f, indent=4)

## Explora as encyclopedias

In [51]:
# Le o bloco de notas com os dados
with open("storage/encyclopedia_same.json", "r") as load:
    encyclopedia_same = json.load(load)
    
#with open("storage/encyclopedia_reuters.json", "r") as load:
#    encyclopedia_reuters = json.load(load)

In [52]:
#encyclopedia_reuters

In [53]:
encyclopedia_same

{'abbett': 'Abbett is a surname.',
 'abdul': 'Abdul (also transliterated as Abdal, Abdel, Abdil, Abdol, Abdool, or Abdoul; , {{translarDINʿAbd al-) is the most frequent transliteration of the combination of the Arabic word Abd  and the definite prefix al / el .It is the initial component of many compound names, names made of two words.',
 'ablaze': 'The film uses stock footage from two other films.',
 'abroad': 'The Moise A. Khayrallah Center for Lebanese Diaspora Studies awarded Alexander the 2020 Khayrallah Art Prize for the film.',
 'acceleration': 'The orientation of an objects acceleration is given by the orientation of the net force acting on that object.',
 'acceptance': 'Acceptance in human psychology is a persons assent to the reality of a situation, recognizing a process or condition (often a negative or uncomfortable situation) without attempting to change it or protest it.',
 'accepted': 'The story takes place in Wickliffe and a fictitious college town called Harmon in Ohio

In [54]:
encyclopedia_same["hanover"]

'Hanover  is the capital and largest city of the German state of Lower Saxony.'

## Conclusão

Em posse da enciclopédia, construída sobre o modelo treinado com o próprio contexto de cada palavra-chave, pode-se observar que em grande parte, as descrições são sentenças pertinentes, porém há palavras com descrições inadequadas, isso pode ter ocorrido devido a essas palavras terem um documento com muito pouco conteúdo escrito, prejudicando o treinamento do modelo para a análise de perplexidade.