In [1]:
#!pip install wikipedia
import wikipedia

# Web-Scraper collects corpus from Wikipedia in real time

The `Wikipedia_Crawler` class uses the `wikipedia` module to load a _corpus_ in real time.<br>

In [2]:
from my_nlp_classes import Wikipedia_Crawler

First, we have to set the desired language of the searches. 

In [3]:
wikipedia.set_lang('pt')

It will search specific wikis from the input query and downloads its content.
- `query_list` is the list of query text to look for in Wikipedia
- `max_results` is the maximum number of pages to load per query expression in the `query_list`
- `max_v` will be used later to limit the vocabulary size of our custom dictionary
- `skip_top` is the number of most frequent words (descending order) to be discard.

In [4]:
query_list  = ['Rainha','Rei','Mulher','Homem'] # 'Brasil','Japão','Família','Tempo','Calendário'  
max_results = 3
max_V       = 2000
skip_top    = 0
max_len_sentence = 100

Next cell, we instantiate the crawler with its basic query parameters.

In [5]:
wiki_crawler = Wikipedia_Crawler(query_list=query_list,max_results_per_query=max_results)

To find the desired wiki pages, we call for `query_wiki`. <br>
The methods `get_wiki_sentences` downloads and tokenizes the sentences of the pages. <br> 
Then, `get_all_tokens` lists all unique tokens (words) in the data. Tokenization is performed in the process. <br>
Finally, `count_vocabulary` counts the absolute frequency of each token. Tokenization is performed in the process.

In [6]:
wiki_crawler.query_wiki()
wiki_crawler.get_wiki_sentences()
wiki_crawler.get_all_tokens()
wiki_crawler.count_vocabulary()

Using all the tokens may be prohibitive. Thus, it is necessary to limit the vocabulary. <br>
- `generate_vocabulary` will create `word2idx` and `idx2word` dictionaries 
- - dictionaries will be limited with `max_V` words ($+ 3$ tags)
- - note that `skip_top` most frequent words will be discarded
- `encode_all_sentence` will encode sentences of words into sequences (lists) of corresponding indexes
- - encoded sentences will have additional `"<START>"` and `"<END>"` tokens
- - words out of vocabulary will be coded with index of `"<OOV>"` tag
- - encoded senteces will be truncated with `max_len_sentence` actual words ($+ 2$ start/end tags)
- - encoded senteces padded by default, unless `pad_sentences` is set to `False`

In [7]:
wiki_crawler.generate_vocabulary(max_len_vocabulary=max_V, skip_top_words=skip_top)
wiki_crawler.encode_all_sentence(max_len_sentence=max_len_sentence, pad_sentences=True) # pad_sentences is True by default

Check where did the corpus came from and other results.

In [8]:
S = len(wiki_crawler.sentences)           
T = len(wiki_crawler.coded_sentences[0]) # max_len_sentence + 2
V = len(wiki_crawler.word2idx)           # max_V + 2

print(f' Corpus derived from the following wiki pages: \n ---- {wiki_crawler.titles_list}')
print(f' Number of sentences: ------ {S}')
print(f' Max. words in sentence: --- {T-2}')
print(f' Vocabulary Size: ---------- {V}')

 Corpus derived from the following wiki pages: 
 ---- ['Rainha', 'A Rainha', 'Rainha Elizabeth', 'Rei', 'Rei (xadrez)', 'Choque Rei', 'Mulher', 'Mulher, Mulher', 'Mulher-Gato', 'Homem', 'Homem-Aranha', 'Homem-Aranha no cinema']
 Number of sentences: ------ 1301
 Max. words in sentence: --- 100
 Vocabulary Size: ---------- 2003


Here is how we can check the generated sentences.  

In [9]:
num_sentence = 5

real_sentence    = wiki_crawler.sentences[num_sentence]
encoded_sentece  = wiki_crawler.coded_sentences[num_sentence]
decoded_sentence = wiki_crawler.decode_one_sentence(encoded_sentece)

print(f' Real Text sentence: \n    {real_sentence}', end=2*'\n')
print(f' Encoded Sentence: \n    {encoded_sentece}', end=2*'\n')
print(f' Decoded Sentence: \n    '+' '.join(decoded_sentence))

 Real Text sentence: 
    entre os reis davi de judá e israel, não é mencionado uma única rainha reinante; apesar de atália, embora a bíblia se refira negativamente como uma usurpadora.

 Encoded Sentence: 
    [1, 49, 17, 543, 1392, 3, 1934, 6, 1935, 26, 23, 1936, 12, 375, 40, 195, 153, 3, 1937, 186, 4, 221, 20, 1938, 1393, 15, 12, 1939, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

 Decoded Sentence: 
    <START> entre os reis davi de judá e israel não é mencionado uma única rainha reinante apesar de atália embora a bíblia se refira negativamente como uma usurpadora <END>


In [10]:
word = 'mulher'
idx = 9
print(f' Word "{word}" corresponds to index {wiki_crawler.word2idx[word]}')
print(f' Index {idx} corresponds to word "{wiki_crawler.idx2word[idx]}"')

 Word "mulher" corresponds to index 36
 Index 9 corresponds to word "do"
