In [1]:
import processing_queries as pq
import generate_inverted_list as gil
import indexer
import searcher

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/casalecchi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/casalecchi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 1) Processando Consultas

Primeiro leremos o arquivo de configuração 'pc.cfg'. Depois iremos pegar o arquivo que contém as consultas, passado no arquivo de configuração, e extrair a raíz do documento XML. Diante dessa raíz, é chamado dois métodos que geram arquivos do tipo CSV. O primeiro gera um arquivo de consultas e o segundo gera um arquivo com resultados esperados das consultas. Ambos os arquivos serão salvos no diretório `results`.

In [2]:
read, queries, expected = pq.read_config("pc.cfg")
xml_root = pq.get_xml_root(read)
pq.get_queries_file(queries, xml_root)
pq.get_expected_file(expected, xml_root)

Agora veremos como esses dois arquivos estão organizados.

In [3]:
import pandas as pd

queries_df = pd.read_csv(queries, sep=";")
expected_df = pd.read_csv(expected, sep=";")

In [4]:
queries_df.head()

Unnamed: 0,QueryNumber,QueryText
0,1,WHAT ARE THE EFFECTS OF CALCIUM ON THE PHYSICA...
1,2,CAN ONE DISTINGUISH BETWEEN THE EFFECTS OF MUC...
2,3,HOW ARE SALIVARY GLYCOPROTEINS FROM CF PATIENT...
3,4,WHAT IS THE LIPID COMPOSITION OF CF RESPIRATOR...
4,5,IS CF MUCUS ABNORMAL?


In [5]:
expected_df.head()

Unnamed: 0,QueryNumber,DocNumber,DocVotes
0,1,139,4
1,1,151,4
2,1,166,1
3,1,311,1
4,1,370,2


## 2) Gerando Lista Invertida

Vamos começar lendo o arquivo de configuração 'gli.cfg' para obter os arquivos do tipo XML que serão lidos para compor nossa base de dados, e um arquivo que será escrito contendo a lista invertida.

In [6]:
read_files, write_file = gil.read_config_file("gli.cfg")
gil.get_tokens_file(read_files, write_file)

É mostrado agora como se organiza o arquivo da lista invertida. Teremos que para cada token, teremos a lista dos documentos em que ele aparece. Caso ele apareça X vezes em um documento, o número desse documento é mostrada X vezes na lista do token.

In [7]:
inverted_list = pd.read_csv(write_file, sep=";", converters={"Appearance": pd.eval})
inverted_list.head()

Unnamed: 0,Token,Appearance
0,SIGNIFIC,"[1, 6, 19, 24, 30, 47, 52, 53, 54, 62, 62, 65,..."
1,PSEUDOMONA,"[1, 1, 1, 7, 8, 18, 18, 61, 61, 62, 62, 62, 62..."
2,AERUGINOSA,"[1, 1, 1, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, ..."
3,INFECT,"[1, 1, 1, 6, 6, 6, 16, 18, 48, 48, 57, 58, 58,..."
4,RESPIRATORI,"[1, 1, 1, 6, 6, 7, 7, 8, 11, 11, 11, 15, 17, 2..."


## 3) Indexador

Em seu arquivo de configuração é setado para ele ler um arquivo CSV, que será a lista invertida gerado pelo módulo anterior e ler um arquivo que será escrito como o modelo. Esse modelo será uma matriz contendo os pesos de cada termo em cada documento, calculado utilizando o tf/idf padrão. É possível utilizar o tf normalizado, basta modificar a variável `type_tf` para `tfn`. 

Primeiro teremos que fazer uma matriz termo documento com a lista invertida gerada no módulo anterior. Então com essa matriz, será possível gerar o modelo com os pesos. A matriz termo documento e o modelo serão mostrados respectivamente.

In [8]:
tokens, model = indexer.read_config_file("index.cfg")
matrix = indexer.get_term_document_matrix(tokens)
matrix

Unnamed: 0_level_0,1,6,19,24,30,47,52,53,54,62,...,909,330,537,580,799,558,908,117,1011,940
Token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SIGNIFIC,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PSEUDOMONA,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AERUGINOSA,3.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
INFECT,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RESPIRATORI,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
THROMBOSI,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MONOSPECIF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CONSENT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PATCHI,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
from tqdm.notebook import tqdm

type_tf = "tfn"
model = tqdm(indexer.get_model(matrix, type_tf))

100%|███████████████████████████████████████| 6278/6278 [05:44<00:00, 18.20it/s]


  0%|          | 0/6278 [00:00<?, ?it/s]

## 4) Buscador

Agora que temos o modelo e as consultas iremos fazer as buscas. Com isso, o arquivo de configuração lê dois arquivos, contendo o modelo e as consultas, e escreve outro, levando os resultados correspondentes. 

Ele funcionará da seguinte maneira: com o modelo e as consultas na memória, ele irá juntar os dois em um único DataFrame, que será mostrado a seguir. Para cada palavra na consulta, o peso será igual a 1. Em cada coluna do DataFrame, temos o vetor do documento ou da consulta. Para cada consulta, será calculada a similaridade pelo cosseno com todos os documentos, obtendo-se um valor. Esse valor será útil para construir um novo DataFrame chamado ranking. Esse ranking terá as consultas como colunas, o número do documento no index e como valor terá a quantia calculada pela similaridade de cossenos. 

Portanto, com esse ranking em mãos, será construído o arquivo que contém os resultados das consultas. Será um arquivo no formato CSV contendo dois campos: 

* Número da consulta 
* Lista contendo três informações: \[posição do documento no ranking da consulta, número do documento, valor da similaridade de cossenos\]

In [10]:
model_file, queries_file, results_file = searcher.read_config_file("busca.cfg")

Agora lemos os arquivos que já foram gerados pelos módulos anteriores.

In [12]:
model = searcher.get_model(model_file)
queries = searcher.get_queries(queries_file)
queries.head()

Unnamed: 0_level_0,QueryText
QueryNumber,Unnamed: 1_level_1
1,"[EFFECT, CALCIUM, PHYSIC, PROPERTI, MUCU, PATI..."
2,"[ONE, DISTINGUISH, EFFECT, MUCU, HYPERSECRET, ..."
3,"[SALIVARI, GLYCOPROTEIN, PATIENT, DIFFER, NORM..."
4,"[LIPID, COMPOSIT, RESPIRATORI, SECRET]"
5,"[MUCU, ABNORM]"


Com esses arquivos em memória, passamos ele ao método `get_ranking` para que o DataFrame 'ranking' seja gerado.

In [13]:
ranking = searcher.get_ranking(model, queries)

No método 'ranking' é feita a junção do modelo com as consultas. As consultas ficarão nas últimas colunas. Para ilustrar o que essa operação realiza, é mostrado a seguir.

In [14]:
searcher.insert_queries(model, queries).head()

Unnamed: 0,1,6,19,24,30,47,52,53,54,62,...,Q90,Q91,Q92,Q94,Q95,Q96,Q97,Q98,Q99,Q100
SIGNIFIC,0.126373,0.084249,0.068931,0.084249,0.151648,0.126373,0.252747,0.252747,0.37912,0.126373,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PSEUDOMONA,0.613622,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.511352,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AERUGINOSA,0.607672,0.81023,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.101279,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
INFECT,0.485316,0.323544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.727975,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RESPIRATORI,0.53587,0.238164,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Agora é mostrado como o ranking é finalizado, contendo cada consulta nas colunas, o número do documento em seu index e o valor correspondente um documento i e uma consulta j pela similaridade de cossenos.

In [15]:
ranking.head()

Unnamed: 0,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,...,Q90,Q91,Q92,Q94,Q95,Q96,Q97,Q98,Q99,Q100
1,0.017383,0.184724,0.017383,0.086635,0.0,0.012291,0.029738,0.070738,0.0,0.021289,...,0.154672,0.019179,0.024583,0.0,0.016093,0.015054,0.016093,0.023222,0.024583,0.019042
6,0.018739,0.123805,0.018739,0.049812,0.0,0.013251,0.016229,0.040671,0.0,0.022951,...,0.136055,0.0,0.026501,0.0,0.017349,0.016229,0.222084,0.0,0.026501,0.020528
19,0.029854,0.0,0.084833,0.0,0.0,0.02111,0.073468,0.041421,0.0,0.074193,...,0.025854,0.0,0.04222,0.0,0.036669,0.025854,0.02764,0.0,0.04222,0.047745
24,0.023859,0.0,0.051974,0.0,0.020955,0.06781,0.045011,0.014571,0.0,0.029221,...,0.020662,0.065809,0.033741,0.0,0.058058,0.020662,0.03329,0.024768,0.050851,0.026136
30,0.015868,0.0,0.023621,0.0,0.0,0.01122,0.03118,0.007753,0.0,0.019434,...,0.013742,0.0,0.022441,0.0,0.240507,0.031098,0.030955,0.0,0.022441,0.017382


Agora é criado o arquivo CSV contendo os resultado e em seguida mostrado. Os documentos que tiveram um valor de similaridade de cosseno, com uma consulta qualquer, igual a zero não foram incluídas no arquivo de resultados.

In [16]:
searcher.get_results(results_file, ranking)

In [17]:
results_df = pd.read_csv(results_file, sep=";")
results_df

Unnamed: 0,QueryNumber,DocInfos
0,1,"[1, '437', 0.27588934462955456]"
1,1,"[2, '498', 0.20844373959940954]"
2,1,"[3, '484', 0.18440699663023122]"
3,1,"[4, '754', 0.17310807431022798]"
4,1,"[5, '957', 0.16879740146000236]"
...,...,...
75720,100,"[826, '565', 0.002573691145570111]"
75721,100,"[827, '955', 0.0025226037139205845]"
75722,100,"[828, '1003', 0.0020904265444869935]"
75723,100,"[829, '1120', 0.0017396581810780723]"
