In [3]:
import processing_queries as pq
import generate_inverted_list as gil
import indexer
import searcher

## 1) Processando Consultas

Primeiro leremos o arquivo de configuração 'pc.cfg'. Depois iremos pegar o arquivo que contém as consultas, passado no arquivo de configuração, e extrair a raíz do documento XML. Diante dessa raíz, é chamado dois métodos que geram arquivos do tipo CSV. O primeiro gera um arquivo de consultas e o segundo gera um arquivo com resultados esperados das consultas. Ambos os arquivos serão salvos no diretório `results`.

In [4]:
read, queries, expected = pq.read_config("pc.cfg")
xml_root = pq.get_xml_root(read)
pq.get_queries_file(queries, xml_root)
pq.get_expected_file(expected, xml_root)

Agora veremos como esses dois arquivos estão organizados.

In [7]:
import pandas as pd

queries_df = pd.read_csv(queries, sep=";")
expected_df = pd.read_csv(expected, sep=";")

In [9]:
queries_df.head()

Unnamed: 0,QueryNumber,QueryText
0,1,WHAT ARE THE EFFECTS OF CALCIUM ON THE PHYSICA...
1,2,CAN ONE DISTINGUISH BETWEEN THE EFFECTS OF MUC...
2,3,HOW ARE SALIVARY GLYCOPROTEINS FROM CF PATIENT...
3,4,WHAT IS THE LIPID COMPOSITION OF CF RESPIRATOR...
4,5,IS CF MUCUS ABNORMAL?


In [10]:
expected_df.head()

Unnamed: 0,QueryNumber,DocNumber,DocVotes
0,1,139,4
1,1,151,4
2,1,166,1
3,1,311,1
4,1,370,2


## 2) Gerando Lista Invertida

Vamos começar lendo o arquivo de configuração 'gli.cfg' para obter os arquivos do tipo XML que serão lidos para compor nossa base de dados, e um arquivo que será escrito contendo a lista invertida.

In [11]:
read_files, write_file = gil.read_config_file("gli.cfg")
gil.get_tokens_file(read_files, write_file)

É mostrado agora como se organiza o arquivo da lista invertida. Teremos que para cada token, teremos a lista dos documentos em que ele aparece. Caso ele apareça X vezes em um documento, o número desse documento é mostrada X vezes na lista do token.

In [13]:
inverted_list = pd.read_csv(write_file, sep=";")
inverted_list.head()

Unnamed: 0,Token,Appearance
0,SIGNIFIC,"[1, 6, 19, 24, 30, 47, 52, 53, 54, 62, 62, 65,..."
1,PSEUDOMONA,"[1, 1, 1, 7, 8, 18, 18, 61, 61, 62, 62, 62, 62..."
2,AERUGINOSA,"[1, 1, 1, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, ..."
3,INFECT,"[1, 1, 1, 6, 6, 6, 16, 18, 48, 48, 57, 58, 58,..."
4,RESPIRATORI,"[1, 1, 1, 6, 6, 7, 7, 8, 11, 11, 11, 15, 17, 2..."


## 3) Indexador

Em seu arquivo de configuração é setado para ele ler um arquivo CSV, que será a lista invertida gerado pelo módulo anterior e ler um arquivo que será escrito como o modelo. Esse modelo será uma matriz contendo os pesos de cada termo em cada documento, calculado utilizando o tf/idf padrão. É possível utilizar o tf normalizado, basta modificar a variável `type_tf` para `tfn`. 

Primeiro teremos que fazer uma matriz termo documento com a lista invertida gerada no módulo anterior. Então com essa matriz, será possível gerar o modelo com os pesos. A matriz termo documento e o modelo serão mostrados respectivamente.

In [14]:
tokens, model = indexer.read_config_file("index.cfg")
matrix = indexer.get_term_document_matrix(tokens)
matrix

Unnamed: 0_level_0,1,6,19,24,30,47,52,53,54,62,...,909,330,537,580,799,558,908,117,1011,940
Token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SIGNIFIC,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PSEUDOMONA,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AERUGINOSA,3.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
INFECT,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
RESPIRATORI,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
THROMBOSI,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MONOSPECIF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CONSENT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PATCHI,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
type_tf = "tfn"
model = indexer.get_model(matrix, type_tf)