# Búsqueda booleana por matriz de incidencia

La matriz de incidencia de términos documentos indexa cada documento por medio de los términos. Así, se pueden hacer búsquedas booleanas (con operadores AND, OR y NOT) que nos permitan recuperar los documentos que respondan a las consultas booleanas. 

Aquí presentamos una versión simple para crear un modelo de búsqueda booleana.

In [1]:
from glob import glob
from re import compile
from collections import defaultdict
from itertools import chain, combinations
import numpy as np
import pandas as pd

### Operadores booleanos

Definimos los operadores booleanos que responden a operaciones lógicas tomando en cuenta las relaciones $True =1$ y $False = 0$. El operador NOT es un operador unitario y el operador AND y OR son binarios.

In [2]:
def AND(x,y):
    """
    Función booleana AND
    """
    if x == 1 and y == 1:
        return 1
    else:
        return 0
    
def NOT(x):
    """
    Función booleana NOT
    """
    if x == 1:
        return 0
    elif x == 0:
        return 1
    
def OR(x,y):
    """
    Función booleana OR
    """
    if x == 0 and y == 0:
        return 0
    else:
        return 1

### Tokenización

Para poder trabajar con una colección de documentos, debemos obtener los términos tokenizándoles (e incluso aplicando stemming o lematización). Aquí definimos una función muy simple de tokenización que:

* Limpia el texto eliminando caracteres no alfanuméricos.
* Separa los tokens por los espacios en blanco.

In [3]:
regex = compile('[-_{}(),;:"#\/.¡!¿?·]')
def tokenize(text):
    """
    Función de tokenización.
    
    Arguments
    ---------
    text : str
        Cadena de texto que se tokenizará
        
    Returns
    -------
    tokens : list
        Lista de tokens
    """
    #Pasa a minúsculas
    lower_text = text.strip().lower()
    #Elimina símbolos no alfanuméricos
    alphanumeric = regex.sub('', lower_text)
    #Obtiene tokens por espacios en blanco
    tokens = alphanumeric.split()
    
    return tokens

## Modelo de búsqueda booleana

Construimos entonces una clase de un modelo de búsqueda booleana a partir de una colección de documentos. Este modelo guardará los términos y los documentos de la colección. Nos enfocamos en las siguientes funciones:

* Función de creación de matriz de incidencia: crea la matriz de incidencia para indexar los documentos a partir de términos. Toma en cuenta una función de lectura de los documentos (get_documents()).
* Función de representación de términos (vectorize()): Recupera el vector booleano de un término a patir de la matriz de incidencia.
* Funciones de búsqueda booleana (searchAND(), searchOR() y searchNOT()): Recuperan los documentos que responden a las búsquedas booleanas entre términos.

In [4]:
class BooleanRetrieval(object):
    """
    Clase para crear modelo de recuperación booleana sobre una colección 
    de documentos.
    
    docs_idx : dict
        Dictionario que guarda los documentos y sus índices
    terms : list
        Lista de términos
    documentos : list
        Lista de documentos
    collection : dict
        Diccionario de índices de documentos y su lista de tokens
    incidence_matriz : array
        Matriz de incidencia término documento
    """
    def __init__(self):
        self.docs_idx = {}
        self.terms = []
        self.documents = []
        self.collection = {}
        self.incidence_matrix = None 
    
    def get_documents(self,directory):
        """
        Función para obtener la colección de documentos a partir de un directorio con archivos.
        
        Arguments
        ---------
        directory : str
            Directorio donde se encuentran guardados los documentos
        """
        #Inicia índices
        idx = 0
        for filename in glob(directory+'*'):
            #Lee los archivos en el directorio
            text = open(filename,'r').read()
            #Tokeniza documentos
            tokenized_text = tokenize(text)
            #Guarda los índices
            self.docs_idx[idx] = filename
            #Crea la colección
            self.collection[idx] = list(sorted(set(tokenized_text)))
            #Avanza en el índice
            idx += 1
            
        #Crea la lsita de términos
        self.terms = list(set(chain(*self.collection.values())))
        #Crea lista de documentos
        self.documents = list(self.docs_idx.values())
        
    def build_incidence_matrix(self, directory):
        """
        Función para la creación de la matriz de incidencia.
        
        Arguments
        ---------
        directory : str
            Directorio donde se encuentran guardados los documentos
            
        Returns
        -------
            Matriz de incidencia
        """
        #Crea la colección a partir de directorio
        self.get_documents(directory)
        #Inicializa la matriz de incidencia con 0s
        self.incidence_matrix = np.zeros((len(self.terms),len(self.collection)))
        
        #Recorre los términos y los documentos
        for i,t in enumerate(self.terms):
            for j,(d, terms_doc) in enumerate(self.collection.items()):
                #Si el término está en documento agrega 1.
                if t in terms_doc:
                    self.incidence_matrix[i,j] = 1

    def vectorize(self, term):
        """
        Función para crear un vector booleano de un término-
        
        Arguments
        ---------
        term : str
            Término que se va a representar
        
        Returns
        -------
        vector : array
            Vector booleano obtenido de los renglones de matriz de incidencia.
        """
        #Obtiene el índice del término
        term_idx = self.terms.index(term)
        #Recupera el renglón de la matriz a partir del índice
        vector = self.incidence_matrix[term_idx]
        
        return vector
    
    def searchAND(self,term1,term2):
        """
        Aplica el operador AND entre dos términos con respecto a sus vectores booleanos.
        
        Arguments
        ---------
        term1, term2 : str
            Términos que se van a operar.
        
        Returns
        -------
            Documentos que responden a la búsqueda AND entre los dos términos
        """
        #Obtiene los vectores booleanos
        vector1 = self.vectorize(term1)
        vector2 = self.vectorize(term2)
        
        #Vector resultante
        ANDVector = []
        for i in range(len(self.documents)):
            #Aplica el operador AND
            ANDVector.append(AND(vector1[i], vector2[i]))
        
        #Recupera documentos que pertenecen
        #a la búsqueda
        for d,bit in enumerate(ANDVector):
            if bit == 1:
                yield self.documents[d]
    
    def searchOR(self,term1,term2):
        """
        Aplica el operador OR entre dos términos con respecto a sus vectores booleanos.
        
        Arguments
        ---------
        term1, term2 : str
            Términos que se van a operar.
        
        Returns
        -------
            Documentos que responden a la búsqueda OR entre los dos términos
        """
        #Obtiene los vectores booleanos de los términos
        vector1 = self.vectorize(term1)
        vector2 = self.vectorize(term2)
        
        #Vector resultante
        ORVector = []
        for i in range(len(self.documents)):
            #Aplica operación OR
            ORVector.append(OR(vector1[i], vector2[i]))
        
        #Recupera documentos que pertenecen
        #a la búsqueda
        for d,bit in enumerate(ORVector):
            if bit == 1:
                yield self.documents[d]
    
    def searchNOT(self,term):
        """
        Aplica el operador NOT a un términos con respecto a su vector booleano.
        
        Arguments
        ---------
        term1 : str
            Término que se va a operar.
        
        Returns
        -------
            Documentos que responden a la búsqueda NOT del término
        """
        #Obtiene vector booleano
        vector = self.vectorize(term)
        
        #Vector resultante
        NOTVector = []
        for i in range(len(self.documents)):
            #Aplica operación NOT
            NOTVector.append(NOT(vector[i]))
        
        #Recupera documentos que pertenecen
        #a la búsqueda
        for d,bit in enumerate(NOTVector):
            if bit == 1:
                yield self.documents[d]

Podemos, entonces probar el modelo de búsqueda booleana creando la matriz de incidencia, así como las listas de términos y de documentos.

In [5]:
#Directorio de documentos
directory = 'wikipedia/'

#Creamos modelo de búsqueda booleana
model = BooleanRetrieval()
#Construimos la matriz de incidencia
model.build_incidence_matrix(directory)

print(model.incidence_matrix)
print("Términos: {}\nDocumentos: {}".format(model.terms[:10], model.documents[:5]))

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]]
Términos: ['inmigran', 'glándula', 'engañosos', 'cambió', 'autónomo', 'diferenciarse', 'lakatos', 'nanotube', 'amigo', 'tocaba']
Documentos: ['wikipedia/cuantica (2).txt', 'wikipedia/fractal (1).txt', 'wikipedia/bioinfo (1).txt', 'wikipedia/ifai (4).txt', 'wikipedia/taylor (1).txt']


Usaremos pandas para visualizar mejor la matriz de incidencia y poder ver las relaciones entre los términos y los documentos.

In [6]:
#Visualización de la matriz de incidencia con PANDAS
IncidenceMatrix = pd.DataFrame(data=model.incidence_matrix,index=model.terms,columns=model.documents)
IncidenceMatrix

Unnamed: 0,wikipedia/cuantica (2).txt,wikipedia/fractal (1).txt,wikipedia/bioinfo (1).txt,wikipedia/ifai (4).txt,wikipedia/taylor (1).txt,wikipedia/ciencia_historia (1).txt,wikipedia/ech (1).txt,wikipedia/religion (1).txt,wikipedia/perro (1).txt,wikipedia/cine (2).txt,...,wikipedia/cuantica (3).txt,wikipedia/economia (2).txt,wikipedia/bioinfo (2).txt,wikipedia/ciencia_historia (3).txt,wikipedia/acustica (3).txt,wikipedia/sushi (2).txt,wikipedia/coca (2).txt,wikipedia/celula (7).txt,wikipedia/twitter (2).txt,wikipedia/condensador (3).txt
inmigran,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
glándula,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
engañosos,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
cambió,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
autónomo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
co,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
o26h11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
av,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
deberá,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


También pandas nos permite localizar los renglones que representan los términos a partir de .loc[].

In [7]:
#Búsqueda de términos y sus representaciones
IncidenceMatrix.loc[['campo','campos']]

Unnamed: 0,wikipedia/cuantica (2).txt,wikipedia/fractal (1).txt,wikipedia/bioinfo (1).txt,wikipedia/ifai (4).txt,wikipedia/taylor (1).txt,wikipedia/ciencia_historia (1).txt,wikipedia/ech (1).txt,wikipedia/religion (1).txt,wikipedia/perro (1).txt,wikipedia/cine (2).txt,...,wikipedia/cuantica (3).txt,wikipedia/economia (2).txt,wikipedia/bioinfo (2).txt,wikipedia/ciencia_historia (3).txt,wikipedia/acustica (3).txt,wikipedia/sushi (2).txt,wikipedia/coca (2).txt,wikipedia/celula (7).txt,wikipedia/twitter (2).txt,wikipedia/condensador (3).txt
campo,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
campos,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Recuperación de documentos por búsqueda booleana

Finalmente podemos aplicar los distintos operadores para obtener los documentos que contienen la query booleana que solicitamos.

In [8]:
#Recuperamos documentos a partir de operadores booleanos
for result in model.searchAND('campo','campos'):
    print(result)

wikipedia/bioinfo (1).txt
wikipedia/ciencia_historia (1).txt
wikipedia/mate (2).txt


In [9]:
#uso del operador Not
for result in model.searchNOT('un'):
    print(result)

wikipedia/religion (2).txt
wikipedia/sociedad (4).txt
wikipedia/salmonella (2).txt
wikipedia/pylori (1).txt


Este tipo de modelo permite hacer búsquedas más complejas que combinen diferentes operadores booleanos; por ejemplo, se pueden concatenar operaciones AND para recuperar documentos que contengan más de dos términos. O bien operar con combinaciones de distintos operadores booleanos.

In [12]:
#Vectoriza dos modelos
u = model.vectorize('campo')
v = model.vectorize('campos')

#Operador OR EXclusivo
XOR = lambda x,y: AND(NOT(AND(x,y)), OR(x,y))

#Búsqueda compleja
result = [XOR(u[i],v[i]) for i in range(len(u))]

#Recupera documentos de búsqueda
for d,bit in enumerate(result):
    if bit == 1:
        print(model.documents[d])

wikipedia/cuantica (2).txt
wikipedia/mate (1).txt
wikipedia/sociedad (2).txt
wikipedia/condensador (2).txt
wikipedia/politica (1).txt
wikipedia/capoeira (2).txt
wikipedia/cuantica (1).txt
wikipedia/condensador (4).txt
wikipedia/ifai (2).txt
wikipedia/sociedad (1).txt
wikipedia/economia (1).txt
wikipedia/cuantica (3).txt
wikipedia/economia (2).txt
wikipedia/bioinfo (2).txt
wikipedia/condensador (3).txt
