<a href="https://colab.research.google.com/github/gibranfp/CursoDatosMasivosI/blob/main/notebooks/3b_indice_inverso.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Búsqueda de documentos por índice inverso
En esta libreta desarrollaremos un buscador de documentos usando índice inverso

## Búsqueda de documentos por palabras
Primero vamos a ver cómo hacer búsqueda de documentos por palabras

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import  Counter

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

Descargamos el conjunto de datos _20 newsgroups_

In [2]:
db = fetch_20newsgroups(remove=('headers','footers','quotes'))

Vemos cómo luce un documento

In [3]:
print(db.data[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


Importamos la biblioteca NLTK y definimos nuestro analizador léxico y lematizador

In [4]:
import nltk
nltk.download(['punkt','averaged_perceptron_tagger','wordnet'])

from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.corpus.reader.wordnet import NOUN, VERB, ADV, ADJ

morphy_tag = {
    'JJ' : ADJ,
    'JJR' : ADJ,
    'JJS' : ADJ,
    'VB' : VERB,
    'VBD' : VERB,
    'VBG' : VERB,
    'VBN' : VERB,
    'VBP' : VERB,
    'VBZ' : VERB,
    'RB' : ADV,
    'RBR' : ADV,
    'RBS' : ADV
}

def doc_a_tokens(doc):
  tagged = pos_tag(word_tokenize(doc.lower()))
  lemmatizer = WordNetLemmatizer()
  tokens = []
  for p,t in tagged:
    tokens.append(lemmatizer.lemmatize(p, pos=morphy_tag.get(t, NOUN)))

  return tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Guardamos el conjunto preprocesado como una lista de cadenas, una por documento

In [5]:
corpus = []
for d in db.data:
  d = d.replace('\n',' ').replace('\r',' ').replace('\t',' ')
  tokens = doc_a_tokens(d)
  corpus.append(' '.join(tokens))

Obtenemos las bolsas de palabras de los documentos preprocesados usando la clase `CountVectorizer` de scikit-learn

In [6]:
v = CountVectorizer(stop_words='english', max_features=5000, max_df=0.8)
bolsas = v.fit_transform(corpus)
print('Componentes de primer documento: {0}'.format(corpus[0]))
print('Bolsa de primer documento: [\n{0}]'.format(bolsas[0]))

Componentes de primer documento: i be wonder if anyone out there could enlighten me on this car i saw the other day . it be a 2-door sport car , look to be from the late 60s/ early 70 . it be call a bricklin . the door be really small . in addition , the front bumper be separate from the rest of the body . this be all i know . if anyone can tellme a model name , engine spec , year of production , where this car be make , history , or whatever info you have on this funky look car , please e-mail .
Bolsa de primer documento: [
  (0, 4904)	1
  (0, 983)	4
  (0, 3987)	1
  (0, 1407)	1
  (0, 1603)	2
  (0, 4253)	1
  (0, 2776)	2
  (0, 2666)	1
  (0, 1649)	1
  (0, 307)	1
  (0, 3742)	1
  (0, 4181)	1
  (0, 440)	1
  (0, 4063)	1
  (0, 3851)	1
  (0, 848)	1
  (0, 2638)	1
  (0, 2998)	1
  (0, 1725)	1
  (0, 4230)	1
  (0, 4972)	1
  (0, 3569)	1
  (0, 2835)	1
  (0, 2268)	1
  (0, 2417)	1
  (0, 2825)	1]


Definimos la clase para el índice inverso con un método para recuperar los documentos que contienen una lista de palabras

In [7]:
class IndiceInverso:
  def  __getitem__(self, idx):
    return self.ifs[idx]

  def recupera(self, l):
    docs = Counter()
    for (i,_) in l:
      docs.update(self.ifs[i])

    return docs

  def from_csr(self, csr):
    self.ifs = [[] for _ in range(csr.shape[1])]
    coo = csr.tocoo()    
    for i,j,v in zip(coo.row, coo.col, coo.data):
      self.ifs[j].append(i)

Instanciamos nuestra clase `IndiceInverso` y creamos la estructura a partir de nuestras bolsas de palabras

In [8]:
ifs = IndiceInverso()
ifs.from_csr(bolsas)

Definimos una función que convierta de arreglos dispersos CSR a listas de listas

In [9]:
def csr_to_ldb(csr):
  ldb = [[] for _ in range(csr.shape[0])]
  coo = csr.tocoo()    
  for i,j,v in zip(coo.row, coo.col, coo.data):
    ldb[i].append((j, v))

  return ldb

Generamos algunas consultas y calculamos sus bolsas de palabras

In [10]:
consultas = []
for c in ['nasa space mission satellite','government crime enforcement security']:
  tokens = doc_a_tokens(c)
  consultas.append(' '.join(tokens))

bc = v.transform(consultas)
cl = csr_to_ldb(bc)

Usamos el índice inverso para recuperar los documentos que contienen las palabras de la primera consultas ordenados por coincidencias y visualizamos el primer documento recuperado

In [11]:
recs = ifs.recupera(cl[0])
top = recs.most_common()[0]
print(recs.most_common())
print(db.data[top[0]])

[(59, 4), (153, 4), (545, 4), (1830, 4), (2800, 4), (3285, 4), (3564, 4), (3864, 4), (4425, 4), (5356, 4), (6197, 4), (6719, 4), (7554, 4), (8525, 4), (9096, 4), (9154, 4), (9986, 4), (10855, 4), (11198, 4), (953, 3), (1071, 3), (1459, 3), (3044, 3), (3665, 3), (4443, 3), (5125, 3), (5207, 3), (5880, 3), (6148, 3), (6502, 3), (6518, 3), (9067, 3), (9868, 3), (10734, 3), (432, 3), (4166, 3), (5877, 3), (6572, 3), (8569, 3), (9232, 3), (10295, 3), (13, 2), (988, 2), (1176, 2), (2142, 2), (3137, 2), (3296, 2), (3727, 2), (3932, 2), (4276, 2), (4706, 2), (5071, 2), (6236, 2), (7139, 2), (7234, 2), (7945, 2), (8637, 2), (9758, 2), (10422, 2), (533, 2), (799, 2), (812, 2), (1571, 2), (1691, 2), (1761, 2), (2061, 2), (2453, 2), (2624, 2), (2837, 2), (2912, 2), (2950, 2), (2992, 2), (3097, 2), (3272, 2), (3295, 2), (3990, 2), (4088, 2), (4307, 2), (4312, 2), (4614, 2), (4625, 2), (4773, 2), (5376, 2), (5729, 2), (6151, 2), (6256, 2), (6869, 2), (6964, 2), (7005, 2), (7081, 2), (7448, 2), (7465

Repetimos el proceso anterior para la segunda consulta

In [12]:
recs = ifs.recupera(cl[1])
top = recs.most_common()[0]
print(recs.most_common())
print(db.data[top[0]])

[(2350, 4), (4498, 4), (4682, 4), (5612, 4), (6635, 4), (8445, 4), (8534, 4), (9007, 4), (9396, 4), (10575, 4), (591, 3), (2077, 3), (3514, 3), (4300, 3), (4499, 3), (4610, 3), (5258, 3), (6715, 3), (7240, 3), (7333, 3), (7344, 3), (7367, 3), (8114, 3), (9181, 3), (10054, 3), (10328, 3), (10433, 3), (11199, 3), (1182, 3), (1379, 3), (1660, 3), (2175, 3), (2352, 3), (2569, 3), (5503, 3), (5844, 3), (5898, 3), (6935, 3), (9115, 3), (9365, 3), (70, 2), (133, 2), (325, 2), (378, 2), (658, 2), (760, 2), (786, 2), (981, 2), (1037, 2), (1188, 2), (1726, 2), (1756, 2), (2039, 2), (2068, 2), (2200, 2), (2227, 2), (2272, 2), (2741, 2), (3121, 2), (3141, 2), (3275, 2), (3301, 2), (3473, 2), (3495, 2), (3511, 2), (3545, 2), (3694, 2), (4125, 2), (4192, 2), (4253, 2), (4376, 2), (4620, 2), (4811, 2), (4869, 2), (5338, 2), (5744, 2), (5789, 2), (5829, 2), (5891, 2), (6543, 2), (6783, 2), (6911, 2), (7150, 2), (7151, 2), (7441, 2), (7631, 2), (7850, 2), (7900, 2), (7910, 2), (8031, 2), (8235, 2), (83

## Búsqueda de documentos similares
Ahora vamos a realizar búsquedas de documentos similares a un documento de consulta.

Primero tomamos 1 documento que sirva de consulta y lo visualizamos

In [13]:
dc = db.data[0]
print(dc)

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


Obtenemos su bolsa

In [14]:
tokens = doc_a_tokens(dc)
bolsa_dc = v.transform([' '.join(tokens)])
print('Componentes para consulta: {0}'.format(tokens))
print('Bolsa para consulta: [\n{0}]'.format(bolsa_dc))

Componentes para consulta: ['i', 'be', 'wonder', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'i', 'saw', 'the', 'other', 'day', '.', 'it', 'be', 'a', '2-door', 'sport', 'car', ',', 'look', 'to', 'be', 'from', 'the', 'late', '60s/', 'early', '70', '.', 'it', 'be', 'call', 'a', 'bricklin', '.', 'the', 'door', 'be', 'really', 'small', '.', 'in', 'addition', ',', 'the', 'front', 'bumper', 'be', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', '.', 'this', 'be', 'all', 'i', 'know', '.', 'if', 'anyone', 'can', 'tellme', 'a', 'model', 'name', ',', 'engine', 'spec', ',', 'year', 'of', 'production', ',', 'where', 'this', 'car', 'be', 'make', ',', 'history', ',', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'look', 'car', ',', 'please', 'e-mail', '.']
Bolsa para consulta: [
  (0, 307)	1
  (0, 440)	1
  (0, 848)	1
  (0, 983)	4
  (0, 1407)	1
  (0, 1603)	2
  (0, 1649)	1
  (0, 1725)	1
  (0, 2268)	1
  (0, 2417)	1
  (0, 2638)	1
  (0, 2666)	

Definimos una función para hacer búsqueda por fuerza bruta dada una función de distancia o similitud

In [15]:
def fuerza_bruta(base, consulta, fd):
  medidas = np.zeros(base.shape[0])
  for i,x in enumerate(base):
    medidas[i] = fd(consulta, x)

  return medidas

Definimos la función para la similitud coseno

In [16]:
def similitud_coseno(x, y):
  x = x.toarray()[0]
  y = y.toarray()[0]
  pnorma = (np.sqrt(x @ x) * np.sqrt(y @ y))

  if pnorma > 0:
    return (x @ y) / pnorma
  else: 
    return np.nan 

In [17]:
sims = fuerza_bruta(bolsas[1:], bolsa_dc, similitud_coseno)

In [18]:
print('Similitud máxima es {0} de documento {1}'.format(np.nanmax(sims), np.nanargmax(sims)+ 1))

Similitud máxima es 0.4589534811637672 de documento 6997


Revisamos documento más similar

In [19]:
print(db.data[np.nanargmax(sims) + 1])


Perhaps instead of this silly argument about what backup lights
are for, couldn't we agree that they serve the dual purpose of
letting people behind your car know that you have it in reverse
and that they can also light up the area behind your car while
you're backing up so you can see?

Backup lamps on current models are much brighter than they used
to be on older cars. Those on my Taurus Wagon are quite bright
enough to illuminate a good area behind the car, and they're 
MUCH brighter than those on my earlier cars from the 60s and 70s. 

Insofar as Vettes having side backup lights, look at a '92 or '93
model (or perhaps a year or two earlier too) and you'll see
red side marker lamps and white side marker lamps both near the
car's hindquarters.  Those aren't just white reflectors. 


Definimos la distancia euclidiana

In [20]:
def distancia_euclidiana(x, y):   
  x = x.toarray()[0]
  y = y.toarray()[0]
  return np.sqrt(np.sum((x - y)**2))

Repetimos el proceso anterior con la distancia euclidiana

In [21]:
euclids = fuerza_bruta(bolsas[1:], bolsa_dc, distancia_euclidiana)
print('Distancia mínima es {0} de documento {1}'.format(np.nanmin(euclids), np.nanargmin(euclids) + 1))

Distancia mínima es 6.324555320336759 de documento 445


Visualizamos el documento

In [22]:
print(db.data[np.nanargmin(euclids) + 1])







Bzzt.
The manta was a two-door sedan in the US.
It had a 1900 engine.
Was sometimes referred to as an Opel 1900.
Manta's are also ve hot and fun cars too.





















Hacemos lo mismo para la similitud de Jaccard y MinMax

In [23]:
def similitud_jaccard(x, y):
  x = x.toarray()[0]
  y = y.toarray()[0]
  inter = np.count_nonzero(x * y)
  return inter / (np.count_nonzero(x) + np.count_nonzero(y) - inter)

def similitud_minmax(x, y):
  x = x.toarray()[0]
  y = y.toarray()[0]
  c = np.vstack((x,y))
  mn = np.sum(np.min(c, axis=0))
  mx = np.sum(np.max(c, axis=0))
  return mn / mx

Calculamos las similitudes con todos los documentos

In [24]:
js = fuerza_bruta(bolsas[1:], bolsa_dc, similitud_jaccard)
mms = fuerza_bruta(bolsas[1:], bolsa_dc, similitud_minmax)
print('Similitud de Jaccard máxima es {0} de documento {1}'.format(np.nanmax(js), np.nanargmax(js) + 1))
print('Similitud de MinMax máxima es {0} de documento {1}'.format(np.nanmax(mms), np.nanargmax(mms) + 1))

Similitud de Jaccard máxima es 0.16049382716049382 de documento 5282
Similitud de MinMax máxima es 0.14736842105263157 de documento 5282


Visualizamos documento con mayor similitud de Jaccard

In [25]:
print(db.data[np.nanargmax(js) + 1])

Alright, beat this automobile sighting.

Driving along just a hair north of Atlanta, I noticed an old, run down
former car dealership which appeared to deal with, and repair, older
rare or exotic foreign sports cars. I saw:

Ford GT-40 (!), the famous model from Ford, that seemed to win most of 
its races in the late 60s, including Le-Mans 4 or 6 times.

Two Jensen Interceptors, one a convertable, one a hatchback?

Porsche 911 (boring compared to the rest)

THREE Ferarries, a Mondial, a 308 prepared for racing, and a red 60s model
that I couldn't identify.

And at the bottom, a late 70s MG convertable.

Outside there was a rotting Rover 3500 saloon, which was never regularly
sold in the U.S.

And in the showroom, there was a small italian body, either an Alpha Romeo
or a Lancia. It was about the size of an Austin Mini.
The trunklid was missing, exposing a boot with a voltage regulator 
in the upper left corner of the wall, and a chunk of metal removed from
the floor on the right hand s

Lo mismo para la similitud MinMax

In [26]:
print(db.data[np.nanargmax(mms) + 1])

Alright, beat this automobile sighting.

Driving along just a hair north of Atlanta, I noticed an old, run down
former car dealership which appeared to deal with, and repair, older
rare or exotic foreign sports cars. I saw:

Ford GT-40 (!), the famous model from Ford, that seemed to win most of 
its races in the late 60s, including Le-Mans 4 or 6 times.

Two Jensen Interceptors, one a convertable, one a hatchback?

Porsche 911 (boring compared to the rest)

THREE Ferarries, a Mondial, a 308 prepared for racing, and a red 60s model
that I couldn't identify.

And at the bottom, a late 70s MG convertable.

Outside there was a rotting Rover 3500 saloon, which was never regularly
sold in the U.S.

And in the showroom, there was a small italian body, either an Alpha Romeo
or a Lancia. It was about the size of an Austin Mini.
The trunklid was missing, exposing a boot with a voltage regulator 
in the upper left corner of the wall, and a chunk of metal removed from
the floor on the right hand s

## Ejercicio
+ Usa el índice inverso para acelerar el proceso de búsqueda para la similitud de Jaccard, MinMax y Coseno