<a href="https://colab.research.google.com/github/blancavazquez/CursoDatosMasivosI/blob/master/notebooks/3b_indice_inverso.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Búsqueda de documentos por índice inverso
En esta libreta desarrollaremos un buscador de documentos usando índice inverso

## Búsqueda de documentos por palabras
Primero vamos a ver cómo hacer búsqueda de documentos por palabras

In [1]:
from collections import  Counter
import re
import codecs
from math import log

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_20newsgroups

VOCMAX = 5000

Descargamos el conjunto de datos _20 newsgroups_

In [2]:
db = fetch_20newsgroups(remove=('headers','footers','quotes'))

Vemos cómo luce un documento

In [3]:
print(db.data[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


Importamos la biblioteca NLTK y definimos nuestro analizador léxico y lematizador

In [4]:
import nltk
nltk.download(['punkt','averaged_perceptron_tagger','wordnet'])

from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.corpus.reader.wordnet import NOUN, VERB, ADV, ADJ

morphy_tag = {
    'JJ' : ADJ,
    'JJR' : ADJ,
    'JJS' : ADJ,
    'VB' : VERB,
    'VBD' : VERB,
    'VBG' : VERB,
    'VBN' : VERB,
    'VBP' : VERB,
    'VBZ' : VERB,
    'RB' : ADV,
    'RBR' : ADV,
    'RBS' : ADV
}

def doc_a_tokens(doc):
  tagged = pos_tag(word_tokenize(doc.lower()))
  lemmatizer = WordNetLemmatizer()
  tokens = []
  for p,t in tagged:
    tokens.append(lemmatizer.lemmatize(p, pos=morphy_tag.get(t, NOUN)))

  return tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Convertimos el conjunto preprocesado a una lista de cadenas, una por documento

In [5]:
corpus = []
for d in db.data:
  d = d.replace('\n',' ').replace('\r',' ').replace('\t',' ')
  d = ' '.join([''.join([c.lower() for c in p if c.isalnum()]) for p in d.split()])
  tokens = doc_a_tokens(d)
  corpus.append(' '.join(tokens))

Obtenemos y cargamos la lista de _stopwords_ para inglés (archivo con una palabra por línea)

In [6]:
!wget -qO- -O stopwords_english.txt \
         https://raw.githubusercontent.com/pan-webis-de/authorid/master/data/stopwords_english.txt

stopwords = []
for line in codecs.open('stopwords_english.txt', encoding = "utf-8"):
  stopwords.append(line.rstrip())

Procesamos cada palabra del corpus completo para generar y ordenar el vocabulario

In [7]:
# Divide la cadena en palabras
term_re = re.compile("\w+", re.UNICODE)

# Contamos las ocurrencias de cada palabra
corpus_freq = Counter()
doc_freq = Counter()
for d in corpus:
  # Eliminamos números de la cadena (documento) a procesar 
  d = re.sub(r'\d+', '', d)

  # Dividimos la cadena en una lista de palabras
  terms = [t for t in term_re.findall(d) if t not in stopwords and len(t) > 2]
  
  # Aumentamos el contador de cada instancia palabra en el documento
  for t in terms:
    corpus_freq[t] += 1
  
  # Aumentamos el contador de cada palabra distinta en el documento
  for t in set(terms):
    doc_freq[t] += 1

# Generamos un diccionario con las VOCMAX palabras más frecuentes
vocabulary = {entry[0]:(i, entry[1], doc_freq[entry[0]], log(len(corpus) / doc_freq[entry[0]])) \
              for i, entry in enumerate(corpus_freq.most_common()) \
              if i < VOCMAX}

Creamos un diccionario para mapear índices a palabras

In [8]:
id_a_palabra = {v[0]: k for k,v in vocabulary.items()}

Generamos las bolsas de palabras de los documentos preprocesados

In [9]:
bolsas = []
for d in corpus:
  d = re.sub(r'\d+', '', d)
  ids = Counter([vocabulary[t][0] for t in term_re.findall(d) \
                 if t in vocabulary and t not in stopwords])
  bolsas.append([i for i in sorted(ids.items())])

Definimos la clase para el índice inverso con un método para recuperar los documentos que contienen una lista de palabras

In [10]:
class IndiceInverso:
  def  __getitem__(self, idx):
    return self.ifs[idx]

  def __repr__(self):
    contenido = ['%d::%s' % (i, self.ifs[i]) for i in range(len(self.ifs))]
    return "<IFS :%s >" % ('\n'.join(contenido))

  def __str__(self):
    contenido = ['%d::%s' % (i, self.ifs[i]) for i in range(len(self.ifs))]
    return '\n'.join(contenido)

  def recupera(self, l):
    return Counter([j for (i,_) in l for j in self.ifs[i]])

  def construye(self, bd, tvoc):
    self.ifs = [[] for _ in range(tvoc)]

    for i,d in enumerate(bd):
      for p in d:
        self.ifs[p[0]].append(i)

Instanciamos nuestra clase `IndiceInverso` y creamos la estructura a partir de nuestras bolsas de palabras

In [11]:
ifs = IndiceInverso()
ifs.construye(bolsas, VOCMAX)

Generamos algunas consultas y calculamos sus bolsas de palabras

In [12]:
consultas = ['nasa space mission satellite', 'government crime enforcement security']
bolsas_consultas = []
for c in consultas:
  c = re.sub(r'\d+', '', c)
  ids = Counter([vocabulary[t][0] for t in term_re.findall(c) \
                 if t not in stopwords and vocabulary[t][0] < VOCMAX])            
  bolsas_consultas.append([i for i in sorted(ids.items())])

Usamos el índice inverso para recuperar los documentos que contienen las palabras de la primera consultas ordenados por coincidencias y visualizamos el primer documento recuperado

In [13]:
recs = ifs.recupera(bolsas_consultas[0])
top = recs.most_common()[0]
print(recs.most_common())
print(db.data[top[0]])

[(59, 4), (153, 4), (545, 4), (1830, 4), (2800, 4), (3285, 4), (3564, 4), (4425, 4), (5356, 4), (6197, 4), (6719, 4), (7554, 4), (8525, 4), (9096, 4), (9154, 4), (9986, 4), (10855, 4), (11198, 4), (432, 3), (953, 3), (1071, 3), (3044, 3), (3864, 3), (4166, 3), (5125, 3), (5207, 3), (5877, 3), (5880, 3), (6572, 3), (9067, 3), (9868, 3), (13, 2), (533, 2), (799, 2), (812, 2), (988, 2), (1459, 2), (1691, 2), (1761, 2), (2061, 2), (2142, 2), (2453, 2), (2624, 2), (2837, 2), (2912, 2), (2950, 2), (3137, 2), (3272, 2), (3295, 2), (3296, 2), (3727, 2), (3818, 2), (3990, 2), (4088, 2), (4276, 2), (4307, 2), (4312, 2), (4443, 2), (4614, 2), (4625, 2), (4706, 2), (4840, 2), (5071, 2), (5376, 2), (5969, 2), (6236, 2), (6256, 2), (6387, 2), (6964, 2), (7234, 2), (7448, 2), (7465, 2), (7545, 2), (8006, 2), (8083, 2), (8167, 2), (8569, 2), (9101, 2), (9232, 2), (9333, 2), (9483, 2), (9635, 2), (9864, 2), (10422, 2), (10498, 2), (10530, 2), (10693, 2), (10734, 2), (10936, 2), (1176, 2), (3665, 2), (6

Repetimos el proceso anterior para la segunda consulta

In [14]:
recs = ifs.recupera(bolsas_consultas[1])
top = recs.most_common()[0]
print(recs.most_common())
print(db.data[top[0]])

[(2350, 4), (4498, 4), (4682, 4), (5612, 4), (6635, 4), (8445, 4), (8534, 4), (9007, 4), (9396, 4), (10575, 4), (591, 3), (1182, 3), (1379, 3), (1660, 3), (2077, 3), (2175, 3), (2352, 3), (2569, 3), (3514, 3), (4499, 3), (4610, 3), (5258, 3), (5503, 3), (5844, 3), (5898, 3), (6715, 3), (6935, 3), (7240, 3), (7333, 3), (7344, 3), (7367, 3), (8114, 3), (9115, 3), (9181, 3), (9365, 3), (10054, 3), (10328, 3), (11199, 3), (10433, 3), (37, 2), (70, 2), (133, 2), (267, 2), (378, 2), (545, 2), (658, 2), (760, 2), (786, 2), (817, 2), (853, 2), (981, 2), (1037, 2), (1114, 2), (1188, 2), (1293, 2), (1368, 2), (1468, 2), (1526, 2), (1726, 2), (1735, 2), (1756, 2), (2039, 2), (2060, 2), (2068, 2), (2120, 2), (2157, 2), (2272, 2), (2581, 2), (2586, 2), (2741, 2), (2825, 2), (2907, 2), (2941, 2), (3121, 2), (3141, 2), (3144, 2), (3157, 2), (3301, 2), (3410, 2), (3433, 2), (3455, 2), (3473, 2), (3475, 2), (3495, 2), (3511, 2), (3545, 2), (3552, 2), (3742, 2), (3746, 2), (3776, 2), (4125, 2), (4192, 2

## Búsqueda de documentos similares
Ahora vamos a realizar búsquedas de documentos similares a un documento de consulta.

Primero tomamos 1 documento que sirva de consulta y lo visualizamos

In [15]:
dc = 0 
print(db.data[dc])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


Obtenemos su bolsa

In [16]:
bolsa_dc = bolsas[dc]

Definimos una función para hacer búsqueda por fuerza bruta dada una función de distancia o similitud

In [17]:
def fuerza_bruta(base, consulta, fd):
  medidas = np.zeros(len(base))
  for i,x in enumerate(base):
    medidas[i] = fd(consulta, x)

  return medidas

Definimos la función para la similitud coseno

In [18]:
def similitud_coseno(x, y):
  ax = np.zeros(VOCMAX)
  for e in x:
    ax[e[0]] = e[1] * vocabulary[id_a_palabra[e[0]]][-1]

  ay = np.zeros(VOCMAX)
  for e in y:
    ay[e[0]] = e[1] * vocabulary[id_a_palabra[e[0]]][-1]

  pnorma = (np.sqrt(ax @ ax) * np.sqrt(ay @ ay))

  if pnorma > 0:
    return (ax @ ay) / pnorma
  else: 
    return np.nan 

In [19]:
bolsas_base = [b for i,b in enumerate(bolsas) if i != dc]
sims = fuerza_bruta(bolsas_base, bolsa_dc, similitud_coseno)

In [20]:
print('Similitud máxima es {0} de documento {1}'.format(np.nanmax(sims), np.nanargmax(sims)+ 1))

Similitud máxima es 0.37713346874515374 de documento 6055


Revisamos documento más similar

In [21]:
print(db.data[np.nanargmax(sims) + 1])


[ These two paragraphs are from two different posts.  In splicing them 
  together it is not my intention to change Steve's meaning or misrepresent
  him in any way.  I don't *think* I've done so. ]


Part of what started this was my earlier example of Illinois, USA requiring
anyone doing more than X automobile transfers a year (X = 10, I think)
to become licensed as a used car dealer.  In addition, it requirs anyone
with a used car dealer's license to own at least 10 cars at a time, all the
time. 

Let me continue with this example and try to answer Steve's questions.

Steve, let's say you have the talent and inclination to fix up and resell
cars.  Either you've gotten good enough at it in your spare time to bump
up against these limits, or you would like to do it full-time but these
stupid, arbitrary laws prevent you from starting out small and pulling
yourself up.  So I'm protected from a hungry neighborhood competitor willing
to take a low profit while working extra hard to fulfil

Hacemos lo mismo para la similitud de Jaccard y MinMax

In [22]:
def similitud_jaccard(x, y):
  ax = np.zeros(VOCMAX)
  for e in x:
    ax[e[0]] = 1

  ay = np.zeros(VOCMAX)
  for e in y:
    ay[e[0]] = 1

  inter = np.count_nonzero(ax * ay)
  return inter / (np.count_nonzero(ax) + np.count_nonzero(ay) - inter)

def similitud_minmax(x, y):
  ax = np.zeros(VOCMAX)
  for e in x:
    ax[e[0]] = e[1]

  ay = np.zeros(VOCMAX)
  for e in y:
    ay[e[0]] = e[1]

  c = np.vstack((ax,ay))
  mn = np.sum(np.min(c, axis=0))
  mx = np.sum(np.max(c, axis=0))
  return mn / mx

Calculamos las similitudes con todos los documentos

In [23]:
js = fuerza_bruta(bolsas_base, bolsa_dc, similitud_jaccard)
mms = fuerza_bruta(bolsas_base, bolsa_dc, similitud_minmax)
print('Similitud de Jaccard máxima es {0} de documento {1}'.format(np.nanmax(js), np.nanargmax(js) + 1))
print('Similitud de MinMax máxima es {0} de documento {1}'.format(np.nanmax(mms), np.nanargmax(mms) + 1))

Similitud de Jaccard máxima es 0.14814814814814814 de documento 7135
Similitud de MinMax máxima es 0.13043478260869565 de documento 9767


Visualizamos documento con mayor similitud de Jaccard

In [24]:
print(db.data[np.nanargmax(js) + 1])


Sorry, not so -- the changes in sunrise and sunset times are not
quite synchronized.  For example, neither the earliest sunrise nor the
latest sunset comes on the longest day of the year.


Lo mismo para la similitud MinMax

In [25]:
print(db.data[np.nanargmax(mms) + 1])

For sale - Mazda 323

	1986 Mazda 323
	White exterior, Grey interior.
	75,000 miles
	Interior in very good condition.
	Exterior in good condition

	Pioneer DX 680 car stereo.
		- CD player
		- 18 FM presets, 6 AM
		- removable faceplate
		- seperate component speakers professionally mounted
		  in the doors.

The car has been well maintained.  I wax it often and keep the interior
clean.  Its a good running car with a solid body (no rust thru, tiny
spots of surface rust.  When I see a spot I touch it up.)  The stereo
makes the car.  I have had no mechanical problems with it.

I'm looking for $900.00 firm.  The car has an average wholesale value of 
about $900.00 without the stereo.  The stereo cost me $500.00 last July.

If you are interested, call or Email me at:


## Ejercicio
+ Usa el índice inverso para acelerar el proceso de búsqueda para la similitud de Jaccard, MinMax y Coseno