<a href="https://colab.research.google.com/github/adautofbn/ri_labs/blob/master/lab07/avaliacao_sys.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [529]:
import pandas as pd
import numpy as np
import nltk
import re
import collections
import bisect
from nltk.tokenize import RegexpTokenizer

nltk.download('stopwords')
result = pd.read_csv('https://raw.githubusercontent.com/adautofbn/ri_labs/master/lab06/results.csv')
                                            # Resultados adquiridos do site el-pais em abril de 2019
json = pd.read_json('https://raw.githubusercontent.com/adautofbn/ri_labs/master/lab07/results_final.json')
feedback = {json['query'][i]:json['docs'][i] for i in range(10)}

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Refazemos o conjunto de indíces invertidos para que possam ser feitos os calculos dos modelos vetoriais.

In [0]:
tknz = RegexpTokenizer(r'([A-Za-zÁáÉéÍíÓóÚúÃãÕõÇçÂâÊê]{3,27})')
stopwords = nltk.corpus.stopwords.words('portuguese') 
indexes = {}
M = result.text.count()

for i in range(len(result)):
  text = result.text[i]
  words = [word for word in tknz.tokenize(text.lower())
           if not bool(re.search(r'\d', word))
           and word not in stopwords and len(word) >= 3]  
  for t in words:
    if t not in indexes.keys():
      indexes[t] = []
    indexes[t].append(i)
    
for elem in indexes.items():
  d = dict(collections.Counter(elem[1]))
  indexes[elem[0]] = list(d.items())
  
for word in indexes:
  k = len(indexes[word])
  IDF = round(np.log((M+1)/k),2)
  indexes[word].append(IDF)

## 1. Escolha um documento dentre aqueles da base do aluno Bernardi e crie uma consulta que você acha que tem boas chances de recuperar este documento. Em seguida, avalie os resultados de tal consulta usando a métrica de avaliação Reciprocal Rank

Escolhemos o documento 213, que trata das consequências do atentado em Suzano na sociedade. Várias ameaças e mensagens de apoio aos terroristas, em redes sociais, foram tratadas com nenhuma ou mínima tolerância devido ao trauma que o atentado gerou.

In [531]:
ndoc = 213
document = result.loc[ndoc]
query = 'armas suzano'

document.title

'A tensão em escolas e universidades na esteira do massacre de Suzano'

Abaixo temos as definições de modelos vetoriais feitas no laboratório anterior.

In [0]:
def binary_vsm(query, document):
  score = 0
  query_tokens = query.split()
  doc_tokens = document.split()
  
  for token in query_tokens:
    score += (token in doc_tokens)
    
  return score

In [0]:
def tf_vsm(query, document):
  score = 0
  doc_tokens = document.split()
  query_tokens = query.split()
  
  for word in query_tokens:
    score += doc_tokens.count(word)
  
  return score

In [0]:
def bm25_vsm(query, document, k):
  score = 0
  doc_tokens = document.split()
  query_tokens = query.split()
  
  words = [word for word in query_tokens if word in doc_tokens]
    
  for word in words:
    cwd = doc_tokens.count(word)
    dfw = 0
    if word in indexes:
      dfw = len(indexes[word][:-1])
    score += (((k+1) * cwd) / (cwd + k)) * np.log10(((M+1) / dfw)) if dfw != 0 else 0
  
  return round(score,2)

In [0]:
def tfidf_vsm(query, document):
  score = 0
  doc_tokens = document.split()
  query_tokens = query.split()
  
  for word in query_tokens:
    cwd = doc_tokens.count(word)
    if word in indexes:
      score += cwd * indexes[word][-1]
  
  return round(score,2)

E aqui, temos a criação das tabelas com os top-k documentos em cada modelo.

In [0]:
def create_topk_models(query,k):
  db = []
  dtf = []
  dtfidf = []
  dbm25 = []
  for i in range(len(result)):
    doc = result.text[i].lower()
    bisect.insort(db, (binary_vsm(query, doc), i))
    bisect.insort(dtf, (tf_vsm(query,doc), i))
    bisect.insort(dtfidf, (tfidf_vsm(query,doc), i))
    bisect.insort(dbm25, (bm25_vsm(query,doc,20), i))
  
  db.reverse()
  dtf.reverse()
  dtfidf.reverse()
  dbm25.reverse()
  
  return db[:k], dtf[:k], dtfidf[:k], dbm25[:k]

In [0]:
top_binary, top_tf, top_tfidf, top_bm25 = create_topk_models(query,10)

### Resultados dos modelos

In [538]:
query_df = pd.DataFrame()

query_df['Binary'] = top_binary
query_df['TF'] = top_tf
query_df['TF-IDF'] = top_tfidf
query_df['BM25'] = top_bm25

query_df.index+=1
query_df

Unnamed: 0,Binary,TF,TF-IDF,BM25
1,"(2, 213)","(13, 21)","(33.81, 238)","(11.7, 213)"
2,"(1, 238)","(8, 213)","(31.59, 21)","(11.42, 238)"
3,"(1, 235)","(7, 238)","(31.44, 213)","(8.73, 21)"
4,"(1, 187)","(2, 235)","(4.86, 235)","(2.02, 235)"
5,"(1, 184)","(2, 149)","(4.86, 149)","(2.02, 149)"
6,"(1, 173)","(1, 187)","(2.43, 187)","(1.06, 187)"
7,"(1, 172)","(1, 184)","(2.43, 184)","(1.06, 184)"
8,"(1, 164)","(1, 173)","(2.43, 173)","(1.06, 173)"
9,"(1, 159)","(1, 172)","(2.43, 172)","(1.06, 172)"
10,"(1, 153)","(1, 164)","(2.43, 164)","(1.06, 164)"


### Reciprocal Rank

In [539]:
def reciprocal_rank(tuples, docId):
  n = 1.0;
  for r,doc in tuples:
    if doc == docId:
      return  [round(1 / n, 2)]
    else:
      n += 1

rank_df = pd.DataFrame()
rank_df['Binary'] = reciprocal_rank(query_df['Binary'], ndoc)
rank_df['TF'] = reciprocal_rank(query_df['TF'], ndoc)
rank_df['TF-IDF'] = reciprocal_rank(query_df['TF-IDF'], ndoc)
rank_df['BM25'] = reciprocal_rank(query_df['BM25'], ndoc)
rank_df.index+=1
rank_df

Unnamed: 0,Binary,TF,TF-IDF,BM25
1,1.0,0.5,0.33,1.0


## 2.  A partir do gabarito fornecido em OBS1, calcule o MAP para cada algoritmo abaixo e aponte qual obteve o melhor resultado. Para os cálculos do MAP, considere que um documento é relevante para uma dada consulta se este documento estiver entre os documentos do gabarito para essa consulta, senão ele deve ser considerado irrelevante.

In [0]:
def doc_indexes(model):
  return [doc for score,doc in model]

def intersection(a,b):
  return [elem for elem in a if elem in b]

def calc_AP(query):
  relevant_docs = []

  for doc_info in feedback[query]:
    row = result.loc[result.url == doc_info['URL']]
    relevant_docs.append(row.index[0])
  
  binary, tf, tfidf, bm25 = create_topk_models(query, 5)
  binary = doc_indexes(binary)
  tf = doc_indexes(tf)
  tfidf = doc_indexes(tfidf)
  bm25 = doc_indexes(bm25)
  
  ap_binary = len(intersection(binary, relevant_docs)) / len(binary)
  ap_tf = len(intersection(tf, relevant_docs)) / len(tf)
  ap_tfidf = len(intersection(tfidf, relevant_docs)) / len(tfidf)
  ap_bm25 = len(intersection(bm25, relevant_docs)) / len(bm25)
  
  return ap_binary, ap_tf, ap_tfidf, ap_bm25
  
def calc_MAP(queries):
  sum_binary = 0
  sum_tf = 0
  sum_tfidf = 0
  sum_bm25 = 0
  
  for query in queries:
    ap_binary, ap_tf, ap_tfidf, ap_bm25 = calc_AP(query)
    sum_binary += ap_binary
    sum_tf += ap_tf
    sum_tfidf += ap_tfidf
    sum_bm25 += ap_bm25
  
  map_binary = round(sum_binary / len(queries),2)
  map_tf = round(sum_tf / len(queries),2)
  map_tfidf = round(sum_tfidf / len(queries),2)
  map_bm25 = round(sum_bm25 / len(queries),2)
  
  return map_binary, map_tf, map_tfidf, map_bm25

In [0]:
map_binary, map_tf, map_tfidf, map_bm25 = calc_MAP(feedback.keys())

### Resultados

In [542]:
rank_df = pd.DataFrame()
rank_df['Binary'] = [map_binary]
rank_df['TF'] = [map_tf]
rank_df['TF-IDF'] = [map_tfidf]
rank_df['BM25'] = [map_bm25]
rank_df.index+=1
rank_df

Unnamed: 0,Binary,TF,TF-IDF,BM25
1,0.1,0.02,0.18,0.18


Por definição, o valor do Mean Average Precision (MAP) varia entre 0 e 1. Os valores calculados indicam que todos os modelos possuem uma precisão baixa na busca por documentos específicos fornecidos no gabarito.

## 3. Repita Q2 usando a avaliação multi-nível DCG. Utilize o campo "level" do gabarito para o cálculo do DCG e do idealDCG. Use uma janela de 5 documentos.

### Cálculos das métricas

In [0]:
def calc_dcg(model, levels):
  dcg = 0.0
  for i in range(1,len(model)+1):
    doc = model[i-1]
    level = get_level(doc, levels)
    dcg += (2^level) / np.log2(i + 1.0)
    
  return dcg

def dcg_models(query):
  relevant_docs = {}

  for doc_info in feedback[query]:
    row = result.loc[result.url == doc_info['URL']]
    relevant_docs[row.index[0]] = doc_info['level']
    
  binary, tf, tfidf, bm25 = create_topk_models(query, 5)
  binary, tf, tfidf, bm25 = all_docs(binary,tf,tfidf,bm25)
  
  dcg_binary = round(calc_dcg(binary, set_levels(binary, relevant_docs)),2)
  dcg_tf = round(calc_dcg(tf, set_levels(tf, relevant_docs)),2)
  dcg_tfidf = round(calc_dcg(tfidf, set_levels(tfidf, relevant_docs)),2)
  dcg_bm25 = round(calc_dcg(bm25, set_levels(bm25, relevant_docs)),2)
  
  return dcg_binary, dcg_tf, dcg_tfidf, dcg_bm25

def idcg_models(query):
  relevant_docs = {}

  for doc_info in feedback[query]:
    row = result.loc[result.url == doc_info['URL']]
    relevant_docs[row.index[0]] = doc_info['level']
    
  binary, tf, tfidf, bm25 = create_topk_models(query, 5)
  binary = doc_indexes(binary)
  tf = doc_indexes(tf)
  tfidf = doc_indexes(tfidf)
  bm25 = doc_indexes(bm25)
  
  levels_binary, levels_tf, levels_tfidf, levels_bm25 = all_levels(binary,tf,tfidf,bm25,relevant_docs)
  
  binary, tf, tfidf, bm25 = extract_docs(levels_binary, levels_tf, levels_tfidf, levels_bm25)
  
  idcg_binary = round(calc_dcg(binary, levels_binary),2)
  idcg_tf = round(calc_dcg(tf, levels_tf),2)
  idcg_tfidf = round(calc_dcg(tfidf, levels_tfidf),2)
  idcg_bm25 = round(calc_dcg(bm25, levels_bm25),2)
  
  return idcg_binary, idcg_tf, idcg_tfidf, idcg_bm25

In [0]:
queries_results = {}
for query in feedback.keys():
  dcg_binary, dcg_tf, dcg_tfidf, dcg_bm25 = dcg_models(query)
  idcg_binary, idcg_tf, idcg_tfidf, idcg_bm25 = idcg_models(query)
  
  binary = (dcg_binary, idcg_binary)
  tf = (dcg_tf, idcg_tf)
  tfidf = (dcg_tfidf, idcg_tfidf)
  bm25 = (dcg_bm25, idcg_bm25)
  
  results = [binary, tf, tfidf, bm25]
  
  queries_results[query] = results

#### Métodos auxiliares

In [0]:
def set_levels(m, d):
  model = [(0, doc) for doc in m if doc not in d]
  dic = [(v, k) for k, v in d.items()]
  
  res = model + dic
  res.sort(reverse=True)
  
  return res

def get_level(d, l):
  for level,doc in l:
    if doc == d:
      return level

def all_docs(bi,tf,tfidf,bm):
  return doc_indexes(bi), doc_indexes(tf), doc_indexes(tfidf), doc_indexes(bm)
    
def all_levels(bi,tf,tfidf,bm, rd):
  return set_levels(bi,rd), set_levels(tf,rd), set_levels(tfidf,rd), set_levels(bm,rd)

def extract_docs(bi,tf,tfidf,bm):
  return [doc for level,doc in bi], [doc for level,doc in tf], [doc for level,doc in tfidf], [doc for level,doc in bm]

### Resultados

In [546]:
results_df = pd.DataFrame()
results_df['Query'] = feedback.keys()
results_df['Binary'] = [queries_results[query][0] for query in feedback.keys()]
results_df['TF'] = [queries_results[query][1] for query in feedback.keys()]
results_df['TF-IDF'] = [queries_results[query][2] for query in feedback.keys()]
results_df['BM25'] = [queries_results[query][3] for query in feedback.keys()]
results_df.index+=1

results_df

Unnamed: 0,Query,Binary,TF,TF-IDF,BM25
1,território palestino,"(5.9, 15.65)","(5.9, 15.65)","(5.51, 14.98)","(5.51, 14.98)"
2,recessão mundial,"(11.58, 14.98)","(9.77, 14.98)","(9.77, 14.98)","(9.77, 14.98)"
3,ditadura militar,"(5.9, 17.17)","(5.9, 17.17)","(5.9, 17.17)","(5.9, 17.17)"
4,muro das lamentações,"(18.08, 19.29)","(5.9, 21.3)","(19.29, 19.29)","(19.29, 19.29)"
5,brasil e argentina,"(8.9, 17.5)","(5.9, 18.17)","(7.79, 17.5)","(7.79, 17.5)"
6,golpe militar,"(5.9, 20.67)","(5.9, 20.67)","(8.4, 20.04)","(8.4, 20.04)"
7,governo bolsonaro,"(5.9, 16.54)","(5.9, 16.54)","(5.9, 16.54)","(5.9, 16.54)"
8,ministro da economia,"(5.9, 17.17)","(5.9, 17.17)","(5.9, 17.17)","(5.9, 17.17)"
9,prisão de Temer,"(5.9, 13.43)","(5.9, 13.43)","(10.29, 12.05)","(10.29, 12.05)"
10,Congresso Nacional,"(5.9, 9.65)","(5.9, 9.65)","(5.9, 9.65)","(5.9, 9.65)"


Acima, temos a disposição dos resultados encontrados para as métricas DCG e IDCG, respectivamente nas tuplas, para os modelos vetoriais usados.