# Detección de afirmaciones falsas sobre COVID-19

## Búsqueda de información en la Base CORD-19

El índice de búsqueda se obtiene de aquí:

https://github.com/castorini/anserini/blob/master/docs/experiments-cord19.md#pre-built-indexes-all-versions

In [1]:
import os
import pandas as pd

os.environ["JAVA_HOME"] = "/Library/Java/JavaVirtualMachines/jdk-11.0.7.jdk/Contents/Home"

import spacy
nlp = spacy.load("en_core_web_lg")

In [2]:
from pyserini.search import pysearch
import json

##from sentence_transformers import SentenceTransformer
#model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

searcher = pysearch.SimpleSearcher('lucene-index-cord19-paragraph-2020-07-16')

In [3]:
import tensorflow as tf
import tensorflow_hub as hub

embed = hub.load("./universal-sentence-encoder-large_5")
def embed_fn(sentences):
    embeddings = embed(sentences)
    
    return np.array(embeddings).tolist()

In [4]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()

def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    #normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return punc_free

In [5]:
#CONSULTA
query = 'covid 19 in flight transmission'

results = searcher.search(query, 300)

scores = []
titles = []
texts = []
dois = []
journals = []
dates = []
authors = []
url = []

num_res = 0
for result in results:
    items = result.lucene_document.get("contents").split("\n")
    if len(items[-1]) == 0:
        continue;
    texts.append(items[-1])
    
    
    scores.append(result.score)
    titles.append(result.lucene_document.get("title"))
    dois.append(result.lucene_document.get("doi"))
    journals.append(result.lucene_document.get("journal"))
    dates.append(result.lucene_document.get("publish_time"))
    authors.append(result.lucene_document.get("author_string"))
    url.append(result.lucene_document.get("url"))
    num_res+=1
    if num_res>200:
        break;
    
data = {"title":titles,
       "score":scores,
       "text":texts,
       "doi":dois,
       "journal":journals,
       "date":dates,
       "authors":authors,
       "url":url}

busquedaDF = pd.DataFrame(data=data)
busquedaDF.head(100)

Unnamed: 0,title,score,text,doi,journal,date,authors,url
0,Potential transmission of SARS-CoV-2 on a flig...,9.7510,Previous studies suggest that the basic reprod...,10.1016/j.tmaid.2020.101816,Travel Med Infect Dis,2020-07-06,"Chen, Junfang; He, Hanqing; Cheng, Wei; Liu, Y...",https://www.ncbi.nlm.nih.gov/pubmed/32645477/
1,Potential transmission of SARS-CoV-2 on a flig...,9.7078,"In conclusion, a flight-related outbreak of CO...",10.1016/j.tmaid.2020.101816,Travel Med Infect Dis,2020-07-06,"Chen, Junfang; He, Hanqing; Cheng, Wei; Liu, Y...",https://www.ncbi.nlm.nih.gov/pubmed/32645477/
2,Potential transmission of SARS-CoV-2 on a flig...,9.6819,Here we describe the investigation of the outb...,10.1016/j.tmaid.2020.101816,Travel Med Infect Dis,2020-07-06,"Chen, Junfang; He, Hanqing; Cheng, Wei; Liu, Y...",https://www.ncbi.nlm.nih.gov/pubmed/32645477/
3,Potential transmission of SARS-CoV-2 on a flig...,9.6023,"With regard to Case 16, he did not have a hist...",10.1016/j.tmaid.2020.101816,Travel Med Infect Dis,2020-07-06,"Chen, Junfang; He, Hanqing; Cheng, Wei; Liu, Y...",https://www.ncbi.nlm.nih.gov/pubmed/32645477/
4,Potential transmission of SARS-CoV-2 on a flig...,9.5887,We have described an outbreak of COVID-19 that...,10.1016/j.tmaid.2020.101816,Travel Med Infect Dis,2020-07-06,"Chen, Junfang; He, Hanqing; Cheng, Wei; Liu, Y...",https://www.ncbi.nlm.nih.gov/pubmed/32645477/
...,...,...,...,...,...,...,...,...
95,In-flight Transmission Cluster of COVID-19: A ...,8.8601,The copyright holder for this preprint this ve...,10.1101/2020.03.28.20040097,,2020-03-30,"Yang, Naibin; Shen, Yuefei; Shi, Chunwei; Ma, ...",https://doi.org/10.1101/2020.03.28.20040097
96,Estimating the Impact of Control Measures to P...,8.8270,There is of course the limitation here that al...,10.1101/2020.06.10.20127977,,2020-06-12,"Wilson, N.; Baker, M. G.; Eichner, M.",http://medrxiv.org/cgi/content/short/2020.06.1...
97,In-flight Transmission Cluster of COVID-19: A ...,8.8033,All Patients received antiviral treatment with...,10.1101/2020.03.28.20040097,,2020-03-30,"Yang, Naibin; Shen, Yuefei; Shi, Chunwei; Ma, ...",https://doi.org/10.1101/2020.03.28.20040097
98,In-flight Transmission Cluster of COVID-19: A ...,8.7622,The copyright holder for this preprint this ve...,10.1101/2020.03.28.20040097,,2020-03-30,"Yang, Naibin; Shen, Yuefei; Shi, Chunwei; Ma, ...",https://doi.org/10.1101/2020.03.28.20040097


In [6]:
from scipy.spatial.distance import cosine
import numpy as np
from nltk.tokenize import sent_tokenize

distances = []
sentencias = []
query_emb = embed_fn([query])
#query_emb = model.encode([query])
for index, texto in enumerate(texts):
    doc = nlp(texto)
    sents = [s.text for s in doc.sents]
    #sents = sent_tokenize(texto)
    
    embedding = embed_fn(sents)
    #embedding = model.encode(sents)
    dist_min = 2
    sentencia = ""
    for i, emb in enumerate(embedding):
        if len(sents[i])<1:
            continue
        dist = cosine(embedding[i], query_emb[0])
        #dists.append(dist)
        if dist < dist_min:
            dist_min = dist
            sentencia = sents[i]
            
    distances.append(dist_min)
    sentencias.append(sentencia)

distances = np.array(distances)
min_dist_idx = distances.argsort()[:20]

print(min_dist_idx)
print(distances[min_dist_idx])

[153  11  10  37 109   1  72   0  45  30   2 141   7  29   9 103 163  13
  96 113]
[0.4917704  0.60661877 0.60661877 0.63400304 0.67025772 0.67736514
 0.67842131 0.67991125 0.68605991 0.69118957 0.6938599  0.69493142
 0.69732826 0.71752676 0.72351193 0.72597138 0.73407303 0.73763626
 0.73852464 0.74120745]


In [7]:
busqueda_filtro_DF = busquedaDF.loc[min_dist_idx]
busqueda_filtro_DF

Unnamed: 0,title,score,text,doi,journal,date,authors,url
153,Transmission of Influenza on International Fli...,7.787,Transmission of ILIs on board these aircraft c...,10.3201/eid1707.101135,Emerg Infect Dis,2011-07-16,"Foxwell, A. Ruth; Roberts, Leslee; Lokuge, Kam...",https://www.ncbi.nlm.nih.gov/pubmed/21762571/
11,Potential transmission of SARS-CoV-2 on a flig...,9.547599,"BACKGROUND: Between January 24, 2020 and Febru...",10.1016/j.tmaid.2020.101816,Travel Med Infect Dis,2020-07-06,"Chen, Junfang; He, Hanqing; Cheng, Wei; Liu, Y...",https://www.ncbi.nlm.nih.gov/pubmed/32645477/
10,Potential transmission of SARS-CoV-2 on a flig...,9.5476,"BACKGROUND: Between January 24, 2020 and Febru...",,Travel Med Infect Dis,2020,"Chen, Junfang; He, Hanqing; Cheng, Wei; Liu, Y...",
37,In-flight Transmission Cluster of COVID-19: A ...,9.3376,The copyright holder for this preprint this ve...,10.1101/2020.03.28.20040097,,2020-03-30,"Yang, Naibin; Shen, Yuefei; Shi, Chunwei; Ma, ...",https://doi.org/10.1101/2020.03.28.20040097
109,The Predictive Capacity of Air Travel Patterns...,8.3791,Infection during flights has been found to hav...,10.3390/ijerph17103356,Int J Environ Res Public Health,2020-05-12,"Christidis, Panayotis; Christodoulou, Aris",https://doi.org/10.3390/ijerph17103356
1,Potential transmission of SARS-CoV-2 on a flig...,9.7078,"In conclusion, a flight-related outbreak of CO...",10.1016/j.tmaid.2020.101816,Travel Med Infect Dis,2020-07-06,"Chen, Junfang; He, Hanqing; Cheng, Wei; Liu, Y...",https://www.ncbi.nlm.nih.gov/pubmed/32645477/
72,Estimating the Impact of Control Measures to P...,9.0149,There are several publications that suggest tr...,10.1101/2020.06.10.20127977,,2020-06-12,"Wilson, N.; Baker, M. G.; Eichner, M.",http://medrxiv.org/cgi/content/short/2020.06.1...
0,Potential transmission of SARS-CoV-2 on a flig...,9.751,Previous studies suggest that the basic reprod...,10.1016/j.tmaid.2020.101816,Travel Med Infect Dis,2020-07-06,"Chen, Junfang; He, Hanqing; Cheng, Wei; Liu, Y...",https://www.ncbi.nlm.nih.gov/pubmed/32645477/
45,In-flight Transmission Cluster of COVID-19: A ...,9.1024,Objectives: No data were available about in-fl...,10.1101/2020.03.28.20040097,,2020-03-30,"Yang, Naibin; Shen, Yuefei; Shi, Chunwei; Ma, ...",https://doi.org/10.1101/2020.03.28.20040097
30,Potential transmission of SARS-CoV-2 on a flig...,9.4196,Air travel for leisure and business purposes h...,10.1016/j.tmaid.2020.101816,Travel Med Infect Dis,2020-07-06,"Chen, Junfang; He, Hanqing; Cheng, Wei; Liu, Y...",https://www.ncbi.nlm.nih.gov/pubmed/32645477/


In [8]:
for index, row in busqueda_filtro_DF.iterrows():
    print(row["title"])
    print("####")
    print(row["text"])
    print("####")
    print(row["doi"])
    print("####")
    print(sentencias[index])
    print("")

Transmission of Influenza on International Flights, May 2009
####
Transmission of ILIs on board these aircraft clustered closely with a passenger who was symptomatic during the flight or may have been in contact with an infectious passenger for >15 minutes during the flight. This finding is similar to transmission of pandemic (H1N1) 2009 noted on a long-haul flight to New Zealand in 2009 (6). Recent studies on the transmission of pandemic (H1N1) 2009 in ferrets demonstrated preference for aerosol and droplet transmission of the virus (15,16). A similar study investigating the disease in a tour group in China indicated droplet transmission from coughing or talking with the index case-patient as being the main mode of transmission (17). The cabin in the A380-800 (flight 1) allows for a 10% wider seat in economy class than does the 747-400 (flight 2) (18), and modern ventilation systems in aircraft circulate air around bands of seat rows rather than the through length of the aircraft (19)

In [1]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-large-nli-mean-tokens')

In [6]:
arr_ = model.encode(['garlic is not a good option to fight against deseases','garlic is a good option to fight against deseases'])
cosine(arr_[0],arr_[1])



0.40097153186798096