<h1>Angela Barriga - Assignment 3</h1>

<h2>
Motivation
</h2>

Since this assignment is an open task, I would like to start by introducing the problem I aimed to solve and why I thought it would be a interesting exercise.
I'm Spanish and I'm in Spain at the moment, sadly we are the most infected country in Europe right now. While I've been lucky enough to have most of my family and friends safe and healthy, a couple of my best friends got the virus. They are fine and they didn't require serious medical help but they have lost their smell and taste senses (anosmia and ageusia in medical terms). These symptomps are becoming more and more common (https://edition.cnn.com/2020/04/12/health/smell-taste-coronavirus-pandemic-wellness-intl-gbr/index.html) but there's very little research about them and doctors don't know for how long the senses are gone or which treatment might work. My friends have already been one month and a half with no smell or taste at all. While they are healthy apart from this, it is quite a cumbersome situation, it affects their daily lives and mental health, since they don't know when they will recover.

<h2>
    Goal
</h2>

I would like to identify a set of articles related with anosmia and ageusia together (I will discard papers talking just about the loss of one of the senses, since most people lose both of them at the same time) and with COVID-19.

<h3>
    Starting point
</h3>

I start by loading the CSV obtained as output in this kernel: https://www.kaggle.com/xhlulu/cord-19-eda-parse-json-and-generate-clean-csv/
With this I can directly start working with the data in a clean format.

In [2]:
# Here I load the CSV files 

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import time
from heapq import nlargest 
import os

biorxiv = pd.read_csv("/data/ex3/biorxiv_clean.csv")
clean_comm_use = pd.read_csv("/data/ex3/clean_comm_use.csv")
clean_noncomm_use = pd.read_csv("/data/ex3/clean_noncomm_use.csv")
clean_pmc = pd.read_csv("/data/ex3/clean_pmc.csv")

<h3>
    Medical word embedding
</h3>

After doing some testing with the regular word2vec repository, I found that I couldn't find much synonyms for anosmia and ageusia (for ageusia it didn't find any matches at all).
So I decided to look for a medical word vector.

I checked the following links, I had some problems loading most of the vectors due to memory errors, but I managed to load the fourth resource from the last link:

<ul>
    <li>https://github.com/ncbi-nlp/BioSentVec/</li>
    <li>https://github.com/epfml/sent2vec</li>
    <li>https://github.com/RaRe-Technologies/gensim-data/issues/5</li>
    <li>http://evexdb.org/pmresources/vec-space-models/</li>
</ul>

In [3]:
#Load the word vector

from gensim.test.utils import datapath
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format(datapath("/data/biomodel/PubMed-w2v.bin"), binary=True)

<h3>
    Filtering papers
</h3>

Here I filter papers related to anosmia and ageusia, respectively. I use as key words: anosmia, smell, ageusia and taste and their most similar synonyms from the word vector previously loaded (e.g., for anosmia the top scored synonym would be hyposmia). Then, I filter each set of papers using the top 5 synonyms from each keyword. By doing this I assure that related or very similar terms from the ones I'm looking for will also be included in the search. I merge the results for anosmia and ageusia separately, removing repeated papers. I do this just for measuring how many papers talk about each symptom individually and then together.

From here until the end of the code, I took some inspiration for operations and vector managing from:

<ul>
    <li>https://www.kaggle.com/saga21/cord-19-data-extraction-functions</li>
<li>https://www.kaggle.com/jihyeseo/word2vec-gensim-play-look-for-similar-words</li>
<li>https://www.kaggle.com/luisblanche/cord-19-use-doc2vec-to-match-articles-to-tasks</li>
</ul>


In [4]:
#Creates a list with the keys in a dictionary
def dictKeystoList(data):
    keys = []
    for pair in data:
        keys.append(pair[0])
    return keys

#Filter papers using a list of words
def filter_papers_word_list(word_list, repository):
    papers_id_list = []
    for idx, paper in repository.iterrows():
        if any(x in paper.text for x in word_list):
            papers_id_list.append(paper.paper_id)

    return papers_id_list

#Getting top synonyms
anosmia_syn = model.most_similar(positive=['anosmia'], topn = 5)
hyposmia_syn = model.most_similar(positive=['hyposmia'], topn = 5)

smell_syn = model.most_similar(positive=['smell'], topn = 5)
olfaction_syn = model.most_similar(positive=['olfaction'], topn = 5)


#Merging all the results
smell_related_words = list(set().union(dictKeystoList(anosmia_syn), dictKeystoList(hyposmia_syn),
                        dictKeystoList(smell_syn), dictKeystoList(olfaction_syn)))

#Getting top synonyms
ageusia_syn = model.most_similar(positive=['ageusia'], topn = 5)
hypogeusia_syn = model.most_similar(positive=['hypogeusia'], topn = 5)

gustatory_syn = model.most_similar(positive=['gustatory'], topn = 5)
taste_syn = model.most_similar(positive=['taste'], topn = 5)

#Merging all the results
taste_related_words = list(set().union(dictKeystoList(ageusia_syn), dictKeystoList(hypogeusia_syn),
                        dictKeystoList(gustatory_syn), dictKeystoList(taste_syn)))

In [5]:
#Filtering for anosmia and ageusia related papers
biorxiv_smell_papers = filter_papers_word_list(smell_related_words, biorxiv)
biorxiv_taste_papers = filter_papers_word_list(taste_related_words, biorxiv)

comm_smell_papers = filter_papers_word_list(smell_related_words, clean_comm_use)
comm_taste_papers = filter_papers_word_list(taste_related_words, clean_comm_use)

noncomm_smell_papers = filter_papers_word_list(smell_related_words, clean_noncomm_use)
noncomm_taste_papers = filter_papers_word_list(taste_related_words, clean_noncomm_use)

pmc_smell_papers = filter_papers_word_list(smell_related_words, clean_pmc)
pmc_taste_papers = filter_papers_word_list(taste_related_words, clean_pmc)

#Intersection to get papers where both symptomps appear
biorxiv_both_papers = set(biorxiv_smell_papers).intersection(set(biorxiv_taste_papers))
comm_both_papers = set(comm_smell_papers).intersection(set(comm_taste_papers))
noncomm_both_papers = set(noncomm_smell_papers).intersection(set(noncomm_taste_papers))
pmc_both_papers = set(pmc_smell_papers).intersection(set(pmc_taste_papers))

#Synonyms for COVID-19, I did this manually, since the vector wouldn't give me good results
corona_words = ['coronavirus', 'covid-19', 'COVID-19', 'SARS-CoV-2', 'sars-cov-2']

#Filter papers related with COVID-19
biorxiv_c = filter_papers_word_list(corona_words, biorxiv)
comm_c = filter_papers_word_list(corona_words, clean_comm_use)
noncomm_c = filter_papers_word_list(corona_words, clean_noncomm_use)
pmc_c = filter_papers_word_list(corona_words, clean_pmc)

#Intersect papers related to both symptomps and those where COVID-19 synonyms appear
biorxiv_covid = set(biorxiv_c).intersection(set(biorxiv_both_papers))
comm_covid = set(comm_c).intersection(set(comm_both_papers))
noncomm_covid = set(noncomm_c).intersection(set(noncomm_both_papers))
pmc_covid = set(pmc_c).intersection(set(pmc_both_papers))

#Print the length of each set of papers
print("Total length of paper sets:")
print("BIORXIV PAPERS: ", len(biorxiv))
print("COMM PAPERS: ", len(clean_comm_use))
print("NONCOMM PAPERS: ", len(clean_noncomm_use))
print("PMC PAPERS: ", len(clean_pmc))
print()
print("Length of paper sets filtered by smell/taste:")
print("BIORXIV SMELL PAPERS: ",len(biorxiv_smell_papers))
print("BIORXIV TASTE PAPERS: ",len(biorxiv_taste_papers))
print("COMM SMELL PAPERS: ",len(comm_smell_papers))
print("COMM TASTE PAPERS: ",len(comm_taste_papers))
print("NONCOMM SMELL PAPERS: ",len(noncomm_smell_papers))
print("NONCOMM TASTE PAPERS: ",len(noncomm_taste_papers))
print("PMC SMELL PAPERS: ",len(pmc_smell_papers))
print("PMC TASTE PAPERS: ",len(pmc_taste_papers))
print()
print("Length of intersection taste and smell:")
print("BIORXIV BOTH PAPERS: ",len(biorxiv_both_papers))
print("COMM BOTH PAPERS: ",len(comm_both_papers))
print("NONCOMM BOTH PAPERS: ",len(noncomm_both_papers))
print("PMC BOTH PAPERS: ",len(pmc_both_papers))
print()
print("Length of intersection covid and symptoms:")
print("BIORXIV COVID PAPERS: ",len(biorxiv_covid))
print("COMM COVID PAPERS: ",len(comm_covid))
print("NONCOMM COVID PAPERS: ",len(noncomm_covid))
print("PMC COVID PAPERS: ",len(pmc_covid))

Total length of paper sets:
BIORXIV PAPERS:  885
COMM PAPERS:  9118
NONCOMM PAPERS:  2353
PMC PAPERS:  16959

Length of paper sets filtered by smell/taste:
BIORXIV SMELL PAPERS:  10
BIORXIV TASTE PAPERS:  8
COMM SMELL PAPERS:  147
COMM TASTE PAPERS:  162
NONCOMM SMELL PAPERS:  43
NONCOMM TASTE PAPERS:  41
PMC SMELL PAPERS:  803
PMC TASTE PAPERS:  611

Length of intersection taste and smell:
BIORXIV BOTH PAPERS:  4
COMM BOTH PAPERS:  36
NONCOMM BOTH PAPERS:  12
PMC BOTH PAPERS:  235

Length of intersection covid and symptoms:
BIORXIV COVID PAPERS:  1
COMM COVID PAPERS:  12
NONCOMM COVID PAPERS:  5
PMC COVID PAPERS:  146


<h3>
    Filtering results
</h3>

As can be seen, a small part of the papers present one of the symptopms, and this set gets even more reduced when looking for both of them at the same time. looking also for COVID-19 reduces even more the set, leaving us with a total of 164 papers from the initial 29315. This confirms what I explained in the motivation, the literature about this topic is quite reduced.

<h3>
    Extracting conclusions
</h3>

I thought it could be interesting to display the conclusion section from those papers having one, since it's a faster way to know what was achieved in the papers. This extraction wasn't very fruitful: only 9 papers of the 164 have a conclusion section.

<h3>
    Extracting summaries
</h3>

I also thought it would be interesting to obtain a summary of each paper. For doing this I follow this kernel's approach: https://www.kaggle.com/saga21/cord-19-data-extraction-functions. Basically it extracts the score of each word in the document and keeps the top 10 sentences wih best scored words.


In [6]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
#!pip3 install -U spacy
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

# Term frequency - inverse document frequency function
def tfidf(data):
    tfidf_vectorizer = TfidfVectorizer(analyzer='word', stop_words = 'english', sublinear_tf=True)
    matrix = tfidf_vectorizer.fit_transform(data)
    return matrix, tfidf_vectorizer


# Define the sentence scoring function
def get_sentence_score(sentence: str, word_scores: dict):
    words = sentence.split()
    if len(words) < 50:
        score = sum([word_scores.get(w.lower(),0) for w in words])
    else:
        score=0
    return score


# Summary extraction function
def extract_summary(df, paper_id):

    list_corpus = list(df[df.paper_id == paper_id].text)
    tfidf_matrix, tfidf_vectorizer = tfidf(list_corpus)
    word_scores_df = pd.DataFrame(tfidf_matrix.toarray(), columns = tfidf_vectorizer.get_feature_names())   # extract a df with the words' scores
    word_scores = dict(zip(list(word_scores_df.columns), list(word_scores_df.iloc[0])))  # convert to dict

    # Split into sentences
    sentences_list = [nlp(s) for s in list(df[df.paper_id == paper_id].text.str.replace('\n', '.').replace('\r', '.'))]
    sentences_list = str(sentences_list[0]).split('.')   # Split sentences by .
    sentences_scores = {}

    # Assign scores and join the top10 sentences into the final summary
    for s in sentences_list:
        sentences_scores[s] = get_sentence_score(s, word_scores)

    top10_sentences = nlargest(10, sentences_scores, key=sentences_scores.get)
    top10_sentences = [s for s in top10_sentences ]
    summary = ' '.join(top10_sentences)
        
    return summary

def extract_conclusion(df, papers_id_list):
    data = df.loc[df['paper_id'].isin(papers_id_list)]
    conclusion = []
    for idx, paper in data.iterrows():
        paper_text = paper.text
        if "\nConclusion\n" in paper.text:
            conclusion.append(paper_text.split('\nConclusion\n')[1])
        else:
            conclusion.append("No Conclusion section")
    data['conclusion'] = conclusion
        
    return data


pd.reset_option('^display.', silent=True)

biorxiv_conclusion = extract_conclusion(biorxiv, biorxiv_covid)
print("Biorxiv papers with conclusion: ", len(biorxiv_conclusion[biorxiv_conclusion.conclusion != "No Conclusion section"]))

comm_conclusion = extract_conclusion(clean_comm_use, comm_covid)
print("Comm papers with conclusion: ", len(comm_conclusion[comm_conclusion.conclusion != "No Conclusion section"]))

noncomm_conclusion = extract_conclusion(clean_noncomm_use, noncomm_covid)
print("Noncomm papers with conclusion: ", len(noncomm_conclusion[noncomm_conclusion.conclusion != "No Conclusion section"]))

pmc_conclusion = extract_conclusion(clean_pmc, pmc_covid)
print("Pmc papers with conclusion: ", len(pmc_conclusion[pmc_conclusion.conclusion != "No Conclusion section"]))

Biorxiv papers with conclusion:  0
Comm papers with conclusion:  3
Noncomm papers with conclusion:  0
Pmc papers with conclusion:  6


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


<h3>
    Displaying papers
</h3>

Next, I filter papers containing terms realted to regaining the smell and taste senses and display them in the screen in pandas format. This is useful for identifying specific papers quickly.

In [7]:
pd.options.display.max_colwidth = 50

def extract_pandas(df, papers_id_list):
    data = df.loc[df['paper_id'].isin(papers_id_list)]      
    return data

brxv_ids = filter_papers_word_list(["treatment", "cure", "reappearance", "regain"], biorxiv_conclusion)
brxv = extract_pandas(biorxiv_conclusion, brxv_ids)
brxv['summary'] = brxv['paper_id'].apply(lambda x: extract_summary(brxv, x))
brxv.head(50)

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography,conclusion,summary
465,7fd337f16780aba8f1e599ce7516dc9b1d80a546,Neurological Manifestations of Hospitalized Pa...,"Ling Mao, Mengdie Wang, Shengcai Chen, Quanwei...","Ling Mao, Mengdie Wang, Shengcai Chen, Quanwei...",Abstract\n\nTo study the neurological manifest...,"Introduction\n\nIn December 2019, many unexpla...",A novel coronavirus from patients with pneumon...,"[{'first': 'Ling', 'middle': [], 'last': 'Mao'...","{'BIBREF0': {'ref_id': 'b0', 'title': 'A novel...",No Conclusion section,We found that patients with CNS symptoms had l...


In [8]:
nlp.max_length= 3157615

comm_ids = filter_papers_word_list(["treatment", "cure", "reappearance", "regain"], comm_conclusion)
comm = extract_pandas(comm_conclusion, comm_ids)
comm['summary'] = comm['paper_id'].apply(lambda x: extract_summary(comm, x))
comm.head(50)

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography,conclusion,summary
741,3f1dd22fd366efca2448379f7becf69de113353a,"Randomized, double-blind, placebo- controlled ...","Pranab K Mukherjee, Frank Esper, Ken Buchheit,...","Pranab K Mukherjee, Frank Esper, Ken Buchheit,...",Abstract\n\nBackground: Current prevention opt...,Background\n\nUpper respiratory infections (UR...,Seasonal influenza in adults and children-diag...,"[{'first': 'Pranab', 'middle': ['K'], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Seasona...",No Conclusion section,[47] compared the efficacy of commercially av...
2109,a5293bb4f17ad25a72133cdd9eee8748dd6a4b8d,A1 Pirfenidone inhibits TGF-b1-induced extrace...,"Heung-Man Lee, Il-Ho Park, Jae-Min Shin","Heung-Man Lee (Guro Hospital), Il-Ho Park (Kor...",,\n\npurpose of this study was to demonstrate t...,Longterm treatment with fasudil improves bleom...,"[{'first': 'Heung-Man', 'middle': [], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Longter...",\nIn wheat allergy patients we need to know th...,"Young-Hee Nam, Dong Sub Jeon, Soo-Keol Lee Don..."
3223,c3f83fef0e35a95a6ff12ffacf58969db6172761,Ciliary Beating Recovery in Deficient Human Ai...,"B Chhin, D Negre, O Merrot, J Pham, Y Tourneur","B Chhin, D Negre, O Merrot, J Pham, Y Tourneur",Abstract\n\nPrimary Ciliary Dyskinesia is a he...,Introduction\n\nPrimary Ciliary Dyskinesia (PC...,"A human syndrome caused by immotile cilia, B A...","[{'first': 'B', 'middle': [], 'last': 'Chhin',...","{'BIBREF0': {'ref_id': 'b0', 'title': 'A human...",No Conclusion section,"Finally, we showed that DNAI1-mutated HAECs t..."
4033,24a8a71cbe308f92fd40a91b81fd4483363bd36b,The use of biochar in animal feeding Distribut...,"Hans-Peter Schmidt, Nikolas Hagemann, Kathleen...",Hans-Peter Schmidt (Ithaka Institute for Carbo...,"Abstract\n\nBiochar, that is, carbonized bioma...",INTRODUCTION\n\nBiochar is produced by pyrolys...,Influence of feedstock and pyrolysis temperatu...,"[{'first': 'Hans-Peter', 'middle': [], 'last':...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Influen...",No Conclusion section,The biochar treatment did not reveal any sign...
4646,ad459f664c704e1ac4f0f8d97d0e194914bb9601,molecules Essential Oils as Antimicrobial Agen...,"Katarzyna Wińska, Wanda Mączka, Jacek Łyczko, ...",Katarzyna Wińska (Wrocław University of Enviro...,Abstract\n\nHerbs and the essential oils deriv...,Introduction\n\nEssential oils (EOs) are defin...,"Natural Antimicrobial Agents, J.-M Mérillon, C...","[{'first': 'Katarzyna', 'middle': [], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Natural...",No Conclusion section,"For EO obtained from cinnamon bark, it is cha..."
4966,7120211f7eda128e441ed22e3e4d8d3eda07b771,Effect of oral care gel on the quality of life...,"Yumiko Nagao, Michio Sata","Yumiko Nagao, Michio Sata",Abstract\n\nBackground: Oral lichen planus (OL...,Background\n\nThere are 170 million chronic he...,Epidemiology and clinical aspects on hepatitis...,"[{'first': 'Yumiko', 'middle': [], 'last': 'Na...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Epidemi...",No Conclusion section,The patients marked on the line the point tha...
5477,49f674332aef03ed0e231eaed321de8cb65b6644,A Single Dose of Azithromycin Does Not Improve...,"Gabrielle B Mccallum, Peter S Morris, Mark D C...",Gabrielle B Mccallum (Charles Darwin Universit...,"Abstract\n\nObjective: Bronchiolitis, one of t...","Introduction\n\nWorldwide, bronchiolitis remai...","Lower respiratory tract infections, A B Chang,...","[{'first': 'Gabrielle', 'middle': ['B'], 'last...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Lower r...",\nIn children hospitalised with moderate to se...,In children hospitalised with moderate to seve...
5864,c27cd721cc62f4294aeb0159e587efb9391ce8ad,molecules Antioxidant and Anti-Inflammatory Ac...,"Maria Graça, Miguel","Maria Graça (IBB, Ed. 8, Campus de Gambelas, 8...",Abstract\n\nEssential oils are complex mixture...,Introduction\n\nAn essential oil is defined in...,Essential oils and volatiles: sample preparati...,"[{'first': 'Maria', 'middle': [], 'last': 'Gra...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Essenti...",No Conclusion section,Amomum tsao-ko oil with low antioxidant activi...
6784,04c80346278c76be38e11fba9ee1a0fccc13c57f,Candidiasis and other oral mucosal lesions dur...,"Yumiko Nagao, Kouji Hashimoto, Michio Sata","Yumiko Nagao, Kouji Hashimoto, Michio Sata",Abstract\n\nBackground: Oral lichen planus (OL...,Background\n\nHepatitis C is a major global pu...,Interferon therapy reduces the risk for hepato...,"[{'first': 'Yumiko', 'middle': [], 'last': 'Na...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Interfe...","\nIn conclusion, our data show that IFN therap...","On the other hand, in a Japanese Phase III tr..."
7212,59ac4eae51d6084c2346ebd27125a3b56e25d72e,Clinical Aspects of Feline Retroviruses: A Review,Katrin Hartmann,"Katrin Hartmann (LMU University of Munich, Vet...",Abstract\n\nFeline leukemia virus (FeLV) and f...,Introduction\n\nFeline leukemia virus (FeLV) a...,Report of the national felv/fiv awareness proj...,"[{'first': 'Katrin', 'middle': [], 'last': 'Ha...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Report ...",No Conclusion section,"Initially both, regressive and progressive in..."


In [9]:
noncomm_ids = filter_papers_word_list(["treatment", "cure", "reappearance", "regain"], noncomm_conclusion)
noncomm = extract_pandas(noncomm_conclusion, noncomm_ids)
noncomm['summary'] = noncomm['paper_id'].apply(lambda x: extract_summary(noncomm, x))
noncomm.head(50)

Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography,conclusion,summary
9,0b627ea5840e5e48709565840313984dc900af8d,Macroevolutionary immunology: a role for immun...,"Eric S Loker, Larry J Dishaw, Lynn Martin","Eric S Loker (University of New Mexico, 87131,...",Abstract\n\nAn emerging picture of the nature ...,INTRODUCTION\n\nRecent years have witnessed a ...,Is MHC enough for understanding wildlife immun...,"[{'first': 'Eric', 'middle': ['S'], 'last': 'L...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Is MHC ...",No Conclusion section,One general factor favoring this is the devel...
83,6e99aa3d7287b2e04d3c2a975ba94212158ac45c,,,,,\n\nThe purpose of this study was to investiga...,"3 School of Veterinary Medicine, University of...",[],"{'BIBREF0': {'ref_id': 'b0', 'title': '3 Schoo...",No Conclusion section,Samples with normal creatinine values were di...
585,5b0eeabeded14c9e02fc7bcfacb52a8fa0b3ae3c,Intrinsically Disordered Proteins e24684-1 Int...,Vladimir N Uversky,Vladimir N Uversky (University of South Florid...,Abstract\n\nThe ability of a protein to fold i...,Introduction\n\nIntrinsically disordered prote...,"Intrinsically disordered proteins from A to Z,...","[{'first': 'Vladimir', 'middle': ['N'], 'last'...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Intrins...",No Conclusion section,"74 Furthermore, C cap residues with polar sid..."
1069,0e18097a83b2801c4865c72065b7933fa24face3,Nonproliferative and Proliferative Lesions of ...,"Meg Ferrell Ramos, ( Co-Chair, Julia Baker, El...","Meg Ferrell Ramos (AbbVie, Inc, North Chicago,...",,Introduction\n\nThe INHAND Project (Internatio...,"Proliferative lesions of the eye in rats, OSS,...","[{'first': 'Meg', 'middle': ['Ferrell'], 'last...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Prolife...",No Conclusion section,"Atrophy of the inner retina (inner nuclear, in..."
1812,4b34166a75b4835a2743fabba60bf7ff23cf0087,Asthma associated with denatonium benzoate in ...,"Kou-Huang Chen, Kuo-Mou Chung, Ju-Hui Chung, K...","Kou-Huang Chen, Kuo-Mou Chung, Ju-Hui Chung, K...",Abstract\n\nRationale: Denatonium benzoate is ...,Introduction\n\nDenatonium benzoate is a usefu...,Occupational Safety and Health Administration....,"[{'first': 'Kou-Huang', 'middle': [], 'last': ...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Occupat...",No Conclusion section,Denatonium benzoate is a useful indicator to e...


In [10]:
pmc_ids = filter_papers_word_list(["treatment", "cure", "reappearance", "regain"], pmc_conclusion)
pmc = extract_pandas(pmc_conclusion, pmc_ids)
pmc['summary'] = pmc['paper_id'].apply(lambda x: extract_summary(pmc, x))
pd.set_option('display.max_rows', 150)
pmc.head(150)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,paper_id,title,authors,affiliations,abstract,text,bibliography,raw_authors,raw_bibliography,conclusion,summary
230,001916486b2012ccc611fe18d4763f9e81517cbb,Domestic Cats as Laboratory Animals,"Brenda Griffin, Henry J Baker","Brenda Griffin, Henry J Baker",,A. Unique Contributions of Cats to Biomedical ...,"Sphingomyelin lipidosis in a cat, H J Baker, V...","[{'first': 'Brenda', 'middle': [], 'last': 'Gr...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Sphingo...",No Conclusion section,Addition of cats of this type into facilities...
383,03f4b1b76ebfc66e6c1deb6d8a1c1524019acf4e,"A Review of Respiratory System Anatomy, Physio...",Melissa A Kling,Melissa A Kling (Mercer University School of M...,,\n\ncounterparts. Lack of genetic diversity wi...,Harkness and Wagner's biology and medicine of ...,"[{'first': 'Melissa', 'middle': ['A'], 'last':...","{'BIBREF0': {'ref_id': 'b0', 'title': ""Harknes...",No Conclusion section,"To accommodate for this high metabolic rate, ..."
490,cf936d1f9760d4c79afd790138c20e4ebeb25167,Infectious Diarrhea,Larry Goodman,Larry Goodman,Abstract\n\nIn Brief 269\n\n,Infectious Diarrhea\n\nAbstract-Infectious dia...,Closing the gap. Health Policy Task Force on I...,"[{'first': 'Larry', 'middle': [], 'last': 'Goo...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Closing...",No Conclusion section,Stool examination for ova and parasites and f...
755,1d825b62269d880d74b9c86622ebd565e3a15df8,CHAP TER 6 Integumentary System,"Elizabeth A Mauldin, Jeanine Peters-Kennedy","Elizabeth A Mauldin, Jeanine Peters-Kennedy",,\n\nmaintains hydration of the body by restric...,A study of the number and distribution of cuta...,"[{'first': 'Elizabeth', 'middle': ['A'], 'last...","{'BIBREF1': {'ref_id': 'b1', 'title': 'A study...",No Conclusion section,Other histopathologic features in donkeys inc...
1211,67aebb235254bde5b5fbbe42cf353d069ee63080,,,,,Introduction\n\nThis chapter considers the pri...,"USDA, , , 2007; USDA, , , 2007; Comprehensive ...",[],"{'BIBREF9': {'ref_id': 'b9', 'title': 'USDA', ...",No Conclusion section,Foals that ingest colostrum within the first ...
1243,9050e548cf6ad266a22fd009f7757fa3dfa126e7,"Respiratory Infections, Acute",Mark Rogan,Mark Rogan (Mater Misercordiae University Hosp...,,Introduction\n\nAcute respiratory tract infect...,Management of acute bacterial rhinosinusitis: ...,"[{'first': 'Mark', 'middle': [], 'last': 'Roga...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Managem...",\nAcute upper and lower respiratory tract infe...,The British Thoracic Society (BTS) criteria f...
1275,932bcdd2f6c6d487386e8bf983e4874387771be1,1 -18 Plenary lectures and concurrent symposia,"P C Doherty, S J Turner, A L Vescovi, H Wekerl...","P C Doherty, S J Turner, A L Vescovi, H Wekerl...",Abstract\n\nVirus-specific CD8+ T cell-mediate...,"\n\nInflammation is a local, protective respon...",Results: There was no difference concerning ag...,"[{'first': 'P', 'middle': ['C'], 'last': 'Dohe...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Results...",No Conclusion section,"Results: Lower clinical scores, less body wei..."
1699,d3286f3332c99b4459a23b555d5eb26cb3e96074,FERRET BEHAVIOR Natural Behavior and Domestica...,,,Abstract\n\nThe question as to why ferrets wer...,\n\nPolecats tend to be solitary and very terr...,Imprinting on prey odours in ferrets (Mustela ...,[],"{'BIBREF0': {'ref_id': 'b0', 'title': 'Imprint...",No Conclusion section,The fact that 60% of unfamiliar pairings did ...
1729,972d0d1d5953c25eba72b4d3899a739665fede37,Biology and Diseases of Rabbits,"Mark A Suckow, David W Brammer, Howard G Rush,...","Mark A Suckow, David W Brammer, Howard G Rush,...",,\n\nBunolagus monticularis (Bushman rabbit) Br...,Characterization of eaeA gene from rabbit ente...,"[{'first': 'Mark', 'middle': ['A'], 'last': 'S...","{'BIBREF0': {'ref_id': 'b0', 'title': 'Charact...",No Conclusion section,The relatively large body size and blood volu...
1732,941a9a3f0138f7c01ff930023a4aa428a18170e1,Synopses of Research Articles,,,,\n\ncomparing the genomes of related organisms...,The transcriptome of the intraerythrocytic dev...,[],"{'BIBREF0': {'ref_id': 'b0', 'title': 'The tra...",No Conclusion section,Using chemical inhibitors and mutant human T ...


<h3>
     Kaggle tasks
</h3>

Finally, I wanted to check if any of my filtered papers could answer any of the Kaggle tasks. For doing so I store the sentences related to the tasks and run a Nearest Neighbours algorithm to get the papers most related with each task, I get 3 papers that matches successfully at least one task. I use the text of each paper and the abstract for finding the proximity with the tasks.

In [11]:
task_1 = """What is known about transmission, incubation, and environmental stability of COVID-19? What do we know about natural history, transmission, and diagnostics for the virus? What have we learned about infection prevention and control?
Range of incubation periods for the disease in humans (and how this varies across age and health status) and how long individuals are contagious, even after recovery.
Prevalence of asymptomatic shedding and transmission (e.g., particularly children).
Seasonality of transmission.
Physical science of the coronavirus (e.g., charge distribution, adhesion to hydrophilic/phobic surfaces, environmental survival to inform decontamination efforts for affected areas and provide information about viral shedding).
Persistence and stability on a multitude of substrates and sources (e.g., nasal discharge, sputum, urine, fecal matter, blood).
Persistence of virus on surfaces of different materials (e,g., copper, stainless steel, plastic).
Natural history of the virus and shedding of it from an infected person
Implementation of diagnostics and products to improve clinical processes
Disease models, including animal models for infection, disease and transmission
Tools and studies to monitor phenotypic change and potential adaptation of the virus
Immune response and immunity
Effectiveness of movement control strategies to prevent secondary transmission in health care and community settings
Effectiveness of personal protective equipment (PPE) and its usefulness to reduce risk of transmission in health care and community settings
Role of the environment in transmission"""

task_2 = """What do we know about COVID-19 risk factors? What have we learned from epidemiological studies?
Data on potential risks factors
Smoking, pre-existing pulmonary disease
Co-infections (determine whether co-existing respiratory/viral infections make the virus more transmissible or virulent) and other co-morbidities
Neonates and pregnant women
Socio-economic and behavioral factors to understand the economic impact of the virus and whether there were differences.
Transmission dynamics of the virus, including the basic reproductive number, incubation period, serial interval, modes of transmission and environmental factors
Severity of disease, including risk of fatality among symptomatic hospitalized patients, and high-risk patient groups
Susceptibility of populations
Public health mitigation measures that could be effective for control"""

task_3 = """What do we know about virus genetics, origin, and evolution? What do we know about the virus origin and management measures at the human-animal interface?
Real-time tracking of whole genomes and a mechanism for coordinating the rapid dissemination of that information to inform the development of diagnostics and therapeutics and to track variations of the virus over time.
Access to geographic and temporal diverse sample sets to understand geographic distribution and genomic differences, and determine whether there is more than one strain in circulation. Multi-lateral agreements such as the Nagoya Protocol could be leveraged.
Evidence that livestock could be infected (e.g., field surveillance, genetic sequencing, receptor binding) and serve as a reservoir after the epidemic appears to be over.
Evidence of whether farmers are infected, and whether farmers could have played a role in the origin.
Surveillance of mixed wildlife- livestock farms for SARS-CoV-2 and other coronaviruses in Southeast Asia.
Experimental infections to test host range for this pathogen.
Animal host(s) and any evidence of continued spill-over to humans
Socioeconomic and behavioral risk factors for this spill-over
Sustainable risk reduction strategies"""

task_4 = """What do we know about vaccines and therapeutics? What has been published concerning research and development and evaluation efforts of vaccines and therapeutics?
Effectiveness of drugs being developed and tried to treat COVID-19 patients.
Clinical and bench trials to investigate less common viral inhibitors against COVID-19 such as naproxen, clarithromycin, and minocyclinethat that may exert effects on viral replication.
Methods evaluating potential complication of Antibody-Dependent Enhancement (ADE) in vaccine recipients.
Exploration of use of best animal models and their predictive value for a human vaccine.
Capabilities to discover a therapeutic (not vaccine) for the disease, and clinical effectiveness studies to discover therapeutics, to include antiviral agents.
Alternative models to aid decision makers in determining how to prioritize and distribute scarce, newly proven therapeutics as production ramps up. This could include identifying approaches for expanding production capacity to ensure equitable and timely distribution to populations in need.
Efforts targeted at a universal coronavirus vaccine.
Efforts to develop animal models and standardize challenge studies
Efforts to develop prophylaxis clinical studies and prioritize in healthcare workers
Approaches to evaluate risk for enhanced disease after vaccination
Assays to evaluate vaccine immune response and process development for vaccines, alongside suitable animal models [in conjunction with therapeutics]"""


task_5 = """What do we know about the effectiveness of non-pharmaceutical interventions? What is known about equity and barriers to compliance for non-pharmaceutical interventions?
Guidance on ways to scale up NPIs in a more coordinated way (e.g., establish funding, infrastructure and authorities to support real time, authoritative (qualified participants) collaboration with all states to gain consensus on consistent guidance and to mobilize resources to geographic areas where critical shortfalls are identified) to give us time to enhance our health care delivery system capacity to respond to an increase in cases.
Rapid design and execution of experiments to examine and compare NPIs currently being implemented. DHS Centers for Excellence could potentially be leveraged to conduct these experiments.
Rapid assessment of the likely efficacy of school closures, travel bans, bans on mass gatherings of various sizes, and other social distancing approaches.
Methods to control the spread in communities, barriers to compliance and how these vary among different populations..
Models of potential interventions to predict costs and benefits that take account of such factors as race, income, disability, age, geographic location, immigration status, housing status, employment status, and health insurance status.
Policy changes necessary to enable the compliance of individuals with limited resources and the underserved with NPIs.
Research on why people fail to comply with public health advice, even if they want to do so (e.g., social or financial costs may be too high).
Research on the economic impact of this or any pandemic. This would include identifying policy and programmatic alternatives that lessen/mitigate risks to critical government services, food distribution and supplies, access to critical household supplies, and access to health diagnoses, treatment, and needed care, regardless of ability to pay."""

task_6 = """What do we know about diagnostics and surveillance? What has been published concerning systematic, holistic approach to diagnostics (from the public health surveillance perspective to being able to predict clinical outcomes)?
How widespread current exposure is to be able to make immediate policy recommendations on mitigation measures. Denominators for testing and a mechanism for rapidly sharing that information, including demographics, to the extent possible. Sampling methods to determine asymptomatic disease (e.g., use of serosurveys (such as convalescent samples) and early detection of disease (e.g., use of screening of neutralizing antibodies such as ELISAs).
Efforts to increase capacity on existing diagnostic platforms and tap into existing surveillance platforms.
Recruitment, support, and coordination of local expertise and capacity (public, private—commercial, and non-profit, including academic), including legal, ethical, communications, and operational issues.
National guidance and guidelines about best practices to states (e.g., how states might leverage universities and private laboratories for testing purposes, communications to public health officials and the public).
Development of a point-of-care test (like a rapid influenza test) and rapid bed-side tests, recognizing the tradeoffs between speed, accessibility, and accuracy.
Rapid design and execution of targeted surveillance experiments calling for all potential testers using PCR in a defined area to start testing and report to a specific entity. These experiments could aid in collecting longitudinal samples, which are critical to understanding the impact of ad hoc local interventions (which also need to be recorded).
Separation of assay development issues from instruments, and the role of the private sector to help quickly migrate assays onto those devices.
Efforts to track the evolution of the virus (i.e., genetic drift or mutations) and avoid locking into specific reagents and surveillance/detection schemes.
Latency issues and when there is sufficient viral load to detect the pathogen, and understanding of what is needed in terms of biological and environmental sampling.
Use of diagnostics such as host response markers (e.g., cytokines) to detect early disease or predict severe disease progression, which would be important to understanding best clinical practice and efficacy of therapeutic interventions.
Policies and protocols for screening and testing.
Policies to mitigate the effects on supplies associated with mass testing, including swabs and reagents.
Technology roadmap for diagnostics.
Barriers to developing and scaling up new diagnostic tests (e.g., market forces), how future coalition and accelerator models (e.g., Coalition for Epidemic Preparedness Innovations) could provide critical funding for diagnostics, and opportunities for a streamlined regulatory environment.
New platforms and technology (e.g., CRISPR) to improve response times and employ more holistic approaches to COVID-19 and future diseases.
Coupling genomics and diagnostic testing on a large scale.
Enhance capabilities for rapid sequencing and bioinformatics to target regions of the genome that will allow specificity for a particular variant.
Enhance capacity (people, technology, data) for sequencing with advanced analytics for unknown pathogens, and explore capabilities for distinguishing naturally-occurring pathogens from intentional.
One Health surveillance of humans and potential sources of future spillover or ongoing exposure for this organism and future pathogens, including both evolutionary hosts (e.g., bats) and transmission hosts (e.g., heavily trafficked and farmed wildlife and domestic food and companion species), inclusive of environmental, demographic, and occupational risk factors.
        """

task_7 = """What has been published about medical care? What has been published concerning surge capacity and nursing homes? What has been published concerning efforts to inform allocation of scarce resources? What do we know about personal protective equipment? What has been published concerning alternative methods to advise on disease management? What has been published concerning processes of care? What do we know about the clinical characterization and management of the virus?
Resources to support skilled nursing facilities and long term care facilities.
Mobilization of surge medical staff to address shortages in overwhelmed communities
Age-adjusted mortality data for Acute Respiratory Distress Syndrome (ARDS) with/without other organ failure – particularly for viral etiologies
Extracorporeal membrane oxygenation (ECMO) outcomes data of COVID-19 patients
Outcomes data for COVID-19 after mechanical ventilation adjusted for age.
Knowledge of the frequency, manifestations, and course of extrapulmonary manifestations of COVID-19, including, but not limited to, possible cardiomyopathy and cardiac arrest.
Application of regulatory standards (e.g., EUA, CLIA) and ability to adapt care to crisis standards of care level.
Approaches for encouraging and facilitating the production of elastomeric respirators, which can save thousands of N95 masks.
Best telemedicine practices, barriers and faciitators, and specific actions to remove/expand them within and across state boundaries.
Guidance on the simple things people can do at home to take care of sick people and manage disease.
Oral medications that might potentially work.
Use of AI in real-time health care delivery to evaluate interventions, risk factors, and outcomes in a way that could not be done manually.
Best practices and critical challenges and innovative solutions and technologies in hospital flow and organization, workforce protection, workforce allocation, community-based support resources, payment, and supply chain management to enhance capacity, efficiency, and outcomes.
Efforts to define the natural history of disease to inform clinical care, public health interventions, infection prevention control, transmission, and clinical trials
Efforts to develop a core clinical outcome set to maximize usability of data across a range of trials
Efforts to determine adjunctive and supportive interventions that can improve the clinical outcomes of infected patients (e.g. steroids, high flow oxygen)
"""

task_8 = """What has been published concerning ethical considerations for research? What has been published concerning social sciences at the outbreak response?
Efforts to articulate and translate existing ethical principles and standards to salient issues in COVID-2019
Efforts to embed ethics across all thematic areas, engage with novel ethical issues that arise and coordinate to minimize duplication of oversight
Efforts to support sustained education, access, and capacity building in the area of ethics
Efforts to establish a team at WHO that will be integrated within multidisciplinary research and operational platforms and that will connect with existing and expanded global networks of social sciences.
Efforts to develop qualitative assessment frameworks to systematically collect information related to local barriers and enablers for the uptake and adherence to public health measures for prevention and control. This includes the rapid identification of the secondary impacts of these measures. (e.g. use of surgical masks, modification of health seeking behaviors for SRH, school closures)
Efforts to identify how the burden of responding to the outbreak and implementing public health measures affects the physical and psychological health of those providing care for Covid-19 patients and identify the immediate needs that must be addressed.
Efforts to identify the underlying drivers of fear, anxiety and stigma that fuel misinformation and rumor, particularly through social media.
"""

task_9 = """What has been published about information sharing and inter-sectoral collaboration? What has been published about data standards and nomenclature? What has been published about governmental public health? What do we know about risk communication? What has been published about communicating with high-risk populations? What has been published to clarify community measures? What has been published about equity considerations and problems of inequity?
Methods for coordinating data-gathering with standardized nomenclature.
Sharing response information among planners, providers, and others.
Understanding and mitigating barriers to information-sharing.
How to recruit, support, and coordinate local (non-Federal) expertise and capacity relevant to public health emergency response (public, private, commercial and non-profit, including academic).
Integration of federal/state/local public health surveillance systems.
Value of investments in baseline public health response infrastructure preparedness
Modes of communicating with target high-risk populations (elderly, health care workers).
Risk communication and guidelines that are easy to understand and follow (include targeting at risk populations’ families too).
Communication that indicates potential risk of disease to all population groups.
Misunderstanding around containment and mitigation.
Action plan to mitigate gaps and problems of inequity in the Nation’s public health capability, capacity, and funding to ensure all citizens in need are supported and can access information, surveillance, and treatment.
Measures to reach marginalized and disadvantaged populations.
Data systems and research priorities and agendas incorporate attention to the needs and circumstances of disadvantaged populations and underrepresented minorities.
Mitigating threats to incarcerated people from COVID-19, assuring access to information, prevention, diagnosis, and treatment.
Understanding coverage policies (barriers and opportunities) related to testing, treatment, and care
"""
list_of_tasks = [task_1, task_2, task_3, task_4, task_5, task_6, task_7, task_8, task_9]
complete_df = pd.concat([brxv, comm, noncomm, pmc])

import gensim

def read_corpus(df, column, tokens_only=False):
    for i, line in enumerate(df[column]):
        
        tokens = gensim.parsing.preprocess_string(line)
        if tokens_only:
            yield tokens
        else:
            # For training data, add tags
            yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

def get_doc_vector(doc):
    tokens = gensim.parsing.preprocess_string(doc)
    vector = model.infer_vector(tokens)
    return vector

In [12]:
from sklearn.neighbors import NearestNeighbors

train_df  = complete_df
train_df = train_df.dropna(subset=['abstract'])
train_summary = list(read_corpus(train_df, 'text')) + list(read_corpus(train_df, 'abstract'))


model = gensim.models.doc2vec.Doc2Vec(dm=1, vector_size=100, min_count=2, epochs=20, seed=42, workers=3)
model.build_vocab(train_summary)
model.train(train_summary, total_examples=model.corpus_count, epochs=model.epochs)

summary_vectors = model.docvecs.vectors_docs
array_of_tasks = [get_doc_vector(task) for task in list_of_tasks]

train_df['summary_vector'] = [vec for vec in summary_vectors]
train_array = train_df['summary_vector'].values.tolist()

ball_tree = NearestNeighbors(algorithm='ball_tree', leaf_size=20).fit(train_array)

distances, indices = ball_tree.kneighbors(array_of_tasks, n_neighbors=2)

for i, info in enumerate(list_of_tasks):
    print("Task ", i+1, "= ", info[:100])
    df =  train_df.iloc[indices[i]]
    abstracts = df['summary']
    titles = df['title']
    dist = distances[i]
    for l in range(len(dist)):
        print("Text Index ", indices[i][l])
        print("Distance to task ",distances[i][l])
        print("Title ",titles.iloc[l])
        print("Summary ",abstracts.iloc[l])
        print()

Task  1 =  What is known about transmission, incubation, and environmental stability of COVID-19? What do we kn
Text Index  5
Distance to task  13.259137823313166
Title  Effect of oral care gel on the quality of life for oral lichen planus in patients with chronic HCV infection
Summary   The patients marked on the line the point that they felt represented their perception of their current state, such as dry mouth, breath odor, oral freshness, oral pain during rest, oral pain at a mealtimes, taste disorder, loss of appetite, sleep disorder, depressive mood and jitteriness This study included nine Japanese patients (4 males and 5 females) with OLP who were positive for HCV antibody (anti-HCV) and who visited our clinic at the Kurume University Hospital in Japan from November 2, 2011 to November 16, 2011  REFRECARE-H ® is an oral care gel (therapeutic dentifrice) containing hinokitiol which can remove stains on the teeth and general oral debris, and is effective in the prevention of breat

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


<h2>
    Conclusions
</h2>

Apart from achieving the initial goal of identifying a set of papers that could shed some light into anosmia and ageusia treatment, I wanted to try and extract some conclusions myself:

<ul>
    <li>More research about these symptomps is needed.</li>
    <li>Treatments used in other respiratory infections could be useful.</li>
    <li>New studies should not only look into the respiratory and gustatory systems, it looks like the loss of these sense could come from neurological damage in some cases.</li>
    <li>Some patients with similar affections improved after 30 days following a REFRE-CARE-H treatment.</li>
    <li>Animal testing seems to be succesful when studying similar cases.</li>
</ul>