# Building a Gismo for the Covid Dataset

This tutorial shows how to build a Gismo from a Covid Dataset. The Gismo is the base object that is used to analyze and summarize the dataset (see for example the Covid Summarizer tutorial).

## Retrieving the Covid-19 Dataset Zip archive.

The dataset can be downloaded from [Kaggle website](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) by clicking the Download button once registered. You'll get a zip file.

We assume you downloaded the archive and that it is available in some directory (adjust the parameters below according to your own settings).

In [1]:
from pathlib import Path
DATASET_DIR = Path("../../../../../../Datasets/covid")
ARCHIVE = Path("CORD-19-research-challenge.zip")
(DATASET_DIR / ARCHIVE).exists()

True

## Loading the corpus from zip

sisu provides a simple interface to load the archive in the form of a list of dictionaries. For testing purposes, you can specify the number of documents you want to retrieve. Here we retrieve the first 100 documents.

In [2]:
from sisu.datasets.covid import load_from_zip
source = load_from_zip(file=ARCHIVE, data_path=DATASET_DIR, max_docs=1000)

In [3]:
len(source)

1000

Each entry contains by default 5 keys (this can be tuned):
- `title`
- `abstract`
- `content`
- `id`
- `lang`

For example, the first 10 titles

In [4]:
[e['title'] for e in source[:10]]

['The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3',
 'Analysis Title: Regaining perspective on SARS-CoV-2 molecular tracing and its implications',
 'Healthcare-resource-adjusted vulnerabilities towards the 2019-nCoV epidemic across China',
 'Real-time, MinION-based, amplicon sequencing for lineage typing of infectious bronchitis virus from upper respiratory samples',
 'A Combined Evidence Approach to Prioritize Nipah Virus Inhibitors',
 'Assessing spread risk of Wuhan novel coronavirus within and beyond China, January-April 2020: a travel network-based modelling study',
 'TWIRLS, an automated topic-wise inference method based on massive literature, suggests a possible mechanism via ACE2 for the pathological changes in the human host after coronavirus infection',
 'Title: Viruses are a dominant driver of protein adaptation in mammals',
 'The impact of regular school closure on seasonal i

Statistics on language used:

In [5]:
from collections import Counter
Counter([e['lang'] for e in source])

Counter({'en': 998, 'fr': 2})

## Converting the archive into a FileSource

Loading the whole file can be time and memory consuming.

In [6]:
import time
start = time.perf_counter()
source = load_from_zip(file=ARCHIVE, data_path=DATASET_DIR)
print(f"Time to load {len(source)} articles: {(time.perf_counter()-start):.2f} seconds.")

Time to load 33375 articles: 188.65 seconds.


If you only need to use the source for linear browsing or accessing a few elements, it can be interesting to save the source as a FileSource. A FileSource will basically behave like a list, except that the data stays stored on the hard drive.

In [7]:
from gismo.filesource import FileSource, create_file_source
create_file_source(source=source, filename="covid", path=DATASET_DIR)

After the FileSource has been created, it can be used instead of the in-memory list.

In [8]:
from gismo.filesource import FileSource
del source
start = time.perf_counter()
source = FileSource(filename="covid", path=DATASET_DIR)
print(f"Time to load {len(source)} articles: {(time.perf_counter()-start):.2f} seconds.")

Time to load 33375 articles: 0.03 seconds.


The main difference is that you cannot use slice index, just plain index or iterators.

In [9]:
[source[i]['title'] for i in range(10)] # use range to avoid slice indexing

['The RNA pseudoknots in foot-and-mouth disease virus are dispensable for genome replication but essential for the production of infectious virus. 2 3',
 'Analysis Title: Regaining perspective on SARS-CoV-2 molecular tracing and its implications',
 'Healthcare-resource-adjusted vulnerabilities towards the 2019-nCoV epidemic across China',
 'Real-time, MinION-based, amplicon sequencing for lineage typing of infectious bronchitis virus from upper respiratory samples',
 'A Combined Evidence Approach to Prioritize Nipah Virus Inhibitors',
 'Assessing spread risk of Wuhan novel coronavirus within and beyond China, January-April 2020: a travel network-based modelling study',
 'TWIRLS, an automated topic-wise inference method based on massive literature, suggests a possible mechanism via ACE2 for the pathological changes in the human host after coronavirus infection',
 'Title: Viruses are a dominant driver of protein adaptation in mammals',
 'The impact of regular school closure on seasonal i

In [10]:
Counter([e['lang'] for e in source])

Counter({'en': 32590, 'fr': 372, 'xx': 51, 'es': 295, 'de': 67})

## Building a corpus

A gismo corpus is essentially a list with instructions about how to convert items of the list to a text that will be used for the embedding (the text does not have to be comprehensible for humans).

For example, we will build a list of English articles with non-trivial title, abtract and content. Note that we can close the source afterwards to avoid keeping an open file.

In [12]:
english_source = [d for d in source if 
                  len(d['abstract']) > 140 and 
                  len(d['title']) > 20 and
                  len(d['content']) > 200 and
                  d['lang']=='en']
source.close()
len(english_source)

23114

Now we can associate the source and a text function (we use a text sanitizer that will extract content and do some cleaning).

In [13]:
from gismo.corpus import Corpus
from sisu.preprocessing.tokenizer import to_text_sanitized

corpus = Corpus(source=english_source, to_text=to_text_sanitized)

## Building the embedding

From the corpus, one can create the Gismo dual embedding of documents into words and words into documents. We will use some stopwords to avoid cluttering the embedding with common words that not not bring much information.

In [32]:
from sisu.preprocessing.language import EN_STOP_WORDS
covid_stop_words = ['preprint', 'copyright', 'holder', 'reuse', 'doi', 'reads', 'fig', 'figure']

from sklearn.feature_extraction.text import CountVectorizer
from gismo.embedding import Embedding
vectorizer = CountVectorizer(min_df=5, dtype=float, stop_words=EN_STOP_WORDS+covid_stop_words)
embedding = Embedding(vectorizer)
embedding.fit_transform(corpus)

In [15]:
embedding.x

<23114x92278 sparse matrix of type '<class 'numpy.float64'>'
	with 18690530 stored elements in Compressed Sparse Row format>

The embedding graph relates 23,114 articles to 92,278 words through a bipartite graph of 18,690,530 relationships.

Average unique words per documents:

In [16]:
18690530/23114

808.6237777970061

Average number of documents where a random word appears:

In [17]:
18690530/92278

202.5458939292139

## The Gismo

Gismo is just a concatenation of a corpus and an embedding.

In [33]:
from gismo.gismo import Gismo
gismo = Gismo(corpus, embedding)

Small example: for a given query, proposes titles of relevant articles.

In [20]:
gismo.post_documents_item = lambda g, i: g.corpus[i]['title']
def propose_titles(query):
    success = gismo.rank(query)
    if success:
        return gismo.get_documents_by_rank()
    else:
        print(f"Not found anything about: {query}!")

In [21]:
propose_titles("flklkfl")

Not found anything about: flklkfl!


In [22]:
propose_titles("pangolin")

['Pangolin homology associated with 2019-nCoV',
 'Probable Pangolin Origin of SARS-CoV-2 Associated with the COVID-19 Outbreak',
 'Evidence of recombination in coronaviruses implicating pangolin origins of nCoV- 2019',
 'Evidence of the Recombinant Origin and Ongoing Mutations in Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2)',
 'Spike protein recognition of mammalian ACE2 predicts the host range and an optimized ACE2 for SARS-CoV-2 infection',
 'Viral Metagenomics Revealed Sendai Virus and Coronavirus Infection of Malayan Pangolins (Manis javanica)',
 'Emergence of SARS-CoV-2 through Recombination and Strong Purifying Selection Short Title: Recombination and origin of SARS-CoV-2 One Sentence Summary: Extensive Recombination and Strong Purifying Selection among coronaviruses from different hosts facilitate the emergence of SARS-CoV-2',
 'Mutations, Recombination and Insertion in the Evolution of 2019-nCoV',
 'SARS-CoV-2, an evolutionary perspective of interaction with hum

In [23]:
propose_titles("platypus")

['Widespread Divergence of the CEACAM/PSG Genes in Vertebrates and Humans Suggests Sensitivity to Selection',
 'Coevolution of activating and inhibitory receptors within mammalian carcinoembryonic antigen families',
 'Phylogenetic Distribution of CMP-Neu5Ac Hydroxylase (CMAH), the Enzyme Synthetizing the Proinflammatory Human Xenoantigen Neu5Gc',
 'Immunoglobulin heavy chain diversity in Pteropid bats: evidence for a diverse and highly specific antigen binding repertoire',
 'Evolutionary Dynamics of the Interferon-Induced Transmembrane Gene Family in Vertebrates',
 'Evolution of vertebrate interferon inducible transmembrane proteins',
 'A Comprehensive Phylogenetic and Structural Analysis of the Carcinoembryonic Antigen (CEA) Gene Family',
 'Chiropteran types I and II interferon genes inferred from genome sequencing traces by a statistical gene-family assembler',
 'A novel fast vector method for genetic sequence comparison OPEN',
 'Alignment-free method for DNA sequence clustering usin

In [29]:
propose_titles("marseille")

['Respiratory viruses within homeless shelters in Marseille, France',
 'Epidemiology of respiratory pathogen carriage in the homeless population within two shelters in Marseille, France, 2015e2017: cross sectional 1-day surveys',
 'Incidence of Hajj-associated febrile cough episodes among French pilgrims: a prospective cohort study on the influence of statin use and risk factors',
 'Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open- label non-randomized clinical trial',
 'Infectious disease symptoms and microbial carriage among French medical students travelling abroad: A prospective study',
 'Acquisition of respiratory viruses and presence of respiratory symptoms in French pilgrims during the 2016 Hajj: A prospective cohort study',
 'The VIZIER project: Preparedness against pathogenic RNA viruses',
 "French Hajj pilgrims' experience with pneumococcal infection and vaccination: A knowledge, attitudes and practice (KAP) evaluation",
 'Journal Pre-proof S

A Gismo can serve many purpose that will be exposed in other tutorials. Note that you can save your Gismo for later use.

In [34]:
gismo.save(filename="covid_gismo", path=DATASET_DIR, compress=True, erase=True)