# Lab: exploring the CORD-19 dataset

This lab uses the CORD-19 dataset that you're using in this year's assignment. We'll replicate and extend the work done in
the demo in a way that'll hopefully indicate how to analyze the results that you get and how to extend the work (whether
for the assignment or for a future client). Obviously, you will not be allowed to submit this work as an extension of your
assignments. We'll use the 2020-03-27 dataset in this lab.

### Naive sentiment analysis

--> 1. If you have not yet loaded the 2020-03-27 CORD-19 dataset, you should create a table out of it (e.g. via the UI). You
shouldn't need to do any pre-processing for this lab.

In [0]:
dbutils.fs.ls("dbfs:/FileStore/tables/CORD19/")

--> 2. Create a dataframe from the table you created in Step 1

In [0]:
# File location and type
file_location = 'dbfs:/FileStore/tables/CORD19/metadata_2020_03_27.csv'
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df_cord = spark.read.format(file_type) \
               .option("inferSchema", infer_schema) \
               .option("header", first_row_is_header) \
               .option("quote", "\"") \
               .option("escape", "\"") \
               .option("sep", delimiter) \
               .load(file_location)

# Dataframe df_cord displayed
df_cord.printSchema()
display(df_cord)

cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file,url
vho70jcx,f056da9c64fbf00a4645ae326e8a4339d015d155,biorxiv,SIANN: Strain Identification by Alignment to Near Neighbors,10.1101/001727,,,biorxiv,"Next-generation sequencing is increasingly being used to study samples composed of mixtures of organisms, such as in clinical applications where the presence of a pathogen at very low abundance may be highly important. We present an analytical method (SIANN: Strain Identification by Alignment to Near Neighbors) specifically designed to rapidly detect a set of target organisms in mixed samples that achieves a high degree of species- and strain-specificity by aligning short sequence reads to the genomes of near neighbor organisms, as well as that of the target. Empirical benchmarking alongside the current state-of-the-art methods shows an extremely high Positive Predictive Value, even at very low abundances of the target organism in a mixed sample. SIANN is available as an Illumina BaseSpace app, as well as through Signature Science, LLC. SIANN results are presented in a streamlined report designed to be comprehensible to the non-specialist user, providing a powerful tool for rapid species detection in a mixed sample. By focusing on a set of (customizable) target organisms and their near neighbors, SIANN can operate quickly and with low computational requirements while delivering highly accurate results.",2014-01-10,Samuel Minot; Stephen D Turner; Krista L Ternus; Dana R Kadavy,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/001727
i9tbix2v,daf32e013d325a6feb80e83d15aabc64a48fae33,biorxiv,Spatial epidemiology of networked metapopulation: An overview,10.1101/003889,,,biorxiv,"An emerging disease is one infectious epidemic caused by a newly transmissible pathogen, which has either appeared for the first time or already existed in human populations, having the capacity to increase rapidly in incidence as well as geographic range. Adapting to human immune system, emerging diseases may trigger large-scale pandemic spreading, such as the transnational spreading of SARS, the global outbreak of A(H1N1), and the recent potential invasion of avian influenza A(H7N9). To study the dynamics mediating the transmission of emerging diseases, spatial epidemiology of networked metapopulation provides a valuable modeling framework, which takes spatially distributed factors into consideration. This review elaborates the latest progresses on the spatial metapopulation dynamics, discusses empirical and theoretical findings that verify the validity of networked metapopulations, and the application in evaluating the effectiveness of disease intervention strategies as well.",2014-06-04,Lin WANG; Xiang Li,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/003889
62gfisc6,f33c6d94b0efaa198f8f3f20e644625fa3fe10d2,biorxiv,Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity,10.1101/006866,,,biorxiv,"Germline variation at immunoglobulin gene (IG) loci is critical for pathogen-mediated immunity, but establishing complete reference sequences in these regions is problematic because of segmental duplications and somatically rearranged source DNA. We sequenced BAC clones from the essentially haploid hydatidiform mole, CHM1, across the light chain IG loci, kappa (IGK) and lambda (IGL), creating single haplotype representations of these regions. The IGL haplotype is 1.25Mb of contiguous sequence with four novel V gene and one novel C gene alleles and an 11.9kbp insertion. The IGK haplotype consists of two 644kbp proximal and 466kbp distal contigs separated by a gap also present in the reference genome sequence. Our effort added an additional 49kbp of unique sequence extending into this gap. The IGK haplotype contains six novel V gene and one novel J gene alleles and a 16.7kbp region with increased sequence identity between the two IGK contigs, exhibiting signatures of interlocus gene conversion. Our data facilitated the first comparison of nucleotide diversity between the light and IG heavy (IGH) chain haplotypes within a single genome, revealing a three to six fold enrichment in the IGH locus, supporting the theory that the heavy chain may be more important in determining antigenic specificity.",2014-07-03,Corey T Watson; Karyn Meltz Steinberg; Tina A Graves-Lindsay; Rene L Warren; Maika Malig; Jacqueline E Schein; Richard K Wilson; Rob Holt; Evan Eichler; Felix Breden,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/006866
058r9486,4da8a87e614373d56070ed272487451266dce919,biorxiv,Bayesian mixture analysis for metagenomic community profiling.,10.1101/007476,,,biorxiv,"Deep sequencing of clinical samples is now an established tool for the detection of infectious pathogens, with direct medical applications. The large amount of data generated provides an opportunity to detect species even at very low levels, provided that computational tools can effectively interpret potentially complex metagenomic mixtures. Data interpretation is complicated by the fact that short sequencing reads can match multiple organisms and by the lack of completeness of existing databases, in particular for viral pathogens. This interpretation problem can be formulated statistically as a mixture model, where the species of origin of each read is missing, but the complete knowledge of all species present in the mixture helps with the individual reads assignment. Several analytical tools have been proposed to approximately solve this computational problem. Here, we show that the use of parallel Monte Carlo Markov chains (MCMC) for the exploration of the species space enables the identification of the set of species most likely to contribute to the mixture. The added accuracy comes at a cost of increased computation time. Our approach is useful for solving complex mixtures involving several related species. We designed our method specifically for the analysis of deep transcriptome sequencing datasets and with a particular focus on viral pathogen detection, but the principles are applicable more generally to all types of metagenomics mixtures. The work is implemented as a user friendly R package, available from CRAN: http://cran.r-project.org/web/packages/metaMix",2014-07-25,Sofia Morfopoulou; Vincent Plagnol,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/007476
wich35l7,eccef80cfbe078235df22398f195d5db462d8000,biorxiv,Mapping a viral phylogeny onto outbreak trees to improve host transmission inference,10.1101/010389,,,biorxiv,"Developing methods to reconstruct transmission histories for viral outbreaks could provide critical information to support locating sources of disease transmission. Phylogenetic methods used to measure the degree of relatedness among sequenced viral samples have proven useful in identifying potential outbreak sources. The complex nature of infectious disease, however, makes it difficult to assign a rigorously defined quantitative confidence value assessing the likelihood of a true direct transmission event using genetic data alone. A new method is presented to calculate a confidence value assessing the likelihood of a transmission event using both phylogenetic inference and limited knowledge of incubation and infectious duration times. The method is applied to simulations of a foot and mouth disease (FMD) outbreak to demonstrate how the combination of both phylogenetic and epidemiology data can be used to strengthen the assessment of the likelihood of direct transmission over methods using just phylogenetic data or infection timing data alone. The method is applied to a previous FMD outbreak to identify areas where over confidence in previously inferred direct transmission may exist. Combining knowledge from viral evolution and epidemiology within a single integrated transmission inference framework is an important approach to assess the potential likelihood of transmission events and makes clear how specific features of a virus' spread through the course of an outbreak will directly determine the potential for confidence in inferred host transmission links.",2014-11-11,Stephen P Velsko; Jonathan E Allen,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/010389
z3tgnzth,c41fdb2efd6d61384a92a84cbba3f8233629a41b,biorxiv,The infant airway microbiome in health and disease impacts later asthma development,10.1101/012070,,,biorxiv,"The nasopharynx (NP) is a reservoir for microbes associated with acute respiratory illnesses (ARI). The development of asthma is initiated during infancy, driven by airway inflammation associated with infections. Here, we report viral and bacterial community profiling of NP aspirates across a birth cohort, capturing all lower respiratory illnesses during their first year. Most infants were initially colonized with Staphylococcus or Corynebacterium before stable colonization with Alloiococcus or Moraxella, with transient incursions of Streptococcus, Moraxella or Haemophilus marking virus-associated ARIs. Our data identify the NP microbiome as a determinant for infection spread to the lower airways, severity of accompanying inflammatory symptoms, and risk for future asthma development. Early asymptomatic colonization with Streptococcus was a strong asthma predictor, and antibiotic usage disrupted asymptomatic colonization patterns.",2014-12-02,Shu Mei Teo; Danny Mok; Kym Pham; Merci Kusel; Michael Serralha; Niamh Troy; Barbara J Holt; Belinda J Hales; Michael L Walker; Elysia Hollams; Yury H Bochkov; Kristine Grindle; Sebastian L Johnston; James E Gern; Peter D Sly; Patrick G Holt; Kathryn E Holt; Michael Inouye,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/012070
1xxrnpg3,1dd898b5ca1ae70ec0e3cad89fc87a165002a99e,biorxiv,Using heterogeneity in the population structure of U.S. swine farms to compare transmission models for porcine epidemic diarrhoea,10.1101/017178,,,biorxiv,"ABSTRACTIn 2013, U.S. swine producers were confronted with the disruptive emergence of porcine epidemic diarrhoea (PED). Movement of animals among farms is hypothesised to have played a role in the spread of PED among farms. Via this or other mechanisms, the rate of spread may also depend on the geographic density of farms and climate. To evaluate such effects on a large scale, we analyse state-level counts of outbreaks with variables describing the distribution of farm sizes and types, aggregate flows of animals among farms, and an index of climate. Our first main finding is that it is possible for a correlation analysis to be sensitive to transmission model parameters. This finding is based on a global sensitivity analysis of correlations on simulated data that included a biased and noisy observation model based on the available PED data. Our second main finding is that flows are significantly associated with the reports of PED outbreaks. This finding is based on correlations of pairwise relationships and regression modeling of total and weekly outbreak counts. These findings illustrate how variation in population structure may be employed along with observational data to improve understanding of disease spread.",2015-03-27,Eamon B. O’Dea; Harry Snelson; Shweta Bansal,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/017178
8ilzm51q,33565294e6bc67fb7ee14dcae6cfdb08148f4ea5,biorxiv,"Big city, small world: Density, contact rates, and transmission of dengue across Pakistan.",10.1101/018481,,,biorxiv,"Macroscopic descriptions of populations commonly assume that encounters between individuals are well mixed; i.e., each individual has an equal chance of coming into contact with any other individual. Relaxing this assumption can be challenging though, due to the difficulty of acquiring detailed knowledge about the non-random nature of encounters. Here, we fitted a mathematical model of dengue virus transmission to spatial time series data from Pakistan and compared maximum-likelihood estimates of “mixing parameters” when disaggregating data across an urban-rural gradient. We show that dynamics across this gradient are subject not only to differing transmission intensities but also to differing strengths of nonlinearity due to differences in mixing. We furthermore show that neglecting spatial variation in mixing can lead to substantial underestimates of the level of effort needed to control a pathogen with vaccines or other control efforts. We complement this analysis with relevant contemporary environmental drivers of dengue.",2015-04-27,Moritz U. G. Kraemer; T. Alex Perkins; Derek A.T. Cummings; Rubeena Zakar; Simon I. Hay; David L. Smith; Robert C. Reiner,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/018481
wafvnbdu,3461d71f6890f7e5ba53bf168be3945cdb16d901,biorxiv,MERS-CoV recombination: implications about the reservoir and potential for adaptation,10.1101/020834,,,biorxiv,"Recombination is a process that unlinks neighbouring loci allowing for independent evolutionary trajectories within genomes of many organisms. If not properly accounted for, recombination can compromise many evolutionary analyses. In addition, when dealing with organisms that are not obligately sexually reproducing, recombination gives insight into the rate at which distinct genetic lineages come into contact. Since June, 2012, Middle East respiratory syndrome coronavirus (MERS-CoV) has caused 1106 laboratory-confirmed infections, with 421 MERS-CoV associated deaths as of April 16, 2015. Although bats are considered as the likely ultimate source of zoonotic betacoronaviruses, dromedary camels have been consistently implicated as the source of current human infections in the Middle East. In this paper we use phylogenetic methods and simulations to show that MERS-CoV genome has likely undergone numerous recombinations recently. Recombination in MERS-CoV implies frequent co-infection with distinct lineages of MERS-CoV, probably in camels given the current understanding of MERS-CoV epidemiology.",2015-06-12,Gytis Dudas; Andrew Rambaut,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/020834
4xocqn6o,1f9d3f9a1a0e8db6a086e0a2b5ba50cf9f235dae,biorxiv,On the causes of evolutionary transition:transversion bias,10.1101/027722,,,biorxiv,"A pattern in which nucleotide transitions are favored several-fold over transversions is common in molecular evolution. When this pattern occurs among amino acid replacements, explanations often invoke an effect of selection, on the grounds that transitions are more conservative in their effects on proteins. However, the underlying hypothesis of conservative transitions has never been tested directly. Here we assess support for this hypothesis using direct evidence: the fitness effects of mutations in actual proteins, measured via individual or paired growth experiments. We assembled data from 8 published studies, ranging in size from 24 to 757 single-nucleotide mutations that change an amino acid. Every study has the statistical power to reveal significant effects of amino acid exchangeability, and most studies have the power to discern a binary conservative-vs-radical distinction. However, only one study suggests that transitions are significantly more conservative than transversions. In the combined set of 1239 replacements, the chance that a transition is more conservative than a transversion is 53 % (95 % confidence interval, 50 % to 56 %), compared to the null expectation of 50 %. We show that this effect is not large compared to that of most biochemical factors, and is not large enough to explain the several-fold bias observed in evolution. In short, available data have the power to verify the ""conservative transitions"" hypothesis if true, but suggest instead that selection on proteins plays at best a minor role in the observed bias.",2015-09-28,Arlin Stoltzfus; Ryan W. Norris,,,,True,biorxiv_medrxiv,https://doi.org/10.1101/027722


--> 3. Extract the underlying RDD.

In [0]:
rdd_cord = df_cord.rdd

rdd_cord.take(2)

--> 4. Form a new RDD which contains the pair (cord id, abstract).

In [0]:
slic_rdd = rdd_cord.map(lambda row: (row.cord_uid, row.abstract))

slic_rdd.take(2)

--> 5. Convert the RDD you created in Step 4 into an RDD which consists of pairs (cord id, list of words in abstract).

In [0]:
slic_rdd = slic_rdd.filter(lambda line: line[1] is not None).map(lambda line: (line[0], line[1].split()))

slic_rdd.take(5)

--> 6. Now we'll use the positive words available from

https://github.com/shekhargulati/sentiment-analysis-python/blob/master/opinion-lexicon-English/positive-words.txt

You'll need to download the file (click on raw and download the page that appears) and remove the comments from
the top of the file on your own machine. Save the new file as positive words.csv You can either add the header word
to the file at this point, or when you load the file through the UI into Databricks, but you should ensure that:
(a) the resulting table is called positive words, and
(b) has a single column, called words

In [0]:
dbutils.fs.ls("/FileStore/tables/positive_words-2.csv")

In [0]:
positive_rdd = sc.textFile("/FileStore/tables/positive_words-2.csv")
positive_rdd.take(5)

--> 7. Do the same for negative words available from

https://github.com/shekhargulati/sentiment-analysis-python/blob/master/opinion-lexicon-English/negative-words.txt

In [0]:
dbutils.fs.ls("/FileStore/tables/negative_words.csv")

In [0]:
negative_rdd = sc.textFile("/FileStore/tables/negative_words.csv")
negative_rdd.take(5)