# CZI Software Mentions Dataset - Interacting with the Dataset


This notebook offers examples of **interacting** with the <b>CZI Software Mentions dataset </b><br>
The <b>CZI Software Mentions dataset </b> is a large dataset of software mentions mined from the literature. 

**Dataset Overview**: Plain-text software mentions are extracted with a trained [SciBERT](#references_scibert)model from several sources: the NIH PubMed Central collection and from papers provided by various publishers to the Chan Zuckerberg Initiative. The dataset provides sources, context and metadata, and, for a number of mentions, the disambiguated software entities and links. Full description of the dataset, methodology, algorihms and evaluation used to create the dataset can be found in our preprint, [A large dataset of software mentions in the biomedical literature](https://arxiv.org/abs/2209.00693) and on our [Github page](https://github.com/chanzuckerberg/software-mentions). 


**The notebook is structured and offers the following information and examples, as follows:**

1. [Interacting with dataset](#dataset_interaction)
    1. [Description of the dataset](#dataset_interaction_description)
    2. [raw files](#dataset_interaction_raw)
        1. Example: [Query the dataset for a particular plain-text software mention](#dataset_interaction_raw_query_scipy)
        2. Example: [Retrieve texts in which a particular plain-text software mention appears](#dataset_interaction_raw_samples_scipy)
        3. Example: [Retrieve the most frequent plain-text software mentions in the corpus](#dataset_interaction_raw_most_frequent_mentions)
        4. Motivation: [Understanding curation labels](#dataset_interaction_raw_curation_labels)
    3. [disambiguated files](#dataset_interaction_disambiguated)
        1. Motivation: [Why do we need disambiguation? Examples showcasing string variation](#dataset_interaction_disambiguated_motivation)
        2. Example: [Query the dataset for a particular software, including all string variations](#dataset_interaction_disambiguated_dataset_query)
        3. Example: [Examples of disambiguated software terms](#dataset_interaction_disambiguated_examples)
        4. Example: [Retrieve the most frequent software mentions based on disambiguated dataset](#dataset_interaction_disambiguated_most_frequent_mentions)
    4. [linked files](#dataset_interaction_linked)
        1. Example: [Query the linked data](#dataset_interaction_linked_query)
        2. Example: [Exploration of metadata fields](#dataset_interaction_linked_metadata_fields)
2. Example: [Query the dataset for a particular software, including all string variations, and links](#dataset_interaction_linked_query) (Example for **scikit-learn**)

There is a different notebook, [CZI Software Mentions Dataset - Sample Use Cases](#link_here), that offers sample use cases for the dataset.

**The full list of resources we have available for the dataset is**:
1. [Preprint: A large dataset of software mentions in the biomedical literature](https://arxiv.org/abs/2209.00693)
2. [Github Repository](https://github.com/chanzuckerberg/software-mentions)
3. [Dataset](https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c?)
4. [Interacting with the Dataset](https://github.com/chanzuckerberg/software-mentions/blob/main/sample_notebooks/Interacting%20with%20the%20dataset.ipynb) - Jupyter Notebook
5. [Sample Use Cases](https://github.com/chanzuckerberg/software-mentions/blob/main/sample_notebooks/Sample%20Use%20Cases.ipynb) - Jupyter Notebook

For questions, please contact aistrate@chanzuckerberg.com

In [1]:
import pandas as pd
import numpy as np
pd.set_option('max_colwidth', 1000)

<a id='dataset_interaction'></a>

## Interacting with the dataset
We offer a brief overview of the dataset below. For a full description, including detailed information about the available files and fields, and how they were obtained, please consult the dataset [README.md](https://datadryad.org/stash/dataset/doi:10.5061/dryad.6wwpzgn2c?) file, or the Appendix section of our [preprint](https://arxiv.org/abs/2209.00693)

<a id='dataset_interaction_description'></a>
### Dataset Description


The notebook assumes that the dataset files are stored in a folder `data` that sits as the same level as the `sample_notebooks` directory. The assumed directory structure is the following:

- `sample_notebooks`
-  `data` 
    - `raw`
        - comm_raw.tsv.gz
        - non_comm_raw.tsv.gz
        - publishers_collection_raw.tsv.gz
    - `disambiguated`
        - comm_disambiguated.tsv.gz
    - `linked`
        - metadata.tsv.gz
        
Description of the folders is as follows:
 - [`raw`](#dataset_interaction_raw) : raw, plain-text software mentions, as extracted by the NER model
 - [`disambiguated`](#dataset_interaction_disambiguated): disambiguated software mentions, after disambiguation
 - [`linked`](#dataset_interaction_linked) :  linked software mentions
        
Note that for the folder `raw`, you don't have to have all of the `comm`, `non_comm`, and `publishers_collection` files. You can shoose as many as you would like to interact with. The description of these files is:

- `comm` : contains data extracted from the PMC OA **commercial** subset
- `non_comm` : contains data extracted from the PMC OA **non-commercial** subset
subset)
- `publishers_collection` : contains data extracted from the **CZI Publishers Collection**  

In the following sections, we will show how to interact with the different modalities of the dataset and touch on the motivation and different use cases each modality carries.

In [2]:
ROOT_DATA_DIR = '../data/'

<a id='dataset_interaction_raw'></a>
### raw files

The **raw** files contain raw, plain-text software mentions, as extracted by the NER model from each of the collections. As mentioned in the [Intro of this section](#dataset_interaction), there are three possible files to interact with:
 - comm_raw.tsv.gz
 - non_comm_raw.tsv.gz
 - publishers_collection_raw.tsv.gz
 
These files are quite large (~1GB zipped) and are stored as GZIP files. 
If you are using **pandas**, we recommend the following approach. <br>
Note that we offer an example of reading only the first 5000000 rows of the dataset, to speed up the computation. <br>
For full analysis of the dataset, you should take `nrows=num_rows_to_read` out and run:

In [3]:
raw_df = pd.read_csv(ROOT_DATA_DIR + 'raw/comm_raw.tsv.gz', sep = '\\t', engine = 'python', compression = 'gzip')

In [4]:
# num_rows_to_read = 5000000
# raw_df = pd.read_csv(ROOT_DATA_DIR + 'raw/comm_raw.tsv.gz', sep = '\\t', engine = 'python', compression = 'gzip', nrows = num_rows_to_read)

<color value = red>**Warning**</color>: It may take a while for the command to finish running (up to an hour), especially for the `comm` and `publishers_collection`. <br>
If you would like to quickly interact with the dataset, we recommend opening the `non_comm` file first, as it's the smallest one. <br><br>
Once the file is read, we can start exploring it! Let's look at a few samples:

In [5]:
raw_df.sample(3)

Unnamed: 0,license,location,pmcid,pmid,doi,pubdate,source,number,text,software,version,ID,curation_label
7610808,comm,comm/Dis_Markers/PMC7369658.nxml,7369658,32733620.0,10.1155/2020/8817652,2020,2.1. Data Acquisition and Integrative Analysis,3,"Based on R software and packages [21], we analyzed the above data to obtain differentially methylated genes and differentially expressed genes",R,,SM3,software
11554458,comm,comm/Sci_Rep/PMC5830603.nxml,5830603,29491348.0,10.1038/s41598-018-22015-3,2018,Computational Method,11,Modified ClayFF potential parameters were developed by Kerisit et al.; these predict contact between the silicate structure and water on the surface,ClayFF,,SM22146,not_curated
8486300,comm,comm/Front_Hum_Neurosci/PMC4267277.nxml,4267277,25566017.0,10.3389/fnhum.2014.00975,2014,SAM ANALYSIS,8,"Given that SPM2 uses standard brains from the MNI, the MNI coordinates were converted to Talairach coordinates using nonlinear transformation (Lancaster et al., 2000) in the mri3dX software package",mri3dX,,SM261667,not_curated


And the fields we have available:

In [6]:
raw_df.columns

Index(['license', 'location', 'pmcid', 'pmid', 'doi', 'pubdate', 'source',
       'number', 'text', 'software', 'version', 'ID', 'curation_label'],
      dtype='object')

Some field definitions that will be relevant for our work in this notebook: 
- **software** : the software mention extracted by the NER algorithm from **text**
- **version** : software version as extracted by the NER algorithm (note that not all mentions will have this)
- **text** : text that **software** is being extracted from
- **curation_label** : label assigned to software mention by our curators 

All of the other fields are related to metadata of the paper the mention is extracted from.  
For a full definition of all of the fields, please take a look at the [Dataset README.md](link) file in our [Github repo](https://github.com/chanzuckerberg/software-mentions). 

Now, let's look at a few examples of interacting with the dataset:
<a id='dataset_interaction_raw_query_scipy'></a>

#### 1.Query the dataset
Example of querying the dataset for the plain-text software mention 'scipy'

In [7]:
scipy_mentions_df = raw_df[raw_df['software'] == 'scipy']

In [8]:
scipy_mentions_df.head(3)

Unnamed: 0,license,location,pmcid,pmid,doi,pubdate,source,number,text,software,version,ID,curation_label
8488,comm,comm/ACS_Nano/PMC7905882.nxml,7905882,33556239.0,10.1021/acsnano.0c10632,2021,Cluster Analysis,40,The 83.4% confidence interval for the mean was calculated using the implementation in the scipy package,scipy,,SM3076,software
9849,comm,comm/ACS_Synth_Biol/PMC8486170.nxml,8486170,33449631.0,10.1021/acssynbio.0c00318,2021,fig_caption,5,Wavelength was computed using scipy FFT and peak finding standard libraries,scipy,,SM3076,software
9853,comm,comm/ACS_Synth_Biol/PMC8486170.nxml,8486170,33449631.0,10.1021/acssynbio.0c00318,2021,Computational Model,12,"Signal analysis (FFT) and visualization were carried out using the scipy and Matplotlib.pyplot libraries, respectively",scipy,,SM3076,software


<a id='dataset_interaction_raw_samples_scipy'></a>
#### 2. Samples of texts in which the mention 'scipy'  appears

In [9]:
scipy_mentions_df[['pmid', 'source', 'text', 'software']].sample(10)

Unnamed: 0,pmid,source,text,software
12525610,32340276.0,,Function welch from the scipy library was used for that,scipy
13884611,30688649.0,Consensus and unsigned consensus,"For visual clarity, these values were interpolated by a third-degree univariate spline calculated using the python package scipy.interpolate.InterpolatedUnivariateSpline (this technique is guaranteed to intercept the measured values).",scipy
7320937,33208883.0,Surrogate modeling,We use the scipy.interpolate.griddata method from the Python Scipy package to implement 2D barycentric linear interpolation and treat each PCA component independently,scipy
13936236,33289631.0,DeepLabCut-Live! package,"It utilizes TensorFlow (Abadi et al., 2016), numpy (Svd et al., 2011), scipy (Virtanen et al., 2020), OpenCV (Bradski, 2000), and others",scipy
3865516,28495919.0,Statistical methods,"Since the data were not normally distributed, a two‐sided non‐parametrical Mann–Whitney test was applied, implemented in python using the function stats.mannwhitneyu in scipy, with Bonferroni correction on the minor fraction for each gene on the two cell types",scipy
8325546,30344485.0,Statistical analysis of distributions,"Kurtosis and skewness were tested using the scipy functions, stats.kurtosistest and stats.skewtest (https://docs.scipy.org)",scipy
9786263,31653678.0,Simulating a 10-locus system with neutral and selected variants:,"For Gaussian DES, the mean γ is zero, as above, and is found by numerical optimization using scipy (Jones ) to give the desired",scipy
3379536,31011536.0,Architecture,"numpy, scipy, and scikit-learn) and continuous integration (e.g",scipy
3171323,32958603.0,Connection specificity modules and network,The Pairwise distance between regulons is calculated by “scipy.spatial.distance.pdist” with Euclidean metric,scipy
6128102,25781329.0,Programming,"Data acquisition and analysis were done using the Python programming language with the pandas, numpy, scipy, and matplotlib extensions",scipy


<a id='dataset_interaction_raw_most_frequent_mentions'></a>
#### 3. Retrieve the most frequent software mentions in the corpus
Note that here, we define *frequency* by number of unique papers a software appears in

**Warning**: this command also takes a long time! Feel free to skip it if you're not interested in this section.

In [11]:
raw_df_aggregated = raw_df.groupby('software').nunique().sort_values(by = 'pmid', ascending = False)
raw_df_aggregated[:30]

Unnamed: 0_level_0,license,location,pmcid,pmid,doi,pubdate,source,number,text,version,ID,curation_label
software,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
SPSS,1,286647,286647,285279,279566,22,31053,691,291953,836,1,1
R,1,192460,192460,191817,191636,21,101864,662,299463,1042,1,1
GraphPad Prism,1,119661,119661,119357,118576,19,27484,295,129161,409,1,1
ImageJ,1,94715,94715,94428,93950,19,76479,349,159854,753,1,1
Excel,1,80247,80247,79826,79447,24,33460,444,96334,299,1,1
GraphPad,1,75614,75614,75438,74900,19,14637,240,76420,134,1,1
SAS,1,75133,75133,74919,74596,23,15059,318,88060,292,1,1
BLAST,1,54981,54981,54870,54660,24,54508,287,100939,122,1,1
Stata,1,46462,46462,46279,46175,20,7807,355,52525,383,1,1
MATLAB,1,46532,46532,46265,46417,21,39037,375,77802,595,1,1


<a id='dataset_interaction_raw_curation_labels'></a>
#### 4. Understanding curation labels

Now, we may notice that in the most frequent software_mentions retrieved above, there are some obvious false positives picked up by the NER model, such as **'COVID'**, **'COVID-19'** or **'Google'**. This is why we engaged an expert team of biomedical curators to sanity-check the top 10k plain-text software mentions extracted by the NER model from the `comm` dataset. We offer more details about the methodology and guidelines in our [preprint](link), as well as on our [Github Repo](https://github.com/chanzuckerberg/software-mentions). The datasets contain a field ```curation_label``` that tells us the label our curation team assigned to a software mention, or if it was not curated at all. 

Let's explore these **curation_label** a bit. First, let's see what values are available

In [12]:
raw_df['curation_label'].unique()

array(['not_curated', 'unclear', 'software', 'not_software'], dtype=object)

Now, let's create the **mention2curation_label** mapping to map from a mention to its curation label

In [13]:
mentions_curation_label_df = raw_df[['software', 'curation_label']].drop_duplicates()
mentions = mentions_curation_label_df['software'].values
curation_labels = mentions_curation_label_df['curation_label'].values
mention2curation_label = {m : c for m, c in zip(mentions, curation_labels)}

Let's look at curation labels of the top most frequent mentions obtained above:

In [19]:
raw_df_aggregated = raw_df_aggregated.reset_index()
raw_df_aggregated['curation_label'] = raw_df_aggregated['software'].apply(lambda x: mention2curation_label[x])
raw_df_aggregated[:30]

Unnamed: 0,index,software,license,location,pmcid,pmid,doi,pubdate,source,number,text,version,ID,curation_label
0,0,SPSS,1,286647,286647,285279,279566,22,31053,691,291953,836,1,software
1,1,R,1,192460,192460,191817,191636,21,101864,662,299463,1042,1,software
2,2,GraphPad Prism,1,119661,119661,119357,118576,19,27484,295,129161,409,1,software
3,3,ImageJ,1,94715,94715,94428,93950,19,76479,349,159854,753,1,software
4,4,Excel,1,80247,80247,79826,79447,24,33460,444,96334,299,1,software
5,5,GraphPad,1,75614,75614,75438,74900,19,14637,240,76420,134,1,software
6,6,SAS,1,75133,75133,74919,74596,23,15059,318,88060,292,1,unclear
7,7,BLAST,1,54981,54981,54870,54660,24,54508,287,100939,122,1,software
8,8,Stata,1,46462,46462,46279,46175,20,7807,355,52525,383,1,software
9,9,MATLAB,1,46532,46532,46265,46417,21,39037,375,77802,595,1,software


We can see that the mentions `COVID`, `COVID-19`, `Google Scholar`, as well as `Google` have been correctly marked by our curators as **not_software**. <br>
We can now segment the dataset based on the curation_labels:

In [15]:
not_software_raw = raw_df[raw_df['curation_label'] == 'not_software']
unclear_raw = raw_df[raw_df['curation_label'] == 'unclear']
not_curated_raw = raw_df[raw_df['curation_label'] == 'not_curated']
software_raw = raw_df[raw_df['curation_label'] == 'software']

Let's look at some more examples of mentions that have been marked by our curators as **not_software** and the context in which they appear:

In [16]:
not_software_raw.sample(10)[['pmid', 'text', 'software']]

Unnamed: 0,pmid,text,software
7059298,32341447.0,Data based on 100000 simulations for MOMP and minimal MOMP.,MOMP
9145575,33574758.0,RASFF is used in the European Union to help obtain the minimally required safety and quality of feed and food (i.e,RASFF
13577999,34574584.0,"For the scopes of the present study, it was fitted with the Easy Alarm YouTube application and included audio files presenting the verbal reminders for the start of the activities and the instructions concerning the activity steps",YouTube
3921529,33917796.0,Result of the potential DDIs at each documentation level of Micromedex.,Micromedex
5934161,32017786.0,"PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous",ONE
1146748,15784138.0,TIGR's in-house non-redundant protein database (NRAA) was searched and aligned to the Arabidopsis BACs using this tool,TIGR
9731133,33322033.0,"Nextflow and the containerization software can be run on Linux, Mac OS, and Windows supported through WSL2",Windows
853912,20096121.0,Analysis of the gene atlas data using PCA and NeatMap,PCA
4927004,34529649.0,"One study had explored the use of Google Trends data as a predictor [24], and the other used a set of meteorological information for generating outbreak predictions [25]",Google Trends
8070115,26339471.0,AGRIS: providing access to agricultural research data exploiting open data on the web,AGRIS


Similarly, we can look at examples of mentions that have been marked by our curators as **software**:

In [17]:
software_raw.sample(10)[['pmid', 'text', 'software']]

Unnamed: 0,pmid,text,software
6447805,25054223.0,Profiles were analyzed with the online analysis tools DAVID and STRING to identify enrichment for specific pathways and protein-protein interactions,DAVID
8734334,28321207.0,ResFinder is also not capable of detecting point mutations in genes such as gyrA that lead to a resistance phenotype,ResFinder
5327125,22530047.0,"All of these proteins were analyzed with Ingenuity Pathway Analysis (IPA) software to disclose connections between these proteins, and thus define pathways that could be involved in the molecular mechanisms of MS.",Ingenuity Pathway Analysis (IPA)
1530508,24559402.0,"Upper part of the plot shows the read coverage, middle part the reads and lower part the transcripts defined by Cufflinks as blue bars",Cufflinks
128627,21754056.0,"Data collection: CrystalClear (Rigaku, 2009 ▶); cell refinement: CrystalClear; data reduction: CrystalClear; program(s) used to solve structure: SHELXS97 (Sheldrick, 2008 ▶); program(s) used to refine structure: SHELXL97 (Sheldrick, 2008 ▶); molecular graphics: SHELXTL (Sheldrick, 2008 ▶); software used to prepare material for publication: CrystalStructure (Rigaku, 2009 ▶).",CrystalClear
12779804,33761993.0,Band intensity was quantified using the ImageJ software (1.44 P).,ImageJ
11536578,29367613.0,"Intron-spanning primers for Cdc42 (Genbank accession number: U37720.1) and GAPDH (GU214026.1) were designed using Primer3 and synthesized by Integrated DNA Technologies (Skokie, IL) as follows:",Primer3
11465334,28860520.0,"Using the function “bioenv” in the vegan package in R, TN was found to be the most correlated factor among all the soil chemical properties (SCPs) and their combinations",R
14745678,29382111.0,"All statistical analyses were performed using SPSS 13.0 statistical software (SPSS Inc., Chicago, IL, USA), and p values less than 0.05 were considered statistically significant.",SPSS
8523333,23630489.0,"Ocular artifact rejection was carried out using the Neuroscan Edit transform (derived from Semlitsch et al., 1986) followed by a second, automatic artifact rejection sweep, with exclusion parameters set at ±75 mV",Neuroscan


Same for **unclear**:

In [18]:
unclear_raw.sample(10)[['pmid', 'text', 'software']]

Unnamed: 0,pmid,text,software
10508661,23497556.0,"We used SHARE in our early experiments on HAI-related semantic querying for Clinical Intelligence purposes, reported in [19]",SHARE
3616993,31367295.0,Statistical analysis was performed using SAS software and GraphPad Prism 5.,SAS
8996379,29456487.0,"In particular, for the finest mesh resolution of 1 mm sources with a distance of 1.59 mm from the brain-CSF surface, DG-FEM yielded mean topographical errors (relative difference measure, RDM%) of 1.5% and mean magnitude errors (MAG%) of 0.1% for the magnetic field",FEM
10273477,33676583.0,Conclusion Urologists and staff affiliated with MUSIC implementation sites indicated that P3P focuses the treatment discussion on items that are important to patients,MUSIC
2507809,22734688.0,"This work suggests that, at least when applied in combination with other modeling methods, MSA can extract meaningful information from wild-type-only data sets",MSA
8850621,33154741.0,"HERFD-XANES data treatment was performed using ATHENA software (Ravel and Newville, 2005)",ATHENA
10088141,33301553.0,"For additional comparison, we applied VISION and PAGODA to the interferon stimulation and blood development datasets (Supplementary Fig",VISION
7847277,23482063.0,"All analyses were conducted using SAS software (version 9.2) and R 2.12.1 (R Project for Statistical Computing, Vienna, Austria)",SAS
7763542,31346416.0,"Estimates of genetic diversity, including average number of alleles per locus (A), observed and expected heterozygosity (H o and H e respectively), and the inbreeding coefficient (f), were calculated with GDA 1.1 (Lewis & Zaykin, 2001), while allelic richness (A r) was calculated with FSTAT 2.9.3 (Goudet, 1995)",GDA
3690776,32781747.0,"We used an analysis of cell viability, flow cytometry, the IncuCyte live-cell analysis system, and Western blotting to study its effects",IncuCyte


Note that for mentions that are marked as **unclear**, we don't recommend excluding them from analyses. They should rather be interpreted as *it cannot be assumed that this plain-text software mention will always be a true software mention when appearing in text*. The curators have only been provided with 5 sentences per software mention, and they did not curate each individual sentence in which a mention appears. The evaluations are based solely on those 5 sentences. We offer a more in-depth discussion about this in our [preprint](https://arxiv.org/abs/2209.00693) and [curation documents](link)

This means that if we wanted to clean our dataset, we should only exclude the mentions markes as **not_software**:

In [20]:
raw_df_clean = raw_df[raw_df['curation_label'] != 'not_software']

Looking at the top mentions again:

In [21]:
raw_df_clean_aggregated = raw_df_clean.groupby('software').nunique().sort_values(by = 'pmid', ascending = False)
raw_df_clean_aggregated[:30]

Unnamed: 0_level_0,license,location,pmcid,pmid,doi,pubdate,source,number,text,version,ID,curation_label
software,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
SPSS,1,286647,286647,285279,279566,22,31053,691,291953,836,1,1
R,1,192460,192460,191817,191636,21,101864,662,299463,1042,1,1
GraphPad Prism,1,119661,119661,119357,118576,19,27484,295,129161,409,1,1
ImageJ,1,94715,94715,94428,93950,19,76479,349,159854,753,1,1
Excel,1,80247,80247,79826,79447,24,33460,444,96334,299,1,1
GraphPad,1,75614,75614,75438,74900,19,14637,240,76420,134,1,1
SAS,1,75133,75133,74919,74596,23,15059,318,88060,292,1,1
BLAST,1,54981,54981,54870,54660,24,54508,287,100939,122,1,1
Stata,1,46462,46462,46279,46175,20,7807,355,52525,383,1,1
MATLAB,1,46532,46532,46265,46417,21,39037,375,77802,595,1,1


<a id='dataset_interaction_disambiguated_freq_samples'></a>
Now, by looking at the most frequent mentions on the dataset above, we can make another observation. There seem to be plain-text software mentions that point to the same software. For Instance:
- **SPSS** and **SPSS Statistics** are the same software **SPSS**, 
- **GraphPad Prism**, **GraphPad** and **Prism** are likely pointing to the same software **GraphPad Prism**
- **MATLAB** and **Matlab** 
- **ImageJ** and **Image J**

If we want to gauge the full impact of a piece of software, we should aggregate the impact of all of its string variations. This is the issue we address in the section below.

<a id='dataset_interaction_disambiguated'></a>
### disambiguated files

Through disambiguation, the goal is to cluster together plain-text software mentions that point to the same software entity. In this section, we go over a [short description of the motivation](#dataset_interaction_disambiguated_motivation'), as well as some examples of engaging with the disambiguated dataset. More in-depth details about the disambiguation algorithm, which is based on string similarity algorithms and [DBSCAN](#references_dbscan), can be found in our preprint, as well as under our Github repository page.


The **disambiguated** files contain plain-text mentions, as extracted by the NER model, mapped to software entities, as determined by our disambiguation algorithm. There is only one file to interact with:
- comm_disambiguated.tsv.gz

The file is quite large and stored as GZIP archive. Similarly to the [**raw** datasets](#dataset_interaction_raw), if you are using pandas, we recommend opening using the following. We also offer an example of reading only the first 1000000 rows. For querying the entire dataset, please remove this constraint and run:

In [20]:
disambiguated_df = pd.read_csv(ROOT_DATA_DIR + 'disambiguated/comm_disambiguated.tsv.gz', sep = '\\t', engine = 'python', compression = 'gzip')

In [21]:
# disambiguated_df = pd.read_csv(ROOT_DATA_DIR + 'disambiguated/comm_disambiguated.tsv.gz', sep = '\\t', engine = 'python', compression = 'gzip')

Note that due to its size, reading the entire file will take quite a long time to open (~20 min)

Let's start by looking at a few samples:

In [22]:
pd.set_option('max_colwidth', 50)
disambiguated_df.sample(3)

Unnamed: 0,license,location,pmcid,pmid,doi,pubdate,source,number,text,software,version,ID,curation_label,mapped_to_software
1340683,comm,comm/BMC_Evol_Biol/PMC3560262.nxml,3560262,23186303.0,10.1186/1471-2148-12-228,2012,Complex patterns of component re-assortment,69,"Importantly, our second BEAST-based approach p...",BEAST,,SM681,software,BEAST
2544450,comm,comm/BMJ/PMC6371944.nxml,6371944,30755451.0,10.1136/bmj.l236,2019,paper_title,0,Effectiveness and safety of electronically del...,REDUCE,,SM9911,unclear,REDUCE
9283459,comm,comm/Front_Plant_Sci/PMC5562724.nxml,5562724,28861102.0,10.3389/fpls.2017.01422,2017,"Transcript Assembly, Sequence Alignment and Qu...",5,To obtain expression profiles and predict gene...,Cufflinks,,SM624,software,Cufflinks


We can look at the fields we have available

In [23]:
disambiguated_df.columns

Index(['license', 'location', 'pmcid', 'pmid', 'doi', 'pubdate', 'source',
       'number', 'text', 'software', 'version', 'ID', 'curation_label',
       'mapped_to_software'],
      dtype='object')

The fields in this dataset are largely the same as the ones in the **raw** dataset. As a reminder: 
- **software** : the software mention extracted by the NER algorithm from **text**
- **version** : software version as extracted by the NER algorithm (note that not all mentions will have this)
- **text** : text that **software** is being extracted from
- **curation_label** : label assigned to software mention by our curators 

These are all fields we've seen before in the **raw** dataset. However, now we get an extra field:

- **mapped_to_software** : software entity (or cluster) the **software** mention is predicted to be part of; this is the result of disambiguation

As a reminder, for a full definition of all of the fields, please take a look at the [Dataset_README.md](link) file in our [Github repo](https://github.com/chanzuckerberg/software-mentions). 

Now let's get started exploring the disambiguated dataset!

For mentions we were not able to disambiguate, we set the **mapped_to_software** field to the value of the **software** mention itself.

In [24]:
def populate_map_to_software(x):
    if x['mapped_to_software'] == 'not_disambiguated':
        return x['software']
    return x['mapped_to_software']

In [25]:
disambiguated_df['mapped_to_software'] = disambiguated_df.apply(populate_map_to_software, axis = 1)

<a id='dataset_interaction_disambiguated_motivation'></a>
#### 1. Why do we need disambiguation? 

First, let's understand more why we would want to disambiguate the software mentions. 
[In the previous section](#dataset_interaction_disambiguated_freq_samples), we already started to see that the same software can be extracted by the NER model from a paper under different string variations. Let's look at a few more examples, to drive the point home.

<a id='dataset_interaction_disambiguated_motivation_scipy'></a>
##### SciPy
We have already looked at **SciPy** in the previous section. But *'SciPy'* is not the only string variation authors can mention the software **SciPy** in their papers, or in which the NER model might extract a string that points to the **SciPy** software entry. Some possible variations that the NER model could extract are: `['scipy', 'Scipy', 'SciPy', 'SCIPY']`. Let's see  how these can appear in the dataset. <br> 
Note that for simplification, we're only outputting the **pmid**, **text** and **software**.

In [26]:
relevant_fields = ['pmid', 'text', 'software']
num_samples = 5
raw_df[raw_df['software'] == 'scipy'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
8488,33556239.0,The 83.4% confidence interval for the mean was...,scipy
9849,33449631.0,Wavelength was computed using scipy FFT and pe...,scipy
9853,33449631.0,Signal analysis (FFT) and visualization were c...,scipy
9857,33449631.0,Wavelength was computed using scipy FFT and pe...,scipy
19111,34542700.0,Parameter optimisation was performed with the ...,scipy


In [27]:
raw_df[raw_df['software'] == 'Scipy'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
29105,32356790.0,Initially we use the widely applied damped lea...,Scipy
274113,33042750.0,Statistical analyses were performed using Pyth...,Scipy
278446,28261545.0,The core components of the Nanosurveyor stream...,Scipy
459150,31739644.0,The NumPy numerical software library [74] was ...,Scipy
459152,31739644.0,Matplotlib [76] package was used to obtain the...,Scipy


In [28]:
raw_df[raw_df['software'] == 'SciPy'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
8827,34156229.0,The system of differential equations describin...,SciPy
9462,34308060.0,As one drainage curve was obtained by averagin...,SciPy
29275,27126118.0,The length and direction of the ellipse axes a...,SciPy
29375,31692460.0,The pairing of acentric anomalous reflections ...,SciPy
29390,31692460.0,The paired intensities were also the basis of ...,SciPy


In [29]:
raw_df[raw_df['software'] == 'SCIPY'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
908446,26051821.0,The library relies on some standard Python lib...,SCIPY
908455,26051821.0,"Also, statistical routines provided by SCIPY a...",SCIPY
2883581,32182929.0,"Moreover, we compute pressure gradients along ...",SCIPY
3370011,30213144.0,The programme script for data conditioning was...,SCIPY
3544433,31835464.0,Statistical analysis has been performed in QII...,SCIPY


Let's look at a few more examples! Note that in each case, we are only showcasing a **few** string variations as an example. However, software entities can have hundreds of string variations, as extracted by the NER model. For more examples of this and a discussion, please visit our preprint.

<a id='dataset_interaction_disambiguated_motivation_BLAST'></a>
##### BLAST
Some example of string variations can be: `['BLAST', '(BLAST)', 'BLAST Whole Genome', 'BLAST) search']`

In [30]:
raw_df[raw_df['software'] == 'BLAST'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
1388,28324466.0,Comparative analysis was performed through BLA...,BLAST
1389,28324466.0,When non-homologous 406 sequences were aligned...,BLAST
1394,28324466.0,Complete BLAST alignments were two-way.,BLAST
1398,28324466.0,KAAS provides functional annotation of genes b...,BLAST
1401,28324450.0,The sequence data was checked by BLAST analysi...,BLAST


In [31]:
raw_df[raw_df['software'] == '(BLAST)'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
2250317,31455245.0,The phylogenetic analysis was performed with t...,(BLAST)
2531584,24074037.0,Basic local alignment search tool (BLAST) anal...,(BLAST)
7856314,34154664.0,"For all datasets, taxonomic identification was...",(BLAST)
8672076,26483785.0,Peptide prediction was performed by the IMG/M-...,(BLAST)
9729246,33238553.0,We analyzed HSP protein sequences using Protei...,(BLAST)


In [32]:
raw_df[raw_df['software'] == 'BLAST Whole Genome'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
1599222,26444974.0,"Through this tool, users can perform nucleotid...",BLAST Whole Genome
7533548,27616775.0,"Using the BLAST tools, users can perform nucle...",BLAST Whole Genome


In [33]:
raw_df[raw_df['software'] == 'BLAST) search'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
9688334,29495353.0,Through a conserved domain basic local alignme...,BLAST) search


<a id='dataset_interaction_disambiguated_motivation_sklearn'></a>
##### scikit-learn

Example string variations:
`['scikit-learn', 'scikit-learn python package', 'scikit-learn python library', 'scikit-learn python', 'scikit-learn library for Python', 'scikit-learn Python package2223']`

In [34]:
raw_df[raw_df['software'] == 'scikit-learn'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
8485,33556239.0,Local cluster density analysis was performed b...,scikit-learn
8487,33556239.0,Nearest neighbor analysis was performed as a k...,scikit-learn
18919,27699703.0,The software scikit-learn 0.17 (http://scikit-...,scikit-learn
36782,25945580.0,We then used scikit-learn v.0.13.1 (Pedregosa ...,scikit-learn
41236,25760616.0,The SVM classifier was implemented with the us...,scikit-learn


In [35]:
raw_df[raw_df['software'] == 'scikit-learn python package'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
4483062,32932717.0,A partition of plant species was performed usi...,scikit-learn python package
4641169,31658251.0,We then use the average precision score of the...,scikit-learn python package
5211977,30444868.0,Computations were performed using the LassoLar...,scikit-learn python package
8704869,33324368.0,Patterns of variance were investigated using p...,scikit-learn python package


In [36]:
raw_df[raw_df['software'] == 'scikit-learn python library'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
10625992,26855674.0,Classifiers chain and binary relevance models ...,scikit-learn python library
14178053,34203866.0,For all classification algorithms we used the ...,scikit-learn python library
14229776,33211021.0,"To address this question, we trained naïve Bay...",scikit-learn python library


In [37]:
raw_df[raw_df['software'] == 'scikit-learn python'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
4018106,34250262.0,The dataset was divided in 4 clusters using th...,scikit-learn python
5281565,31644551.0,"First, the patients were divided into two clus...",scikit-learn python
8327865,32063839.0,This data reduction step used the default recu...,scikit-learn python
9717103,32244427.0,PCA+kmeans: Here PCA and kmeans functions from...,scikit-learn python
10651581,34372940.0,Cross-validation studies were carried out usin...,scikit-learn python


In [38]:
raw_df[raw_df['software'] == 'scikit-learn library for Python'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
2087705,31195972.0,LASSO regressions of the three SCFAs and lacta...,scikit-learn library for Python


In [39]:
raw_df[raw_df['software'] == 'scikit-learn Python package2223'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
11255682,27113018.0,"In this study, we use the gaussian_process mod...",scikit-learn Python package2223


<a id='dataset_interaction_disambiguated_motivation_ImageJ'></a>
##### ImageJ

Example string variations:
`['ImageJ', 'Image J']`

In [40]:
raw_df[raw_df['software'] == 'ImageJ'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
105,33134427.0,The obtained images were processed by ImageJ (...,ImageJ
728,30725341.0,The ImageJ software platform (National Institu...,ImageJ
740,31696334.0,All the measurements were performed using Aviz...,ImageJ
1213,30923948.0,Once the photograph was imported into an image...,ImageJ
1224,30923948.0,Once the photograph was imported into an image...,ImageJ


In [41]:
raw_df[raw_df['software'] == 'Image J'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
1214,30923948.0,a Digital photograph exported from Image J wit...,Image J
1222,30923948.0,The photographs were imported into Image J dig...,Image J
1226,30923948.0,Digital photographs with embedded measurements...,Image J
1229,30923948.0,a Digital photograph exported from Image J wit...,Image J
1333,30715677.0,"Microneedle physical dimensions (height, width...",Image J


<a id='dataset_interaction_disambiguated_motivation_MATLAB'></a>
#### MATLAB
Example string variations:
`['MaTLab', 'MatLAB', 'MatLab', 'Matlab', 'MATLAB, Statistics and Machine Learning Toolbox', 'mAtLab)', 'Matlab)']`

In [42]:
raw_df[raw_df['software'] == 'MaTLab'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
3091108,29623956.0,The coding was done using MaTLab 2016b.,MaTLab
13246865,18387189.0,The model was realized with a combination of M...,MaTLab
13246867,18387189.0,The MaTLab program calculates the plume from a...,MaTLab


In [43]:
raw_df[raw_df['software'] == 'MatLAB'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
245819,20809745.0,Custom-written software in MatLAB (version 7.0...,MatLAB
274930,30937256.0,Videos were taken using Zeiss Axiocam ERc 5s (...,MatLAB
313798,32589613.0,Data were statistically analyzed using MatLAB ...,MatLAB
646735,19828075.0,A last set of implemented queries allows the P...,MatLAB
2592774,26425412.0,We provide a sample MatLAB code (Supporting In...,MatLAB


In [44]:
raw_df[raw_df['software'] == 'MaTLab'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
3091108,29623956.0,The coding was done using MaTLab 2016b.,MaTLab
13246865,18387189.0,The model was realized with a combination of M...,MaTLab
13246867,18387189.0,The MaTLab program calculates the plume from a...,MaTLab


In [45]:
raw_df[raw_df['software'] == 'MatLab'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
1335,30715677.0,In order to investigate the influence of alias...,MatLab
1336,30715677.0,When any portion of the desired microneedle ar...,MatLab
1337,30715677.0,The MatLab code was used to produce image slic...,MatLab
1338,30715677.0,In order to alter the aspect ratio and spacing...,MatLab
1339,30715677.0,"To vary spacing, slices of microneedles measur...",MatLab


In [46]:
raw_df[raw_df['software'] == 'Matlab'][relevant_fields][:num_samples]

Unnamed: 0,pmid,text,software
268,33507428.0,All parameters in each specimen were measured ...,Matlab
406,33141272.0,Statistical analyses were performed in Matlab ...,Matlab
411,33141272.0,A Students t-test was performed to determine i...,Matlab
495,33914200.0,The Strain-stress curves in the elastic region...,Matlab
501,33914200.0,The comparison of the strain-stress curves was...,Matlab


<a id='dataset_interaction_disambiguated_dataset_query'></a>
#### Query the dataset for a particular software, including all string variations

Great! So now that we saw the benefit of disambiguating the software terms, let's see how we can query our disambiguated dataset and interact with it. Let's start by going back to searching for the **scipy** entry. <BR>

In [47]:
software = 'scipy'
disambiguated_df[disambiguated_df['software'] == software].head(3)

Unnamed: 0,license,location,pmcid,pmid,doi,pubdate,source,number,text,software,version,ID,curation_label,mapped_to_software
8488,comm,comm/ACS_Nano/PMC7905882.nxml,7905882,33556239.0,10.1021/acsnano.0c10632,2021,Cluster Analysis,40,The 83.4% confidence interval for the mean was...,scipy,,SM3076,software,SciPy
9849,comm,comm/ACS_Synth_Biol/PMC8486170.nxml,8486170,33449631.0,10.1021/acssynbio.0c00318,2021,fig_caption,5,Wavelength was computed using scipy FFT and pe...,scipy,,SM3076,software,SciPy
9853,comm,comm/ACS_Synth_Biol/PMC8486170.nxml,8486170,33449631.0,10.1021/acssynbio.0c00318,2021,Computational Model,12,Signal analysis (FFT) and visualization were c...,scipy,,SM3076,software,SciPy


We see that the **mapped_to_software** field for the entries in which **software** is **scipy** is **SciPy**, which means that the plain-text software mention **scipy** got mapped to the **SciPy** software entity through disambiguation. We can also obtain this by:

In [48]:
mapped_to = disambiguated_df[disambiguated_df['software'] == software]['mapped_to_software'].values[0]
print('The', software, 'mention got mapped to the', mapped_to, 'software entity')

The scipy mention got mapped to the SciPy software entity


In [49]:
mapped_to

'SciPy'

Now, let's look at all the other entries that got mapped to the same entry:

In [50]:
mapped_to_df = disambiguated_df[disambiguated_df['mapped_to_software'] == mapped_to]

Now we can retrieve all of the string variations that map to the same entry:

In [51]:
string_variations = mapped_to_df['software'].unique()
print('Here are other potential string variations for', software, 'as given by the disambiguation algorithm')
print(string_variations)

Here are other potential string variations for scipy as given by the disambiguation algorithm
['scipy' 'SciPy' 'Scipy' 'SciPy Python package' 'Python-SciPy'
 'SciPy Python library' 'python-scipy' 'scipy Python'
 'Scipy Python package' 'Scipy1' 'Python scipy' 'Python Scipy package'
 'Python SciPy' 'scipy python library' 'SciPy python package'
 'Scipy Python library' 'Python-Scipy' 'SciPy3' 'scipy Python package'
 'SciPY' 'scipy python package' 'scipy29' 'Scipy python library'
 'Python Scipy' 'Python SciPy package' 'Python SciPy library'
 'Python scipy package' 'SciPy Python packages' 'Python scipy library'
 'python scipy package' 'Scipy Python Library' 'python scipy'
 'Python Scipy library' 'Scipy Python' 'sciPy' 'SciPy80'
 'Scientific Tools for Python' 'SciPy7' 'SciPy, Python library'
 'scipy python' 'scipy Python library' 'Scipy python' 'scipy®' 'scipy16'
 'SCIPy' 'SciPy30' 'SciPy62' 'SciPy42' 'SciPy52' 'SciPy79'
 'SciPy Python Library' 'scipy)' 'Python/Scipy' 'SciPy)'
 'Python packag

<a id='dataset_interaction_disambiguated_examples'></a>
#### Examples of disambiguated software terms

Now let's look at a few more examples! For the software mentions below, we will retrieve the software entity they get mapped to, and what other string variations they can appear under

In [52]:
software_mentions = ['BLAST', 'sklearn', 'SciPy', 'SPSS', 'ImageJ', 'limma', 'BERT', 'scikit-image', 
                    'GraphPad Prism']

for software in software_mentions:
    mapped_to = disambiguated_df[disambiguated_df['software'] == software]['mapped_to_software'].values[0]
    mapped_to_df = disambiguated_df[disambiguated_df['mapped_to_software'] == mapped_to]
    string_variations = mapped_to_df['software'].unique()
    print(software, 'got mapped to the', mapped_to, 'software entity')
    print('There are', len(string_variations), 'other potential string variations')
    print('Some of these are:')
    print('='*30)
    print(string_variations[:50]) #only printing 50 for visibility

BLAST got mapped to the BLAST software entity
There are 309 other potential string variations
Some of these are:
['BLAST' 'BLASTP' 'Blastp' 'blastp' 'blast' 'BlastP' 'BLAST search'
 'BLAST®' 'BLASTp' 'BLASTs' 'BLAST+' 'Blast search' 'BLAST)'
 'Basic Local Alignment Search Tool' 'BLAST-p' 'BLAST +' 'Blast' 'blastP'
 'BLAST-P' 'Basic Local Alignment Search tool' 'blast search' 'blast+'
 'BLAST-searches' 'BLAST-search' 'blast+ package' 'NCBI Blast'
 'Basic local alignment search tool' 'Basic Local Alignment Search Tool74'
 'Basic Local Alignment Search Tool)' 'BLAST10' 'BLAST14' 'BLAST37'
 'Blast +' 'Blastp)' '(Basic Local Alignment Search Tool)' 'ncbi-blast'
 'Basic Local Alignment Search Tool®' 'blastP)' 'NCBI-BLAST'
 'blastp suite' 'BLAST.P' 'BLAST+ package' 'BLASTS' 'ncbi - blast +'
 'BLASTP2' 'BLAST2' 'NCBI-BLAST+' 'BLAST package' 'BLAST2P' 'BLAST3']
sklearn got mapped to the scikit-learn software entity
There are 124 other potential string variations
Some of these are:
['Scikit-Lear

<a id='dataset_interaction_disambiguated_most_frequent_mentions'></a>

#### Example: Retrieve the most frequent software mentions based on disambiguated dataset

Now let's look at getting the most frequent terms in our dataset, bassed on the **mapped_to** field
It is the same command as on the [**raw** dataset](#dataset_interaction_raw_most_frequent_mentions), except that now we change the **software** field for the **mapped_to_software** field. Now the counts should be aggregated over *all* the string variations, as outputted by the disambiguation algorithm.

In [53]:
disambiguated_df_aggregated = raw_df.groupby('software').nunique().sort_values(by = 'pmid', ascending = False)
disambiguated_df_aggregated[:20]

Unnamed: 0_level_0,license,location,pmcid,pmid,doi,pubdate,source,number,text,version,ID,curation_label
software,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
SPSS,1,286647,286647,285279,279566,22,31053,691,291953,836,1,1
R,1,192460,192460,191817,191636,21,101864,662,299463,1042,1,1
GraphPad Prism,1,119661,119661,119357,118576,19,27484,295,129161,409,1,1
ImageJ,1,94715,94715,94428,93950,19,76479,349,159854,753,1,1
Excel,1,80247,80247,79826,79447,24,33460,444,96334,299,1,1
GraphPad,1,75614,75614,75438,74900,19,14637,240,76420,134,1,1
SAS,1,75133,75133,74919,74596,23,15059,318,88060,292,1,1
BLAST,1,54981,54981,54870,54660,24,54508,287,100939,122,1,1
Stata,1,46462,46462,46279,46175,20,7807,355,52525,383,1,1
MATLAB,1,46532,46532,46265,46417,21,39037,375,77802,595,1,1


Let's remind ourselves the most frequent terms on the raw dataset:

In [54]:
raw_df_aggregated[:20]

Unnamed: 0,software,license,location,pmcid,pmid,doi,pubdate,source,number,text,version,ID,curation_label
0,SPSS,1,286647,286647,285279,279566,22,31053,691,291953,836,1,software
1,R,1,192460,192460,191817,191636,21,101864,662,299463,1042,1,software
2,GraphPad Prism,1,119661,119661,119357,118576,19,27484,295,129161,409,1,software
3,ImageJ,1,94715,94715,94428,93950,19,76479,349,159854,753,1,software
4,Excel,1,80247,80247,79826,79447,24,33460,444,96334,299,1,software
5,GraphPad,1,75614,75614,75438,74900,19,14637,240,76420,134,1,software
6,SAS,1,75133,75133,74919,74596,23,15059,318,88060,292,1,unclear
7,BLAST,1,54981,54981,54870,54660,24,54508,287,100939,122,1,software
8,Stata,1,46462,46462,46279,46175,20,7807,355,52525,383,1,software
9,MATLAB,1,46532,46532,46265,46417,21,39037,375,77802,595,1,software


Note that the counts on the **disambiguated** and the **raw** dataset splits are different. This is because now, we are aggregating over multiple string variations for each string. 

<a id='dataset_interaction_linked'></a>
### linked files

We queried a number of databases, searching for exact matches for plain text sofware mentions in the `comm` collection. The databases we queryied are: 
- [PyPI Index](https://pypi.org/simple/)
- [Bioconductor Index](https://www.bioconductor.org/packages/release/bioc/)
- [CRAN Index](https://cran.r-project.org/web/packages/available_packages_by_name.html)
- [GitHub API](https://github.com)
- [SciCrunch API](https://scicrunch.org/resources)

We describe in detail the methodology we used to obtain the links in our preprint, as well as in our [Github Repository](https://github.com/chanzuckerberg/software-mentions/tree/main/software-mentions-linker-disambiguator)

We compiled this information in the `metadata.tsv.gz` file, which is assumed to be under the `linked` directory, as described in the [Interacting with the dataset Section](#dataset_interaction)

The **linked** files contain metadata links, extracted through the linking algorithm. There is only one file to interact with:
- metadata.tsv.gz

The file is also stored as GZIP archive. Similarly to the [**raw** datasets](#dataset_interaction_raw), if you are using pandas, we recommend opening using the following:

In [55]:
linked_df = pd.read_csv(ROOT_DATA_DIR + 'linked/metadata.tsv.gz', sep = '\\t', engine = 'python', compression = 'gzip', error_bad_lines = False)

Skipping line 79480: Expected 16 fields in line 79480, saw 17. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.


Let's start by looking at a few samples:

In [56]:
linked_df.sample(3)

Unnamed: 0,ID,software_mention,mapped_to,source,platform,package_url,description,homepage_url,other_urls,license,github_repo,github_repo_license,exact_match,RRID,reference,scicrunch_synonyms
111136,SM288033,IDDM,IDDM,Github API,,https://github.com/emigmo/IDDM,,,,,https://github.com/emigmo/IDDM,,True,,,
153189,SM796465,log4cplus,log4cplus,Github API,,https://github.com/log4cplus/log4cplus,log4cplus is a simple to use C++ logging API p...,,,,https://github.com/log4cplus/log4cplus,,True,,,
29929,"volume = {21},",,,,,,,,,,,,,,,


We can look at the fields we have available

In [57]:
linked_df.columns

Index(['ID', 'software_mention', 'mapped_to', 'source', 'platform',
       'package_url', 'description', 'homepage_url', 'other_urls', 'license',
       'github_repo', 'github_repo_license', 'exact_match', 'RRID',
       'reference', 'scicrunch_synonyms'],
      dtype='object')

We describe the fields in details on the Github repository, under [Linking Schema](https://github.com/chanzuckerberg/software-mentions/tree/main/software-mentions-linker-disambiguator#linking-schema)

Now let's explore the information we have available!

<a id='dataset_interaction_linked'></a>
#### Examples of linked software terms

Let's start by querying a number of software mentions, to see what links we get. We can start with the previous set of software mentions we looked at: ***'BLAST', 'sklearn', 'SciPy', 'SPSS', 'ImageJ', 'limma', 'BERT', 'scikit-image', 'GraphPad Prism'***

##### BLAST

In [58]:
linked_df[linked_df['software_mention'] == 'BLAST']

Unnamed: 0,ID,software_mention,mapped_to,source,platform,package_url,description,homepage_url,other_urls,license,github_repo,github_repo_license,exact_match,RRID,reference,scicrunch_synonyms
41167,SM501,BLAST,BLAST Similarity Search,SciCrunch API,,https://scicrunch.org/browse/resources/SCR_008419,The goals of this sequencing effort are to pro...,['http://www.broad.mit.edu/cgi-bin/annotation/...,[],,[None],,False,SCR_008419,"[nan, '(BLAST Similarity Search, RRID:SCR_0084...","['blast similarity search', 'blast']"


##### sklearn

In [59]:
linked_df[linked_df['software_mention'] == 'sklearn']

Unnamed: 0,ID,software_mention,mapped_to,source,platform,package_url,description,homepage_url,other_urls,license,github_repo,github_repo_license,exact_match,RRID,reference,scicrunch_synonyms
8276,SM3414,sklearn,sklearn,Pypi Index,Pypi,https://pypi.org/project/sklearn,A set of python modules for machine learning a...,['https://pypi.python.org/pypi/scikit-learn/'],,,[None],,True,,,
41751,SM3414,sklearn,Sklearn,SciCrunch API,,https://scicrunch.org/browse/resources/SCR_019053,Software Python package part of nonnegative ma...,['https://scikit-learn.org/stable/modules/gene...,['https://github.com/scikit-learn/scikit-learn...,,"['https://github.com/scikit-learn/scikit-learn,']",,False,SCR_019053,"[nan, '(Sklearn, RRID:SCR_019053)']",['sklearn']
60459,SM3414,sklearn,sklearn,Github API,,https://github.com/KeithGalli/sklearn,Data & Code associated with my tutorial on the...,,,,https://github.com/KeithGalli/sklearn,,True,,,


##### SciPy

In [60]:
linked_df[linked_df['software_mention'] == 'SciPy']

Unnamed: 0,ID,software_mention,mapped_to,source,platform,package_url,description,homepage_url,other_urls,license,github_repo,github_repo_license,exact_match,RRID,reference,scicrunch_synonyms
41719,SM3207,SciPy,SciPy,SciCrunch API,,https://scicrunch.org/browse/resources/SCR_008058,A Python-based environment of open-source soft...,['http://www.scipy.org/'],[],,[None],,True,SCR_008058,"[nan, '(SciPy, RRID:SCR_008058)']","['scientific tools for python', 'scipy']"


##### SPSS

In [61]:
linked_df[linked_df['software_mention'] == 'SPSS']

Unnamed: 0,ID,software_mention,mapped_to,source,platform,package_url,description,homepage_url,other_urls,license,github_repo,github_repo_license,exact_match,RRID,reference,scicrunch_synonyms
41139,SM165,SPSS,SPSS,SciCrunch API,,https://scicrunch.org/browse/resources/SCR_002865,"Software package used for interactive, or batc...",['http://www-01.ibm.com/software/uk/analytics/...,['https://www.ibm.com/products/software'],,[None],,True,SCR_002865,"[nan, '(SPSS, RRID:SCR_002865)']","['spss software', 'spss']"
60116,SM165,SPSS,SPSS,Github API,,https://github.com/laribio/SPSS,SPSS workshop,,,,https://github.com/laribio/SPSS,,True,,,


##### ImageJ

In [62]:
linked_df[linked_df['software_mention'] == 'ImageJ']

Unnamed: 0,ID,software_mention,mapped_to,source,platform,package_url,description,homepage_url,other_urls,license,github_repo,github_repo_license,exact_match,RRID,reference,scicrunch_synonyms
41105,SM25,ImageJ,ImageJ,SciCrunch API,,https://scicrunch.org/browse/resources/SCR_003070,Open source Java based image processing softwa...,['https://imagej.net/'],"['http://www.nitrc.org/projects/incf_imagej,',...",,[None],,True,SCR_003070,"['http://www.ncbi.nlm.nih.gov/pubmed/29187165,...","['image j', 'imagej - image processing and ana..."


##### limma

In [63]:
linked_df[linked_df['software_mention'] == 'limma']

Unnamed: 0,ID,software_mention,mapped_to,source,platform,package_url,description,homepage_url,other_urls,license,github_repo,github_repo_license,exact_match,RRID,reference,scicrunch_synonyms
10649,SM2939,limma,limma,Bioconductor Index,Bioconductor,https://www.bioconductor.org/packages/release/...,Linear Models for Microarray Data,['http://bioinf.wehi.edu.au/limma'],,GPL (>=2),,,True,,https://doi.org/doi:10.18129/B9.bioc.limma,
41685,SM2939,limma,LIMMA,SciCrunch API,,https://scicrunch.org/browse/resources/SCR_010943,Software package for the analysis of gene expr...,['http://bioinf.wehi.edu.au/limma/'],"['https://omictools.com/limma-tool,', 'https:/...",,[None],,False,SCR_010943,"[nan, '(LIMMA, RRID:SCR_010943)']","['limma: linear models for microarray data', '..."
60319,SM2939,limma,limma,Github API,,https://github.com/mjktfw/limma,This is a read-only mirror of the Bioconductor...,,,,https://github.com/mjktfw/limma,,True,,,


##### BERT

In [64]:
linked_df[linked_df['software_mention'] == 'BERT']

Unnamed: 0,ID,software_mention,mapped_to,source,platform,package_url,description,homepage_url,other_urls,license,github_repo,github_repo_license,exact_match,RRID,reference,scicrunch_synonyms
43698,SM25010,BERT,BERT,SciCrunch API,,https://scicrunch.org/browse/resources/SCR_018008,Technique for Natural Language Processing pre-...,['https://github.com/google-research/bert'],[],,['https://github.com/google-research/bert'],,True,SCR_018008,"[nan, '(BERT, RRID:SCR_018008)']",['bert']


##### scikit-image

In [65]:
linked_df[linked_df['software_mention'] == 'scikit-image']

Unnamed: 0,ID,software_mention,mapped_to,source,platform,package_url,description,homepage_url,other_urls,license,github_repo,github_repo_license,exact_match,RRID,reference,scicrunch_synonyms
7884,SM3074,scikit-image,scikit-image,Pypi Index,Pypi,https://pypi.org/project/scikit-image,Image processing in Python,['https://scikit-image.org'],,OSI Approved :: BSD License,['https://github.com/scikit-image/scikit-image'],,True,,,
41698,SM3074,scikit-image,scikit-image,SciCrunch API,,https://scicrunch.org/browse/resources/SCR_021142,Open source software tool as collection of ima...,['https://scikit-image.org/'],[],,[None],,True,SCR_021142,"['http://dx.doi.org/10.7717/peerj.453', '(scik...",['scikit-image']
60373,SM3074,scikit-image,scikit-image,Github API,,https://github.com/scikit-image/scikit-image,Image processing in Python,,,,https://github.com/scikit-image/scikit-image,,True,,,


<a id='dataset_interaction_linked_metadata_fields'></a>
#### Exploration of metadata fields

In this section, we want to offer some examples of the type of metadata we are able to retrieve through the links we created. <br>
First, let's remind ourselves the fields we have available:

In [66]:
linked_df.columns

Index(['ID', 'software_mention', 'mapped_to', 'source', 'platform',
       'package_url', 'description', 'homepage_url', 'other_urls', 'license',
       'github_repo', 'github_repo_license', 'exact_match', 'RRID',
       'reference', 'scicrunch_synonyms'],
      dtype='object')

And here is a full description of the metadata fields:

| field | description |
| :--- | :--- |
| **ID** 	| unique ID of software mention (generated by us) |
| **software_mention**	| plain-text software mention |
| **mapped_to**	| value the software_mention is being mapped to |
| **source**	| source of the mapping - eg Bioconductor Index, GitHub API |
| **platform**	| platform of software_mention - eg PyPI, CRAN |
| **package_url**	| URL linking software_mention to source |
| **description**	| description of software_mention |
| **homepage_url**	| homepage_url of software_mention |
| **other_urls**	| other related URLs |
| **license**	| software license |
| **github_repo**	| GitHub repository |
| **github_repo_license**	| GitHub repository license |
| **exact_match**	| whether or not this mapping was an exact match |
| **RRID**	| RRID for software_mention |
| **reference**	| journal articles linked to software_mention (identified either through DOI, pmid or RRID) |
| **scicrunch_synonyms**	| synonyms for software_mention, retrieved from Scicrunch |

In [67]:
pd.set_option('max_colwidth', 1000)

##### package_url + description
Examples for the software entities:

In [68]:
linked_df[linked_df['software_mention'] == software][['software_mention', 'package_url', 'description']]

Unnamed: 0,software_mention,package_url,description
41114,GraphPad Prism,https://scicrunch.org/browse/resources/SCR_002798,"Statistical analysis software that combines scientific graphing, comprehensive curve fitting (nonlinear regression), understandable statistics, and data organization. Designed for biological research applications in pharmacology, physiology, and other biological fields for data analysis, hypothesis testing, and modeling."


##### homepage_url + other_urls
Examples for the software entities:

In [69]:
linked_df[linked_df['software_mention'] == software][['software_mention', 'homepage_url', 'other_urls']]

Unnamed: 0,software_mention,homepage_url,other_urls
41114,GraphPad Prism,['http://www.graphpad.com/'],"['http://graphpad-prism.software.informer.com/5.0/,', 'https://www.graphpad.com/guides/prism/7/user-guide/index.htm']"


##### github_repo + github_repo_license
Examples for the software entities:

In [70]:
linked_df[linked_df['software_mention'] == 'scipy'][['software_mention', 'github_repo', 'github_repo_license']]

Unnamed: 0,software_mention,github_repo,github_repo_license
7908,scipy,['https://github.com/scipy/scipy'],
41700,scipy,[None],
60375,scipy,https://github.com/scipy/scipy,


##### RRID
Examples for the software entities:

In [71]:
linked_df[linked_df['software_mention'] == software][['software_mention', 'RRID']]

Unnamed: 0,software_mention,RRID
41114,GraphPad Prism,SCR_002798


##### reference
Examples for the software entities:

In [72]:
linked_df[linked_df['software_mention'] == 'ImageJ'][['software_mention', 'reference']]

Unnamed: 0,software_mention,reference
41105,ImageJ,"['http://www.ncbi.nlm.nih.gov/pubmed/29187165, 22930834', '(ImageJ, RRID:SCR_003070)']"


##### scicrunch_synonyms
Examples for the software entities:

In [73]:
linked_df[linked_df['software_mention'] == 'BLAST'][['software_mention', 'scicrunch_synonyms']]

Unnamed: 0,software_mention,scicrunch_synonyms
41167,BLAST,"['blast similarity search', 'blast']"


##### platform
Examples for the software entities:

In [74]:
linked_df[linked_df['software_mention'] == 'scipy'][['software_mention', 'platform']]

Unnamed: 0,software_mention,platform
7908,scipy,Pypi
41700,scipy,
60375,scipy,


##### license
Examples for the software entities:

In [75]:
linked_df[linked_df['software_mention'] == 'scipy'][['software_mention', 'license']]

Unnamed: 0,software_mention,license
7908,scipy,OSI Approved :: BSD License
41700,scipy,
60375,scipy,


<a id='dataset_interaction_linked_query'></a>
## Query the dataset for a particular software, retrieving all the data
In this section, we offer an example of querying the dataset end2end for a particular software entitiy, retrieving all of the available information <br>
Let's take for example the software_mention **scikit-learn**


###### 1. Retrieve software entity mapping

In [76]:
software_mention = 'sklearn'
mention_df = disambiguated_df[disambiguated_df['software'] == software_mention]

In [77]:
software_mention_mapping = disambiguated_df[disambiguated_df['software'] == software_mention]['mapped_to_software'].values[0]
print('Mapped', software_mention, 'to', software_mention_mapping)

Mapped sklearn to scikit-learn


###### 2. Retrieve string variations

In [78]:
string_variations = disambiguated_df[disambiguated_df['mapped_to_software'] == software_mention_mapping]['software'].unique()
print('String variations for', software_mention_mapping)
print('=' * 30)
print(string_variations)

String variations for scikit-learn
['Scikit-Learn' 'scikit-learn Python' 'scikit-learn' 'Scikit learn'
 'sklearn' 'scikit-learn Python library' 'Scikit-learn' 'Sklearn'
 'Python scikit-learn' 'scikit learn' 'scikit‐learn' 'Scikit‐learn'
 'Scikit-Learn Python' 'scikit-learn81' 'SciKit-Learn' 'sklearn Python'
 'Python package scikit-learn' 'Scikit-learn Python library'
 'Scikit Learn5' 'Scikit-learn Python package' 'Sci-Kit Learn'
 'Python Scikit-learn' 'Scikit Learn' 'scikit-learn Python package'
 'Python Scikit-learn library' 'Sci-kit Learn'
 'Python scikit-learn package' 'sci-kit learn' 'Scikit Learn Python'
 'SciKit Learn' 'scikit-learn2' 'Python scikit-learn library'
 'Sci-kit learn' 'Scikit-learn Python' 'Python Sklearn library'
 'Sklearn Python' 'Scikit-learn29' 'scikits-learn' 'Python Sklearn'
 'sci-kit-learn' 'scikit_learn' 'SciKit-learn Python library' 'SkLearn'
 'scikit -learn' 'Python Scikit-Learn' 'Python3 sklearn' 'SciKitLearn'
 'scikitlearn' 'ScikitLearn' 'SciKit-learn' 'P

###### 3. Retrieve total number of unique publications mention appears in

###### undisambiguated

In [79]:
num_counts_undisambiguated = disambiguated_df[disambiguated_df['software'] == software_mention]['pmid'].nunique()
print(software_mention, 'appears in', num_counts_undisambiguated, 'number of PMIDs')

sklearn appears in 556 number of PMIDs


###### disambiguated

In [80]:
num_counts_disambiguated = disambiguated_df[disambiguated_df['software'] == software_mention_mapping]['pmid'].nunique()
print(software_mention, 'is mapped to', software_mention_mapping, 'which appears in', num_counts_disambiguated, 'number of PMIDs')

sklearn is mapped to scikit-learn which appears in 2415 number of PMIDs


###### 4. Retrieve metadata

In [81]:
software_mention_medatada = linked_df[linked_df['software_mention'] == software_mention]

In [82]:
software_mention_medatada

Unnamed: 0,ID,software_mention,mapped_to,source,platform,package_url,description,homepage_url,other_urls,license,github_repo,github_repo_license,exact_match,RRID,reference,scicrunch_synonyms
8276,SM3414,sklearn,sklearn,Pypi Index,Pypi,https://pypi.org/project/sklearn,A set of python modules for machine learning and data mining,['https://pypi.python.org/pypi/scikit-learn/'],,,[None],,True,,,
41751,SM3414,sklearn,Sklearn,SciCrunch API,,https://scicrunch.org/browse/resources/SCR_019053,"Software Python package part of nonnegative matrix factorization NMF. Features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with Python numerical and scientific libraries NumPy and SciPy.",['https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html'],"['https://github.com/scikit-learn/scikit-learn,', 'https://scikit-learn.org/stable/']",,"['https://github.com/scikit-learn/scikit-learn,']",,False,SCR_019053,"[nan, '(Sklearn, RRID:SCR_019053)']",['sklearn']
60459,SM3414,sklearn,sklearn,Github API,,https://github.com/KeithGalli/sklearn,Data & Code associated with my tutorial on the sci-kit learn machine learning library in python,,,,https://github.com/KeithGalli/sklearn,,True,,,


<a id='prompts'></a>

That's it! This notebook serves as a starting point for showcasing how particular research questions might be answered using the data available in the **CZI Software Mentions Dataset**. It is our hope that it sparks further analyses and interesting research directions. <br>

**A non-exhaustive list potential questions to consider next is:**
1. Understanding the context in which a software mention appears. Ideas include using topic modeling, pretrained language models embeddings or models for citation intent classification. 
2. Drawing better insights about usage of software in particular fields by incorporating other metadata about a paper, such as  MESH terms, author provided-keywords, title, or abstract
3. Explore measure of impact for software, such as Eigenfactor
4. Looking into open-access policies of most used or impactful software, and how that varies in particular fields
5. Exploring differences in how software usage varies over time


We are excited about the work that will build on top of the **CZI Software Mentions Dataset** and can't wait to see what other ideas you come up with!

### References

<a id='references_scibert'>1. Beltagy, Iz, Kyle Lo, and Arman Cohan. "SciBERT: A pretrained language model for scientific text." arXiv preprint arXiv:1903.10676 (2019).</a> <br>
<a id='references_dbscan'>2. Ester, Martin, et al. "A density-based algorithm for discovering clusters in large spatial databases with noise." kdd. Vol. 96. No. 34. 1996.</a>