# Co-mentioning  

* Remove preprints from articles
* Create co-mentioning matrix
* Check if (mentions) articles are about a tool in bio.tools 


Mentions file should contain a list where each element has (tool name, pmid and list of articles mentioning the tool)

In [3]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
from bh24_literature_mining.utils import read_cites_from_json

path = '../var/biotools_cites.json' # REPLACE with the path to the mentions file

tools = read_cites_from_json(path)

print(f'Loaded {len(tools)} tools.')

Loaded 9453 tools.


### Remove preprints from articles in every tool

For published articles there can also be preprints included since DOI and other IDs are different from the final publication to the preprint.

In [6]:
for tool in tools:
    tool['articles'] = [article for article in tool['articles'] if article.pubType != 'preprint']

print(f'Filtered out preprints. Now {len(tools)} tools.')

Filtered out preprints. Now 9453 tools.


### Collect unique article IDs

In [8]:
publication_ids = list({article.id for tool in tools for article in tool['articles']})

print(f'Found {len(publication_ids)} unique publication IDs.')

Found 366828 unique publication IDs.


### Create binary matrix for tools vs. articles

In [12]:
import pandas as pd

#for tool in tools:
#    print(tool['name'], len(tool['articles']))
#    tool['articles'] = [article for article in tool['articles'] if article.id in publication_ids]
tools_short = tools[:10]
matrix = [[True if article_id in [article.id for article in tool['articles']] else False for article_id in publication_ids] for tool in tools_short]

comentions_df = pd.DataFrame(matrix, columns=publication_ids)

# Set rownames to tool names
#comentions_df.index = [tool['name'] for tool in tools]

ValueError: Length mismatch: Expected axis has 10 elements, new values have 9453 elements

In [None]:
duplicated_indices = comentions_df.index[comentions_df.index.duplicated(keep=False)]

duplicated_rows = comentions_df.loc[duplicated_indices]

unique_rows = duplicated_rows.groupby(level=0).filter(lambda x: x.nunique().eq(1).all())

# Remove exact duplicates by dropping all but the first instance
comentions_matrix = pd.concat([comentions_df.drop(index=unique_rows.index), unique_rows.groupby(level=0).first()])

print("Deduplicated DataFrame:")
print(comentions_matrix.head())


### Check if articles are about a tool in bio.tools

In [None]:
# Load bio.tools data

biotoolspath = 'path/to/biotools.json' # REPLACE

with open(biotoolspath) as f:
    biotools = json.load(f)


publication_ids_in_biotools = [
    (id, tool['biotoolsID']) 
    for id in publication_ids 
    for tool in biotools 
    if id in [article.get('pmid') or article.get('pmcid') for article in tool.get('publication', [])]
]

print(f'Matched {len(publication_ids_in_biotools)} publication IDs with a bio.tools tool.')