# Mapping results analysis

The mapping output consists of CSV files that are stored in different folders, according to the relation between the bibliographic resources (BRs) in the two collection. The folders are the following:
1. the folder containing bibliographic resources that are mapped in a 1:1 relation (each OC Meta BR is mapped to exactly one OpenAlex BR)
2. the folder containing multi-mapped bibliographic resources (each OC Meta BR is mapped to more than one OpenAlex BR)
3. the folder containing unmapped bibliographic resources (the CSV files only contain the OC Meta BRs for which no corresponding entity was found in OpenAlex).

Given a sample configuration file like the following, the mapping results folders would be the `mapped` (1), `multi_mapped` (2) and `non_mapped` folders (3), which are all inside the folder `mapping_output`.

```yaml
meta_tables:
  meta_dump_zip: '../data/oc_meta/oc_meta.zip'
  meta_ids_out: '../process/meta_ids'
  all_rows: True

openalex_works:
  inp_dir: '../data/openalex/data/works'
  out_dir: '../process/openalex_tables/works'
  entity_type: 'work'
openalex_sources:
  inp_dir: '../data/openalex/data/sources'
  out_dir: '../process/openalex_tables/sources'
  entity_type: 'source'

... # other configuration parameters

mapping:
  inp_dir: '../process/meta_ids/primary_ents'
  db_path: '../process/openalex.db'
  out_dir: '../mapping_output/mapped'
  multi_mapped_dir: '../mapping_output/multi_mapped'
  non_mapped_dir: '../mapping_output/non_mapped'
  type_field: True
  all_rows: True
```

## Basic quantitative analysis on input data and OC Meta BRs mapped to a single OpenAlex BR

First, we can perform basic counts on the input data and the OC Meta BRs that are mapped to a single OpenAlex BR. To make things easier, we can use some of the functions in `omid_openalex.mapping` and `omid_openalex.utils`. 

Let's start by simply counting the number of BRs in the input OC Meta data: each of these BRs is represented as a row in the CSV files (stored in the ZIP archive `oc_meta.zip`, see the configuration file above), therefore we can simply count the number of rows in the file. We should bear in mind, however, that not all BRs are represented as CSV rows in their own right, even though they are available in the OC Meta collection: this is the case of journal issues and volumes, that despite being first-class entities in OC Meta and represented as such in the triplestore, are not necessarily represented as CSV single rows in the OC Meta dump.[^1] 
Besides counting the total number of BRs represented as a CSV row (i.e. the number of BRs that are processed in the mapping step), we can gain more specific insights by counting, among these BRs, how many have external PIDs (i.e. any PID that is not OMID) and how many have at least one PID that is also supported by OpenAlex (i.e. one PID among DOI, PMID, PMCID, Wikidata ID and ISSN). It is worth noting that only the OC Meta BRs with at least one OpenAlex-supported PID are potentially mappable to any corresponding BR in OpenAlex. 
 
[^1]: In the CSV dump, journal volumes and issues are often represented only as values of the *volume* and *issue* fields, and they are not considered for the mapping process, which only takes into consideration the OMIDs and the external PIDs stored in the *id* field.

In [None]:
"""Count the number of rows in the input data."""
from omid_openalex.mapping import MetaProcessor

meta_dump_zip = '../data/oc_meta/oc_meta.zip'

tot_brs_count = 0
brs_with_external_pids_count = 0
brs_with_openalex_supported_pids_count = 0
mutually_supported_ids = {'doi', 'pmid', 'pmcid', 'issn', 'wikidata'}
for row in MetaProcessor.read_compressed_meta_dump(meta_dump_zip):
    tot_brs_count += 1
    ids = row['id'].split()
    if len(ids) > 1:
        brs_with_external_pids_count += 1
        if any(id.split(':')[0] in mutually_supported_ids for id in ids):
            brs_with_openalex_supported_pids_count += 1

print(f'Total number of OC Meta BRs represented as CSV rows: {tot_brs_count}')
print(f'Number of OC Meta BRs with external PIDs: {brs_with_external_pids_count}')
print(f'Number of OC Meta BRs with at least one PID that is also supported by OpenAlex: {brs_with_openalex_supported_pids_count}')
print(f'Number of non-mappable OC Meta BRs represented as CSV rows: {tot_brs_count - brs_with_openalex_supported_pids_count}')
print(f'Percentage of potentially mappable OC Meta BRs represented as CSV rows: {brs_with_openalex_supported_pids_count / tot_brs_count * 100:.2f}%')

Similarly to what we did for the OC Meta data, we can count the number of BRs in the OpenAlex data. In this case, we can simply count the number of lines, i.e. JSON objects in the JSON-L files in the `works` and `sources` folders, which contain the BRs that are processed in the mapping step. Moreover, we can count how many of these BRs have at least one PID that is also supported by OC Meta.

In [None]:
from omid_openalex.mapping import OpenAlexProcessor

openalex_works_dir = '../data/openalex/data/works'
openalex_sources_dir = '../data/openalex/data/sources'

# Count Works
tot_works_count = 0
works_with_meta_supported_pids_count = 0
for line in OpenAlexProcessor.read_compressed_openalex_dump(openalex_works_dir):
    tot_works_count += 1
    for _ in OpenAlexProcessor.get_work_ids(line):
        works_with_meta_supported_pids_count += 1
        break

# Count Sources
tot_sources_count = 0
sources_with_meta_supported_pids_count = 0
for line in OpenAlexProcessor.read_compressed_openalex_dump(openalex_sources_dir):
    tot_sources_count += 1
    for _ in OpenAlexProcessor.get_source_ids(line):
        sources_with_meta_supported_pids_count += 1
        break

print(f'Total number of OpenAlex BRs: {tot_works_count + tot_sources_count}')
print(f'Total number of OpenAlex Works: {tot_works_count}')
print(f'Number of OpenAlex Works with at least one PID that is also supported by OC Meta: {works_with_meta_supported_pids_count}')
print(f'Total number of OpenAlex Sources: {tot_sources_count}')
print(f'Number of OpenAlex Sources with at least one PID that is also supported by OC Meta: {sources_with_meta_supported_pids_count}')
print(f'Total number of OpenAlex BRs with at least one PID that is also supported by OC Meta (i.e. mappable BRs in OpenAlex): {works_with_meta_supported_pids_count + sources_with_meta_supported_pids_count}')
print(f'Percentage of OpenAlex BRs with at least one PID that is also supported by OC Meta: {(works_with_meta_supported_pids_count + sources_with_meta_supported_pids_count) / (tot_works_count + tot_sources_count) * 100:.2f}%')

Finally, we can count the number of OC Meta BRs that are mapped to a single OpenAlex BR. With the sample configuration file above, these BRs would be stored in the CSV files in the `mapped` folder. We can simply count the number of rows in these files, which corresponds to the number of OC Meta BRs (from the rows of the CSV dump) that are mapped to a single OpenAlex BR.

In [None]:
from omid_openalex.utils import read_csv_tables

mapped_dir = '../mapping_output/mapped'

mapped_brs_count = 0
for row in read_csv_tables(mapped_dir):
    mapped_brs_count += 1

print(f'Total number of OC Meta BRs mapped to a single OpenAlex BR: {mapped_brs_count}')

Moreover, it should be considered that some of these OC Meta BRs might be mapped to the same OpenAlex BR as other OC Meta BRs (inverted multi-mapped). In fact, if two BRs in OC Meta have been mistakenly assigned the same external PID, if this PID is assigned to a BR in OpenAlex both the BRs in OC Meta will be mapped to the same BR in OpenAlex. 
We can examine the BRs in the `mapped` folder and save the inverted multi-mapped BRs in a separate CSV file, in order to later distinguish them from other multi-mapped BRs, if necessary. In order to do so, we can use the `find_inverted_multi_mapped` function in `omid_openalex.analytics.helper`.

In [None]:
from omid_openalex.analytics.helper import find_inverted_multi_mapped

mapped_dir = '../mapping_output/mapped'
inverted_multi_mapped_dir = '../analysis/inverted_multi_mapped' #dir where to save the inverted multi-mapped BRs

# print the number of inverted multi-mapped OC Meta BRs and save them to a separate CSV file, together with the corresponding OpenAlex BRs
find_inverted_multi_mapped(mapped_dir, inverted_multi_mapped_dir)

Furthermore, we can count the number of multi-mapped OC Meta BRs (OC Meta BRs that are mapped to more than one OpenAlex BR). With the sample configuration file above, multi-mapped BRs would be stored in (one or more) CSV file(s) in the `multi_mapped` folder: again, we can simply count the number of table rows.

In [None]:
from omid_openalex.utils import read_csv_tables

multi_mapped_dir = '../mapping_output/multi_mapped'

multi_mapped_brs_count = 0
for row in read_csv_tables(multi_mapped_dir):
    multi_mapped_brs_count += 1

print(f'Total number of OC Meta BRs mapped to more than one OpenAlex BR: {multi_mapped_brs_count}')

Finally, we can count the number of OC Meta BRs that are not mapped to any OpenAlex BR. With the sample configuration file above, these BRs would be stored in the CSV files in the `non_mapped` folder.

In [None]:
from omid_openalex.utils import read_csv_tables

non_mapped_dir = '../mapping_output/non_mapped'

non_mapped_brs_count = 0
for row in read_csv_tables(non_mapped_dir):
    non_mapped_brs_count += 1

print(f'Total number of unmapped OC Meta BRs (from CSV dump): {non_mapped_brs_count}')

## Multi-mapped BRs analysis

Multi-mapped BRs are particularly interesting, since they can help reveal potential inconsistencies in the processed datasets. First, we can load the multi-mapped BRs (stored in a single file, named `multi_mapped_omids.csv` inside the `multi_mapped` folder) in a pandas Dataframe. 
Then, we can gain a deeper understanding of their nature by using some functions from the `omid_openalex.analytics.helper` module.

In [146]:
import pandas as pd
from omid_openalex.utils import read_csv_tables

# mm_csv = '../mapping_output/multi_mapped/multi_mapped_omids.csv' # multi-mapped BRs are stored in a single CSV file
mm_csv = "D:/mapping_oct_23/multi_map/multi_mapped_omids.csv" # multi-mapped BRs are stored in a single CSV file; replace this path with the one where you saved your results

df = pd.read_csv(mm_csv, encoding='utf-8')
df.head()


Unnamed: 0,omid,openalex_id,type
0,omid:br/06083,W4235201711 W2550489455,reference book
1,omid:br/060135,W3116204541 W2504600547,reference book
2,omid:br/060101,W4234565634 W4254878166,reference book
3,omid:br/060765,W3081784037 W3035986120,report
4,omid:br/06082,W2889504338 W2551658000,reference book


For example, we can focus on journal articles and perform basic analysis with standard pandas operations.

In [147]:
# get multi-mapped JOURNAL ARTICLES and study them
jadf = df[df['type'] == 'journal article']
jadf['openalex_id'] = jadf['openalex_id'].str.split() # convert string-encoded lists to actual lists


print(f'Average number of OpenAlex IDs for one OMID: {jadf.openalex_id.apply(len).mean():.2f}')
print(f'Minimum number of OpenAlex IDs for one OMID: {jadf.openalex_id.apply(len).min()}')
print(f'Maximum number of OpenAlex IDs for one OMID: {jadf.openalex_id.apply(len).max()}')

Average number of OpenAlex IDs for one OMID: 2.15
Minimum number of OpenAlex IDs for one OMID: 2
Maximum number of OpenAlex IDs for one OMID: 1051




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In order to get more complete results, we can also pass the type of bibliographic resource we want to analyse to the `analyse_mm_by_type()` function, together with the path to the CSV file storing multi-mapped BRs. This will output the distribution of OMIDs over the number of OpenAlex IDs each OMID is mapped to.

In [148]:
from omid_openalex.analytics.helper import analyse_mm_by_type

analyse_mm_by_type(mm_csv, 'journal article')




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Total number of multi-mapped OMIDs of BRs of type journal article: 140426

Number of OMIDs of journal article multi-mapped to Work IDs only: 140426. The following illustrates how these are distributed over the number of OAIDs each OMID is mapped to: {2: 130280, 3: 7471, 4: 1601, 5: 506, 6: 201, 7: 83, 8: 39, 9: 13, 10: 27, 11: 18, 12: 12, 13: 11, 14: 10, 15: 10, 16: 4, 17: 13, 18: 14, 19: 5, 20: 3, 21: 4, 22: 5, 23: 8, 24: 5, 25: 4, 26: 4, 28: 5, 29: 1, 30: 1, 31: 6, 32: 2, 33: 2, 34: 2, 35: 2, 40: 2, 41: 1, 42: 2, 43: 4, 44: 1, 45: 1, 46: 2, 47: 1, 48: 1, 49: 1, 50: 1, 52: 1, 53: 1, 57: 1, 59: 2, 60: 2, 65: 1, 67: 1, 69: 1, 73: 1, 74: 1, 76: 1, 77: 1, 78: 1, 80: 1, 83: 2, 94: 2, 96: 1, 97: 1, 98: 1, 100: 1, 106: 2, 111: 2, 112: 2, 115: 1, 123: 1, 130: 1, 135: 1, 160: 1, 197: 1, 1051: 1}

Number of OMIDs of journal article multi-mapped to Source IDs only: 0. The following illustrates how these are distributed over the number of OAIDs each OMID is mapped to: {}


The `filter_mm_df()` function makes it easier to filter the dataframe for a specific type of bibliographic resource among the ones specified by OC Meta (e.g. journal article, journal, book), for a specific OpenAlex entity type (Works or Sources) and for a given number of OpenAlex IDs at the same time.

In [149]:
from omid_openalex.analytics.helper import prepare_data_for_filtering, filter_mm_df

new_df = prepare_data_for_filtering(mm_csv)
filtered_df = filter_mm_df(new_df, res_type='journal article', composition='works', oaid_count=6)
filtered_df.head()

Unnamed: 0,omid,openalex_id,type,oaid_count,composition
676,omid:br/0620144776,W4253733061 W4250748934 W4233785629 W425195928...,journal article,6,works
5459,omid:br/06204258166,W4232252855 W2068427257 W4245596445 W422992732...,journal article,6,works
5584,omid:br/06204264644,W4244962988 W4251940503 W4233590687 W423690514...,journal article,6,works
5670,omid:br/06204269306,W3124575407 W4298086179 W4300444114 W429844190...,journal article,6,works
5969,omid:br/06204285178,W2586568347 W2590260806 W1898333710 W258633207...,journal article,6,works


For a manual analysis of the multi-mapped BRs it can be useful to directly access the metadata of the involved entities via the APIs of both OpenCitations and OpenAlex. To make it faster, we can use the `get_api_url()` function as follows.

In [150]:
from omid_openalex.analytics.helper import get_api_url
from pprint import pprint

pprint(get_api_url(filtered_df)[:3]) # print the API url for the first 3 rows of the dataframe obtained in the previous cell

[{'composition': 'works',
  'oaid_count': 6,
  'omid': 'https://opencitations.net/meta/api/v1/metadata/omid:br/0620144776',
  'openalex_id': ['https://api.openalex.org/W4253733061',
                  'https://api.openalex.org/W4250748934',
                  'https://api.openalex.org/W4233785629',
                  'https://api.openalex.org/W4251959288',
                  'https://api.openalex.org/W3102474143',
                  'https://api.openalex.org/W4244406985'],
  'type': 'journal article'},
 {'composition': 'works',
  'oaid_count': 6,
  'omid': 'https://opencitations.net/meta/api/v1/metadata/omid:br/06204258166',
  'openalex_id': ['https://api.openalex.org/W4232252855',
                  'https://api.openalex.org/W2068427257',
                  'https://api.openalex.org/W4245596445',
                  'https://api.openalex.org/W4229927320',
                  'https://api.openalex.org/W4229699789',
                  'https://api.openalex.org/W4239361117'],
  'type': 'journal arti

Valuable insights can be provided also by a dynamic visualisation of multi-mapped data. For example, we can create a distribution histogram to immediately grasp what are the BR types most frequently involved in multi-mapping, or to understand how many OpenAlex IDs the OMIDs are typically multi-mapped to.

In [151]:
import plotly.express as px
from omid_openalex.analytics.helper import add_columns_to_df


viz_df = add_columns_to_df(df)
viz_df['type'].fillna('unspecified', inplace=True)

hist_data_df = viz_df.groupby(['oaid_count', 'type', 'composition']).size().reset_index(name='frequency')

# define new custom legend names (including the total number of occurrences for each type)    
legend_names = {brtype: f"{brtype} ({viz_df['type'].value_counts(dropna=False).get(brtype)})" for brtype in viz_df['type'].unique()}

fig = px.bar(hist_data_df, x='oaid_count', y='frequency', color='type', hover_data=['composition'], log_y=True)

fig.for_each_trace(lambda t: t.update(name = legend_names[t.name]))

fig.update_layout(title='Distribution of OpenAlex ID per OMID over number of multi-mapped OMIDs (grouped by type)',
                  xaxis_title='Number of OpenAlex IDs for a single OMID',
                  yaxis_title='Frequency (log)', yaxis_type='log')

# write html file of the log scale histogram
# fig.write_html('graphs/multi_mapped_dist.html')

fig.show()

### Multi-mapped BRs categorisation

To try and delve deeper into the causes of the multi-mapping, we can attempt to categorise the involved BRs. This can be done by running the process in `mm_categ.py`. Once we obtain the results of the categorisation, stored in a JSON file, we can load them into two pandas DataFrames for ease of use. The categories labels are explained as follows:

Categories for Works: 
- __*A*__: Multiple OpenAlex Works share the same DOI, PMID or PMCID.
- __*B*__: DOI(s) for preprint/postprint/version hosted in repository. Instances of this categories are determined based on the DOI prefix, which is associated with the publisher. For example, the "10.22541" prefix is associated with Authorea publishing company, which manages a large preprint server; the "10.17615" prefix is associated with the University of North Carolina at Chapel Hill, which curates/publishes the articles in the Carolina Digital Repository.
- __*C*__: Error in data source or 2 entities linked together by mistake (e.g. duplicated DOI).
- __*D*__: Version-marked DOI(s). This category includes preprint versions and detects them by checking for version number (e.g. "/v1") in the DOI value.
- __*E*__: DOI(s) coming from preprint servers (based on the presence of semantic indicators in the DOI suffix, e.g. "/arxiv" or "zenodo").
- __*F*__: Multiple DOIs all from the same publisher/DOI issuer: errata, letters, editorials, other.
- __*non classified*__: Non classified.

Categories for Sources: 
- __*A*__: Multiple OpenAlex Sources share the same ISSN/ISSN-L. Wikidata IDs are not considered
- __*non classified*__: Non classified

In [152]:
import pandas as pd
import json

json_file = 'tmp/mm_categories.json'  # replace with the JSON file storing the output of mm_categ.py
data = json.load(open(json_file, 'r', encoding='utf-8'))

# Create a DataFrame for "works"
works_data = {}
for work_type, work_values in data["works"].items():
    works_data[work_type] = work_values

works_df = pd.DataFrame(works_data).T.fillna(0).astype(int)

# Create a DataFrame for "sources"
sources_data = {}
for source_type, source_values in data["sources"].items():
    sources_data[source_type] = source_values

sources_df = pd.DataFrame(sources_data).T.fillna(0).astype(int)

In [153]:
works_df.head()

Unnamed: 0,A,B,C,non classified,D,E,F
proceedings article,477,10,452,666,108,16,0
journal article,38179,8722,35744,50579,10196,1030,805
,607,502,609,1503,1753,265,29
book chapter,341,8,1112,2002,21,4,36
book,27,1,581,8511,31,0,4


In [154]:
sources_df.head()

Unnamed: 0,A,non classified
journal,4057,2345
book series,17,38
series,2,0


## Provenance analysis of unmapped resources

OC Meta BRs that have not been mapped to any BR in OpenAlex are also worth investigating. In particular, we are interested in observing the provenance of these resources, i.e. understand which data sources among the ones used by OC Meta (e.g. Crossref, Datacite) provide data that is solely stored in OC Meta and not in OpenAlex.
We can perform this analysis by running the process in `prov_analysis.py`. The results of this analysis will be stored in a JSON file at the path specified in the configuration file for this process.

First we load the results in a Dataframe.

In [155]:
import json
import pandas as pd

prov_results_json = 'tmp/provenance_analysis_results.json'  # replace with path of JSON file storing results of provenance analysis process

prov_analysis_results = json.load(open(prov_results_json, 'r', encoding='utf-8'))
prov_df = pd.DataFrame(prov_analysis_results)

prov_df.head()

Unnamed: 0,proceedings,journal issue,book,journal volume,dataset,Unnamed: 6,journal article,reference book,report,journal,...,dissertation,book chapter,computer program,proceedings article,reference entry,series,web content,standard,data management plan,book section
https://api.crossref.org/,"{'omid_only': 5046154, 'other_pids': 31}","{'omid_only': 4667606, 'other_pids': 79827}","{'omid_only': 2405267, 'other_pids': 91797}","{'omid_only': 1440942, 'other_pids': 95}","{'omid_only': 0, 'other_pids': 46}","{'omid_only': 356955, 'other_pids': 324066}","{'omid_only': 0, 'other_pids': 4983}","{'omid_only': 186140, 'other_pids': 25}","{'omid_only': 0, 'other_pids': 15}","{'omid_only': 56561, 'other_pids': 52}",...,,"{'omid_only': 0, 'other_pids': 1230}",,"{'omid_only': 0, 'other_pids': 84}","{'omid_only': 0, 'other_pids': 87}","{'omid_only': 9, 'other_pids': 17}","{'omid_only': 0, 'other_pids': 3}","{'omid_only': 0, 'other_pids': 4}",,"{'omid_only': 0, 'other_pids': 3}"
https://api.crossref.org/snapshots/monthly/2023/09/all.json.tar.gz,"{'omid_only': 324639, 'other_pids': 0}","{'omid_only': 202652, 'other_pids': 15}","{'omid_only': 2626, 'other_pids': 16272}","{'omid_only': 106965, 'other_pids': 0}",,"{'omid_only': 71227, 'other_pids': 149362}","{'omid_only': 0, 'other_pids': 484}","{'omid_only': 2286, 'other_pids': 1}",,"{'omid_only': 5000, 'other_pids': 3}",...,,"{'omid_only': 0, 'other_pids': 1197}",,"{'omid_only': 0, 'other_pids': 778}",,"{'omid_only': 0, 'other_pids': 1}","{'omid_only': 0, 'other_pids': 26}",,,
https://doi.org/10.5281/zenodo.7845968,"{'omid_only': 0, 'other_pids': 11487}","{'omid_only': 1075, 'other_pids': 0}","{'omid_only': 0, 'other_pids': 1247}",,"{'omid_only': 0, 'other_pids': 203240}","{'omid_only': 0, 'other_pids': 355075}","{'omid_only': 0, 'other_pids': 202018}",,"{'omid_only': 0, 'other_pids': 1993}","{'omid_only': 0, 'other_pids': 19}",...,"{'omid_only': 0, 'other_pids': 7983}","{'omid_only': 0, 'other_pids': 690}","{'omid_only': 0, 'other_pids': 1992}",,,,,,"{'omid_only': 0, 'other_pids': 4}",
https://nih.figshare.com/collections/iCite_Database_Snapshots_NIH_Open_Citation_Collection_/4586573/42,"{'omid_only': 786, 'other_pids': 0}","{'omid_only': 102115, 'other_pids': 0}","{'omid_only': 3757, 'other_pids': 0}","{'omid_only': 22602, 'other_pids': 0}",,"{'omid_only': 1, 'other_pids': 153}","{'omid_only': 0, 'other_pids': 42009}","{'omid_only': 1, 'other_pids': 0}",,"{'omid_only': 40080, 'other_pids': 1499}",...,,,,,,"{'omid_only': 14, 'other_pids': 2}",,,,
https://doi.org/10.5281/zenodo.7845968 https://api.crossref.org/,"{'omid_only': 0, 'other_pids': 17}","{'omid_only': 2830, 'other_pids': 37}","{'omid_only': 0, 'other_pids': 2}","{'omid_only': 1730, 'other_pids': 0}","{'omid_only': 0, 'other_pids': 190}","{'omid_only': 0, 'other_pids': 16}","{'omid_only': 0, 'other_pids': 57}",,,"{'omid_only': 5, 'other_pids': 0}",...,"{'omid_only': 0, 'other_pids': 5}",,"{'omid_only': 0, 'other_pids': 5}",,,,,,,


We can create a simpler copy of the dataframe, where the dictionaries are replaced with the sum of the values.

In [156]:
prov_df_copy = prov_df.copy()
prov_df_copy.fillna(0, inplace=True)

for col in prov_df_copy.columns:
    prov_df_copy[col] = prov_df_copy[col].apply(lambda x: sum(x.values()) if isinstance(x, dict) else x)

prov_df_copy.head()

Unnamed: 0,proceedings,journal issue,book,journal volume,dataset,Unnamed: 6,journal article,reference book,report,journal,...,dissertation,book chapter,computer program,proceedings article,reference entry,series,web content,standard,data management plan,book section
https://api.crossref.org/,5046185,4747433,2497064,1441037,46,681021,4983,186165,15,56613,...,0,1230,0,84,87,26,3,4,0,3
https://api.crossref.org/snapshots/monthly/2023/09/all.json.tar.gz,324639,202667,18898,106965,0,220589,484,2287,0,5003,...,0,1197,0,778,0,1,26,0,0,0
https://doi.org/10.5281/zenodo.7845968,11487,1075,1247,0,203240,355075,202018,0,1993,19,...,7983,690,1992,0,0,0,0,0,4,0
https://nih.figshare.com/collections/iCite_Database_Snapshots_NIH_Open_Citation_Collection_/4586573/42,786,102115,3757,22602,0,154,42009,1,0,41579,...,0,0,0,0,0,16,0,0,0,0
https://doi.org/10.5281/zenodo.7845968 https://api.crossref.org/,17,2867,2,1730,190,16,57,0,0,5,...,5,0,5,0,0,0,0,0,0,0


We can visualize the results in a histogram showing only the number of unmapped OC Meta BRs for every BR type.

In [157]:
# --------- PREPARE THE DATA ----------

# Create a DataFrame to make it easier to work with the data
prov_vis_df = prov_df.T  # Transpose the DataFrame to have 'br types' as columns

# Add new columns for sum of 'omid_only' and 'other_pids'
prov_vis_df['omid_only_sum'] = prov_vis_df.apply(lambda row: sum(item.get('omid_only', 0) if isinstance(item, dict) else 0 for item in row), axis=1)
prov_vis_df['other_pids_sum'] = prov_vis_df.apply(lambda row: sum(item.get('other_pids', 0) if isinstance(item, dict) else 0 for item in row), axis=1)
# Add new column for sum of all values
prov_vis_df['Number of BRs'] = prov_vis_df['omid_only_sum'] + prov_vis_df['other_pids_sum']

# Reset the index to have 'br types' as a regular column
prov_vis_df.reset_index(inplace=True)
prov_vis_df.rename(columns={'index': 'BR Type'}, inplace=True)
prov_vis_df['BR Type'].replace('', 'Unspecified', inplace=True)  # Replace empty type string with 'Unknown'


# ----------- VISUALISATION -------------

# Create a new column for the legend labels
prov_vis_df['Legend Label'] = prov_vis_df['BR Type'] + ' (' + prov_vis_df['Number of BRs'].astype(str) + ')'

# Create the bar chart
fig = px.bar(prov_vis_df, x='BR Type', y='Number of BRs', text='Number of BRs', color='Legend Label',
             # labels={'Number of BRs': 'Number of BRs', 'omid_only_sum': 'Omid Only', 'With other PIDs': 'other_pids_sum'},
             title='Number of non-mapped BRs per BR Type',
             hover_name='Legend Label',
             hover_data=['omid_only_sum', 'other_pids_sum'],
             )
fig.update_layout(xaxis_title='BR Type', yaxis_title='Number of BRs')

# print to html file
# fig.write_html('graphs/non_mapped_brs_per_br_type.html')

fig.show()

Or we can visualize the proportion of the contribution of each data source for a given type of bibliographic resource.

In [158]:
def get_tot_contribution_by_source(sources_for_type:dict):
    
    res = dict()
    for k, v in sources_for_type.items():
        if ' ' not in k:
            if k in res:
                res[k] += sum(v.values())
            else:
                res[k] = sum(v.values())
        else:
            for single_source in k.split():
                if single_source in res:
                    res[single_source] += sum(v.values())
                else:
                    res[single_source] = sum(v.values())
    return res

def visualize_sources_proportion(sources_for_type:dict, title:str):
    data = get_tot_contribution_by_source(sources_for_type)
    labels = list(data.keys())
    values = list(data.values())

    fig = px.pie(values=values, names=labels, title=title)
    fig.show()
    

# visualize sources for unmapped datasets and the proportion of each source
visualize_sources_proportion(prov_analysis_results['dataset'], 'Dataset')