# Mapping results analysis

The mapping output consists of CSV files that are stored in different folders, according to the relation between the bibliographic resources (BRs) in the two collection. The folders are the following:
1. the folder containing bibliographic resources that are mapped in a 1:1 relation (each OC Meta BR is mapped to exactly one OpenAlex BR)
2. the folder containing multi-mapped bibliographic resources (each OC Meta BR is mapped to more than one OpenAlex BR)
3. the folder containing unmapped bibliographic resources (the CSV files only contain the OC Meta BRs for which no corresponding entity was found in OpenAlex).

Given a sample configuration file like the following, the mapping results folders would be the `mapped` (1), `multi_mapped` (2) and `non_mapped` folders (3), which are all inside the folder `mapping_output`.

```yaml
meta_tables:
  meta_dump_zip: '../data/oc_meta/oc_meta.zip'
  meta_ids_out: '../process/meta_ids'
  all_rows: True

openalex_works:
  inp_dir: '../data/openalex/data/works'
  out_dir: '../process/openalex_tables/works'
  entity_type: 'work'
openalex_sources:
  inp_dir: '../data/openalex/data/sources'
  out_dir: '../process/openalex_tables/sources'
  entity_type: 'source'

... # other configuration parameters

mapping:
  inp_dir: '../process/meta_ids/primary_ents'
  db_path: '../process/openalex.db'
  out_dir: '../mapping_output/mapped'
  multi_mapped_dir: '../mapping_output/multi_mapped'
  non_mapped_dir: '../mapping_output/non_mapped'
  type_field: True
  all_rows: True
```

## Basic quantitative analysis on input data and OC Meta BRs mapped to a single OpenAlex BR

First, we can perform basic counts on the input data and the OC Meta BRs that are mapped to a single OpenAlex BR. To make things easier, we can use some of the functions in `omid_openalex.mapping` and `omid_openalex.utils`. 

Let's start by simply counting the number of BRs in the input OC Meta data: each of these BRs is represented as a row in the CSV files (stored in the ZIP archive `oc_meta.zip`, see the configuration file above), therefore we can simply count the number of rows in the file. We should bear in mind, however, that not all BRs are represented as CSV rows in their own right, even though they are available in the OC Meta collection: this is the case of journal issues and volumes, that despite being first-class entities in OC Meta and represented as such in the triplestore, are not necessarily represented as CSV single rows in the OC Meta dump.[^1] 
Besides counting the total number of BRs represented as a CSV row (i.e. the number of BRs that are processed in the mapping step), we can gain more specific insights by counting, among these BRs, how many have external PIDs (i.e. any PID that is not OMID) and how many have at least one PID that is also supported by OpenAlex (i.e. one PID among DOI, PMID, PMCID, Wikidata ID and ISSN). It is worth noting that only the OC Meta BRs with at least one OpenAlex-supported PID are potentially mappable to any corresponding BR in OpenAlex. 
 
[^1]: In the CSV dump, journal volumes and issues are often represented only as values of the *volume* and *issue* fields, and they are not considered for the mapping process, which only takes into consideration the OMIDs and the external PIDs stored in the *id* field.

In [None]:
"""Count the number of rows in the input data."""
from omid_openalex.mapping import MetaProcessor

meta_dump_zip = '../data/oc_meta/oc_meta.zip'

tot_brs_count = 0
brs_with_external_pids_count = 0
brs_with_openalex_supported_pids_count = 0
mutually_supported_ids = {'doi', 'pmid', 'pmcid', 'issn', 'wikidata'}
for row in MetaProcessor.read_compressed_meta_dump(meta_dump_zip):
    tot_brs_count += 1
    ids = row['id'].split()
    if len(ids) > 1:
        brs_with_external_pids_count += 1
        if any(id.split(':')[0] in mutually_supported_ids for id in ids):
            brs_with_openalex_supported_pids_count += 1

print(f'Total number of OC Meta BRs represented as CSV rows: {tot_brs_count}')
print(f'Number of OC Meta BRs with external PIDs: {brs_with_external_pids_count}')
print(f'Number of OC Meta BRs with at least one PID that is also supported by OpenAlex: {brs_with_openalex_supported_pids_count}')
print(f'Number of non-mappable OC Meta BRs represented as CSV rows: {tot_brs_count - brs_with_openalex_supported_pids_count}')
print(f'Percentage of potentially mappable OC Meta BRs represented as CSV rows: {brs_with_openalex_supported_pids_count / tot_brs_count * 100:.2f}%')

Similarly to what we did for the OC Meta data, we can count the number of BRs in the OpenAlex data. In this case, we can simply count the number of lines, i.e. JSON objects in the JSON-L files in the `works` and `sources` folders, which contain the BRs that are processed in the mapping step. Moreover, we can count how many of these BRs have at least one PID that is also supported by OC Meta.

In [None]:
from omid_openalex.mapping import OpenAlexProcessor

openalex_works_dir = '../data/openalex/data/works'
openalex_sources_dir = '../data/openalex/data/sources'

# Count Works
tot_works_count = 0
works_with_meta_supported_pids_count = 0
for line in OpenAlexProcessor.read_compressed_openalex_dump(openalex_works_dir):
    tot_works_count += 1
    for _ in OpenAlexProcessor.get_work_ids(line):
        works_with_meta_supported_pids_count += 1
        break

# Count Sources
tot_sources_count = 0
sources_with_meta_supported_pids_count = 0
for line in OpenAlexProcessor.read_compressed_openalex_dump(openalex_sources_dir):
    tot_sources_count += 1
    for _ in OpenAlexProcessor.get_source_ids(line):
        sources_with_meta_supported_pids_count += 1
        break

print(f'Total number of OpenAlex BRs: {tot_works_count + tot_sources_count}')
print(f'Total number of OpenAlex Works: {tot_works_count}')
print(f'Number of OpenAlex Works with at least one PID that is also supported by OC Meta: {works_with_meta_supported_pids_count}')
print(f'Total number of OpenAlex Sources: {tot_sources_count}')
print(f'Number of OpenAlex Sources with at least one PID that is also supported by OC Meta: {sources_with_meta_supported_pids_count}')
print(f'Total number of OpenAlex BRs with at least one PID that is also supported by OC Meta (i.e. mappable BRs in OpenAlex): {works_with_meta_supported_pids_count + sources_with_meta_supported_pids_count}')
print(f'Percentage of OpenAlex BRs with at least one PID that is also supported by OC Meta: {(works_with_meta_supported_pids_count + sources_with_meta_supported_pids_count) / (tot_works_count + tot_sources_count) * 100:.2f}%')

Finally, we can count the number of OC Meta BRs that are mapped to a single OpenAlex BR. With the sample configuration file above, these BRs would be stored in the CSV files in the `mapped` folder. We can simply count the number of rows in these files, which corresponds to the number of OC Meta BRs (from the rows of the CSV dump) that are mapped to a single OpenAlex BR.

In [None]:
from omid_openalex.utils import read_csv_tables

mapped_dir = '../mapping_output/mapped'

mapped_brs_count = 0
for row in read_csv_tables(mapped_dir):
    mapped_brs_count += 1

print(f'Total number of OC Meta BRs mapped to a single OpenAlex BR: {mapped_brs_count}')

Moreover, it should be considered that some of these OC Meta BRs might be mapped to the same OpenAlex BR as other OC Meta BRs (inverted multi-mapped). In fact, if two BRs in OC Meta have been mistakenly assigned the same external PID, if this PID is assigned to a BR in OpenAlex both the BRs in OC Meta will be mapped to the same BR in OpenAlex. 
We can examine the BRs in the `mapped` folder and save the inverted multi-mapped BRs in a separate CSV file, in order to later distinguish them from other multi-mapped BRs, if necessary. In order to do so, we can use the `find_inverted_multi_mapped` function in `omid_openalex.analytics.helper`.

In [None]:
from omid_openalex.analytics.helper import find_inverted_multi_mapped

mapped_dir = '../mapping_output/mapped'
inverted_multi_mapped_dir = '../analysis/inverted_multi_mapped' #dir where to save the inverted multi-mapped BRs

# print the number of inverted multi-mapped OC Meta BRs and save them to a separate CSV file, together with the corresponding OpenAlex BRs
find_inverted_multi_mapped(mapped_dir, inverted_multi_mapped_dir)

Furthermore, we can count the number of multi-mapped OC Meta BRs (OC Meta BRs that are mapped to more than one OpenAlex BR). With the sample configuration file above, multi-mapped BRs would be stored in (one or more) CSV file(s) in the `multi_mapped` folder: again, we can simply count the number of table rows.

In [None]:
from omid_openalex.utils import read_csv_tables

multi_mapped_dir = '../mapping_output/multi_mapped'

multi_mapped_brs_count = 0
for row in read_csv_tables(multi_mapped_dir):
    multi_mapped_brs_count += 1

print(f'Total number of OC Meta BRs mapped to more than one OpenAlex BR: {multi_mapped_brs_count}')

Finally, we can count the number of OC Meta BRs that are not mapped to any OpenAlex BR. With the sample configuration file above, these BRs would be stored in the CSV files in the `non_mapped` folder.

In [None]:
from omid_openalex.utils import read_csv_tables

non_mapped_dir = '../mapping_output/non_mapped'

non_mapped_brs_count = 0
for row in read_csv_tables(non_mapped_dir):
    non_mapped_brs_count += 1

print(f'Total number of unmapped OC Meta BRs (from CSV dump): {non_mapped_brs_count}')

## Multi-mapped BRs analysis
