## Analyzing the mapping results

In more 29 983 cases, an OMID is mapped to more than one OpenAlex ID (OAID). The table below reports the number of such cases for each type of bibliographic resource (as they are assigned to the entities in OpenCitations Meta).

“multi-mapped” OMIDs (one OMID for multiple OAIDs)	tot: **29 983**

| type                | count  |
|---------------------|--------|
| reference book      | 69     |
| series              | 87     |
| standard            | 7      |
| book series         | 1 058  |
| journal             | 11 184 |
| journal article     | 10 001 |
| book                | 7 247  |
| proceedings article | 106    |
| reference entry     | 108    |
| book chapter        | 95     |
| report              | 12     |
| web content         | 4      |
| proceedings         | 4      |
| dataset             | 1      |


Let's analyse the cases of multi-mapped OMIDs more closely. We read them into a Pandas Dataframe from the CSV file in which they are stored, so to be able to manually inspect a part of them and try to identify the reason for the multiple mappings.

In [142]:
from collections import Counter
import chart_studio.plotly as csp
import pandas as pd
import plotly.express as px
import plotly.io as pio
from typing import List, Literal, Union
from pprint import pprint
import os
from csv import DictReader
import time

from os.path import join, abspath, splitext, basename, exists, isdir, isfile
from os import listdir, makedirs
import csv
from io import TextIOWrapper
from zipfile import ZipFile
from typing import Generator, Literal, List, Dict, Callable
from tqdm import tqdm
import time
import gzip
import json
import warnings

In [143]:
multi_mapped_omids_path =  'multi_mapped_omids.csv'

In [144]:

def analyse_multi_mapped_omids(file_path:str, res_type='journal article'):
    """
    Analyse the multi-mapped OMIDs and returns a tuple of three lists of dictionaries, where each list contains rows where either only OpenAlex Works ID, or only OpenAlex Sources ID, or both Works and Sources ID are present. The lists are filtered by the type of bibliographic resource (e.g. journal article, book, etc.). The function also prints basic statistics about the filtered data.
    :param file_path: the .csv file storing the table rows containing multi-mapped OMIDs
    :param res_type: the type of bibliographic resource to filter the data by
    :return: tupleof thre lists by composition of OAIDs (works, sources, or both)
    """
    df = pd.read_csv(file_path, sep=',', header=None, names=['omid', 'openalex_id', 'type'])


    filter_type_df = df[df['type'] == res_type]
    filter_type_df['openalex_id'] = filter_type_df['openalex_id'].str.split(' ')

    all_works = []
    all_sources = []
    works_and_sources = []

    for index, row in filter_type_df.iterrows():
        starts_with_w = all(item.startswith('W') for item in row['openalex_id'])
        starts_with_s = all(item.startswith('S') for item in row['openalex_id'])

        if starts_with_w and not starts_with_s:
            all_works.append(row.tolist())
        elif starts_with_s and not starts_with_w:
            all_sources.append(row.tolist())
        else:
            works_and_sources.append(row.tolist())

    count_works = dict(Counter(len(item[1]) for item in all_works))
    count_sources = dict(Counter(len(item[1]) for item in all_sources))
    count_works_and_sources = dict(Counter(len(item[1]) for item in works_and_sources))

    print(f'Total number of OMIDs of {res_type} multi-mapped: {len(all_works) + len(all_sources) + len(works_and_sources)}', end='\n\n')

    print(f'Number of OMIDs of {res_type} multi-mapped to Work IDs only: {len(all_works)}. The following illustrates how these are distributed over the number of OAIDs each OMID is mapped to: {dict(sorted(count_works.items()))}', end='\n\n')
    print(f'Number of OMIDs of {res_type} multi-mapped to Source IDs only: {len(all_sources)}. The following illustrates how these are distributed over the number of OAIDs each OMID is mapped to: {dict(sorted(count_sources.items()))}', end='\n\n')
    print(f'Number of OMIDs of {res_type} multi-mapped to both Work and Source IDs: {len(works_and_sources)}. The following illustrates how these are distributed over the number of OAIDs each OMID is mapped to: {dict(sorted(count_works_and_sources.items()))}', end='\n\n')

    return all_works, all_sources, works_and_sources

In [145]:
all_works, all_sources, works_and_sources = analyse_multi_mapped_omids(multi_mapped_omids_path, res_type='book')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Total number of OMIDs of book multi-mapped: 7247

Number of OMIDs of book multi-mapped to Work IDs only: 7247. The following illustrates how these are distributed over the number of OAIDs each OMID is mapped to: {2: 6410, 3: 825, 4: 11, 5: 1}

Number of OMIDs of book multi-mapped to Source IDs only: 0. The following illustrates how these are distributed over the number of OAIDs each OMID is mapped to: {}

Number of OMIDs of book multi-mapped to both Work and Source IDs: 0. The following illustrates how these are distributed over the number of OAIDs each OMID is mapped to: {}



In [146]:
filter_threshold = 2

print('Works only: \n')
pprint([x for x in all_works if len(x[1]) == filter_threshold][:3])
print('Sources only: \n')
pprint([x for x in all_sources if len(x[1]) == filter_threshold][:3])
print('Sources and Works: \n')
pprint([x for x in works_and_sources if len(x[1]) == filter_threshold][:3])


Works only: 

[['meta:br/062101418122', ['W2212757987', 'W2909068687'], 'book'],
 ['meta:br/06103247581', ['W4252858052', 'W4249101945'], 'book'],
 ['meta:br/062101475435', ['W4207016656', 'W4238810106'], 'book']]
Sources only: 

[]
Sources and Works: 

[]


# Working with dataframe data

In [147]:
# ------------ ➡️save the csv data to a dataframe -------------
MULTI_MAPPED_DF = pd.read_csv(multi_mapped_omids_path, sep=',', header=None, names=['omid', 'openalex_id', 'type'])

### Add columns to the dataframe

In [148]:
def add_columns_to_df(df):
    """
    Add to the dataframe the columns for the number of OAIDs for each OMID ('oaid_count' column) and the composition of the OAIDs (works, sources, or both) ('composition' column).
    :param df: the input dataframe, as it is read from the CSV file
    :return: a dataframe with the two additional columns
    """
    # add a column to the dataframe with the number of OAIDs for each OMID
    df['oaid_count'] = df['openalex_id'].apply(lambda x: len(x.split(' ') if len(x.split(' ')) > 1 else ''))

    # add a column to the dataframe with the composition of the OAIDs (works, sources, or both)
    df['composition'] = df['openalex_id'].apply(lambda x: 'works' if all(item.startswith('W') for item in x.split(' ')) else 'sources' if all(item.startswith('S') for item in x.split(' ')) else 'both' if any(item.startswith('W') or item.startswith('S') for item in x.split(' ')) else '')

    # remove extra header row
    df = df[df['composition'] != ''].reset_index(drop=True)
    return df

```
# Group the DataFrame by 'type' and count the frequency of each unique value in 'oaid_count' within each group
grouped_df = add_columns_to_df(multi_mapped_df).groupby(['type', 'oaid_count']).apply(lambda x: x.sort_values(by='oaid_count', ascending=False))

# Reset the index to make the grouped columns regular columns again
grouped_df = grouped_df.reset_index(drop=True)
```


To get stats regardless of the composition of the OAIDs, we can include all the three possibilities in the filtering criteria, like so:

`filtered_df = grouped_df[(grouped_df['composition'].isin(['works', 'sources', 'both'])) &
                           (grouped_df['type'] == 'journal') &
                           (grouped_df['oaid_count'] >= 2)]`

### Print the rows with URLs for OMIDs and OAIDs

In [149]:
def filter_multi_mapped_df(df, res_type: str, composition: Union[Literal['works', 'sources', 'both'], None], oaid_count: Union[int, None]=2):
    """
    Filter the dataframe by the type of resource, the composition of the OAIDs, and the number of OAIDs for a single OMID.
    :param df: a DF to which columns 'oaid_count' and 'composition' have been added
    :param res_type:
    :param composition: only one at a time: 'works', 'sources', or 'both'; set at None if you want to get all the three
    :param oaid_count: the exact number of OAIDs for a single OMID
    :return:
    """
    if composition not in ['works', 'sources', 'both', None]:
        raise ValueError('The composition parameter must be one of the following: "works", "sources", "both", or None.')
    if res_type:
        if not oaid_count and not composition:
            filtered_df = df[(df['type'] == res_type)]
        elif not oaid_count and composition:
            filtered_df = df[(df['type'] == res_type) &
                             (df['composition'] == composition)]
        elif oaid_count and not composition:
            filtered_df = df[(df['type'] == res_type) &
                             (df['oaid_count'] == oaid_count)]
        else:
            filtered_df = df[(df['type'] == res_type) &
                             (df['composition'] == composition) &
                             (df['oaid_count'] == oaid_count)]
    else:
        if not oaid_count and not composition:
            warnings.warn('You need to specify at least one of the two parameters: composition or oaid_count. Otherwise, the whole dataframe is returned.', UserWarning)
            filtered_df = df
        elif not oaid_count and composition:
            filtered_df = df[(df['composition'] == composition)]
        elif oaid_count and not composition:
            filtered_df = df[(df['oaid_count'] == oaid_count)]
        else:
            filtered_df = df[(df['composition'] == composition) &
                             (df['oaid_count'] == oaid_count)]

    return filtered_df

def get_ids_uris(df, verbose=True):
    """
    Transform the dataframe into a list of dicts with OMIDs and OAIDs written as (clickable) URLs.
    :param df: any dataframe with columns 'omid' and 'openalex_id'
    :param verbose: if True, return the list of whole rows; if False, return only the 'omid' and 'openalex_id' columns
    :return:
    """
    result = []
    oa_url = 'https://api.openalex.org/'
    oc_url = 'https://opencitations.net/meta/api/v1/metadata/'


    for row in df.to_dict(orient='records'):
        res_row = row
        res_row['omid'] = oc_url + row['omid'].replace('meta:', 'omid:')
        res_row['openalex_id'] = list((map(lambda x: oa_url + x, row['openalex_id'].split())))

        if verbose:
            result.append(res_row)
        else:
            result.append({'omid':res_row['omid'], 'openalex_id':res_row['openalex_id']})
    return result

# Analyse the multi-mapped OMIDs
The following cell illustrates how to use the code above to perform analyses on the data.

In [150]:

# 1) get the dataframe with the additional columns from the base dataframe; assign values to the variables used as funct. params
current_data_df = add_columns_to_df(MULTI_MAPPED_DF)

res_type = None
comp = None
n_oaid = None

# 2) filter the dataframe by the type of resource, the composition of the OAIDs, and the number of OAIDs for a single OMID
operational_df = filter_multi_mapped_df(current_data_df, res_type=res_type, composition=comp, oaid_count=n_oaid)

# 3) get the OMIDs and OAIDs as URLs (set verbose=False to get only the 'omid' and 'openalex_id' columns)
output = get_ids_uris(operational_df, verbose=False) # (p)print output variable on a separate cell for clickable URLs???
# pprint(output)


You need to specify at least one of the two parameters: composition or oaid_count. Otherwise, the whole dataframe is returned.



In [151]:
print('Total number of multi-mapped OMIDs: ', len(MULTI_MAPPED_DF))

Total number of multi-mapped OMIDs:  29984


In [152]:
print(len(output))
# filter = (operational_df[~operational_df['type'].isin(['series', 'book series', 'journal'])])  # filter out the series with negation (~)
filter = (operational_df[operational_df['type'].isin(['series', 'book series', 'journal'])])
print(len(filter))
print('series', len(operational_df[operational_df['type'].eq('series')]))
print('book series', len(operational_df[operational_df['type'].eq('book series')]))
print('journals', len(operational_df[operational_df['type'].eq('journal')]))
len(filter)

29983
12329
series 87
book series 1058
journals 11184


12329

## Visualizations
We will use Plotly Express to create the visualizations. The graphs can be exported as HTML files, and uploaded to plotly chart studio.


In [153]:
## Uncomment the line below if you want to render the graphs in the jupyter notebook via browser (it resets at each server connection?)
# pio.renderers.default = 'browser'

In [154]:
# Initialize a dataframe with the additional columns, added from the base dataframe
df = add_columns_to_df(MULTI_MAPPED_DF)

In [155]:
## DATA FOR THE HISTOGRAMS ↓↓↓
# Group by openalex_id_count and type, and count the number of occurrences
hist_data = df.groupby(['oaid_count', 'type', 'composition']).size().reset_index(name='frequency')
# hist_data = df.groupby(['oaid_count', 'type']).size().reset_index(name='frequency') # this is without the composition column!

In [156]:
# print(hist_data)

In [157]:
# # --------- ⚠️BARE COUNT HISTOGRAM (best for local use only)-----------
# fig_bare_count = px.bar(hist_data, x='oaid_count', y='frequency', color='type', hover_data=['composition'])
# # fig_bare_count = px.bar(hist_data, x='oaid_count', y='frequency', color='type')
# # Set the title and axis labels and show histograms
# fig_bare_count.update_layout(title='Distribution of OAID Counts (Bare Count)',
#                   xaxis_title='Number of OpenAlex IDs for a single OMID',
#                   yaxis_title='Frequency')
# fig_bare_count.show()

In [158]:
# --------- ⚠️LOG SCALE HISTOGRAM (suitable also for online display)-----------

# define new custom legend names (including the total number of occurrences for each type)
legend_names = {type_of_res : f"{type_of_res} ({df['type'].value_counts()[type_of_res]})" for type_of_res in df['type'].unique()}


# Create the histogram using plotly.express (logarithmic scale)
fig_log_scale = px.bar(hist_data, x='oaid_count', y='frequency', color='type', hover_data=['composition'], log_y=True)
# fig_log_scale = px.bar(hist_data, x='oaid_count', y='frequency', color='type', log_y=True)  # this is without the composition column!

fig_log_scale.for_each_trace(lambda t: t.update(name = legend_names[t.name]))

fig_log_scale.update_layout(title='Distribution of OAID counts by type (Log Scale)',
                  xaxis_title='Number of OpenAlex IDs for a single OMID',
                  yaxis_title='Frequency (log)', yaxis_type='log')

# write html file of the log scale histogram
fig_log_scale.write_html('../graphs/log_scale.html')

fig_log_scale.show()

# upload the log scale histogram to plotly chart studio
csp.plot(fig_log_scale, filename='oaid_count_distr_logarithmic', auto_open=False, sharing='public', fileopt='new')


'https://plotly.com/~eliarizzetto/123/'

In [159]:
legend_names

{'journal article': 'journal article (10001)',
 'reference book': 'reference book (69)',
 'series': 'series (87)',
 'book series': 'book series (1058)',
 'standard': 'standard (7)',
 'journal': 'journal (11184)',
 'book': 'book (7247)',
 'proceedings article': 'proceedings article (106)',
 'reference entry': 'reference entry (108)',
 'book chapter': 'book chapter (95)',
 'report': 'report (12)',
 'web content': 'web content (4)',
 'proceedings': 'proceedings (4)',
 'dataset': 'dataset (1)'}

## Composition

In [160]:

composition_data_df = df.groupby(['oaid_count', 'composition', 'type']).size().reset_index(name='comp_by_count_freq')

fig = px.pie(composition_data_df, values='comp_by_count_freq', names='composition', title='Composition of multi-mapped OMIDs')
# show labels upon the pie slices
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(title='Composition of multi-mapped OMIDs', showlegend=False)

fig.write_html('../graphs/composition.html')

fig.show()

# upload file to plotly chart studio
csp.plot(fig, filename='composition', auto_open=False, sharing='public', fileopt='new')

'https://plotly.com/~eliarizzetto/119/'

In [161]:
composition_data_df.query('oaid_count >= 20 and type == "journal"')

Unnamed: 0,oaid_count,composition,type,comp_by_count_freq
47,23,both,journal,1
48,29,both,journal,1
49,30,both,journal,1
50,38,both,journal,1
51,95,both,journal,1
52,153,both,journal,1


# Analyse new results
The following analysis concern the results obtained by mapping OMIDs to OAIDs using the new method (i.e. using only the ISSN for serial publications, i.e. journals, series and book series, instead of using also the DOI -> this leads to the elimination of the situation by which the same entity has both a Source OAID and a Work OAID).

The same analysis as above are performed on the new results, in order to compare them with the previous ones.

In [162]:
new_omids_path = 'new_multi_mapped_omids_072023.csv'

In [163]:
# GET DISTRIBUTION OF OAIDs COUNTS BY TYPE
type_to_filter = '' # choose the type of resource to analyse (if empty, all types are analysed)
all_works, all_sources, works_and_sources = analyse_multi_mapped_omids(new_omids_path, res_type=type_to_filter)

Total number of OMIDs of  multi-mapped: 0

Number of OMIDs of  multi-mapped to Work IDs only: 0. The following illustrates how these are distributed over the number of OAIDs each OMID is mapped to: {}

Number of OMIDs of  multi-mapped to Source IDs only: 0. The following illustrates how these are distributed over the number of OAIDs each OMID is mapped to: {}

Number of OMIDs of  multi-mapped to both Work and Source IDs: 0. The following illustrates how these are distributed over the number of OAIDs each OMID is mapped to: {}



In [164]:
# GET DETAILED INFO WITH DATAFRAME
NEW_MULTI_MAPPED_DF = pd.read_csv(new_omids_path, sep=',', header=None, names=['omid', 'openalex_id', 'type'])

# 1) get the dataframe with the additional columns from the base dataframe; assign values to the variables used as funct. params
new_current_data_df = add_columns_to_df(NEW_MULTI_MAPPED_DF)

res_type = None
comp = None
n_oaid = None

# 2) filter the dataframe by the type of resource, the composition of the OAIDs, and the number of OAIDs for a single OMID
operational_df = filter_multi_mapped_df(new_current_data_df, res_type=res_type, composition=comp, oaid_count=n_oaid)

# 3) get the OMIDs and OAIDs as URLs (set verbose=False to get only the 'omid' and 'openalex_id' columns)
output = get_ids_uris(operational_df, verbose=False) # (p)print output variable on a separate cell for clickable URLs???
# pprint(output)


You need to specify at least one of the two parameters: composition or oaid_count. Otherwise, the whole dataframe is returned.



In [165]:
print(len(output))
# filter = (operational_df[~operational_df['type'].isin(['series', 'book series', 'journal'])])  # filter out the series with negation (~)
filter = (operational_df[operational_df['type'].isin(['series', 'book series', 'journal'])])
print(len(filter))
print('series', len(operational_df[operational_df['type'].eq('series')]))
print('book series', len(operational_df[operational_df['type'].eq('book series')]))
print('journals', len(operational_df[operational_df['type'].eq('journal')]))
len(filter)


23841
6187
series 4
book series 56
journals 6127


6187

### Visualizations of new results

Prepare data for the histogram and pie chart visualizations.

In [166]:
new_df = add_columns_to_df(NEW_MULTI_MAPPED_DF)
new_hist_data = new_df.groupby(['oaid_count', 'type', 'composition']).size().reset_index(name='frequency')
new_composition_data_df = new_df.groupby(['oaid_count', 'composition', 'type']).size().reset_index(name='comp_by_count_freq')


In [167]:
# ----------------LOG SCALE HISTOGRAM----------------
fig_log_scale_new = px.bar(new_hist_data, x='oaid_count', y='frequency', color='type', hover_data=['composition'], log_y=True)

fig_log_scale_new.for_each_trace(lambda t: t.update(name = legend_names[t.name]))

fig_log_scale_new.update_layout(title='Distribution of OAID counts by type (Log Scale) (new results)',
                  xaxis_title='Number of OpenAlex IDs for a single OMID',
                  yaxis_title='Frequency (log)', yaxis_type='log')

# write html file of the log scale histogram
fig_log_scale_new.write_html('../graphs/log_scale_new.html')

fig_log_scale_new.show()

# upload the log scale histogram to plotly chart studio
csp.plot(fig_log_scale_new, filename='oaid_count_distr_logarithmic_new', auto_open=False, sharing='public', fileopt='new')

'https://plotly.com/~eliarizzetto/176/'

In [168]:
# ----------------PIE CHART (composition)----------------
fig = px.pie(new_composition_data_df, values='comp_by_count_freq', names='composition', title='Composition of multi-mapped OMIDs')
# show labels upon the pie slices
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(title='Composition of multi-mapped OMIDs (new results)', showlegend=False)

fig.write_html('../graphs/composition_new.html')

fig.show()

# upload file to plotly chart studio
csp.plot(fig, filename='composition_new', auto_open=False, sharing='public', fileopt='new')


'https://plotly.com/~eliarizzetto/179/'

# Get full metadata about the multi-mapped OMIDs

In order to perform further (programmatic) analysis on the multi-mapped OMIDs, we need to get the full metadata about them. We can do this by querying the OpenAlex dump, using the OMIDs as keys.

In [169]:
assert False

AssertionError: 

In [None]:
from helper import create_query_lists_oaid, get_full_metadata_for_oaids, unify_part_files

In [None]:
works_list, sources_list = create_query_lists_oaid(df) # chiama la funzione query_oa_dump con una lista e una cartella che siano tra loro coerenti (cioè che contengano lo stesso tipo di risorsa, o solo works o solo sources)

works_input_filepath = "D:/openalex_dump/data/works/**/*.gz"
works_output_filepath = "D:/multi_mapped_full_data/works/"
sources_input_filepath = "D:/openalex_dump/data/sources/**/*.gz"
sources_output_filepath = "D:/multi_mapped_full_data/sources/"

if __name__ == '__main__': # this is needed for Dask to work properly (?
    process_start_time = time.perf_counter()
    print('Processing OA dump files...')
    ## Retrieve full metadata for multi-mapped OAIDs.
    ## Uncomment the lines below to run the functions
    # get_full_metadata_for_oaids(works_input_filepath, works_output_filepath, works_list) # only for Works
    # get_full_metadata_for_oaids(sources_input_filepath, sources_output_filepath, sources_list) # only for Sources

    process_end_time = time.perf_counter()
    print(f'Processed files and wrote output in {(process_end_time - process_start_time)/3600} hours. Output files are stored in .part files inside the folders {works_output_filepath} and {sources_output_filepath}')

    ## Put all multi-mapped OpenAlex records together
    ## Uncomment the lines below to run the functions
    print('Unifying part files storing full metadata of multi-mapped OAIDS...')
    process_start_time = time.perf_counter()
    # unify_part_files(works_output_filepath + '*.part', out_path='D:/multi_mapped_full_data/works/full_data.json') # create a single JSON-L file  for Works
    # unify_part_files(sources_output_filepath + '*.part', out_path='D:/multi_mapped_full_data/sources/full_data.json') # create a single JSON-L file  for Sources
    process_end_time = time.perf_counter()
    print(f'Created 2 JSON-L files for multi-mapped Works and Sources in  {(process_end_time - process_start_time)/60} minutes.')

# Load Merged IDs CSV into Dataframes

Template for loading the compressed CSV files storing merged IDs from the OpenAlex dump into a single dataframe:

```
merged_works_df = pd.read_csv('D:/merged_ids_reduced/works/merged_ids.csv', dtype={'id': 'string', 'merge_into_id': 'string'})
merged_sources_df = pd.read_csv('D:/merged_ids_reduced/sources/merged_ids.csv', dtype={'id': 'string', 'merge_into_id': 'string'})


# Define a function that takes an ID and returns the merge_into_id if it is in the works_list
def get_oaid_if_in_works_list(i):
    if i in works_list:
        return i
    else:
        return None

# Get the values of the merge_into_id column
values = merged_works_df['merge_into_id'].values


bag = dask.bag.from_sequence(values)
results = bag.map(get_oaid_if_in_works_list) # map the function to the values

with ProgressBar():
    results = results.compute()

# Step 7: Print the results
for result in results:
    if result is not None:
        print(result)
```
