# Merge, Inspect, and Select Publications 

*Last update: 01/22/2022*

With this notebook, you can merge publication lists from different queries. It uses a data file, which is generated each time you serach for scientific publications with one of our [query notebooks](https://github.com/ai4ki/LitRev_Toolbox.git). These data files are stored in the directory `results_json`. 

If you follow steps 1 through 5, you end up with an Excel file that contains all your publication data without duplicates. Since there is no reliable universal identifier for a publication record, we use the papers' titles to identify duplicates. We account for possible differences in notation by pre-processing the titles. Note, however, that this method is not failsafe. You thus might want to inspect the final publication list manually to remove any remaining duplicates.        

Before exporting to Excel, you can manually inspect every record in the merged publication list, and remove papers that you deem irrelevant for your analysis.

## Relevance Scores
Along the way, you can calculate a number of relevance scores, which you might want to consider in your final assessment (see step 2). These scores are added to your final Excel-file so that you can use Excel's sort and filter options to rank publications.

## Working with Jupyter notebooks

In case you are not familiar with Jupyter notebooks, this is how to go about it: In order to execute a piece of code, click inside a cell (the ones with `[]` to the left) and press Shift+Enter. Wait until the cell is done--that's when the `*` in `[]` turned into a number--and move on to the next cell.

If you get inconceivable error messages or the notebook gets stuck, choose "Restart & Clear Output" in the "Kernel" dropdown-menu above and start afresh. 

___
**Please help us to improve this tool by [emailing us](mailto:ai4ki.dev@gmail.com?subject=ai4ki-tools:%20Merge_Inspect_Select) your update ideas or error reports.**
___

## Preparation: Import libraries

In [1]:
import pandas as pd
from ai4ki_utils.merge_rank_utils import *
from os.path import join

## Step 1: Fetch publications lists

In [None]:
# Read all JSON-files in dedicated directory
path = './results_json'
pub_data = get_pub_data(path=path)

# Get some useful global data
n_files = len(pub_data)
pub_data_lengths = [len(pub_data[i]) for i in range(n_files)]
mx_flen = max(pub_data_lengths)

### 1.1: Check publication lists for bad entries
*Run this cell to delete records without an entry in the title field as these cause trouble further down below.*  

In [None]:
check_pub_data(pub_data)

## Step 2: Calculate relevance scores
### 2.1: Normalized rank score
Scientific search engines like Semantic Scholar, Google Scholar, or CORE return a ranked list of results for each query (usually called 'relevance sorting'). We use a simple technique to preserve this information in the merging process.

The following cell calculates a 'Rank score' (RS) between 0 and 1 for each publication in the lists provided in step 1 (RS=0: top ranked). When merging the lists in step 3, we calculate the average RS, if a publication appears in more than one list. 

In [None]:
# Calculate RS for each publication in each list and add to data
rank_score(pub_data, mx_flen, key='Rank score')

### 2.2: Title and abstract match score
The following cell calculates two relevance scores: 'Title match score' (TMS) and 'Abstract match score' (AMS).

Based on a list of keywords you provide, TMS is the number of keywords that appear in a publication's title. We divide this number by total number of keywords to get a TMS value between 0 and 1. AMS is defined accordingly with respect to a publication's abstract (if available). 

*Note: Comma-seperate your list of keywords:* `deep learning, cats, dogs, classification algorithm`.<br>Make sure to spell your keywords correctly!  

In [None]:
# Calculate TMS and AMS for each publication in each list and add to data
keywords = input('Enter keywords: ')
keywords = keywords.split(',')
keywords = [k.strip().lower() for k in keywords]
match_score(pub_data, keywords)

### 2.3.: Similarity Score 
The following cell calculates a similiarity score (SimS) between a user-provided document and the abstract of a publication. SimS is based on the [TF-IDF-algorithm](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (Term Frequency–Inverse Document Frequency). SimS ranges from 0 to 1, with a value of 1 indicating identical documents.

Provide your document as a `.txt`-file and upload it to the JupyterLab base directory. Run the following cell and enter the full filename: 

In [None]:
# Load and read base document
filename = input('Enter filename: ')
with open(filename, 'r', encoding='utf-8') as f: document = f.read()
document = document.replace('\n',' ')

# Calculate similarity score for each publication in each list and add to data
sim_score(pub_data, document)

## Step 3: Merge publication data

In [None]:
# First, get the titles from the data and process them for matching
pub_data_titles, pub_data_idx = get_proc_titles(pub_data)

# Second, deduplicate and merge publication data
pub_data_merge_final = merge_pub_data(pub_data, pub_data_titles, pub_data_idx)
df_merge = pd.DataFrame(pub_data_merge_final)

### Step 3.1: Display merged publication data (optional)

In [None]:
# Preview merged results
n_rows = int(input('Enter number of rows to display: ')) 
c_sort = input('Enter name of column for sorting: ')

if c_sort != '0': 
    try:
        df_merge.sort_values(by=c_sort, inplace=True, ascending=True)
        print('==> Displaying first {} items, sorted by column {}:'.format(n_rows, c_sort))
    except:
        print('ERROR: Selceted sorting column does not exist--check spelling!')
        print('==> Showing unsorted publication list instead:')
df_merge.head(n_rows)

## Step 4: Manually inspect merged publication data
*Walk through the merged publication records and decide which ones to keep.*<br>
*It's best to do this in one pass, but in case you get tired you can exit by entering 'stop'...*

In [None]:
ir_index, pub_data_final = inspect_pubs(pub_data_merge_final)

### Step 4.1: Continue manual inspection of merged publication data (just in case...)
*In case you interrupted manual inspection of publication data above, run the following cell!*

In [None]:
if ir_index != 0:
    ir_index, pub_data_cntd = inspect_pubs(pub_data_merge_final, idx_start=ir_index)
    pub_data_final.append(pub_data_cntd)

## Step 5: Export final publication list

In [None]:
XCL_XTNSN, JSN_XTNSN, BBT_XTNSN = '.xlsx', '.json', '.bib'
outfile = 'Final_Publication_List'

# Choose which data to export
print('Which publication data do you want to export?')
which_data = input('auto/manual ')
if which_data == 'auto':
    data_out = pub_data_merge_final
elif which_data == 'manual':
    data_out = pub_data_final    

# Export results to EXCEL file
df_out = pd.DataFrame(data_out)
df_out.to_excel(join('./results_merge',outfile+XCL_XTNSN), engine='xlsxwriter', index=False)

# Export results to JSON file
with open(join('./results_merge',outfile+JSN_XTNSN), 'w', encoding='utf-8') as f:
    json.dump(data_out, f)
    
# Export BibTex-Data to .bib-file
bibtex_data = '\n\n'.join([item['BibTex'] for item in data_out if item['BibTex'] is not None])
with open(join('./results_merge',outfile+BBT_XTNSN), 'w', encoding='utf-8') as f:
    f.write(bibtex_data)