# Literature Search with CORE
 
*Last updated: 01/21/2022*
 
This notebook allows you to query the [CORE](https://core.ac.uk/) collection. It is based on the CORE API, which is documented [here](https://api.core.ac.uk/docs/v3). 

The results of your queries are stored in different formats: an Excel- and a JSON-file with the full publication data, and a `.bib`-file with only the BibTex data for each publication (if available). You will need the JSON-files, if you later want to merge results using the notebook `Merge_Inspect_Select.ipynb`. The files are named `MyCORE_Search_[current date_time].xlsx/.json/.bib`. You will find the files on the left navigation pane in folders named `results_excel/json/bibtex`.   

You can also bulk download all PDFs available to your query (see Section I, Step 4).

**Note**: You need an API key to use this notebook. If you don't have one, you can request one [here](https://core.ac.uk/searchAssets/api-keys/register/).

## Section I: Query the CORE Collection
CORE allows you to use special query syntax. In particular, you can use the logical binaries `OR` and `AND` (note that CORE treats a blank space as `AND`). Moreover, it is possible to use round brackets for grouping and prioritizing elements of the query, like so `("deep learning" AND (applications OR models))` (make sure to wrap the entire query in brackets).

You can also restrict your search to certain fields, in particular to `title`, `abstract`, or `fullText`. To do that, you have to add the field parameter to *every* search term, like so `(title:"deep learning" AND (abstract:applications OR abstract:models))`.     

Note that per query a maximum of 100 results is returned. In case your query yielded more than 100 hits, you have to repeat the same query until you reached the desired number of results (see below for instructions on how to do this).

## Section II: Perform multiple queries in one batch

You can also send any number of queries to the CORE API in one batch. Go directly to Section II for instructions on how to do this.

## Working with Jupyter notebooks

In case you are not familiar with Jupyter notebooks, this is how to go about it: In order to execute a piece of code, click inside a cell (the ones with `[]` to the left) and press Shift+Enter. Wait until the cell is done--that's when the `*` in `[]` turned into a number--and move on to the next cell.

If you get inconceivable error messages or the notebook gets stuck, choose "Restart & Clear Output" in the "Kernel" dropdown-menu above.
___
**Please help us to improve this tool by [emailing us](mailto:ai4ki.dev@gmail.com?subject=ai4ki-tools:%20CORE%20Collection%20Search) your update ideas or error reports.**
___

## Preparations

### A: Provide API key
*Exceute the cell and enter your key in the input field below.*

In [None]:
API_KEY = input('Enter your API key: ')
print('API key accepted: ', API_KEY)

### B: Import  libraries
*You have to excecute the following cell only once at the beginning of a session!*

In [2]:
import json
import os
import pandas as pd
import sys
import urllib.parse
sys.path.append('../')

from ai4ki_utils.core_utils import *
from ai4ki_utils.core_request import *
from datetime import datetime
from os.path import join

## Section I: Query the CORE Collection

### Step 1: Formulate search query
*Exceute the cell and enter your query in the input field below.*

In [None]:
query = input('Enter query: ') 
print('Accepted query: ', query)
q = urllib.parse.quote(query)

# Create list for storing more than 100 results
results_store = []

# Count number of queries for convenience later
n_queries = 1

### Step 2: Post query

In [None]:
# Set the maximum number of results to be returned (absolute maximum is 100) 
limit = 100
offset = int((n_queries - 1)*limit)

# Construct the URL for the CORE API endpoint
url = 'https://api.core.ac.uk/v3/search/works?q=' + q
params = {
    'offset': str(offset),
    'limit': str(limit),
    'apiKey': API_KEY
}

# Make the request and fetch the data
status, data = core_request(url, params)
if status:
    n_papers = len(data['results'])
    print('==> Formatting publication data {beg} to {end}:'.format(beg=offset,end=n_papers + offset - 1))
    results_form = format_core_data(n_papers, data)
    results_store += results_form
    df_tmp = pd.DataFrame(results_form) # Create dataframe for display
    n_queries += 1

### Step 3: Preview results

In [None]:
# Preview results
n_rows = int(input('Enter number of rows to display: ')) 
print('==> Displaying first {n} rows: '.format(n=n_rows))

df_tmp.head(n_rows)

*If you want to export more than 100 results, repeat step 2 before moving on to steps 4 and 5!*

### Step 4: Bulk download available PDFs
*NOTE: By default, the PDFs of all results are downloaded (if available). If you only want the first `n_down`, change the parameter below accordingly.*<br>
*The parameter `fraction` controls how many characters of the publication title are used for the PDF name.*

In [None]:
# Download PDFs from CORE 
core_download(results_store, dir_path='./results_pdf', n_down=limit, fraction=50)

### Step 5: Write results to Excel file  
*Note that, after running the following cell, your query results will be deleted from memory!*

In [None]:
XCL_XTNSN, JSN_XTNSN, BBT_XTNSN = '.xlsx', '.json', '.bib'
time_stamp = datetime.now().strftime("%Y-%m-%d-%H%M%S")
outfile = "MyCORE_Search_" + str(time_stamp)

# Export results to EXCEL file
df_out = pd.DataFrame(results_store)
df_out.to_excel(join('./results_excel', outfile+XCL_XTNSN), engine='openpyxl', index=False)

# Export BibTex-Data to .bib-file
bibtex_data = '\n\n'.join([item['BibTex'] for item in results_store if item['BibTex'] is not None])
with open(join('./results_bibtex',outfile+BBT_XTNSN), 'w', encoding='utf-8') as f:
    f.write(bibtex_data)

# Export results to JSON file
with open(join('./results_json', outfile+JSN_XTNSN), 'w', encoding='utf-8') as f:
    json.dump(results_store, f)

#Delete variables for next run
del data, df_out, df_tmp, n_queries, n_papers, outfile, q, query, results_form, results_store, status, url

## Section II: Perform multiple queries in one batch

### Step 1: Load your queries
*Provide your queries in a `.txt`-file with one line for each query. Run the following cell to load your queries.* 

In [None]:
# Open and read queries file
with open('CORE_queries_abstract.txt', 'r', encoding='utf-8') as f:
    user_queries = f.readlines()
# Strip leading and trailing spaces and newline char; replace space with proper URL encoding
queries = [q.strip() for q in user_queries]
for i,q in enumerate(user_queries): print('Query {}: {}'.format(i,q.strip()))

### Step 2: Compare the queries (optional)
*Run the following cell to get the number of papers for each query and indictators for how a pair of queries compares for the first 100 results.*

In [None]:
compare = comp_mult_queries(queries, API_KEY, min_match=0.7, min_rbo=0.5, p_value=0.97)

**Interpretation of the RBO value:**<br>

*Rank Biased Overlap (RBO) is a measure of similarity between two ranked lists, which don't necessarily share the same items. `RBO=0` means no, `RBO=1` means perfect overlap between two lists. The `p_value` is a measure of how much the first `n` results contribute to the RBO-value. We use a `p_value` of 0.97, which means that the top 10 results contribute roughly 50% to the RBO measure.*

#### 2.1: Discard redundant queries
Based on the results of the comparison, you can discard queries, which return redundant results. In the first line of following cell, enter the numbers of the queries you want to discard. Then run the cell to delete the chosen queries from the list `queries`.  

In [None]:
delete_queries = [2,5,8]
for d in delete_queries:
    queries.pop(d)

### Step 3: Process the queries
*Run the following cell to process your queries.<br>
The results of each query is stored in a separate Excel-, JSON-, and BibTex-files named `MyCORE_Search_[query i]_[current date_time].*`.* 

In [None]:
# For each query, set the number of papers to download.
limit = 100 # absolute maximum: 100
offset = 0

# Store query parameters in directory
params = {
    'offset': str(offset),
    'limit': str(limit),
    'apiKey': API_KEY
}

# For each query, make the request and fetch the data
proc_mult_queries(queries, params)