# Literature Search with Semantic Scholar
 
*Last updated: 01/21/2022*
 
This notebook allows you to query [Semantic Scholar](https://www.semanticscholar.org). It is based on the Semantic Scholar API, which is documented [here](https://api.semanticscholar.org/graph/v1). 

The results of your queries are stored in different formats: an Excel- and a JSON-file with the full publication data, and a `.bib`-file with only the BibTex data for each publication (if available). You will need the JSON-files, if you later want to merge results using the notebook `Merge_Inspect_Select.ipynb`. The files are named `MySemSchol_Search_[current date_time].xlsx/.json/.bib`. You will find the files on the left navigation pane in folders named `results_excel/json/bibtex`.   

**Note**: This is not an application to be used at scale! So, whether you like it or not, please use it with restraint (Semantic Scholar sets a limit of 100 requests per five minutes anyway). 

## Section I: Query the Semantic Scholar Database

Due to the semantic search approach, the API does not support a special query syntax like Boolean operators or wildcards. Try using a natural language query or the ususal sequence of keywords instead.  

Note that per query a maximum of 100 results (i.e., paper references) is returned. In case your query yielded more than 100 hits, you have to repeat the same query until you reached the desired number of results (see below for instructions on how to do this). 

## Section II: Perform multiple queries in one batch

You can also send any number of different queries to Semantic Scholar in one batch. Go directly to Section II below for instructions on how to do this.


## Working with Jupyter notebooks

In case you are not familiar with Jupyter notebooks, this is how to go about it: In order to execute a piece of code, click inside a cell (the ones with `[]` to the left) and press Shift+Enter. Wait until the cell is done--that's when the `*` in `[]` turned into a number--and move on to the next cell.

If you get inconceivable error messages or the notebook gets stuck, choose "Restart & Clear Output" in the "Kernel" dropdown-menu above.

___
**Please help us to improve this tool by [emailing us](mailto:ai4ki.dev@gmail.com?subject=ai4ki-tools:%20Semantic%20Scholar%20Search) your update ideas or error reports.**
___

## Preparation: Import libraries
*You have to excecute the following cell only once at the beginning of a session!*

In [4]:
import json
import pandas as pd
import os
import sys
import time
sys.path.append('../')

from datetime import datetime
from os.path import join
from ai4ki_utils.semschol_utils import *
from ai4ki_utils.semschol_request import semschol_request

## Section I: Query the Semantic Scholar Database

### Step 1: Formulate search query
*Run the following cell to perform a single query. Enter your query in the input field that pops up below.*

In [None]:
query = input('Enter query: ') 
print('Accepted query: ', query)
# Replace spaces with proper URL encoding 
query = query.strip().replace(' ','%20')
# Create list for storing more than 100 results
results_store = []
# Count number of queries for convenience later
n_queries = 1

### Step 2: Post query

In [None]:
# Set the maximum number of results to be returned (absolute maximum is 100)
limit = 100
offset = int((n_queries - 1)*limit)

# Define the fields to be returned by the API request (can be changed, see API documentation)
FIELDS = 'title,url,authors,abstract,citationCount,externalIds,isOpenAccess,year,fieldsOfStudy' 

# Store query parameters in directory
params = {
    'offset': str(offset),
    'limit': str(limit),
    'fields': FIELDS
}

# URL for the Semantic Scholar API endpoint
url='https://api.semanticscholar.org/graph/v1/paper/search?query=' + query

# Make the request and fetch the data
status, results = semschol_request(url, params)
if status: 
    n_papers = len(results['data'])
    print('==> Formatting publication data {beg} to {end}:'.format(beg=offset,end=n_papers + offset - 1))
    results_form = format_semschol_data(n_papers, results)
    results_store += results_form
    df_tmp = pd.DataFrame(results_form) # Create dataframe for display
    n_queries += 1

### Step 3: Preview results

In [None]:
# Preview results
n_rows = int(input('Enter number of rows to display: ')) 
print('==> Displaying first {n} rows: '.format(n=n_rows))

df_tmp.head(n_rows)

#### 3.1: Show results in Semantic Scholar (optional)

In [None]:
#Get equivalent Semantic Scholar search URL
sem_schol_url = 'https://www.semanticscholar.org/search?q=' + query + '&sort=relevance'
sem_schol_url = sem_schol_url.replace(' ', '%20')
print("Follow this link: ", sem_schol_url)

*If you want to export more than 100 results, repeat step 2 before moving on to Step 4!*

### Step 4: Export publication data  
*Note that, after running the following cell, your query results will be deleted from memory!*

In [None]:
XCL_XTNSN, JSN_XTNSN, BBT_XTNSN = '.xlsx', '.json', '.bib'
time_stamp = datetime.now().strftime("%Y-%m-%d-%H%M%S")
outfile = "MySemSchol_Search_" + str(time_stamp)

# Export results to EXCEL file
df_out = pd.DataFrame(results_store)
df_out.to_excel(join('./results_excel', outfile+XCL_XTNSN), engine='openpyxl', index=False)

# Export BibTex-Data to .bib-file
bibtex_data = '\n\n'.join([item['BibTex'] for item in results_store if item['BibTex'] is not None])
with open(join('./results_bibtex',outfile+BBT_XTNSN), 'w', encoding='utf-8') as f:
    f.write(bibtex_data)
    
# Export results to JSON file
with open(join('./results_json',outfile+JSN_XTNSN), 'w', encoding='utf-8') as f:
    json.dump(results_store, f)

#Delete variables for next run
del df_out, df_tmp, n_queries, n_papers, outfile, query, results, results_form, results_store, status, url

## Section II: Perform multiple queries in one batch

### Step 1: Load your queries
*Provide your queries in a `.txt`-file with one line for each query. Run the following cell to load your queries.* 

In [None]:
# Open and read queries file
with open('SemSchol_queries.txt', 'r', encoding='utf-8') as f:
    user_queries = f.readlines()
# Strip leading and trailing spaces and newline char; replace space with proper URL encoding
queries = [q.strip().replace(' ','%20') for q in user_queries]
for i,q in enumerate(user_queries): print('Query {}: {}'.format(i,q.strip()))

### Step 2: Compare the queries (optional)
*Run the following cell to get the number of papers for each query and indictaors for how a pair of queries compares for the first 100 results.*

In [None]:
compare = comp_mult_queries(queries, min_match=0.7, min_rbo=0.5, p_value=0.97)

**Interpretation of the RBO value:**<br>

*Rank Biased Overlap (RBO) is a measure of similarity between two ranked lists, which don't necessarily share the same items. `RBO=0` means no, `RBO=1` means perfect overlap between two lists. The `p_value` is a measure of how much the first `n` results contribute to the RBO-value. We use a `p_value` of 0.97, which means that the top 10 results contribute roughly 50% to the RBO measure.*

#### 2.1: Discard redundant queries
Based on the results of the comparison, you can discard queries, which return redundant results. In the first line of following cell, enter the numbers of the queries you want to discard. Then run the cell to delete the chosen queries from the list `queries`.  

In [None]:
delete_queries = [2,5,8]
for d in delete_queries:
    queries.pop(d)

### Step 3: Process the queries
*Run the following cell to process your queries and export the ressults to Excel.<br>
The results of each query is stored in a separate Excel-, JSON-, and BibTex-files named `MySemSchol_Search_[query i]_[current date_time].*`.* 

In [None]:
# For each query, set the number of papers to download.
limit = 100
offset = 0

# Define the fields to be returned by the API request (can be changed, see API documentation)
FIELDS = 'title,url,authors,abstract,citationCount,externalIds,isOpenAccess,year,fieldsOfStudy' 

# Store query parameters in directory
params = {
    'offset': str(offset),
    'limit': str(limit),
    'fields': FIELDS
}

# For each query, make the request and fetch the data
proc_mult_queries(queries, params)