# Literature Search with Google Scholar 

*Last updated: 01/21/2022*

This notebook allows you to query [Google Scholar](https://scholar.google.com/). 

Since Google doesn't like bots, it might block this appliaction occasionally. When this happens, you can either try using a different IP address (see step 1.1 below) or go grab a coffee and wait it out. In any case, don't overdo it. If you're in an early phase of your research, consider using Google Scholar directly for experimenting with different search queries. We also limit each search query to return only the first 100 results (sorted by relevance). 

The result of your query is stored in different formats: an Excel- and a JSON-file with the full publication data, and a `.bib`-file with only the BibTex data for each publication (if available). You will need the JSON-files, if you later want to merge results using the notebook `Merge_Inspect_Select.ipynb`. The files are named `MyGS_Search_[current date_time].xlsx/.json/.bib`. You will find the files on the left navigation pane in the folder `results`. 

## Searching with Google Scholar

While Google Scholar does support the use of certain search operators, it is not possible to formulate complex queries by combining and grouping multiple Boolean operators. The best way to formulate your query is to use Google Scholar's Advanced Search (AS) option. Go to the end of this notebook, to learn how to transfer a query constructed with AS to this application.


## Working with Jupyter notebooks

In case you are not familiar with Jupyter notebooks, this is how to go about it: In order to execute a piece of code, click inside a cell (the ones with `[]` to the left) and press Shift+Enter. Wait until the cell is done--that's when the `*` in `[]` turned into a number--and move on to the next cell.

If you get inconceivable error messages or the notebook gets stuck, choose "Restart & Clear Output" in the "Kernel" dropdown-menu above and start afresh. 
___
**Please help us to improve this tool by [emailing us](mailto:ai4ki.dev@gmail.com?subject=ai4ki-tools:%20Google%20Scholar%20Search) your update ideas or error reports.**
___

## Step 1: Import some libraries
*You have to excecute the following cell only once at the beginning of a session!*<br>
*Note: If it throws an error, simply excecute the cell again!*

In [None]:
import json
import pandas as pd
import sys
sys.path.append('../')

from ai4ki_utils.gs_utils import format_gs_data
from datetime import datetime
from os.path import join
from scholarly import scholarly, ProxyGenerator
from tqdm import tqdm

### 1.1: Use a proxy server (optional)
*If you receive the error* `MaxTriesExceededException` *after executing step 2, change `USE_PROXY` to `True`, run the cell, and try your query again.*

In [None]:
# Set the following variable to 'True' if you want to use a proxy server
USE_PROXY = False

if USE_PROXY:
    print('...connetcting to proxy server--please be patient')
    pg = ProxyGenerator()
    pg.FreeProxies()
    scholarly.use_proxy(pg)

## Step 2: Formulate search query
### 2.1: Select publication year range
*In the cell below, change `start_year` and `end_year` to the desired values. Set them to `None`, if you don't want to limit your search.* 

In [None]:
start_year = None
end_year = None

### 2.2: Enter query string
*Exceute the cell and enter your query string in the input field below.*<br>
*Be patient as it might take some time to connect to proxy server*

In [None]:
query = input('Enter query: ') 
if len(query) < 256:  
    print('Accepted query: ', query)

    # Define search query generator object
    search_query = scholarly.search_pubs(query, year_low=start_year, year_high=end_year)
else:
    print('ERROR: Your search string is too long--has {} chars, must have less than 256!'.format(len(query)))

## Step 3: Post query

In [None]:
# Set maximum number of results (don't change this!)
MAX_PUBS = 100

pub_counter = 0 
results = []
for pub in tqdm(search_query):
    pub_counter += 1
    results.append(pub)
    if pub_counter == MAX_PUBS: break
        
print('Fetched {n_pubs} publications from Google Scholar'.format(n_pubs=pub_counter))

# Reformat data for pretty EXCEl dump
print('==> Reformatting publication data:')
results_form = format_gs_data(pub_counter, results)
df = pd.DataFrame(results_form)

## Step 4: Preview results (optional)

In [None]:
# Preview results
n_rows = int(input('Enter number of rows to display: ')) 
print('==> Displaying first {n} rows: '.format(n=n_rows))
df.head(n_rows)

## Step 5: Write results to Excel file  

In [None]:
XCL_XTNSN, JSN_XTNSN, BBT_XTNSN = '.xlsx', '.json', '.bib'
time_stamp = datetime.now().strftime("%Y-%m-%d-%H%M%S")
filename = "MyGS_Search_" + str(time_stamp)
out_dir = '../results'
if not os.path.exists(out_dir):
    os.makedirs(out_dir)

# Export results to EXCEL file
df.to_excel(join(out_dir, filename+XCL_XTNSN), engine='openpyxl', index=False)

# Export BibTex-Data to .bib-file
bibtex_data = '\n\n'.join([item['BibTex'] for item in results_form if item['BibTex'] is not None])
with open(join(out_dir, filename+BBT_XTNSN), 'w', encoding='utf-8') as f:
    f.write(bibtex_data)
    
# Export results to JSON file
with open(join(out_dir, filename+JSN_XTNSN), 'w', encoding='utf-8') as f:
  json.dump(results_form, f)

## Step 7: Perpare next query
*Execute the following cell, if you want to start a new query. Note that, after exceuting this cell, the results of your previous query will be lost!*

In [None]:
del df, pub_counter, query, results, results_form, search_query

## Brief guide to advanced search with Google Scholar

Go to [Google Scholar](https://scholar.google.com) and select "Advanced Search" in the top-left menu. Construct your query by filling out the appropriate fields and hit search. In the search bar at the top of the results page you find the search string that corresponds to your query. Copy this string and use it as input in step 2 above. 

Note that you can use the asterisk (\*) as a placeholder for unknown or wildcard terms, for example `"bayesian * learning"` (returns papers with, for example, 'bayesian deep learning'). However, don't use the asterisk to find word variations; GS search does this automatically.

In case you want to contruct your query manually, be aware that Google Scholar doesn't recognize the AND-operator. Use space or '+' instead!

---
*Note: Scholarly has the option to use a custom Google Scholar URL (module: `search_pubs_custom_url(url: str)`). We plan to implement this in the future so that the above detour won't be necessary.*  