# Google scholar scraper

Getting publication information from Google Scholar is difficult because
search results only show 10 publications per page, and results cannot
be exported for further analysis.

This notebook uses the Selenium Python wrapper to open Google Chrome,
navigate to Google Scholar, search a topic, pull citaation information
from each result, navigate to the next page, pull more results, and
finally saves assembled citation information into a Pandas DataFrame for
export as an ASCII text file.

In [1]:
import os, re, time
import numpy as np
import pandas as pd
from selenium import webdriver
import scholar_scrape_methods as ssm

## Specifiy a topic to search on Google Scholar

In [2]:
topic = 'PEDOT:PSS'

## Open browser, go to Scholar, and search topic

In [3]:
chromedriver_path = os.path.join(os.getcwd(), 'chromedriver.exe')

browser = webdriver.Chrome(executable_path=chromedriver_path)

# go to the url, query a topic, and submit the query
browser.get('https://scholar.google.com/')
query = browser.find_element_by_id('gs_hdr_tsi')
query.send_keys(topic)
query.submit()
time.sleep(2)

## Read intial search results page

In [6]:
try:
    total_search_results = ssm.get_total_search_results(browser)
    max_pages = int(np.floor(total_search_results / 10))
    print('Total search results: {}'.format(total_search_results))
    print('Maximum results pages: {}'.format(max_pages))
except IndexError:
    print('Google has created a Recaptcha testfor you.')
    print('Please complete the test in the bowser window')
    print('and rerun this code and everything below it.')

Total search results: 103000
Maximum results pages: 10300


## Collect citation information from each publication

In [9]:
all_citations = []

for page_idx in range(10):
    print('reading page {}...'.format(1+page_idx))
    try:
        time.sleep(1)
        page_citations = ssm.get_citations(browser)
        all_citations += page_citations
        time.sleep(1)
        ssm.move_to_next_page(browser)
    except IndexError:
        time.sleep(3)
    time.sleep(1)

print('\nAcquired {} total citations'.format(len(all_citations)))

reading page 1...
acquired 10 citations
reading page 2...
acquired 10 citations
reading page 3...
acquired 10 citations
reading page 4...
acquired 10 citations
reading page 5...
acquired 10 citations
reading page 6...
acquired 10 citations
reading page 7...
reading page 8...
reading page 9...
reading page 10...

Acquired 60 total citations


## Assemble results into Pandas DataFrame

In [15]:
df = pd.DataFrame([c.split('"') for c in all_citations],
                  columns=['author', 'title', 'pub'])
df['title'] = [t[1:-1] if t.startswith(' ') else t[:-1] for t in df['title']]
df['author'] = [a[:-1] for a in df['author']]
df['pub'] = [p[:-1] for p in df['pub']]
df['year'] = [int(re.split('\(+(\d+)\)', p)[-2]) for p in df['pub']]

In [16]:
df

Unnamed: 0,author,title,pub,year
0,"Lang, Udo, Elisabeth Müller, Nicola Naujoks, a...",Microscopical investigations of PEDOT: PSS thi...,"Advanced Functional Materials 19, no. 8 (2009...",2009
1,"Hong, Wenjing, Yuxi Xu, Gewu Lu, Chun Li, and ...",Transparent graphene/PEDOT–PSS composite films...,"Electrochemistry Communications 10, no. 10 (2...",2008
2,"Nardes, A. Mantovani, Martijn Kemerink, René A...",Microscopic understanding of the anisotropic c...,"Advanced Materials 19, no. 9 (2007): 1196-1200",2007
3,"Hwang, Jaehyung, Fabrice Amy, and Antoine Kahn.",Spectroscopic study on sputtered PEDOT· PSS: R...,"Organic electronics 7, no. 5 (2006): 387-396",2006
4,"Nardes, A. Mantovani, M. Kemerink, M. M. De Ko...","Conductivity, work function, and environmental...","Organic electronics 9, no. 5 (2008): 727-734",2008
5,"Vitoratos, E., S. Sakkopoulos, Evangelos Dalas...",Thermal degradation mechanisms of PEDOT: PSS,"Organic Electronics 10, no. 1 (2009): 61-66",2009
6,"Jönsson, S. K. M., Jonas Birgerson, Xavier Cri...",The effects of solvents on the morphology and ...,"Synthetic metals 139, no. 1 (2003): 1-10",2003
7,"Lang, Udo, Nicola Naujoks, and Jurg Dual.",Mechanical characterization of PEDOT: PSS thin...,"Synthetic Metals 159, no. 5-6 (2009): 473-479",2009
8,"Timpanaro, S., Martijn Kemerink, F. J. Touwsla...",Morphology and conductivity of PEDOT/PSS films...,"Chemical Physics Letters 394, no. 4-6 (2004):...",2004
9,"Greczynski, Grzegorz, Th Kugler, and W. R. Sal...",Characterization of the PEDOT-PSS system by me...,"Thin Solid Films 354, no. 1-2 (1999): 129-135",1999


## Export dataframe to file

In [17]:
ssm.results_to_file(df, topic)

In [18]:
'''
elems = browser.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))
    print(elem.get_attribute("id"))
    print(elem.
    '''

'\nelems = browser.find_elements_by_xpath("//a[@href]")\nfor elem in elems:\n    print(elem.get_attribute("href"))\n    print(elem.get_attribute("id"))\n    print(elem.\n    '

In [19]:
# get list of unique div classes on the page
#sorted(list(set([a.get_attribute('class') for a in browser.find_elements_by_tag_name('div')])))