# Scraping Google Scholar with Selenium and Python:

Source: https://medium.com/ml-book/web-scraping-using-selenium-python-3be7b8762747

**Steps:**

1) Install Selenium <br> 
    conda install -c conda-forge selenium <br>
    depending on setup; alternative: pip install selenium
    
2) Download GeckoDriver <br>
    In this case: suited for Firefox - working on a Mac <br>
    https://github.com/mozilla/geckodriver/releases?source=post_page-----3be7b8762747--------------------------------

In [111]:
from selenium import webdriver 

driver = webdriver.Firefox(executable_path= '/Users/sebastian/Downloads/geckodriver')

## First problem

How do we access all entries on Google Scholar? <br>
- <del>there is no API</del>
- empty search does not work
- empty space does not work

**Attempts: <br>**
- Advanced search - using * as regex wildcard - only works in combination with a word 
- Limit search by year only appears after a search result. Setting a timeframe and removing the search term leads to results that are not specific to the term but the amount of results indicates that something is wrong. It might be that we only get a subset of results that include the removed search term but there is no indication for that. Nonetheless, this approach should be reviewed rigorously. 
    - 2001 till 2024 = 1 040 000 Results (0,04 Sec.)
    - 2005 till 2024 = 1 150 000 Results (0,03 Sec.)
    
    - 2023 till 2023 = 6 960 000 Results (0,03 Sec.)
    - 2023 till 2024 = 3 030 000 Results (0,03 Sec.)
    
**Conclusion: <br>**
It seems that 1 million and 1.5 million results are some sort of barrier. <br>
If the search is to unspecific, the result is cut somewhere in this area. However, by choosing smaller timeframes, the results appear to be more reliable, Even though there is little to verify that. Searching with the wildcard "*" works as long as further constraints are used like setting a time constraint. <br>

**Possible solution: <br>**
Search systematically by selecting only a subset of years in each search.
If we are interested in very old documents, we could work with bigger timeframes, but in general, it seems like the amount of results per year is getting bigger the closer we get to the current year. This fits the expectations. 

**Exploring alternatives: <br>**
Using the Google Scholar API seems to be the better choice. However, getting all results without a search term persits as a challenge. 
Additionally the "Free Plan" provides only 100 searches / month even with 3 team members 300 searches / month might be problematic. 

Questions: 
- How much of the data can we get with 1 search? 

In [36]:
driver.get('https://scholar.google.com/scholar?q=*&hl=de&as_sdt=0%2C5&as_ylo=2000&as_yhi=2000')

**Constraints: <br>**
1) Keep the interaction with the website as easy as possible from a coding point of view
2) Automate the process as much as possible

In order to satisfy both constraints we can use the structure of Google Scholar links.
Searching for the wildcard "*" and setting a timeframe once provides us with a working link. Furthermore, we simply can change the years and get a working search out of it. 

Therefore creating those searches with a simple loop iterating from year x to year y should do the job. 

**Steps: <br>**

3) Start building a scraper suited for Google Scholar results. 

In [3]:
container_main_body = driver.find_elements_by_id('gs_res_ccl_mid')
len(container_main_body)

1

In [4]:
results = driver.find_elements_by_class_name('gs_ri')
len(results)

10

**Google Scholar Body: <br>**

Nested within successive layers of HTML structure, the main content body is encapsulated within a container identified by the unique identifier "gs_res_ccl_mid". Within this container, further sub-containers are delineated by the class attribute "gs_r gs_or gs_scl". These sub-containers encompass individual entries, some of which may contain multiple child elements, such as direct links to PDF files. However, our primary focus lies on the containers characterized by the class "gs_ri", as they represent the core entries of interest within this hierarchical structure.

**Questions: <br>**
- Is the result unique for each user? Is there a seed for the search or are the constraints the only factor?

In [9]:
titles = driver.find_elements_by_class_name('gs_rt')

for title in titles:
    title.click()
    print("Opened a page!")
    

Opened a page!
Opened a page!


StaleElementReferenceException: Message: The element with the reference 3894b0a1-cd45-48c2-8347-2be007cb4b13 is not known in the current browsing context


**Error <br>**
Iterating over the results as done above leads to an StaleElementReferenceException. The problem is, that references change after we interacted with the page. In order to avoid that we need to save the URLs before we interact with the page.

**ChatGPT 3.5 <br>**
ChatGPT proposes a slightly more advanced attempt. Using css_selecter instead of by_id. 


    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time

    # Initialize the WebDriver
    driver = webdriver.Chrome()

    # Navigate to Google Scholar
    driver.get("https://scholar.google.com")

    # Locate the search bar and input your query
    search_bar = driver.find_element_by_name("q")
    search_query = "Your search query here"
    search_bar.send_keys(search_query)
    search_bar.send_keys(Keys.RETURN)

    # Wait for the results to load
    time.sleep(2)

    # Collect the URLs of the search results
    result_links = driver.find_elements_by_css_selector('.gs_rt a')
    result_urls = [link.get_attribute('href') for link in result_links]

    # Iterate over the URLs and scrape metadata from each page
    for url in result_urls:
        driver.execute_script("window.open('{}', '_blank');".format(url))
        driver.switch_to.window(driver.window_handles[-1])
    
        # Here you can scrape metadata from the individual page
        # Example:
        meta_tags = driver.find_elements_by_css_selector('meta')
        for tag in meta_tags:
            print(tag.get_attribute('name'), tag.get_attribute('content'))
    
        # Close the tab after scraping
        driver.close()
    
        # Switch back to the main window/tab
        driver.switch_to.window(driver.window_handles[0])

    # Close the WebDriver
    driver.quit()





A few adjustments seem necessary. We want to wait inside the nested loop instead of upfront. We do not need the query part. After some adjustments with the timer, giving the algorithm more time to load the pages before we read the meta data we are able to read the wanted information from 9 out of 10 pages.

Missing output:

--------------------------------------
 text/html; charset=UTF-8
 IE=Edge
robots noindex,nofollow
viewport width=device-width,initial-scale=1
 375
--------------------------------------

After checking we can see, that we encounter a captcha. Before we bring order to the scraped material, for example by saving the name of each meta tag in combination with the content, we want to solve this issue, because it is likely to occure more often. 

In [40]:
import time

result_links = driver.find_elements_by_css_selector('.gs_rt a')
result_urls = [link.get_attribute('href') for link in result_links]

# Iterate over the URLs and scrape metadata from each page
for url in result_urls:
    driver.execute_script("window.open('{}', '_blank');".format(url))
    driver.switch_to.window(driver.window_handles[-1])
    
    time.sleep(4)
    
    # Here you can scrape metadata from the individual page
    # Example:
    meta_tags = driver.find_elements_by_css_selector('meta')
    for tag in meta_tags:
        print(tag.get_attribute('name'), tag.get_attribute('content'))
    
    
    
    driver.close()
    
    # Switch back to the main window/tab
    driver.switch_to.window(driver.window_handles[0])
    
    print("--------------------------------------")


 
 IE=edge
viewport width=device-width, initial-scale=1
applicable-device pc,mobile
access No
doi 10.1007/978-3-658-28565-4
title Die grenzenlose Unternehmung
description Die Neuauflage dieses Standardlehrbuches bietet ausgewählte theoretische Erklärungsansätze für die Entstehung neuer Organisationskonzepte vor dem Hintergrund von Digitalisierung und neuen Technologien und zeigt auf, welche Implikationen sich daraus für die Menschen in Organisationen ergeben.
citation_springer_api_url https://api.springernature.com/xmldata/jats?q=bookdoi:10.1007/978-3-658-28565-4&api_key=
 https://link.springer.com/book/10.1007/978-3-658-28565-4
 book
 SpringerLink
 Die grenzenlose Unternehmung
 https://media.springernature.com/w153/springer-static/cover/book/978-3-658-28565-4.jpg
 As0hBNJ8h++fNYlkq8cTye2qDLyom8NddByiVytXGGD0YVE+2CEuTCpqXMDxdhOMILKoaiaYifwEvCRlJ/9GcQ8AAAB8eyJvcmlnaW4iOiJodHRwczovL2RvdWJsZWNsaWNrLm5ldDo0NDMiLCJmZWF0dXJlIjoiV2ViVmlld1hSZXF1ZXN0ZWRXaXRoRGVwcmVjYXRpb24iLCJleHBpcnkiOjE3MTk1

**Solve the capture problem: <br>**

After a short research five techniques with varying effort can be named.

1) proxies 
2) limit requests
3) headless browsers
4) randomized user agent
5) human emulation

<span style="color:red">After discussing the structure of the data with the rest of the team, it became obvious that before we continue here, we first should scrape the information from the main page. </span>.

In [44]:
import csv
import time
from selenium.common.exceptions import NoSuchElementException, ElementNotInteractableException


driver.get('https://scholar.google.com/scholar?start=0&q=blockhaus+1700+gr%C3%BCn+kuh&hl=de&as_sdt=0,5')

done = False;

def navigate_to_next_page(driver):
    global done
    try:
        # Find the <b> tag containing the text "Weiter"
        next_button = driver.find_element_by_xpath("//b[contains(text(), 'Weiter')]")

        # Click on the <b> tag to navigate to the next page
        next_button.click()
    except NoSuchElementException:
        done = True
        print("No more next page button found. End of search results.")
    except ElementNotInteractableException:
        done = True
        print("Next page button is not interactable. End of search results.")

def scrape_google_scholar_results(url):
    # Set up the Selenium webdriver
    #driver = webdriver.Firefox()
    #driver.get(url)

    # Wait for the page to load
    driver.implicitly_wait(10)

    results = []

    while(done != True):
    # Find all the search result elements
    

        search_results = driver.find_elements_by_css_selector('.gs_ri')

        for result in search_results:
            time.sleep(0.5)
            
            title = result.find_element_by_css_selector('h3').text
            authors = result.find_element_by_css_selector('.gs_a').text.split('-')[0].strip()
            year = result.find_element_by_css_selector('.gs_a').text.split('-')[-1].strip()
            publisher = result.find_element_by_css_selector('.gs_a').text.split('-')[-2].strip()
            referenced = result.find_element_by_css_selector('.gs_fl a:nth-of-type(3)').text.split()[-1]
            #versions = result.find_element_by_css_selector('.gs_fl a:nth-of-type(5)').text.split()[-2]
            versions = "abc"


            results.append({
                'Title': title,
                'Authors': authors,
                'Year': year,
                'Publisher': publisher,
                'Referenced': referenced,
                'Versions': versions
            })
            

        # Close the webdriver
        #driver.quit()
        navigate_to_next_page(driver)

    return results

def print_results(results):
    for result in results:
        print("Title:", result['Title'])
        print("Authors:", result['Authors'])
        print("Year:", result['Year'])
        print("Publisher:", result['Publisher'])
        print("Referenced:", result['Referenced'])
        print("Versions:", result['Versions'])
        print("--------------------")

def export_to_csv(results, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Title', 'Authors', 'Year', 'Publisher', 'Referenced', 'Versions']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        for result in results:
            writer.writerow(result)

# Example usage:
url = "YOUR_GOOGLE_SCHOLAR_SEARCH_URL_HERE"
results = scrape_google_scholar_results(url)

# Print the results
print_results(results)

# Export results to CSV
export_to_csv(results, 'Output_Scraping_Google_Scholar_Notbook/google_scholar_results.csv')

Next page button is not interactable. End of search results.
Title: Die touristische Erschließung des Fogarascher Gebirges in den Südkarpaten
Authors: KS Von Alfred, FM Schuster
Year: ceeol.com
Publisher: Zeitschrift für Siebenbürgische …, 2016
Referenced: Artikel
Versions: abc
--------------------
Title: [BUCH] Radensdorf: Geschichte (n) eines Spreewalddorfs
Authors: T Mietk
Year: books.google.com
Publisher: 2023
Referenced: Artikel
Versions: abc
--------------------
Title: [BUCH] Die Lüneburger Heide
Authors: R Linde
Year: books.google.com
Publisher: 2014
Referenced: 1
Versions: abc
--------------------
Title: [BUCH] Versuch einer Darstellung der deutschen Mundarten des ungrischen Berglandes mit Sprachproben und Erläuterungen
Authors: KJ Schröer
Year: books.google.com
Publisher: 1864
Referenced: 3
Versions: abc
--------------------
Title: [BUCH] 1914: Tagebuch
Authors: F Blümel
Year: books.google.com
Publisher: 2014
Referenced: Artikel
Versions: abc
--------------------
Title: [BUCH]

In [71]:
import csv
import time
import re
from selenium.common.exceptions import NoSuchElementException, ElementNotInteractableException


driver.get('https://scholar.google.com/scholar?start=0&q=blockhaus+1700+gr%C3%BCn+kuh&hl=de&as_sdt=0,5')

done = False;

def navigate_to_next_page(driver):
    global done
    try:
        # Find the <b> tag containing the text "Weiter"
        next_button = driver.find_element_by_xpath("//b[contains(text(), 'Weiter')]")

        # Click on the <b> tag to navigate to the next page
        next_button.click()
    except NoSuchElementException:
        done = True
        print("No more next page button found. End of search results.")
    except ElementNotInteractableException:
        done = True
        print("Next page button is not interactable. End of search results.")
        
def extract_number(s):
    match = re.search(r'\d+', s)
    if match:
        return int(match.group())
    else:
        return 1

def scrape_google_scholar_results(url):
    # Set up the Selenium webdriver
    #driver = webdriver.Firefox()
    driver.get(url)

    # Wait for the page to load
    driver.implicitly_wait(10)

    results = []

    while(done != True):
    # Find all the search result elements
    

        search_results = driver.find_elements_by_css_selector('.gs_ri')

        for result in search_results:
            time.sleep(0.02)
            
            title = result.find_element_by_css_selector('h3').text
            subtitle = ''
            authors = ''
            year = ''
            publisher = ''
            referenced = ''
            versions = ''

            try:
                metadata = result.find_element_by_css_selector('.gs_a').text.split('-')
                authors = metadata[0].strip()
                year = metadata[-2].strip()
                
                if(year == authors):
                    year = None
                else:    
                    parts = year.split(",")
                    if len(parts) > 1:
                        year = parts[-1].strip()
                        subtitle = ", ".join(parts[:-1]).strip()
                    
                publisher = metadata[-1].strip()
            except NoSuchElementException:
                pass

            try:
                referenced = result.find_element_by_css_selector('.gs_fl a:nth-of-type(3)').text.split()[-1]
                if referenced == "Artikel":
                    referenced = 0
            except NoSuchElementException:
                pass

            try:
                # Check for the presence of the versions element, if it exists
                #versions = result.find_element_by_css_selector('.gs_fl a:nth-of-type(5)').text.split()[-2]
                versions = result.find_element_by_css_selector('.gs_fl a:nth-of-type(5)').text
                versions = extract_number(versions)
            except NoSuchElementException:
                pass

            results.append({
                'Title': title,
                'Subtitle': subtitle,
                'Authors': authors,
                'Year': year,
                'Publisher': publisher,
                'Referenced': referenced,
                'Versions': versions
            })
            

        # Close the webdriver
        #driver.quit()
        navigate_to_next_page(driver)

    return results

def print_results(results):
    for result in results:
        print("Title:", result['Title'])
        print("Subtitle:", result['Subtitle'])
        print("Authors:", result['Authors'])
        print("Year:", result['Year'])
        print("Publisher:", result['Publisher'])
        print("Referenced:", result['Referenced'])
        print("Versions:", result['Versions'])
        print("--------------------")

def export_to_csv(results, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Title', 'Subtitle', 'Authors', 'Year', 'Publisher', 'Referenced', 'Versions']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        for result in results:
            writer.writerow(result)



#results = scrape_google_scholar_results(url)

# Print the results
#print_results(results)

# Export results to CSV
#export_to_csv(results, 'Output_Scraping_Google_Scholar_Notbook/google_scholar_results_v2.csv')

## Unsolved issues

One citation "[ZITATION] Berichte über land-und forstwirtschaft in Deutsch-Ostafrika" is not scraped correctly. 
For now we can leave this as it is, probably no issue for the data we finally want to scrape.

In [72]:
url = 'https://scholar.google.com/scholar?as_vis=1&q=large+language+model&hl=de&as_sdt=0,5'

results = scrape_google_scholar_results(url)

# Print the results
#print_results(results)

# Export results to CSV
export_to_csv(results, 'gs_llm_1.csv')

No more next page button found. End of search results.


# Wall

It seems we reached a barrier. 

Google Scholar only provides us with 100 pages (10 results each).
There is no way we can get the amount of data we want. Restricting by time is not that viable.


Idea: Use the Google Scholar Author pages.

pros:
- Author problem solved
- structured to a high degree 

cons:
- not all authors have a page on google scholar


Plan:
1) Search for a term like "large language model"
2) scrape all authors from the first 100 pages
3) keep track of the links to their specific page
4) keep track of authors that have no page - including the paper we found them (reuse code from above for this purpose)
5) check list of links for uniqueness 
6) scrape data from each author



In [113]:
import csv
import time
import re
from selenium.common.exceptions import NoSuchElementException, ElementNotInteractableException


driver.get('https://scholar.google.com/scholar?start=0&q=blockhaus+1700+gr%C3%BCn+kuh&hl=de&as_sdt=0,5')

done = False;
page_counter = 1;
author_links = []

def navigate_to_next_page(driver):
    global done
    global page_counter
    try:
        # Find the <b> tag containing the text "Weiter"
        next_button = driver.find_element_by_xpath("//b[contains(text(), 'Weiter')]")

        
        page_counter = page_counter + 1
        
        if (page_counter > 98):
            done = True
        else:
            next_button.click()
        
    except NoSuchElementException:
        done = True
        print("No more next page button found. End of search results.")
    except ElementNotInteractableException:
        done = True
        print("Next page button is not interactable. End of search results.")
        
def extract_number(s):
    match = re.search(r'\d+', s)
    if match:
        return int(match.group())
    else:
        return 1

def scrape_google_scholar_results(url):
    # Set up the Selenium webdriver
    #driver = webdriver.Firefox()
    driver.get(url)

    # Wait for the page to load
    driver.implicitly_wait(10)

    results = []
    

    while(done != True):
    # Find all the search result elements
    

        search_results = driver.find_elements_by_css_selector('.gs_ri')

        for result in search_results:
            time.sleep(0.02)
            
            title = result.find_element_by_css_selector('h3').text
            subtitle = ''
            authors = ''
            year = ''
            publisher = ''
            referenced = ''
            versions = ''

            try:
                metadata = result.find_element_by_css_selector('.gs_a').text.split('-')
                authors = metadata[0].strip()
                year = metadata[-2].strip()
                
                if(year == authors):
                    year = None
                else:    
                    parts = year.split(",")
                    if len(parts) > 1:
                        year = parts[-1].strip()
                        subtitle = ", ".join(parts[:-1]).strip()
                    
                publisher = metadata[-1].strip()
                
                metadata = result.find_element_by_css_selector('.gs_a')
                author_links = [a.get_attribute('href') for a in metadata.find_elements_by_css_selector('a')]
                print(author_links)
                
            except NoSuchElementException:
                pass

            try:
                referenced = result.find_element_by_css_selector('.gs_fl a:nth-of-type(3)').text.split()[-1]
                if referenced == "Artikel":
                    referenced = 0
            except NoSuchElementException:
                pass

            try:
                # Check for the presence of the versions element, if it exists
                #versions = result.find_element_by_css_selector('.gs_fl a:nth-of-type(5)').text.split()[-2]
                versions = result.find_element_by_css_selector('.gs_fl a:nth-of-type(5)').text
                versions = extract_number(versions)
            except NoSuchElementException:
                pass

            results.append({
                'Title': title,
                'Subtitle': subtitle,
                'Authors': authors,
                'Year': year,
                'Publisher': publisher,
                'Referenced': referenced,
                'Versions': versions
            })
            

        # Close the webdriver
        #driver.quit()
        navigate_to_next_page(driver)

    return results



def print_results(results):
    for result in results:
        print("Title:", result['Title'])
        print("Subtitle:", result['Subtitle'])
        print("Authors:", result['Authors'])
        print("Year:", result['Year'])
        print("Publisher:", result['Publisher'])
        print("Referenced:", result['Referenced'])
        print("Versions:", result['Versions'])
        print("--------------------")

def export_to_csv(results, filename):
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Title', 'Subtitle', 'Authors', 'Year', 'Publisher', 'Referenced', 'Versions']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        for result in results:
            writer.writerow(result)

print(author_links)

#results = scrape_google_scholar_results(url)

# Print the results
#print_results(results)

# Export results to CSV
export_to_csv(results, 'Output_Scraping_Google_Scholar_Notbook/gs_test_page2.csv')

[]


In [101]:
url = 'https://scholar.google.com/scholar?as_vis=1&q=large+language+model&hl=de&as_sdt=0,5'

results = scrape_google_scholar_results(url)

# Print the results
#print_results(results)

# Export results to CSV
#export_to_csv(results, 'Output_Scraping_Google_Scholar_Notbook/gs_links.csv')

['https://scholar.google.com/citations?user=3qb1AYwAAAAJ&hl=de&oi=sra', 'https://scholar.google.com/citations?user=KbrpC8cAAAAJ&hl=de&oi=sra', 'https://scholar.google.com/citations?user=BE_lVTQAAAAJ&hl=de&oi=sra']
['https://scholar.google.com/citations?user=JUsooa0AAAAJ&hl=de&oi=sra', 'https://scholar.google.com/citations?user=4iG4IC4AAAAJ&hl=de&oi=sra']
['https://scholar.google.com/citations?user=TYqG0x4AAAAJ&hl=de&oi=sra', 'https://scholar.google.com/citations?user=Q7Ieos8AAAAJ&hl=de&oi=sra', 'https://scholar.google.com/citations?user=hBZ_tKsAAAAJ&hl=de&oi=sra', 'https://scholar.google.com/citations?user=KVeRu2QAAAAJ&hl=de&oi=sra', 'https://scholar.google.com/citations?user=go3sFxcAAAAJ&hl=de&oi=sra']
['https://scholar.google.com/citations?user=48GJrbsAAAAJ&hl=de&oi=sra', 'https://scholar.google.com/citations?user=206vNCEAAAAJ&hl=de&oi=sra', 'https://scholar.google.com/citations?user=oUYfjg0AAAAJ&hl=de&oi=sra']
['https://scholar.google.com/citations?user=JNhNacoAAAAJ&hl=de&oi=sra', '

KeyboardInterrupt: 