# Scraping Google Scholar with Selenium and Python:

Source: https://medium.com/ml-book/web-scraping-using-selenium-python-3be7b8762747

**Steps:**

1) Install Selenium <br> 
    conda install -c conda-forge selenium <br>
    depending on setup; alternative: pip install selenium
    
2) Download GeckoDriver <br>
    In this case: suited for Firefox - working on a Mac <br>
    https://github.com/mozilla/geckodriver/releases?source=post_page-----3be7b8762747--------------------------------

In [37]:
from selenium import webdriver 

driver = webdriver.Firefox(executable_path= '/Users/sebastian/Downloads/geckodriver')

## First problem

How do we access all entries on Google Scholar? <br>
- <del>there is no API</del>
- empty search does not work
- empty space does not work

**Attempts: <br>**
- Advanced search - using * as regex wildcard - only works in combination with a word 
- Limit search by year only appears after a search result. Setting a timeframe and removing the search term leads to results that are not specific to the term but the amount of results indicates that something is wrong. It might be that we only get a subset of results that include the removed search term but there is no indication for that. Nonetheless, this approach should be reviewed rigorously. 
    - 2001 till 2024 = 1 040 000 Results (0,04 Sec.)
    - 2005 till 2024 = 1 150 000 Results (0,03 Sec.)
    
    - 2023 till 2023 = 6 960 000 Results (0,03 Sec.)
    - 2023 till 2024 = 3 030 000 Results (0,03 Sec.)
    
**Conclusion: <br>**
It seems that 1 million and 1.5 million results are some sort of barrier. <br>
If the search is to unspecific, the result is cut somewhere in this area. However, by choosing smaller timeframes, the results appear to be more reliable, Even though there is little to verify that. Searching with the wildcard "*" works as long as further constraints are used like setting a time constraint. <br>

**Possible solution: <br>**
Search systematically by selecting only a subset of years in each search.
If we are interested in very old documents, we could work with bigger timeframes, but in general, it seems like the amount of results per year is getting bigger the closer we get to the current year. This fits the expectations. 

**Exploring alternatives: <br>**
Using the Google Scholar API seems to be the better choice. However, getting all results without a search term persits as a challenge. 
Additionally the "Free Plan" provides only 100 searches / month even with 3 team members 300 searches / month might be problematic. 

Questions: 
- How much of the data can we get with 1 search? 

In [38]:
driver.get('https://scholar.google.com/scholar?q=*&hl=de&as_sdt=0%2C5&as_ylo=2000&as_yhi=2000')

**Constraints: <br>**
1) Keep the interaction with the website as easy as possible from a coding point of view
2) Automate the process as much as possible

In order to satisfy both constraints we can use the structure of Google Scholar links.
Searching for the wildcard "*" and setting a timeframe once provides us with a working link. Furthermore, we simply can change the years and get a working search out of it. 

Therefore creating those searches with a simple loop iterating from year x to year y should do the job. 

**Steps: <br>**

3) Start building a scraper suited for Google Scholar results. 

In [3]:
container_main_body = driver.find_elements_by_id('gs_res_ccl_mid')
len(container_main_body)

1

In [4]:
results = driver.find_elements_by_class_name('gs_ri')
len(results)

10

**Google Scholar Body: <br>**

Nested within successive layers of HTML structure, the main content body is encapsulated within a container identified by the unique identifier "gs_res_ccl_mid". Within this container, further sub-containers are delineated by the class attribute "gs_r gs_or gs_scl". These sub-containers encompass individual entries, some of which may contain multiple child elements, such as direct links to PDF files. However, our primary focus lies on the containers characterized by the class "gs_ri", as they represent the core entries of interest within this hierarchical structure.

**Questions: <br>**
- Is the result unique for each user? Is there a seed for the search or are the constraints the only factor?

In [9]:
titles = driver.find_elements_by_class_name('gs_rt')

for title in titles:
    title.click()
    print("Opened a page!")
    

Opened a page!
Opened a page!


StaleElementReferenceException: Message: The element with the reference 3894b0a1-cd45-48c2-8347-2be007cb4b13 is not known in the current browsing context


**Error <br>**
Iterating over the results as done above leads to an StaleElementReferenceException. The problem is, that references change after we interacted with the page. In order to avoid that we need to save the URLs before we interact with the page.

**ChatGPT 3.5 <br>**
ChatGPT proposes a slightly more advanced attempt. Using css_selecter instead of by_id. 


    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import time

    # Initialize the WebDriver
    driver = webdriver.Chrome()

    # Navigate to Google Scholar
    driver.get("https://scholar.google.com")

    # Locate the search bar and input your query
    search_bar = driver.find_element_by_name("q")
    search_query = "Your search query here"
    search_bar.send_keys(search_query)
    search_bar.send_keys(Keys.RETURN)

    # Wait for the results to load
    time.sleep(2)

    # Collect the URLs of the search results
    result_links = driver.find_elements_by_css_selector('.gs_rt a')
    result_urls = [link.get_attribute('href') for link in result_links]

    # Iterate over the URLs and scrape metadata from each page
    for url in result_urls:
        driver.execute_script("window.open('{}', '_blank');".format(url))
        driver.switch_to.window(driver.window_handles[-1])
    
        # Here you can scrape metadata from the individual page
        # Example:
        meta_tags = driver.find_elements_by_css_selector('meta')
        for tag in meta_tags:
            print(tag.get_attribute('name'), tag.get_attribute('content'))
    
        # Close the tab after scraping
        driver.close()
    
        # Switch back to the main window/tab
        driver.switch_to.window(driver.window_handles[0])

    # Close the WebDriver
    driver.quit()





A few adjustments seem necessary. We want to wait inside the nested loop instead of upfront. We do not need the query part. After some adjustments with the timer, giving the algorithm more time to load the pages before we read the meta data we are able to read the wanted information from 9 out of 10 pages.

Missing output:

--------------------------------------
 text/html; charset=UTF-8
 IE=Edge
robots noindex,nofollow
viewport width=device-width,initial-scale=1
 375
--------------------------------------

After checking we can see, that we encounter a captcha. Before we bring order to the scraped material, for example by saving the name of each meta tag in combination with the content, we want to solve this issue, because it is likely to occure more often. 

In [40]:
import time

result_links = driver.find_elements_by_css_selector('.gs_rt a')
result_urls = [link.get_attribute('href') for link in result_links]

# Iterate over the URLs and scrape metadata from each page
for url in result_urls:
    driver.execute_script("window.open('{}', '_blank');".format(url))
    driver.switch_to.window(driver.window_handles[-1])
    
    time.sleep(4)
    
    # Here you can scrape metadata from the individual page
    # Example:
    meta_tags = driver.find_elements_by_css_selector('meta')
    for tag in meta_tags:
        print(tag.get_attribute('name'), tag.get_attribute('content'))
    
    
    
    driver.close()
    
    # Switch back to the main window/tab
    driver.switch_to.window(driver.window_handles[0])
    
    print("--------------------------------------")


 
 IE=edge
viewport width=device-width, initial-scale=1
applicable-device pc,mobile
access No
doi 10.1007/978-3-658-28565-4
title Die grenzenlose Unternehmung
description Die Neuauflage dieses Standardlehrbuches bietet ausgewählte theoretische Erklärungsansätze für die Entstehung neuer Organisationskonzepte vor dem Hintergrund von Digitalisierung und neuen Technologien und zeigt auf, welche Implikationen sich daraus für die Menschen in Organisationen ergeben.
citation_springer_api_url https://api.springernature.com/xmldata/jats?q=bookdoi:10.1007/978-3-658-28565-4&api_key=
 https://link.springer.com/book/10.1007/978-3-658-28565-4
 book
 SpringerLink
 Die grenzenlose Unternehmung
 https://media.springernature.com/w153/springer-static/cover/book/978-3-658-28565-4.jpg
 As0hBNJ8h++fNYlkq8cTye2qDLyom8NddByiVytXGGD0YVE+2CEuTCpqXMDxdhOMILKoaiaYifwEvCRlJ/9GcQ8AAAB8eyJvcmlnaW4iOiJodHRwczovL2RvdWJsZWNsaWNrLm5ldDo0NDMiLCJmZWF0dXJlIjoiV2ViVmlld1hSZXF1ZXN0ZWRXaXRoRGVwcmVjYXRpb24iLCJleHBpcnkiOjE3MTk1

**Solve the capture problem: <br>**

After a short research five techniques with varying effort can be named.

1) proxies 
2) limit requests
3) headless browsers
4) randomized user agent
5) human emulation
