# Scraping Google Scholar with Selenium and Python:

Source: https://medium.com/ml-book/web-scraping-using-selenium-python-3be7b8762747

**Steps:**

1) Install Selenium <br> 
    conda install -c conda-forge selenium <br>
    depending on setup; alternative: pip install selenium
    
2) Download GeckoDriver <br>
    In this case: suited for Firefox - working on a Mac <br>
    https://github.com/mozilla/geckodriver/releases?source=post_page-----3be7b8762747--------------------------------

In [9]:
from selenium import webdriver 

driver = webdriver.Firefox(executable_path= '/Users/sebastian/Downloads/geckodriver')

## First problem

How do we access all entries on Google Scholar? <br>
- there is no API 
- empty search does not work
- empty space does not work

**Attempts: <br>**
- Advanced search - using * as regex wildcard - only works in combination with a word 
- Limit search by year only appears after a search result. Setting a timeframe and removing the search term leads to results that are not specific to the term but the amount of results indicates that something is wrong. It might be that we only get a subset of results that include the removed search term but there is no indication for that. Nonetheless, this approach should be reviewed rigorously. 
    - 2001 till 2024 = 1 040 000 Results (0,04 Sec.)
    - 2005 till 2024 = 1 150 000 Results (0,03 Sec.)
    
    - 2023 till 2023 = 6 960 000 Results (0,03 Sec.)
    - 2023 till 2024 = 3 030 000 Results (0,03 Sec.)
    
**Conclusion: <br>**
It seems that 1 million and 1.5 million results are some sort of barrier. <br>
If the search is to unspecific, the result is cut somewhere in this area. However, by choosing smaller timeframes, the results appear to be more reliable, Even though there is little to verify that. Searching with the wildcard "*" works as long as further constraints are used. <br>

**Possible solution: <br>**
Search systematically by selecting only a subset of years in each search.
If we are interested in very old documents, we could work with bigger timeframes, but in general, it seems like the amount of results per year is getting bigger the closer we get to the current year. This fits the expectations. 

**Exploring alternatives: <br>**
Using the Google Scholar API seems to be the better choice. However, getting all results without a search term persits as a challenge. 
Additionally the "Free Plan" provides only 100 searches / month even with 3 team members 300 searches / month might be problematic. 

Questions: 
- How much of the data can we get with 1 search? 

In [10]:
driver.get('https://scholar.google.com/scholar?q=*&hl=de&as_sdt=0%2C5&as_ylo=2000&as_yhi=2000')

**Goals: <br>**
1) Keep the interaction with the website as easy as possible from a coding point of view
2) Automate the process as much as possible

In order to satisfy both constraints we can use the structure of Google Scholar links.
Searching for the wildcard "*" and setting a timeframe once provides us with a working link. Furthermore, we simply can change the years and get a working search out of it. 

Therefore creating those searches with a simple loop iterating from year x to year y should do the job. 

**Steps: <br>**

3) Start building a scraper suited for Google Scholar results. 

In [19]:
container_main_body = driver.find_elements_by_id('gs_res_ccl_mid')
len(container_main_body)

1

In [24]:
results = driver.find_elements_by_class_name('gs_ri')
len(results)

10

**Google Scholar Body: <br>**

Nested within further layers, the body is within a container with the id "gs_res_ccl_mid". Within this container are more containers with the class attribute "gs_r gs_or gs_scl". Within those containers are the individual entries. Some contain more than one child, in cases were direct links for example to pdfs are available but we are interested in the container(s) from the class "gs_ri". 