# Synopsis

In this unit we will cover the use of `Selenium` to interact in real time with wepages.


# Read libraries

In [1]:
%load_ext autoreload
%autoreload 2

from colorama import Back, Fore, Style
from copy import copy, deepcopy
from pathlib import Path


## Special requirements

This notebook requires a different environment in which `selenium` is installed

In [2]:
from bs4 import BeautifulSoup
from IPython.display import HTML, display, Image
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.support.expected_conditions import element_to_be_clickable
from string import punctuation, whitespace
from time import sleep

import json
import re
import requests

# When you need to interact with the website

Sometimes, accessing webpages that you which to scrape require authentication and/or require you to select from menus, click buttons, and so on.

Those things can be quite difficult to do using `requests`.  So, here comes `Selenium` to the rescue.

`Selenium` opens a real browser in your computer that is controlled by your code and, possibly, you.

Below we see how to probe a webpage for the type of information that `Selenium` is able to use for navigation and for activation of the interactable elements in the page. 

## Google scholar

In [3]:
# My google scholar page
#
gs_url = 'https://scholar.google.com/citations?hl=en&user=Jo0G0c0AAAAJ'

As you can see in the image, we want to click on the button `SHOW MORE`
<center>
    <img src = 'Images/inspect_html.png' width = 800>
</center>

<br><br>
Like `BeautifulSoup`, `Selenium` has the ability to search for `HTML` tags using several formatting methods. In this case, we will use the `XPath` approach:

<center>
    <img src = 'Images/inspect_html_copy_xpath.png' width = 800>
</center>

<br><br>
To do this, you right click on the relevant element in the `inspection` pane.  Then, you select `Copy` and the `by_x_path` option.

Pasting the clipboard content you get

> //*[@id="gsc_bpf_more"]

### Using the browser

In [4]:
# We use sleep to allow for information to be exchange over the internet
#
with webdriver.Firefox() as browser:
    browser.get(gs_url)
    # print(browser.)
    sleep(5)

    # Rows will store the list of publications
    # 
    rows = []

    # Check that there is an active button, meaning that there
    # are more records to be shown
    #
    while True:        
        button = browser.find_element( By.XPATH, 
                                       "//button[@id = 'gsc_bpf_more']" )
        if button.is_enabled():
            button.click()
            sleep(2)
        else:
            break

# Now that we have the page with all publications showing, 
# we can retrieve it.
# We will use beautiful soup to extract information.
#
soup = BeautifulSoup(browser.page_source, 'html.parser')
print(f"Retrieved {gs_url} and got soup!\n")

# First we get the body of the table displaying publications.
# Then, we retrieve them one by one
#
results_table = soup.find('tbody', {'id': 'gsc_a_b'})
for item in results_table.children:
    rows.append( item )

print(f"Retrieved {len(rows)} rows of data")


NoSuchElementException: Message: Unable to locate element: //button[@id = 'gsc_bpf_more']; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:193:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:511:5
dom.find/</<@chrome://remote/content/shared/DOM.sys.mjs:136:16


In [None]:
type(button)

<br><br>

Next, we use the function `scrape_gsprofile` to extract information in usable form

In [None]:
def scrape_gs_profile( rows ):
    """
    Takes html text of papers in Google Scholar profile and stores citations.

    inputs:
        rows -- list of html element extracted from table of publications

    returns:
        gs_papers -- list of dictionaries
    """
    gs_papers = []
    for i, item in enumerate(rows):
        gs_papers.append({})

        parts = list( item.children )
        year = int(parts[2].text)

        text_for_cites = parts[1].text.strip(punctuation).replace(',', '')
        if text_for_cites.isnumeric():
            citations = int(text_for_cites)
        else:
            citations = 0

        title = parts[0].find('a').text
        authors = parts[0].find('div').text

        gs_papers[-1]['year'] = year
        gs_papers[-1]['citations'] = citations
        gs_papers[-1]['title'] = title
        gs_papers[-1]['author'] = authors.strip( whitespace + punctuation )

        first_author = gs_papers[-1]['author'].split(',')[0].split()[-1]
        gs_papers[-1]['_id'] = f"{year} {first_author.lower()} {title.lower()}"

        print(f"{year} - {citations:>4} - {first_author.lower()} {title.lower()}")

    return gs_papers

In [None]:
gs_papers = scrape_gs_profile( rows )

gs_papers[:5]

## Scopus author profiles

In [5]:
# My google scholar page
#
scopus_url = 'https://www.scopus.com/authid/detail.uri?authorId=57200155842'

In [7]:
n_selector = 'label.Select-module__mkduq:nth-child(1) > select:nth-child(2)'
table_css = 'ul.Stack-module__tT3r4:nth-child(4)'
next_css = '.page-item > button:nth-child(1)'

with webdriver.Firefox() as browser:
    scopus_papers = []
    browser.get(scopus_url)
    sleep(20)
    n_menu = Select( browser.find_element( By.CSS_SELECTOR, 
                                           n_selector) )
    
    n_menu.options[-1].click()
    print(Fore.BLUE, 'Changed number of records shown in page!\n', Style.RESET_ALL)
    sleep(5)
    
    # Rows will store the list of publications
    # 
    rows = []

    # Download records and then check that there is an active button, 
    # meaning that there are more pages of records to be shown
    #
    counter = 1
    print( 'There are:' ) 
    while True:
        # First we get the body of the table displaying publications.
        # Then, use BS to retrieve entries all at once
        #
        table = browser.find_element( By.CSS_SELECTOR, 
                                      table_css )
        soup = BeautifulSoup( table.get_attribute('innerHTML'), 'html.parser' )

        results_table = soup.findAll( 'li' )
        print( f"\t{len(results_table)} rows in results in page {counter}." )

        for item in results_table:
            rows.append( item )
 
        button = browser.find_element(By.CSS_SELECTOR, next_css)
        if button.is_enabled():
            button.click()
            counter += 1
            sleep(5)
        else:
            break

    print(Fore.BLUE, f"\nRetrieved {len(rows)} rows of data.", Style.RESET_ALL)



Changed number of records shown in page!

There are 200 rows in results

There are 13 rows in results


Retrieved 213 rows of data.


<br><br>

Next, we use the function `scrape_gsprofile` to extract information in usable form

In [8]:
def scrape_scopus_profile( rows ):
    """
    Takes html text of papers in Scopus and stores citations.

    inputs:
        rows -- list of rows extracted from soup object

    returns:
        scopus_papers -- list of dictionaries
    """
    scopus_papers = []
    for k, row in enumerate( rows ):
        # Extract title
        #
        title_element = row.find('h4')
        if not title_element:
            continue
        title = title_element.text

        # Extract authors
        #
        author_element = row.find('div', {'data-testid': 'author-list'} )
        authors = []
        for child in author_element.children:
            if child.text != ', ':
                authors.append(child.text)

        # Extract year and source
        #
        div_elements = row.findAll( 'span',
                                    {'class': 'Typography-module__lVnit Typography-module__fRnrd Typography-module__Nfgvc'} )

        if len(div_elements) > 1:
            source_year_vol_page  = div_elements[0].text
            match = re.search( r'\d{4}', source_year_vol_page )

            source = source_year_vol_page[:match.start()].strip().rstrip(',')
            year = source_year_vol_page[match.start():match.end()]

        else:
            print('O'*30)
            year_vol_page = div_elements[0].text
            elements = year_vol_page.split(',')
            year = elements[0].strip()

            source_element = row.find('div', {'class': 'text-meta'})
            source = source_element.text

        # Extract citations
        #
        citations_element = row.find('div', {'data-testid': 'order'})
        text_for_cites = citations_element.text.rstrip('Citations')
        text_for_cites = text_for_cites.replace(',', '')
        if text_for_cites.isnumeric():
            citations = int(text_for_cites)
        else:
            citations = 0

        first_author = authors[0].split(',')[0].lower()
        print(f"{year} - {citations:>4} - {first_author} {title.lower()}")

        # There are issues with alternative last names for some of my publications
        #           Delete 'if' conditions below or replace in case you have similar
        #           issues
        #
        if first_author == 'nunes':
            first_author = 'amaral'
        if first_author == 'auto':
            first_author = 'moreira'

        # Scopus includes citations to versions published in book chapters and
        # includes statement about open access. These must be removed.
        #
        clean_title = title.lower().replace('open access', ' ')
        clean_title = clean_title.replace('book chapter', ' ').strip()
        scopus_id = f"{year} {first_author} {clean_title.lower()}"

        scopus_papers.append({'_id': scopus_id, 'title': title, 'year': year,
                              'authors': authors, 'journal': source,
                              'citations': citations})

    return scopus_papers
        

  match = re.search( '\d{4}', source_year_vol_page )


In [9]:
scopus_papers = scrape_scopus_profile( rows )

scopus_papers[:5]

2024 -    0 - nunes amaral artificial intelligence needs a scientific method-driven reset
2024 -    3 - richardson meta-research: understudied genes are lost in a leaky pipeline between genome-wide assays and reporting of results
2024 -    1 - castro on the improvement of handwritten text line recognition with octave convolutional recurrent neural networks
2024 -    0 - qiao energy metabolism modulates the regulatory impact of activators on gene expression
2023 -    2 - bernasek ratiometric sensing of pnt and yan transcription factor levels confers ultrasensitivity to photoreceptor fate transitions in drosophila
2023 -    1 - liu a new approach for extracting information from protein dynamics
2022 -   33 - stoeger aging is associated with a systemic length-associated transcriptome imbalance
2022 -   15 - lei forecasting the evolution of fast-changing transportation networks using machine learning
2022 -    0 - bechel the first step is recognizing there is a problem: a methodology for a

[{'_id': '2024 nunes amaral artificial intelligence needs a scientific method-driven reset',
  'title': 'Artificial intelligence needs a scientific method-driven reset',
  'year': '2024',
  'authors': ['Nunes Amaral, L.A.'],
  'journal': 'Nature Physics',
  'citations': 0},
 {'_id': '2024 richardson meta-research: understudied genes are lost in a leaky pipeline between genome-wide assays and reporting of results',
  'title': 'Meta-Research: Understudied genes are lost in a leaky pipeline between genome-wide assays and reporting of results',
  'year': '2024',
  'authors': ['Richardson, R., ',
   'Tejedor Navarro, H., ',
   'Nunes Amaral, L.A., ',
   'Stoeger, T.'],
  'journal': 'eLife',
  'citations': 3},
 {'_id': '2024 castro on the improvement of handwritten text line recognition with octave convolutional recurrent neural networks',
  'title': 'On the improvement of handwritten text line recognition with octave convolutional recurrent neural networks',
  'year': '2024',
  'authors': [