# Google DR Search

## Questions
* It seems that with Selenium, we do not provide the names. We only do when using the Request packages...:headers={"Name" : "Simon Ullrich - summer course project" , "email": "simon.ullrich@sodas.ku.dk"}

## Comments
* Looping until last page implemented
    * Error term included to do while loop
* Time.sleep included for less suspicious behaviour
* CSS selector changed, because if last page is reached, the old CSS selecter might just have turned around.

## Importing packages

In [1]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import numpy as np
import tqdm
import csv

# Import selenium
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

## Generate search terms

Search terms follow the following format: We search for the month and year in the format DR articles inclue a timestap. Manual Google searches proved to provide relevant research results mostly limited to the month provided. In this way we create a list of links to DR articles. All articles are located on the site https://www.dr.dk/nyheder or a subsite. This can be included in the google search. An example of a search is: jan. 2012" AND "sygeplejersker*" site:https://www.dr.dk/nyheder.

In [2]:
# Generate empty list with search terms
search_terms = []
# Generate combinations of year-month and term search combinations
for year in range(2012, 2023):
    months = ['jan.', 'feb.', 'mar.', 'apr.', 'maj', 'jun.', 'jul.', 'aug.', 'sep.', 'okt.', 'nov.', 'dec.']
    for month in months:
        term = f'"{month} {year}" AND "sygeplejersker*" site:https://www.dr.dk/nyheder/ \n'
        search_terms.append(term)
# Create final list of search terms, future month-year combinations deleted
search_terms = search_terms[:-4]

In [43]:
len(search_terms[:-4])

124

We create a total of 128 search terms, for all month-year combinations between January 2012 and August 2022.

## Scraping Google search to retrieve list of links to DR articles
With the following code we scrape Google searches to retrieve a list of links to DR articles, given the DR website does not provide a useful search function. We use Selenium to go execute a Google search and retrieve DR article links. We execute the search, save the HTML for the first results page and then go to further pages of the search results to retrieve more search resutls. Google intervenes when scaping search results too fast. We therefore integrate a break when moving between pages. The break time takes random values between 0.25 and 3.5 seconds.

### Defining a scraping function

In [191]:
html_list = []
finished_searches = 0
for i in tqdm.tqdm(search_terms[start:stop]):
    driverService = Service(r"C:\Users\jgb569\OneDrive - University of Copenhagen\Documents\Webscraping\chromedriver.exe") 
    driver = webdriver.Chrome(service = driverService)
    # Go to google
    driver.get('https:google.com')
    # Discard cookie message, reject cookies
    cookie = driver.find_element(By.ID, "W0wltc")
    cookie.click()
    # Search for DR news articles
    gsearch = driver.find_element(By.CSS_SELECTOR, "input[title='Søg']")
    gsearch.send_keys(i)
    # Get HTML for first search result page
    html = driver.page_source
    html_list.append(html)
    # Go to next result page
    next_page = driver.find_element(By.CSS_SELECTOR, ".NVbCr+ span") #CSS selector only last not previous page
    next_page.click()
    # Define an error used when reaching last search page:
        # When error = 0, there is another resut page.
        # When error = 1, there is no further page on Google, loop stops.
    error = 0 
    while error < 1:
        try:
            html2 = driver.page_source
            html_list.append(html2)
            # Google detects suspicious behavior and asks to solve some puzzle after 7 iterations. Trying random sleep time and scrolling down to element.
            time.sleep(np.random.uniform(5, 10))
            # Go to next result page
            next_page = driver.find_element(By.CSS_SELECTOR, "#pnnext .NVbCr+ span") #CSS selector only last not previous page
            next_page.click()
        except:
            error += 1
    finished_searches += 1
    time.sleep(np.random.uniform(30-45))
    driver.quit()

NameError: name 'start' is not defined

### SCRAPE NOT HERE - PROBLEM: Page limit set, but continues to iterate over pages!!
The call of the scraping function is divided into smaller portions for a less suspicious Selenium experience. This did not fully work.
Jan-Aug 2012: All links
Sep-Dec 2012

In [None]:
html_list = []
finished_searches = 8
for i in tqdm.tqdm(search_terms[8:12]):
    driverService = Service(r"C:\Users\jgb569\OneDrive - University of Copenhagen\Documents\Webscraping\chromedriver.exe") 
    driver = webdriver.Chrome(service = driverService)
    # Go to google
    driver.get('https:google.com')
    # Discard cookie message, reject cookies
    cookie = driver.find_element(By.ID, "W0wltc")
    cookie.click()
    # Search for DR news articles
    gsearch = driver.find_element(By.CSS_SELECTOR, "input[title='Søg']")
    gsearch.send_keys(i)
    # Get HTML for first search result page
    html = driver.page_source
    html_list.append(html)
    # Go to next result page
    next_page = driver.find_element(By.CSS_SELECTOR, ".NVbCr+ span") #CSS selector only last not previous page
    next_page.click()
    # Define an error used when reaching last search page:
        # When error = 0, there is another resut page.
        # When error = 1, there is no further page on Google, loop stops.
    error = 0
    page = 1
    while page < 3:
        try:
            html2 = driver.page_source
            html_list.append(html2)
            # Google detects suspicious behavior and asks to solve some puzzle after 7 iterations. Trying random sleep time and scrolling down to element.
            time.sleep(np.random.uniform(5, 12))
            # Go to next result page
            next_page = driver.find_element(By.CSS_SELECTOR, "#pnnext .NVbCr+ span") #CSS selector only last not previous page
            next_page.click()
            page += 1
        except:
            error += 1
    finished_searches += 1
    time.sleep(np.random.uniform(30,40))
    driver.quit()

  0%|                                                                                            | 0/4 [00:00<?, ?it/s]

# Scrape HERE (First 3 pages)

In [51]:
html_list = []
finished_searches = 111
for i in tqdm.tqdm(search_terms[111:124]):
    driverService = Service(r"C:\Users\jgb569\OneDrive - University of Copenhagen\Documents\Webscraping\chromedriver.exe") 
    driver = webdriver.Chrome(service = driverService)
    # Go to google
    driver.get('https:google.com')
    # Discard cookie message, reject cookies
    cookie = driver.find_element(By.ID, "W0wltc")
    cookie.click()
    # Search for DR news articles
    gsearch = driver.find_element(By.CSS_SELECTOR, "input[title='Søg']")
    gsearch.send_keys(i)
    # Get HTML for first search result page
    html = driver.page_source
    html_list.append(html)
    # Go to next result page
    next_page = driver.find_element(By.CSS_SELECTOR, ".NVbCr+ span") #CSS selector only last not previous page
    next_page.click()
    # Define an error used when reaching last search page:
        # When error = 0, there is another resut page.
        # When error = 1, there is no further page on Google, loop stops.
    html2 = driver.page_source
    html_list.append(html2)
    # Google detects suspicious behavior and asks to solve some puzzle after 7 iterations. Trying random sleep time and scrolling down to element.
    time.sleep(np.random.uniform(5, 12))
    # Go to next result page
    next_page = driver.find_element(By.CSS_SELECTOR, "#pnnext .NVbCr+ span") #CSS selector only last not previous page
    next_page.click()
    finished_searches += 1
    driver.quit()

100%|██████████████████████████████████████████████████████████████████████████████████| 13/13 [04:34<00:00, 21.12s/it]


### Further search instances

### Create link list

In [53]:
# Preparing two empty lists
link_list = []
link_list_clean = []
# Iterating over the results from scraping
for l in html_list:
    soup = BeautifulSoup(l, 'lxml')
    try: 
        links = soup.find('div', class_ = 'v7W49e').find_all('a', href=True)
    except:
        pass
    # Generate list with all links
    for i in links:
        temp = i['href']
        link_list.append(temp)
    # Getting rid of noise, links not pointing to DR but Google infrastructure
    for link in link_list:
        if "webcache.googleusercontent" not in link:
            link_list_clean.append(link)

#### From list to DataFrame

In [54]:
dr_links_012 = pd.DataFrame({'links':link_list_clean})
print(len(dr_links_012))

4436


In [None]:
### For the different scraping sessions

In [55]:
dr_links_012_backup = dr_links_012.copy()
dr_links_012_clean = dr_links_012_backup.drop_duplicates(subset = 'links')
len(dr_links_012_clean)

244

* `dr_links_001` has 29525 instances. After dropping duplicates 713 links remain.
* `dr_links_002` has 713 instances. After dropping duplicates 212 links remain.
* `dr_links_003` has 14336 instances. After dropping duplicates 506 links remain.
* `dr_links_004` has ?? instances. After dropping duplicates 80 links remain. (20 per month? Sep-Dec 2012, but I thought it is iterating over 3 pages)
* `dr_links_005` has 2571 instances. After dropping duplicates 222 links remain. (20 per month? jan-nov 2013)
* `dr_links_006` has 1323 instances. After dropping duplicates 155 links remain. 
* `dr_links_007` has 9307 instances. After dropping duplicates 447 links remain.
* `dr_links_008` has 980 instances. After dropping duplicates 143 links remain.
* `dr_links_009` has 16689 instances. After dropping duplicates 530 links remain.
* `dr_links_010` has 17456 instances. After dropping duplicates 397 links remain.
* `dr_links_011` has 313 instances. After dropping duplicates 64 links remain.
* `dr_links_012` has 4436 instances. After dropping duplicates 244 links remain.

#### Save first part of dataset

In [None]:
dr_links_001_clean.to_csv("dr_links_001_clean.csv")

In [158]:
dr_links_002_clean.to_csv("dr_links_002_clean.csv")

In [164]:
dr_links_003_clean.to_csv("dr_links_003_clean.csv")

In [7]:
dr_links_004_clean.to_csv("dr_links_004_clean.csv")

In [13]:
dr_links_005_clean.to_csv("dr_links_005_clean.csv")

In [18]:
dr_links_006_clean.to_csv("dr_links_006_clean.csv")

In [26]:
dr_links_007_clean.to_csv("dr_links_007_clean.csv")

In [32]:
dr_links_008_clean.to_csv("dr_links_008_clean.csv")

In [37]:
dr_links_009_clean.to_csv("dr_links_009_clean.csv")

In [42]:
dr_links_010_clean.to_csv("dr_links_010_clean.csv")

In [52]:
dr_links_011_clean.to_csv("dr_links_011_clean.csv")

In [56]:
dr_links_012_clean.to_csv("dr_links_012_clean.csv")

### Calling the scraping function

# Backups

In [None]:
# Repeated call to scraping function. But there is an issue with it. The pauses are not implemented, must be implemented in function itself

html_list_dr = []
finished_searches_list_dr = []
for i in range(0, 32):
    html_list, finished_searches = scrape_and_run(search_range[i], search_range[i+1])
    html_list_dr.append(html_list)
    finished_searches_list_dr.append(finished_searches)
    time.sleep(np.random.uniform(10,25))

In [None]:
html_list = []

for i in tqdm.tqdm(search_terms):
    driverService = Service(r"C:\Users\jgb569\OneDrive - University of Copenhagen\Documents\Webscraping\chromedriver.exe") 
    driver = webdriver.Chrome(service = driverService)
    # Go to google
    driver.get('https:google.com')
    # Discard cookie message, reject cookies
    cookie = driver.find_element(By.ID, "W0wltc")
    cookie.click()
    # Search for DR news articles
    gsearch = driver.find_element(By.CSS_SELECTOR, "input[title='Søg']")
    gsearch.send_keys(i)
    # Get HTML for first search result page
    html = driver.page_source
    html_list.append(html)
    # Go to next result page
    next_page = driver.find_element(By.CSS_SELECTOR, ".NVbCr+ span") #CSS selector only last not previous page
    next_page.click()
    # Define an error used when reaching last search page:
        # When error = 0, there is another resut page.
        # When error = 1, there is no further page on Google, loop stops.
    error = 0 
    while error < 1:
        try:
            html2 = driver.page_source
            html_list.append(html2)
            # Google detects suspicious behavior and asks to solve some puzzle after 7 iterations. Trying random sleep time and scrolling down to element.
            time.sleep(np.random.uniform(0.25, 3.5))
            # Go to next result page
            next_page = driver.find_element(By.CSS_SELECTOR, "#pnnext .NVbCr+ span") #CSS selector only last not previous page
            next_page.click()
        except:
            error += 1
    driver.quit()

In [None]:
def scrape_and_run(start, stop):
    '''
    This function scrapes google results from the search_terms list.
    It allows dividing the scraping into smaller parts, since the entire list does not run completely through.
    
    start: takes an index of the search_terms list
    stop: takes a higher index of the search_terms list
    '''
    html_list = []
    finished_searches = 0
    for i in tqdm.tqdm(search_terms[start:stop]):
        driverService = Service(r"C:\Users\jgb569\OneDrive - University of Copenhagen\Documents\Webscraping\chromedriver.exe") 
        driver = webdriver.Chrome(service = driverService)
        # Go to google
        driver.get('https:google.com')
        # Discard cookie message, reject cookies
        cookie = driver.find_element(By.ID, "W0wltc")
        cookie.click()
        # Search for DR news articles
        gsearch = driver.find_element(By.CSS_SELECTOR, "input[title='Søg']")
        gsearch.send_keys(i)
        # Get HTML for first search result page
        html = driver.page_source
        html_list.append(html)
        # Go to next result page
        next_page = driver.find_element(By.CSS_SELECTOR, ".NVbCr+ span") #CSS selector only last not previous page
        next_page.click()
        # Define an error used when reaching last search page:
            # When error = 0, there is another resut page.
            # When error = 1, there is no further page on Google, loop stops.
        error = 0 
        while error < 1:
            try:
                html2 = driver.page_source
                html_list.append(html2)
                # Google detects suspicious behavior and asks to solve some puzzle after 7 iterations. Trying random sleep time and scrolling down to element.
                time.sleep(np.random.uniform(3, 5))
                # Go to next result page
                next_page = driver.find_element(By.CSS_SELECTOR, "#pnnext .NVbCr+ span") #CSS selector only last not previous page
                next_page.click()
            except:
                error += 1
        finished_searches += 1
        driver.quit()
    return html_list, finished_searches

## Same as For loop

In [None]:
html_list = []
finished_searches = 0
for i in tqdm.tqdm(search_terms[start:stop]):
    driverService = Service(r"C:\Users\jgb569\OneDrive - University of Copenhagen\Documents\Webscraping\chromedriver.exe") 
    driver = webdriver.Chrome(service = driverService)
    # Go to google
    driver.get('https:google.com')
    # Discard cookie message, reject cookies
    cookie = driver.find_element(By.ID, "W0wltc")
    cookie.click()
    # Search for DR news articles
    gsearch = driver.find_element(By.CSS_SELECTOR, "input[title='Søg']")
    gsearch.send_keys(i)
    # Get HTML for first search result page
    html = driver.page_source
    html_list.append(html)
    # Go to next result page
    next_page = driver.find_element(By.CSS_SELECTOR, ".NVbCr+ span") #CSS selector only last not previous page
    next_page.click()
    # Define an error used when reaching last search page:
        # When error = 0, there is another resut page.
        # When error = 1, there is no further page on Google, loop stops.
    error = 0 
    while error < 1:
        try:
            html2 = driver.page_source
            html_list.append(html2)
            # Google detects suspicious behavior and asks to solve some puzzle after 7 iterations. Trying random sleep time and scrolling down to element.
            time.sleep(np.random.uniform(5, 10))
            # Go to next result page
            next_page = driver.find_element(By.CSS_SELECTOR, "#pnnext .NVbCr+ span") #CSS selector only last not previous page
            next_page.click()
        except:
            error += 1
    finished_searches += 1
    time.sleep(np.random.uniform(30-45))
    driver.quit()