# Python 101 
## Part VIII. - II.

---

## Web Scraping - Part III.

### Dynamically generated pages

Dynamically generated pages could not be parsed by simply downloading them since the generated content won't be present. For this case there is an another library called selenium. This library also requires a browser to operate. A browser will be started and every operation will be executed inside that browser. Its path must be set in order to use it.

In [None]:
!conda install selenium -y

In [None]:
import os
from helpers import get_download_dir, chromedriver_download

chromedriver_download()
os.environ['PATH'] += ';' + get_download_dir()

In [None]:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

#### a) Simple lookup
- initialize the browser which will be used by the library

In [None]:
driver = webdriver.Chrome()

- request a page

In [None]:
driver.get('http://9gag.com/random')

- find items

In [None]:
try:
    media = (
        driver
        .find_element_by_class_name('post-container')
        .find_element_by_tag_name('img')
        .get_attribute('src')
    )
except NoSuchElementException:
    media = (
        driver
        .find_element_by_class_name('post-container')
        .find_element_by_tag_name('video')
        .find_element_by_tag_name('source')
        .get_attribute('src')
    )
    
print(media)

Available finder methods:
- `find_element_by_tag_name(tag)`
- `find_elements_by_tag_name(tag)`
- `find_element_by_class_name(class)`
- `find_elements_by_class_name(class)`
- `find_element_by_id(id)`
- `find_element_by_css_selector(css_selector)`
- `find_elements_by_css_selector(css_selector)`

#### CSS selectors
- `tagname`
- `.classname`
- `#id`
- `[attribute=value]`

In [None]:
try:
    media = (driver
             .find_element_by_css_selector('#individual-post .post-container img')
             .get_attribute('src'))
except NoSuchElementException:
    media = (driver
             .find_element_by_css_selector('#individual-post .post-container video source')
             .get_attribute('src'))
    
media

#### b) Interaction with the site
- request the page

In [None]:
driver.get('https://444.hu/kereses')

- find search field

In [None]:
search_field = driver.find_element_by_css_selector('#content-main input[name=q]')

- fill in search query

In [None]:
search_field.send_keys('migráns')

- find submit button and click on it

In [None]:
submit_button = driver.find_element_by_css_selector('#content-main input[type=submit]')
submit_button.click()

- find related content

In [None]:
urls = []
for article in driver.find_elements_by_class_name('card'):
    urls.append(article.find_element_by_tag_name('a').get_attribute('href'))
len(urls)

- solution for infinite scrolldown

In [None]:
import time

def scrolldown():
    lastHeight = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)
        newHeight = driver.execute_script("return document.body.scrollHeight")
        if newHeight == lastHeight:
            break
        lastHeight = newHeight
    return True

In [None]:
urls = []
button = True
while button:
    print('.', end='')
    
    scrolldown()
    for article in driver.find_elements_by_class_name('card'):
        urls.append(article.find_element_by_tag_name('a').get_attribute('href'))
    try:
        button = driver.find_element_by_css_selector('a.infinity-next.button')
        button.click()
    except NoSuchElementException:
        button = False

In [None]:
len(urls)

#### Exercise:
Search for a specific brand of car in hasznaltauto.hu and list the car urls from the first page.