# Part II. Webscraping

---

## 1. Obtaining a webpage

The easiest way is to use a third party library called __`requests`__.

In [None]:
import requests

We simply ask a server to give us an html document by requesting it through an url.

In [None]:
existing_url = 'http://localhost:8000/test.html'
response = requests.get(existing_url)
print(response.status_code) # hopefully 200 -> successful download

In [None]:
not_existing_url = 'http://localhost:8000/test1.html'
response = requests.get(not_existing_url)
print(response.status_code) # unfortunately 404 -> not exists

__Common status codes:__
- 200: success
- 301: permanent redirect
- 303: redirect
- 400: bad request
- 401: unauthorized
- 404: not exists
- 500: internal server error

In [None]:
response = requests.get(existing_url)
print(response.content.decode('utf-8'))

## 2. Parsing

There is a third party module for this purpose called __`BeautifulSoup`__.

In [None]:
from bs4 import BeautifulSoup

Then create a soup from the downloaded document.

In [None]:
document = response.content
soup = BeautifulSoup(document, 'html.parser')

In [None]:
print(soup.prettify())

With the created soup (which is a parsed document) we can easily access any part of the document.  
It is able to:
- get the title of the document

In [None]:
print(soup.title)
print(type(soup.title))

- get the title text

In [None]:
print(soup.title.get_text())
print(type(soup.title.get_text()))

- get the text-only version of the page

In [None]:
print(soup.get_text())

- get all the links from the document

In [None]:
soup.find_all('a')

- get the actual urls from the tags

In [None]:
for url in soup.find_all('a'):
    print(url.get('href'))

During scraping, there are a lot of different tasks that must be solve in order to get the data we need. 
In this case this demo document has important and unimportant parts. We only need the important parts.   
#### a) Let's find the important links!

In [None]:
important_urls = []
for url in soup.find_all('a'):
    if 'important_part' in url.get('href'):
        important_urls.append(url.get('href'))
print(important_urls)

#### b) Find the important text in the document
- select every paragraph which has "important" class

In [None]:
soup.find_all('p', class_='important')

- Whooops, something's going on! Investigate!

In [None]:
important_paragraphs = soup.find_all('p', class_='important')

- print the text in the tags, and tags' parent's id attribute

In [None]:
for p in important_paragraphs:
    print(p.get_text(), '>', p.parent.get('id'))

- We can see, that the "fake" result is from somewhere else

In [None]:
soup.find(id='not_main_section')

- We have a hidden fake section! Let's modify our search!

In [None]:
soup.find(id='main_content').find_all('p', class_='important')

#### c) Find the pictures of our interest
- Get the "nice" pictures from the **`div`** with **`random_images_1`** class!

In [None]:
(
    soup
    .find(id='main_content')
    .find('div', class_='random_images_1')
    .find_all('img', class_='nice')
)

- Whoops again. Filter out the result we don't like.

In [None]:
imgs = (
    soup
    .find(id='main_content')
    .find('div', class_='random_images_1')
    .find_all('img', class_='nice')
)
nice_imgs = []
for img in imgs:
    if 'not' not in img.get('class'):
        nice_imgs.append(img.get('src'))
print(nice_imgs)

Most important methods:
- `.find(tag, id, class_, attrs)`
- `.find_all(tag, id, class_, attrs)`
- `.get(attribute)`
- `.get_text()`

#### Exercise:
- Find every **visible** headlines (`h1`...`h6`) texts and subtitles

---

## 3. Querying webpages 

Collect the articles about migrants from index.hu

This will require to search in the site.
On the upper-left corner, there is a search icon. Use it, and observe the resulting url:

`https://index.hu/24ora/?tol=1999-01-01&ig=2018-04-11&word=1&s=migráns`

It has multiple parts:
- `http://` - protocol
- `index.hu` - base url
- `/24ora/` - sub url
- `?tol=1999-01-01&ig=2018-04-11&word=1&pepe=1&s=migráns` - query

Let's investigate the query part a little more!  
Every query starts with a __`?`__ charater followed by one or more key-value pairs. The key-value pairs are separated with the __`&`__ character. Based on this information, we can extract the query parameters:
- `tol`
- `ig`
- `word`
- `s`

Use these values to construct our own request:

In [None]:
base_url = 'http://index.hu'
sub_url = '/24ora'
query = {
    'tol': '1999-01-01',
    'ig': '2018-04-11',
    'word': 1,
    's': 'migráns'
}

We can use the requests library to send the query:

In [None]:
resp = requests.get(url=base_url+sub_url, data=query) # some pages requires `params` instead of `data`
resp

#### Exercise:
- Using the response, extract the urls inside the `<article>` tags!

You can see that only 30 results showed up. We can customize our query to cover shorter amount of timed by replacing __`tol`__ and __`ig`__ parameters with a formattable string: __`'{year}-{month:0>2}-{day:0>2}'`__. This string can be formatted by providing the required parameters:
- year
- month
- day

like this:

In [None]:
'{year}-{month:0>2}-{day:0>2}'.format(year=2016, month=1, day=1)

There is a useful library called __`datetime`__. You can use it to generate dates automatically.

In [None]:
import datetime

date = datetime.date(1999, 1, 1)
day_after_date = date + datetime.timedelta(days=1)
day_before_date = date - datetime.timedelta(days=1)
today = datetime.date.today()

print(day_before_date)
print(date)
print(day_after_date)
print(today)

print(today.year, today.month, today.day)

Create a loop which iterate through every day from 1999-01-01 till today and execute the same procedure you created previously. (Pro tip: create a function!) Observe the number of results!

---

## 4. User agents

Let's pretend to be a browser instead of a script

In [None]:
USER_AGENTS = [
    # Chrome
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36',
    # Firefox
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0',
    # Opera
    'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14',
    # Safari
    'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25',
    # Internet Explorer, probably a good idea to leave this one out...
    'Mozilla/5.0 (compatible; MSIE 10.6; Windows NT 6.1; Trident/5.0; InfoPath.2; SLCC1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET CLR 2.0.50727) 3gpp-gba UNTRUSTED/1.0',
]

Let's write a wrapper function to handle the user-agent string.

In [None]:
import random
def get_header(agents):
    return {'User-agent': random.choice(agents)}

#### Exercise:
Get the main articles from index.hu. Write a function that prints that extracts the current main articles! It should contain:
- the title
- the article text
- the url
- every picture from the article

In [None]:
url = 'http://index.hu'
index_response = requests.get(url, headers=get_header(USER_AGENTS))

---

## 5. Dynamically generated pages

Dynamically generated pages could not be parsed by simply downloading them since the generated content won't be present. For this case there is an another library called `selenium`. This library also requires a browser to operate. A browser will be started and every operation will be executed inside that browser. Its path must be set in order to use it.

In [None]:
import os

download_dir = os.path.expanduser('~')
download_dir = os.path.join(download_dir, 'Downloads')

os.environ['PATH'] += ';' + download_dir

In [None]:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

#### a) Simple lookup
- initialize the browser which will be used by the library

In [None]:
driver = webdriver.Chrome()

- request a page

In [None]:
driver.get('http://9gag.com/random')

- find items

In [None]:
try:
    media = (
        driver
        .find_element_by_class_name('post-container')
        .find_element_by_tag_name('img')
        .get_attribute('src')
    )
except NoSuchElementException:
    media = (
        driver
        .find_element_by_class_name('post-container')
        .find_element_by_tag_name('video')
        .find_element_by_tag_name('source')
        .get_attribute('src')
    )
    
print(media)

Available finder methods:
- `find_element_by_tag_name(tag)`
- `find_elements_by_tag_name(tag)`
- `find_element_by_class_name(class)`
- `find_elements_by_class_name(class)`
- `find_element_by_id(id)`
- `find_element_by_css_selector(css_selector)`
- `find_elements_by_css_selector(css_selector)`

#### CSS selectors
- `tagname`
- `.classname`
- `#id`
- `[attribute=value]`

In [None]:
driver.find_element_by_css_selector('#individual-post .post-container video source').get_attribute('src')

#### b) Interaction with the site
- request the page

In [None]:
driver.get('https://444.hu/kereses')

- find search field

In [None]:
search_field = driver.find_element_by_css_selector('#content-main input[name=q]')

- fill in search query

In [None]:
search_field.send_keys('migráns')

- find submit button and click on it

In [None]:
submit_button = driver.find_element_by_css_selector('#content-main input[type=submit]')
submit_button.click()

- find related content

In [None]:
urls = []
for article in driver.find_elements_by_class_name('card'):
    urls.append(article.find_element_by_tag_name('a').get_attribute('href'))
len(urls)

- solution for infinite scrolldown

In [None]:
import time

def scrolldown():
    lastHeight = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)
        newHeight = driver.execute_script("return document.body.scrollHeight")
        if newHeight == lastHeight:
            break
        lastHeight = newHeight
    return True

In [None]:
scrolldown()
urls = []
for article in driver.find_elements_by_class_name('card'):
    urls.append(article.find_element_by_tag_name('a').get_attribute('href'))

In [None]:
len(urls)

#### Exercise:
Search for a specific brand of car in hasznaltauto.hu and list the car urls from the first page.

## Further reading

- [web scraping tutorial](https://hackernoon.com/web-scraping-tutorial-with-python-tips-and-tricks-db070e70e071)
- [selenium with python blogpost](https://realpython.com/blog/python/modern-web-automation-with-python-and-selenium/)
- [another selenium blogpost](https://medium.com/@hoppy/how-to-test-or-scrape-javascript-rendered-websites-with-python-selenium-a-beginner-step-by-c137892216aa)