# Part II. Webscraping

---

## 1. Obtaining a webpage

The easiest way is to use a third party library called __`requests`__.

In [None]:
import requests

We simply ask a server to give us an html document by requesting it through an url.

In [None]:
existing_url = 'http://localhost:8000/test.html'
response = requests.get(existing_url)
print(response.status_code) # hopefully 200 -> successful download

In [None]:
not_existing_url = 'http://localhost:8000/test1.html'
response = requests.get(not_existing_url)
print(response.status_code) # unfortunately 404 -> not exists

__Common status codes:__
- 200: success
- 301: permanent redirect
- 303: redirect
- 400: bad request
- 401: unauthorized
- 404: not exists
- 500: internal server error

In [None]:
response = requests.get(existing_url)
print(response.content.decode('utf-8'))

### 2. Parsing

There is a third party module for this purpose called __`BeautifulSoup`__.

In [None]:
from bs4 import BeautifulSoup

Then create a soup from the downloaded document.

In [None]:
document = response.content
soup = BeautifulSoup(document, 'html.parser')

In [None]:
print(soup.prettify())

With the created soup (which is a parsed document) we can easily access any part of the document.  
It is able to:
- get the title of the document

In [None]:
print(soup.title)
print(type(soup.title))

- get the title text

In [None]:
print(soup.title.get_text())
print(type(soup.title.get_text()))

- get the text-only version of the page

In [None]:
print(soup.get_text())

- get all the links from the document

In [None]:
soup.find_all('a')

- get the actual urls from the tags

In [None]:
for url in soup.find_all('a'):
    print(url.get('href'))

In [None]:
soup.find()

Most important methods:
- `.find(tag, id, class_, attrs)`
- `.find_all(tag, id, class_, attrs)`
- `.get(attribute)`
- `.get_text()`

#### b) Find the important text in the document
- select every paragraph which has "important" class

In [None]:
soup.find_all('p', class_='important')

- Whooops, something's going on! Investigate!

In [None]:
important_paragraphs = soup.find_all('p', class_='important')

- print the text in the tags, and tags' parent's id attribute

In [None]:
for p in important_paragraphs:
    print(p.get_text(), '>', p.parent.get('id'))

- We can see, that the "fake" result is from somewhere else

In [None]:
soup.find(id='not_main_section')

- We have a hidden fake section! Let's modify our search!

In [None]:
soup.find(id='main_content').find_all('p', class_='important')

#### c) Find the pictures of our interest
- Get the "nice" pictures from the **`div`** with **`random_images_1`** class!

In [None]:
(
    soup
    .find(id='main_content')
    .find('div', class_='random_images_1')
    .find_all('img', class_='nice')
)

- Whoops again. Filter out the result we don't like.

In [None]:
imgs = (
    soup
    .find(id='main_content')
    .find('div', class_='random_images_1')
    .find_all('img', class_='nice')
)
nice_imgs = []
for img in imgs:
    if 'not' not in img.get('class'):
        nice_imgs.append(img.get('src'))
print(nice_imgs)

### Exercise

- Find every **visible** headlines (`h1`...`h6`) texts and subtitles

---

### 3. Querying webpages 

Collect the articles about migrants from index.hu

This will require to search in the site.
On the upper-left corner, there is a search icon. Use it, and observe the resulting url:

`https://index.hu/24ora/?tol=1999-01-01&ig=2018-04-11&word=1&s=migráns`

It has multiple parts:
- `http://` - protocol
- `index.hu` - base url
- `/24ora/` - sub url
- `?tol=1999-01-01&ig=2018-04-11&word=1&pepe=1&s=migráns` - query

Let's investigate the query part a little more!  
Every query starts with a __`?`__ charater followed by one or more key-value pairs. The key-value pairs are separated with the __`&`__ character. Based on this information, we can extract the query parameters:
- `tol`
- `ig`
- `word`
- `s`

Use these values to construct our own request:

In [None]:
base_url = 'http://index.hu'
sub_url = '/24ora'
query = {
    'tol': '1999-01-01',
    'ig': '2018-04-11',
    'word': 1,
    's': 'migráns'
}

We can use the requests library to send the query:

In [None]:
resp = requests.get(url=base_url+sub_url, data=query) # some pages requires `params` instead of `data`
resp

Using the response, extract the urls inside the `<article>` tags!

You can see that only 30 results showed up. We can customize our query to cover shorter amount of timed by replacing __`tol`__ and __`ig`__ parameters with a formattable string: __`'{year}-{month:0>2}-{day:0>2}'`__. This string can be formatted by providing the required parameters:
- year
- month
- day

like this:

In [None]:
'{year}-{month:0>2}-{day:0>2}'.format(year=2016, month=1, day=1)

There is a useful library called __`datetime`__. You can use it to generate dates automatically.

In [None]:
import datetime

date = datetime.date(1999, 1, 1)
day_after_date = date + datetime.timedelta(days=1)
day_before_date = date - datetime.timedelta(days=1)
today = datetime.date.today()

print(day_before_date)
print(date)
print(day_after_date)
print(today)

print(today.year, today.month, today.day)

Create a loop which iterate through every day from 1999-01-01 till today and execute the same procedure you created previously. (Pro tip: create a function!) Observe the number of results!

---

### 4. User agents

Let's pretend to be a browser instead of a script

In [None]:
USER_AGENTS = [
    # Chrome
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36',
    # Firefox
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0',
    # Opera
    'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14',
    # Safari
    'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25',
    # Internet Explorer, probably a good idea to leave this one out...
    'Mozilla/5.0 (compatible; MSIE 10.6; Windows NT 6.1; Trident/5.0; InfoPath.2; SLCC1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET CLR 2.0.50727) 3gpp-gba UNTRUSTED/1.0',
]

Let's write a wrapper function to handle the user-agent string.

In [None]:
import random
def get_header(agents):
    return {'User-agent': random.choice(agents)}

### 5. Exercises

#### 1. Get the main articles from index.hu
Write a function that prints that extracts the current main articles! It should contain:
- the title
- the article text
- the url
- every picture from the article

In [None]:
url = 'http://index.hu'
index_response = requests.get(url, headers=get_header(USER_AGENTS))

#### 2. Get the articles about migration from 444.hu

Write a function that prints the titles of the articles

In [None]:
url = 'https://444.hu/kereses'
query = '?q=migrans'