# Python 101 
## Part IX.

---

## Web Scraping - Part III.

### I. [SelectorGadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)

__Making life easier to select the proper content from a website. The ones and only the ones you need.__

1. Click on the SelectorGadget icon to activate it. It is located in the upper right corner.
2. Right after clicking this, a bar will appear in the bottom right corner of the chrome window. Also you will realise that as you start moving the cursor, things will get frames. Do not panick, this is normal!
![frame](pics/bar.png)
3. You will probably want to get multiple instances of the same type of content (e.g. pictures from the main page of telex.hu). 
4. Rules for selection:
 - First click to mark an instace of the type of content you like
  ![example](pics/example_selector.png) <br></br>
 - The same type of content will also be framed. If there is something you want to exclude (e.g. the telex logo at the top or the tiny weather icon), click on one of them. Starting with the second click, you may exclude anything. The program is smart enough to figure out that if you did not want the telex logo, it is likely that you will want to exclude the weather icon as well. Therefore, it is going to be removed automatically.<br></br>
   ![example](pics/good_state.png)
- In the bottom right corner, you will see the magic command (`.article_title img`) you should use to select all the content you want. Run `soup.select()` to get a list of instances.

Let try it!

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
url = "https://telex.hu"
response = requests.get(url)
soup = BeautifulSoup(response.content)

So far, this is business as usual. Let's get the pictures!

In [None]:
image_list = []
for image in soup.select(".article_title img"): # select will always return a list
    image_list.append(url + image.get("src")) # prefix is needed
image_list

Ooooor the way cool kids do it. List comprehension:

In [None]:
[url + image.get("src") for image in soup.select(".article_title img")]

![coolkids](https://a.wattpad.com/cover/163492905-352-k572763.jpg)

#### Exercise I:
Search for a specific brand of car in hasznaltauto.hu and list the car urls from the __first page__.

#### Exercise II:
Get some pieces of information on the real estate market of Budapest. Check out all the houses on [ingatlan.com](https://ingatlan.com/lista/elado+lakas+budapest) and get the following content for the __first page__.
- Price
- Unit price (displayed in _Ft/m2_)
- Number of rooms
- Area (displayed in _m2_)

Make sure you select the proper format of storing these variable! Printing them is not enough, save them!

### II. Dynamically generated pages

Dynamically generated pages could not be parsed by simply downloading them since the generated content won't be present. For this case there is an another library called selenium. This library also requires a browser to operate. A browser will be started and every operation will be executed inside that browser. Its path must be set in order to use it.

In [None]:
!conda install selenium -y

In [None]:
import os
from helpers import get_download_dir, chromedriver_download

chromedriver_download()
os.environ['PATH'] += ';' + get_download_dir()

In [None]:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

#### a) Simple lookup
- initialize the browser which will be used by the library

In [None]:
driver = webdriver.Chrome()

- request a page

In [None]:
driver.get('http://9gag.com/random')

- find items

In [None]:
try:
    media = (
        driver
        .find_element_by_class_name('post-container')
        .find_element_by_tag_name('img')
        .get_attribute('src')
    )
except NoSuchElementException:
    media = (
        driver
        .find_element_by_class_name('post-container')
        .find_element_by_tag_name('video')
        .find_element_by_tag_name('source')
        .get_attribute('src')
    )
    
print(media)

Available finder methods:
- `find_element_by_tag_name(tag)`
- `find_elements_by_tag_name(tag)`
- `find_element_by_class_name(class)`
- `find_elements_by_class_name(class)`
- `find_element_by_id(id)`
- `find_element_by_css_selector(css_selector)`
- `find_elements_by_css_selector(css_selector)`

#### CSS selectors
- `tagname`
- `.classname`
- `#id`
- `[attribute=value]`

In [None]:
try:
    media = (driver
             .find_element_by_css_selector('#individual-post .post-container img')
             .get_attribute('src'))
except NoSuchElementException:
    media = (driver
             .find_element_by_css_selector('#individual-post .post-container video source')
             .get_attribute('src'))
    
media

#### b) Interaction with the site
- request the page

In [None]:
driver.get('https://444.hu/kereses')

- find search field

In [None]:
search_field = driver.find_element_by_css_selector('#content-main input[name=q]')

- fill in search query

In [None]:
search_field.send_keys('migráns')

- find submit button and click on it

In [None]:
submit_button = driver.find_element_by_css_selector('#content-main input[type=submit]')
submit_button.click()

- find related content

In [None]:
urls = []
for article in driver.find_elements_by_class_name('card'):
    urls.append(article.find_element_by_tag_name('a').get_attribute('href'))
len(urls)

- solution for infinite scrolldown

In [None]:
import time

def scrolldown():
    lastHeight = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)
        newHeight = driver.execute_script("return document.body.scrollHeight")
        if newHeight == lastHeight:
            break
        lastHeight = newHeight
    return True

In [None]:
urls = []
button = True
while button:
    print('.', end='')
    
    scrolldown()
    for article in driver.find_elements_by_class_name('card'):
        urls.append(article.find_element_by_tag_name('a').get_attribute('href'))
    try:
        button = driver.find_element_by_css_selector('a.infinity-next.button')
        button.click()
    except NoSuchElementException:
        button = False

In [None]:
len(urls)

### III. Querying microservices

Microservices - also known as the microservice architecture - is an architectural style that structures an application as a collection of services that are

- Highly maintainable and testable
- Loosely coupled
- Independently deployable
- Organized around business capabilities
- Owned by a small team

Source and more reading [here](https://microservices.io/).

####  Architectural design

![Architecture1](https://i.stack.imgur.com/b62O1.png)
![Architecture2](https://www.redhat.com/cms/managed-files/monolithic-vs-microservices.png)

####  Genaral algorithm to uncover and exploit microservices
__Warning #1:__ Sometimes, the direct usage of microservices is forbidden for commercial purposes. Before you start building a business on it, you might want to read the related terms and conditions of the website. Rare usage should not result in any actions.

__Warning #2:__ Not every website uses microservices (or they are restricted in some ways). Therefore, this method will __not__ work in every single case. Sometimes, parsing an HTML is just not something you can avoid. However, it is surely worth checking as you may retrieve the whole dataset without having to parse and clean anything. 

__Task__: Say you want to scrape the departing flights fro a given day from [Budapest Liszt Ferenc Airport](https://www.bud.hu/indulo_jaratok). You need every detail that is accessible.

1. Open the website, right click and go inspect. On the top bar, instead of browsing the `Elements` tab, change to `Network`. If nothing is displayed here, refresh the page. This will show you the list of network traffic that happens under the hoods. There are pictures here, JavaScript codes and a bunch of scary process that we will avoid, don't worry. You will want to order the requests by `Type`. In most of the cases, `xhr` and `document` types will be the ones we care about. If you click on one of the `xhr` types, this is what should pop up. <br></br>
 ![micro0](pics/micro0.png) <br></br>
2. The `Headers` tab shows you the input details of the request that was sent out retrieve this specific content. If you change to the `Preview` or the `Response` tabs, the result of this request will be shown to you. While clicking the former will give you a nicer and rendered look, the latter returns a raw version.<br></br>
3. Now, the task is to find the entry that returns the pieces of flights data we need. Let's check all the ones with `Type` = `xhr` first and check their `Preview` tabs to find the right one. I think we have a winner here, this looks great: <br></br>
 ![micro2](pics/micro2.png)<br></br>
4. Click on the "play button looking" triangle to expand an entry. Okat, this is very cool, we have it.
5. Next, we need to find a way replicate it so that we can get the data programmatically. If only there was a way to retrieve the input data for this very request. Oh wait! This is what the `Headers` tab is there for, isn't it? It is!
6. Now, the `Headers` tab contains details in a non-Python format (this is not entirely true, but at this point you are not assumed to have the skills needed to transform it manually).
7. We are going to transform it with a third party service: https://curl.trillworks.com/
8. We need to first copy the [curl](https://en.wikipedia.org/wiki/CURL) equivalent of the request by right clicking -> Copy -> copy as curl. Now, the curl command is copied to the keyboard. Go to https://curl.trillworks.com/ and paste it to the curl command box. This will generate the Python code we can use.
![micro3](pics/micro3.png)<br></br>
9. You are all done :) From now in, the sucess only depends on your Python skills.

In [None]:
# This is the code snippet curl.trillworks.com generated to me:
import requests

cookies = {
    'cookie_bar': 'enabled',
    '_ga': 'GA1.2.270795426.1604223546',
    '_gid': 'GA1.2.1464611313.1604223546',
    'XSRF-TOKEN': 'eyJpdiI6ImFDdE11RUFSZWEwa0QrN3VJRVJhbFE9PSIsInZhbHVlIjoibFhGNENRK3RPeVhRUW5VS3ZGYkhyREJTU29kVEQzMVhIeVQzOWo1dTNscUd2RkQxN0xURUZJcDBRblVCdHRQMUNVbXFDQXBmbXk3ZVdSR1A0SlBkWGc9PSIsIm1hYyI6IjI5YTcyZjJlYzk4YmZmOGZmYTFlNTQxMWQ4ZGVmM2ZjMDVhYjMwOWU4MzhkNjI5MjNjYzAzMTBlNTFhYjA5ZjUifQ%3D%3D',
    'budhu_session': 'eyJpdiI6IlhJTHEraE5jYmJ0Z2lLXC9zeVk1VmRBPT0iLCJ2YWx1ZSI6IjFBT0FyQmhDaGc0UlwvM0Z5NDBYd1pOQzIxNlpHcGRqbGFGQ3NPOXI1NlZlaCtKWHZ0c3Z5UENkb0RxK1N5WkVpcHhBV1JxYUsybFU5aXRjampVU3FJUT09IiwibWFjIjoiNTQ3OWJlZTQzNjY3MzAwZmFlYzJiN2FlNTI4MTA5YjAyOWYxZWQ2ZDdmNmQ5MTkwNWYxMTEwNmM2YTc1Mjc5YSJ9',
}

headers = {
    'Connection': 'keep-alive',
    'Accept': '*/*',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Dest': 'empty',
    'Referer': 'https://www.bud.hu/indulo_jaratok',
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

params = (
    ('mode', 'list'),
    ('lang', 'hun'),
    ('dir', '0'),
    ('flightdate_custom_from_date', 'today'),
    ('flightdate_custom_from_time', '10:30'),
)

response = requests.get('https://www.bud.hu/api/ajaxFlights/', headers=headers, params=params, cookies=cookies)

#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
# response = requests.get('https://www.bud.hu/api/ajaxFlights/?mode=list&lang=hun&dir=0&flightdate_custom_from_date=today&flightdate_custom_from_time=10:30', headers=headers, cookies=cookies)


Always check the status code!

In [None]:
response.status_code

Remember, 200 is great, means success. It is usually the case, that you do not need to include cookies in the request. Just saying, but up to you.

In [None]:
response = requests.get('https://www.bud.hu/api/ajaxFlights/', headers=headers, params=params) # deleted cookies from here
response.status_code

Now, as the response is a [JSON](https://en.wikipedia.org/wiki/JSON) file, we don't need to parse it with `BeautifulSoup`, just simply convert it to a variable. If you are not familiar with the format JSON, just think of it as a Python dictionary  (that can include lists as well).

In [None]:
data = response.json() # interpreting it as JSON
type(data) # result object is a list this time

As there is no documentation in what format data are coming, we need to uncover the pattern. But relax, it is usually very handy. First, have a look at the first item of the list.

In [None]:
data[0] # First item of the list, a dictionary

This will probably be a list of dictionaries, each item containing pieces of information on one spicific departing flight.

#### Exercise III: Let's hack the system!
![hackerman](https://wompampsupport.azureedge.net/fetchimage?siteId=7575&v=2&jpgQuality=100&width=700&url=https%3A%2F%2Fi.kym-cdn.com%2Fentries%2Ficons%2Ffacebook%2F000%2F021%2F807%2Fig9OoyenpxqdCQyABmOQBZDI0duHk2QZZmWg2Hxd4ro.jpg) <br></br>
 Change the parameters so that:
 
 - Instead of today, it will return flights from the day before (that is, yesterday). 
 - Instead of departing flights, it will return the arrivals.
 - Instead of showing flights after 10.30 AM, it will return all the flights.
 
__Warning #3:__ Note, that every single website has different microservices and hence parameters. What we are doing it specific to [bud.hu](https://www.bud.hu/). When scraping another website, you need to uncover the parameter space and find the possibilities you have.

In [None]:
custom_params = 

response = requests.get('https://www.bud.hu/api/ajaxFlights/', headers=headers, params=custom_params)

In [None]:
# Check your result here

#### Exercise IV: 

In [None]:
# TODO