# Web Scraping
### Workshop 3

*Jun 29 - IACS-MACI Internship*

Web scrapping is a computational techique for **automatically data extraction.**

We can do scrapping in `pdf`, `docs`, and any other sources of information. The term `web` point to use HTML websites.

There are **2 scrapping approaches** depeding on the website:
- Static websites: HTML only 
- Dinamyc websites (also called applications): HTML + Javascript


In Python we have more than one option to do scrapping.In the following table we show a comparison between the most popular libraries: 

 <img src="images/scrap_python.png" /> 

(*Source: [Python Web Scrapping - Kite, 2020](https://www.youtube.com/watch?v=zucvHSQsKHA)*)

In this workshop we will use **[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)**

In [3]:
import requests # to load html code
from bs4 import BeautifulSoup

The first step is to load HTML code from some URL we want to explore.

```
https://arxiv.org/search/?query=machine+learning&searchtype=all&abstracts=show&order=-announced_date_first&size=200
````

In this case, we use a list to save the keywords.

In [4]:
key_words = ['machine', 'learning']

str_keywords = '+'.join(key_words)

base = 'https://arxiv.org/search/?query={}&searchtype=all&abstracts=show&order=-announced_date_first&size=200'
url  = base.format(str_keywords)
print(url)

https://arxiv.org/search/?query=machine+learning&searchtype=all&abstracts=show&order=-announced_date_first&size=200


Once we defined the url, then we need to open it on python

In [5]:
%%time
html_page = requests.get(url) 

CPU times: user 38.6 ms, sys: 2.38 ms, total: 40.9 ms
Wall time: 1.32 s


In [11]:
# html_page.text

Using `BeatifulSoup` we can give an structure to the raw `HTML` code

Notice we have [another formats](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) such as `XML`

In [12]:
soup = BeautifulSoup(html_page.text, 'html.parser') # Instanciamos nuestro scrapper

In [14]:
# soup

Now It's time to extract the information.

The common way is using our browser in developer mode.

1. Open the web site
2. Right click on the item that we want to explore  <img src="./images/ws_1.png" alt="Drawing" style="width: 600px;"/> 
3. Click on Inspect Element <img src="./images/ws_2.png" alt="Drawing" style="width: 200px;"/> 
4. Look at the HTML tags to access from our scraper <img src="./images/ws_3.png" alt="Drawing" style="width: 600px;"/> 

In this example we are going to get the **titles** of the articles. Thus, we need to extract paragraphs ```<p>``` whose class is named ```class="title is-5 mathjax">```

In [19]:
titles = soup.find_all('p', attrs = {"class": 'title is-5 mathjax'})

In [20]:
for title in titles:
    print(titulo.text.strip())
    break

Learning a Single Neuron with Adversarial Label Noise via Gradient Descent


You can also access to the object attributes. For example:

In [22]:
for title in titles:
    print(titulo.attrs)
    class_value = titulo.attrs.get('class')
    print(class_value)
    break

{'class': ['title', 'is-5', 'mathjax']}
['title', 'is-5', 'mathjax']


## Hands on code!

Create a DataFrame containng the title of the paper and the url to it.

**Hint**: All the papers are listed in `<ol>` an ordered list. Thus every article block is into the `<li>` tags. You have to iterate over those objects capturing the required information

**Hint 2**: You can do as many queries as tags in your html code. Go from the most general to the most specific

In [17]:
list_elements = **your code here**

In [None]:
titles_column  = [] # to save titles
link_column = [] # to save links

for item in list_elements:
    title = soup.find_all('p', attrs = {"class": 'title is-5 mathjax'})
    link_content = *your code here*
    link_content = *your code here*