# Web Scraping
### Workshop 3

*Jun 29 - IACS-MACI Internship*

Web scrapping is a computational techique for **automatically data extraction.**

We can do scrapping in `pdf`, `docs`, and any other sources of information. The term `web` point to use HTML websites.

There are **2 scrapping approaches** depeding on the website:
- Static websites: HTML only 
- Dinamyc websites (also called applications): HTML + Javascript


In Python we have more than one option to do scrapping.In the following table we show a comparison between the most popular libraries: 

 <img src="images/scrap_python.png" /> 

(*Source: [Python Web Scrapping - Kite, 2020](https://www.youtube.com/watch?v=zucvHSQsKHA)*)

In this workshop we will use **[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)**

In [2]:
import requests # to load html code
from bs4 import BeautifulSoup

The first step is to load HTML code from some URL we want to explore.

```
https://arxiv.org/search/?query=machine+learning&searchtype=all&abstracts=show&order=-announced_date_first&size=200
````

In this case, we use a list to save the keywords.

In [27]:
key_words = ['machine', 'learning']

str_keywords = '+'.join(key_words)

base = 'https://arxiv.org/search/?query={}&searchtype=all&abstracts=show&order=-announced_date_first&size=200'
url  = base.format(str_keywords)
print(url)

https://arxiv.org/search/?query=machine+learning&searchtype=all&abstracts=show&order=-announced_date_first&size=200


Once we defined the url, then we need to open it on python

In [28]:
%%time
html_page = requests.get(url) 

CPU times: user 30.7 ms, sys: 159 µs, total: 30.8 ms
Wall time: 1.1 s


In [29]:
# html_page.text

Using `BeatifulSoup` we can give an structure to the raw `HTML` code

Notice we have [another formats](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) such as `XML`

In [30]:
soup = BeautifulSoup(html_page.text, 'html.parser') # Instanciamos nuestro scrapper

In [31]:
# soup

Now It's time to extract the information.

The common way is using our browser in developer mode.

1. Open the web site
2. Right click on the item that we want to explore  <img src="./images/ws_1.png" alt="Drawing" style="width: 600px;"/> 
3. Click on Inspect Element <img src="./images/ws_2.png" alt="Drawing" style="width: 200px;"/> 
4. Look at the HTML tags to access from our scraper <img src="./images/ws_3.png" alt="Drawing" style="width: 600px;"/> 

In this example we are going to get the **titles** of the articles. Thus, we need to extract paragraphs ```<p>``` whose class is named ```class="title is-5 mathjax">```

In [24]:
titles = soup.find_all('p', attrs = {"class": 'title is-5 mathjax'})

In [26]:
len(titles)

0

In [25]:
for title in titles:
    print(title.text.strip())
    break

You can also access to the object attributes. For example:

In [15]:
for title in titles:
    print(title.attrs)
    class_value = title.attrs.get('class')
    print(class_value)
    break

{'class': ['title', 'is-5', 'mathjax']}
['title', 'is-5', 'mathjax']


## Hands on code!

Create a DataFrame containng the title of the paper and the url to it.

**Hint**: All the papers are listed in `<ol>` an ordered list. Thus every article block is into the `<li>` tags. You have to iterate over those objects capturing the required information

**Hint 2**: You can do as many queries as tags in your html code. Go from the most general to the most specific

In [36]:
list_elements = soup.find_all('a', string='pdf')
list_titles = soup.find_all('p', attrs = {"class": 'title is-5 mathjax'})

In [39]:
titles_column  = [x.text.strip() for x in list_titles] # to save titles
link_column = [x.attrs.get('href') for x in list_elements] # to save links

In [43]:
def make_clickable(val):
    # target _blank to open new window
    return '<a target="_blank" href="{}">{}</a>'.format(val, val)

In [45]:
import pandas as pd
df = pd.DataFrame()
df['title'] = titles_column
df['link'] = link_column
df.style.format({'link': make_clickable})

Unnamed: 0,title,link
0,Visual Foresight With a Local Dynamics Model,https://arxiv.org/pdf/2206.14802
1,Meta-Learning over Time for Destination Prediction Tasks,https://arxiv.org/pdf/2206.14801
2,Understanding Generalization via Leave-One-Out Conditional Mutual Information,https://arxiv.org/pdf/2206.14800
3,Generalized Permutants and Graph GENEOs,https://arxiv.org/pdf/2206.14798
4,3D-Aware Video Generation,https://arxiv.org/pdf/2206.14797
5,On the Robustness of Dialogue History Representation in Conversational Question Answering: A Comprehensive Study and a New Prompt-based Method,https://arxiv.org/pdf/2206.14796
6,ENS-10: A Dataset For Post-Processing Ensemble Weather Forecast,https://arxiv.org/pdf/2206.14786
7,IBP Regularization for Verified Adversarial Robustness via Branch-and-Bound,https://arxiv.org/pdf/2206.14772
8,Distilling Model Failures as Directions in Latent Space,https://arxiv.org/pdf/2206.14754
9,An Auto-Regressive Formulation for Smoothing and Moving Mean with Exponentially Tapered Windows,https://arxiv.org/pdf/2206.14749


- asyncronous data collection
- get table data 