# Script to extract data from the url

## Stage 1 - Data Extraction

### Modules to be used for data extraction

In [1]:
from urllib import request as req
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
import lxml
import pandas as pd

In [2]:
# storing the website address for data extraction

url = 'https://arxiv.org/search/?query=Vq+VAE&source=header&searchtype=all'
url2 = 'https://openreview.net/search?term=ICML++vae&content=all&group=all&source=all'

### Step 1 - Connection Establishment
   >This step comprises of three sub steps:
 - Opening of client and establishing connection
 - Reading the page content
 - Closing the client and connection

In [3]:
# opening the client connection
client = urlopen(url) 

# reading the data from the html page and storing it
page_html = client.read()

# closing the client connection
client.close()

### Step 2 - Parsing of the web information

In [4]:
# the extracted page is parsed using BeautifulSoup
page_soup = bs(page_html, 'lxml')

In [5]:
page_soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- new favicon config and versions by realfavicongenerator.net -->
<link href="https://static.arxiv.org/static/base/0.16.8/images/icons/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="https://static.arxiv.org/static/base/0.16.8/images/icons/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="https://static.arxiv.org/static/base/0.16.8/images/icons/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="https://static.arxiv.org/static/base/0.16.8/images/icons/site.webmanifest" rel="manifest"/>
<link color="#b31b1b" href="https://static.arxiv.org/static/base/0.16.8/images/icons/safari-pinned-tab.svg" rel="mask-icon"/>
<link href="https://static.arxiv.org/static/base/0.16.8/images/icons/favicon.ico" rel="shortcut icon"/>
<meta content="#b31b1b" name="msapplication-TileColor"/>
<meta cont

### Step 3 - Extraction of the web data (Web Scraping or Web Data extraction)
> Based on the type and volume of data, the data extraction step plays a very crucial role.  

In [6]:
# All the search results from the website are now stored in the ind_doc_sec variable

ind_doc_sec = page_soup.findAll("li",{"class":"arxiv-result"})

#### Upon checking the website, it primarily has the following sections:
> 1. `Doc_id`
> 2. `Tags`
> 3. `Doc_title`
> 4. `Author`
> 5. `Abstract of the document`
> 6. `Submission date`
> 7. `Original Announcement date`

#### Each of the items can be identified and extracted by pulling the respective class information from the html code.
> 1. `Doc_id` is present inside the "p"- {"class":"list-title is-inline-block"} which subsequently is present inside the "div" - {"class":"is-marginless"}
> 2. `Tags` is present inside the "span" - {"class":"tag is-small is-grey tooltip is-tooltip-top" which is subsequently present inside the "div" - {"class":"tags is-inline-block"}
> 3. `Doc_title` is present inside the "p" - {"class":"title is-5 mathjax"
> 4. `Authors` is present inside the "p" - {"class":"authors"}
> 5. `Abstract` is present inside the "span" - {"class":"abstract-full has-text-grey-dark mathjax"}
> 6. `Submitted` is present inside the "p" - {"class":"is-size-7"}
> 7. `Originally Announced` is present inside the "p" - {"class":"is-size-7"}

In [10]:
# initializing a dictionary to store the data
data_final ={"Doc_id":[],
             "Tags":[],
            "Doc_title":[],
            "Authors":[],
            "Abstract":[],
            "Submitted":[],
            "Originally Announced":[]}

In [11]:
# defining a loop to extract the items from each section of the results
for i in range(len(ind_doc_sec)):
    doc_id= ind_doc_sec[i].find("div",{"class":"is-marginless"}).find("p",{"class":"list-title is-inline-block"}).text.strip().split('\n\xa0')[0]
    data_final["Doc_id"].append(doc_id)
    data_tag =[]
    for x in range(len(ind_doc_sec[i].find("div",{"class":"tags is-inline-block"}).findAll("span",{"class":"tag is-small is-grey tooltip is-tooltip-top"}))):
        data_tag.append(ind_doc_sec[i].find("div",{"class":"tags is-inline-block"}).findAll("span",{"class":"tag is-small is-grey tooltip is-tooltip-top"})[x].text)
    data_final["Tags"].append(data_tag)
    data_final["Doc_title"].append(ind_doc_sec[i].find("p",{"class":"title is-5 mathjax"}).text.strip())
    data_authors = []
    for y in range(len(ind_doc_sec[i].find("p",{"class":"authors"}).findAll("a"))):
        data_authors.append(ind_doc_sec[i].find("p",{"class":"authors"}).findAll("a")[y].text)
    data_final["Authors"].append(data_authors)
    data_final["Abstract"].append(ind_doc_sec[i].find("span",{"class":"abstract-full has-text-grey-dark mathjax"}).text.strip())
    data_final["Submitted"].append(ind_doc_sec[i].find("p",{"class":"is-size-7"}).text.strip().split('\n')[0].replace('Submitted ',''))
    data_final["Originally Announced"].append(ind_doc_sec[i].find("p",{"class":"is-size-7"}).text.strip().split('\n')[1].replace('      originally announced ',''))

In [12]:
# converting the data into a dataframe
data_df1 = pd.DataFrame(data_final)
data_df1.head(5)

Unnamed: 0,Doc_id,Tags,Doc_title,Authors,Abstract,Submitted,Originally Announced
0,arXiv:2008.04549,[cs.SD],Unsupervised Learning For Sequence-to-sequence...,"[Haitong Zhang, Yue Lin]","Recently, sequence-to-sequence models with att...","11 August, 2020;",August 2020.
1,arXiv:2008.02528,[stat.ML],Learning Sampling in Financial Statement Audit...,"[Marco Schreyer, Timur Sattarov, Anita Gierbl,...",The audit of financial statements is designed ...,"6 August, 2020;",August 2020.
2,arXiv:2007.09923,"[cs.LG, eess.IV]",Incorporating Reinforced Adversarial Learning ...,"[Kenan E. Ak, Ning Xu, Zhe Lin, Yilin Wang]",Autoregressive models recently achieved compar...,"20 July, 2020;",July 2020.
3,arXiv:2006.12150,[],Generating Annotated High-Fidelity Images Cont...,"[Bryan G. Cardenas, Devanshu Arya, Deepak K. G...",Recent developments related to generative mode...,"24 June, 2020; v1 submitted 22 June, 2020;",June 2020.
4,arXiv:2006.07926,[cs.CL],UWSpeech: Speech to Speech Translation for Unw...,"[Chen Zhang, Xu Tan, Yi Ren, Tao Qin, Kejun Zh...",Existing speech to speech translation systems ...,"14 June, 2020;",June 2020.
