# Data fetch

This set of cells retrieve all published papers in the CVPR conference web page for all 3 days of the 2020 conference. They take a long time to execute since all they do is fetch and download every single pdf into a (pre-existing) folder called "pdfs", as well as processing them and obtaining processable data. It is recommended to not run these cells unless there is a good reason to do so. Otherwise, the preferred method to get the compressed files is to download them from:

- PDFs: https://upm365-my.sharepoint.com/:u:/g/personal/alejandro_alvarezco_alumnos_upm_es/ES9Qtwt0pHBEvUB76QJtweEBGogkajCFEkw3piJqza4eew?e=LvEGyB
- TXTs: https://upm365-my.sharepoint.com/:u:/g/personal/alejandro_alvarezco_alumnos_upm_es/EZrHe6K6cyxGjAstE2SHdPsBf51yJ-DzuD8Q2lrP9AcBSw?e=HCk3Jn
- Stemmed TXTs: https://upm365-my.sharepoint.com/:u:/g/personal/alejandro_alvarezco_alumnos_upm_es/EXk-jVXXyJdKlayNxS2gmNkBhX_J1Kd722F9uHgubrFtRQ?e=DjnJ4o


In [2]:
from bs4 import BeautifulSoup as bs
import requests

In [2]:
BASE_URL = "https://openaccess.thecvf.com/"
DAYS = [
    "CVPR2020?day=2020-06-16",
    "CVPR2020?day=2020-06-17",
    "CVPR2020?day=2020-06-18"
]

WARNING: the following cell takes around 2 min. to finish executing.

In [3]:
urls = []
names = []

for day in DAYS:

    url = BASE_URL + day

    r = requests.get(url)
    soup = bs(r.text)

    links_list = enumerate(soup.findAll('a'))

    for i, link in links_list:
        if link.string != 'pdf': continue
        if "onclick" in str(link) or link.get('href') is None: continue
        _FULLURL = BASE_URL + link.get('href')
        if _FULLURL.endswith('.pdf'):
            urls.append(_FULLURL)
            names.append(soup.select('a')[i].attrs['href'])

print("Found {} papers".format(len(urls)))

names_urls = zip(names, urls)


Found 1466 papers


WARNING: the following cell takes about 30 min. to execute.

In [4]:
for name, link in names_urls:
    resp = requests.get(link)
    new_name = name.replace('/', '_')
    with open("data/pdfs/"+new_name, 'wb') as f:
        f.write(resp.content)
        f.close()

In [6]:
import textract
import os

for file_name in os.listdir("data/pdfs/"):
    if "zip" in file_name: continue
    text = textract.process("data/pdfs/"+file_name)
    with open("data/txts/"+file_name.replace(".pdf", ".txt"), 'wb') as f:
        f.write(text)
        f.close()



In [11]:
import nltk
from nltk.stem.porter import PorterStemmer

corpus = []
porter_stemmer = PorterStemmer()

for file_name in os.listdir("data/txts/"):
    if "zip" in file_name: continue
    with open("data/txts/" + file_name, "r") as f:
        try:
            text = f.read()
        except UnicodeDecodeError:
            print("UnicodeDecodeError in ", file_name)
            continue
    with open("data/stemmed/" + file_name, "w") as f:
        nltk_tokens = nltk.word_tokenize(text)
        stems = []

        for w_port in range(len(nltk_tokens)):
            if "-" in nltk_tokens[w_port]:
                new_words = nltk_tokens[w_port].split("-")
                stems.append(porter_stemmer.stem(new_words[0]))
                stems.append(porter_stemmer.stem(new_words[1]))
            else:
                stems.append(porter_stemmer.stem(nltk_tokens[w_port]))
        
        for s in stems:
            f.write(s + " ")
        f.close()