# ArXiv Scraper Notes

In [None]:
!pip install -U arxivscraper

Get all preprints in `"Computation and Language"` from the past year.

In [None]:
import arxivscraper

scraper = arxivscraper.Scraper(
    category='cs',
    date_from='2022-01-01',
    date_until='2023-02-05',
    filters={'categories': ['cs.CL']}
)

Begin scraping...

In [None]:
output = scraper.scrape()

Convert records to Pandas dataframe:

In [None]:
import pandas as pd

cols = ['id', 'title', 'categories', 'abstract', 'doi', 'created', 'updated', 'authors']
df = pd.DataFrame(output, columns=cols)
df.head()

## Processing PDFs

Now we need to process the PDFs to plaintext — this is almost almost always unreasonably difficult to do...

In [20]:
import requests

doi = "2210.03629"
url = f"https://arxiv.org/pdf/{doi}.pdf"

# download and save PDF to file
r = requests.get(url)
with open(f"{doi}.pdf", 'wb') as f:
    f.write(r.content)

I haven't tested other simpler PDF readers for a long time, for now will just go with easy option of PyPDF2

In [18]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [36]:
import PyPDF2

data = {}

# open the PDF file in binary read mode
with open(f"{doi}.pdf", "rb") as file:
    # create a PDF object
    pdf = PyPDF2.PdfReader(file)
    data[doi] = ""
    # iterate over every page in the PDF
    for page in range(len(pdf.pages)):
        # get the page object
        page_obj = pdf.pages[page]
        # extract text from the page
        text = page_obj.extract_text()
        # add text to data
        data[doi] += text

In [37]:
data[doi] = data[doi].split('.\n')
print(len(data[doi]))
print(data[doi][0])

434
REAC T: S YNERGIZING REASONING AND ACTING IN
LANGUAGE MODELS
Shunyu Yao*,1, Jeffrey Zhao2, Dian Yu2, Nan Du2, Izhak Shafran2, Karthik Narasimhan1, Yuan Cao2
1Department of Computer Science, Princeton University
2Google Research, Brain team
1{shunyuy,karthikn}@princeton.edu
2{jeffreyzhao,dianyu,dunan,izhak,yuancao}@google.com
ABSTRACT
While large language models (LLMs) have demonstrated impressive performance
across tasks in language understanding and interactive decision making, their
abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action
plan generation) have primarily been studied as separate topics. In this paper, we
explore the use of LLMs to generate both reasoning traces and task-speciﬁc actions
in an interleaved manner, allowing for greater synergy between the two: reasoning
traces help the model induce, track, and update action plans as well as handle
exceptions, while actions allow it to interface with and gather additional information
from exte

Can deal with the annoying formatting later, right now I don't think it's important.