<center>
<h1>SCRAPING THE ARXIV DATA</h1>
</center>

<center><h4>This notebook scrapes the arXiv website for papers in the category "cs.CV" (Computer Vision) ,"stat.ML" / "cs.LG" (Machine Learning) and "cs.AI" (Artificial Intelligence). The papers are then saved in a csv file.</h4></center>

<center>
        <img src="https://1.bp.blogspot.com/-qNgnU6Fb4mQ/YNYg4YdWyaI/AAAAAAAAV04/Bbx5Ez0Iz_4PFOpFxuL2bPMrfLqFHF_rgCLcBGAsYHQ/s791/Data%2BScraping%2Bseminar%2Btopics.jpg" alt="Your Image">
    </center>

In [1]:
%pip install arxiv








<h2>Import Libraries</h2>

In [2]:
import arxiv
import pandas as pd

from tqdm import tqdm
from pathlib import Path

##### Assigning the path to the data directory that is one level above the current working directory to the variable PATH_DATA_BASE.

In [3]:
PATH_DATA_BASE = Path.cwd().parent / "data"
print(PATH_DATA_BASE)

C:\Users\soulo\PaperMate\data


## Scraping the arXiv website

<p>Defining a list of keywords that we will use to query the arXiv API.</p>

In [4]:
query_keywords = [
    "\"image segmentation\"",
    "\"self-supervised learning\"",
    "\"representation learning\"",
    "\"image generation\"",
    "\"object detection\"",
    "\"transfer learning\"",
    "\"transformers\"",
    "\"adversarial training\"",
    "\"generative adversarial networks\"",
    "\"model compression\"",
    "\"few-shot learning\"",
    "\"natural language processing\"",
    "\"graph neural networks\"",
    "\"colorization\"",
    "\"depth estimation\"",
    "\"point cloud\"",
    "\"structured data\"",
    "\"optical flow\"",
    "\"reinforcement learning\"",
    "\"super resolution\"",
    "\"attention mechanisms\"",
    "\"tabular data\"",
    "\"unsupervised learning\"",
    "\"semi-supervised learning\"",
    "\"explainable AI\"",
    "\"radiance field\"",
    "\"decision tree\"",
    "\"time series analysis\"",
    "\"molecule generation\"",
    "\"large language models\"",
    "\"LLMs\"",
    "\"language models\"",
    "\"image classification\"",
    "\"document image classification\"",
    "\"encoder-decoder\"",
    "\"multimodal learning\"",
    "\"multimodal deep learning\"",
    "\"speech recognition\"",
    "\"generative models\"",
    "\"anomaly detection\"",
    "\"recommender systems\"",
    "\"robotics\"",
    "\"knowledge graphs\"",
    "\"cross-modal learning\"",
    "\"attention mechanisms\"",
    "\"unsupervised translation\"",
    "\"machine translation\"",
    "\"dialogue systems\"",
    "\"sentiment analysis\"",
    "\"question answering\"",
    "\"text summarization\"",
    "\"sequential modeling\"",
    "\"neurosymbolic AI\"",
    "\"fairness in AI\"",
    "\"transferable skills\"",
    "\"data augmentation\"",
    "\"neural architecture search\"",
    "\"active learning\"",
    "\"automated machine learning\"",
    "\"meta-learning\"",
    "\"domain adaptation\"",
    "\"time series forecasting\"",
    "\"weakly supervised learning\"",
    "\"self-supervised vision\"",
    "\"visual reasoning\"",
    "\"knowledge distillation\"",
    "\"hyperparameter optimization\"",
    "\"cross-validation\"",
    "\"explainable reinforcement learning\"",
    "\"meta-reinforcement learning\"",
    "\"generative models in NLP\"",
    "\"knowledge representation and reasoning\"",
    "\"zero-shot learning\"",
    "\"self-attention mechanisms\"",
    "\"ensemble learning\"",
    "\"online learning\"",
    "\"cognitive computing\"",
    "\"self-driving cars\"",
    "\"emerging AI trends\"",
    "\"Attention is all you need\"",
    "\"GPT\"",
    "\"BERT\"",
    "\"Transformers\"",
    "\"yolo\"",
    "\"speech recognisation\"",
    "\"LSTM\"",
    "\"GRU\"",
    
]


<p>Afterwards, we define a function that creates a search object using the given query. It sets the maximum number of results for each category to 6000 and sorts them by the last updated date. </p>

In [5]:
client = arxiv.Client(num_retries=20, page_size=500)

In [6]:


def query_with_keywords(query) -> tuple:
    """
    Query the arXiv API for research papers based on a specific query and filter results by selected categories.
    
    Args:
        query (str): The search query to be used for fetching research papers from arXiv.
    
    Returns:
        tuple: A tuple containing three lists - terms, titles, and abstracts of the filtered research papers.
        
            terms (list): A list of lists, where each inner list contains the categories associated with a research paper.
            titles (list): A list of titles of the research papers.
            abstracts (list): A list of abstracts (summaries) of the research papers.
            urls (list): A list of URLs for the papers' detail page on the arXiv website.
    """
    
    # Create a search object with the query and sorting parameters.
    search = arxiv.Search(
        query=query,
        max_results=6000,
        sort_by=arxiv.SortCriterion.LastUpdatedDate
    )
    
    # Initialize empty lists for terms, titles, abstracts, and urls.
    terms = []
    titles = []
    abstracts = []
    urls = []
    ids = []
    # For each result in the search...
    for res in tqdm(client.results(search), desc=query):
        # Check if the primary category of the result is in the specified list.
        if res.primary_category in ["cs.CV", "stat.ML", "cs.LG", "cs.AI"]:
            # If it is, append the result's categories, title, summary, and url to their respective lists.
            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
            urls.append(res.entry_id)
            ids.append(res.entry_id.split('/')[-1])

    # Return the four lists.
    return terms, titles, abstracts, urls , ids

In [9]:
all_titles = []
all_abstracts = []
all_terms = []
all_urls = []
all_ids = []

for query in query_keywords:
    terms, titles, abstracts, urls , ids = query_with_keywords(query)
    all_titles.extend(titles)
    all_abstracts.extend(abstracts)
    all_terms.extend(terms)
    all_urls.extend(urls)
    all_ids.extend(ids)

"image segmentation": 3303it [01:33, 35.47it/s]
"self-supervised learning": 0it [00:03, ?it/s]
"representation learning": 6000it [03:13, 30.98it/s]
"image generation": 2557it [01:36, 26.56it/s]
"object detection": 6000it [03:26, 29.04it/s]
"transfer learning": 5623it [03:09, 29.65it/s]
"transformers": 6000it [03:26, 29.00it/s]
"adversarial training": 2837it [01:16, 37.26it/s]
"generative adversarial networks": 5984it [03:38, 27.44it/s]
"model compression": 801it [00:21, 36.57it/s]
"few-shot learning": 0it [00:04, ?it/s]
"natural language processing": 6000it [03:38, 27.49it/s]
"graph neural networks": 5074it [04:20, 19.45it/s]
"colorization": 6000it [03:25, 29.18it/s]
"depth estimation": 1388it [00:45, 30.43it/s]
"point cloud": 5052it [02:32, 33.10it/s]
"structured data": 2104it [01:19, 26.62it/s]
"optical flow": 1651it [00:55, 29.82it/s]
"reinforcement learning": 6000it [03:02, 32.89it/s]
"super resolution": 3228it [01:34, 34.02it/s]
"attention mechanisms": 5488it [02:28, 36.83it/s]
"t

In [10]:
print(urls[50].split('/')[-1])

2210.06475v2


In [11]:
print(ids[1])

2305.16165v2


In [12]:
print(terms[1])

['cs.LG', 'cs.CY']


In [13]:
print(titles[1])

A Conceptual Model for End-to-End Causal Discovery in Knowledge Tracing


### lets see the data scraped

In [14]:
arxiv_data = pd.DataFrame({
    'titles': all_titles,
    'abstracts': all_abstracts,
    'terms': all_terms,
    'urls': all_urls,
    'ids':all_ids,
})



In [15]:
arxiv_data

Unnamed: 0,titles,abstracts,terms,urls,ids
0,DAE-Former: Dual Attention-guided Efficient Tr...,Transformers have recently gained attention in...,[cs.CV],http://arxiv.org/abs/2212.13504v3,2212.13504v3
1,Multi-modal Learning with Missing Modality via...,The missing modality issue is critical but non...,[cs.CV],http://arxiv.org/abs/2307.14126v1,2307.14126v1
2,Unite-Divide-Unite: Joint Boosting Trunk and S...,High-accuracy Dichotomous Image Segmentation (...,[cs.CV],http://arxiv.org/abs/2307.14052v1,2307.14052v1
3,MDViT: Multi-domain Vision Transformer for Sma...,"Despite its clinical utility, medical image se...",[cs.CV],http://arxiv.org/abs/2307.02100v2,2307.02100v2
4,Learning Transferable Object-Centric Diffeomor...,Obtaining labelled data in medical image segme...,[cs.CV],http://arxiv.org/abs/2307.13645v1,2307.13645v1
...,...,...,...,...,...
93800,Ask the GRU: Multi-Task Learning for Deep Text...,In a variety of application domains the conten...,"[stat.ML, cs.CL, cs.LG, I.2.7; I.2.6]",http://arxiv.org/abs/1609.02116v2,1609.02116v2
93801,Faster Training of Very Deep Networks Via p-No...,A major contributing factor to the recent adva...,"[stat.ML, cs.LG, cs.NE]",http://arxiv.org/abs/1608.03639v1,1608.03639v1
93802,Drawing and Recognizing Chinese Characters wit...,Recent deep learning based approaches have ach...,[cs.CV],http://arxiv.org/abs/1606.06539v1,1606.06539v1
93803,Delving Deeper into Convolutional Networks for...,We propose an approach to learn spatio-tempora...,"[cs.CV, cs.LG, cs.NE]",http://arxiv.org/abs/1511.06432v4,1511.06432v4


### Save the data - Finally, we export the DataFrame to a csv file.

In [20]:
arxiv_data.to_csv(PATH_DATA_BASE / 'data.csv', index=False)