<center>Notebook - 001 </center>

<center>
<h1>SCRAPING THE ARXIV DATA</h1>
</center>

<center><h4>This notebook scrapes the arXiv website for papers in the category "cs.CV" (Computer Vision) ,"stat.ML" / "cs.LG" (Machine Learning) and "cs.AI" (Artificial Intelligence). The papers are then saved in a csv file.</h4></center>

<center>
        <img src="https://1.bp.blogspot.com/-qNgnU6Fb4mQ/YNYg4YdWyaI/AAAAAAAAV04/Bbx5Ez0Iz_4PFOpFxuL2bPMrfLqFHF_rgCLcBGAsYHQ/s791/Data%2BScraping%2Bseminar%2Btopics.jpg" alt="Your Image">
    </center>

In [1]:
%pip install arxiv

Note: you may need to restart the kernel to use updated packages.




<h2>Import Libraries</h2>

In [2]:
import arxiv
import pandas as pd

from tqdm import tqdm
from pathlib import Path

##### Assigning the path to the data directory that is one level above the current working directory to the variable PATH_DATA_BASE.

In [3]:
PATH_DATA_BASE = Path.cwd().parent / "data"
print(PATH_DATA_BASE)

C:\Users\soulo\MACHINE_LEARNING\PaperMate\data


## Scraping the arXiv website

<p>Defining a list of keywords that we will use to query the arXiv API.</p>

In [4]:
query_keywords = [
    "\"image segmentation\"",
    "\"self-supervised learning\"",
    "\"representation learning\"",
    "\"image generation\"",
    "\"object detection\"",
    "\"transfer learning\"",
    "\"transformers\"",
    "\"adversarial training\"",
    "\"generative adversarial networks\"",
    "\"model compression\"",
    "\"few-shot learning\"",
    "\"natural language processing\"",
    "\"graph neural networks\"",
    "\"colorization\"",
    "\"depth estimation\"",
    "\"point cloud\"",
    "\"structured data\"",
    "\"optical flow\"",
    "\"reinforcement learning\"",
    "\"super resolution\"",
    "\"attention mechanisms\"",
    "\"tabular data\"",
    "\"unsupervised learning\"",
    "\"semi-supervised learning\"",
    "\"explainable AI\"",
    "\"radiance field\"",
    "\"decision tree\"",
    "\"time series analysis\"",
    "\"molecule generation\"",
    "\"large language models\"",
    "\"LLMs\"",
    "\"language models\"",
    "\"image classification\"",
    "\"document image classification\"",
    "\"encoder-decoder\"",
    "\"multimodal learning\"",
    "\"multimodal deep learning\"",
    "\"speech recognition\"",
    "\"generative models\"",
    "\"anomaly detection\"",
    "\"recommender systems\"",
    "\"robotics\"",
    "\"knowledge graphs\"",
    "\"cross-modal learning\"",
    "\"attention mechanisms\"",
    "\"unsupervised translation\"",
    "\"machine translation\"",
    "\"dialogue systems\"",
    "\"sentiment analysis\"",
    "\"question answering\"",
    "\"text summarization\"",
    "\"sequential modeling\"",
    "\"neurosymbolic AI\"",
    "\"fairness in AI\"",
    "\"transferable skills\"",
    "\"data augmentation\"",
    "\"neural architecture search\"",
    "\"active learning\"",
    "\"automated machine learning\"",
    "\"meta-learning\"",
    "\"domain adaptation\"",
    "\"time series forecasting\"",
    "\"weakly supervised learning\"",
    "\"self-supervised vision\"",
    "\"visual reasoning\"",
    "\"knowledge distillation\"",
    "\"hyperparameter optimization\"",
    "\"cross-validation\"",
    "\"explainable reinforcement learning\"",
    "\"meta-reinforcement learning\"",
    "\"generative models in NLP\"",
    "\"knowledge representation and reasoning\"",
    "\"zero-shot learning\"",
    "\"self-attention mechanisms\"",
    "\"ensemble learning\"",
    "\"online learning\"",
    "\"cognitive computing\"",
    "\"self-driving cars\"",
    "\"emerging AI trends\"",
    "\"Attention is all you need\"",
    "\"GPT\"",
    "\"BERT\"",
    "\"Transformers\"",
    "\"yolo\"",
    "\"speech recognisation\"",
    "\"LSTM\"",
    "\"GRU\"",
    "\"BERT - Bidirectinal Encoder Representation of Transformes\"",
    "\"Large Language Model\" ",
    "\"Stabel diffusion\"",
    "\"Attention is all you need\"",
    "\"Encoder-Decoder\"",
     "\"Paper Recommendatin systems\"",
     "\" Latent Dirichlet Allocation (LDA)\"",
     "\"Transformers\"",
     "\"Generative Pre-trained Transforme\"",
]


<p>Afterwards, we define a function that creates a search object using the given query. It sets the maximum number of results for each category to 6000 and sorts them by the last updated date. </p>

In [5]:
client = arxiv.Client(num_retries=20, page_size=500)

In [6]:


def query_with_keywords(query) -> tuple:
    """
    Query the arXiv API for research papers based on a specific query and filter results by selected categories.
    
    Args:
        query (str): The search query to be used for fetching research papers from arXiv.
    
    Returns:
        tuple: A tuple containing three lists - terms, titles, and abstracts of the filtered research papers.
        
            terms (list): A list of lists, where each inner list contains the categories associated with a research paper.
            titles (list): A list of titles of the research papers.
            abstracts (list): A list of abstracts (summaries) of the research papers.
            urls (list): A list of URLs for the papers' detail page on the arXiv website.
    """
    
    # Create a search object with the query and sorting parameters.
    search = arxiv.Search(
        query=query,
        max_results=6000,
        sort_by=arxiv.SortCriterion.LastUpdatedDate
    )
    
    # Initialize empty lists for terms, titles, abstracts, and urls.
    terms = []
    titles = []
    abstracts = []
    urls = []
    ids = []
    # For each result in the search...
    for res in tqdm(client.results(search), desc=query):
        # Check if the primary category of the result is in the specified list.
        if res.primary_category in ["cs.CV", "stat.ML", "cs.LG", "cs.AI" ,"cs.CL"]:
            # If it is, append the result's categories, title, summary, and url to their respective lists.
            terms.append(res.categories)
            titles.append(res.title)
            abstracts.append(res.summary)
            urls.append(res.entry_id)
            ids.append(res.entry_id.split('/')[-1])

    # Return the four lists.
    return terms, titles, abstracts, urls , ids

In [7]:
all_titles = []
all_abstracts = []
all_terms = []
all_urls = []
all_ids = []

for query in query_keywords:
    terms, titles, abstracts, urls , ids = query_with_keywords(query)
    all_titles.extend(titles)
    all_abstracts.extend(abstracts)
    all_terms.extend(terms)
    all_urls.extend(urls)
    all_ids.extend(ids)

"image segmentation": 3322it [01:27, 37.75it/s]
"self-supervised learning": 0it [00:03, ?it/s]
"representation learning": 6000it [02:42, 36.85it/s]
"image generation": 2580it [01:09, 36.98it/s]
"object detection": 6000it [02:56, 34.08it/s]
"transfer learning": 5642it [02:38, 35.55it/s]
"transformers": 6000it [02:39, 37.65it/s]
"adversarial training": 2855it [01:12, 39.43it/s]
"generative adversarial networks": 6000it [02:13, 45.09it/s]
"model compression": 800it [00:26, 30.71it/s]
"few-shot learning": 0it [00:03, ?it/s]
"natural language processing": 6000it [02:50, 35.25it/s]
"graph neural networks": 5095it [02:33, 33.27it/s]
"colorization": 6000it [02:22, 42.04it/s]
"depth estimation": 1390it [00:35, 38.89it/s]
"point cloud": 5070it [02:17, 36.77it/s]
"structured data": 2110it [01:16, 27.47it/s]
"optical flow": 1656it [00:45, 36.54it/s]
"reinforcement learning": 6000it [02:28, 40.30it/s]
"super resolution": 3237it [01:13, 44.33it/s]
"attention mechanisms": 5512it [02:27, 37.32it/s]
"t

In [10]:
print(all_ids[1])

2307.10123v2


In [13]:
print(all_terms[:10])

[['cs.CV'], ['cs.CV'], ['cs.CV'], ['cs.CV', 'cs.AI'], ['cs.CV'], ['cs.CV'], ['cs.LG', 'cs.CR', 'eess.IV'], ['cs.CV'], ['cs.CV'], ['cs.CV']]


In [15]:
print(all_titles[1])

Two Approaches to Supervised Image Segmentation


### lets see the data scraped

In [16]:
arxiv_data = pd.DataFrame({
    'titles': all_titles,
    'abstracts': all_abstracts,
    'terms': all_terms,
    'urls': all_urls,
    'ids':all_ids,
})



In [17]:
arxiv_data.head()

Unnamed: 0,titles,abstracts,terms,urls,ids
0,Point2Mask: Point-supervised Panoptic Segmenta...,Weakly-supervised image segmentation has recen...,[cs.CV],http://arxiv.org/abs/2308.01779v1,2308.01779v1
1,Two Approaches to Supervised Image Segmentation,Though performed almost effortlessly by humans...,[cs.CV],http://arxiv.org/abs/2307.10123v2,2307.10123v2
2,Semi-Siamese Network for Robust Change Detecti...,Automatic defect detection for 3D printing pro...,[cs.CV],http://arxiv.org/abs/2212.08583v2,2212.08583v2
3,Data-Centric Diet: Effective Multi-center Data...,This paper seeks to address the dense labeling...,"[cs.CV, cs.AI]",http://arxiv.org/abs/2308.01189v1,2308.01189v1
4,Prompt-Based Tuning of Transformer Models for ...,Medical image segmentation is a vital healthca...,[cs.CV],http://arxiv.org/abs/2305.18948v2,2305.18948v2


In [18]:
len(arxiv_data)

135321

### Save the data - Finally, we export the DataFrame to a csv file.

In [19]:
arxiv_data.to_csv(PATH_DATA_BASE / 'data.csv', index=False)