## Dataset Partitioning Script

In [3]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("Cornell-University/arxiv")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'arxiv' dataset.
Path to dataset files: /kaggle/input/arxiv


In [7]:

import json
original_dataset = path + "/arxiv-metadata-oai-snapshot.json"
with open(original_dataset, "r") as f:
  for i in range(3):
    line = f.readline()
    data = json.loads(line)
    print(data)

{'id': '0704.0001', 'submitter': 'Pavel Nadolsky', 'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan", 'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies', 'comments': '37 pages, 15 figures; published version', 'journal-ref': 'Phys.Rev.D76:013009,2007', 'doi': '10.1103/PhysRevD.76.013009', 'report-no': 'ANL-HEP-PR-07-12', 'categories': 'hep-ph', 'license': None, 'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with data from the Fermilab

In [8]:
import json

subset = "arxiv_csAI_subset.json"
target_category = "cs.AI"

total_lines = 0
kept_lines = 0

with open(original_dataset, "r") as infile, open(subset, "w") as outfile:
    for line in infile:
        total_lines += 1
        data = json.loads(line)
        cats = data.get('categories', '')
        cats_list = cats.split() if isinstance(cats, str) else []
        if target_category in cats_list:
            # write new jason
            json.dump(data, outfile)
            outfile.write("\n")
            kept_lines += 1

print("finished")


finished


In [9]:
with open(subset, "r") as f:
  for i in range(3):
    line = f.readline()
    data = json.loads(line)
    print(data)

{'id': '0704.0047', 'submitter': 'Igor Grabec', 'authors': 'T. Kosel and I. Grabec', 'title': 'Intelligent location of simultaneously active acoustic emission sources:\n  Part I', 'comments': '5 pages, 5 eps figures, uses IEEEtran.cls', 'journal-ref': None, 'doi': None, 'report-no': None, 'categories': 'cs.NE cs.AI', 'license': None, 'abstract': '  The intelligent acoustic emission locator is described in Part I, while Part\nII discusses blind source separation, time delay estimation and location of two\nsimultaneously active continuous acoustic emission sources.\n  The location of acoustic emission on complicated aircraft frame structures is\na difficult problem of non-destructive testing. This article describes an\nintelligent acoustic emission source locator. The intelligent locator comprises\na sensor antenna and a general regression neural network, which solves the\nlocation problem based on learning from examples. Locator performance was\ntested on different test specimens. Tests

## Load Data

In [10]:
import pandas as pd
import json

In [11]:
file = subset

def get_data():
    with open(file) as f:
        for line in f:
            yield line

## Clean and Format Data

In [12]:
data = get_data()

cols = ['id', 'authors', 'title', 'update_date', 'categories', 'doi']

interested_data = []
for line in data:
  paper = json.loads(line)
  interested_data.append({col: paper.get(col) for col in cols})

df = pd.DataFrame(interested_data)
df.head(5)

Unnamed: 0,id,authors,title,update_date,categories,doi
0,704.0047,T. Kosel and I. Grabec,Intelligent location of simultaneously active ...,2009-09-29,cs.NE cs.AI,
1,704.005,T. Kosel and I. Grabec,Intelligent location of simultaneously active ...,2007-05-23,cs.NE cs.AI,
2,704.0304,Carlos Gershenson,The World as Evolving Information,2013-04-05,cs.IT cs.AI math.IT q-bio.PE,10.1007/978-3-642-18003-3_10
3,704.0985,"Mohd Abubakr, R.M.Vinay",Architecture for Pseudo Acausal Evolvable Embe...,2007-05-23,cs.NE cs.AI,
4,704.1028,Jianlin Cheng,A neural network approach to ordinal regression,2007-05-23,cs.LG cs.AI cs.NE,


How many data points have direct access to the paper via doi (url)?

In [13]:
total_rows = df.shape[0]
rows_with_links = df['doi'].notna().sum()

print(f"Percent Rows with Links = {rows_with_links / total_rows:.2f}%")

Percent Rows with Links = 0.11%


# Search Engine

### Most Basic Search
Prints 5 results that have the search word in the title.

In [14]:
key_word = 'search engine'

results = df[df['title'].str.contains(key_word, case=False)].head(5)

print(results['title'])

1326    Intelligent Semantic Web Search Engines: A Bri...
6297               Towards the Ontology Web Search Engine
6749    An Innovative Approach for online Meta Search ...
9274            Realization of Ontology Web Search Engine
9920    Search Engine Guided Non-Parametric Neural Mac...
Name: title, dtype: object
