In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Overview

Read in and explore dataset:
https://www.kaggle.com/datasets/sumitm004/arxiv-scientific-research-papers-dataset

To review:

## Data Cleaning
- [X] Data structure (columns, data types, missing values)
- [ ] Identify duplicates
- [ ] Data distribution (histograms, box plots)
- [ ] Identify temporal coverage (date ranges)

## Data Exploration
- [ ] Look for class imbalance in target variable
- [ ] Generate derived features as needed and characterize
- [ ] Identify statistical relationships between features


# Functions

In [3]:
def create_col_info_df(df):
    """
        Take a dataframe and create a new dataframe containing column information. 
    """
    missing_values=df.isnull().sum()
    pct_missing = (missing_values/len(df)) * 100
    datatypes = df.dtypes
    unique_values = df.nunique()

    col_info_df = pd.DataFrame({
        "Column name": df.columns,
        "Number of Missing Values": missing_values,
        "Percent Missing Values": pct_missing,
        "Datatype": datatypes,
        "Number of Unique Values": unique_values
    })
    # reset the index
    col_info_df = col_info_df.reset_index(drop=True)
    return (col_info_df)

# Data Cleaning

In [2]:
df = pd.read_csv('../data/raw/arXiv_scientific_dataset.csv')
display(df.head())

Unnamed: 0,id,title,category,category_code,published_date,updated_date,authors,first_author,summary,summary_word_count
0,cs-9308101v1,Dynamic Backtracking,Artificial Intelligence,cs.AI,8/1/93,8/1/93,['M. L. Ginsberg'],'M. L. Ginsberg',Because of their occasional need to return to ...,79
1,cs-9308102v1,A Market-Oriented Programming Environment and ...,Artificial Intelligence,cs.AI,8/1/93,8/1/93,['M. P. Wellman'],'M. P. Wellman',Market price systems constitute a well-underst...,119
2,cs-9309101v1,An Empirical Analysis of Search in GSAT,Artificial Intelligence,cs.AI,9/1/93,9/1/93,"['I. P. Gent', 'T. Walsh']",'I. P. Gent',We describe an extensive study of search in GS...,167
3,cs-9311101v1,The Difficulties of Learning Logic Programs wi...,Artificial Intelligence,cs.AI,11/1/93,11/1/93,"['F. Bergadano', 'D. Gunetti', 'U. Trinchero']",'F. Bergadano',As real logic programmers normally use cut (!)...,174
4,cs-9311102v1,Software Agents: Completing Patterns and Const...,Artificial Intelligence,cs.AI,11/1/93,11/1/93,"['J. C. Schlimmer', 'L. A. Hermens']",'J. C. Schlimmer',To support the goal of allowing users to recor...,187


In [5]:
col_inf_df=create_col_info_df(df)
display(col_inf_df)

Unnamed: 0,Column name,Number of Missing Values,Percent Missing Values,Datatype,Number of Unique Values
0,id,0,0.0,object,136238
1,title,0,0.0,object,136154
2,category,0,0.0,object,138
3,category_code,0,0.0,object,139
4,published_date,0,0.0,object,7259
5,updated_date,0,0.0,object,7196
6,authors,0,0.0,object,125548
7,first_author,0,0.0,object,77742
8,summary,0,0.0,object,136193
9,summary_word_count,0,0.0,int64,346


# Column information
* There are 10 columns.
* There is no missing data!
* Interestingly, there are 138 categories, and 139 catagories codes- we should explore this discrepancy.
* I would have expected the 'id', 'title' and 'summary' columns to have the same number of unique values, but they do not. 
* Date columns need to be converted to datetime format.
* May be useful to create an 'author_count' column and 'title_word_count' column.

In [9]:
# update date columns to date types
df['published_date'] = pd.to_datetime(df['published_date'], format='%Y-%m-%d')
df['updated_date'] = pd.to_datetime(df['updated_date'], format='%Y-%m-%d')
print(df['published_date'].dtype)
print(df['updated_date'].dtype)

datetime64[ns]
datetime64[ns]


In [18]:
# Find all duplicate rows
duplicate_rows = df[df.duplicated()]

# Count how many duplicate rows exist
num_duplicates = len(duplicate_rows)
print(f"Number of duplicate rows: {num_duplicates}")

# You can also count duplicates including the first occurrence
all_duplicates = df[df.duplicated(keep=False)]
num_all_duplicates = len(all_duplicates)
print(f"Number of rows that appear more than once: {num_all_duplicates}")

# View the duplicates
print(duplicate_rows.head())

Number of duplicate rows: 0
Number of rows that appear more than once: 0
Empty DataFrame
Columns: [id, title, category, category_code, published_date, updated_date, authors, first_author, summary, summary_word_count]
Index: []


There are no full duplicate rows, but there are some duplicate values in the 'title' and 'summary' columns that are worth exploring. 

In [None]:
# Create a DataFrame that groups by category and counts distinct category_codes
category_mapping = df.groupby('category')['category_code'].nunique().reset_index()

# Find categories that map to more than one category_code
multiple_mappings = category_mapping[category_mapping['category_code'] > 1]

print(f"Categories with multiple category codes: {len(multiple_mappings)}")
print(multiple_mappings)

# For the categories with multiple mappings, show the actual combinations
if len(multiple_mappings) > 0:
    for category in multiple_mappings['category']:
        print(f"\nCategory: {category}")
        print(df[df['category'] == category][['category', 'category_code']].drop_duplicates())


Categories with multiple category codes: 1
              category  category_code
98  Numerical Analysis              2

Category: Numerical Analysis
                 category category_code
39965  Numerical Analysis         cs.NA
40889  Numerical Analysis       math.NA


In [15]:
# Count rows where category_code is 'cs.NA'
cs_na_count = df[df['category_code'] == 'cs.NA'].shape[0]
print(f"Number of rows where category_code == 'cs.NA': {cs_na_count}")

# Count rows where category_code is 'math.NA'
math_na_count = df[df['category_code'] == 'math.NA'].shape[0]
print(f"Number of rows where category_code == 'math.NA': {math_na_count}")

# You can also check what categories these map to
if cs_na_count > 0:
    print("\nCategories for 'cs.NA':")
    print(df[df['category_code'] == 'cs.NA']['category'].value_counts())

if math_na_count > 0:
    print("\nCategories for 'math.NA':")
    print(df[df['category_code'] == 'math.NA']['category'].value_counts())

Number of rows where category_code == 'cs.NA': 25
Number of rows where category_code == 'math.NA': 56

Categories for 'cs.NA':
category
Numerical Analysis    25
Name: count, dtype: int64

Categories for 'math.NA':
category
Numerical Analysis    56
Name: count, dtype: int64


In [17]:
# Filter for rows where category is 'Numerical Analysis'
na_rows = df[df['category'] == 'Numerical Analysis']
display(na_rows)

Unnamed: 0,id,title,category,category_code,published_date,updated_date,authors,first_author,summary,summary_word_count
39965,abs-1312.6872v1,Matrix recovery using Split Bregman,Numerical Analysis,cs.NA,2013-12-17,2013-12-17,"['Anupriya Gogna', 'Ankita Shukla', 'Angshul M...",'Anupriya Gogna',In this paper we address the problem of recove...,157
39971,abs-1401.0159v1,Speeding-Up Convergence via Sequential Subspac...,Numerical Analysis,cs.NA,2013-12-31,2013-12-31,['Michael Zibulevsky'],'Michael Zibulevsky',This is an overview paper written in style of ...,161
39985,abs-1407.1399v1,Generalized Higher-Order Tensor Decomposition ...,Numerical Analysis,cs.NA,2014-07-05,2014-07-05,"['Fanhua Shang', 'Yuanyuan Liu', 'James Cheng']",'Fanhua Shang',Higher-order tensors are becoming prevalent in...,166
40351,abs-1601.07721v1,Distributed Low Rank Approximation of Implicit...,Numerical Analysis,cs.NA,2016-01-28,2016-01-28,"['David P. Woodruff', 'Peilin Zhong']",'David P. Woodruff',We study distributed low rank approximation in...,212
40889,abs-1707.09428v1,A unified method for super-resolution recovery...,Numerical Analysis,math.NA,2017-07-26,2017-07-26,"['Charles K. Chui', 'Hrushikesh N. Mhaskar']",'Charles K. Chui',"In this paper, motivated by diffraction of tra...",151
...,...,...,...,...,...,...,...,...,...,...
111897,abs-2309.05947v3,Tumoral Angiogenic Optimizer: A new bio-inspir...,Numerical Analysis,math.NA,2023-09-12,2023-09-20,"['Hernández Rodríguez', 'Matías Ezequiel']",'Hernández Rodríguez',"In this article, we propose a new metaheuristi...",185
112591,abs-1605.09232v3,Tradeoffs between Convergence Speed and Recons...,Numerical Analysis,cs.NA,2016-05-30,2018-02-15,"['Raja Giryes', 'Yonina C. Eldar', 'Alex M. Br...",'Raja Giryes',Solving inverse problems with iterative algori...,164
112632,abs-1706.04702v1,Deep learning-based numerical methods for high...,Numerical Analysis,math.NA,2017-06-15,2017-06-15,"['Weinan E', 'Jiequn Han', 'Arnulf Jentzen']",'Weinan E',We propose a new algorithm for solving parabol...,118
112975,abs-2408.14057v1,Revisiting time-variant complex conjugate matr...,Numerical Analysis,math.NA,2024-08-26,2024-08-26,"['Jiakuang He', 'Dongqing Wu']",'Jiakuang He',Large-scale linear equations and high dimensio...,134


The 'Numerical Analysis' category is the only one that seems to map to two codes. And, this happens more than once. It is unclear if this is a data entry error or if the category is actually meant to be two codes. I will leave this for now, but it may be worth exploring more later.


In [23]:
# Investigate why the 'title' unique values do not match the id unique values
title_cts= df['title'].value_counts()
non_unique_titles = title_cts[title_cts > 1]
print(f"There are {len(non_unique_titles)} duplicate titles")
print(non_unique_titles)

There are 84 duplicate titles
title
Fairness in Reinforcement Learning                                                                             2
SCREEN: Learning a Flat Syntactic and Semantic Spoken Language Analysis\n  Using Artificial Neural Networks    2
Bridging the Gap Between Target Networks and Functional Regularization                                         2
Conditional Plausibility Measures and Bayesian Networks                                                        2
Interpretable Two-level Boolean Rule Learning for Classification                                               2
                                                                                                              ..
Standards for Language Resources                                                                               2
Fair Active Learning                                                                                           2
Hypertree Decompositions Revisited for PGMs                 

In [28]:
display(df[df['title'].isin(non_unique_titles.index.tolist())].sort_values(by='title'))

Unnamed: 0,id,title,category,category_code,published_date,updated_date,authors,first_author,summary,summary_word_count
130124,cmp-lg-9407014v1,Abstract Machine for Typed Feature Structures,Computation and Language (Legacy category),cmp-lg,1994-07-17,1994-07-17,"['Shuly Wintner', 'Nissim Francez']",'Shuly Wintner',This paper describes a first step towards the ...,107
130324,cmp-lg-9504009v1,Abstract Machine for Typed Feature Structures,Computation and Language (Legacy category),cmp-lg,1995-04-13,1995-04-13,"['Shuly Wintner', 'Nissim Francez']",'Shuly Wintner',This paper describes an abstract machine for l...,116
2472,abs-1408.2056v1,Active Sensing as Bayes-Optimal Sequential Dec...,Artificial Intelligence,cs.AI,2014-08-09,2014-08-09,"['Sheeraz Ahmad', 'Angela Yu']",'Sheeraz Ahmad',Sensory inference under conditions of uncertai...,177
10523,abs-1305.6650v1,Active Sensing as Bayes-Optimal Sequential Dec...,Artificial Intelligence,cs.AI,2013-05-28,2013-05-28,"['Sheeraz Ahmad', 'Angela J. Yu']",'Sheeraz Ahmad',Sensory inference under conditions of uncertai...,177
2467,abs-1408.2034v1,Approximate inference on planar graphs using L...,Artificial Intelligence,cs.AI,2014-08-09,2014-08-09,"['Vicenc Gomez', 'Hilbert Kappen', 'Michael Ch...",'Vicenc Gomez',We introduce novel results for approximate inf...,153
...,...,...,...,...,...,...,...,...,...,...
2447,abs-1407.7188v1,When Ignorance is Bliss,Artificial Intelligence,cs.AI,2014-07-27,2014-07-27,"['Peter D. Grunwald', 'Joseph Y. Halpern']",'Peter D. Grunwald',It is commonly-accepted wisdom that more infor...,120
17549,abs-2206.07940v4,When Rigidity Hurts: Soft Consistency Regulari...,Machine Learning,cs.LG,2022-06-16,2023-10-19,"['Harshavardhan Kamarthi', 'Lingkai Kong', 'Al...",'Harshavardhan Kamarthi',Probabilistic hierarchical time-series forecas...,211
21739,abs-2310.11569v2,When Rigidity Hurts: Soft Consistency Regulari...,Machine Learning,cs.LG,2023-10-17,2023-10-19,"['Harshavardhan Kamarthi', 'Lingkai Kong', 'Al...",'Harshavardhan Kamarthi',Probabilistic hierarchical time-series forecas...,211
2463,abs-1408.1692v1,When do Numbers Really Matter?,Artificial Intelligence,cs.AI,2014-08-07,2014-08-07,"['Hei Chan', 'Adnan Darwiche']",'Hei Chan',Common wisdom has it that small distinctions i...,140


Some of these seem to be true duplicate papers, even though some of the extra columns are different. There are only 84 duplicate titles. 

There is at least one that is not a duplicate (same title, different authors), but many of the others have similar author lists. 

To Start:
* Remove the older row where the title and first_auther are the same. I know this will miss some things (Joris M. Mooij vs. Joris Mooij), but it is a start.
* Review the list again to see if there are other candidates for removal. 