# Python Case Study - Papers

The goal of this exercise is to explore a dataset of 10,000 Computer Science papers, compiled in a CSV file through web scraping. Mission: gain insights through descriptive analysis, evaluate data quality, and query them.

## Data Quality Assessment

Goal: investigate for missing values, duplicates, and any inconsistencies. Propose solutions to improve the dataset's quality.

In order to clean data well, it is necessary to verify that records are correctly read.

In [3]:
import pandas as pd

df = pd.read_csv("papers.csv", quotechar='"')

At first glance, I noticed that some records spanned multiple rows due to the presence of "/n" in certain values. Additionally, some values contained commas, which could interfere with field separation.
To address this, I added the **quotechar** argument to the read statement to ensure that text enclosed in quotation marks remained within the same field. Finally, while I confirmed that commas were indeed problematic, I could not say the same about "/n".

In [4]:
df.shape

(10000, 7)

In [5]:
df.head()

Unnamed: 0,date,abstract,title,authors,subjects,venue,importing_date
0,2021-10-25 05:17:04,"For a long time, bone marrow cell morphology e...",Bone Marrow Cell Recognition: Training Deep Ob...,"Dehao Huang, Jintao Cheng, Rui Fan, Zhihao Su,...",Computer Vision and Pattern Recognition (cs.CV),arxiv,
1,2014-06-05 23:35:17,It is assumed that under suitable economic and...,Arbitrage-free exchange rate ensembles over a ...,Stan Palasek,General Economics (econ.GN); Social and Inform...,arxiv,
2,2023-03-17 16:26:03,Physics-based covariance models provide a syst...,Scalable Physics-based Maximum Likelihood Esti...,"Yian Chen, Mihai Anitescu",Computation (stat.CO); Numerical Analysis (mat...,arxiv,
3,,incorrect id format for 2210.156,Automatic extraction of materials and properti...,"Luca Foppiano, Pedro Baptista de Castro, Pedro...",Computation and Language (cs.CL); Superconduct...,arxiv,
4,2017-12-04 16:26:52,This paper presents a new isogeometric mortar ...,A segmentation-free isogeometric extended mort...,"Thang Xuan Duong, Laura De Lorenzis, Roger A. ...","Computational Engineering, Finance, and Scienc...",arxiv,


In [6]:
df.tail()

Unnamed: 0,date,abstract,title,authors,subjects,venue,importing_date
9995,2020-12-17 22:49:02,"In this work, a system for creating a relighta...",Relightable 3D Head Portraits from a Smartphon...,"Artem Sevastopolsky, Savva Ignatiev, Gonzalo F...",Computer Vision and Pattern Recognition (cs.CV),arxiv,
9996,2022-08-24 01:04:47,Transformer-based methods have achieved impres...,SwinFIR: Revisiting the SwinIR with Fast Fouri...,"Dafeng Zhang, Feiyu Huang, Shizhuo Liu, Xiaobi...",Computer Vision and Pattern Recognition (cs.CV),arxiv,
9997,,,Random Projection Trees Revisited.,,,nips,2024-03-02
9998,2021-09-28 10:18:27,"In this paper, we introduce a compositional me...",Compositional Abstractions of Interconnected D...,"Abdalla Swikir, Majid Zamani",Systems and Control (eess.SY),arxiv,
9999,2023-05-24 15:09:41,Manifolds discovered by machine learning model...,Short and Straight: Geodesics on Differentiabl...,"Daniel Kelshaw, Luca Magri",Machine Learning (cs.LG); Computational Geomet...,arxiv,


The following step is to find and manage **missing data**.

In [7]:
df.isnull().sum()

date              2934
abstract          2645
title                0
authors           2645
subjects          2645
venue                0
importing_date    7355
dtype: int64

There are many missing values. We have 0 in just two columns: title and value.
One field in particular is problematic: **importing_date**. It has more than 70% missing values.
However, looking at the values it assumes, it doesn't seem to be a very important variable, so it can be removed from dataframe.

In [8]:
df["importing_date"].unique()

array([nan, '2024-03-03', '2024-03-02'], dtype=object)

In [9]:
df1 = df.drop(["importing_date"], axis=1)
df1.head()

Unnamed: 0,date,abstract,title,authors,subjects,venue
0,2021-10-25 05:17:04,"For a long time, bone marrow cell morphology e...",Bone Marrow Cell Recognition: Training Deep Ob...,"Dehao Huang, Jintao Cheng, Rui Fan, Zhihao Su,...",Computer Vision and Pattern Recognition (cs.CV),arxiv
1,2014-06-05 23:35:17,It is assumed that under suitable economic and...,Arbitrage-free exchange rate ensembles over a ...,Stan Palasek,General Economics (econ.GN); Social and Inform...,arxiv
2,2023-03-17 16:26:03,Physics-based covariance models provide a syst...,Scalable Physics-based Maximum Likelihood Esti...,"Yian Chen, Mihai Anitescu",Computation (stat.CO); Numerical Analysis (mat...,arxiv
3,,incorrect id format for 2210.156,Automatic extraction of materials and properti...,"Luca Foppiano, Pedro Baptista de Castro, Pedro...",Computation and Language (cs.CL); Superconduct...,arxiv
4,2017-12-04 16:26:52,This paper presents a new isogeometric mortar ...,A segmentation-free isogeometric extended mort...,"Thang Xuan Duong, Laura De Lorenzis, Roger A. ...","Computational Engineering, Finance, and Scienc...",arxiv


In [10]:
df1.isnull().sum()

date        2934
abstract    2645
title          0
authors     2645
subjects    2645
venue          0
dtype: int64

There are still too many missing, from 26 to almost 30% of data in each of the four columns. It's too much to ignore or drop them all.
So, I decided to analyze NaN values by row as well. I want to see if there are rows with many missing fields.

In [11]:
record_na = df.isna().sum(axis=1)
record_na.sort_values(ascending=False)

20      4
18      4
31      4
30      4
9989    4
       ..
19      1
9999    1
21      1
22      1
12      1
Length: 10000, dtype: int64

In [12]:
(record_na == 4).sum()

np.int64(2645)

I discovered that there are 2645 records with 4 missing values out of 6, more than half the fields! Abstract, Subject and Author are surely relevant values for our analysis, so I decided to drop all this useless records. I lost 27% of my dataframe.

In [13]:


df2 = df1.drop(index=record_na[record_na == 4].index)
df2.shape

(7355, 6)

but now it is more clean than before

In [14]:
df2.head()

Unnamed: 0,date,abstract,title,authors,subjects,venue
0,2021-10-25 05:17:04,"For a long time, bone marrow cell morphology e...",Bone Marrow Cell Recognition: Training Deep Ob...,"Dehao Huang, Jintao Cheng, Rui Fan, Zhihao Su,...",Computer Vision and Pattern Recognition (cs.CV),arxiv
1,2014-06-05 23:35:17,It is assumed that under suitable economic and...,Arbitrage-free exchange rate ensembles over a ...,Stan Palasek,General Economics (econ.GN); Social and Inform...,arxiv
2,2023-03-17 16:26:03,Physics-based covariance models provide a syst...,Scalable Physics-based Maximum Likelihood Esti...,"Yian Chen, Mihai Anitescu",Computation (stat.CO); Numerical Analysis (mat...,arxiv
3,,incorrect id format for 2210.156,Automatic extraction of materials and properti...,"Luca Foppiano, Pedro Baptista de Castro, Pedro...",Computation and Language (cs.CL); Superconduct...,arxiv
4,2017-12-04 16:26:52,This paper presents a new isogeometric mortar ...,A segmentation-free isogeometric extended mort...,"Thang Xuan Duong, Laura De Lorenzis, Roger A. ...","Computational Engineering, Finance, and Scienc...",arxiv


In [15]:
df2.isnull().sum()

date        289
abstract      0
title         0
authors       0
subjects      0
venue         0
dtype: int64

We still have some missing dates, but probably it's not so important, and they are only 3.9% of remaining record, so I'll leave them.

In [16]:
df2[df2.duplicated(keep=False)].sort_values("abstract")

Unnamed: 0,date,abstract,title,authors,subjects,venue
3849,2024-02-18 17:17:17,"In this paper, we revisit model-free policy se...",Model-Free μ -Synthesis: A Nonsmooth Optimizat...,"Darioush Keivan, Xingang Guo, Peter Seiler, Ge...",Optimization and Control (math.OC); Machine Le...,arxiv
9682,2024-02-18 17:17:17,"In this paper, we revisit model-free policy se...",Model-Free μ -Synthesis: A Nonsmooth Optimizat...,"Darioush Keivan, Xingang Guo, Peter Seiler, Ge...",Optimization and Control (math.OC); Machine Le...,arxiv
450,2024-02-08 16:54:20,Introduction This study explores the use of th...,Using YOLO v7 to Detect Kidney in Magnetic Res...,"Pouria Yazdian Anari, Fiona Obiezu, Nathan Lay...",Image and Video Processing (eess.IV); Computer...,arxiv
2638,2024-02-08 16:54:20,Introduction This study explores the use of th...,Using YOLO v7 to Detect Kidney in Magnetic Res...,"Pouria Yazdian Anari, Fiona Obiezu, Nathan Lay...",Image and Video Processing (eess.IV); Computer...,arxiv
3397,2024-02-03 16:03:17,The rapid growth of automated and autonomous i...,Multimodal Co-orchestration for Exploring Stru...,"Boris N. Slautin, Utkarsh Pratiush, Ilia N. Iv...",Materials Science (cond-mat.mtrl-sci); Machine...,arxiv
6346,2024-02-03 16:03:17,The rapid growth of automated and autonomous i...,Multimodal Co-orchestration for Exploring Stru...,"Boris N. Slautin, Utkarsh Pratiush, Ilia N. Iv...",Materials Science (cond-mat.mtrl-sci); Machine...,arxiv
2321,2024-02-16 08:21:43,The strengthening of tensions in the cosmologi...,Late-time transition of M B inferred via neura...,"Purba Mukherjee, Konstantinos F. Dialektopoulo...",Cosmology and Nongalactic Astrophysics (astro-...,arxiv
3992,2024-02-16 08:21:43,The strengthening of tensions in the cosmologi...,Late-time transition of M B inferred via neura...,"Purba Mukherjee, Konstantinos F. Dialektopoulo...",Cosmology and Nongalactic Astrophysics (astro-...,arxiv
7486,2024-02-10 19:12:31,This work provides a theoretical framework for...,Generalization Error of Graph Neural Networks ...,"Gholamali Aminian, Yixuan He, Gesine Reinert, ...",Machine Learning (stat.ML); Information Theory...,arxiv
9754,2024-02-10 19:12:31,This work provides a theoretical framework for...,Generalization Error of Graph Neural Networks ...,"Gholamali Aminian, Yixuan He, Gesine Reinert, ...",Machine Learning (stat.ML); Information Theory...,arxiv


Then, we remove some duplicated records.

In [17]:
df3 = df2.drop_duplicates()
df3.shape

(7347, 6)

Finally, I noticed that some rows in the dataframe contain the value "incorrect id format" in the abstract field. There are 319 such records—not many, but since the title and subjects fields are not null, the reference to the article might still be useful.

In [18]:
df3[df3["abstract"].str.contains("incorrect", case=False, na=False)]


Unnamed: 0,date,abstract,title,authors,subjects,venue
3,,incorrect id format for 2210.156,Automatic extraction of materials and properti...,"Luca Foppiano, Pedro Baptista de Castro, Pedro...",Computation and Language (cs.CL); Superconduct...,arxiv
45,,incorrect id format for 912.2298,Performance Metrics Analysis of Torus Embedded...,"N. Gopalakrishna Kini, M. Sathish Kumar, H.S. ...",Networking and Internet Architecture (cs.NI),arxiv
69,,incorrect id format for 2208.051,KL-divergence Based Deep Learning for Discrete...,"Li Liu, Xiangeng Fang, Di Wang, Weijing Tang, ...",Machine Learning (stat.ML); Machine Learning (...,arxiv
73,,incorrect id format for 902.1275,Delay Performance Optimization for Multiuser D...,"Jalil Seifali Harsini, Farshad Lahouti",Information Theory (cs.IT),arxiv
75,,incorrect id format for 1805.08,Adversarial Noise Layer: Regularize Neural Net...,"Zhonghui You, Jinmian Ye, Kunming Li, Zenglin ...",Computer Vision and Pattern Recognition (cs.CV...,arxiv
...,...,...,...,...,...,...
9908,,incorrect id format for 1102.314,Capacity Region of\nK\n-User Discrete Memoryle...,"G.Abhinav, B.Sundar Rajan",Information Theory (cs.IT),arxiv
9917,,incorrect id format for 2109.109,Towards Multi-Agent Reinforcement Learning usi...,"Tobias Müller, Christoph Roch, Kyrill Schmid, ...",Artificial Intelligence (cs.AI); Multiagent Sy...,arxiv
9924,,incorrect id format for 905.3296,An Analysis of Bug Distribution in Object Orie...,"Alessandro Murgia, Giulio Concas, Michele Marc...",Software Engineering (cs.SE); Programming Lang...,arxiv
9939,,incorrect id format for 1212.51,A Polynomial Time Version of LLL With Deep Ins...,"Felix Fontein, Michael Schneider, Urs Wagner",Cryptography and Security (cs.CR); Combinatori...,arxiv


## Descriptive Analysis

Goal: dive deep into the dataset to reveal the number of papers, their sources, and other intriguing facts. What stories do the numbers tell?

In [34]:
df3.shape

(7347, 6)

In [40]:
df3['venue'].value_counts().head(10)

venue
arxiv    7347
Name: count, dtype: int64

In [None]:
df['venue'].value_counts().head(10)

venue
arxiv    7355
nips      326
chi       321
hicss     258
ijcai     216
acl       170
icml      152
emnlp     143
icpr      141
cikm      117
Name: count, dtype: int64

We can immediately notice that all the 7347 papers from the cleaned dataframe are got from Arxiv. If we compare this result with the original dataset, we discover that articles from other sources were all removed during data cleaning, because they were too incomplete. So Arxiv is the most practicle source among these to search for papers though this scraping.

In [43]:
df3['subjects'].str.split(',').explode().value_counts().head(10)

subjects
Computer Vision and Pattern Recognition (cs.CV)         729
Information Theory (cs.IT)                              284
Computation and Language (cs.CL)                        261
 Parallel                                               226
Machine Learning (cs.LG)                                185
Machine Learning (cs.LG); Machine Learning (stat.ML)    175
Distributed                                             146
Robotics (cs.RO)                                        135
Cryptography and Security (cs.CR)                       127
Numerical Analysis (math.NA)                            121
Name: count, dtype: int64

Then, looking at the 'subject' column, we notice that Computer Vision and Pattern Recognition constitutes a strong majority of the handled subjects.

In [None]:
import re

# clean authors list from details between parenthesis
df3['authors_clean'] = df3['authors'].apply(lambda x: 
    ', '.join([re.sub(r"\s*\(.*?\)", "", name.strip()) for name in x]) 
    if isinstance(x, list) else 
    re.sub(r"\s*\(.*?\)", "", str(x))
)


In [60]:
df3['authors_clean'].str.split(',').explode().value_counts().head(10)

authors_clean
H. Vincent Poor    13
Jiebo Luo          11
Yang Liu           10
Wei Wang            9
Jiashi Feng         9
Yi Yang             9
Yoshua Bengio       8
Wei Liu             8
Jun Wang            7
Xin Wang            7
Name: count, dtype: int64

From an analysis of the authors, we can state that the most proficient author among them is H. Vincent Poor. However, we also observe that the majority of proficient authors are Asian. Therefore, we cannot exclude the possibility that some of them have identical names, because they seem very simple and common.

In [65]:
df3['abstract'].str.split().apply(len).describe()

count    7347.000000
mean      157.901320
std        61.997716
min         5.000000
25%       122.000000
50%       159.000000
75%       199.000000
max       413.000000
Name: abstract, dtype: float64

## Function Creation

Goal: write a function that filters the papers based on some given keywords existing in their title and abstract. It could be very useful for researchers who need to look at relevant papers

In [33]:
def filter_papers(df, keywords):
    pairs = df[['abstract', 'title']].values.tolist()
    articles = [] 

#checks if any word in keywords is either in abstract or in title and creates a filtered df
    for idx, pair in enumerate(pairs):
        for key in keywords:
            if (key in pair[0]) or (key in pair[1]): 
                article = df.iloc[idx] 
                articles.append(article) 

    newdf = pd.DataFrame(articles)

    return newdf

filter_papers(df3, ['AI'])

Unnamed: 0,date,abstract,title,authors,subjects,venue
23,2023-01-26 16:55:15,Memes can sway people's opinions over social m...,Characterizing the Entities in Harmful Memes: ...,"Shivam Sharma, Atharva Kulkarni, Tharun Suresh...",Computation and Language (cs.CL); Computers an...,arxiv
24,2021-05-03 11:42:27,"Building human-like agent, which aims to learn...",APPL: Adaptive Planner Parameter Learning,"Xuesu Xiao, Zizhao Wang, Zifan Xu, Bo Liu, Gar...",Robotics (cs.RO),arxiv
44,2013-03-27 19:39:17,It is suggested that an AI inference system sh...,Inference Policies,Paul E. Lehner,Artificial Intelligence (cs.AI),arxiv
66,2023-06-29 02:46:45,The development of Natural Language Generation...,The Future of AI-Assisted Writing,"Carlos Alves Pereira, Tanay Komarlu, Wael Mobe...",Human-Computer Interaction (cs.HC); Artificial...,arxiv
92,2023-03-20 20:28:26,Cognitive psychology delves on understanding p...,Mind meets machine: Unravelling GPT-4's cognit...,"Sifatkaur Dhingra, Manmeet Singh, Vaisakh SB, ...",Computation and Language (cs.CL); Artificial I...,arxiv
...,...,...,...,...,...,...
9799,2023-07-10 00:58:28,To achieve successful deployment of AI researc...,A Demand-Driven Perspective on Generative Audi...,"Sangshin Oh, Minsung Kang, Hyeongi Moon, Keunw...",Audio and Speech Processing (eess.AS); Artific...,arxiv
9856,2015-06-23 00:59:27,Integrating vision and language has long been ...,A Survey of Current Datasets for Vision and La...,"Francis Ferraro, Nasrin Mostafazadeh, Ting-Hao...",Computation and Language (cs.CL); Artificial I...,arxiv
9857,2023-08-09 04:48:55,As cyber threats evolve and grow progressively...,Data-Driven Intelligence can Revolutionize Tod...,"Iqbal H. Sarker, Helge Janicke, Leandros Magla...",Cryptography and Security (cs.CR),arxiv
9867,2020-11-02 07:08:19,Commonsense reasoning refers to the ability of...,PC-GAIN: Pseudo-label Conditional Generative A...,"Yufeng Wang, Dan Li, Xiang Li, Min Yang",Machine Learning (cs.LG); Machine Learning (st...,arxiv
