# Python Case Study - Papers

The goal of this exercise is to explore a dataset of 10,000 Computer Science papers, compiled in a CSV file through web scraping. Mission: gain insights through descriptive analysis, evaluate data quality, and query them.

## Data Quality Assessment

Goal: investigate for missing values, duplicates, and any inconsistencies. Propose solutions to improve the dataset's quality.

In order to clean data well, it is necessary to verify that records are correctly read.

In [3]:
import pandas as pd

df = pd.read_csv("papers.csv", quotechar='"')

At first glance, I noticed that some records spanned multiple rows due to the presence of "/n" in certain values. Additionally, some values contained commas, which could interfere with field separation.
To address this, I added the **quotechar** argument to the read statement to ensure that text enclosed in quotation marks remained within the same field. Finally, while I confirmed that commas were indeed problematic, I could not say the same about "/n".

In [6]:
df.shape

(10000, 7)

In [None]:
df.head()

Unnamed: 0,date,abstract,title,authors,subjects,venue,importing_date
0,2021-10-25 05:17:04,"For a long time, bone marrow cell morphology e...",Bone Marrow Cell Recognition: Training Deep Ob...,"Dehao Huang, Jintao Cheng, Rui Fan, Zhihao Su,...",Computer Vision and Pattern Recognition (cs.CV),arxiv,
1,2014-06-05 23:35:17,It is assumed that under suitable economic and...,Arbitrage-free exchange rate ensembles over a ...,Stan Palasek,General Economics (econ.GN); Social and Inform...,arxiv,
2,2023-03-17 16:26:03,Physics-based covariance models provide a syst...,Scalable Physics-based Maximum Likelihood Esti...,"Yian Chen, Mihai Anitescu",Computation (stat.CO); Numerical Analysis (mat...,arxiv,
3,,incorrect id format for 2210.156,Automatic extraction of materials and properti...,"Luca Foppiano, Pedro Baptista de Castro, Pedro...",Computation and Language (cs.CL); Superconduct...,arxiv,
4,2017-12-04 16:26:52,This paper presents a new isogeometric mortar ...,A segmentation-free isogeometric extended mort...,"Thang Xuan Duong, Laura De Lorenzis, Roger A. ...","Computational Engineering, Finance, and Scienc...",arxiv,


In [7]:
df.tail()

Unnamed: 0,date,abstract,title,authors,subjects,venue,importing_date
9995,2020-12-17 22:49:02,"In this work, a system for creating a relighta...",Relightable 3D Head Portraits from a Smartphon...,"Artem Sevastopolsky, Savva Ignatiev, Gonzalo F...",Computer Vision and Pattern Recognition (cs.CV),arxiv,
9996,2022-08-24 01:04:47,Transformer-based methods have achieved impres...,SwinFIR: Revisiting the SwinIR with Fast Fouri...,"Dafeng Zhang, Feiyu Huang, Shizhuo Liu, Xiaobi...",Computer Vision and Pattern Recognition (cs.CV),arxiv,
9997,,,Random Projection Trees Revisited.,,,nips,2024-03-02
9998,2021-09-28 10:18:27,"In this paper, we introduce a compositional me...",Compositional Abstractions of Interconnected D...,"Abdalla Swikir, Majid Zamani",Systems and Control (eess.SY),arxiv,
9999,2023-05-24 15:09:41,Manifolds discovered by machine learning model...,Short and Straight: Geodesics on Differentiabl...,"Daniel Kelshaw, Luca Magri",Machine Learning (cs.LG); Computational Geomet...,arxiv,


The following step is to find and manage **missing data**.

In [8]:
df.isnull().sum()

date              2934
abstract          2645
title                0
authors           2645
subjects          2645
venue                0
importing_date    7355
dtype: int64

There are many missing values. We have 0 in just two columns: title and value.
One field in particular is problematic: **importing_date**. It has more than 70% missing values.
However, looking at the values it assumes, it doesn't seem to be a very important variable, so it can be removed from dataframe.

In [9]:
df["importing_date"].unique()

array([nan, '2024-03-03', '2024-03-02'], dtype=object)

In [11]:
df1 = df.drop(["importing_date"], axis=1)
df1.head()

Unnamed: 0,date,abstract,title,authors,subjects,venue
0,2021-10-25 05:17:04,"For a long time, bone marrow cell morphology e...",Bone Marrow Cell Recognition: Training Deep Ob...,"Dehao Huang, Jintao Cheng, Rui Fan, Zhihao Su,...",Computer Vision and Pattern Recognition (cs.CV),arxiv
1,2014-06-05 23:35:17,It is assumed that under suitable economic and...,Arbitrage-free exchange rate ensembles over a ...,Stan Palasek,General Economics (econ.GN); Social and Inform...,arxiv
2,2023-03-17 16:26:03,Physics-based covariance models provide a syst...,Scalable Physics-based Maximum Likelihood Esti...,"Yian Chen, Mihai Anitescu",Computation (stat.CO); Numerical Analysis (mat...,arxiv
3,,incorrect id format for 2210.156,Automatic extraction of materials and properti...,"Luca Foppiano, Pedro Baptista de Castro, Pedro...",Computation and Language (cs.CL); Superconduct...,arxiv
4,2017-12-04 16:26:52,This paper presents a new isogeometric mortar ...,A segmentation-free isogeometric extended mort...,"Thang Xuan Duong, Laura De Lorenzis, Roger A. ...","Computational Engineering, Finance, and Scienc...",arxiv


In [12]:
df1.isnull().sum()

date        2934
abstract    2645
title          0
authors     2645
subjects    2645
venue          0
dtype: int64

In [None]:
record_na = df.isna().sum(axis=1)
record_na.sort_values(ascending=False)
df1 = df.drop(index=record_na[record_na == 4].index)

La variabile importing_date ha un numero di missing molto elevato, superiore al 70%. In relazione agli scopi dell'analisi non sembra essere una variabile importante, quindi viene rimossa.

In [None]:
df2.shape

abbiamo sacrificato il 27% del dataset, perchè i record avevano più della metà dei field mancanti

In [None]:
df2.isnull().sum()

Dato che i missing data di "date" corrispondono al 2,89% dei record, e la variabile non ha un importanza elevata, possiamo ignorarli

In [None]:
df2[df2.duplicated(keep=False)].sort_values("abstract")

Infine, rimuoviamo i record duplicati

In [None]:
df3 = df2.drop_duplicates()
df3.shape