# Exploratory data analysis of paper sentence dataset for citation classification

This notebook aims to analyze the data found in two paper sentence datasets with the purpose of reworking the dataset into citing sentences and non-citing sentences. We will be attempting to find citation markers in sentences to classify something as citing.

## Getting the data

It can be easily acquired from [here](https://www.dropbox.com/scl/fi/s56s5v01yp5r2v64m9bxr/dss.tar.gz?rlkey=jvobt45y7f5c3yyeopck1q7ch&e=1&dl=0), extract the archive anywhere on your system and update `PATH_TO_DATA`.

Both datasets have the following format:
- `.txt` file for each paper, containing sentences separated by `\n============\n` and citations marked with `[number]` or `<PUBLICATION:paper_url>`.
- `.ref` file detailing each reference

In [2]:
import pandas as pd
import os

from pathlib import Path

# Change as needed
PATH_TO_DATA = Path("C:\\Users\\Adrian\\Downloads\\dss")


In [3]:
files = []

for root, _, filenames in os.walk(PATH_TO_DATA):
    for filename in filenames:
        if filename.endswith(".txt"):
            files.append(os.path.join(root, filename))

print(f"Found {len(files)} files")

Found 160204 files


In [4]:
sentences = []

for file in files[:1000]:
    with open(file, "r", encoding="utf-8") as f:
        sentences.extend(f.read().split("\n============\n"))

print(f"Found {len(sentences)} sentences")

Found 150872 sentences


In [5]:
df = pd.DataFrame(sentences, columns=["sentence"])

# extract citation markers
citation_markers = df["sentence"].str.extract(r"(\[.*\]|<.*>)")[0]

citation_markers.value_counts()

0
<formula>                                                                                                                                                                                                                                                                                                                                                                                                                20183
<formula> and <formula>                                                                                                                                                                                                                                                                                                                                                                                                   1734
<formula>, <formula>                                                                                                                                                    

The regex needs some refinement, as the author of the datasets replaced all formulas with `<formula>`, as well as handle other things that might be in brackets.

In [6]:
citation_markers = df["sentence"].str.extract(r" ?(\[\d+(?:-\d+|(?:, ?\d+(-\d+)?)*)+\]|<([A-Z]+:[a-zA-Z0-9._:/-]*)>) ?")[0]

citation_markers.value_counts()

0
<GC:>                                                     289
<GC:and>                                                  141
<GC:and.cover.minus.plus.thomas>                           58
<DBLP:http://dblp.org/rec/journals/corr/abs-0704-0229>     54
<DBLP:http://dblp.org/rec/journals/lmcs/BlassGRR07a>       47
                                                         ... 
<GC:and.ben.greville.israel.springer>                       1
<DBLP:http://dblp.org/rec/journals/gc/Tay93>                1
<DBLP:http://dblp.org/rec/conf/cccg/LeeST05>                1
<GC:johnson.lindenstrauss.matouek.the.variants>             1
<GC:computing.grid.lamanna.lhc.the>                         1
Name: count, Length: 7127, dtype: int64

In [7]:
import re

CITATION_REGEX = r" ?(\[\d+(?:-\d+|(?:, ?\d+(-\d+)?)*)+\]|<([A-Z]+:[a-zA-Z0-9._:/-]*)>) ?"

df = pd.DataFrame(map(lambda sentence: [re.sub(CITATION_REGEX,
                                                   "",
                                                   sentence),
                                            bool(re.search(CITATION_REGEX,
                                                           sentence))
                                            ],
                          df["sentence"]),
                      columns=["sentence", "citing"])


In [8]:
df

Unnamed: 0,sentence,citing
0,=1 The focus of this paper is decompositions o...,False
1,"We use graphto mean a multigraph, possibly wit...",False
2,We say that a graph is <formula>-sparseif no s...,False
3,We call the range <formula> the upper range of...,False
4,"In this paper, we present efficient algorithms...",False
...,...,...
150867,The entropies defined below occur naturally in...,False
150868,The Boltzmann entropy with respect to the posi...,False
150869,A way to circumvent this problem is to conside...,False
150870,Let <formula> be a positive measurable function.,False
