In [1]:
import json
import pickle

In [2]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [3]:
%cd /content/drive/My Drive/SciPara_2024_release

/content/drive/My Drive/SciPara_2024_release


# Load the dataset

In [4]:
with open('base_dataset_abstract_previous.pkl', 'rb') as f:
    loaded_dict = pickle.load(f)

In the key_list, we have the paper ids of the papers in our dataset

In [5]:
key_list = list(loaded_dict.keys())

In [6]:
key_list[0]

'paper_1'

For each paper, we have the paper's body (divided into paragraphs) and abstract

In [7]:
loaded_dict["paper_1"].keys()

dict_keys(['paper', 'abstract'])

In [8]:
loaded_dict["paper_1"]['paper']

[{'id': 36211,
  'text': 'SKIP\nPaper title: Phylogenetic analyses provide the first insights into the evolution of OVATE family proteins in land plants. \n\n(#1) The OVATE gene was first identified as an important regulator of fruit shape in tomato, in which a naturally occurring premature stop codon in OVATE results in pear-shaped fruit with longitudinal elongation and neck constriction (Liu et al., 2002) .\n(#2) This revealed a previously uncharacterized class of regulatory genes in plant development, which encode proteins with a conserved 70 amino acid C-terminal domain.\n(#3) This domain was designated as the OVATE domain, also known as DUF623 (Domain of Unknown Function 623), and proteins containing this domain were designated OVATE family proteins (OFPs), which are found exclusively in plants (Hackbusch et al., 2005; Wang et al., 2007 Wang et al., , 2011 .',
  'label': [[0, 4, 'Bad structure'],
   [129, 132, 'Topic sentence'],
   [129, 133, 'Score 3'],
   [138, 143, 'Paragraph T

In [9]:
loaded_dict["paper_1"]['abstract']

'Background and aims: The OVATE gene encodes a nuclear-localized regulatory protein belonging to a distinct family of plant-specific proteins known as the OVATE family proteins (OFPs). OVATE was first identified as a key regulator of fruit shape in tomato, with nonsense mutants displaying pear-shaped fruits. However, the role of OFPs in plant development has been poorly characterized. Methods: Public databases were searched and a total of 265 putative OVATE protein sequences were identified from 13 sequenced plant genomes that represent the major evolutionary lineages of land plants. A phylogenetic analysis was conducted based on the alignment of the conserved OVATE domain from these 13 selected plant genomes. The expression patterns of tomato SlOFP genes were analysed via quantitative real-time PCR. The pattern of OVATE gene duplication resulting in the expansion of the gene family was determined in arabidopsis, rice and tomato. Key results: Genes for OFPs were found to be present in 

Below, we concatenate all paragraphs in the dataset into 1 list

In [10]:
paras_list = []

for a in list(loaded_dict.keys()):
  paper = loaded_dict[a]
  for b in paper['paper']:
    paras_list.append(b)

For each paragraph, we have the paragraph id, the paragraph text, all labels placed on the paragraph by the annotator, and the previous paragraph from the corresponding paper (expect for the very first paragraphs of each paper)

In [11]:
paras_list[1]

{'id': 36212,
 'text': 'SKIP\n(#1) To date, OFPs have been primarily characterized in arabidopsis (AtOFPs) and demonstrated to regulate plant growth and development (Hackbusch et al., 2005; Pagnussat et al., 2007; Wang et al., 2007 Wang et al., , 2010 Li et al., 2011) .\n(#2) AtOFPs were shown to have close functional interactions with three amino acid loop extension (TALE) homeodomain proteins, and AtOFP1 and AtOFP5 regulate the sub-cellular localization of TALE homeoproteins (Hackbusch et al., 2005) .\n(#3) The plant TALE proteins are a conserved superclass of homeodomain proteins characterized by an extension of three amino acids between helices 1 and 2 of the homeodomain (Bertolino et al., 1995) , and comprise two sub-classes called the KNOTTED-like homeobox (KNOX) and BEL1-like homeodomain (BELL) proteins Tsiantis, 2009, 2010; Hamant and Pautot, 2010) .\n(#4) Furthermore, it has been well documented that interactions between KNOX and BELL proteins result in heterodimers regulating

In [12]:
paras_list[1]['text']

'SKIP\n(#1) To date, OFPs have been primarily characterized in arabidopsis (AtOFPs) and demonstrated to regulate plant growth and development (Hackbusch et al., 2005; Pagnussat et al., 2007; Wang et al., 2007 Wang et al., , 2010 Li et al., 2011) .\n(#2) AtOFPs were shown to have close functional interactions with three amino acid loop extension (TALE) homeodomain proteins, and AtOFP1 and AtOFP5 regulate the sub-cellular localization of TALE homeoproteins (Hackbusch et al., 2005) .\n(#3) The plant TALE proteins are a conserved superclass of homeodomain proteins characterized by an extension of three amino acids between helices 1 and 2 of the homeodomain (Bertolino et al., 1995) , and comprise two sub-classes called the KNOTTED-like homeobox (KNOX) and BEL1-like homeodomain (BELL) proteins Tsiantis, 2009, 2010; Hamant and Pautot, 2010) .\n(#4) Furthermore, it has been well documented that interactions between KNOX and BELL proteins result in heterodimers regulating plant development in a

In [13]:
paras_list[1]['label']

[[0, 4, 'Bad structure'],
 [5, 9, 'Score 4'],
 [5, 9, 'Topic sentence'],
 [19, 23, 'Paragraph Topic'],
 [74, 80, 'Paragraph Topic'],
 [246, 250, 'Score 2'],
 [246, 250, 'Supporting detail sentence'],
 [251, 257, 'Paragraph Topic'],
 [345, 349, 'Topic attribute 1'],
 [377, 383, 'Topic attribute 2'],
 [388, 394, 'Topic attribute 4'],
 [436, 441, 'Topic attribute 1'],
 [483, 487, 'Score 3'],
 [483, 487, 'Supporting detail sentence'],
 [498, 503, 'Topic attribute 1'],
 [747, 751, 'Topic attribute 3'],
 [757, 785, 'Topic attribute 6'],
 [844, 848, 'Supporting detail sentence'],
 [845, 848, 'Score 4'],
 [916, 920, 'Topic attribute 3'],
 [925, 938, 'Topic attribute 6'],
 [1156, 1160, 'Supporting detail sentence'],
 [1157, 1160, 'Score 3'],
 [1161, 1167, 'Topic attribute 2'],
 [1346, 1350, 'Score 4'],
 [1346, 1350, 'Supporting detail sentence'],
 [1386, 1392, 'Topic attribute 2'],
 [1498, 1502, 'Score 2'],
 [1498, 1502, 'Supporting detail sentence'],
 [1514, 1520, 'Topic attribute 2'],
 [1621,

In [14]:
paras_list[1]['previous_para']

'SKIP\nPaper title: Phylogenetic analyses provide the first insights into the evolution of OVATE family proteins in land plants. \n\n(#1) The OVATE gene was first identified as an important regulator of fruit shape in tomato, in which a naturally occurring premature stop codon in OVATE results in pear-shaped fruit with longitudinal elongation and neck constriction (Liu et al., 2002) .\n(#2) This revealed a previously uncharacterized class of regulatory genes in plant development, which encode proteins with a conserved 70 amino acid C-terminal domain.\n(#3) This domain was designated as the OVATE domain, also known as DUF623 (Domain of Unknown Function 623), and proteins containing this domain were designated OVATE family proteins (OFPs), which are found exclusively in plants (Hackbusch et al., 2005; Wang et al., 2007 Wang et al., , 2011 .'

In [15]:
len(paras_list)

981