The dataset I will use for this project consists of 100 scientific papers from the WING NUS group's Scisumm corpus found at this [github link](https://github.com/WING-NUS/scisumm-corpus). According to the authors, [Scisumm](https://cs.stanford.edu/~myasu/projects/scisumm_net/) is a summary of scientific papers should ideally incorporate the impact of the papers on the research community reflected by citations. To facilitate research in citation-aware scientific paper summarization (Scisumm), the CL-Scisumm shared task has been organized since 2014 for papers in the computational linguistics and NLP domain. 

## Data Collection and Preprocessing

The data from Scisumm is in the .xml format. The [XML](https://www.indeed.com/career-advice/career-development/xml-file), also known as the extensible markup language file, is used to structure data for storage and transport. It contains tags to provide structure to the data and also contains the text. Put simply, XML is a standard text file that utilizes customized tags, to describe the structure of the document and how it should be stored and transported.

The structure of a sample paper is shown below: 


<img src='https://drive.google.com/uc?id=1wheFaobd2Bw6QSmMIn0azJIhC584lNzV'>

Each part of the paper is contained in a `SECTION` tag and the succeeding paragraphs of the section are found below it. Each sentence is given a unique `security identifier` or `sid`. 

In order to properly extract this data, I follow this process:
1. Get the list of all `.xml` files names
2. Use the library `objectify` in order to extract all text contents of the data.
3. Extract the `abstract` and `conclusion` columns into separate lists for abstractive summarization. 
4. Collate the text from every section into one whole text.
5. Append the `abstract`, `entire_text`, and `conclusion` into a pandas dataframe

### Getting the list of all `.xml` file names from Scisumm

In [1]:
!wget https://cs.stanford.edu/~myasu/projects/scisumm_net/scisummnet_release1.1__20190413.zip

--2022-08-11 12:42:19--  https://cs.stanford.edu/~myasu/projects/scisumm_net/scisummnet_release1.1__20190413.zip
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19451729 (19M) [application/zip]
Saving to: ‘scisummnet_release1.1__20190413.zip’


2022-08-11 12:42:21 (10.7 MB/s) - ‘scisummnet_release1.1__20190413.zip’ saved [19451729/19451729]



In [2]:
!unzip scisummnet_release1.1__20190413.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
   creating: __MACOSX/scisummnet_release1.1__20190413/top1000_complete/C90-3030/summary/
  inflating: __MACOSX/scisummnet_release1.1__20190413/top1000_complete/C90-3030/summary/._C90-3030.gold.txt  
  inflating: __MACOSX/scisummnet_release1.1__20190413/top1000_complete/C90-3030/._summary  
  inflating: __MACOSX/scisummnet_release1.1__20190413/top1000_complete/._C90-3030  
   creating: scisummnet_release1.1__20190413/top1000_complete/P04-1036/
  inflating: scisummnet_release1.1__20190413/top1000_complete/P04-1036/citing_sentences_annotated.json  
   creating: __MACOSX/scisummnet_release1.1__20190413/top1000_complete/P04-1036/
  inflating: __MACOSX/scisummnet_release1.1__20190413/top1000_complete/P04-1036/._citing_sentences_annotated.json  
   creating: scisummnet_release1.1__20190413/top1000_complete/P04-1036/Documents_xml/
  inflating: scisummnet_release1.1__20190413/top1000_complete/P04-1036/Documents_xml/P04-1036.xml  


Then, I manually filtered the unzipped folder to only include 100 selected documents that **follow the format in the screenshot above** for the purpose of this project. The 100 documents are contained in the raw data `top100.csv`.

## Data Extraction

For the purpose of future iterations and/or selecting a different subset of scientific texts, the following code block was used to extract the `abstract`, `conclusion`, and `full_text` columns found in the `top100.csv` raw data file.

Note that the `xml` documents chosen must be placed into a separate folder containing all the individual folders per document as shown in the screenshot below.

<img src='https://drive.google.com/uc?id=1Qizt8ZFXS3-SgSY8W9iORGelCwcrsXql'>

Once the data has been setup like this, the code blocks below will allow performing extraction of these documents easily.

In [None]:
%%time

# Data collection libraries
from lxml import objectify
import pandas as pd
import numpy as np
import os
import glob
from glob import iglob

# Create xml file extraction function
def extract_xml(directory):
  xml_data = objectify.parse(directory)  # Parse XML data
  root = xml_data.getroot()  # Root element

  data = []
  cols = []
  for i in range(len(root.getchildren())):
      child = root.getchildren()[i]
      data.append([subchild.text for subchild in child.getchildren()])

      # If the tag is not 'SECTION', it is a section header, append that header
      # If it is, it means it is a subsection, and append the title of that subsection
      if child.tag != "SECTION":
        cols.append(child.tag)
      else:
        cols.append(child.attrib.get('title'))

  df = pd.DataFrame(data).T  # Create DataFrame and transpose it
  df.columns = cols  # Update column names

  # Get the abstract column (second column)
  abstract_list = df.iloc[:, 1].dropna()
  abstract = " ".join(abstract_list)

  # Get the conclusion column (penultimate column)
  conclusion_text = df.iloc[:, -2].dropna()
  conclusion = " ".join(conclusion_text)

  # Drop last column of a dataframe
  df = df.iloc[: , :-1]

  # Drop first column: S 
  df = df.iloc[:, 1:]

  # Iterate over all sections and join them together to get the text document
  text_list = []
  for column in df.columns:
    text_filtered = df[column].dropna()
    text = " ".join(text_filtered)
    text_list.append(text)

  text_list
  final_text = " ".join(text_list)

  return abstract, final_text, conclusion

# Get the list of directories for all .xml files
file_directory = glob.glob("scisummnet_release1.1__20190413/top100/*/*/*.xml", recursive=True)

# Check if the paths directory is correct
print(file_directory[0:5])

# Create lists for abstract, full text, and conclusion
abstract_list = []
full_text_list = []
conclusion_list = []

counter = 0
for directory in file_directory:
  abstract, full_text, conclusion = extract_xml(directory)
  abstract_list.append(abstract)
  full_text_list.append(full_text)
  conclusion_list.append(conclusion)

  print(f"XML extraction for document {counter} done!")
  counter += 1

text_df = pd.DataFrame(list(zip(abstract_list, full_text_list, conclusion_list)), columns=["abstract", "full_text", "conclusion"])
print(text_df.head())

# Optional: Save the dataframe to a .csv file
# text_df.to_csv("top100.csv")