<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 4 - eyamrog

The aim of this phase is to PDF scrape the selected texts from the archive of preprints of [SciELO](https://scielo.org/) (Scientific Electronic Library Online), isolate the paragraphs of the articles and review them with ChatGPT.

- [SciELO Preprints](https://preprints.scielo.org/index.php/scielo)

## Required Python packages

- beautifulsoup4
- PyMuPDF
- lxml
- pandas
- requests

## Importing the required libraries

In [7]:
import requests
from bs4 import BeautifulSoup
import fitz # PyMuPDF
import re
import pandas as pd
import os
import sys
import logging
from tqdm import tqdm

## Defining input variables

In [2]:
input_directory = 'cl_st1_ph3_eyamrog'
output_directory = 'cl_st1_ph4_eyamrog'

## Creating output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory successfully created.


## PDF Scraping `SciELO Preprints` archive

### Importing the data into a DataFrame

In [3]:
df_scielo_preprint_preChatGPT_en = pd.read_json(f'{input_directory}/scielo_erpp_pp.jsonl', lines=True)

In [4]:
df_scielo_preprint_preChatGPT_en.dtypes

Title           object
URL             object
Authors         object
Published       object
PDF Language    object
PDF URL         object
Submitted        int64
Posted           int64
Text            object
Text ID         object
dtype: object

In [5]:
df_scielo_preprint_preChatGPT_en['Submitted'] = pd.to_datetime(df_scielo_preprint_preChatGPT_en['Submitted'], unit='ms')
df_scielo_preprint_preChatGPT_en['Posted'] = pd.to_datetime(df_scielo_preprint_preChatGPT_en['Posted'], unit='ms')

In [6]:
df_scielo_preprint_preChatGPT_en

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted,Text,Text ID
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,Publication status: Preprint has been publishe...,t000000
1,Assembling the perfect bacterial genome using ...,https://preprints.scielo.org/index.php/scielo/...,"Ryan R. Wick, Louise M. Judd, Kathryn E. Holt",Submitted 11/11/2022 - Posted 11/11/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-11,2022-11-11,Publication status: Preprint has been publishe...,t000001
2,ON METHODOLOGY AND METHODS FOR ANALYSING CLASS...,https://preprints.scielo.org/index.php/scielo/...,Leonardo Goncalves Lago,Submitted 11/10/2022 - Posted 11/16/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-10,2022-11-16,Publication status: Preprint has been publishe...,t000002
3,"MOBILIZING LINGUISTIC AND SEMIOTIC RESOURCES, ...",https://preprints.scielo.org/index.php/scielo/...,"Estêvão Cabral, Marylin Martin-Jones",Submitted 11/07/2022 - Posted 11/07/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-07,2022-11-07,Publication status: Preprint has been submitte...,t000003
4,PORTUGUESE AO PÉ DO BERIMBAU: ON CAPOEIRA AS A...,https://preprints.scielo.org/index.php/scielo/...,"Mike Baynham, Jolana Hanusova",Submitted 11/04/2022 - Posted 11/04/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-04,2022-11-04,Publication status: Preprint has been submitte...,t000004
...,...,...,...,...,...,...,...,...,...,...
311,Challenges in the fight against the COVID-19 p...,https://preprints.scielo.org/index.php/scielo/...,Eduardo Alexandrino Servolo Medeiros,Submitted 04/15/2020 - Posted 04/15/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-15,2020-04-15,*Corresponding author. E-mail: edubalaccih@gma...,t000311
312,Information about the new coronavirus disease ...,https://preprints.scielo.org/index.php/scielo/...,Claudio Márcio Amaral de Oliveira Lima,Submitted 04/13/2020 - Posted 04/13/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-13,2020-04-13,V\nRadiol Bras. 2020 Mar/Abr;53(2):V–VI\n0100-...,t000312
313,ACE2 diversity in placental mammals reveals th...,https://preprints.scielo.org/index.php/scielo/...,"Bibiana Sampaio de Oliveira Fam, Pedro Vargas-...",Submitted 04/11/2020 - Posted 04/28/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-11,2020-04-28,Status: Preprint has been published in a journ...,t000313
314,Coronavirus 2: Analysis of Regularity of Compl...,https://preprints.scielo.org/index.php/scielo/...,Yuri Morales-López,Submitted 04/10/2020 - Posted 06/04/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-10,2020-06-04,Publication status: Preprint has been publishe...,t000314


### Extracting the articles in original PDF format and scraping them into TXT format

In [None]:
# Configure logging to write to a file
logging.basicConfig(
    filename = f"{output_directory}/scraping_log.txt",
    level = logging.INFO,
    format = '%(asctime)s - %(levelname)s - %(message)s'
)

def scrape_pdf(pdf_byte_stream, output_txt):
    try:
        # Opening the PDF file
        doc = fitz.open(stream=pdf_byte_stream, filetype='pdf')
        
        # Initialising an empty string to store the text
        text = ''
        
        # Iterating through all the pages and extracting the text
        for page in doc:
            text += page.get_text()
        
        # Writing the extracted text to a text file in UTF-8 encoding
        with open(output_txt, 'w', encoding='utf-8') as txt_file:
            txt_file.write(text)
        
        logging.info(f"Text successfully extracted and saved to {output_txt}")
    except Exception as e:
        logging.error(f"Error extracting text from PDF: {e}")

def save_pdf(pdf_byte_stream, output_pdf):
    try:
        with open(output_pdf, 'wb') as pdf_file:
            pdf_file.write(pdf_byte_stream)
        
        logging.info(f"PDF successfully saved to {output_pdf}")
    except Exception as e:
        logging.error(f"Error saving PDF: {e}")

for index, row in df_scielo_preprint_preChatGPT_en.iterrows():
    try:
        article_page = requests.get(row['PDF URL'], verify=False)
        soup = BeautifulSoup(article_page.content, 'lxml')
        # Finding the <a> tag with the class 'download'
        download_link = soup.find('a', class_='download')
        article_pdf_link = download_link.get('href') if download_link else 'No PDF link'
        
        if article_pdf_link != 'No PDF link':
            article = requests.get(article_pdf_link, verify=False)
            # Scraping the article from the byte stream and saving it in the output directory
            scrape_pdf(article.content, f"{output_directory}/{row['Text ID']}.txt")
            # Saving the article in the output directory
            save_pdf(article.content, f"{output_directory}/{row['Text ID']}.pdf")
        else:
            logging.warning(f"No PDF link found for Text ID {row['Text ID']}")
    except Exception as e:
        logging.error(f"Error processing row {index}: {e}")

## Tokenising the paragraphs of each article

### Manual inspection and clean up

Inspect each article in `TXT` format and:
- Remove titles and subtitles
- Remove headers and footers
- Remove elements such as tables, figures, references and appendices
- Separate sets of lines that constitute paragraphs with an empty line

### Merging lines into paragraphs

In [8]:
# Setting up logging
logging.basicConfig(
    filename = f"{output_directory}/paragraph_tokenise_log.txt",
    level = logging.INFO,
    format = '%(asctime)s - %(levelname)s - %(message)s'
)

# Defining a function to tokenise the paragraphs of each article
def paragraph_tokenise(text):
    lines = text.split('\n')
    paragraphs = []
    paragraph = ''
    
    for line in lines:
        if line.strip():
            cleaned_line = ' '.join(line.split())  # Remove extra spaces within the line
            paragraph += ' ' + cleaned_line.strip()  # Join subsequent lines into a paragraph
        else:
            paragraphs.append(paragraph.strip())  # If there is an empty line, the paragraph consolidated so far is added to the list of paragraphs
            paragraph = ''  # The paragraph variable is cleared out
    
    if paragraph:
        paragraphs.append(paragraph.strip())  # The last paragraph is added to the list of paragraphs
    
    tokenised_paragraphs = '\n'.join(paragraphs)  # The list of paragraphs is compiled into a text with each paragraph as a separate line
    
    return tokenised_paragraphs

# Defining a function to read the content of a TXT file
def read_txt_file(file_path):
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            return file.read()
    except Exception as e:
        logging.error(f"Error reading file {file_path}: {e}")
        return None

# Defining a function to save the paragraph-tokenised articles into TXT files
def save_paragraph_tokenised_file(output_text_content, output_file):
    try:
        with open(output_file, 'w', encoding='utf-8') as output_txt_file:
            output_txt_file.write(output_text_content)
        logging.info(f"Successfully saved tokenised file: {output_file}")
    except Exception as e:
        logging.error(f"Error saving file {output_file}: {e}")

# Iterating through each row in the DataFrame and add the text content
for index, row in tqdm(df_scielo_preprint_preChatGPT_en.iterrows(), total=df_scielo_preprint_preChatGPT_en.shape[0], desc='Processing files'):
    text_id = row['Text ID']
    txt_file_path = os.path.join(output_directory, f"{text_id}.txt")
    if os.path.exists(txt_file_path):
        text_content = read_txt_file(txt_file_path)
        if text_content:
            paragraph_tokenised_text_content = paragraph_tokenise(text_content)
            save_paragraph_tokenised_file(paragraph_tokenised_text_content, f"{output_directory}/{text_id}_tokenised.txt")
    else:
        logging.warning(f"File not found: {txt_file_path}")

Processing files: 100%|██████████| 316/316 [00:00<00:00, 708.93it/s]


## Adding the column `Text Paragraphs` with the text extracted from each article

In [None]:
# Function to read the content of a TXT file
def read_txt_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

# Iterating through each row in the DataFrame and add the text content
texts = []
for index, row in df_scielo_preprint_preChatGPT_en.iterrows():
    text_id = row['Text ID']
    txt_file_path = os.path.join(output_directory, f"{text_id}_tokenised.txt")
    if os.path.exists(txt_file_path):
        text_content = read_txt_file(txt_file_path)
    else:
        text_content = None  # or you can set it to an empty string or any default value
    texts.append(text_content)

# Add the 'Text Paragraphs' column to the DataFrame
df_scielo_preprint_preChatGPT_en['Text Paragraphs'] = texts

In [None]:
df_scielo_preprint_preChatGPT_en

## Exporting to a file

In [None]:
df_scielo_preprint_preChatGPT_en.to_json(f"{output_directory}/scielo_erpp_pp.jsonl", orient='records', lines=True)