<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 3 - eyamrog

The aim of this phase is to PDF scrape the selected texts from the archive of preprints of [SciELO](https://scielo.org/) (Scientific Electronic Library Online), inspect the articles and select the ones that are within the scope of this study.

- [SciELO Preprints](https://preprints.scielo.org/index.php/scielo)

## Required Python packages

- beautifulsoup4
- PyMuPDF
- lxml
- pandas
- requests

## Importing the required libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import fitz # PyMuPDF
import re
import pandas as pd
import os
import sys
import logging

## Defining input variables

In [2]:
input_directory = 'cl_st1_ph2_eyamrog'
output_directory = 'cl_st1_ph3_eyamrog'

## Creating output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory successfully created.


## PDF Scraping `SciELO Preprints` archive

### Importing the data into a DataFrame

In [4]:
df_scielo_preprint_preChatGPT_en = pd.read_json(f'{input_directory}/scielo_preprint_preChatGPT_en.jsonl', lines=True)

In [5]:
df_scielo_preprint_preChatGPT_en.dtypes

Title           object
URL             object
Authors         object
Published       object
PDF Language    object
PDF URL         object
Submitted        int64
Posted           int64
dtype: object

In [6]:
df_scielo_preprint_preChatGPT_en['Submitted'] = pd.to_datetime(df_scielo_preprint_preChatGPT_en['Submitted'], unit='ms')
df_scielo_preprint_preChatGPT_en['Posted'] = pd.to_datetime(df_scielo_preprint_preChatGPT_en['Posted'], unit='ms')

### Creating the column `Text ID`

In [7]:
df_scielo_preprint_preChatGPT_en['Text ID'] = 't' + df_scielo_preprint_preChatGPT_en.index.astype(str).str.zfill(6)

In [8]:
df_scielo_preprint_preChatGPT_en

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted,Text ID
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000
1,Belching as a symptom of COVID-19: Case presen...,https://preprints.scielo.org/index.php/scielo/...,"Nuvia Novoa Acosta, Liuba Yamila Peña Galbán, ...",Submitted 11/14/2022 - Posted 11/29/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-14,2022-11-29,t000001
2,Assembling the perfect bacterial genome using ...,https://preprints.scielo.org/index.php/scielo/...,"Ryan R. Wick, Louise M. Judd, Kathryn E. Holt",Submitted 11/11/2022 - Posted 11/11/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-11,2022-11-11,t000002
3,ON METHODOLOGY AND METHODS FOR ANALYSING CLASS...,https://preprints.scielo.org/index.php/scielo/...,Leonardo Goncalves Lago,Submitted 11/10/2022 - Posted 11/16/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-10,2022-11-16,t000003
4,"MOBILIZING LINGUISTIC AND SEMIOTIC RESOURCES, ...",https://preprints.scielo.org/index.php/scielo/...,"Estêvão Cabral, Marylin Martin-Jones",Submitted 11/07/2022 - Posted 11/07/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-07,2022-11-07,t000004
...,...,...,...,...,...,...,...,...,...
356,Information about the new coronavirus disease ...,https://preprints.scielo.org/index.php/scielo/...,Claudio Márcio Amaral de Oliveira Lima,Submitted 04/13/2020 - Posted 04/13/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-13,2020-04-13,t000356
357,ACE2 diversity in placental mammals reveals th...,https://preprints.scielo.org/index.php/scielo/...,"Bibiana Sampaio de Oliveira Fam, Pedro Vargas-...",Submitted 04/11/2020 - Posted 04/28/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-11,2020-04-28,t000357
358,Coronavirus 2: Analysis of Regularity of Compl...,https://preprints.scielo.org/index.php/scielo/...,Yuri Morales-López,Submitted 04/10/2020 - Posted 06/04/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-10,2020-06-04,t000358
359,MULTIPROFESSIONAL ELECTRONIC PROTOCOL FOR DIGE...,https://preprints.scielo.org/index.php/scielo/...,"Osvaldo Malafaia, Faruk Abrão KALIL-FILHO, Jo...",Submitted 04/09/2020 - Posted 04/09/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-09,2020-04-09,t000359


### Extracting the articles in original PDF format and scraping them into TXT format

In [None]:
# Configure logging to write to a file
logging.basicConfig(
    filename = f"{output_directory}/scraping_log.txt",
    level = logging.INFO,
    format = '%(asctime)s - %(levelname)s - %(message)s'
)

def scrape_pdf(pdf_byte_stream, output_txt):
    try:
        # Opening the PDF file
        doc = fitz.open(stream=pdf_byte_stream, filetype='pdf')
        
        # Initialising an empty string to store the text
        text = ''
        
        # Iterating through all the pages and extracting the text
        for page in doc:
            text += page.get_text()
        
        # Writing the extracted text to a text file in UTF-8 encoding
        with open(output_txt, 'w', encoding='utf-8') as txt_file:
            txt_file.write(text)
        
        logging.info(f"Text successfully extracted and saved to {output_txt}")
    except Exception as e:
        logging.error(f"Error extracting text from PDF: {e}")

def save_pdf(pdf_byte_stream, output_pdf):
    try:
        with open(output_pdf, 'wb') as pdf_file:
            pdf_file.write(pdf_byte_stream)
        
        logging.info(f"PDF successfully saved to {output_pdf}")
    except Exception as e:
        logging.error(f"Error saving PDF: {e}")

for index, row in df_scielo_preprint_preChatGPT_en.iterrows():
    try:
        article_page = requests.get(row['PDF URL'], verify=False)
        soup = BeautifulSoup(article_page.content, 'lxml')
        # Finding the <a> tag with the class 'download'
        download_link = soup.find('a', class_='download')
        article_pdf_link = download_link.get('href') if download_link else 'No PDF link'
        
        if article_pdf_link != 'No PDF link':
            article = requests.get(article_pdf_link, verify=False)
            # Scraping the article from the byte stream and saving it in the output directory
            scrape_pdf(article.content, f"{output_directory}/{row['Text ID']}.txt")
            # Saving the article in the output directory
            save_pdf(article.content, f"{output_directory}/{row['Text ID']}.pdf")
        else:
            logging.warning(f"No PDF link found for Text ID {row['Text ID']}")
    except Exception as e:
        logging.error(f"Error processing row {index}: {e}")

## Adding the column `Text` with the text extracted from each article

In [10]:
# Function to read the content of a TXT file
def read_txt_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

# Iterate through each row in the DataFrame and add the text content
texts = []
for index, row in df_scielo_preprint_preChatGPT_en.iterrows():
    text_id = row['Text ID']
    txt_file_path = os.path.join(output_directory, f"{text_id}.txt")
    if os.path.exists(txt_file_path):
        text_content = read_txt_file(txt_file_path)
    else:
        text_content = None  # or you can set it to an empty string or any default value
    texts.append(text_content)

# Add the 'Text' column to the DataFrame
df_scielo_preprint_preChatGPT_en['Text'] = texts

In [11]:
df_scielo_preprint_preChatGPT_en

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted,Text ID,Text
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,Publication status: Preprint has been publishe...
1,Belching as a symptom of COVID-19: Case presen...,https://preprints.scielo.org/index.php/scielo/...,"Nuvia Novoa Acosta, Liuba Yamila Peña Galbán, ...",Submitted 11/14/2022 - Posted 11/29/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-14,2022-11-29,t000001,Publication status: Not informed by the submit...
2,Assembling the perfect bacterial genome using ...,https://preprints.scielo.org/index.php/scielo/...,"Ryan R. Wick, Louise M. Judd, Kathryn E. Holt",Submitted 11/11/2022 - Posted 11/11/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-11,2022-11-11,t000002,Publication status: Preprint has been publishe...
3,ON METHODOLOGY AND METHODS FOR ANALYSING CLASS...,https://preprints.scielo.org/index.php/scielo/...,Leonardo Goncalves Lago,Submitted 11/10/2022 - Posted 11/16/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-10,2022-11-16,t000003,Publication status: Preprint has been publishe...
4,"MOBILIZING LINGUISTIC AND SEMIOTIC RESOURCES, ...",https://preprints.scielo.org/index.php/scielo/...,"Estêvão Cabral, Marylin Martin-Jones",Submitted 11/07/2022 - Posted 11/07/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-07,2022-11-07,t000004,Publication status: Preprint has been submitte...
...,...,...,...,...,...,...,...,...,...,...
356,Information about the new coronavirus disease ...,https://preprints.scielo.org/index.php/scielo/...,Claudio Márcio Amaral de Oliveira Lima,Submitted 04/13/2020 - Posted 04/13/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-13,2020-04-13,t000356,V\nRadiol Bras. 2020 Mar/Abr;53(2):V–VI\n0100-...
357,ACE2 diversity in placental mammals reveals th...,https://preprints.scielo.org/index.php/scielo/...,"Bibiana Sampaio de Oliveira Fam, Pedro Vargas-...",Submitted 04/11/2020 - Posted 04/28/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-11,2020-04-28,t000357,Status: Preprint has been published in a journ...
358,Coronavirus 2: Analysis of Regularity of Compl...,https://preprints.scielo.org/index.php/scielo/...,Yuri Morales-López,Submitted 04/10/2020 - Posted 06/04/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-10,2020-06-04,t000358,Publication status: Preprint has been publishe...
359,MULTIPROFESSIONAL ELECTRONIC PROTOCOL FOR DIGE...,https://preprints.scielo.org/index.php/scielo/...,"Osvaldo Malafaia, Faruk Abrão KALIL-FILHO, Jo...",Submitted 04/09/2020 - Posted 04/09/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-09,2020-04-09,t000359,\n \n \nArtigo Original \n \nVALIDAÇÃO DE PRO...


## Dropping articles that actually are not in English

The articles were visually inspected and the following ones were selected to be discarded.

In [13]:
article_drop_list = [
    't000001',
    't000017',
    't000023',
    't000037',
    't000051',
    't000053',
    't000057',
    't000058',
    't000066',
    't000067',
    't000074',
    't000095',
    't000096',
    't000101',
    't000111',
    't000113',
    't000114',
    't000116',
    't000119',
    't000124',
    't000131',
    't000136',
    't000139',
    't000142',
    't000143',
    't000149',
    't000154',
    't000158',
    't000167',
    't000171',
    't000175',
    't000180',
    't000185',
    't000186',
    't000190',
    't000205',
    't000224',
    't000232',
    't000233',
    't000239',
    't000240',
    't000311',
    't000329',
    't000339',
    't000359'
]

In [14]:
# Creating a boolean mask for the rows whose column 'Text ID' are in the list of articles to be dropped
mask = df_scielo_preprint_preChatGPT_en['Text ID'].isin(article_drop_list)

# Confirming the number of rows that are going to be dropped
counts = mask.value_counts()

# Display the counts
counts

Text ID
False    316
True      45
Name: count, dtype: int64

In [15]:
# Retaining the rows whose column 'Text ID' are not in the list of articles to be dropped
df_scielo_preprint_preChatGPT_en = df_scielo_preprint_preChatGPT_en[~mask]
df_scielo_preprint_preChatGPT_en = df_scielo_preprint_preChatGPT_en.reset_index(drop=True)

In [16]:
df_scielo_preprint_preChatGPT_en

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted,Text ID,Text
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,Publication status: Preprint has been publishe...
1,Assembling the perfect bacterial genome using ...,https://preprints.scielo.org/index.php/scielo/...,"Ryan R. Wick, Louise M. Judd, Kathryn E. Holt",Submitted 11/11/2022 - Posted 11/11/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-11,2022-11-11,t000002,Publication status: Preprint has been publishe...
2,ON METHODOLOGY AND METHODS FOR ANALYSING CLASS...,https://preprints.scielo.org/index.php/scielo/...,Leonardo Goncalves Lago,Submitted 11/10/2022 - Posted 11/16/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-10,2022-11-16,t000003,Publication status: Preprint has been publishe...
3,"MOBILIZING LINGUISTIC AND SEMIOTIC RESOURCES, ...",https://preprints.scielo.org/index.php/scielo/...,"Estêvão Cabral, Marylin Martin-Jones",Submitted 11/07/2022 - Posted 11/07/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-07,2022-11-07,t000004,Publication status: Preprint has been submitte...
4,PORTUGUESE AO PÉ DO BERIMBAU: ON CAPOEIRA AS A...,https://preprints.scielo.org/index.php/scielo/...,"Mike Baynham, Jolana Hanusova",Submitted 11/04/2022 - Posted 11/04/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-04,2022-11-04,t000005,Publication status: Preprint has been submitte...
...,...,...,...,...,...,...,...,...,...,...
311,Challenges in the fight against the COVID-19 p...,https://preprints.scielo.org/index.php/scielo/...,Eduardo Alexandrino Servolo Medeiros,Submitted 04/15/2020 - Posted 04/15/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-15,2020-04-15,t000355,*Corresponding author. E-mail: edubalaccih@gma...
312,Information about the new coronavirus disease ...,https://preprints.scielo.org/index.php/scielo/...,Claudio Márcio Amaral de Oliveira Lima,Submitted 04/13/2020 - Posted 04/13/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-13,2020-04-13,t000356,V\nRadiol Bras. 2020 Mar/Abr;53(2):V–VI\n0100-...
313,ACE2 diversity in placental mammals reveals th...,https://preprints.scielo.org/index.php/scielo/...,"Bibiana Sampaio de Oliveira Fam, Pedro Vargas-...",Submitted 04/11/2020 - Posted 04/28/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-11,2020-04-28,t000357,Status: Preprint has been published in a journ...
314,Coronavirus 2: Analysis of Regularity of Compl...,https://preprints.scielo.org/index.php/scielo/...,Yuri Morales-López,Submitted 04/10/2020 - Posted 06/04/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-10,2020-06-04,t000358,Publication status: Preprint has been publishe...


In [17]:
# Redefining the column 'Text ID'
df_scielo_preprint_preChatGPT_en = df_scielo_preprint_preChatGPT_en.drop(columns=['Text ID']) # Dropping the previous one
df_scielo_preprint_preChatGPT_en['Text ID'] = 't' + df_scielo_preprint_preChatGPT_en.index.astype(str).str.zfill(6) # Creating a new one

In [18]:
df_scielo_preprint_preChatGPT_en

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted,Text,Text ID
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,Publication status: Preprint has been publishe...,t000000
1,Assembling the perfect bacterial genome using ...,https://preprints.scielo.org/index.php/scielo/...,"Ryan R. Wick, Louise M. Judd, Kathryn E. Holt",Submitted 11/11/2022 - Posted 11/11/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-11,2022-11-11,Publication status: Preprint has been publishe...,t000001
2,ON METHODOLOGY AND METHODS FOR ANALYSING CLASS...,https://preprints.scielo.org/index.php/scielo/...,Leonardo Goncalves Lago,Submitted 11/10/2022 - Posted 11/16/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-10,2022-11-16,Publication status: Preprint has been publishe...,t000002
3,"MOBILIZING LINGUISTIC AND SEMIOTIC RESOURCES, ...",https://preprints.scielo.org/index.php/scielo/...,"Estêvão Cabral, Marylin Martin-Jones",Submitted 11/07/2022 - Posted 11/07/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-07,2022-11-07,Publication status: Preprint has been submitte...,t000003
4,PORTUGUESE AO PÉ DO BERIMBAU: ON CAPOEIRA AS A...,https://preprints.scielo.org/index.php/scielo/...,"Mike Baynham, Jolana Hanusova",Submitted 11/04/2022 - Posted 11/04/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-04,2022-11-04,Publication status: Preprint has been submitte...,t000004
...,...,...,...,...,...,...,...,...,...,...
311,Challenges in the fight against the COVID-19 p...,https://preprints.scielo.org/index.php/scielo/...,Eduardo Alexandrino Servolo Medeiros,Submitted 04/15/2020 - Posted 04/15/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-15,2020-04-15,*Corresponding author. E-mail: edubalaccih@gma...,t000311
312,Information about the new coronavirus disease ...,https://preprints.scielo.org/index.php/scielo/...,Claudio Márcio Amaral de Oliveira Lima,Submitted 04/13/2020 - Posted 04/13/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-13,2020-04-13,V\nRadiol Bras. 2020 Mar/Abr;53(2):V–VI\n0100-...,t000312
313,ACE2 diversity in placental mammals reveals th...,https://preprints.scielo.org/index.php/scielo/...,"Bibiana Sampaio de Oliveira Fam, Pedro Vargas-...",Submitted 04/11/2020 - Posted 04/28/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-11,2020-04-28,Status: Preprint has been published in a journ...,t000313
314,Coronavirus 2: Analysis of Regularity of Compl...,https://preprints.scielo.org/index.php/scielo/...,Yuri Morales-López,Submitted 04/10/2020 - Posted 06/04/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-10,2020-06-04,Publication status: Preprint has been publishe...,t000314


## Exporting to a file

In [19]:
df_scielo_preprint_preChatGPT_en.to_json(f"{output_directory}/scielo_erpp_pp.jsonl", orient='records', lines=True)