<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 2 - eyamrog

The aim of this phase is to web scrape the archive of preprints of [SciELO](https://scielo.org/) (Scientific Electronic Library Online).

- [SciELO Preprints](https://preprints.scielo.org/index.php/scielo)

## Required Python packages

- beautifulsoup4
- lxml
- pandas
- requests

## Importing the required libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
import sys

## Defining input variables

In [2]:
output_directory = 'cl_st1_ph2_eyamrog'

## Creating output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory already exists.


## Scraping `SciELO Preprints` archive

In 17/09/2024, `SciELO Preprints` archive was described in 173 web pages.

In [4]:
start_page = 1
end_page = 173

In [None]:
# Initialising an empty list to store the data
data = []

# Iterating through the URLs
for i in range(start_page, end_page + 1):
    url = f'https://preprints.scielo.org/index.php/scielo/preprints/index/{i}'
    response = requests.get(url, verify=False) # SciELO's SSL certificate could not be verified ('SSLError: CERTIFICATE_VERIFY_FAILED'), so SSL verification had to be disabled
    soup = BeautifulSoup(response.content, 'lxml')
    
    # Find all <div> elements with the class 'obj_article_summary'
    articles = soup.find_all('div', class_='obj_article_summary')
    
    for article in articles:
        # Extract the title
        title_tag = article.find('div', class_='title').find('a')
        title = title_tag.get_text(strip=True)
        title_url = title_tag.get('href')
        
        # Extract the authors
        authors = article.find('div', class_='authors').get_text(strip=True)
        
        # Extract the published date
        published = article.find('div', class_='published').get_text(strip=True)
        
        # Extract the PDF link
        pdf_link_tag = article.find('a', class_='obj_galley_link pdf')
        pdf_link = pdf_link_tag.get('href') if pdf_link_tag else 'No PDF link'
        pdf_text = pdf_link_tag.get_text(strip=True) if pdf_link_tag else 'No PDF text'
        
        # Append the data to the list
        data.append({
            'Title': title,
            'URL': title_url,
            'Authors': authors,
            'Published': published,
            'PDF Language': pdf_text,
            'PDF URL': pdf_link
        })

# Create a pandas DataFrame from the data
df_scielo_preprint = pd.DataFrame(data)

In [6]:
df_scielo_preprint

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL
0,The sun rises for everyone: popular entreprene...,https://preprints.scielo.org/index.php/scielo/...,Henrique Costa,Submitted 09/16/2024 - Posted 09/17/2024,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...
1,FROM TEXT TO FABRIC: RESONANCES OF GILBERTO FR...,https://preprints.scielo.org/index.php/scielo/...,"Rodrigo Medeiros, Lucas Oliveira",Submitted 09/13/2024 - Posted 09/17/2024,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...
2,Perception of formative and summative assessme...,https://preprints.scielo.org/index.php/scielo/...,"Martha Quiroz Figueroa, Mercedes María Lucas C...",Submitted 09/13/2024 - Posted 09/17/2024,PDF (Español (España)),https://preprints.scielo.org/index.php/scielo/...
3,SCIENTIFIC AND TECHNOLOGICAL LITERACY: CONTRIB...,https://preprints.scielo.org/index.php/scielo/...,"Débora Danieli Pontarollo Gonçalves, Marizete ...",Submitted 09/16/2024 - Posted 09/17/2024,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...
4,COGNITIVE AND METACOGNITIVE STRATEGIES IN DIGI...,https://preprints.scielo.org/index.php/scielo/...,"Fernando Silvio Cavalcante Pimentel, Daniela K...",Submitted 09/14/2024 - Posted 09/17/2024,PDF,https://preprints.scielo.org/index.php/scielo/...
...,...,...,...,...,...,...
3452,SARS-CoV-2: a clinical update,https://preprints.scielo.org/index.php/scielo/...,"Mateus da Silveira Cespedes, José Carlos Souza",Submitted 04/10/2020 - Posted 04/13/2020,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...
3453,MULTIPROFESSIONAL ELECTRONIC PROTOCOL FOR DIGE...,https://preprints.scielo.org/index.php/scielo/...,"Osvaldo Malafaia, Faruk Abrão KALIL-FILHO, Jo...",Submitted 04/09/2020 - Posted 04/09/2020,PDF,https://preprints.scielo.org/index.php/scielo/...
3454,Overview of confirmed cases of COVID-19 in fiv...,https://preprints.scielo.org/index.php/scielo/...,"NILA ALBUQUERQUE, Nathália Pedrosa",Submitted 04/09/2020 - Posted 04/09/2020,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...
3455,COVID-19 in the State of Ceará: behaviors and ...,https://preprints.scielo.org/index.php/scielo/...,"Danilo Lima, Aldo Dias, Renata Rabelo, Igor Cr...",Submitted 04/08/2020 - Posted 04/09/2020,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...


### Removing duplicates

In [7]:
df_scielo_preprint.drop_duplicates(subset='URL', keep='first', inplace=True)
df_scielo_preprint = df_scielo_preprint.reset_index(drop=True)
df_scielo_preprint.shape

(3297, 6)

### Initialising the columns `Submitted` and `Posted`

In [8]:
df_scielo_preprint.at[0, 'Submitted'] = '12/31/2024'
df_scielo_preprint.at[0, 'Posted'] = '12/31/2024'

### Extracting the `Submitted` and `Posted` dates from the column `Published`

In [9]:
# Defining a function to extract dates using RegEx
def extract_dates(published):
    match = re.search(r'Submitted (\d{2}/\d{2}/\d{4}) - Posted (\d{2}/\d{2}/\d{4})', published)
    if match:
        return match.group(1), match.group(2)
    return None, None

# Applying the function to the 'Published' column and create new columns
df_scielo_preprint['Submitted'], df_scielo_preprint['Posted'] = zip(*df_scielo_preprint['Published'].apply(extract_dates))

In [10]:
df_scielo_preprint['Submitted'] = pd.to_datetime(df_scielo_preprint['Submitted'])
df_scielo_preprint['Posted'] = pd.to_datetime(df_scielo_preprint['Posted'])

In [11]:
df_scielo_preprint.dtypes

Title                   object
URL                     object
Authors                 object
Published               object
PDF Language            object
PDF URL                 object
Submitted       datetime64[ns]
Posted          datetime64[ns]
dtype: object

In [12]:
df_scielo_preprint

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted
0,The sun rises for everyone: popular entreprene...,https://preprints.scielo.org/index.php/scielo/...,Henrique Costa,Submitted 09/16/2024 - Posted 09/17/2024,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2024-09-16,2024-09-17
1,FROM TEXT TO FABRIC: RESONANCES OF GILBERTO FR...,https://preprints.scielo.org/index.php/scielo/...,"Rodrigo Medeiros, Lucas Oliveira",Submitted 09/13/2024 - Posted 09/17/2024,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2024-09-13,2024-09-17
2,Perception of formative and summative assessme...,https://preprints.scielo.org/index.php/scielo/...,"Martha Quiroz Figueroa, Mercedes María Lucas C...",Submitted 09/13/2024 - Posted 09/17/2024,PDF (Español (España)),https://preprints.scielo.org/index.php/scielo/...,2024-09-13,2024-09-17
3,SCIENTIFIC AND TECHNOLOGICAL LITERACY: CONTRIB...,https://preprints.scielo.org/index.php/scielo/...,"Débora Danieli Pontarollo Gonçalves, Marizete ...",Submitted 09/16/2024 - Posted 09/17/2024,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2024-09-16,2024-09-17
4,COGNITIVE AND METACOGNITIVE STRATEGIES IN DIGI...,https://preprints.scielo.org/index.php/scielo/...,"Fernando Silvio Cavalcante Pimentel, Daniela K...",Submitted 09/14/2024 - Posted 09/17/2024,PDF,https://preprints.scielo.org/index.php/scielo/...,2024-09-14,2024-09-17
...,...,...,...,...,...,...,...,...
3292,SARS-CoV-2: a clinical update,https://preprints.scielo.org/index.php/scielo/...,"Mateus da Silveira Cespedes, José Carlos Souza",Submitted 04/10/2020 - Posted 04/13/2020,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2020-04-10,2020-04-13
3293,MULTIPROFESSIONAL ELECTRONIC PROTOCOL FOR DIGE...,https://preprints.scielo.org/index.php/scielo/...,"Osvaldo Malafaia, Faruk Abrão KALIL-FILHO, Jo...",Submitted 04/09/2020 - Posted 04/09/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-09,2020-04-09
3294,Overview of confirmed cases of COVID-19 in fiv...,https://preprints.scielo.org/index.php/scielo/...,"NILA ALBUQUERQUE, Nathália Pedrosa",Submitted 04/09/2020 - Posted 04/09/2020,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2020-04-09,2020-04-09
3295,COVID-19 in the State of Ceará: behaviors and ...,https://preprints.scielo.org/index.php/scielo/...,"Danilo Lima, Aldo Dias, Renata Rabelo, Igor Cr...",Submitted 04/08/2020 - Posted 04/09/2020,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2020-04-08,2020-04-09


### Sorting the DataFrame by the column `Submitted` in descending order

In [13]:
df_scielo_preprint = df_scielo_preprint.sort_values(by='Submitted', ascending=False)
df_scielo_preprint = df_scielo_preprint.reset_index(drop=True)

In [14]:
df_scielo_preprint

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted
0,The sun rises for everyone: popular entreprene...,https://preprints.scielo.org/index.php/scielo/...,Henrique Costa,Submitted 09/16/2024 - Posted 09/17/2024,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2024-09-16,2024-09-17
1,Risk of Adverse Health Outcomes in Patients wi...,https://preprints.scielo.org/index.php/scielo/...,"Marcus Vinícius Bolívar Malachias, Sergio Eman...",Submitted 09/16/2024 - Posted 09/16/2024,PDF,https://preprints.scielo.org/index.php/scielo/...,2024-09-16,2024-09-16
2,SCIENTIFIC AND TECHNOLOGICAL LITERACY: CONTRIB...,https://preprints.scielo.org/index.php/scielo/...,"Débora Danieli Pontarollo Gonçalves, Marizete ...",Submitted 09/16/2024 - Posted 09/17/2024,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2024-09-16,2024-09-17
3,COGNITIVE AND METACOGNITIVE STRATEGIES IN DIGI...,https://preprints.scielo.org/index.php/scielo/...,"Fernando Silvio Cavalcante Pimentel, Daniela K...",Submitted 09/14/2024 - Posted 09/17/2024,PDF,https://preprints.scielo.org/index.php/scielo/...,2024-09-14,2024-09-17
4,PUBLIC POLICIES AND EVANGELICAL WOMEN IN POLITICS,https://preprints.scielo.org/index.php/scielo/...,Jamille Bezerra,Submitted 09/13/2024 - Posted 09/14/2024,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2024-09-13,2024-09-14
...,...,...,...,...,...,...,...,...
3292,Overview of confirmed cases of COVID-19 in fiv...,https://preprints.scielo.org/index.php/scielo/...,"NILA ALBUQUERQUE, Nathália Pedrosa",Submitted 04/09/2020 - Posted 04/09/2020,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2020-04-09,2020-04-09
3293,Education before the advance of COVID-19 in Pa...,https://preprints.scielo.org/index.php/scielo/...,Mirta Britez,Submitted 04/09/2020 - Posted 05/13/2020,PDF (Español (España)),https://preprints.scielo.org/index.php/scielo/...,2020-04-09,2020-05-13
3294,MULTIPROFESSIONAL ELECTRONIC PROTOCOL FOR DIGE...,https://preprints.scielo.org/index.php/scielo/...,"Osvaldo Malafaia, Faruk Abrão KALIL-FILHO, Jo...",Submitted 04/09/2020 - Posted 04/09/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-09,2020-04-09
3295,COVID-19 in the State of Ceará: behaviors and ...,https://preprints.scielo.org/index.php/scielo/...,"Danilo Lima, Aldo Dias, Renata Rabelo, Igor Cr...",Submitted 04/08/2020 - Posted 04/09/2020,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2020-04-08,2020-04-09


### Exporting to a file

In [15]:
df_scielo_preprint.to_json(f'{output_directory}/scielo_preprint_archive.jsonl', orient='records', lines=True)

## Selecting articles submitted before the advent of ChatGPT

ChatGPT was released by OpenAI on November 30, 2022.

In [16]:
df_scielo_preprint_preChatGPT = df_scielo_preprint[df_scielo_preprint['Submitted'] <= '2022-11-30']
df_scielo_preprint_preChatGPT = df_scielo_preprint_preChatGPT.reset_index(drop=True)

In [17]:
df_scielo_preprint_preChatGPT

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted
0,"DIALOGUES BETWEEN EDUCATION, ART AND CULTURAL ...",https://preprints.scielo.org/index.php/scielo/...,"Bruno Nogueira, Daniela Nery Bracchi",Submitted 11/27/2022 - Posted 12/02/2022,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2022-11-27,2022-12-02
1,Leadership for the Implementation of the Organ...,https://preprints.scielo.org/index.php/scielo/...,"Verónica Ramírez-Sánchez, Maria del Carmen San...",Submitted 11/27/2022 - Posted 12/01/2022,PDF (Español (España)),https://preprints.scielo.org/index.php/scielo/...,2022-11-27,2022-12-01
2,Social network analysis as a methodology for r...,https://preprints.scielo.org/index.php/scielo/...,Rebecca Vargas-Bolaños,Submitted 11/23/2022 - Posted 12/01/2022,PDF (Español (España)),https://preprints.scielo.org/index.php/scielo/...,2022-11-23,2022-12-01
3,Standardized Learning Assessment in higher edu...,https://preprints.scielo.org/index.php/scielo/...,"Jorge Gutierrez, Luis Alan Acuña Gamboa",Submitted 11/22/2022 - Posted 11/25/2022,PDF (Español (España)),https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-25
4,State of the art on the Social Network Analysi...,https://preprints.scielo.org/index.php/scielo/...,Rebecca Vargas-Bolaños,Submitted 11/22/2022 - Posted 12/01/2022,PDF (Español (España)),https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-12-01
...,...,...,...,...,...,...,...,...
1971,Overview of confirmed cases of COVID-19 in fiv...,https://preprints.scielo.org/index.php/scielo/...,"NILA ALBUQUERQUE, Nathália Pedrosa",Submitted 04/09/2020 - Posted 04/09/2020,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2020-04-09,2020-04-09
1972,Education before the advance of COVID-19 in Pa...,https://preprints.scielo.org/index.php/scielo/...,Mirta Britez,Submitted 04/09/2020 - Posted 05/13/2020,PDF (Español (España)),https://preprints.scielo.org/index.php/scielo/...,2020-04-09,2020-05-13
1973,MULTIPROFESSIONAL ELECTRONIC PROTOCOL FOR DIGE...,https://preprints.scielo.org/index.php/scielo/...,"Osvaldo Malafaia, Faruk Abrão KALIL-FILHO, Jo...",Submitted 04/09/2020 - Posted 04/09/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-09,2020-04-09
1974,COVID-19 in the State of Ceará: behaviors and ...,https://preprints.scielo.org/index.php/scielo/...,"Danilo Lima, Aldo Dias, Renata Rabelo, Igor Cr...",Submitted 04/08/2020 - Posted 04/09/2020,PDF (Português (Brasil)),https://preprints.scielo.org/index.php/scielo/...,2020-04-08,2020-04-09


## Selecting the articles in English

Articles written in English are usually the ones whose button is `PDF`.

In [18]:
df_scielo_preprint_preChatGPT_en = df_scielo_preprint_preChatGPT[df_scielo_preprint_preChatGPT['PDF Language'] == 'PDF']
df_scielo_preprint_preChatGPT_en = df_scielo_preprint_preChatGPT_en.reset_index(drop=True)

In [19]:
df_scielo_preprint_preChatGPT_en

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23
1,Belching as a symptom of COVID-19: Case presen...,https://preprints.scielo.org/index.php/scielo/...,"Nuvia Novoa Acosta, Liuba Yamila Peña Galbán, ...",Submitted 11/14/2022 - Posted 11/29/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-14,2022-11-29
2,Assembling the perfect bacterial genome using ...,https://preprints.scielo.org/index.php/scielo/...,"Ryan R. Wick, Louise M. Judd, Kathryn E. Holt",Submitted 11/11/2022 - Posted 11/11/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-11,2022-11-11
3,ON METHODOLOGY AND METHODS FOR ANALYSING CLASS...,https://preprints.scielo.org/index.php/scielo/...,Leonardo Goncalves Lago,Submitted 11/10/2022 - Posted 11/16/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-10,2022-11-16
4,"MOBILIZING LINGUISTIC AND SEMIOTIC RESOURCES, ...",https://preprints.scielo.org/index.php/scielo/...,"Estêvão Cabral, Marylin Martin-Jones",Submitted 11/07/2022 - Posted 11/07/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-07,2022-11-07
...,...,...,...,...,...,...,...,...
343,Information about the new coronavirus disease ...,https://preprints.scielo.org/index.php/scielo/...,Claudio Márcio Amaral de Oliveira Lima,Submitted 04/13/2020 - Posted 04/13/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-13,2020-04-13
344,ACE2 diversity in placental mammals reveals th...,https://preprints.scielo.org/index.php/scielo/...,"Bibiana Sampaio de Oliveira Fam, Pedro Vargas-...",Submitted 04/11/2020 - Posted 04/28/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-11,2020-04-28
345,Coronavirus 2: Analysis of Regularity of Compl...,https://preprints.scielo.org/index.php/scielo/...,Yuri Morales-López,Submitted 04/10/2020 - Posted 06/04/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-10,2020-06-04
346,MULTIPROFESSIONAL ELECTRONIC PROTOCOL FOR DIGE...,https://preprints.scielo.org/index.php/scielo/...,"Osvaldo Malafaia, Faruk Abrão KALIL-FILHO, Jo...",Submitted 04/09/2020 - Posted 04/09/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-09,2020-04-09


### Exporting to a file

In [20]:
df_scielo_preprint_preChatGPT_en.to_json(f'{output_directory}/scielo_preprint_preChatGPT_en.jsonl', orient='records', lines=True)