<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 5 - eyamrog

The aim of this phase is extending the data set to the paragraph level by including the `section` and `paragraph` categories.

## Required Python packages

- beautifulsoup4
- PyMuPDF
- lxml
- pandas
- requests

## Importing the required libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import fitz # PyMuPDF
import re
import pandas as pd
import os
import sys
import logging
from tqdm import tqdm
import matplotlib.pyplot as plt
import json

## Defining input variables

In [2]:
input_directory = 'cl_st1_ph4_eyamrog'
output_directory = 'cl_st1_ph5_eyamrog'

## Creating output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory successfully created.


## Extending the data set to the paragraph level

### Importing the data into a DataFrame

In [3]:
df_scielo_preprint_preChatGPT_en = pd.read_json(f'{input_directory}/scielo_chatgpt_erpp_pp.jsonl', lines=True)

In [4]:
df_scielo_preprint_preChatGPT_en.dtypes

Title                    object
URL                      object
Authors                  object
Published                object
PDF Language             object
PDF URL                  object
Submitted                 int64
Posted                    int64
Text                     object
Text ID                  object
Area of Knowledge        object
Text Paragraphs          object
Text Paragraphs Count     int64
Text ChatGPT             object
dtype: object

In [5]:
df_scielo_preprint_preChatGPT_en['Submitted'] = pd.to_datetime(df_scielo_preprint_preChatGPT_en['Submitted'], unit='ms')
df_scielo_preprint_preChatGPT_en['Posted'] = pd.to_datetime(df_scielo_preprint_preChatGPT_en['Posted'], unit='ms')

In [6]:
df_scielo_preprint_preChatGPT_en

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted,Text,Text ID,Area of Knowledge,Text Paragraphs,Text Paragraphs Count,Text ChatGPT
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,Publication status: Preprint has been publishe...,t000000,Biological Sciences,"(Fern flora of Viçosa, Minas Gerais State, Bra...",29,In this research article focusing on the fern ...
1,Assembling the perfect bacterial genome using ...,https://preprints.scielo.org/index.php/scielo/...,"Ryan R. Wick, Louise M. Judd, Kathryn E. Holt",Submitted 11/11/2022 - Posted 11/11/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-11,2022-11-11,Publication status: Preprint has been publishe...,t000001,Biological Sciences,A perfect bacterial genome assembly is one whe...,31,A flawless bacterial genome assembly is charac...
2,ON METHODOLOGY AND METHODS FOR ANALYSING CLASS...,https://preprints.scielo.org/index.php/scielo/...,Leonardo Goncalves Lago,Submitted 11/10/2022 - Posted 11/16/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-10,2022-11-16,Publication status: Preprint has been publishe...,t000002,Applied Social Sciences,This article presents a theoretical work whose...,65,This article presents a theoretical work with ...
3,"MOBILIZING LINGUISTIC AND SEMIOTIC RESOURCES, ...",https://preprints.scielo.org/index.php/scielo/...,"Estêvão Cabral, Marylin Martin-Jones",Submitted 11/07/2022 - Posted 11/07/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-07,2022-11-07,Publication status: Preprint has been submitte...,t000003,"Linguistic, literature and arts",This paper is based on research of a socioling...,58,This paper presents a sociolinguistic and ethn...
4,PORTUGUESE AO PÉ DO BERIMBAU: ON CAPOEIRA AS A...,https://preprints.scielo.org/index.php/scielo/...,"Mike Baynham, Jolana Hanusova",Submitted 11/04/2022 - Posted 11/04/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-04,2022-11-04,Publication status: Preprint has been submitte...,t000004,"Linguistic, literature and arts",From its historical origins as a resistant and...,123,Originating as a defiant and violently suppres...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
311,Challenges in the fight against the COVID-19 p...,https://preprints.scielo.org/index.php/scielo/...,Eduardo Alexandrino Servolo Medeiros,Submitted 04/15/2020 - Posted 04/15/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-15,2020-04-15,*Corresponding author. E-mail: edubalaccih@gma...,t000311,Health Sciences,We are living the most important pandemic in r...,9,"The ongoing global pandemic, triggered by the ..."
312,Information about the new coronavirus disease ...,https://preprints.scielo.org/index.php/scielo/...,Claudio Márcio Amaral de Oliveira Lima,Submitted 04/13/2020 - Posted 04/13/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-13,2020-04-13,V\nRadiol Bras. 2020 Mar/Abr;53(2):V–VI\n0100-...,t000312,Health Sciences,"Coronavirus is a zoonotic virus, an RNA virus ...",9,The coronavirus is a zoonotic RNA virus belong...
313,ACE2 diversity in placental mammals reveals th...,https://preprints.scielo.org/index.php/scielo/...,"Bibiana Sampaio de Oliveira Fam, Pedro Vargas-...",Submitted 04/11/2020 - Posted 04/28/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-11,2020-04-28,Status: Preprint has been published in a journ...,t000313,Biological Sciences,The recent emergence of SARS-CoV-2 is responsi...,35,The recent emergence of SARS-CoV-2 has led to ...
314,Coronavirus 2: Analysis of Regularity of Compl...,https://preprints.scielo.org/index.php/scielo/...,Yuri Morales-López,Submitted 04/10/2020 - Posted 06/04/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-10,2020-06-04,Publication status: Preprint has been publishe...,t000314,Biological Sciences,This paper shows a technique for the detection...,15,This paper presents a technique for detecting ...


### Checking the maximum number of paragraphs per text

In [7]:
df_scielo_preprint_preChatGPT_en['Text Paragraphs Count'].max()

163

### Obtaining the list of Text IDs

In [8]:
text_id_list = df_scielo_preprint_preChatGPT_en['Text ID'].tolist()

In [9]:
with open(f"{output_directory}/text_ids.txt", 'w', encoding='utf8', newline='\n') as file:
    for text_id in text_id_list:
        file.write(f'{text_id}\n')

### Testing with a data subset

#### Creating the data subset

In [10]:
#df_test = df_scielo_preprint_preChatGPT_en.head(4) # Alternative command
df_test = df_scielo_preprint_preChatGPT_en.iloc[:4]
df_test = df_test.reset_index(drop=True)

In [11]:
df_test

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted,Text,Text ID,Area of Knowledge,Text Paragraphs,Text Paragraphs Count,Text ChatGPT
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,Publication status: Preprint has been publishe...,t000000,Biological Sciences,"(Fern flora of Viçosa, Minas Gerais State, Bra...",29,In this research article focusing on the fern ...
1,Assembling the perfect bacterial genome using ...,https://preprints.scielo.org/index.php/scielo/...,"Ryan R. Wick, Louise M. Judd, Kathryn E. Holt",Submitted 11/11/2022 - Posted 11/11/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-11,2022-11-11,Publication status: Preprint has been publishe...,t000001,Biological Sciences,A perfect bacterial genome assembly is one whe...,31,A flawless bacterial genome assembly is charac...
2,ON METHODOLOGY AND METHODS FOR ANALYSING CLASS...,https://preprints.scielo.org/index.php/scielo/...,Leonardo Goncalves Lago,Submitted 11/10/2022 - Posted 11/16/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-10,2022-11-16,Publication status: Preprint has been publishe...,t000002,Applied Social Sciences,This article presents a theoretical work whose...,65,This article presents a theoretical work with ...
3,"MOBILIZING LINGUISTIC AND SEMIOTIC RESOURCES, ...",https://preprints.scielo.org/index.php/scielo/...,"Estêvão Cabral, Marylin Martin-Jones",Submitted 11/07/2022 - Posted 11/07/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-07,2022-11-07,Publication status: Preprint has been submitte...,t000003,"Linguistic, literature and arts",This paper is based on research of a socioling...,58,This paper presents a sociolinguistic and ethn...


#### Assigning section and paragraph information to each paragraph according to the definitions of the dictionary `section_paragraph_mapping`

In [12]:
# Setting up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[
    logging.FileHandler(f"{output_directory}/section_paragraph_test.log"),
    logging.StreamHandler()
])

# Reading the JSON file containing the dictionary 'section_paragraph_mapping'
with open(f"{output_directory}/section_paragraph_mapping.json", 'r') as json_file:
    section_paragraph_mapping = json.load(json_file)

# Initialising a list to store the new DataFrame rows
new_rows = []

# Iterating through the existing DataFrame with progress tracking
for index, row in tqdm(df_test.iterrows(), total=len(df_test), desc="Processing rows"):
    text_id = row['Text ID']
    paragraphs = row['Text Paragraphs'].split('\n')
    paragraph_index = 0

    if text_id in section_paragraph_mapping:
        sections = section_paragraph_mapping[text_id]
        dict_paragraph_count = sum(len(paragraphs) for paragraphs in sections.values())
        
        if dict_paragraph_count > len(paragraphs):
            logging.info(f"Text ID: {text_id} - Dictionary has more paragraphs ({dict_paragraph_count}) than text ({len(paragraphs)})")
        elif dict_paragraph_count < len(paragraphs):
            logging.info(f"Text ID: {text_id} - Text has more paragraphs ({len(paragraphs)}) than dictionary ({dict_paragraph_count}). The remaining paragraphs in the text will be left unprocessed.")
        else:
            logging.info(f"Text ID: {text_id} - Dictionary and text have the same number of paragraphs ({len(paragraphs)})")
        
        for section, paragraph_names in sections.items():
            for paragraph_name in paragraph_names:
                if paragraph_index < len(paragraphs):
                    paragraph_text = paragraphs[paragraph_index]
                    new_rows.append({
                        'Text ID': text_id,
                        'Section': section,
                        'Paragraph': paragraph_name,
                        'Text Paragraph': paragraph_text
                    })
                    paragraph_index += 1
                else:
                    logging.warning(f"Not enough paragraphs in text for {text_id}, section: {section}, paragraph: {paragraph_name}")
                    break
            if paragraph_index >= len(paragraphs):
                break
    else:
        logging.warning(f"{text_id} not found in section_paragraph_mapping")

# Creating the new DataFrame
df_new_test = pd.DataFrame(new_rows)

# Merging the new DataFrame with the existing DataFrame on 'Text ID'
df_merged_test = pd.merge(df_test.drop(columns=['Text', 'Text Paragraphs Count', 'Text Paragraphs', 'Text ChatGPT']), df_new_test, on='Text ID', how='right')

logging.info("Merging of DataFrames complete.")

Processing rows:   0%|          | 0/4 [00:00<?, ?it/s]2025-04-22 06:13:36,585 - INFO - Text ID: t000000 - Dictionary and text have the same number of paragraphs (29)
2025-04-22 06:13:36,587 - INFO - Text ID: t000001 - Dictionary and text have the same number of paragraphs (31)
2025-04-22 06:13:36,588 - INFO - Text ID: t000002 - Dictionary and text have the same number of paragraphs (65)
2025-04-22 06:13:36,590 - INFO - Text ID: t000003 - Dictionary and text have the same number of paragraphs (58)
Processing rows: 100%|██████████| 4/4 [00:00<00:00, 354.74it/s]
2025-04-22 06:13:36,614 - INFO - Merging of DataFrames complete.


In [13]:
df_merged_test

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted,Text ID,Area of Knowledge,Section,Paragraph,Text Paragraph
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,Biological Sciences,Abstract,Paragraph 1,"(Fern flora of Viçosa, Minas Gerais State, Bra..."
1,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,Biological Sciences,"Introduction, Literature Review",Paragraph 1,At the end of the era in which the plant Class...
2,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,Biological Sciences,"Introduction, Literature Review",Paragraph 2,Brazil is represented by Dennstaedtiaceae with...
3,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,Biological Sciences,"Introduction, Literature Review",Paragraph 3,Pteridium is spread all across the globe (exce...
4,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,Biological Sciences,"Introduction, Literature Review",Paragraph 4,"In the State of Minas Gerais, Dennstaedtiaceae..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
178,"MOBILIZING LINGUISTIC AND SEMIOTIC RESOURCES, ...",https://preprints.scielo.org/index.php/scielo/...,"Estêvão Cabral, Marylin Martin-Jones",Submitted 11/07/2022 - Posted 11/07/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-07,2022-11-07,t000003,"Linguistic, literature and arts",Conclusion,Paragraph 2,"Second, an orientation to national identity is..."
179,"MOBILIZING LINGUISTIC AND SEMIOTIC RESOURCES, ...",https://preprints.scielo.org/index.php/scielo/...,"Estêvão Cabral, Marylin Martin-Jones",Submitted 11/07/2022 - Posted 11/07/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-07,2022-11-07,t000003,"Linguistic, literature and arts",Conclusion,Paragraph 3,"Third, the case of the Timorese migrants in No..."
180,"MOBILIZING LINGUISTIC AND SEMIOTIC RESOURCES, ...",https://preprints.scielo.org/index.php/scielo/...,"Estêvão Cabral, Marylin Martin-Jones",Submitted 11/07/2022 - Posted 11/07/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-07,2022-11-07,t000003,"Linguistic, literature and arts",Conclusion,Paragraph 4,"Fourth, the dual focus on mobility and on the ..."
181,"MOBILIZING LINGUISTIC AND SEMIOTIC RESOURCES, ...",https://preprints.scielo.org/index.php/scielo/...,"Estêvão Cabral, Marylin Martin-Jones",Submitted 11/07/2022 - Posted 11/07/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-07,2022-11-07,t000003,"Linguistic, literature and arts",Conclusion,Paragraph 5,"And, finally, undertaking detailed ethnographi..."


In [14]:
df_merged_test.at[0, 'Text Paragraph']

'(Fern flora of Viçosa, Minas Gerais State, Brazil: Dennstaedtiaceae, Lindsaeaceae and Saccolomataceae). As part of an ongoing project treating the ferns and lycophytes from the region of Viçosa, Minas Gerais State, Brazil, we here present the taxonomic treatment of the early divergent lineages of the leptosporangiate ferns: the families Dennstaedtiaceae, Lindsaeaceae and Saccolomataceae. We have been sampling the remnant forest patches since 2012; we also fully reviewed the collection of herbarium VIC and other online collections: F, IAN, NY, PH, RB, U, UC, UPCB, US, and WTU. In the region of Viçosa, six taxa belonging to those families occur Dennstaedtia cicutaria and Pteridium esculentum subsp. arachnoideum (Dennstaedtiaceae); Lindsaea lancea var. lancea, L. quadrangularis subsp. quadrangularis, and L. stricta var. stricta (Lindsaeaceae); and Saccoloma elegans (Saccolomataceae). Among these taxa, only L. quadrangularis subsp. quadrangularis and S. elegans are endemic to the Brazilia

In [15]:
df_merged_test.at[1, 'Text Paragraph']

'At the end of the era in which the plant Classification Systems were mostly based on morphological data, Dennstaedtiaceae was considered a big family, comprehending three subfamilies (or tribes) and about 17 genera (Tryon & Tryon 1982, Kramer 1990). With the advent of molecular data, those subfamilies were raised to family-level, some families were created, and now the early- diverging leptosporangiate ferns are represented by Cystodiaceae (one genus), Dennstaedtiaceae with a stricter circumscription (11-12 genera), Lindsaeaceae (seven genera), Lonchitidaceae (one genus), Saccolomataceae (one or two genera), as well as the great Pteridaceae (+50 genera) (Smith et al. 2006, PPG I 2016, Shang et al. 2018, Schwartsburd et al. 2020).'

In [16]:
df_new_test.head(30)

Unnamed: 0,Text ID,Section,Paragraph,Text Paragraph
0,t000000,Abstract,Paragraph 1,"(Fern flora of Viçosa, Minas Gerais State, Bra..."
1,t000000,"Introduction, Literature Review",Paragraph 1,At the end of the era in which the plant Class...
2,t000000,"Introduction, Literature Review",Paragraph 2,Brazil is represented by Dennstaedtiaceae with...
3,t000000,"Introduction, Literature Review",Paragraph 3,Pteridium is spread all across the globe (exce...
4,t000000,"Introduction, Literature Review",Paragraph 4,"In the State of Minas Gerais, Dennstaedtiaceae..."
5,t000000,Methodology,Paragraph 1,"In the region of Viçosa, Minas Gerais State, B..."
6,t000000,Methodology,Paragraph 2,We have been sampling the remnant forest patch...
7,t000000,"Results, Discussion, Conclusion",Paragraph 1,"Dennstaedtiaceae, Lindsaeaceae, and Saccolomat..."
8,t000000,"Results, Discussion, Conclusion",Paragraph 2,"Rhizomes generaly long-creeping, solenostelic ..."
9,t000000,"Results, Discussion, Conclusion",Paragraph 3,"The family is cosmopolitan, composed of eleven..."


In [17]:
df_new_test.iloc[124:183]

Unnamed: 0,Text ID,Section,Paragraph,Text Paragraph
124,t000002,"Discussion, Conclusion",Paragraph 7,"Here, I argue for a complementarity stance mai..."
125,t000003,Abstract,Paragraph 1,This paper is based on research of a socioling...
126,t000003,Introduction,Paragraph 1,"This paper draws on research, of a sociolingui..."
127,t000003,Introduction,Paragraph 2,"In this paper, we focus on one particular life..."
128,t000003,Introduction,Paragraph 3,Our research revealed the situated and multifa...
129,t000003,Introduction,Paragraph 4,The orienting theories for our research come f...
130,t000003,Literature Review,Paragraph 1,The last decade of the twentieth century and t...
131,t000003,Literature Review,Paragraph 2,By the first decade of the twenty first centur...
132,t000003,Literature Review,Paragraph 3,"For most of the twentieth century, the dominan..."
133,t000003,Literature Review,Paragraph 4,Along with this critique of essentialised noti...


#### Exporting to a file

In [18]:
df_merged_test.to_json(f"{output_directory}/scielo_erpp_pp_test.jsonl", orient='records', lines=True)

#### Revising the paragraphs with ChatGPT

The programme `cl_st1_ph5_test_eyamrog.py` was adapted for command line for better efficiency running on an AWS EC2 instance.

- Update the `.env` file with the actual OpenAI API Key
- Activate the corresponding Python environment
- Run it as

```
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_eyamrog$ nohup python cl_st1_ph5_test_eyamrog.py cl_st1_ph5_eyamrog cl_st1_ph5_eyamrog &
```

### Creating the dictionary `section_paragraph_mapping`

As the dictionary `section_paragraph_mapping` is gradually created, the programme `section_paragraph_mapping_validate.py` should be used to validate it.

### Assigning section and paragraph information to each paragraph according to the definitions of the dictionary `section_paragraph_mapping`

In [19]:
# Setting up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[
    logging.FileHandler(f"{output_directory}/section_paragraph.log"),
    logging.StreamHandler()
])

# Reading the JSON file containing the dictionary 'section_paragraph_mapping'
with open(f"{output_directory}/section_paragraph_mapping.json", 'r') as json_file:
    section_paragraph_mapping = json.load(json_file)

# Initialising a list to store the new DataFrame rows
new_rows = []

# Iterating through the existing DataFrame with progress tracking
for index, row in tqdm(df_scielo_preprint_preChatGPT_en.iterrows(), total=len(df_scielo_preprint_preChatGPT_en), desc="Processing rows"):
    text_id = row['Text ID']
    paragraphs = row['Text Paragraphs'].split('\n')
    paragraph_index = 0

    if text_id in section_paragraph_mapping:
        sections = section_paragraph_mapping[text_id]
        dict_paragraph_count = sum(len(paragraphs) for paragraphs in sections.values())
        
        if dict_paragraph_count > len(paragraphs):
            logging.info(f"Text ID: {text_id} - Dictionary has more paragraphs ({dict_paragraph_count}) than text ({len(paragraphs)})")
        elif dict_paragraph_count < len(paragraphs):
            logging.info(f"Text ID: {text_id} - Text has more paragraphs ({len(paragraphs)}) than dictionary ({dict_paragraph_count}). The remaining paragraphs in the text will be left unprocessed.")
        else:
            logging.info(f"Text ID: {text_id} - Dictionary and text have the same number of paragraphs ({len(paragraphs)})")
        
        for section, paragraph_names in sections.items():
            for paragraph_name in paragraph_names:
                if paragraph_index < len(paragraphs):
                    paragraph_text = paragraphs[paragraph_index]
                    new_rows.append({
                        'Text ID': text_id,
                        'Section': section,
                        'Paragraph': paragraph_name,
                        'Text Paragraph': paragraph_text
                    })
                    paragraph_index += 1
                else:
                    logging.warning(f"Not enough paragraphs in text for {text_id}, section: {section}, paragraph: {paragraph_name}")
                    break
            if paragraph_index >= len(paragraphs):
                break
    else:
        logging.warning(f"{text_id} not found in section_paragraph_mapping")

# Creating the new DataFrame
df_new = pd.DataFrame(new_rows)

# Merging the new DataFrame with the existing DataFrame on 'Text ID'
df_merged = pd.merge(df_scielo_preprint_preChatGPT_en.drop(columns=['Text', 'Text Paragraphs Count', 'Text Paragraphs', 'Text ChatGPT']), df_new, on='Text ID', how='right')

logging.info("Merging of DataFrames complete.")

Processing rows:   0%|          | 0/316 [00:00<?, ?it/s]2025-04-22 06:14:13,214 - INFO - Text ID: t000000 - Dictionary and text have the same number of paragraphs (29)
2025-04-22 06:14:13,215 - INFO - Text ID: t000001 - Dictionary and text have the same number of paragraphs (31)
2025-04-22 06:14:13,217 - INFO - Text ID: t000002 - Dictionary and text have the same number of paragraphs (65)
2025-04-22 06:14:13,218 - INFO - Text ID: t000003 - Dictionary and text have the same number of paragraphs (58)
2025-04-22 06:14:13,220 - INFO - Text ID: t000004 - Dictionary and text have the same number of paragraphs (123)
2025-04-22 06:14:13,221 - INFO - Text ID: t000005 - Dictionary and text have the same number of paragraphs (15)
2025-04-22 06:14:13,222 - INFO - Text ID: t000006 - Dictionary and text have the same number of paragraphs (38)
2025-04-22 06:14:13,223 - INFO - Text ID: t000007 - Dictionary and text have the same number of paragraphs (28)
2025-04-22 06:14:13,224 - INFO - Text ID: t0000

In [20]:
df_merged

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted,Text ID,Area of Knowledge,Section,Paragraph,Text Paragraph
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,Biological Sciences,Abstract,Paragraph 1,"(Fern flora of Viçosa, Minas Gerais State, Bra..."
1,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,Biological Sciences,"Introduction, Literature Review",Paragraph 1,At the end of the era in which the plant Class...
2,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,Biological Sciences,"Introduction, Literature Review",Paragraph 2,Brazil is represented by Dennstaedtiaceae with...
3,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,Biological Sciences,"Introduction, Literature Review",Paragraph 3,Pteridium is spread all across the globe (exce...
4,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,Biological Sciences,"Introduction, Literature Review",Paragraph 4,"In the State of Minas Gerais, Dennstaedtiaceae..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11522,COVID-19 in Brazil: advantages of a socialized...,https://preprints.scielo.org/index.php/scielo/...,"Julio Croda, Wanderson Kleber de Oliveira, Ro...",Submitted 04/06/2020 - Posted 04/08/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-06,2020-04-08,t000315,Health Sciences,"Literature Review, Methodology, Results, Discu...",Paragraph 14,"During the Zika epidemic, Brazil led the disco..."
11523,COVID-19 in Brazil: advantages of a socialized...,https://preprints.scielo.org/index.php/scielo/...,"Julio Croda, Wanderson Kleber de Oliveira, Ro...",Submitted 04/06/2020 - Posted 04/08/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-06,2020-04-08,t000315,Health Sciences,"Literature Review, Methodology, Results, Discu...",Paragraph 15,Although Brazil is attempting to implement mea...
11524,COVID-19 in Brazil: advantages of a socialized...,https://preprints.scielo.org/index.php/scielo/...,"Julio Croda, Wanderson Kleber de Oliveira, Ro...",Submitted 04/06/2020 - Posted 04/08/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-06,2020-04-08,t000315,Health Sciences,"Literature Review, Methodology, Results, Discu...",Paragraph 16,"Regarding cultural differences, the use of mas..."
11525,COVID-19 in Brazil: advantages of a socialized...,https://preprints.scielo.org/index.php/scielo/...,"Julio Croda, Wanderson Kleber de Oliveira, Ro...",Submitted 04/06/2020 - Posted 04/08/2020,PDF,https://preprints.scielo.org/index.php/scielo/...,2020-04-06,2020-04-08,t000315,Health Sciences,"Literature Review, Methodology, Results, Discu...",Paragraph 17,Physical distancing is a measure that should b...


#### Detecting empty paragraphs in the column `Text Paragragh`

##### Detecting `NaN` (null value) in the column `Text Paragraph`

In [21]:
print(df_merged['Text Paragraph'].isnull().sum())

0


##### Detecting empty strings in the column `Text Paragraph`

In [22]:
print((df_merged['Text Paragraph'] == '').sum())

8


##### Detecting both `NaN` and empty strings in the column `Text Paragraph`

In [23]:
print(df_merged['Text Paragraph'].apply(lambda x: x is None or x == '').sum())

8


#### Inspecting a few data

In [24]:
df_merged.at[3411, 'Text Paragraph']

''

In [25]:
df_merged.at[6610, 'Text Paragraph']

'𝐻0 : 𝛽2 = 𝛽3 = 𝛽5 = 𝛽6 = 0 (2) 𝐻1 : 𝛽2 ≠𝛽3 ≠𝛽5 ≠𝛽6 ≠0 (3)'

In [26]:
df_merged.at[2611, 'Text Paragraph']

'Whether the purpose of NATO expansion is seen as an attempt to prevent the formation of a Eurasian alliance (KLARE, 2022), a “justifiable response to the […] entreaties of new Central and Eastern European democracies” (SAROTTE, 2021) that contributed to the frustration of East-West cooperation, or “the most fateful error in the entire post-Cold War era” (KENNAN, 1997) always depends on one’s standpoint; there is no “objective position”. It is a central contention of Bakhtinian dialogism that we do not only engage in talk about discourse, but with discourse, and that a form of dialogical understanding always includes evaluation and response Todorov (1984, p.16). In the continuously changing realm of politics, any analyst not only observes political processes, but also shapes them, so that “decision and standpoint are inseparably bound up together” (MANNHEIM, 1936, p.152). Individual standpoints, also sometimes called researcher bias, thus unavoidably permeate any type of analysis (DAVI

#### Exporting to a file

In [27]:
df_merged.to_json(f"{output_directory}/scielo_erpp_pp.jsonl", orient='records', lines=True)

In [28]:
df_merged.to_excel(f"{output_directory}/scielo_erpp_pp.xlsx")

## Revising the paragraphs with ChatGPT

The programme `cl_st1_ph5_eyamrog.py` was adapted for command line for better efficiency running on an AWS EC2 instance.

- Update the `.env` file with the actual OpenAI API Key
- Activate the corresponding Python environment
- Run it as

```
(my_env) eyamrog@Rog-ASUS:~/work/cl_st1_eyamrog$ nohup python cl_st1_ph5_eyamrog.py cl_st1_ph5_eyamrog cl_st1_ph5_eyamrog &
```