<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 1 - Phase 3_1 - eyamrog

The aim of this phase is to prepare data for testing.

## Required Python packages

- pandas
- PyMuPDF
- python-docx

## Importing the required libraries

In [1]:
import pandas as pd
import pymupdf
from docx import Document
import re
import os
import sys

## Defining input variables

In [2]:
input_directory = 'cl_st1_ph3_eyamrog'
output_directory = 'cl_st1_ph31_eyamrog'

## Creating output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory successfully created.


## Preparing data for testing

### Importing the data into a DataFrame

In [4]:
df_scielo_preprint_preChatGPT_en = pd.read_json(f'{input_directory}/scielo_erpp_pp.jsonl', lines=True)

In [5]:
df_scielo_preprint_preChatGPT_en.dtypes

Title           object
URL             object
Authors         object
Published       object
PDF Language    object
PDF URL         object
Submitted        int64
Posted           int64
Text ID         object
Text            object
dtype: object

In [6]:
df_scielo_preprint_preChatGPT_en['Submitted'] = pd.to_datetime(df_scielo_preprint_preChatGPT_en['Submitted'], unit='ms')
df_scielo_preprint_preChatGPT_en['Posted'] = pd.to_datetime(df_scielo_preprint_preChatGPT_en['Posted'], unit='ms')

### Dropping all texts except the first

In [7]:
df_scielo_preprint_preChatGPT_en = df_scielo_preprint_preChatGPT_en.loc[:0]

In [8]:
df_scielo_preprint_preChatGPT_en

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted,Text ID,Text
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,"(Fern flora of Viçosa, Minas Gerais State, Bra..."


### Including `t000001`manually

Copy `t000001.pdf` into `output_directory` before running the following cell.

In [9]:
def scrape_pdf(pdf_path, output_txt):
    # Opening the PDF file
    doc = pymupdf.open(pdf_path)
    
    # Initialising an empty string to store the text
    text = ''
    
    # Iterating through all the pages and extract text
    for page in doc:
        text += page.get_text()
    
    # Writing the extracted text to a text file in UTF-8 encoding
    with open(output_txt, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

scrape_pdf(f"{output_directory}/t000001.pdf", f"{output_directory}/t000001.txt")

Edit `t000001.txt` manually to ensure one paragraph per line before running the following cell.

In [10]:
# Function to read the content of a TXT file
def read_txt_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

data = []

# Append the data to the list
data.append({
    'Title': 'Critical Literacies, Meaning Making and New Epistemological Perspectives',
    'URL': 'https://revistas.unal.edu.co/index.php/male/article/view/10712/',
    'Authors': 'Walkyria Monte Mór',
    'Published': 'Submitted 01/01/2008 - Posted 01/01/2008',
    'PDF Language': 'PDF',
    'PDF URL': 'https://revistas.unal.edu.co/index.php/male/article/view/10712/',
    'Submitted': '2008-01-01',
    'Posted': '2008-01-01',
    'Text ID': 't000001',
    'Text': read_txt_file(f"{output_directory}/t000001.txt")
})

# Convert the data list to a DataFrame
df_new_row = pd.DataFrame(data)

# Append the new row to the existing DataFrame
df_scielo_preprint_preChatGPT_en = pd.concat([df_scielo_preprint_preChatGPT_en, df_new_row], ignore_index=True)

In [11]:
df_scielo_preprint_preChatGPT_en

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted,Text ID,Text
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22 00:00:00,2022-11-23 00:00:00,t000000,"(Fern flora of Viçosa, Minas Gerais State, Bra..."
1,"Critical Literacies, Meaning Making and New Ep...",https://revistas.unal.edu.co/index.php/male/ar...,Walkyria Monte Mór,Submitted 01/01/2008 - Posted 01/01/2008,PDF,https://revistas.unal.edu.co/index.php/male/ar...,2008-01-01,2008-01-01,t000001,This article presents a research analysis in w...


### Including `t000002`manually

Copy `t000002.docx` into `output_directory` before running the following cell.

In [12]:
def scrape_docx(docx_path, output_txt):
    # Opening the DOCX file
    doc = Document(docx_path)

    # Initialising an empty string to store the text
    text_list = []

    # Iterating through all the paragraphs and extract text
    for paragraph in doc.paragraphs:
        text_list.append(paragraph.text)

    text = '\n'.join(text_list)
    
    # Writing the extracted text to a text file in UTF-8 encoding
    with open(output_txt, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

scrape_docx(f"{output_directory}/t000002.docx", f"{output_directory}/t000002.txt")

Edit `t000002.txt` manually to ensure one paragraph per line before running the following cell.

In [13]:
# Function to read the content of a TXT file
def read_txt_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

data = []

# Append the data to the list
data.append({
    'Title': 'THE RELEVANCE OF AI-POWERED TOOLS IN THE ENGLISH ACADEMIC WRITING OF BRAZILIAN SCHOLARS IN APPLIED LINGUISTICS AND IN THE VISUALISATION OF RESEARCH DATA',
    'URL': '',
    'Authors': 'Rogério Yamada',
    'Published': 'Submitted 01/06/2023 - Posted 01/06/2023',
    'PDF Language': 'DOCX',
    'PDF URL': '',
    'Submitted': '2023-06-01',
    'Posted': '2023-06-01',
    'Text ID': 't000002',
    'Text': read_txt_file(f"{output_directory}/t000002.txt")
})

# Convert the data list to a DataFrame
df_new_row = pd.DataFrame(data)

# Append the new row to the existing DataFrame
df_scielo_preprint_preChatGPT_en = pd.concat([df_scielo_preprint_preChatGPT_en, df_new_row], ignore_index=True)

In [14]:
df_scielo_preprint_preChatGPT_en['Submitted'] = pd.to_datetime(df_scielo_preprint_preChatGPT_en['Submitted'])
df_scielo_preprint_preChatGPT_en['Posted'] = pd.to_datetime(df_scielo_preprint_preChatGPT_en['Posted'])

In [15]:
df_scielo_preprint_preChatGPT_en

Unnamed: 0,Title,URL,Authors,Published,PDF Language,PDF URL,Submitted,Posted,Text ID,Text
0,"(Fern flora of Viçosa, Minas Gerais State, Bra...",https://preprints.scielo.org/index.php/scielo/...,"Nelson Túlio Lage Pena, Pedro Bond Schwartsburd",Submitted 11/22/2022 - Posted 11/23/2022,PDF,https://preprints.scielo.org/index.php/scielo/...,2022-11-22,2022-11-23,t000000,"(Fern flora of Viçosa, Minas Gerais State, Bra..."
1,"Critical Literacies, Meaning Making and New Ep...",https://revistas.unal.edu.co/index.php/male/ar...,Walkyria Monte Mór,Submitted 01/01/2008 - Posted 01/01/2008,PDF,https://revistas.unal.edu.co/index.php/male/ar...,2008-01-01,2008-01-01,t000001,This article presents a research analysis in w...
2,THE RELEVANCE OF AI-POWERED TOOLS IN THE ENGLI...,,Rogério Yamada,Submitted 01/06/2023 - Posted 01/06/2023,DOCX,,2023-06-01,2023-06-01,t000002,The recent advent of new-generation Artificial...


### Exporting to a file

In [16]:
df_scielo_preprint_preChatGPT_en.to_json(f"{output_directory}/test_erpp_pp.jsonl", orient='records', lines=True)