<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 2 - Phase 5 - eyamrog

The aim of this phase is to compile the `QJPP` corpus (Human-Authored Reference Corpus).

## Required Python packages

- pandas
- nltk

## Import the required libraries

In [1]:
import pandas as pd
import os
import sys
import nltk
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt
import random

## Define input variables

In [2]:
input_directory = 'cl_st2_ph4_eyamrog'
output_directory = 'cl_st2_ph5_eyamrog'
files_directory = 'cl_st2_ph2_eyamrog'

## Create output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory already exists.


## Import the data into a DataFrame

In [4]:
df_qjpp = pd.read_json(f"{input_directory}/df_qjpp.jsonl", lines=True)

In [5]:
df_qjpp['Published'] = pd.to_datetime(df_qjpp['Published'], unit='ms')

## Data Wrangling

### Drop unused columns

In [6]:
df_qjpp = df_qjpp.drop(columns=['Open Access', 'Open Access 1', 'Article Type'])

### Reorder the columns

In [7]:
df_qjpp.columns.tolist()

['Title',
 'URL',
 'Authors',
 'Published',
 'PDF URL',
 'Discipline',
 'Journal',
 'ID',
 'Vol/Issue',
 'DOI']

In [8]:
reordered_columns = [
    'Journal',
    'Title',
    'Authors',
    'Published',
    'Vol/Issue',
    'URL',
    'DOI',
    'PDF URL',
    'Discipline',
    'ID'
]

In [9]:
df_qjpp = df_qjpp[reordered_columns + [col for col in df_qjpp.columns if col not in reordered_columns]]

### Handling missing values

In [10]:
df_qjpp.isna().sum()

Journal        0
Title          0
Authors        0
Published      0
Vol/Issue     22
URL            0
DOI           89
PDF URL       51
Discipline     0
ID             0
dtype: int64

In [11]:
df_qjpp[['Vol/Issue', 'DOI', 'PDF URL']] = df_qjpp[['Vol/Issue', 'DOI', 'PDF URL']].fillna('Not defined')

In [12]:
df_qjpp.isna().sum().sum()

0

### Adding the `Text ID` column

In [13]:
prefix = 't'

In [14]:
df_qjpp['Text ID'] = prefix + df_qjpp.index.astype(str).str.zfill(6)

### Export into a file

In [15]:
df_qjpp.to_json(f"{output_directory}/df_qjpp1.jsonl", orient='records', lines=True)

In [16]:
df_qjpp.to_excel(f"{output_directory}/df_qjpp1.xlsx", index=False)

### Fetching the text files

#### Import the data into a DataFrame

In [4]:
df_qjpp1 = pd.read_json(f"{output_directory}/df_qjpp1.jsonl", lines=True)

In [5]:
df_qjpp1['Published'] = pd.to_datetime(df_qjpp1['Published'], unit='ms')

#### Fetching the files

**The following cell was disabled to prevent inadvertently overwriting the manually reviewed texts.**

### Manually review the texts and clean them up

Apply the scheme:
- Abstract, Introduction, Literature Review, Methodology, Results, Discussion, Conclusion, Acknowledgements

**Only run the remaining notebook sections when the manual revison is done.**

### Break down the texts into sections and paragraphs

In [6]:
# Prepare to collect rows
data = []

# Loop through each 'Text ID' in df_qjpp
for _, row in df_qjpp1.iterrows():
    text_id = row['Text ID']

    paragraph_count = 0
    section = None
    file_path = os.path.join(output_directory, f"{text_id}.txt")

    if not os.path.isfile(file_path):
        print(f"Missing file: {file_path}")
        continue

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = ' '.join(line.split()).strip()

            if line.startswith('Section:'):
                section_name = line.partition(':')[2].strip()
                # If a section line is blank or only has the colon, we gracefully assign the fallback name 'Undefined Section'
                section = section_name if section_name else 'Undefined Section'
                paragraph_count = 0  # Resetting paragraph count for new section

            elif line:
                paragraph_count += 1
                data.append({
                    'Text ID': text_id,
                    'Section': section,
                    'Paragraph': f"Paragraph {paragraph_count}",
                    'Text Paragraph': line
                })

# Create final DataFrame
df_qjpp1_section_paragraph = pd.DataFrame(data)

In [7]:
df_qjpp1_section_paragraph.head(55)

Unnamed: 0,Text ID,Section,Paragraph,Text Paragraph
0,t000000,Abstract,Paragraph 1,The severe acute respiratory syndrome coronavi...
1,t000000,"Introduction, Literature Review",Paragraph 1,"Since its emergence in Wuhan, China, at the en..."
2,t000000,"Introduction, Literature Review",Paragraph 2,"By January 2022, approximately 49% of the glob..."
3,t000000,"Introduction, Literature Review",Paragraph 3,Low-income and lower-middle-income countries h...
4,t000000,"Introduction, Literature Review",Paragraph 4,In high-income countries that have achieved hi...
5,t000000,"Introduction, Literature Review",Paragraph 5,"Through the use of a detailed global model, ma..."
6,t000000,Results,Paragraph 1,"By the end of 2021, nearly 50% of the global p..."
7,t000000,Results,Paragraph 2,We found that increased vaccine sharing would ...
8,t000000,Results,Paragraph 3,The bulk of these benefits are not seen in lat...
9,t000000,Results,Paragraph 4,If the increased infection seen in high-income...


#### Create a dictionary of the `Section` column for inspection

In [8]:
# Create a list of sections sorted in increasing order
section_list = sorted(df_qjpp1_section_paragraph['Section'].dropna().unique())

# Create the dictionary with scheme 's1', 's2', etc.
section_dict = {f"s{i+1}": section for i, section in enumerate(section_list)}

# Display the dictionary
section_dict

{'s1': 'Abstract',
 's2': 'Acknowledgements',
 's3': 'Conclusion',
 's4': 'Discussion',
 's5': 'Discussion, Conclusion',
 's6': 'Introduction',
 's7': 'Introduction, Literature Review',
 's8': 'Introduction, Literature Review, Methodology, Results, Discussion',
 's9': 'Introduction, Literature Review, Methodology, Results, Discussion, Conclusion',
 's10': 'Introduction, Literature Review, Results, Discussion, Conclusion',
 's11': 'Literature Review',
 's12': 'Literature Review, Methodology',
 's13': 'Literature Review, Methodology, Results',
 's14': 'Literature Review, Methodology, Results, Discussion',
 's15': 'Literature Review, Methodology, Results, Discussion, Conclusion',
 's16': 'Methodology',
 's17': 'Methodology, Results',
 's18': 'Methodology, Results, Discussion',
 's19': 'Results',
 's20': 'Results, Discussion',
 's21': 'Results, Discussion, Conclusion'}

#### Inspect duplicated texts and correct

The following articles had duplicated paragraphs due to web scraping errors. **They were manually corrected**.
- laph000151 - t000062
- aran000005 - t000181
- aran000015 - t000183
- aran000019 - t000184
- aran000023 - t000186
- aran000028 - t000188
- aran000030 - t000189
- aran000035 - t000192
- aran000040 - t000195
- aran000043 - t000197
- aran000054 - t000198
- aran000066 - t000203
- aran000076 - t000205
- aran000078 - t000206

The majority of the errors occurred in the `Annual Reviews` journal.

In [9]:
df_qjpp1_section_paragraph_duplicated = df_qjpp1_section_paragraph[df_qjpp1_section_paragraph.duplicated(subset='Text Paragraph', keep=False)]

In [10]:
df_qjpp1_section_paragraph_duplicated.to_excel(f"{output_directory}/df_qjpp1_section_paragraph_duplicated.xlsx", index=False)

### Merge `df_qjpp1_section_paragraph` into `df_qjpp1` to obtain `df_qjpp`

In [11]:
df_qjpp = df_qjpp1.merge(df_qjpp1_section_paragraph, on='Text ID', how='left')

In [12]:
df_qjpp

Unnamed: 0,Journal,Title,Authors,Published,Vol/Issue,URL,DOI,PDF URL,Discipline,ID,Text ID,Section,Paragraph,Text Paragraph
0,Nature Medicine,Retrospectively modeling the effects of increa...,"Sam Moore, Edward M. Hill, Matt J. Keeling",2022-10-27,Not defined,https://www.nature.com/articles/s41591-022-020...,Not defined,https://www.nature.com/articles/s41591-022-020...,Health Sciences,natm000005,t000000,Abstract,Paragraph 1,The severe acute respiratory syndrome coronavi...
1,Nature Medicine,Retrospectively modeling the effects of increa...,"Sam Moore, Edward M. Hill, Matt J. Keeling",2022-10-27,Not defined,https://www.nature.com/articles/s41591-022-020...,Not defined,https://www.nature.com/articles/s41591-022-020...,Health Sciences,natm000005,t000000,"Introduction, Literature Review",Paragraph 1,"Since its emergence in Wuhan, China, at the en..."
2,Nature Medicine,Retrospectively modeling the effects of increa...,"Sam Moore, Edward M. Hill, Matt J. Keeling",2022-10-27,Not defined,https://www.nature.com/articles/s41591-022-020...,Not defined,https://www.nature.com/articles/s41591-022-020...,Health Sciences,natm000005,t000000,"Introduction, Literature Review",Paragraph 2,"By January 2022, approximately 49% of the glob..."
3,Nature Medicine,Retrospectively modeling the effects of increa...,"Sam Moore, Edward M. Hill, Matt J. Keeling",2022-10-27,Not defined,https://www.nature.com/articles/s41591-022-020...,Not defined,https://www.nature.com/articles/s41591-022-020...,Health Sciences,natm000005,t000000,"Introduction, Literature Review",Paragraph 3,Low-income and lower-middle-income countries h...
4,Nature Medicine,Retrospectively modeling the effects of increa...,"Sam Moore, Edward M. Hill, Matt J. Keeling",2022-10-27,Not defined,https://www.nature.com/articles/s41591-022-020...,Not defined,https://www.nature.com/articles/s41591-022-020...,Health Sciences,natm000005,t000000,"Introduction, Literature Review",Paragraph 4,In high-income countries that have achieved hi...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14382,Corpora,Learner corpus research in New Zealand,"Anna Siyanova-Chanturia, Jean Parkinson, and T...",2022-09-27,"Volume 17, Issue Supplement",https://www.euppublishing.com/doi/full/10.3366...,https://doi.org/10.3366/cor.2022.0250,https://www.euppublishing.com/doi/pdf/10.3366/...,"Linguistic, literature and arts",corp000008,t000299,"Literature Review, Methodology, Results, Discu...",Paragraph 32,"However, compared with the writers of New Zeal..."
14383,Corpora,Learner corpus research in New Zealand,"Anna Siyanova-Chanturia, Jean Parkinson, and T...",2022-09-27,"Volume 17, Issue Supplement",https://www.euppublishing.com/doi/full/10.3366...,https://doi.org/10.3366/cor.2022.0250,https://www.euppublishing.com/doi/pdf/10.3366/...,"Linguistic, literature and arts",corp000008,t000299,"Literature Review, Methodology, Results, Discu...",Paragraph 33,Prommas’s (2020 ) finding of the lower use of ...
14384,Corpora,Learner corpus research in New Zealand,"Anna Siyanova-Chanturia, Jean Parkinson, and T...",2022-09-27,"Volume 17, Issue Supplement",https://www.euppublishing.com/doi/full/10.3366...,https://doi.org/10.3366/cor.2022.0250,https://www.euppublishing.com/doi/pdf/10.3366/...,"Linguistic, literature and arts",corp000008,t000299,"Literature Review, Methodology, Results, Discu...",Paragraph 34,The findings of this mixed-methods study were ...
14385,Corpora,Learner corpus research in New Zealand,"Anna Siyanova-Chanturia, Jean Parkinson, and T...",2022-09-27,"Volume 17, Issue Supplement",https://www.euppublishing.com/doi/full/10.3366...,https://doi.org/10.3366/cor.2022.0250,https://www.euppublishing.com/doi/pdf/10.3366/...,"Linguistic, literature and arts",corp000008,t000299,Conclusion,Paragraph 1,"In this paper, we attempted to highlight the s..."


### Export into a file

In [13]:
df_qjpp.to_json(f"{output_directory}/df_qjpp.jsonl", orient='records', lines=True)

In [14]:
df_qjpp.to_excel(f"{output_directory}/df_qjpp.xlsx", index=False)