<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 2 - Phase 4 - eyamrog

The aim of this phase is to design the `QJPP` corpus.

## Required Python packages

- pandas

## Import the required libraries

In [1]:
import pandas as pd
import os
import sys
import numpy as np

## Define input variables

In [2]:
input_directory = 'cl_st2_ph2_eyamrog'
output_directory = 'cl_st2_ph4_eyamrog'

## Create output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory already exists.


### Create output subdirectories

In [4]:
def create_directory(path):
    """Creates a subdirectory if it doesn't exist."""
    if not os.path.exists(path):
        try:
            os.makedirs(path)
            print(f"Successfully created the directory: {path}")
        except OSError as e:
            print(f"Failed to create the {path} directory: {e}")
            sys.exit(1)
    else:
        print(f"Directory already exists: {path}")

## List documents without `Abstract`

Identify documents that may not be `Research Articles` by checking the absence of `Abstract` section.

In [5]:
def get_files_without_abstract(directory):
    no_abstract = []

    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            file_path = os.path.join(directory, filename)

            with open(file_path, 'r', encoding='utf-8') as f:
                has_abstract = any(
                    line.strip().startswith('Abstract:') or
                    line.strip().startswith('Section: ABSTRACT')
                    for line in f
                )

            if not has_abstract:
                no_abstract.append(os.path.splitext(filename)[0]) # Remove '.txt'

    return no_abstract

In [6]:
def get_urls_from_ids(df, id_list, id_column='ID', url_column='URL'):
    """
    Given a DataFrame and a list of IDs, return a list of corresponding URLs.

    Parameters:
    - df: pandas DataFrame containing at least 'ID' and 'URL' columns
    - id_list: list of ID values to match
    - id_column: column name in df that contains IDs (default 'ID')
    - url_column: column name in df that contains URLs (default 'URL')

    Returns:
    - List of URLs matching the given IDs
    """
    filtered = df[df[id_column].isin(id_list)]
    return filtered[url_column].tolist()

## QJPP Design

### Define `QJPP Text Count` proportional to `EL2AP Text Count`

In [7]:
qjpp_dict = {
    'Discipline': [
        'Health Sciences',
        'Biological Sciences',
        'Human Sciences',
        'Applied Social Sciences',
        'Linguistic, literature and arts'
    ],
    'EL2AP Text Count': [
        45, 46, 26, 21, 15
    ]
}

In [8]:
df_qjpp_design = pd.DataFrame(qjpp_dict)

For QJPP reference corpus, the discipline that was the lowest text count is `Linguistic, literature and arts` with 20 texts. The text count in other disciplines should follow the proportion of 20 QJPP texts per 15 EL2AP texts.

In [9]:
proportion_factor = 20/15

In [10]:
df_qjpp_design['QJPP Text Count'] = np.ceil(
    df_qjpp_design['EL2AP Text Count'] * proportion_factor
).astype(int)

In [11]:
df_qjpp_design

Unnamed: 0,Discipline,EL2AP Text Count,QJPP Text Count
0,Health Sciences,45,60
1,Biological Sciences,46,62
2,Human Sciences,26,35
3,Applied Social Sciences,21,28
4,"Linguistic, literature and arts",15,20


### Splitting the quantities among QJPP's journals

#### Map the journals

In [12]:
qjpp_journal_map = {
    'Health Sciences': [
        'Nature Medicine',
        'Annual Review of Public Health',
        'Lancet Public Health',
        'New England Journal of Medicine'
    ],
    'Biological Sciences': [
        'Cell',
        'American Journal of Human Biology'
    ],
    'Human Sciences': [
        'Annual Review of Anthropology',
        'Journal of Human Evolution'
    ],
    'Applied Social Sciences': [
        'Journal of Applied Social Science',
        'Journal of Social Issues',
        'Social Science & Medicine'
    ],
    'Linguistic, literature and arts': [
        'Applied Corpus Linguistics',
        'Journal of English Linguistics'
    ]
}

#### Break down the counts

In [13]:
# Create a list of rows for the new DataFrame
rows = []

for _, row in df_qjpp_design.iterrows():
    discipline = row['Discipline']
    total = row['QJPP Text Count']
    journals = qjpp_journal_map[discipline]
    share = total / len(journals)

    for journal in journals:
        rows.append({
            'Discipline': discipline,
            'Journal': journal,
            'QJPP Text Count': share
        })

# New DataFrame
df_qjpp_design_split = pd.DataFrame(rows)

# Round up to the nearest integer
df_qjpp_design_split['QJPP Text Count'] = np.ceil(df_qjpp_design_split['QJPP Text Count']).astype(int)

df_qjpp_design_split

Unnamed: 0,Discipline,Journal,QJPP Text Count
0,Health Sciences,Nature Medicine,15
1,Health Sciences,Annual Review of Public Health,15
2,Health Sciences,Lancet Public Health,15
3,Health Sciences,New England Journal of Medicine,15
4,Biological Sciences,Cell,31
5,Biological Sciences,American Journal of Human Biology,31
6,Human Sciences,Annual Review of Anthropology,18
7,Human Sciences,Journal of Human Evolution,18
8,Applied Social Sciences,Journal of Applied Social Science,10
9,Applied Social Sciences,Journal of Social Issues,10


## Health Sciences

### [Nature Medicine](https://www.nature.com/nm/)

#### Create output subdirectory

In [14]:
# 'Nature Medicine'
id = 'natm'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\natm


#### Import the data into a DataFrame

In [15]:
df_nature_medicine_open_access = pd.read_json(f"{input_directory}/nature_medicine_open_access.jsonl", lines=True)

In [16]:
df_nature_medicine_open_access['Published'] = pd.to_datetime(df_nature_medicine_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

The documents are actually `Research Articles` and **should not be excluded**.

In [17]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_nature_medicine_open_access, no_abstract)
no_abstract_urls

['natm000052', 'natm000063']


['https://www.nature.com/articles/s41591-022-01866-4',
 'https://www.nature.com/articles/s41591-022-01837-9']

### [Annual Review of Public Health](https://www.annualreviews.org/content/journals/publhealth)

#### Create output subdirectory

In [18]:
# 'Annual Review of Public Health'
id = 'arph'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\arph


#### Import the data into a DataFrame

In [19]:
df_ar_public_health = pd.read_json(f"{input_directory}/ar_public_health.jsonl", lines=True)

In [20]:
df_ar_public_health['Published'] = pd.to_datetime(df_ar_public_health['Published'], unit='ms')

#### Identify documents without `Abstract`

The documents are `Introductions` and **should be excluded**.

In [21]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_ar_public_health, no_abstract)
no_abstract_urls

['arph000029', 'arph000057']


['https://www.annualreviews.org/content/journals/10.1146/annurev-pu-42-012821-100001',
 'https://www.annualreviews.org/content/journals/10.1146/annurev-pu-41-012720-100001']

### [Lancet Public Health](https://www.thelancet.com/journals/lanpub/home)

#### Create output subdirectory

In [22]:
# 'Lancet Public Health'
id = 'laph'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\laph


#### Import the data into a DataFrame

In [23]:
df_lancet_public_health_open_access = pd.read_json(f"{input_directory}/lancet_public_health_open_access.jsonl", lines=True)

In [24]:
df_lancet_public_health_open_access['Published'] = pd.to_datetime(df_lancet_public_health_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [25]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_lancet_public_health_open_access, no_abstract)
no_abstract_urls

[]


[]

### [New England Journal of Medicine](https://www.nejm.org/)

#### Create output subdirectory

In [7]:
# 'New England Journal of Medicine'
id = 'nejm'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\nejm


#### Import the data into a DataFrame

In [8]:
df_new_england_journal_of_medicine_open_access = pd.read_json(f"{input_directory}/new_england_journal_of_medicine_open_access.jsonl", lines=True)

In [9]:
df_new_england_journal_of_medicine_open_access['Published'] = pd.to_datetime(df_new_england_journal_of_medicine_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

These documents are non-standard `Research Articles`. **They should be excluded**.

In [10]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_new_england_journal_of_medicine_open_access, no_abstract)
no_abstract_urls

['nejm000048', 'nejm000081', 'nejm000216', 'nejm000265', 'nejm000428', 'nejm000540', 'nejm000568', 'nejm000578', 'nejm000619', 'nejm000653', 'nejm000662']


['https://www.nejm.org/doi/full/10.1056/NEJMra1908412',
 'https://www.nejm.org/doi/full/10.1056/NEJMra1901594',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2026131',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2035343',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2030281',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2200583',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2117706',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2106441',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2206573',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2208860',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2200092']

##### Empty files

Many documents are empty and the remaining ones are non-standard `Research Articles`. **They have been re-scraped and are adequate**.

```
# PowerShell command
(Get-ChildItem -Path . -File -Recurse -Filter *.txt | Where-Object { $_.Length -eq 0 }).FullName
```

##### Documents without `Abstract` that are not empty

These documents are non-standard `Research Articles`. **They should be excluded**.

## Biological Sciences

### [Cell](https://www.cell.com/cell/home)

#### Create output subdirectory

In [32]:
# 'Cell'
id = 'cell'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\cell


#### Import the data into a DataFrame

In [33]:
df_cell_open_access = pd.read_json(f"{input_directory}/cell_open_access.jsonl", lines=True)

In [34]:
df_cell_open_access['Published'] = pd.to_datetime(df_cell_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [35]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_cell_open_access, no_abstract)
no_abstract_urls

[]


[]

##### Retracted

This article has been retracted and **should be excluded**.

In [36]:
retracted = ['cell000085']

### [American Journal of Human Biology](https://onlinelibrary.wiley.com/journal/15206300?msockid=0525cb73d9a76a060b80df20d87e6b4b)

#### Create output subdirectory

In [37]:
# 'American Journal of Human Biology'
id = 'ajhb'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\ajhb


#### Import the data into a DataFrame

In [38]:
df_american_journal_human_biology_open_access = pd.read_json(f"{input_directory}/american_journal_human_biology_open_access.jsonl", lines=True)

In [39]:
df_american_journal_human_biology_open_access['Published'] = pd.to_datetime(df_american_journal_human_biology_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

The documents are not `Research Articles` and **should be excluded**.

In [40]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_american_journal_human_biology_open_access, no_abstract)
no_abstract_urls

['ajhb000000', 'ajhb000004', 'ajhb000005', 'ajhb000006', 'ajhb000007', 'ajhb000010', 'ajhb000013', 'ajhb000014', 'ajhb000015', 'ajhb000017', 'ajhb000019', 'ajhb000020', 'ajhb000021', 'ajhb000033', 'ajhb000035', 'ajhb000042', 'ajhb000043']


['https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23389',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23407',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23408',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23409',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23430',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23475',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23511',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23474',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23478',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23568',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23562',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23594',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23593',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23607',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23714',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23739',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23740

## Human Sciences

### [Annual Review of Anthropology](https://www.annualreviews.org/content/journals/anthro)

#### Create output subdirectory

In [41]:
# 'Annual Review of Anthropology'
id = 'aran'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\aran


#### Import the data into a DataFrame

In [42]:
df_ar_anthropology = pd.read_json(f"{input_directory}/ar_anthropology.jsonl", lines=True)

In [43]:
df_ar_anthropology['Published'] = pd.to_datetime(df_ar_anthropology['Published'], unit='ms')

#### Identify documents without `Abstract`

The documents are not `Research Articles` and **should be excluded**.

In [44]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_ar_anthropology, no_abstract)
no_abstract_urls

['aran000000', 'aran000031', 'aran000057']


['https://www.annualreviews.org/content/journals/10.1146/annurev-an-51-082222-100001',
 'https://www.annualreviews.org/content/journals/10.1146/annurev-an-50-081621-100001',
 'https://www.annualreviews.org/content/journals/10.1146/annurev-an-49-081420-100001']

### [Journal of Human Evolution](https://www.sciencedirect.com/journal/journal-of-human-evolution)

#### Create output subdirectory

In [45]:
# 'Journal of Human Evolution'
id = 'jhue'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\jhue


#### Import the data into a DataFrame

In [46]:
df_journal_human_evolution_open_access = pd.read_json(f"{input_directory}/journal_human_evolution_open_access.jsonl", lines=True)

In [47]:
df_journal_human_evolution_open_access['Published'] = pd.to_datetime(df_journal_human_evolution_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [48]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_journal_human_evolution_open_access, no_abstract)
no_abstract_urls

[]


[]

## Applied Social Sciences

### [Journal of Applied Social Science](https://journals.sagepub.com/home/jax)

#### Create output subdirectory

In [49]:
# 'Journal of Applied Social Science'
id = 'jasc'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\jasc


#### Import the data into a DataFrame

In [50]:
df_journal_applied_social_science_open_access = pd.read_json(f"{input_directory}/journal_applied_social_science_open_access.jsonl", lines=True)

In [51]:
df_journal_applied_social_science_open_access['Published'] = pd.to_datetime(df_journal_applied_social_science_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [52]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_journal_applied_social_science_open_access, no_abstract)
no_abstract_urls

[]


[]

### [Journal of Social Issues](https://spssi.onlinelibrary.wiley.com/journal/15404560)

#### Create output subdirectory

In [53]:
# 'Journal of Social Issues'
id = 'jsoi'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\jsoi


#### Import the data into a DataFrame

In [54]:
df_journal_social_issues_open_access = pd.read_json(f"{input_directory}/journal_social_issues_open_access.jsonl", lines=True)

In [55]:
df_journal_social_issues_open_access['Published'] = pd.to_datetime(df_journal_social_issues_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [56]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_journal_social_issues_open_access, no_abstract)
no_abstract_urls

[]


[]

### [Social Science & Medicine](https://www.sciencedirect.com/journal/social-science-and-medicine)

#### Create output subdirectory

In [57]:
# 'Social Science & Medicine'
id = 'socm'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\socm


#### Import the data into a DataFrame

In [58]:
df_social_science_medicine_open_access = pd.read_json(f"{input_directory}/social_science_medicine_open_access.jsonl", lines=True)

In [59]:
df_social_science_medicine_open_access['Published'] = pd.to_datetime(df_social_science_medicine_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

All of the documents are valid `Research Articles` without `Abstract`. **They should not be excluded**.

In [60]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_social_science_medicine_open_access, no_abstract)
no_abstract_urls

['socm000081', 'socm000118', 'socm000131', 'socm000181', 'socm000182', 'socm000203', 'socm000300', 'socm000380', 'socm000539']


['https://www.sciencedirect.com//science/article/pii/S0277953620304548',
 'https://www.sciencedirect.com//science/article/pii/S0277953620307243',
 'https://www.sciencedirect.com//science/article/pii/S0277953620306419',
 'https://www.sciencedirect.com//science/article/pii/S0277953621000162',
 'https://www.sciencedirect.com//science/article/pii/S027795362100023X',
 'https://www.sciencedirect.com//science/article/pii/S0277953620308571',
 'https://www.sciencedirect.com//science/article/pii/S0277953621004469',
 'https://www.sciencedirect.com//science/article/pii/S0277953620307656',
 'https://www.sciencedirect.com//science/article/pii/S0277953621008431']

#### Documents with incomplete scraping

The following documents were incompletely scraped. Probably, the scraping programme was not designed to handle their HTML structure. They **should be excluded**.

In [61]:
small_files = ['socm000528', 'socm000550', 'socm000657']

small_files_urls = get_urls_from_ids(df_social_science_medicine_open_access, small_files)
small_files_urls

['https://www.sciencedirect.com//science/article/pii/S0277953622001460',
 'https://www.sciencedirect.com//science/article/pii/S0277953621008212',
 'https://www.sciencedirect.com//science/article/pii/S0277953622005329']

## Linguistics, literature and arts

### [Applied Corpus Linguistics](https://www.sciencedirect.com/journal/applied-corpus-linguistics)

#### Create output subdirectory

In [62]:
# 'Applied Corpus Linguistics'
id = 'apcl'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\apcl


#### Import the data into a DataFrame

In [63]:
df_applied_corpus_linguistics_open_access = pd.read_json(f"{input_directory}/applied_corpus_linguistics_open_access.jsonl", lines=True)

In [64]:
df_applied_corpus_linguistics_open_access['Published'] = pd.to_datetime(df_applied_corpus_linguistics_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [65]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_applied_corpus_linguistics_open_access, no_abstract)
no_abstract_urls

[]


[]

### [Journal of English Linguistics](https://journals.sagepub.com/home/eng)

#### Create output subdirectory

In [66]:
# 'Journal of English Linguistics'
id = 'jenl'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\jenl


#### Import the data into a DataFrame

In [67]:
df_journal_english_linguistics_open_access = pd.read_json(f"{input_directory}/journal_english_linguistics_open_access.jsonl", lines=True)

In [68]:
df_journal_english_linguistics_open_access['Published'] = pd.to_datetime(df_journal_english_linguistics_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

The documents are not `Research Articles` and **should be excluded**.

In [69]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_journal_english_linguistics_open_access, no_abstract)
no_abstract_urls

['jenl000015', 'jenl000016']


['https://journals.sagepub.com/doi/abs/10.1177/00754242221126692',
 'https://journals.sagepub.com/doi/abs/10.1177/00754242221126533']