<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 2 - Phase 4 - eyamrog

The aim of this phase is to design the `QJPP` corpus.

## Required Python packages

- pandas

## Import the required libraries

In [1]:
import pandas as pd
import os
import sys

## Define input variables

In [2]:
input_directory = 'cl_st2_ph2_eyamrog'
output_directory = 'cl_st2_ph4_eyamrog'

## Create output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory already exists.


### Create output subdirectories

In [4]:
def create_directory(path):
    """Creates a subdirectory if it doesn't exist."""
    if not os.path.exists(path):
        try:
            os.makedirs(path)
            print(f"Successfully created the directory: {path}")
        except OSError as e:
            print(f"Failed to create the {path} directory: {e}")
            sys.exit(1)
    else:
        print(f"Directory already exists: {path}")

## List documents without `Abstract`

Identify documents that may not be `Research Articles` by checking the absence of `Abstract` section.

In [5]:
def get_files_without_abstract(directory):
    no_abstract = []

    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            file_path = os.path.join(directory, filename)

            with open(file_path, 'r', encoding='utf-8') as f:
                has_abstract = any(
                    line.strip().startswith('Abstract:') or
                    line.strip().startswith('Section: ABSTRACT')
                    for line in f
                )

            if not has_abstract:
                no_abstract.append(os.path.splitext(filename)[0]) # Remove '.txt'

    return no_abstract

In [6]:
def get_urls_from_ids(df, id_list, id_column='ID', url_column='URL'):
    """
    Given a DataFrame and a list of IDs, return a list of corresponding URLs.

    Parameters:
    - df: pandas DataFrame containing at least 'ID' and 'URL' columns
    - id_list: list of ID values to match
    - id_column: column name in df that contains IDs (default 'ID')
    - url_column: column name in df that contains URLs (default 'URL')

    Returns:
    - List of URLs matching the given IDs
    """
    filtered = df[df[id_column].isin(id_list)]
    return filtered[url_column].tolist()

## Health Sciences

### [Nature Medicine](https://www.nature.com/nm/)

#### Create output subdirectory

In [7]:
# 'Nature Medicine'
id = 'natm'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\natm


#### Import the data into a DataFrame

In [8]:
df_nature_medicine_open_access = pd.read_json(f"{input_directory}/nature_medicine_open_access.jsonl", lines=True)

In [9]:
df_nature_medicine_open_access['Published'] = pd.to_datetime(df_nature_medicine_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

The documents are actually `Research Articles` and should not be excluded.

In [10]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_nature_medicine_open_access, no_abstract)
no_abstract_urls

['natm000052', 'natm000063']


['https://www.nature.com/articles/s41591-022-01866-4',
 'https://www.nature.com/articles/s41591-022-01837-9']

### [Annual Review of Public Health](https://www.annualreviews.org/content/journals/publhealth)

#### Create output subdirectory

In [11]:
# 'Annual Review of Public Health'
id = 'arph'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\arph


#### Import the data into a DataFrame

In [12]:
df_ar_public_health = pd.read_json(f"{input_directory}/ar_public_health.jsonl", lines=True)

In [13]:
df_ar_public_health['Published'] = pd.to_datetime(df_ar_public_health['Published'], unit='ms')

#### Identify documents without `Abstract`

In [14]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_ar_public_health, no_abstract)
no_abstract_urls

['arph000029', 'arph000057']


['https://www.annualreviews.org/content/journals/10.1146/annurev-pu-42-012821-100001',
 'https://www.annualreviews.org/content/journals/10.1146/annurev-pu-41-012720-100001']

### [Lancet Public Health](https://www.thelancet.com/journals/lanpub/home)

#### Create output subdirectory

In [15]:
# 'Lancet Public Health'
id = 'laph'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\laph


#### Import the data into a DataFrame

In [16]:
df_lancet_public_health_open_access = pd.read_json(f"{input_directory}/lancet_public_health_open_access.jsonl", lines=True)

In [17]:
df_lancet_public_health_open_access['Published'] = pd.to_datetime(df_lancet_public_health_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [18]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_lancet_public_health_open_access, no_abstract)
no_abstract_urls

[]


[]

### [New England Journal of Medicine](https://www.nejm.org/)

#### Create output subdirectory

In [19]:
# 'New England Journal of Medicine'
id = 'nejm'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\nejm


#### Import the data into a DataFrame

In [20]:
df_new_england_journal_of_medicine_open_access = pd.read_json(f"{input_directory}/new_england_journal_of_medicine_open_access.jsonl", lines=True)

In [21]:
df_new_england_journal_of_medicine_open_access['Published'] = pd.to_datetime(df_new_england_journal_of_medicine_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [22]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_new_england_journal_of_medicine_open_access, no_abstract)
no_abstract_urls

['nejm000048', 'nejm000081', 'nejm000138', 'nejm000177', 'nejm000197', 'nejm000212', 'nejm000216', 'nejm000224', 'nejm000252', 'nejm000265', 'nejm000318', 'nejm000374', 'nejm000375', 'nejm000394', 'nejm000397', 'nejm000406', 'nejm000428', 'nejm000480', 'nejm000507', 'nejm000513', 'nejm000538', 'nejm000540', 'nejm000542', 'nejm000547', 'nejm000552', 'nejm000568', 'nejm000578', 'nejm000619', 'nejm000620', 'nejm000653', 'nejm000662', 'nejm000666']


['https://www.nejm.org/doi/full/10.1056/NEJMra1908412',
 'https://www.nejm.org/doi/full/10.1056/NEJMra1901594',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa2016638',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa2004967',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa2000962',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa2017699',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2026131',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa2027906',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa2035389',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2035343',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa1901281',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa2035908',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa2028973',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa1708120',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa2109682',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa2114255',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2030281',
 'https://www.nejm.org/doi/full/10.1056/NEJMoa21

## Biological Sciences

### [Cell](https://www.cell.com/cell/home)

#### Create output subdirectory

In [23]:
# 'Cell'
id = 'cell'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\cell


#### Import the data into a DataFrame

In [24]:
df_cell_open_access = pd.read_json(f"{input_directory}/cell_open_access.jsonl", lines=True)

In [25]:
df_cell_open_access['Published'] = pd.to_datetime(df_cell_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [26]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_cell_open_access, no_abstract)
no_abstract_urls

[]


[]

### [American Journal of Human Biology](https://onlinelibrary.wiley.com/journal/15206300?msockid=0525cb73d9a76a060b80df20d87e6b4b)

#### Create output subdirectory

In [27]:
# 'American Journal of Human Biology'
id = 'ajhb'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\ajhb


#### Import the data into a DataFrame

In [28]:
df_american_journal_human_biology_open_access = pd.read_json(f"{input_directory}/american_journal_human_biology_open_access.jsonl", lines=True)

In [29]:
df_american_journal_human_biology_open_access['Published'] = pd.to_datetime(df_american_journal_human_biology_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [30]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_american_journal_human_biology_open_access, no_abstract)
no_abstract_urls

['ajhb000000', 'ajhb000004', 'ajhb000005', 'ajhb000006', 'ajhb000007', 'ajhb000010', 'ajhb000013', 'ajhb000014', 'ajhb000015', 'ajhb000017', 'ajhb000019', 'ajhb000020', 'ajhb000021', 'ajhb000033', 'ajhb000035', 'ajhb000042', 'ajhb000043']


['https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23389',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23407',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23408',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23409',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23430',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23475',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23511',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23474',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23478',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23568',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23562',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23594',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23593',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23607',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23714',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23739',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23740

## Human Sciences

### [Annual Review of Anthropology](https://www.annualreviews.org/content/journals/anthro)

#### Create output subdirectory

In [31]:
# 'Annual Review of Anthropology'
id = 'aran'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\aran


#### Import the data into a DataFrame

In [32]:
df_ar_anthropology = pd.read_json(f"{input_directory}/ar_anthropology.jsonl", lines=True)

In [33]:
df_ar_anthropology['Published'] = pd.to_datetime(df_ar_anthropology['Published'], unit='ms')

#### Identify documents without `Abstract`

In [34]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_ar_anthropology, no_abstract)
no_abstract_urls

['aran000000', 'aran000031', 'aran000057']


['https://www.annualreviews.org/content/journals/10.1146/annurev-an-51-082222-100001',
 'https://www.annualreviews.org/content/journals/10.1146/annurev-an-50-081621-100001',
 'https://www.annualreviews.org/content/journals/10.1146/annurev-an-49-081420-100001']

### [Journal of Human Evolution](https://www.sciencedirect.com/journal/journal-of-human-evolution)

#### Create output subdirectory

In [35]:
# 'Journal of Human Evolution'
id = 'jhue'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\jhue


#### Import the data into a DataFrame

In [36]:
df_journal_human_evolution_open_access = pd.read_json(f"{input_directory}/journal_human_evolution_open_access.jsonl", lines=True)

In [37]:
df_journal_human_evolution_open_access['Published'] = pd.to_datetime(df_journal_human_evolution_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [38]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_journal_human_evolution_open_access, no_abstract)
no_abstract_urls

[]


[]

## Applied Social Sciences

### [Journal of Applied Social Science](https://journals.sagepub.com/home/jax)

#### Create output subdirectory

In [39]:
# 'Journal of Applied Social Science'
id = 'jasc'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\jasc


#### Import the data into a DataFrame

In [40]:
df_journal_applied_social_science_open_access = pd.read_json(f"{input_directory}/journal_applied_social_science_open_access.jsonl", lines=True)

In [41]:
df_journal_applied_social_science_open_access['Published'] = pd.to_datetime(df_journal_applied_social_science_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [42]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_journal_applied_social_science_open_access, no_abstract)
no_abstract_urls

[]


[]

### [Journal of Social Issues](https://spssi.onlinelibrary.wiley.com/journal/15404560)

#### Create output subdirectory

In [43]:
# 'Journal of Social Issues'
id = 'jsoi'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\jsoi


#### Import the data into a DataFrame

In [44]:
df_journal_social_issues_open_access = pd.read_json(f"{input_directory}/journal_social_issues_open_access.jsonl", lines=True)

In [45]:
df_journal_social_issues_open_access['Published'] = pd.to_datetime(df_journal_social_issues_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [46]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_journal_social_issues_open_access, no_abstract)
no_abstract_urls

[]


[]

### [Social Science & Medicine](https://www.sciencedirect.com/journal/social-science-and-medicine)

#### Create output subdirectory

In [47]:
# 'Social Science & Medicine'
id = 'socm'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\socm


#### Import the data into a DataFrame

In [48]:
df_social_science_medicine_open_access = pd.read_json(f"{input_directory}/social_science_medicine_open_access.jsonl", lines=True)

In [49]:
df_social_science_medicine_open_access['Published'] = pd.to_datetime(df_social_science_medicine_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [50]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_social_science_medicine_open_access, no_abstract)
no_abstract_urls

['socm000081', 'socm000118', 'socm000131', 'socm000181', 'socm000182', 'socm000203', 'socm000300', 'socm000380', 'socm000539']


['https://www.sciencedirect.com//science/article/pii/S0277953620304548',
 'https://www.sciencedirect.com//science/article/pii/S0277953620307243',
 'https://www.sciencedirect.com//science/article/pii/S0277953620306419',
 'https://www.sciencedirect.com//science/article/pii/S0277953621000162',
 'https://www.sciencedirect.com//science/article/pii/S027795362100023X',
 'https://www.sciencedirect.com//science/article/pii/S0277953620308571',
 'https://www.sciencedirect.com//science/article/pii/S0277953621004469',
 'https://www.sciencedirect.com//science/article/pii/S0277953620307656',
 'https://www.sciencedirect.com//science/article/pii/S0277953621008431']

## Linguistics, literature and arts

### [Applied Corpus Linguistics](https://www.sciencedirect.com/journal/applied-corpus-linguistics)

#### Create output subdirectory

In [51]:
# 'Applied Corpus Linguistics'
id = 'apcl'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\apcl


#### Import the data into a DataFrame

In [52]:
df_applied_corpus_linguistics_open_access = pd.read_json(f"{input_directory}/applied_corpus_linguistics_open_access.jsonl", lines=True)

In [53]:
df_applied_corpus_linguistics_open_access['Published'] = pd.to_datetime(df_applied_corpus_linguistics_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [54]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_applied_corpus_linguistics_open_access, no_abstract)
no_abstract_urls

[]


[]

### [Journal of English Linguistics](https://journals.sagepub.com/home/eng)

#### Create output subdirectory

In [55]:
# 'Journal of English Linguistics'
id = 'jenl'
path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
create_directory(path1)

Directory already exists: cl_st2_ph4_eyamrog\jenl


#### Import the data into a DataFrame

In [56]:
df_journal_english_linguistics_open_access = pd.read_json(f"{input_directory}/journal_english_linguistics_open_access.jsonl", lines=True)

In [57]:
df_journal_english_linguistics_open_access['Published'] = pd.to_datetime(df_journal_english_linguistics_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [58]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_journal_english_linguistics_open_access, no_abstract)
no_abstract_urls

['jenl000015', 'jenl000016']


['https://journals.sagepub.com/doi/abs/10.1177/00754242221126692',
 'https://journals.sagepub.com/doi/abs/10.1177/00754242221126533']