<center>
<img src="https://laelgelcpublic.s3.sa-east-1.amazonaws.com/lael_50_years_narrow_white.png.no_years.400px_96dpi.png" width="300" alt="LAEL 50 years logo">
<h3>APPLIED LINGUISTICS GRADUATE PROGRAMME (LAEL)</h3>
</center>
<hr>

# Corpus Linguistics - Study 2 - Phase 4 - eyamrog

The aim of this phase is to perform an initial compilation of the `QJPP` corpus.

## Required Python packages

- pandas

## Import the required libraries

In [1]:
import pandas as pd
import os
import sys
import numpy as np
import random

## Define input variables

In [2]:
input_directory = 'cl_st2_ph2_eyamrog'
output_directory = 'cl_st2_ph4_eyamrog'

## Create output directory

In [3]:
# Check if the output directory already exists. If it does, do nothing. If it doesn't exist, create it.
if os.path.exists(output_directory):
    print('Output directory already exists.')
else:
    try:
        os.makedirs(output_directory)
        print('Output directory successfully created.')
    except OSError as e:
        print('Failed to create the directory:', e)
        sys.exit(1)

Output directory already exists.


### Create output subdirectories

In [4]:
def create_directory(path):
    """Creates a subdirectory if it doesn't exist."""
    if not os.path.exists(path):
        try:
            os.makedirs(path)
            print(f"Successfully created the directory: {path}")
        except OSError as e:
            print(f"Failed to create the {path} directory: {e}")
            sys.exit(1)
    else:
        print(f"Directory already exists: {path}")

## List documents without `Abstract`

Identify documents that may not be `Research Articles` by checking the absence of `Abstract` section.

In [5]:
def get_files_without_abstract(directory):
    no_abstract = []

    for filename in os.listdir(directory):
        if filename.endswith('.txt'):
            file_path = os.path.join(directory, filename)

            with open(file_path, 'r', encoding='utf-8') as f:
                has_abstract = any(
                    line.strip().startswith('Abstract:') or
                    line.strip().startswith('Section: ABSTRACT')
                    for line in f
                )

            if not has_abstract:
                no_abstract.append(os.path.splitext(filename)[0]) # Remove '.txt'

    return no_abstract

In [6]:
def get_urls_from_ids(df, id_list, id_column='ID', url_column='URL'):
    """
    Given a DataFrame and a list of IDs, return a list of corresponding URLs.

    Parameters:
    - df: pandas DataFrame containing at least 'ID' and 'URL' columns
    - id_list: list of ID values to match
    - id_column: column name in df that contains IDs (default 'ID')
    - url_column: column name in df that contains URLs (default 'URL')

    Returns:
    - List of URLs matching the given IDs
    """
    filtered = df[df[id_column].isin(id_list)]
    return filtered[url_column].tolist()

## Sample articles

In [7]:
def sample_articles(df, sampled_article_count, exclusion_article_list):
    """
    Sample articles randomly.

    Parameters:
    - df: Full DataFrame
    - sampled_article_count: Number of sampled articles
    - exclusion_article_list: List of article IDs to be excluded

    Returns:
    - DataFrame with the sampled articles
    """
    # Get the full article list
    full_article_list = list(df['ID'].unique())

    # Get the article list
    article_list = [item for item in full_article_list if item not in exclusion_article_list]

    # Ensure the 'ID' column has enough unique items
    if len(article_list) >= sampled_article_count:
        # Generate a random sample of 'ID's
        random_id_list = random.sample(article_list, sampled_article_count)
    else:
        print(f"Not enough unique 'ID' values to sample {sampled_article_count} items.")

    # Sample the articles
    df_sampled = df[df['ID'].isin(random_id_list)]

    return df_sampled

## QJPP Design

### Define `QJPP Text Count` proportional to `EL2AP Text Count`

In [8]:
qjpp_dict = {
    'Discipline': [
        'Health Sciences',
        'Biological Sciences',
        'Human Sciences',
        'Applied Social Sciences',
        'Linguistic, literature and arts'
    ],
    'EL2AP Text Count': [
        45, 46, 26, 21, 15
    ]
}

In [9]:
df_qjpp_design = pd.DataFrame(qjpp_dict)

For QJPP reference corpus, the discipline that was the lowest text count is `Linguistic, literature and arts` with 29 texts. The text count in other disciplines should follow the proportion of 29 QJPP texts per 15 EL2AP texts.

In [10]:
proportion_factor = 29/15

In [11]:
df_qjpp_design['QJPP Text Count'] = np.ceil(
    df_qjpp_design['EL2AP Text Count'] * proportion_factor
).astype(int)

In [12]:
df_qjpp_design

Unnamed: 0,Discipline,EL2AP Text Count,QJPP Text Count
0,Health Sciences,45,87
1,Biological Sciences,46,89
2,Human Sciences,26,51
3,Applied Social Sciences,21,41
4,"Linguistic, literature and arts",15,29


### Splitting the quantities among QJPP's journals

#### Map the journals

In [13]:
qjpp_journal_map = {
    'Health Sciences': [
        'Nature Medicine',
        'Annual Review of Public Health',
        'Lancet Public Health',
        'New England Journal of Medicine'
    ],
    'Biological Sciences': [
        'Cell',
        'American Journal of Human Biology'
    ],
    'Human Sciences': [
        'Annual Review of Anthropology',
        'Journal of Human Evolution'
    ],
    'Applied Social Sciences': [
        'Journal of Applied Social Science',
        'Journal of Social Issues',
        'Social Science & Medicine'
    ],
    'Linguistic, literature and arts': [
        'Applied Corpus Linguistics',
        'Journal of English Linguistics',
        'Corpora'
    ]
}

#### Break down the counts

In [14]:
# Create a list of rows for the new DataFrame
rows = []

for _, row in df_qjpp_design.iterrows():
    discipline = row['Discipline']
    total = row['QJPP Text Count']
    journals = qjpp_journal_map[discipline]
    share = total / len(journals)

    for journal in journals:
        rows.append({
            'Discipline': discipline,
            'Journal': journal,
            'QJPP Text Count': share
        })

# New DataFrame
df_qjpp_design_split = pd.DataFrame(rows)

# Round up to the nearest integer
df_qjpp_design_split['QJPP Text Count'] = np.ceil(df_qjpp_design_split['QJPP Text Count']).astype(int)

# Manually adjust the values in the 'Human Sciences' discipline
df_qjpp_design_split.loc[
    df_qjpp_design_split['Journal'] == 'Annual Review of Anthropology',
    'QJPP Text Count'
] = 29

df_qjpp_design_split.loc[
    df_qjpp_design_split['Journal'] == 'Journal of Human Evolution',
    'QJPP Text Count'
] = 22

# Manually adjust the values in the 'Linguistic, literature and arts' discipline
df_qjpp_design_split.loc[
    df_qjpp_design_split['Journal'] == 'Applied Corpus Linguistics',
    'QJPP Text Count'
] = 5

df_qjpp_design_split.loc[
    df_qjpp_design_split['Journal'] == 'Journal of English Linguistics',
    'QJPP Text Count'
] = 15

df_qjpp_design_split.loc[
    df_qjpp_design_split['Journal'] == 'Corpora',
    'QJPP Text Count'
] = 9

# Include the total 'QJPP Text Count' row
total_qjpp_text_count = df_qjpp_design_split['QJPP Text Count'].sum()

total_row = ['', 'Total', total_qjpp_text_count]
df_qjpp_design_split.loc[len(df_qjpp_design_split.index)] = total_row

df_qjpp_design_split

Unnamed: 0,Discipline,Journal,QJPP Text Count
0,Health Sciences,Nature Medicine,22
1,Health Sciences,Annual Review of Public Health,22
2,Health Sciences,Lancet Public Health,22
3,Health Sciences,New England Journal of Medicine,22
4,Biological Sciences,Cell,45
5,Biological Sciences,American Journal of Human Biology,45
6,Human Sciences,Annual Review of Anthropology,29
7,Human Sciences,Journal of Human Evolution,22
8,Applied Social Sciences,Journal of Applied Social Science,14
9,Applied Social Sciences,Journal of Social Issues,14


##### Create a LaTeX table

In [15]:
title = 'QJPP Text Count'
filename = 'df_qjpp_design_split_paragraph_count'
caption = title
label = f"tab:{filename}"
tex_filename = f"{filename}.tex"

In [16]:
tex_table = df_qjpp_design_split.to_latex(index=False, longtable=True, decimal=',', caption=caption, label=label)

In [17]:
with open(f"{output_directory}/{tex_filename}", 'w', encoding='utf8', newline='\n') as file:
    file.write(tex_table)

## Health Sciences

### [Nature Medicine](https://www.nature.com/nm/)

#### Create output subdirectory

In [18]:
# 'Nature Medicine'
id = 'natm'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [19]:
df_nature_medicine_open_access = pd.read_json(f"{input_directory}/nature_medicine_open_access.jsonl", lines=True)

In [20]:
df_nature_medicine_open_access['Published'] = pd.to_datetime(df_nature_medicine_open_access['Published'], unit='ms')

##### Correct the data set's `Discipline` column and export to a file

In [21]:
df_nature_medicine_open_access['Discipline'] = 'Health Sciences'

In [22]:
df_nature_medicine_open_access.to_json(f"{output_directory}/nature_medicine_open_access.jsonl", orient='records', lines=True)

#### Identify documents without `Abstract`

The documents are actually `Research Articles` and **should not be excluded**.

In [23]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_nature_medicine_open_access, no_abstract)
no_abstract_urls

['natm000052', 'natm000063']


['https://www.nature.com/articles/s41591-022-01866-4',
 'https://www.nature.com/articles/s41591-022-01837-9']

#### Sample articles

In [24]:
df = df_nature_medicine_open_access
sampled_article_count = 22
exclusion_article_list = []

In [25]:
df_nature_medicine_open_access_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [26]:
df_nature_medicine_open_access_sampled.shape

(22, 9)

### [Annual Review of Public Health](https://www.annualreviews.org/content/journals/publhealth)

#### Create output subdirectory

In [27]:
# 'Annual Review of Public Health'
id = 'arph'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [28]:
df_ar_public_health = pd.read_json(f"{input_directory}/ar_public_health.jsonl", lines=True)

In [29]:
df_ar_public_health['Published'] = pd.to_datetime(df_ar_public_health['Published'], unit='ms')

#### Identify documents without `Abstract`

The documents are `Introductions` and **should be excluded**.

In [30]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_ar_public_health, no_abstract)
no_abstract_urls

['arph000029', 'arph000057']


['https://www.annualreviews.org/content/journals/10.1146/annurev-pu-42-012821-100001',
 'https://www.annualreviews.org/content/journals/10.1146/annurev-pu-41-012720-100001']

#### Sample articles

In [31]:
df = df_ar_public_health
sampled_article_count = 22
exclusion_article_list = no_abstract

In [32]:
df_ar_public_health_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [33]:
df_ar_public_health_sampled.shape

(22, 9)

### [Lancet Public Health](https://www.thelancet.com/journals/lanpub/home)

#### Create output subdirectory

In [34]:
# 'Lancet Public Health'
id = 'laph'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [35]:
df_lancet_public_health_open_access = pd.read_json(f"{input_directory}/lancet_public_health_open_access.jsonl", lines=True)

In [36]:
df_lancet_public_health_open_access['Published'] = pd.to_datetime(df_lancet_public_health_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [37]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_lancet_public_health_open_access, no_abstract)
no_abstract_urls

[]


[]

#### Sample articles

In [38]:
df = df_lancet_public_health_open_access
sampled_article_count = 22
exclusion_article_list = []

In [39]:
df_lancet_public_health_open_access_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [40]:
df_lancet_public_health_open_access_sampled.shape

(22, 10)

### [New England Journal of Medicine](https://www.nejm.org/)

#### Create output subdirectory

In [41]:
# 'New England Journal of Medicine'
id = 'nejm'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [42]:
df_new_england_journal_of_medicine_open_access = pd.read_json(f"{input_directory}/new_england_journal_of_medicine_open_access.jsonl", lines=True)

In [43]:
df_new_england_journal_of_medicine_open_access['Published'] = pd.to_datetime(df_new_england_journal_of_medicine_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

These documents are non-standard `Research Articles`. **They should be excluded**.

In [44]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_new_england_journal_of_medicine_open_access, no_abstract)
no_abstract_urls

['nejm000048', 'nejm000081', 'nejm000216', 'nejm000265', 'nejm000428', 'nejm000540', 'nejm000568', 'nejm000578', 'nejm000619', 'nejm000653', 'nejm000662']


['https://www.nejm.org/doi/full/10.1056/NEJMra1908412',
 'https://www.nejm.org/doi/full/10.1056/NEJMra1901594',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2026131',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2035343',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2030281',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2200583',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2117706',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2106441',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2206573',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2208860',
 'https://www.nejm.org/doi/full/10.1056/NEJMra2200092']

##### Empty files

Many documents are empty and the remaining ones are non-standard `Research Articles`. **They have been re-scraped and are adequate**.

```
# PowerShell command
(Get-ChildItem -Path . -File -Recurse -Filter *.txt | Where-Object { $_.Length -eq 0 }).FullName
```

##### Documents without `Abstract` that are not empty

These documents are non-standard `Research Articles`. **They should be excluded**.

#### Sample articles

In [45]:
df = df_new_england_journal_of_medicine_open_access
sampled_article_count = 22
exclusion_article_list = no_abstract

In [46]:
df_new_england_journal_of_medicine_open_access_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [47]:
df_new_england_journal_of_medicine_open_access_sampled.shape

(22, 12)

## Biological Sciences

### [Cell](https://www.cell.com/cell/home)

#### Create output subdirectory

In [48]:
# 'Cell'
id = 'cell'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [49]:
df_cell_open_access = pd.read_json(f"{input_directory}/cell_open_access.jsonl", lines=True)

In [50]:
df_cell_open_access['Published'] = pd.to_datetime(df_cell_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [51]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_cell_open_access, no_abstract)
no_abstract_urls

[]


[]

##### Retracted

This article has been retracted and **should be excluded**.

In [52]:
retracted = ['cell000085']

#### Sample articles

In [53]:
df = df_cell_open_access
sampled_article_count = 45
exclusion_article_list = retracted

In [54]:
df_cell_open_access_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [55]:
df_cell_open_access_sampled.shape

(45, 10)

### [American Journal of Human Biology](https://onlinelibrary.wiley.com/journal/15206300?msockid=0525cb73d9a76a060b80df20d87e6b4b)

#### Create output subdirectory

In [56]:
# 'American Journal of Human Biology'
id = 'ajhb'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [57]:
df_american_journal_human_biology_open_access = pd.read_json(f"{input_directory}/american_journal_human_biology_open_access.jsonl", lines=True)

In [58]:
df_american_journal_human_biology_open_access['Published'] = pd.to_datetime(df_american_journal_human_biology_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

The documents are not `Research Articles` and **should be excluded**.

In [59]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_american_journal_human_biology_open_access, no_abstract)
no_abstract_urls

['ajhb000000', 'ajhb000004', 'ajhb000005', 'ajhb000006', 'ajhb000007', 'ajhb000010', 'ajhb000013', 'ajhb000014', 'ajhb000015', 'ajhb000017', 'ajhb000019', 'ajhb000020', 'ajhb000021', 'ajhb000033', 'ajhb000035', 'ajhb000042', 'ajhb000043']


['https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23389',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23407',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23408',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23409',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23430',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23475',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23511',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23474',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23478',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23568',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23562',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23594',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23593',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23607',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23714',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23739',
 'https://onlinelibrary.wiley.com/doi/10.1002/ajhb.23740

#### Sample articles

In [60]:
df = df_american_journal_human_biology_open_access
sampled_article_count = 45
exclusion_article_list = no_abstract

In [61]:
df_american_journal_human_biology_open_access_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [62]:
df_american_journal_human_biology_open_access_sampled.shape

(45, 12)

## Human Sciences

### [Annual Review of Anthropology](https://www.annualreviews.org/content/journals/anthro)

#### Create output subdirectory

In [63]:
# 'Annual Review of Anthropology'
id = 'aran'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [64]:
df_ar_anthropology = pd.read_json(f"{input_directory}/ar_anthropology.jsonl", lines=True)

In [65]:
df_ar_anthropology['Published'] = pd.to_datetime(df_ar_anthropology['Published'], unit='ms')

#### Identify documents without `Abstract`

The documents are not `Research Articles` and **should be excluded**.

In [66]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_ar_anthropology, no_abstract)
no_abstract_urls

['aran000000', 'aran000031', 'aran000057']


['https://www.annualreviews.org/content/journals/10.1146/annurev-an-51-082222-100001',
 'https://www.annualreviews.org/content/journals/10.1146/annurev-an-50-081621-100001',
 'https://www.annualreviews.org/content/journals/10.1146/annurev-an-49-081420-100001']

#### Sample articles

In [67]:
df = df_ar_anthropology
sampled_article_count = 29
exclusion_article_list = no_abstract

In [68]:
df_ar_anthropology_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [69]:
df_ar_anthropology_sampled.shape

(29, 9)

### [Journal of Human Evolution](https://www.sciencedirect.com/journal/journal-of-human-evolution)

#### Create output subdirectory

In [70]:
# 'Journal of Human Evolution'
id = 'jhue'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [71]:
df_journal_human_evolution_open_access = pd.read_json(f"{input_directory}/journal_human_evolution_open_access.jsonl", lines=True)

In [72]:
df_journal_human_evolution_open_access['Published'] = pd.to_datetime(df_journal_human_evolution_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [73]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_journal_human_evolution_open_access, no_abstract)
no_abstract_urls

[]


[]

#### Sample articles

In [74]:
df = df_journal_human_evolution_open_access
sampled_article_count = 22
exclusion_article_list = []

In [75]:
df_journal_human_evolution_open_access_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [76]:
df_journal_human_evolution_open_access_sampled.shape

(22, 13)

## Applied Social Sciences

### [Journal of Applied Social Science](https://journals.sagepub.com/home/jax)

#### Create output subdirectory

In [77]:
# 'Journal of Applied Social Science'
id = 'jasc'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [78]:
df_journal_applied_social_science_open_access = pd.read_json(f"{input_directory}/journal_applied_social_science_open_access.jsonl", lines=True)

In [79]:
df_journal_applied_social_science_open_access['Published'] = pd.to_datetime(df_journal_applied_social_science_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [80]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_journal_applied_social_science_open_access, no_abstract)
no_abstract_urls

[]


[]

#### Sample articles

In [81]:
df = df_journal_applied_social_science_open_access
sampled_article_count = 14
exclusion_article_list = []

In [82]:
df_journal_applied_social_science_open_access_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [83]:
df_journal_applied_social_science_open_access_sampled.shape

(14, 12)

### [Journal of Social Issues](https://spssi.onlinelibrary.wiley.com/journal/15404560)

#### Create output subdirectory

In [84]:
# 'Journal of Social Issues'
id = 'jsoi'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [85]:
df_journal_social_issues_open_access = pd.read_json(f"{input_directory}/journal_social_issues_open_access.jsonl", lines=True)

In [86]:
df_journal_social_issues_open_access['Published'] = pd.to_datetime(df_journal_social_issues_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [87]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_journal_social_issues_open_access, no_abstract)
no_abstract_urls

[]


[]

#### Sample articles

In [88]:
df = df_journal_social_issues_open_access
sampled_article_count = 14
exclusion_article_list = []

In [89]:
df_journal_social_issues_open_access_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [90]:
df_journal_social_issues_open_access_sampled.shape

(14, 12)

### [Social Science & Medicine](https://www.sciencedirect.com/journal/social-science-and-medicine)

#### Create output subdirectory

In [91]:
# 'Social Science & Medicine'
id = 'socm'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [92]:
df_social_science_medicine_open_access = pd.read_json(f"{input_directory}/social_science_medicine_open_access.jsonl", lines=True)

In [93]:
df_social_science_medicine_open_access['Published'] = pd.to_datetime(df_social_science_medicine_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

All of the documents are valid `Research Articles` without `Abstract`. **They should not be excluded**.

In [94]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_social_science_medicine_open_access, no_abstract)
no_abstract_urls

['socm000081', 'socm000118', 'socm000131', 'socm000181', 'socm000182', 'socm000203', 'socm000300', 'socm000380', 'socm000539']


['https://www.sciencedirect.com//science/article/pii/S0277953620304548',
 'https://www.sciencedirect.com//science/article/pii/S0277953620307243',
 'https://www.sciencedirect.com//science/article/pii/S0277953620306419',
 'https://www.sciencedirect.com//science/article/pii/S0277953621000162',
 'https://www.sciencedirect.com//science/article/pii/S027795362100023X',
 'https://www.sciencedirect.com//science/article/pii/S0277953620308571',
 'https://www.sciencedirect.com//science/article/pii/S0277953621004469',
 'https://www.sciencedirect.com//science/article/pii/S0277953620307656',
 'https://www.sciencedirect.com//science/article/pii/S0277953621008431']

#### Documents with incomplete scraping

The following documents were incompletely scraped. Probably, the scraping programme was not designed to handle their HTML structure. They **should be excluded**.

In [95]:
small_files = ['socm000528', 'socm000550', 'socm000657']

small_files_urls = get_urls_from_ids(df_social_science_medicine_open_access, small_files)
small_files_urls

['https://www.sciencedirect.com//science/article/pii/S0277953622001460',
 'https://www.sciencedirect.com//science/article/pii/S0277953621008212',
 'https://www.sciencedirect.com//science/article/pii/S0277953622005329']

#### Sample articles

In [96]:
df = df_social_science_medicine_open_access
sampled_article_count = 14
exclusion_article_list = small_files

In [97]:
df_social_science_medicine_open_access_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [98]:
df_social_science_medicine_open_access_sampled.shape

(14, 13)

## Linguistics, literature and arts

### [Applied Corpus Linguistics](https://www.sciencedirect.com/journal/applied-corpus-linguistics)

#### Create output subdirectory

In [99]:
# 'Applied Corpus Linguistics'
id = 'apcl'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [100]:
df_applied_corpus_linguistics_open_access = pd.read_json(f"{input_directory}/applied_corpus_linguistics_open_access.jsonl", lines=True)

In [101]:
df_applied_corpus_linguistics_open_access['Published'] = pd.to_datetime(df_applied_corpus_linguistics_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [102]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_applied_corpus_linguistics_open_access, no_abstract)
no_abstract_urls

[]


[]

#### Sample articles

In [103]:
df = df_applied_corpus_linguistics_open_access
sampled_article_count = 5
exclusion_article_list = []

In [104]:
df_applied_corpus_linguistics_open_access_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [105]:
df_applied_corpus_linguistics_open_access_sampled.shape

(5, 12)

#### Sample articles

In [106]:
df_applied_corpus_linguistics_open_access_sampled = df_applied_corpus_linguistics_open_access

### [Journal of English Linguistics](https://journals.sagepub.com/home/eng)

#### Create output subdirectory

In [107]:
# 'Journal of English Linguistics'
id = 'jenl'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [108]:
df_journal_english_linguistics_open_access = pd.read_json(f"{input_directory}/journal_english_linguistics_open_access.jsonl", lines=True)

In [109]:
df_journal_english_linguistics_open_access['Published'] = pd.to_datetime(df_journal_english_linguistics_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

The documents are not `Research Articles` and **should be excluded**.

In [110]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_journal_english_linguistics_open_access, no_abstract)
no_abstract_urls

['jenl000015', 'jenl000016']


['https://journals.sagepub.com/doi/abs/10.1177/00754242221126692',
 'https://journals.sagepub.com/doi/abs/10.1177/00754242221126533']

#### Sample articles

In [111]:
df = df_journal_english_linguistics_open_access
sampled_article_count = 15
exclusion_article_list = no_abstract

In [112]:
df_journal_english_linguistics_open_access_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [113]:
df_journal_english_linguistics_open_access_sampled.shape

(15, 12)

### [Corpora](https://www.euppublishing.com/journal/cor)

#### Create output subdirectory

In [114]:
# 'Corpora'
id = 'corp'
#path1 = os.path.join(output_directory, id)
path2 = os.path.join(input_directory, id)
#create_directory(path1)

#### Import the data into a DataFrame

In [115]:
df_corpora_open_access = pd.read_json(f"{input_directory}/corpora_open_access.jsonl", lines=True)

In [116]:
df_corpora_open_access['Published'] = pd.to_datetime(df_corpora_open_access['Published'], unit='ms')

#### Identify documents without `Abstract`

In [117]:
no_abstract = get_files_without_abstract(path2)
print(no_abstract)

no_abstract_urls = get_urls_from_ids(df_corpora_open_access, no_abstract)
no_abstract_urls

[]


[]

#### Sample articles

In [118]:
df = df_corpora_open_access
sampled_article_count = 9
exclusion_article_list = []

In [119]:
df_corpora_open_access_sampled = sample_articles(df, sampled_article_count, exclusion_article_list)

In [120]:
df_corpora_open_access_sampled.shape

(9, 12)

## Concatenate the DataFrames

In [121]:
df_qjpp = pd.concat([
    df_nature_medicine_open_access_sampled,
    df_ar_public_health_sampled,
    df_lancet_public_health_open_access_sampled,
    df_new_england_journal_of_medicine_open_access_sampled,
    df_cell_open_access_sampled,
    df_american_journal_human_biology_open_access_sampled,
    df_ar_anthropology_sampled,
    df_journal_human_evolution_open_access_sampled,
    df_journal_applied_social_science_open_access_sampled,
    df_journal_social_issues_open_access_sampled,
    df_social_science_medicine_open_access_sampled,
    df_applied_corpus_linguistics_open_access_sampled,
    df_journal_english_linguistics_open_access_sampled,
    df_corpora_open_access_sampled
], ignore_index=True)

In [122]:
df_qjpp

Unnamed: 0,Title,URL,Authors,Published,PDF URL,Open Access,Discipline,Journal,ID,Vol/Issue,DOI,Article Type,Open Access 1
0,Retrospectively modeling the effects of increa...,https://www.nature.com/articles/s41591-022-020...,"Sam Moore, Edward M. Hill, Matt J. Keeling",2022-10-27,https://www.nature.com/articles/s41591-022-020...,Open Access,Health Sciences,Nature Medicine,natm000005,,,,
1,Effects of elevated systolic blood pressure on...,https://www.nature.com/articles/s41591-022-019...,"Christian Razo, Catherine A. Welgan, Gregory A...",2022-10-10,https://www.nature.com/articles/s41591-022-019...,Open Access,Health Sciences,Nature Medicine,natm000015,,,,
2,Association of step counts over time with the ...,https://www.nature.com/articles/s41591-022-020...,"Hiral Master, Jeffrey Annis, Evan L. Brittain",2022-10-10,https://www.nature.com/articles/s41591-022-020...,Open Access,Health Sciences,Nature Medicine,natm000016,,,,
3,A real-world comparison of tisagenlecleucel an...,https://www.nature.com/articles/s41591-022-019...,"Emmanuel Bachy, Steven Le Gouill, Franck Morsc...",2022-09-22,https://www.nature.com/articles/s41591-022-019...,Open Access,Health Sciences,Nature Medicine,natm000020,,,,
4,Pilot study of responsive nucleus accumbens de...,https://www.nature.com/articles/s41591-022-019...,"Rajat S. Shivacharan, Camarin E. Rolle, Casey ...",2022-08-29,https://www.nature.com/articles/s41591-022-019...,Open Access,Health Sciences,Nature Medicine,natm000030,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,The ordering of relative clauses and determine...,https://www.euppublishing.com/doi/full/10.3366...,Jiajin Xu and Zhaoxia Liu,2022-09-27,https://www.euppublishing.com/doi/pdf/10.3366/...,Free access,"Linguistic, literature and arts",Corpora,corp000004,"Volume 17, Issue Supplement",https://doi.org/10.3366/cor.2022.0246,Research article,
296,Exploring the use of make + noun collocations ...,https://www.euppublishing.com/doi/full/10.3366...,Ryo Sawaguchi and Atsushi Mizumoto,2022-09-27,https://www.euppublishing.com/doi/pdf/10.3366/...,Free access,"Linguistic, literature and arts",Corpora,corp000005,"Volume 17, Issue Supplement",https://doi.org/10.3366/cor.2022.0247,Research article,
297,"Learner corpus research in Hong Kong: past, pr...",https://www.euppublishing.com/doi/full/10.3366...,"Kanglong Liu, Joyce Oiwun Cheung, and Nan Zhao",2022-09-27,https://www.euppublishing.com/doi/pdf/10.3366/...,Free access,"Linguistic, literature and arts",Corpora,corp000006,"Volume 17, Issue Supplement",https://doi.org/10.3366/cor.2022.0248,Research article,
298,A corpus-based study of native speakers’ and T...,https://www.euppublishing.com/doi/full/10.3366...,Yi-Ching Lin and Siaw-Fong Chung,2022-09-27,https://www.euppublishing.com/doi/pdf/10.3366/...,Free access,"Linguistic, literature and arts",Corpora,corp000007,"Volume 17, Issue Supplement",https://doi.org/10.3366/cor.2022.0249,Research article,


### Export to a file

In [123]:
df_qjpp.to_json(f"{output_directory}/df_qjpp.jsonl", orient='records', lines=True)

In [124]:
df_qjpp.to_excel(f"{output_directory}/df_qjpp.xlsx", index=False)