# NER on CORD-19 Dataset

## 1. Import necessary modules and set the environment

`setup_nltk` function checks and downloads all necessary NLTK modules.

In [1]:
# # download necessary packages
# %pip install nltk
# %pip install spacy

In [2]:
import json
import re
import nltk
import spacy
import tarfile
import os
import pandas as pd
import requests
from tqdm import tqdm
from pathlib import Path
from collections import defaultdict
from NLTK_utils import download_dataset, setup_nltk, stream_cord19_data, clean_text, extract_entities_nltk, extract_entities_spacy, calculate_metrics

# PROJECT_ROOT = Path(__file__).resolve().parent
DATA_DIR = "data/"
CORD19_FILENAME = 'cord-19_2022-06-02.tar.gz'
CORD19_FILE_PATH = DATA_DIR + CORD19_FILENAME
CORD19_URL = "https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2022-06-02.tar.gz"
OUTPUT_DIR = "output/"

os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)

setup_nltk()
# download spacy model
print("Downloading spaCy model...")
spacy.cli.download("en_core_web_sm")

Checking NLTK packages...
Checking NLTK packages...
Downloading spaCy model...
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## 2. Download the dataset

Since the whole dataset is very big, we implement a reliable download function supporting resumable downloads and streaming processing. The definition of this function is:
```python
download_dataset(url, dest_path, chunk_size=1024*1024)
```

Parameters:
- `url`: the link to download the dataset.
- `dest_path`: the path where dataset is to be stored. Note that the dataset is a compressed file, and therefore the `dest_path` is a compressed folder ending with `.tar.gz`. We use a stream processor `stream_cord19_data` to read this compressed file without uncompressing.

In [3]:
# 2. download dataset
download_dataset(CORD19_URL, CORD19_FILE_PATH)

Dataset already exists and is complete: data/cord-19_2022-06-02.tar.gz


## 3. Hyperparameters configuration

There are 2 hyperparameters:
- **MAX_NUMBER:** indicates the total number of articles that are to be analyzed. The dataset is very big and contains thousands of articles. Therefore, it is nearly impossible to analyze the whole dataset and we only choose the first **MAX_NUMBER** articles to analyze.
- **MAX_LENGTH:** indicates where to truncate the `cleaned_text`. Because the main texts of published articles are usually long, it is a useful trick to only analyze the first **MAX_LENGTH** cleaned characters in order to get a quicker demonstration. Besides, if **MAX_LENGTH** is set to `None`, our analysis code would analyze the whole text.

You can configure the parameters in the below cell. 

In [4]:
MAX_NUMBER = 10
MAX_LENGTH = 2000

## 4. perform analysis

The whole analysis process is integrated in the function `analyze` in the below cell. It invokes 2 separate entities extraction function and 1 performance analysis function. 

### Entities extraction

The entities extraction function of NLTK is defined as:
```python
extract_entities_nltk(text)
```

The only parameter is the input text.

The entities extraction function of spaCy is defined as:
```python
extract_entities_spacy(text, nlp_model)
```
There are two parameters: `text` for the input text; and `nlp_model` for the spaCy model.

Both functions above return a list of entities.

### Performance analysis

The performance analysis uses relative performance to assess the results of NLTK. Since the dataset is unlabeled, there is no ground truth to evaluate the accuracy, recall and F-1 score of NLTK methods. Therefore, we regard the mature spaCy method as the standard, and calculate the relative performance of NLTK to spaCy. The definition of the performance analysis function is: 
```python
calculate_metrics(reference_entities, candidate_entities)
```

- `reference_entities`: the result list of entities by the standard method that we choose, which is spaCy.
- `candidate_entities`: the result list of entities by the method that we are to assess, which is NLTK.

This function returns a dictionary that contains all performance metrics, of which the shape is:
```python
{
    "precision": round(precision, 4),
    "recall": round(recall, 4),
    "f1_score": round(f1, 4),
    "overlap_count": tp,
    "nltk_only_count": fp,
    "spacy_only_count": fn
}
```

Note that the entities extracted per paper and paper-wise performance analysis are stored in `output/` as `.csv` files, in which more details can be inspected.

In [5]:
def analyze():
    # initialize models
    print("Initializing models...")
    
    # load spaCy
    try:
        nlp_spacy = spacy.load("en_core_web_sm")
    except OSError:
        print("SpaCy model not found. Please run: python -m spacy download en_core_web_sm")
        return

    # list to store results
    all_entities_data = []
    performance_metrics = []

    # process the first MAX_NUMBER papers
    MAX_NUMBER = 10
    print(f"\n--- Processing {MAX_NUMBER} papers from CORD-19 ---")
    
    paper_generator = stream_cord19_data(CORD19_FILE_PATH, limit=MAX_NUMBER)
    
    for i, paper in enumerate(paper_generator):
        paper_id = paper['id']
        print(f"Processing [{i+1}/{MAX_NUMBER}]: {paper_id}")
        
        cleaned_text = clean_text(paper['text'])
        
        # 1. run the models
        # notice: in order for fair comparison, we apply same truncation for different methods.
        # for quick demonstration, we use the first MAX_LENGTH characters 
        if MAX_LENGTH:
            eval_text = cleaned_text[:MAX_LENGTH]
        else:
            eval_text = cleaned_text
        
        ents_nltk = extract_entities_nltk(eval_text)
        ents_spacy = extract_entities_spacy(eval_text, nlp_spacy)
        
        # 2. store the output for following analysis
        for ent in ents_nltk:
            all_entities_data.append({'paper_id': paper_id, 'model': 'NLTK', 'entity': ent})
        for ent in ents_spacy:
            all_entities_data.append({'paper_id': paper_id, 'model': 'spaCy', 'entity': ent})
            
        # 3. compute the performance (use spaCy as the Silver Standard)
        metrics = calculate_metrics(reference_entities=ents_spacy, candidate_entities=ents_nltk)
        metrics['paper_id'] = paper_id
        performance_metrics.append(metrics)

    # ---------------------------------------------------------
    # store and display the results
    # ---------------------------------------------------------
    
    # store lists of entities
    df_entities = pd.DataFrame(all_entities_data)
    entities_csv_path = OUTPUT_DIR + "extracted_entities.csv"
    df_entities.to_csv(entities_csv_path, index=False)
    print(f"\n[Saved] All extracted entities saved to: {entities_csv_path}")
    
    # store performances
    df_perf = pd.DataFrame(performance_metrics)
    perf_csv_path = OUTPUT_DIR + "performance_metrics.csv"
    df_perf.to_csv(perf_csv_path, index=False)
    print(f"[Saved] Performance metrics saved to: {perf_csv_path}")
    
    # print average performances
    if not df_perf.empty:
        print("\n" + "="*40)
        print("Average Performance (NLTK vs spaCy as Baseline)")
        print("="*40)
        print(df_perf[['precision', 'recall', 'f1_score']].mean())
        print("="*40)
        print("Note: Since CORD-19 is unlabeled, we treat spaCy results as the")
        print("'Silver Standard' (Ground Truth) to evaluate NLTK's relative performance.")

# run the analysis
analyze()

Initializing models...

--- Processing 10 papers from CORD-19 ---
Opening dataset: data/cord-19_2022-06-02.tar.gz...
Processing [1/10]: d1aafb70c066a2068b02786f8929fd9c900897fb
Processing [2/10]: PMC35282
Processing [3/10]: 6b0567729c2143a66d737eb0a2f63f2dce2e5a7d
Processing [4/10]: PMC59543
Processing [5/10]: 06ced00a5fc04215949aa72528f2eeaae1d58927
Processing [6/10]: PMC59549
Processing [7/10]: 348055649b6b8cf2b9a376498df9bf41f7123605
Processing [8/10]: PMC59574
Processing [9/10]: 5f48792a5fa08bed9f56016f4981ae2ca6031b32
Processing [10/10]: PMC59580

[Saved] All extracted entities saved to: output/extracted_entities.csv
[Saved] Performance metrics saved to: output/performance_metrics.csv

Average Performance (NLTK vs spaCy as Baseline)
precision    0.53916
recall       0.27565
f1_score     0.33390
dtype: float64
Note: Since CORD-19 is unlabeled, we treat spaCy results as the
'Silver Standard' (Ground Truth) to evaluate NLTK's relative performance.
