<a href="https://www.nvidia.com/dli"> <img src="../images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# Assessment: Authorship Attribution
### (NVIDIA NeMo v1.0)

Authorship attribution is a type of text classification problem.  Instead of categorizing text by _topic_, as you did in the disease text classification problem, the objective is to classify the text by _author_.  

The inherent assumption in trying to solve a problem like this is that there is *some difference between the styles* of the authors in question, *which can be discerned by a model*.  Is that the case for BERT et al?  Is a language model able to "understand" written style? 

### Table of Contents
[The Problem](#The-Problem)<br>
[Scoring](#Scoring)<br>
[Step 1: Prepare the Data](#Step-1:-Prepare-the-Data)<br>
[Step 2: Prepare the Model Configuration](#Step-2:-Prepare-the-Model-Configuration)<br>
[Step 3: Prepare the Trainer Configuration](#Step-3:-Prepare-the-Trainer-Configuration)<br>
[Step 4: Train](#Step-4:-Train)<br>
[Step 5: Infer](#Step-5:-Infer)<br>
[Step 6: Submit You Assessment](#Step-6:-Submit-You-Assessment)

---
# The Problem
### The Federalist Papers - History Mystery!

The [Federalist Papers](https://en.wikipedia.org/wiki/The_Federalist_Papers) are a set of essays written between 1787 and 1788 by [Alexander Hamilton](https://en.wikipedia.org/wiki/Alexander_Hamilton), [James Madison](https://en.wikipedia.org/wiki/James_Madison) and [John Jay](https://en.wikipedia.org/wiki/John_Jay).  Initially published under the pseudonym 'Publius', their intent was to encourage the ratification of the then-new Constitution of the United States of America.  In later years, a list emerged where the author of each one of the 85 papers was identified.  Nevertheless, for a subset of these papers the author is still in question.  The problem of the Federalist Papers authorship attribution has been a subject of much research in statistical NLP in the past.   Now you will try to solve this question with your own BERT-based project model.
<img style="float: right;" src="images/HandM.png" width=400>
                                                                                                           
In concrete terms, the problem is identifying, for each one of the disputed papers, whether Alexander Hamilton or James Madison are the authors.  For this exercise, you can assume that each paper has a single author, i.e., that no collaboration took place (though *that* is not 100% certain!), and that each author has a well-defined writing style that is displayed across all the identified papers. 

### Your Project
You are provided with labeled `train.tsv` and `dev.tsv` datasets for the project.  There are 10 test sets, one for each of the disputed papers.  All datasets are contained in the `data/federalist_papers_HM` directory.  

Each "sentence" is actually a group of sentences of approximately 256 words.  The labels are '0' for HAMILTON and '1' for MADISON.  There are more papers by Hamilton in the example files than by Madison.  The validation set has been created with approximately the same distribution of the two labels as in the training set.

Your task is to build neural networks using NeMo, as you did in Lab 2.  You'll train your model and test it.  Then you'll use provided collation code to see what answers your model gives to the "history mystery"! 

Along the way, you'll save code snippets that will be tested with the autograder once you are done.  Submission instructions are provided at the end of the notebook for this part.

---
# Scoring
You will be assessed on your ability to set up and train a model for the project, rather than the final result.  This coding assessment is worth 70 points, divided as follows:

### Rubric

| Step                                 | Graded                                                    | FIXMEs?  | Points |
|--------------------------------------|-----------------------------------------------------------|----------|--------|
| 1. Prepare the Project               | Fix data format (correct format)                          |  2       | 10     |
| 2. Prepare the Model Configuration   | Set model parameters for override                         |  3       | 15     |
| 3. Prepare the Trainer Configuration | Set trainer parameters for override                       |  3       | 15     |
| 4. Train                             | Run the Trainer (training logs indicate training correct) |  4       | 20      |
| 5. Infer                             | Run Inference (results indicate working project)          |  0       | 10     |

Although you are very capable at this point of building the project without any help at all, some scaffolding is provided, including specific names for variables.  This is for the benefit of the autograder, so please use these constructs for your assessment.  Also, this assessment tests the use of the command line method and assumes the use of the `text_classification_with_bert.py` script and the ` the command line training method, not th.  You are free to change parameters such as model name, sequence length, batch size, learning rate, number of epochs, and so on to improve your model as you see fit.

Once you are confident that you've built a reliable model, follow the instructions for submission at the end of the notebook.

### Resources and Hints
* **Example code:**<br>
In the file browser at your left, you'll find the `lab2_reference_notebooks` directory.  This contains solution notebooks from Lab 2 for text classification and NER to use as examples.
* **Language model (PRETRAINED_MODEL_NAME):**<br>
You may find it useful to try different language models to better discern style.  Specifically, it may be that capitalization is important, which would mean you'd want to try a "cased" model.
* **Maximum sequence length (MAX_SEQ_LEN):**<br>
Values that can be used for MAX_SEQ_LENGTH are 64, 128, or 256.  Larger models (BERT-large, Megatron) may require a smaller MAX_SEQ_LENGTH to avoid an out-of-memory error.
* **Number of Classes (NUM_CLASSES):**<br>
For the Federalist Papers, we are only concerned with HAMILTON and MADISON.  The papers by John Jay have been excluded from the dataset.
* **Batch size (BATCH_SIZE):**<br>
Larger batch sizes train faster, but large language models tend to use up the available memory quickly.
* **Memory usage:**<br>
Some of the models are very large.   If you get "RuntimeError: CUDA out of memory" during training, you'll know you need to reduce the batch size, sequence length, and/or choose a smaller language model, restart the kernel, and try again from the beginning of the notebook.
* **Accuracy and loss:**<br>
It is definitely possible to achieve 95% or more model accuracy for this project.  In addition to changes in accuracy as the model trains, pay attention to the loss value.  You want the loss value to be dropping and getting very small for best results.
* **Number of epochs (NUM_EPOCHS):**<br>
You may need to run more epochs for your model (or not!).

---
# Step 1: Prepare the Data

In [32]:
# Import useful utilities for grading
import os
import json
import glob
from omegaconf import OmegaConf

def get_latest_model():  
    nemo_model_paths = glob.glob('nemo_experiments/TextClassification/*/checkpoints/*.nemo')
    # Sort newest first
    nemo_model_paths.sort(reverse=True)
    return nemo_model_paths[0]

The data is located in the data directory - see the list in the following cell:

In [33]:
DATA_DIR = '/dli/task/data/federalist_papers_HM'
!ls $DATA_DIR

cached_dev_nemo_format.tsv_BertTokenizer_256_28996_-1_0_False.pkl
cached_dev_nemo_format.tsv_BertTokenizer_256_30522_-1_0_False.pkl
cached_train_nemo_format.tsv_BertTokenizer_256_28996_-1_0_True.pkl
cached_train_nemo_format.tsv_BertTokenizer_256_30522_-1_0_True.pkl
dev.tsv
dev_nemo_format.tsv
test.tsv
test49.tsv
test50.tsv
test51.tsv
test52.tsv
test53.tsv
test54.tsv
test55.tsv
test56.tsv
test57.tsv
test62.tsv
train.tsv
train_nemo_format.tsv


## Data Format (graded)
The data is not in the correct format for NeMo text classification.  Correct the data and save the new datasets in the DATA_DIR as `train_nemo_format.tsv` and `dev_nemo_format.tsv`.  You do not need to do anything with any of the test files.

Complete the <i><strong style="color:green;">#FIXME</strong></i> lines and run the save cell.

In [34]:
# Correct the format for train.tsv and dev.tsv
#   and save the updates in train_nemo_format.tsv and dev_nemo_format.tsv
!sed 1d $DATA_DIR/train.tsv > $DATA_DIR/train_nemo_format.tsv
!sed 1d $DATA_DIR/dev.tsv > $DATA_DIR/dev_nemo_format.tsv

#FIXME train.tsv format
#FIXME dev.tsv format

In [35]:
# check your work
print("*****\ntrain_nemo_format.tsv sample\n*****")
!head -n 3 $DATA_DIR/train_nemo_format.tsv
print("\n\n*****\ndev_nemo_format.tsv sample\n*****")
!head -n 3 $DATA_DIR/dev_nemo_format.tsv

*****
train_nemo_format.tsv sample
*****
Concerning Dangers from Dissensions Between the States For the Independent Journal .To the People of the State of New York : THE three last numbers of this paper have been dedicated to an enumeration of the dangers to which we should be exposed , in a state of disunion , from the arms and arts of foreign nations .I shall now proceed to delineate dangers of a different and , perhaps , still more alarming kind -- those which will in all probability flow from dissensions between the States themselves , and from domestic factions and convulsions .These have been already in some instances slightly anticipated ; but they deserve a more particular and more full investigation .A man must be far gone in Utopian speculations who can seriously doubt that , if these States should either be wholly disunited , or only united in partial confederacies , the subdivisions into which they might be thrown would have frequent and violent contests with each other .To

In [36]:
# Run to save for assessment- DO NOT CHANGE
import os.path
DATA_DIR = '/dli/task/data/federalist_papers_HM'
step1 = []
try:
    with open(os.path.join(DATA_DIR,'train_nemo_format.tsv')) as f:
        content = f.readlines()
        step1 += content[:2]
    with open(os.path.join(DATA_DIR,'dev_nemo_format.tsv')) as f:
        content = f.readlines()
        step1 += content[:2]
except:
    pass
                
with open("my_assessment/step1.json", "w") as outfile: 
    json.dump(step1, outfile) 

---
# Step 2: Prepare the Model Configuration
Review the default model configuration and available language models.

In [37]:
# Take a look at the default model portion of the config file
CONFIG_DIR = "/dli/task/nemo/examples/nlp/text_classification/conf"
CONFIG_FILE = "text_classification_config.yaml"

config = OmegaConf.load(CONFIG_DIR + "/" + CONFIG_FILE)
print(OmegaConf.to_yaml(config.model))

nemo_path: text_classification_model.nemo
tokenizer:
  tokenizer_name: ${model.language_model.pretrained_model_name}
  vocab_file: null
  tokenizer_model: null
  special_tokens: null
language_model:
  pretrained_model_name: bert-base-uncased
  lm_checkpoint: null
  config_file: null
  config: null
classifier_head:
  num_output_layers: 2
  fc_dropout: 0.1
class_labels:
  class_labels_file: null
dataset:
  num_classes: ???
  do_lower_case: false
  max_seq_length: 256
  class_balancing: null
  use_cache: false
train_ds:
  file_path: null
  batch_size: 64
  shuffle: true
  num_samples: -1
  num_workers: 3
  drop_last: false
  pin_memory: false
validation_ds:
  file_path: null
  batch_size: 64
  shuffle: false
  num_samples: -1
  num_workers: 3
  drop_last: false
  pin_memory: false
test_ds:
  file_path: null
  batch_size: 64
  shuffle: false
  num_samples: -1
  num_workers: 3
  drop_last: false
  pin_memory: false
optim:
  name: adam
  lr: 2.0e-05
  betas:
  - 0.9
  - 0.999
  weight_decay:

In [38]:
# See what BERT-like language models are available
from nemo.collections import nlp as nemo_nlp
nemo_nlp.modules.get_pretrained_lm_models_list()

['megatron-bert-345m-uncased',
 'megatron-bert-345m-cased',
 'megatron-bert-uncased',
 'megatron-bert-cased',
 'biomegatron-bert-345m-uncased',
 'biomegatron-bert-345m-cased',
 'bert-base-uncased',
 'bert-large-uncased',
 'bert-base-cased',
 'bert-large-cased',
 'bert-base-multilingual-uncased',
 'bert-base-multilingual-cased',
 'bert-base-chinese',
 'bert-base-german-cased',
 'bert-large-uncased-whole-word-masking',
 'bert-large-cased-whole-word-masking',
 'bert-large-uncased-whole-word-masking-finetuned-squad',
 'bert-large-cased-whole-word-masking-finetuned-squad',
 'bert-base-cased-finetuned-mrpc',
 'bert-base-german-dbmdz-cased',
 'bert-base-german-dbmdz-uncased',
 'cl-tohoku/bert-base-japanese',
 'cl-tohoku/bert-base-japanese-whole-word-masking',
 'cl-tohoku/bert-base-japanese-char',
 'cl-tohoku/bert-base-japanese-char-whole-word-masking',
 'TurkuNLP/bert-base-finnish-cased-v1',
 'TurkuNLP/bert-base-finnish-uncased-v1',
 'wietsedv/bert-base-dutch-cased',
 'distilbert-base-uncased

## Set Parameters (graded)
Complete the <i><strong style="color:green;">#FIXME</strong></i> lines and run the save cell.

In [39]:
# set the values
NUM_CLASSES = 2 
MAX_SEQ_LENGTH = 256 
BATCH_SIZE = 16 
PATH_TO_TRAIN_FILE = "/dli/task/data/federalist_papers_HM/train_nemo_format.tsv"
PATH_TO_VAL_FILE = "/dli/task/data/federalist_papers_HM/dev_nemo_format.tsv"
PRETRAINED_MODEL_NAME = 'bert-base-cased' # change as desired
LR = 1e-4 # change as desired

In [40]:
# Run to save for assessment- DO NOT CHANGE
with open("my_assessment/step2.json", "w") as outfile: 
    json.dump([MAX_SEQ_LENGTH, NUM_CLASSES, BATCH_SIZE], outfile) 

---
# Step 3: Prepare the Trainer Configuration
Review the default trainer and exp_manager configurations.

In [41]:
print(OmegaConf.to_yaml(config.trainer))
print(OmegaConf.to_yaml(config.exp_manager))

gpus: 1
num_nodes: 1
max_epochs: 100
max_steps: null
accumulate_grad_batches: 1
gradient_clip_val: 0.0
amp_level: O0
precision: 32
accelerator: ddp
log_every_n_steps: 1
val_check_interval: 1.0
resume_from_checkpoint: null
num_sanity_val_steps: 0
checkpoint_callback: false
logger: false

exp_dir: null
name: TextClassification
create_tensorboard_logger: true
create_checkpoint_callback: true



## Set Parameters (graded)
Set the automatic mixed precision to level 1 with FP16 precision.  Set the MAX_EPOCHS to a reasonable level, perhaps between 5 and 20.  <br>Complete the <i><strong style="color:green;">#FIXME</strong></i> lines and run the save cell.

In [42]:
# set the values
MAX_EPOCHS = 5
AMP_LEVEL = 'O1' 
PRECISION = 16 

In [43]:
# Run to save for assessment - DO NOT CHANGE
with open("my_assessment/step3.json", "w") as outfile: 
    json.dump([MAX_EPOCHS, AMP_LEVEL, PRECISION], outfile) 

---
# Step 4: Train

### Run the Trainer (graded)
Complete the <i><strong style="color:green;">#FIXME</strong></i> in the following cell for train and validation batch sizes, amp level, and precision.  Then train and run the save cell!

In [44]:
%%time
# Run the training script, overriding the config values in the command line
TC_DIR = "/dli/task/nemo/examples/nlp/text_classification"


!python $TC_DIR/text_classification_with_bert.py \
        model.dataset.num_classes=$NUM_CLASSES \
        model.dataset.max_seq_length=$MAX_SEQ_LENGTH \
        model.train_ds.file_path=$PATH_TO_TRAIN_FILE \
        model.validation_ds.file_path=$PATH_TO_VAL_FILE \
        model.infer_samples=[] \
        trainer.max_epochs=$MAX_EPOCHS \
        model.language_model.pretrained_model_name=$PRETRAINED_MODEL_NAME \
        trainer.amp_level=$AMP_LEVEL \
        trainer.precision=$PRECISION \
        model.train_ds.batch_size=$BATCH_SIZE \
        model.validation_ds.batch_size=$BATCH_SIZE 

    Use OmegaConf.to_yaml(cfg)
    
    
[NeMo I 2021-07-08 16:01:49 text_classification_with_bert:110] 
    Config Params:
    trainer:
      gpus: 1
      num_nodes: 1
      max_epochs: 5
      max_steps: null
      accumulate_grad_batches: 1
      gradient_clip_val: 0.0
      amp_level: O1
      precision: 16
      accelerator: ddp
      log_every_n_steps: 1
      val_check_interval: 1.0
      resume_from_checkpoint: null
      num_sanity_val_steps: 0
      checkpoint_callback: false
      logger: false
    model:
      nemo_path: text_classification_model.nemo
      tokenizer:
        tokenizer_name: ${model.language_model.pretrained_model_name}
        vocab_file: null
        tokenizer_model: null
        special_tokens: null
      language_model:
        pretrained_model_name: bert-base-cased
        lm_checkpoint: null
        config_file: null
        config: null
      classifier_head:
        num_output_layers: 2
        fc_dropout: 0.1
      class_labels:
        class_labe

In [45]:
# Run to save for assessment- DO NOT CHANGE
cmd_log = os.path.join(os.path.dirname(os.path.dirname(get_latest_model())),'cmd-args.log')
lightning_logs = os.path.join(os.path.dirname(os.path.dirname(get_latest_model())),'lightning_logs.txt')

with open(cmd_log, "r") as f:
    cmd = f.read()
    cmd_list = cmd.split()
with open("my_assessment/step4.json", "w") as outfile: 
    json.dump(cmd_list, outfile) 
    
with open(lightning_logs, "r") as f:
    log = f.readlines()
with open("my_assessment/step4_lightning.json", "w") as outfile:
    json.dump(log, outfile)

---
# Step 5: Infer

### Run Inference (graded)
Run the inference blocks to see and save the results. (Note: there is nothing to fix here)

In [46]:
# Run inference for assessment -  - DO NOT CHANGE
from nemo.collections import nlp as nemo_nlp

# Instantiate the model by restoring from the latest .nemo checkpoint
model = nemo_nlp.models.TextClassificationModel.restore_from(get_latest_model())

# Find the latest model path
DATA_DIR = '/dli/task/data/federalist_papers_HM'

test_files = [
    'test49.tsv',
    'test50.tsv',
    'test51.tsv',
    'test52.tsv',
    'test53.tsv',
    'test54.tsv', 
    'test55.tsv',
    'test56.tsv',
    'test57.tsv',
    'test62.tsv',
]
results = []
for test_file in test_files:
    # get as list and remove header row
    filepath = os.path.join(DATA_DIR, test_file)
    with open(filepath, "r") as f:
        lines = f.readlines()
    del lines[0]
    
    results.append(model.classifytext(lines, batch_size = 1, max_seq_length = 256))
print(results)

Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.
[NeMo W 2021-07-08 16:04:09 modelPT:137] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    file_path: /dli/task/data/federalist_papers_HM/train_nemo_format.tsv
    batch_size: 16
    shuffle: true
    num_samples: -1
    num_workers: 3
    drop_last: false
    pin_memory: false
    
[NeMo W 2021-07-08 16:04:09 modelPT:144] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    file_path: /dli/task/data/federalist_papers_HM/dev_nemo_format.tsv
    batch_size: 16
    shuffle: false
    num_samples: -1
    num_workers: 3
    drop_last: false
    pin_memory: false
    
[NeMo W 2021-07-08 1

[NeMo I 2021-07-08 16:04:13 modelPT:434] Model TextClassificationModel was successfully restored from nemo_experiments/TextClassification/2021-07-08_16-01-49/checkpoints/TextClassification.nemo.


[NeMo W 2021-07-08 16:04:13 text_classification_dataset:250] Found 7 out of 7 sentences with more than 256 subtokens. Truncated long sentences from the end.
[NeMo W 2021-07-08 16:04:13 text_classification_dataset:250] Found 4 out of 4 sentences with more than 256 subtokens. Truncated long sentences from the end.
[NeMo W 2021-07-08 16:04:13 text_classification_dataset:250] Found 8 out of 8 sentences with more than 256 subtokens. Truncated long sentences from the end.
[NeMo W 2021-07-08 16:04:14 text_classification_dataset:250] Found 7 out of 7 sentences with more than 256 subtokens. Truncated long sentences from the end.
[NeMo W 2021-07-08 16:04:14 text_classification_dataset:250] Found 9 out of 9 sentences with more than 256 subtokens. Truncated long sentences from the end.
[NeMo W 2021-07-08 16:04:15 text_classification_dataset:250] Found 8 out of 8 sentences with more than 256 subtokens. Truncated long sentences from the end.
[NeMo W 2021-07-08 16:04:15 text_classification_dataset:25

[[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


In [47]:
# Run to save for assessment- DO NOT CHANGE
author = []
for result in results:
    avg_result = sum(result) / len(result)
    if avg_result < 0.5:
        author.append("HAMILTON")
        print("HAMILTON")
    else:
        author.append("MADISON")
        print("MADISON")
        
with open("my_assessment/step5.json", "w") as outfile: 
    json.dump(author, outfile) 

HAMILTON
HAMILTON
HAMILTON
HAMILTON
HAMILTON
HAMILTON
HAMILTON
HAMILTON
HAMILTON
HAMILTON


---
# Step 6: Submit You Assessment
How were your results?  According to an earlier [machine learning analysis using support vector machines](http://pages.cs.wisc.edu/~gfung/federalist.pdf), Madison was the most likely true author of all the disputed papers (assuming no collaboration).  It is possible to get the "all MADISON" answer using the tools you have.  If you are so inclined, you can keep trying, though **a particular result is *NOT* required to pass the assessment**.

If you are satisfied that you have completed the code correctly, and that your training and inference are working correctly, you can submit your project as follows to the autograder:

1. Go back to the GPU launch page and click the checkmark to run the assessment:

<img src="../images/assessment_checkmark.png" width=600>

2. That's it!  If you passed, you'll receive a pop-up window saying so, and the points will be credited to your progress.  If not, you'll receive feedback in the pop-up window. 

<img src="../images/assessment_pass_popup.png" width=600>

You can always check your assessment progress in the course progress tab.  Note that partial values for the coding assessment won't be visible here - it shows up as either 0 or 70 points.  Be sure to complete the questions on Transformer and Deployment on the same course page to qualify for your final certificate!

<img src="../images/progress.png" width=600>

<a href="https://www.nvidia.com/dli"> <img src="../images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>