<a href="https://colab.research.google.com/github/ericphann/dsba6188-group6-project/blob/main/ecfr_ner_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

__DSBA6188: Final Project__  
Group 6: Search Wizards

# Introduction

This notebook will walk through building some NER models for eCFR Title 12 using spaCy.  
Annotations consist of both manual annotations and few-shot annotations from Chat-GPT 3.5 v3. All annotations were made locally using Prodigy.





# Import packages

In [1]:
from google.colab import files
import spacy
from spacy import displacy

# Import datasets (not required)

Note: This section is not required as the ```corpus``` folders will have all the required configs and files for each model.

## Few-Shot Training Data

Our few-shot training data (```ecfr-few-shot.jsonl```) was produced using spaCy and Chat-GPT 3.5 v3. Labels and annotation guidelines were developed by Chat-GPT 3.5 v3 and are available in the write-up. Please see ```spacy-llm-config.cfg``` for configuration details and ```few_shot_examples.yml``` for the few shot examples used.

After downloading the training data locally, please upload it below.

In [None]:
uploaded = files.upload()

Saving ecfr-few-shot.jsonl to ecfr-few-shot.jsonl


## Manual Validation Data

Our validation data (```ecfr-manual.jsonl```) was manually labelled by the team. We followed the labels and annotation guidelines developed by Chat-GPT 3.5 v3. Please refer to the write-up for details.  

After downloading the validation data locally, please upload it below.

In [None]:
uploaded = files.upload()

Saving ecfr-validation.jsonl to ecfr-validation.jsonl


# Import spaCy configs and files

First, let's download ```en_core_web_sm```.

In [2]:
%%capture
!python -m spacy download en_core_web_sm

Now let's import our spaCy configs and files, either through the following code blocks or by dragging directly into the files panel on the left. These were all created locally using Prodigy's ```data-to-spacy``` recipe. Some noteable parameters set manually in each ```config.cfg``` due to the datasets being large:
*   ```eval_frequency``` = 200
*   ```max_epochs``` = 20



## few-shot-corpus

Be sure to mirror the directory in Colab as shown below
or you may experience problems.


*   /few-shot-corpus
  * config.cfg
  * dev.spacy
  * train.spacy
  * /labels
      * ner.json


In [3]:
# upload config.cfg, dev.spacy, train.spacy
uploaded = files.upload()

Saving config.cfg to config.cfg
Saving dev.spacy to dev.spacy
Saving train.spacy to train.spacy


In [4]:
# upload ner.json (/labels contents)
uploaded = files.upload()

Saving ner.json to ner.json


**Organize these files accordingly before uploading anything else!**

## manual-corpus

Be sure to mirror the directory in Colab as shown below
or you may experience problems.


*   /manual-corpus
  * config.cfg
  * dev.spacy
  * train.spacy
  * /labels
      * ner.json

In [5]:
# upload config.cfg, dev.spacy, train.spacy
uploaded = files.upload()

Saving config.cfg to config.cfg
Saving dev.spacy to dev.spacy
Saving train.spacy to train.spacy


In [6]:
# upload ner.json (/labels contents)
uploaded = files.upload()

Saving ner.json to ner.json


**Organize these files accordingly before uploading anything else!**

## mixed-corpus

Be sure to mirror the directory in Colab as shown below
or you may experience problems.


*   /mixed-corpus
  * config.cfg
  * dev.spacy
  * train.spacy
  * /labels
      * ner.json

In [7]:
# upload config.cfg, dev.spacy, train.spacy
uploaded = files.upload()

Saving config.cfg to config.cfg
Saving dev.spacy to dev.spacy
Saving train.spacy to train.spacy


In [8]:
# upload ner.json (/labels contents)
uploaded = files.upload()

Saving ner.json to ner.json


**Organize these files accordingly before uploading anything else!**

# Training the models

We will train and evaluate 3 models:


*   Few-shot train/validation (```few-shot-corpus```)
*   Manual train/validation (```manual-corpus```)
*   Few-shot train/manual validation (```mixed-corpus```)



Let's check to see if we can use Colab's T4 GPU and set it as our preference.

In [9]:
gpu = spacy.prefer_gpu()
print(gpu)

True


## Few-shot Training & Validation

Let's train the model using some ```ecfr-few-shot.jsonl``` as training data and some as test data. Because of how large our dataset is, we will only use 80 few-shot examples for training and 20 additional ones for validation.

In [17]:
!python -m spacy train ./few-shot-corpus/config.cfg --paths.train ./few-shot-corpus/train.spacy --paths.dev ./few-shot-corpus/dev.spacy --gpu-id 0  --output ./few-shot-model

[38;5;4mℹ Saving to output directory: few-shot-model[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00   1065.22    0.00    0.00    0.00    0.00
  2     200       2434.73   7966.80    9.47   40.91    5.36    0.09
  5     400        509.11   2253.38   51.55   53.90   49.40    0.52
  8     600       1617.82   1692.45   33.63   33.94   33.33    0.34
 10     800        280.17   1312.56   58.82   61.29   56.55    0.59
 13    1000        285.32    997.52   33.33   34.62   32.14    0.33
 16    1200        427.54    805.82   53.29   53.61   52.98    0.53
 19    1400        548.08    676.20   53.01   53.66   52.38    0.53
[38;5;2m✔ Saved pipeline to output directory[0m
few-shot-model/model-last


About 20 epochs ~ 4m 22s with best F1 score 0.59

Run the following code block if you'd like to download the resulting model locally.

In [18]:
!zip -r ./few-shot-model.zip ./few-shot-model
from google.colab import files
files.download("./few-shot-model.zip")

updating: few-shot-model/ (stored 0%)
updating: few-shot-model/model-best/ (stored 0%)
updating: few-shot-model/model-best/ner/ (stored 0%)
updating: few-shot-model/model-best/ner/moves (deflated 57%)
updating: few-shot-model/model-best/ner/model (deflated 8%)
updating: few-shot-model/model-best/ner/cfg (deflated 36%)
updating: few-shot-model/model-best/vocab/ (stored 0%)
updating: few-shot-model/model-best/vocab/strings.json (deflated 75%)
updating: few-shot-model/model-best/vocab/lookups.bin (stored 0%)
updating: few-shot-model/model-best/vocab/vectors.cfg (stored 0%)
updating: few-shot-model/model-best/vocab/vectors (deflated 45%)
updating: few-shot-model/model-best/vocab/key2row (stored 0%)
updating: few-shot-model/model-best/tok2vec/ (stored 0%)
updating: few-shot-model/model-best/tok2vec/model (deflated 8%)
updating: few-shot-model/model-best/tok2vec/cfg (stored 0%)
updating: few-shot-model/model-best/tokenizer (deflated 81%)
updating: few-shot-model/model-best/config.cfg (deflat

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Manual Training & Validation

Let's train the model using some ```ecfr-manual.jsonl``` as training data and some as test data. Because of how large our dataset is, we will only use 80 manual examples for training and 20 additional ones for validation.

In [13]:
!python -m spacy train ./manual-corpus/config.cfg --paths.train ./manual-corpus/train.spacy --paths.dev ./manual-corpus/dev.spacy --gpu-id 0  --output ./manual-model

[38;5;4mℹ Saving to output directory: manual-model[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    242.33    0.31    0.17    1.92    0.00
  2     200       1140.44   6275.21    6.59   23.08    3.85    0.07
  5     400       3348.08   4276.95    4.62   23.53    2.56    0.05
  7     600       1855.54   2771.84    8.10   10.99    6.41    0.08
 10     800        457.77   2229.18   13.33   14.73   12.18    0.13
 13    1000        555.28   1906.27    7.81   10.00    6.41    0.08
 15    1200        845.60   1758.85    8.40   10.38    7.05    0.08
 18    1400        596.12   1364.76   10.19    8.20   13.46    0.10
[38;5;2m✔ Saved pipeline to output directory[0m
manual-model/model-last


About 20 epochs ~ 4m 36s with best F1 score 0.13

Run the following code block if you'd like to download the resulting model locally.

In [16]:
!zip -r ./manual-model.zip ./manual-model
from google.colab import files
files.download("./manual-model.zip")

updating: manual-model/ (stored 0%)
updating: manual-model/model-best/ (stored 0%)
updating: manual-model/model-best/ner/ (stored 0%)
updating: manual-model/model-best/ner/moves (deflated 57%)
updating: manual-model/model-best/ner/model (deflated 7%)
updating: manual-model/model-best/ner/cfg (deflated 36%)
updating: manual-model/model-best/vocab/ (stored 0%)
updating: manual-model/model-best/vocab/strings.json (deflated 74%)
updating: manual-model/model-best/vocab/lookups.bin (stored 0%)
updating: manual-model/model-best/vocab/vectors.cfg (stored 0%)
updating: manual-model/model-best/vocab/vectors (deflated 45%)
updating: manual-model/model-best/vocab/key2row (stored 0%)
updating: manual-model/model-best/tok2vec/ (stored 0%)
updating: manual-model/model-best/tok2vec/model (deflated 8%)
updating: manual-model/model-best/tok2vec/cfg (stored 0%)
updating: manual-model/model-best/tokenizer (deflated 81%)
updating: manual-model/model-best/config.cfg (deflated 62%)
updating: manual-model/mod

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Few-shot Training / Manual Validation (Mixed)

Let's train the model using some ```ecfr-few-shot.jsonl``` as training data and some ```ecfr-manual.jsonl``` as test data. Because of how large our dataset is, we will only use 80 few-shot examples as training and 20 manual examples as validation.

In [19]:
!python -m spacy train ./mixed-corpus/config.cfg --paths.train ./mixed-corpus/train.spacy --paths.dev ./mixed-corpus/dev.spacy --gpu-id 0  --output ./mixed-model

[38;5;2m✔ Created output directory: mixed-model[0m
[38;5;4mℹ Saving to output directory: mixed-model[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00   1234.67    0.00    0.00    0.00    0.00
  2     200       2912.50   8780.14    4.02    3.79    4.27    0.04
  5     400        118.26   2019.53    6.48    6.15    6.84    0.06
  8     600       5443.40   2474.78    7.72    7.04    8.55    0.08
 10     800       1323.84   1608.70    8.37    9.18    7.69    0.08
 13    1000       2455.21   1212.95    7.56    7.44    7.69    0.08
 16    1200        344.86    837.09    4.78    4.48    5.13    0.05
 18    1400        467.97    751.98    5.56    5.19    5.98    0.06
[38;5;2m✔ Saved pipeline to output directory[0m
mix

About 20 epochs ~ 4m 29s with best F1 score 0.08

Run the following code block if you'd like to download the resulting model locally.

In [20]:
!zip -r ./mixed-model.zip ./mixed-model
from google.colab import files
files.download("./mixed-model.zip")

  adding: mixed-model/ (stored 0%)
  adding: mixed-model/model-best/ (stored 0%)
  adding: mixed-model/model-best/ner/ (stored 0%)
  adding: mixed-model/model-best/ner/moves (deflated 59%)
  adding: mixed-model/model-best/ner/model (deflated 8%)
  adding: mixed-model/model-best/ner/cfg (deflated 36%)
  adding: mixed-model/model-best/vocab/ (stored 0%)
  adding: mixed-model/model-best/vocab/strings.json (deflated 75%)
  adding: mixed-model/model-best/vocab/lookups.bin (stored 0%)
  adding: mixed-model/model-best/vocab/vectors.cfg (stored 0%)
  adding: mixed-model/model-best/vocab/vectors (deflated 45%)
  adding: mixed-model/model-best/vocab/key2row (stored 0%)
  adding: mixed-model/model-best/tok2vec/ (stored 0%)
  adding: mixed-model/model-best/tok2vec/model (deflated 8%)
  adding: mixed-model/model-best/tok2vec/cfg (stored 0%)
  adding: mixed-model/model-best/tokenizer (deflated 81%)
  adding: mixed-model/model-best/config.cfg (deflated 62%)
  adding: mixed-model/model-best/meta.json 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Evaluating the models

While we can evaluate quantitatively using the F1 scores from above, this section will focus on qualitative evaluation.  

We have selected 5 examples not used for training for each of our models to annotate. There were no specific selection criterion other than examples being ideally meaningful and short.

## Few-shot Model

First, we must load in the best few-shot model.

In [25]:
nlp = spacy.load("./few-shot-model/model-best")



In [26]:
doc = nlp("Notwithstanding any other provision of this title, the NCUA may, without any administrative due process,\
 immediately place into conservatorship or liquidation any corporate credit union that has been categorized as critically undercapitalized.")
displacy.render(doc, style="ent")

In [27]:
doc = nlp("Loans made under title III by banks for cooperatives and agricultural credit banks may be made to eligible domestic parties\
 domiciled within any territory that may be served by Farm Credit institutions under section 1.2 of the Act and to eligible foreign parties without regard to domicile.")
displacy.render(doc, style="ent")

In [28]:
doc = nlp("If an interlocutory appeal or collateral attack is brought in any court concerning all or any part of an adjudicatory proceeding,\
 the challenged adjudicatory proceeding shall continue without regard to the pendency of that court proceeding.\
  No default or other failure to act as directed in the adjudicatory proceeding within the times prescribed in this subpart shall\
   be excused based on the pendency before any court of any interlocutory appeal or collateral attack."
)
displacy.render(doc, style="ent")



In [33]:
doc = nlp("""If the Board of Directors finds that a savings association is a special supervisory association
 under the provisions of section 8(a)(8)(B) of the FDIA (12 U.S.C. 1818(a)(8)(B))
 for purposes of temporary suspension of insured status, the Board of Directors shall serve
  upon the association its findings with regard to the determination that the capital of the association,
   as computed using applicable accounting standards, has suffered a material decline; that such association or its directors or officers,
    is engaging in an unsafe or unsound practice in conducting the business of the association; that such association is in an unsafe or unsound condition
     to continue operating as an insured association; or that such association or its directors or officers, has violated any law, rule, regulation, order, condition
      imposed in writing by any Federal banking agency, or any written agreement, or that the association failed to enter into a capital improvement plan acceptable
       to the Corporation prior to January, 1990."""
)
displacy.render(doc, style="ent")

In [32]:
doc = nlp("The conservator or receiver may enforce any contract entered into by\
 the regulated entity pursuant to the provisions and subject to the restrictions of section 1367(d)(13) of the Safety and Soundness Act.")
displacy.render(doc, style="ent")

Notes:


*   The model does not even bother to label anything other than __regulator__ and made only one label for __institution__. This may be due to an imbalance in labels in the training data.
*   Overall, very lacking in the number of labels overall, with example 3 not even having any labels.



## Manual Model

In [34]:
nlp = spacy.load("./manual-model/model-best")



In [35]:
doc = nlp("Notwithstanding any other provision of this title, the NCUA may, without any administrative due process,\
 immediately place into conservatorship or liquidation any corporate credit union that has been categorized as critically undercapitalized.")
displacy.render(doc, style="ent")



In [36]:
doc = nlp("Loans made under title III by banks for cooperatives and agricultural credit banks may be made to eligible domestic parties\
 domiciled within any territory that may be served by Farm Credit institutions under section 1.2 of the Act and to eligible foreign parties without regard to domicile.")
displacy.render(doc, style="ent")

In [37]:
doc = nlp("If an interlocutory appeal or collateral attack is brought in any court concerning all or any part of an adjudicatory proceeding,\
 the challenged adjudicatory proceeding shall continue without regard to the pendency of that court proceeding.\
  No default or other failure to act as directed in the adjudicatory proceeding within the times prescribed in this subpart shall\
   be excused based on the pendency before any court of any interlocutory appeal or collateral attack."
)
displacy.render(doc, style="ent")

In [38]:
doc = nlp("""If the Board of Directors finds that a savings association is a special supervisory association under the provisions
 of section 8(a)(8)(B) of the FDIA (12 U.S.C. 1818(a)(8)(B)) for purposes of temporary suspension of insured status,
  the Board of Directors shall serve upon the association its findings with regard to the determination that the capital of the association,
   as computed using applicable accounting standards, has suffered a material decline; that such association or its directors or officers,
    is engaging in an unsafe or unsound practice in conducting the business of the association; that such association is in an unsafe or unsound condition
     to continue operating as an insured association; or that such association or its directors or officers, has violated any law, rule, regulation, order, condition
      imposed in writing by any Federal banking agency, or any written agreement, or that the association failed to enter into a capital improvement plan acceptable
       to the Corporation prior to January, 1990."""
)
displacy.render(doc, style="ent")

In [39]:
doc = nlp("The conservator or receiver may enforce any contract entered into by\
 the regulated entity pursuant to the provisions and subject to the restrictions of section 1367(d)(13) of the Safety and Soundness Act.")
displacy.render(doc, style="ent")

Notes:

*   Even worse than the few-shot model. Less labelling overall. 3 out of 5 example have no labels at all.
*   Successfully identified an __ACT__ entity.



## Mixed Model

In [40]:
nlp = spacy.load("./mixed-model/model-best")

In [41]:
doc = nlp("Notwithstanding any other provision of this title, the NCUA may, without any administrative due process,\
 immediately place into conservatorship or liquidation any corporate credit union that has been categorized as critically undercapitalized.")
displacy.render(doc, style="ent")

In [42]:
doc = nlp("Loans made under title III by banks for cooperatives and agricultural credit banks may be made to eligible domestic parties\
 domiciled within any territory that may be served by Farm Credit institutions under section 1.2 of the Act and to eligible foreign parties without regard to domicile.")
displacy.render(doc, style="ent")

In [43]:
doc = nlp("If an interlocutory appeal or collateral attack is brought in any court concerning all or any part of an adjudicatory proceeding,\
 the challenged adjudicatory proceeding shall continue without regard to the pendency of that court proceeding.\
  No default or other failure to act as directed in the adjudicatory proceeding within the times prescribed in this subpart shall\
   be excused based on the pendency before any court of any interlocutory appeal or collateral attack."
)
displacy.render(doc, style="ent")

In [44]:
doc = nlp("""If the Board of Directors finds that a savings association is a special supervisory association under the provisions
 of section 8(a)(8)(B) of the FDIA (12 U.S.C. 1818(a)(8)(B)) for purposes of temporary suspension of insured status,
  the Board of Directors shall serve upon the association its findings with regard to the determination that the capital of the association,
   as computed using applicable accounting standards, has suffered a material decline; that such association or its directors or officers,
    is engaging in an unsafe or unsound practice in conducting the business of the association; that such association is in an unsafe or unsound condition
     to continue operating as an insured association; or that such association or its directors or officers, has violated any law, rule, regulation, order, condition
      imposed in writing by any Federal banking agency, or any written agreement, or that the association failed to enter into a capital improvement plan acceptable
       to the Corporation prior to January, 1990."""
)
displacy.render(doc, style="ent")

In [45]:
doc = nlp("The conservator or receiver may enforce any contract entered into by\
 the regulated entity pursuant to the provisions and subject to the restrictions of section 1367(d)(13) of the Safety and Soundness Act.")
displacy.render(doc, style="ent")