**Project Milestone 3**  
Group 6: Search Wizards

# Dependencies

Install and import necessary packages.

In [None]:
%%capture
!pip install datasets

In [None]:
import pandas as pd
import json
import spacy

from spacy import displacy
from datasets import load_dataset

# Data Preprocessing

Import the dataset from Hugging Face.

In [None]:
data = load_dataset("wesslen/ecfr-title-12")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/671 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.91M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4665 [00:00<?, ? examples/s]

Let's check the data structure.

In [None]:
data['train'][0]

{'text': '(a) (1) Compliance with the requirements of this part shall be enforced under - , (i) Section 8 of the Federal Deposit Insurance Act, by the appropriate Federal banking agency, as defined in section 3(q) of the Federal Deposit Insurance Act (12 U.S.C. 1813(q)), with respect to - , (A) National banks, federal savings associations, and federal branches and federal agencies of foreign banks;, (B) Member banks of the Federal Reserve System (other than national banks), branches and agencies of foreign banks (other than federal branches, federal Agencies, and insured state branches of foreign banks), commercial lending companies owned or controlled by foreign banks, and organizations operating under section 25 or 25A of the Federal Reserve Act;, (C) Banks and state savings associations insured by the Federal Deposit Insurance Corporation (other than members of the Federal Reserve System), and insured state branches of foreign banks;, (ii) The Federal Credit Union Act (12 U.S.C. 175

To build our model, we need to get only the text from the dataset.

In [None]:
def prepare_training_data(data):
    train_data = []
    for item in data:
        text = item['text']  # Get text or default to empty string if not available
        train_data.append(text)
    return train_data

In [None]:
train_data = prepare_training_data(data['train'])

We now have a list of raw text from our dataset.

In [None]:
train_data[0]

'(a) (1) Compliance with the requirements of this part shall be enforced under - , (i) Section 8 of the Federal Deposit Insurance Act, by the appropriate Federal banking agency, as defined in section 3(q) of the Federal Deposit Insurance Act (12 U.S.C. 1813(q)), with respect to - , (A) National banks, federal savings associations, and federal branches and federal agencies of foreign banks;, (B) Member banks of the Federal Reserve System (other than national banks), branches and agencies of foreign banks (other than federal branches, federal Agencies, and insured state branches of foreign banks), commercial lending companies owned or controlled by foreign banks, and organizations operating under section 25 or 25A of the Federal Reserve Act;, (C) Banks and state savings associations insured by the Federal Deposit Insurance Corporation (other than members of the Federal Reserve System), and insured state branches of foreign banks;, (ii) The Federal Credit Union Act (12 U.S.C. 1751 et seq.

# Building the Model

We are going to use the SpaCy model **en_core_web_sm** to generate NER labels.

In [None]:
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")



Let's see what labels it generates when we put some of our text through the model, using Displacy.

In [None]:
doc = nlp(train_data[0])
displacy.render(doc, style="ent")

We can see that it generates some pretty useful labels for us that we can use. For this proof of concept, we will use the following out-of-the-box labels:
*  ORG
*  LAW
*  NORP



In [None]:
desired_labels = {"ORG", "LAW", "NORP"}

Lets see the entity labels produced from the first **5** texts from our dataset.

In [None]:
for i, text in enumerate(train_data[:5]):
    print("Example " + str(i))
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in desired_labels:
            print(ent.text, ent.label_)
    print("\n")

Example 0
the Federal Deposit Insurance Act ORG
Federal banking agency ORG
section 3(q LAW
the Federal Deposit Insurance Act ORG
1813(q ORG
the Federal Reserve System ORG
Agencies ORG
section 25 or LAW
the Federal Reserve Act ORG
the Federal Deposit Insurance Corporation ORG
the Federal Reserve System ORG
The Federal Credit Union Act ORG
the Administrator of the National Credit Union Administration ORG
National Credit Union Administration Board ORG
The Federal Aviation Act LAW
Transportation ORG
The Securities Exchange Act ORG
the Securities and Exchange Commission ORG
the Federal Deposit Insurance Act ORG
section 1(b LAW
Federal Trade Commission ORG
the Consumer Financial Protection Act ORG
the Federal Trade Commission ORG
the Federal Trade Commission ORG
the Federal Trade Commission Act ORG
the Federal Trade Commission Act ORG
the Federal Trade Commission ORG
the Federal Trade Commission Act ORG
the Federal Trade Commission ORG
the Federal Trade Commission ORG
the Federal Trade Commi