**Project Milestone 2**  
Group 6: Search Wizards


    How is the data organized?
    Are there any metadata (e.g., author name, title, etc.) that can help you?
    How is the unstructured text represented (e.g., sentence, paragraph, section or full document)?
    Is there a more appropriate unit of analysis (e.g., would it be easier to chunk your documents down to sentence level?
    How are you evaluating the quality of your data?
    What are challenges with the data?


**Upload the dataset onto Google Colab.**

In [None]:
from google.colab import files
uploaded = files.upload()

import pandas as pd
import json

Saving ecfr-title-12.jsonl to ecfr-title-12.jsonl


In [None]:
# Specify the path to JSON file
file_name = 'ecfr-title-12.jsonl'

In [None]:
print(uploaded.keys())


dict_keys(['ecfr-title-12.jsonl'])


In [None]:
# Access the contents of the uploaded file
file_content = uploaded[file_name]

# Convert the bytes content to a string
file_content_str = file_content.decode('utf-8')

# peek at contents of the file
print(file_content_str[:50])

{"text":"","meta":{"chapter":["II","II","II","II",


In [None]:
# put each record (delineated by a line) into a list
data = []
for line in file_content_str.split('\n'):
    if line.strip():  # Check if the line is not empty
        data.append(json.loads(line))

**Q: How is the data organized?**

Let's take a look at the first 5 items in our list.

In [None]:
for item in data[:5]:
    print(type(item))
    print(item, "\n")

<class 'dict'>
{'text': '', 'meta': {'chapter': ['II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'II', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'III', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VI', 'VII', 'VII', 'VII', 'VII', 'VII', 'VII', 'VII', 'VII', 'VII', 'VII'

It looks like each item is a dict. Also the strings of text are pretty long and verbose, which is what is expected from federal regulations. It is somewhat hard to read in its current state.  

Let's try to convert it into a dataframe.

In [None]:
df = pd.DataFrame(data)

print(df.head(), "\n")
print(df.dtypes)

                                                text  \
0                                                      
1  (a) (1) Compliance with the requirements of th...   
2  (a) <E T="04">Federal Register. The Board publ...   
3  (a) <E T="04">Federal Register. The Committee ...   
4  (a) A Bank shall make long-term advances only ...   

                                                meta  
0  {'chapter': ['II', 'II', 'II', 'II', 'II', 'II...  
1  {'chapter': ['II'], 'chapter_title': ['CHAPTER...  
2  {'chapter': ['II'], 'chapter_title': ['CHAPTER...  
3  {'chapter': ['II'], 'chapter_title': ['CHAPTER...  
4  {'chapter': ['XII'], 'chapter_title': ['CHAPTE...   

text    object
meta    object
dtype: object


This is a bit easier to read. We can see that the key for each record is the string of text and the value of the record is its metadata. We can also see that the metadata is also a dictionary, suggesting hierarchical/nested dictionaries, which is common for json.  

We can also just read the jsonl file directly instead of converting the data into a list.

In [None]:
df = pd.read_json(file_name, lines=True)
df.head(10)

Unnamed: 0,text,meta
0,,"{'chapter': ['II', 'II', 'II', 'II', 'II', 'II..."
1,(a) (1) Compliance with the requirements of th...,"{'chapter': ['II'], 'chapter_title': ['CHAPTER..."
2,"(a) <E T=""04"">Federal Register. The Board publ...","{'chapter': ['II'], 'chapter_title': ['CHAPTER..."
3,"(a) <E T=""04"">Federal Register. The Committee ...","{'chapter': ['II'], 'chapter_title': ['CHAPTER..."
4,(a) A Bank shall make long-term advances only ...,"{'chapter': ['XII'], 'chapter_title': ['CHAPTE..."
5,(a) A Bank shall require each member to mainta...,"{'chapter': ['XII'], 'chapter_title': ['CHAPTE..."
6,(a) A Bank shall require the borrower to certi...,"{'chapter': ['XII'], 'chapter_title': ['CHAPTE..."
7,(a) A Board-regulated institution described in...,"{'chapter': ['II'], 'chapter_title': ['CHAPTER..."
8,(a) A Board-regulated institution that is an a...,"{'chapter': ['II'], 'chapter_title': ['CHAPTER..."
9,(a) A Farm Credit Bank or agricultural credit ...,"{'chapter': ['VI'], 'chapter_title': ['CHAPTER..."


In [None]:
df.describe(include='object')

Unnamed: 0,text,meta
count,4666.0,4666
unique,4666.0,4666
top,,"{'chapter': ['II', 'II', 'II', 'II', 'II', 'II..."
freq,1.0,1


In [None]:
# no null values
df.isnull().sum()

text    0
meta    0
dtype: int64

It looks like the first record is the table of contents for the entire document, or all the metadata combined. Also, it does not look like records are in any significant order, we can see that it jumps around by looking at the chapter attribute in meta.

**Q: Are there any metadata (e.g., author name, title, etc.) that can help you?**

It looks like the first record is simply the table of contents for the entire document, or all the metadata combined.  

It is worth noting that from this we can see that some content is missing that is present on the website, such as the entirety of Chapter 1. Other gaps in data are seemingly reflected on the website e.g. reserved chapters.

In [None]:
df.iloc[0, 1]

{'chapter': ['II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'II',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'III',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI',
  'VI'

Let's take a look at a metadata example more closely.

In [None]:
df.iloc[1, 1]

{'chapter': ['II'],
 'chapter_title': ['CHAPTER II - FEDERAL RESERVE SYSTEM (CONTINUED)'],
 'subchapter': ['A'],
 'subchapter_title': ['SUBCHAPTER A - BOARD OF GOVERNORS OF THE FEDERAL RESERVE SYSTEM (CONTINUED)'],
 'part': ['235'],
 'part_title': ['PART 235 - DEBIT CARD INTERCHANGE FEES AND ROUTING (REGULATION II)'],
 'section': ['235.9'],
 'section_title': ['§ 235.9   Administrative enforcement.']}

We can see that various attributes of metadata that each record might have, such as the chapter, chapter title, etc. This represents the overall hierarchical structure that the original document has. This metadata allows us to find and situate the text within the overall document, from the chapter down to the section.

**Q: How is the unstructured text represented (e.g., sentence, paragraph, section or full document)?**

Let's take a look at a text example more closely.

In [None]:
df.iloc[1,0]

'(a) (1) Compliance with the requirements of this part shall be enforced under - , (i) Section 8 of the Federal Deposit Insurance Act, by the appropriate Federal banking agency, as defined in section 3(q) of the Federal Deposit Insurance Act (12 U.S.C. 1813(q)), with respect to - , (A) National banks, federal savings associations, and federal branches and federal agencies of foreign banks;, (B) Member banks of the Federal Reserve System (other than national banks), branches and agencies of foreign banks (other than federal branches, federal Agencies, and insured state branches of foreign banks), commercial lending companies owned or controlled by foreign banks, and organizations operating under section 25 or 25A of the Federal Reserve Act;, (C) Banks and state savings associations insured by the Federal Deposit Insurance Corporation (other than members of the Federal Reserve System), and insured state branches of foreign banks;, (ii) The Federal Credit Union Act (12 U.S.C. 1751 et seq.

Each text represents an entire section out of the entire Title 12 of the eCFR. It is somewhat problematic as raw text because within each section there are also subsections e.g. (a), (1), (i), (A), etc. With this comes added punctuation, such as commas after each subsection.

**Q: Is there a more appropriate unit of analysis (e.g., would it be easier to chunk your documents down to sentence level?**

Depending on the application, there may be more appropriate unit of analysis. But, from a general standpoint we think chunking by section is great: it is perhaps the smallest meaningful unit for the eCFR Title 12. Additional chunking would require reformating the metadata with additional attributes, such as subsections (a), (1), etc. We think these are too many dimensions with limited returns, especially since many of the subsections are incomplete clauses, highly contextual, or dependent on its subsections. Therefore, keeping the records to a section-level is good for now.

**Q: How are you evaluating the quality of your data?**

**Completeness**: Ensure that the data contains all the necessary fields or attributes and that no critical information is missing. Evaluate if there are any gaps or null values in the dataset.  
**Consistency**: Verify that the data follows a consistent format, structure, and conventions throughout. Inconsistent data formatting or conflicting information across different records can indicate quality issues.  
**Validity**: Assess whether the data conforms to defined standards, rules, or constraints. Validate the data against predefined criteria or business rules to ensure its validity.


**Q: What are challenges with the data?**

**Complexity and Length**: The text is lengthy and contains complex legal language, making it challenging to interpret and understand, especially for individuals without a legal background.  
**Hierarchy**: The text outlines items under different sections and subsections, creating a hierarchical structure that may require careful analysis to decipher the relationships. It may also require extra work expanding the meta column to extract these various attributes.   
**Formatting**: As previously mentioned, sections are composed of subsections but this is represented in raw text, leading to some formatting issues such as extra commas. Further, some records include links and whatnot which may present as a challenge. See the following example (E T="04", commas, links, etc)


In [None]:
df.iloc[2,0]

'(a) <E T="04">Federal Register. The Board publishes in the <E T="04">Federal Register for the guidance of the public:, (1) Descriptions of the Board\'s central and field organization;, (2) Statements of the general course and method by which the Board\'s functions are channeled and determined, including the nature and requirements of procedures;, (3) Rules of procedure, descriptions of forms available and the place where they may be obtained, and instructions on the scope and contents of all papers, reports, and examinations;, (4) Substantive rules, interpretations of general applicability, and statements of general policy;, (5) Every amendment, revision, or repeal of the foregoing in paragraphs (a)(1) through (4) of this section; and, (6) Other notices as required by law., (b) Publications. The Board maintains a list of publications on its website (at www.federalreserve.gov/publications). Most publications issued by the Board, including available back issues, may be downloaded from t

**Additional exploration using BERTopic.**

In [None]:
%%capture
!pip install bertopic

In [None]:
from bertopic import BERTopic

In [None]:
# n = first 1000
docs = df.iloc[:1000, 0]
docs.head(5)

0                                                     
1    (a) (1) Compliance with the requirements of th...
2    (a) <E T="04">Federal Register. The Board publ...
3    (a) <E T="04">Federal Register. The Committee ...
4    (a) A Bank shall make long-term advances only ...
Name: text, dtype: object

In [None]:
# took about 4 mins
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,288,-1_the_of_or_to,"[the, of, or, to, and, in, that, for, by, any]","[(a) Definitions., Financial asset means cash ..."
1,0,166,0_the_of_or_to,"[the, of, or, to, be, and, in, shall, any, may]",[(a) Applicability. Except as provided in para...
2,1,99,1_union_credit_the_of,"[union, credit, the, of, and, to, unions, in, ...",[(a) After the board of directors has complied...
3,2,85,2_of_the_and_bank,"[of, the, and, bank, or, to, in, act, foreign,...",[(a) Certain de novo activities. A bank holdin...
4,3,63,3_the_of_fdic_or,"[the, of, fdic, or, to, for, and, in, institut...",[(a) CECL transition provision. (1) Except as ...
5,4,35,4_capital_ratio_boardregulated_buffer,"[capital, ratio, boardregulated, buffer, amoun...",[(a) Capital conservation buffer and leverage ...
6,5,26,5_account_the_institution_or,"[account, the, institution, or, to, of, deposi...",[(a) Authority. Under section 19(a) of the Fed...
7,6,26,6_reserve_security_bookentry_federal,"[reserve, security, bookentry, federal, securi...",[(a) A Participant's Security Entitlement is c...
8,7,24,7_holding_mutual_company_association,"[holding, mutual, company, association, saving...",[(a) Conversion - (1) Generally. A mutual hold...
9,8,18,8_agricultural_farm_credit_bank,"[agricultural, farm, credit, bank, or, and, fa...","[(a) As a condition for extending funding, dis..."


In [None]:
topic_model.get_topic(0)

[('the', 0.03944225045099486),
 ('of', 0.03380615558241565),
 ('or', 0.0325400638581085),
 ('to', 0.031584509149201016),
 ('be', 0.025028308549783766),
 ('and', 0.022789807522002888),
 ('in', 0.021934208761928312),
 ('shall', 0.02191009231942508),
 ('any', 0.021166190917716696),
 ('may', 0.020258225561499827)]

In [None]:
topic_model.get_document_info(docs).head(10)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,,-1,-1_the_of_or_to,"[the, of, or, to, and, in, that, for, by, any]","[(a) Definitions., Financial asset means cash ...",the - of - or - to - and - in - that - for - b...,0.0,False
1,(a) (1) Compliance with the requirements of th...,2,2_of_the_and_bank,"[of, the, and, bank, or, to, in, act, foreign,...",[(a) Certain de novo activities. A bank holdin...,of - the - and - bank - or - to - in - act - f...,0.703403,False
2,"(a) <E T=""04"">Federal Register. The Board publ...",0,0_the_of_or_to,"[the, of, or, to, be, and, in, shall, any, may]",[(a) Applicability. Except as provided in para...,the - of - or - to - be - and - in - shall - a...,1.0,False
3,"(a) <E T=""04"">Federal Register. The Committee ...",0,0_the_of_or_to,"[the, of, or, to, be, and, in, shall, any, may]",[(a) Applicability. Except as provided in para...,the - of - or - to - be - and - in - shall - a...,1.0,False
4,(a) A Bank shall make long-term advances only ...,-1,-1_the_of_or_to,"[the, of, or, to, and, in, that, for, by, any]","[(a) Definitions., Financial asset means cash ...",the - of - or - to - and - in - that - for - b...,0.0,False
5,(a) A Bank shall require each member to mainta...,9,9_capital_bank_the_undercapitalized,"[capital, bank, the, undercapitalized, directo...",[(a) Discretionary reclassification. Where the...,capital - bank - the - undercapitalized - dire...,1.0,False
6,(a) A Bank shall require the borrower to certi...,-1,-1_the_of_or_to,"[the, of, or, to, and, in, that, for, by, any]","[(a) Definitions., Financial asset means cash ...",the - of - or - to - and - in - that - for - b...,0.0,False
7,(a) A Board-regulated institution described in...,-1,-1_the_of_or_to,"[the, of, or, to, and, in, that, for, by, any]","[(a) Definitions., Financial asset means cash ...",the - of - or - to - and - in - that - for - b...,0.0,False
8,(a) A Board-regulated institution that is an a...,4,4_capital_ratio_boardregulated_buffer,"[capital, ratio, boardregulated, buffer, amoun...",[(a) Capital conservation buffer and leverage ...,capital - ratio - boardregulated - buffer - am...,0.820268,False
9,(a) A Farm Credit Bank or agricultural credit ...,8,8_agricultural_farm_credit_bank,"[agricultural, farm, credit, bank, or, and, fa...","[(a) As a condition for extending funding, dis...",agricultural - farm - credit - bank - or - and...,1.0,False


In [None]:
df.iloc[:10, 1]

0    {'chapter': ['II', 'II', 'II', 'II', 'II', 'II...
1    {'chapter': ['II'], 'chapter_title': ['CHAPTER...
2    {'chapter': ['II'], 'chapter_title': ['CHAPTER...
3    {'chapter': ['II'], 'chapter_title': ['CHAPTER...
4    {'chapter': ['XII'], 'chapter_title': ['CHAPTE...
5    {'chapter': ['XII'], 'chapter_title': ['CHAPTE...
6    {'chapter': ['XII'], 'chapter_title': ['CHAPTE...
7    {'chapter': ['II'], 'chapter_title': ['CHAPTER...
8    {'chapter': ['II'], 'chapter_title': ['CHAPTER...
9    {'chapter': ['VI'], 'chapter_title': ['CHAPTER...
Name: meta, dtype: object

The topics generated were not what we would expect. We would expect that the topics created would be similar to the chapters they are found in. For example, records 1, 2, 3, 7, and 8 are all from Chapter 2, yet represent 3 different topics (2, 0, 4).