# Fine Tuning Data

Current outlook:

1. Get GPUs.
2. Split Docs using a recursive splitter in langchain.
3. Fine tune baseBERT (and/or legalBERT) to classify the subtitle given a section of text. 
4. Create 3 Vector stores each with one of the embedding models from: (baseBERT, LegalBERT, FinetunedBERT).
5. Compare the cluster compactness in the vector stores
5. Query with text you know is in the data -> can get precision and recall
6. Look at making this a RAG system & compare with a tiny GPT or one that doesn't have a lot of legal data

In [1]:
from lxml import etree

NS = {'uslm': 'http://xml.house.gov/schemas/uslm/1.0',
      'xhtml': 'http://www.w3.org/1999/xhtml'}

In [2]:
def parse_subtitles(file_path):
    with open(file_path, 'rb') as f:
        tree = etree.parse(f)
    
    subtitles = tree.findall('.//uslm:subtitle', namespaces=NS)
    parsed = []

    for subtitle in subtitles:
        # Extract subtitle heading or name
        heading = subtitle.find('uslm:heading', namespaces=NS)
        heading_text = heading.text.strip() if heading is not None else ""

        # Get all text under subtitle (including paragraphs and nested tags)
        content_texts = []
        # Here, use `.//uslm:p` to get paragraphs or `.//` to get all text nodes under subtitle
        for elem in subtitle.findall('.//uslm:p', namespaces=NS):
            text = ' '.join(elem.itertext()).strip()
            if text:
                content_texts.append(text)

        parsed.append({
            "subtitle": heading_text,
            "content": "\n".join(content_texts)
        })

    return parsed

# Example usage
subtitles = parse_subtitles("./usc26.xml")

In [None]:
print(f"There are {len(subtitles)} subtitles collected")

The subtitles are called

In [None]:
[d['subtitle'] for d in subtitles]

['Income Taxes',
 'Estate and Gift Taxes',
 'Employment Taxes',
 'Miscellaneous Excise Taxes',
 'Alcohol, Tobacco, and Certain Other Excise Taxes',
 'Procedure and Administration',
 'The Joint Committee on Taxation',
 'Financing of Presidential Election Campaigns',
 'Trust Fund Code',
 'Coal Industry Health Benefits',
 'Group Health Plan Requirements']

Length of each Subtitle in characters

In [None]:
[len(d['content']) for d in subtitles]

[10495568,
 429618,
 926266,
 1099066,
 688201,
 3266091,
 18316,
 53768,
 250834,
 43385,
 98520]

Example Section

In [None]:
subtitles[-5]['subtitle']

'The Joint Committee on Taxation'

In [None]:
subtitles[-5]['content']

'1976— Pub. L. 94–455, title XIX, §\u202f1907(b)(1) ,  Oct. 4, 1976 ,  90 Stat. 1836 , struck out “Internal Revenue” in heading of subtitle G.\nThere shall be a joint congressional committee known as the Joint Committee on Taxation (hereinafter in this subtitle referred to as the “Joint Committee”).\n1976— Pub. L. 94–455  struck out “Internal Revenue” after “Committee on”.\nPub. L. 94–455, title XIX, §\u202f1907(c) ,  Oct. 4, 1976 ,  90 Stat. 1836 , provided that:  “The amendments made by this section [amending this section and sections 8004, 8021, and 8023 of this title and enacting provisions set out below] shall take effect on the first day of the first month which begins more than 90 days after the date of the enactment of this Act [ Oct. 4, 1976 ].”\nPub. L. 94–455, title XIX, §\u202f1907(a)(5) ,  Oct. 4, 1976 ,  90 Stat. 1836 , provided that:  “All references in any other statute, or in any rule, regulation, or order, to the Joint Committee on Internal Revenue Taxation shall be c

## Lang chain doc splitter

## Fine Tuning Setup (start with BERT base)

Need to emphasize this, this is a large part of DL

## Fine Tuning Setup (start with LegalBERT)

Need to emphasize this, this is a large part of DL

## Vector DB Creation (x4)

## Vector DB Evaluation & Comparison (x4)

Look at the Prec and Recall

    Ask for items you know are in the database (fragments of text from langchain documents), 
        if it returns the same item -> hit, 
        else -> miss

Look at cluster compactness

Project embedding space using PCA or t-SNE and color code each subtitle to see if there is grouping there. Compare this among the different models.


## RAG System Setup (Small Model)

RAG can be useful for LLMs that are missing information (e.g. from compression or lack of current data)

Use BERT Base and best performing fine tune

## RAG System Comparison (Qualitative)

A few in-data questions

A few out-of-data questions

