# I-9 Assistant

This assistant helps business vet an employee submission of an I-9 and provides next steps if the form is incomplete.

## Business problem:

The I-9 (Employment Eligibility Verification) form serves several critical business needs:

Legal Compliance
- Required by federal law for every new hire in the U.S.
- Employers must verify both identity and employment eligibility within 3 business days of hire
- Failure to properly complete and maintain I-9s can result in significant fines ($234-$2,332 per violation as of 2024) and potential criminal penalties

Risk Management
- Protects against hiring unauthorized workers, which can lead to:
  - Department of Labor investigations
  - Immigration and Customs Enforcement (ICE) audits
  - Loss of business licenses in some jurisdictions
  - Reputational damage
- Creates a clear audit trail of employment eligibility verification

Workplace Security
- Helps ensure all employees are who they claim to be
- Establishes a consistent verification process across all new hires
- Supports overall workplace safety and security measures

Business Operations
- Required for payroll and tax documentation
- Often needed for government contracts and certifications
- May be necessary for business loans or corporate transactions
- Essential for maintaining clean corporate records


## Install OpenAI, Tavily, and LangChain dependencies

In [None]:
!pip install langchain>=0.2.0 langchain-openai>=0.1.7 langchain-community>=0.2.0 langgraph>=0.1.1 langchain-chroma>=0.1.2
!pip install python-dotenv
!pip install -qU pypdf

In [None]:
import os

'''
# for Colab users
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
TAVILY_API_KEY = userdata.get('TAVILY_API_KEY')

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
os.environ['TAVILY_API_KEY'] = TAVILY_API_KEY

from google.colab import drive
drive.mount('/content/drive')
'''
from dotenv import load_dotenv
load_dotenv()  # take environment variables from .env.
OPENAI_API_KEY = os.environ['OPENAI_API_KEY'] 
TAVILY_API_KEY = os.environ['TAVILY_API_KEY']

## Build a Search Index for Instructions

For the creation of the vector store, a paragraph-based chunking approach is implemented using Docling and LangChain, and the vector database is built with ChromaDB.

First, load the manual that has been copied to `data/i9_instructions.pdf`. (The original can be found at https://www.uscis.gov/sites/default/files/document/forms/i-9instr.pdf )

In [None]:
file_path = (
    "./data/i-9instr.pdf"
)

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(file_path)
pages = []
async for page in loader.alazy_load():
    pages.append(page)
    ## TODO: add metadata about the document being loaded

API Reference [PyPDFLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html)
If you want to load a folder, see [Langchain Pdf Loader Multiple Files](https://www.restack.io/docs/langchain-knowledge-pdf-loader-multiple-files-cat-ai)

Split text

In [None]:
# Split into documents chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500,
                                                 chunk_overlap=200)
doc_chunks = text_splitter.split_documents(pages)

In [None]:
doc_chunks

## Create a Vector DB and persist on disk

The following code uses [LangChain Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma/)

If you want to use a in memory vector store, see [https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.in_memory.InMemoryVectorStore.html]


In [None]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings


# Create document embeddings and store in Vector DB
openai_embed_model = OpenAIEmbeddings()

# create vector DB of docs and embeddings - takes < 30s on Colab
chroma_db = Chroma.from_documents(documents=doc_chunks,
                                  collection_name='rag_i9_db',
                                  embedding=openai_embed_model,
                                  # need to set the distance function to cosine else it uses euclidean by default
                                  # check https://docs.trychroma.com/guides#changing-the-distance-function
                                  collection_metadata={"hnsw:space": "cosine"},
                                  persist_directory="./i9_db")

TODO: Add metadata with document name into Chroma