# IP Agreement Clause Generator


This generator involves developing a system to generate new clauses for intellectual property (IP) agreements used by lawyers. Generator ensures these clauses incorporate prior art and best practices based on existing data from law firms. The generated clauses  lead to favorable economic and legal outcomes for the assignee.

Lawyers can input requests within the context of an original document. The system retrieves similar clauses from a dataset, ranks them from best to worst based on their outcomes, and generates a clause that can be seamlessly integrated into the original document. The best clause is defined as one with the most favorable economic and legal outcomes.



## 1 - Dependencies

In [24]:
# Langchain dependencies
from langchain.document_loaders.pdf import PyPDFDirectoryLoader  # Importing PDF loader from Langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Importing text splitter from Langchain
from langchain.embeddings import OpenAIEmbeddings  # Importing OpenAI embeddings from Langchain
from langchain.schema import Document  # Importing Document schema from Langchain
from langchain.vectorstores.chroma import Chroma  # Importing Chroma vector store from Langchain
from dotenv import load_dotenv # Importing dotenv to get API key from .env file
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain_openai import OpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document
from langchain.vectorstores.chroma import Chroma
import fitz
import os  # Importing os module for operating system functionalities
import shutil  # Importing shutil module for high-level file operations
import warnings

# Suppress specific warnings
warnings.filterwarnings('ignore', category=UserWarning)

# Suppress all warnings
warnings.filterwarnings('ignore')


## 2 - Split into chunks of text

- Function to split text content of documents:
- Split the text content of the given list of `Document` objects into smaller chunks:
 
  - Parameters:
    - `chunk_size`: Size of each chunk in characters.
    - `chunk_overlap`: Overlap between consecutive chunks.
    - `length_function`: Function to compute the length of the text.
    - `add_start_index`: Flag to add start index to each chunk.
  
  - Returns:
    - Returns a list of `Document` objects representing the split text chunks.


In [6]:
def split_text(document):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=400,  # Size of each chunk in characters
        chunk_overlap=100,  # Overlap between consecutive chunks
        length_function=len,  # Function to compute the length of the text
        add_start_index=True,  # Flag to add start index to each chunk
    )
    # Split document into smaller chunks
    chunks = text_splitter.split_documents([document])
    return [Document(page_content=chunk.page_content) for chunk in chunks]

## 3 - Model Initialization & Prompt Creation



In [7]:
gptPrompt = """
You are basically lawyer that is basically extracting the legal clauses given the text. Now I will give you a Text and base on the text, you will ruturn a text that contains only the legal clauses and nothing else."
Text: {question}
Answer:"""

lawyer_template = PromptTemplate(
    input_variables=["question"],
    template=gptPrompt
)

- **Define the GPT Prompt:**
  - `gptPrompt` is a string that serves as a template for the task of extracting legal clauses from a given text.
  - The template instructs the model to:
    - Act as a lawyer.
    - Extract and return only the legal clauses from the provided text.
  - The placeholder `{question}` will be replaced with the actual text input when used.

In [8]:
load_dotenv()

True

- **Load Environment Variables:**
  - `load_dotenv()` is a function from the `dotenv` library that loads environment variables from a `.env` file into the environment.
  - This function is used to ensure that environment variables (such as API keys, database URLs, or configuration settings) defined in a `.env` file are available to the application.


In [9]:
llm=OpenAI()

- **Initialize OpenAI Model:**
  - `llm = OpenAI()` initializes an instance of the OpenAI language model.
  - This object represents the language model provided by OpenAI and is used to interact with the model for generating text, answering questions, or performing other language-related tasks.


In [10]:
llm_chain= lawyer_template | llm

- **Combine Prompt Template with Model:**
  - `llm_chain = lawyer_template | llm` creates a processing chain by combining the `lawyer_template` with the OpenAI model instance (`llm`).
  - This uses the `|` operator, which typically represents chaining or combining components in this context.



In [16]:
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    final_text = ""
    for page in doc:
        text = page.get_text()
        
        final_text+=llm_chain.invoke(text)
    doc.close()
    return final_text

  - `def extract_text_from_pdf(pdf_path):` defines a function to extract and process text from a PDF file.


In [17]:
def list_all_paths(folder_path):
    paths = []
    for root, directories, files in os.walk(folder_path):
        for name in files:
            paths.append(os.path.join(root, name))
        for name in directories:
            paths.append(os.path.join(root, name))
    return paths

  - `def list_all_paths(folder_path):` defines a function to list all file and directory paths within a specified folder.


In [19]:
paths=list_all_paths(r'D:\RAG\Law LLM (RAG)\1. RAG\data')

## 4 - Save to a RDB using Chroma


In [14]:
CHROMA_PATH = r"D:\RAG\Law LLM (RAG)\1. RAG\chroma2"

In [26]:
db = Chroma(persist_directory=CHROMA_PATH, embedding_function=OpenAIEmbeddings())

- **Database Initialization:**
  - `CHROMA_PATH = r"D:\RAG\Law LLM (RAG)\1. RAG\chroma2"` sets the path for the Chroma database.
  - `db = Chroma(persist_directory=CHROMA_PATH, embedding_function=OpenAIEmbeddings())` initializes a new Chroma database at the specified path using OpenAI embeddings.


In [32]:


def save_to_chroma(chunks: list[Document], db):
    # Add new chunks to the database
    db.add_documents(chunks)
    db.persist()
    print(f"Appended {len(chunks)} chunks to {CHROMA_PATH}.")



- **Save to Chroma Function:**
  - `def save_to_chroma(chunks: list[Document], db):` defines a function to save document chunks to the Chroma database.
  - `db.add_documents(chunks)` adds the new chunks to the database.
  - `db.persist()` saves the changes to the database.
  - `print(f"Appended {len(chunks)} chunks to {CHROMA_PATH}.")` prints the number of chunks added to the database.


In [None]:
i=1
for path in paths:
    text=extract_text_from_pdf(path)
    document = Document(page_content=text)
    chunks = split_text(document)
    save_to_chroma(chunks,db)
    print("document "+str(i)+" processed")
    i+=1

- **Processing and Saving Documents:**
  - `i = 1` initializes a counter to keep track of processed documents.
  - `for path in paths:` iterates over all paths in the previously generated list.
  - `text = extract_text_from_pdf(path)` extracts text from the current PDF path.
  - `document = Document(page_content=text)` creates a `Document` object with the extracted text.
  - `chunks = split_text(document)` splits the document text into smaller chunks.
  - `save_to_chroma(chunks, db)` saves the chunks to the Chroma database.
  - `print("document " + str(i) + " processed")` prints a message indicating the current document has been processed.
  - `i += 1` increments the counter.

## 5 - Query vector database for relevant data

In [17]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}
You are required to rank the retrieved clauses from the most favorable economic conditions to the worst conditions. The evaluation should consider factors such as economic stability, growth prospects, inflation rates, employment levels, and other relevant economic indicators. Additionally, the legal outcomes should encompass potential liabilities, risks, and the protection afforded to the assignee under different economic scenarios, including the legal outcomes for the assignee. Do not generate any new clauses. Just rewrite the existing clauses entire text
Answer the question based on the following query: {question}


"""

- **Prompt Template Definition:**
  - `PROMPT_TEMPLATE` is a string that defines a template for answering a question based on provided context.
  - The template instructs the model to:
    - Rank the retrieved clauses from the most favorable economic conditions to the worst conditions.
    - Consider factors such as economic stability, growth prospects, inflation rates, employment levels, and other relevant economic indicators.
    - Evaluate potential liabilities, risks, and the protection afforded to the assignee under different economic scenarios, including the legal outcomes for the assignee.
    - Only rewrite the existing clauses and not generate any new clauses.

# User-Input & Generation of Effective Clauses

In [22]:
query_text="""(b) Trademarks. The Parties acknowledge and agree that certain rights and obligations with respect to the use by the SpinCo Group of
certain Honeywell Trademarks shall be set forth in the Trademark License Agreement. To the extent there is a conflict between the terms of this
Agreement and the Trademark License Agreement, the terms of the Trademark License Agreement shall control."""

# Search the DB.
results = db.similarity_search_with_relevance_scores(query_text, k=10)
if len(results) == 0 or results[0][1] < 0.7:
    print(f"Unable to find matching results.")
    
    
context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query_text)


model = ChatOpenAI()
response_text = model.predict(prompt)

sources = [doc.metadata.get("source", None) for doc, _score in results]
formatted_response = f"Response:\n {response_text}"
print(formatted_response)

Response:
 1. Section 3.01(a) - Honeywell grants SpinCo and its group members a non-exclusive, royalty-free, perpetual, sublicenseable worldwide license to use the Honeywell Shared IP, excluding Trademarks, for the purposes of the SpinCo Business prior to the Distribution Date.

2. Section 4.01(a) - SpinCo grants Honeywell and its subsidiaries a non-exclusive, royalty-free, perpetual, sublicenseable worldwide license to use the SpinCo Shared IP for the same purposes as used prior to the Distribution Date.

3. Section 3.01(b) - The terms of the Trademark License Agreement shall control in case of any conflicts between that agreement and the terms of this Agreement regarding the use of certain Honeywell Trademarks by the SpinCo Group.

4. Section 4.02(b) - SpinCo must promptly notify Honeywell of any infringement issues.

5. Section 4.01(b) - Honeywell has the right to challenge the validity or enforceability of a transfer of more than fifty percent of voting power, shares, or equity of 

- **Query Text Definition:**
  - `query_text` is a string that contains the text to be searched in the database.

- **Perform Similarity Search:**
  - `results = db.similarity_search_with_relevance_scores(query_text, k=10)`
    - Searches the database for documents most similar to the `query_text`.
    - Retrieves the top 10 results along with their relevance scores.

- **Check for Matching Results:**
  - `if len(results) == 0 or results[0][1] < 0.7:`
    - Checks if no results were found or if the top result's relevance score is below 0.7.
    - If either condition is true, prints a message indicating that no matching results were found.
    - `print(f"Unable to find matching results.")`

- **Prepare Context Text:**
  - `context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])`
    - Joins the page content of the retrieved documents into a single string, with each document's content separated by `\n\n---\n\n`.

- **Create Prompt Template:**
  - `prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)`
    - Creates a prompt template using the predefined `PROMPT_TEMPLATE`.

- **Format the Prompt:**
  - `prompt = prompt_template.format(context=context_text, question=query_text)`
    - Formats the prompt template with the `context_text` and `query_text`.

- **Model Prediction:**
  - `model = ChatOpenAI()`
    - Initializes the OpenAI language model.
  - `response_text = model.predict(prompt)`
    - Generates a response from the model based on the formatted prompt.

- **Retrieve and Format Sources:**
  - `sources = [doc.metadata.get("source", None) for doc, _score in results]`
    - Extracts the source metadata from the retrieved documents.
  - `formatted_response = f"Response:\n {response_text}"`
    - Combines the response text with a formatted message.

- **Print Formatted Response:**
  - `print(formatted_response)`
    - Prints the formatted response to the console.
