<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Build Fast with AI](https://img.shields.io/badge/BuildFastWithAI-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://www.buildfastwithai.com/genai-course)
[![EduChain GitHub](https://img.shields.io/github/stars/satvik314/educhain?style=for-the-badge&logo=github&color=gold)](https://github.com/satvik314/educhain)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-zxVAH43LjFx-Qmamqx87BvnIAcLkwCt#scrollTo=QtIS_AMUtw56)
## Master Generative AI in 6 Weeks
**What You'll Learn:**
- Build with Latest LLMs
- Create Custom AI Apps
- Learn from Industry Experts
- Join Innovation Community
Transform your AI ideas into reality through hands-on projects and expert mentorship.
[Start Your Journey](https://www.buildfastwithai.com/genai-course)
*Empowering the Next Generation of AI Innovators

## **Llama Parse: Transform Unstructured Data with Ease**  
 Llama Parse is a powerful tool designed to transform unstructured data into structured formats, handling sources like PDFs, HTML, and text files. 📄 It simplifies large-scale data parsing, enabling seamless integration with workflows and making complex tasks more efficient. 💡 Tailored for developers, it offers flexible customization options while connecting parsed data directly to LLMs. 🔗 With its precise, fast, and reliable data extraction capabilities, Llama Parse boosts productivity and empowers AI-driven workflows!


##**Building a RAG Pipeline over Legal Documents**

###**Setup and Installation**



In [None]:
%pip install llama-index llama-parse

In [None]:
import os

from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
os.environ['LLAMA_CLOUD_API_KEY']=userdata.get('LLAMA-CLOUD-API')

### **📥 Downloading and Extracting Dataset 📂**


In [None]:
!wget https://github.com/user-attachments/files/16447759/data.zip -O data.zip
!unzip -o data.zip
!rm data.zip

In [None]:
import nest_asyncio

nest_asyncio.apply()

### **📂 Parsing US Legal Documents with LlamaParse ⚖️**

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    parsing_instruction="Provided are a series of US legal documents.",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt4o",
    show_progress=True,
)

DATA_DIR = "data"


def get_data_files(data_dir=DATA_DIR) -> list[str]:
    files = []
    for f in os.listdir(data_dir):
        fname = os.path.join(data_dir, f)
        if os.path.isfile(fname):
            files.append(fname)
    return files


files = get_data_files()

In [None]:
documents = parser.load_data(
    files,
    extra_info={"name": "US legal documents provided by the Library of Congress."},
)

Parsing files: 100%|██████████| 8/8 [02:01<00:00, 15.23s/it]


### **🔍 Setting Up VectorStore Index for Legal Documents 📚**







In [None]:
from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
    Settings,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

embed_model = OpenAIEmbedding(model="text-embedding-3-large")
llm = OpenAI("gpt-4o")

Settings.llm = llm
Settings.embed_model = embed_model

if not os.path.exists("storage_legal"):
    index = VectorStoreIndex(documents, embed_model=embed_model)
    index.storage_context.persist(persist_dir="./storage_legal")
else:
    ctx = StorageContext.from_defaults(persist_dir="./storage_legal")
    index = load_index_from_storage(ctx)

query_engine = index.as_query_engine()

### **📝 Querying Legal Document Index for Information 🔍**

In [None]:
from IPython.display import display, Markdown

response = query_engine.query(
    "Where did the majority of Barre Savings Bank's loans go?"
)
display(Markdown(str(response)))

In [None]:
response = query_engine.query(
    "Why does Mr. Kubarych believe foreign markets are so important?"
)
display(Markdown(str(response)))

In [None]:
response = query_engine.query(
    "Who is against the proposal of offshore drilling in CA and why?"
)
display(Markdown(str(response)))

In [None]:
response = query_engine.query(
    "What is the purpose of the Ocean Science and Technology Subcommittee?"
)
display(Markdown(str(response)))

###**Multimodal Parsing using GPT4o-mini**

### **📥 Downloading Llama3.1 Blog PDF 📝**

In [None]:
!wget "https://www.dropbox.com/scl/fi/8iu23epvv3473im5rq19g/llama3.1_blog.pdf?rlkey=5u417tbdox4aip33fdubvni56&st=dzozd11e&dl=1" -O "data/llama3.1_blog.pdf"


###**Initialize LlamaParse**


In [None]:
from llama_index.core.schema import TextNode
from typing import List
import json


def get_text_nodes(json_list: List[dict]):
    text_nodes = []
    for idx, page in enumerate(json_list):
        text_node = TextNode(text=page["md"], metadata={"page": page["page"]})
        text_nodes.append(text_node)
    return text_nodes


def save_jsonl(data_list, filename):
    """Save a list of dictionaries as JSON Lines."""
    with open(filename, "w") as file:
        for item in data_list:
            json.dump(item, file)
            file.write("\n")


def load_jsonl(filename):
    """Load a list of dictionaries from JSON Lines."""
    data_list = []
    with open(filename, "r") as file:
        for line in file:
            data_list.append(json.loads(line))
    return data_list
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model_name="openai-gpt-4o-mini",
    invalidate_cache=True,
)
json_objs = parser.get_json_result("./data/llama3.1_blog.pdf")
json_list = json_objs[0]["pages"]
docs = get_text_nodes(json_list)

Started parsing the file under job_id c7cd9ead-5a69-4aad-bef0-5b33ff83346e


In [None]:
save_jsonl([d.dict() for d in docs], "docs.jsonl")


In [None]:
from llama_index.core import Document

docs_dicts = load_jsonl("docs.jsonl")
docs = [Document.parse_obj(d) for d in docs_dicts]

<ipython-input-18-ea776a9a360e>:4: PydanticDeprecatedSince20: The `parse_obj` method is deprecated; use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  docs = [Document.parse_obj(d) for d in docs_dicts]


###**Setup GPT-4o baseline**


In [None]:
from llama_parse import LlamaParse

parser_gpt4o = LlamaParse(
    result_type="markdown",
    use_vendor_multimodal_model=True,
    vendor_multimodal_model="openai-gpt4o",
    # invalidate_cache=True
)
json_objs_gpt4o = parser_gpt4o.get_json_result("./data/llama3.1_blog.pdf")
# json_objs_gpt4o = parser.get_json_result("./data/llama2-p33.pdf")
json_list_gpt4o = json_objs_gpt4o[0]["pages"]
docs_gpt4o = get_text_nodes(json_list_gpt4o)

Started parsing the file under job_id a9760515-355e-4089-b8ff-63642cee140d


In [None]:
save_jsonl([d.dict() for d in docs_gpt4o], "docs_gpt4o.jsonl")


In [None]:
from llama_index.core import Document

docs_gpt4o_dicts = load_jsonl("docs_gpt4o.jsonl")
docs_gpt4o = [Document.parse_obj(d) for d in docs_gpt4o_dicts]

<ipython-input-21-fae74f861b4c>:4: PydanticDeprecatedSince20: The `parse_obj` method is deprecated; use `model_validate` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  docs_gpt4o = [Document.parse_obj(d) for d in docs_gpt4o_dicts]


###**View Results**


In [None]:
print(docs[4].get_content(metadata_mode="all"))


page: 5

# Llama 3.1 Model Evaluation

## Benchmark Results

| Category        | Llama 3.1 8B | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mistral 8x22B Instruct | GPT 3.5 Turbo |
|-----------------|---------------|----------------|----------------------|----------------|------------------------|----------------|
| General         |               |                |                      |                |                        |                |
| MMLU            | 73.0          | 72.3           | 60.5                 | 86.0           | 79.9                   | 69.8           |
| MMLU PRO        | 48.3          | 36.9           | 36.9                 | 66.4           | 56.3                   | 49.2           |
| iEval           | 80.4          | 73.6           | 57.6                 | 87.5           | 69.9                   |                |
| Code            |               |                |                      |                |                        |                |