# Pipeline Steps:
1. Sourcing:
- Selenium based scraper for CEAT website

2. Parsing:
- Llama-parse apis with json
- Each separate section is returned as a json object with its metadata available
- Combine all the markdown as a single markdown text
- Use Langchain MarkdownHeaderTextSplitter using level 1 and level 2 to create chunks
- Use RecursiveCharacterTextSplitter to split it down further
- **Todo**: Convert into Document using Llamaindex with metadata and split it
  
3. Tagging:
- **Todo**: Use LLM to extract keywaords and add as tags in metadata. Use this to enhance keyword based search and routing of queries.

4. Embedding generation
- Chroma DB with Llama3 8B
- **Todo**: Shift to OpenAI Embeddings with a hosted vector store

5. Query Construstion
6. Router
7. Metadata Search
8. Top-K Similarity Search

**Rough Flow**: doc => llama parse => markdown splitter + tags + metadata => recursive splitter => embedding => into vector store
query => enhance using llm => embedding => top k similarity search
enhance via metadata search 

# Sourcing
## Download Reports from Ceat Website

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
import time

In [None]:
# Set up the WebDriver
download_dir = os.getcwd() + "/ceat"  # Download and Save the pdf to this folder
print(download_dir)
os.makedirs(download_dir, exist_ok=True)

# Set Chrome preferences to automate downloads
options = Options()
prefs = {
        "download.default_directory": download_dir,
        "download.prompt_for_download": False,  # To automatically download the PDF
        # "download.directory_upgrade": True,
        "plugins.always_open_pdf_externally": True  # It will not open the PDF in the browser
    }
options.add_experimental_option("prefs", prefs)
# options.add_argument("--headless")

# Initialize WebDriver with options
service = Service(executable_path="./chromedriver_mac_arm64/chromedriver") # Update the path to where you've downloaded chromedriver
driver = webdriver.Chrome(options=options)
print(driver)

# URL of the webpage
url = 'https://www.ceat.com/investors/annual-reports.html'

# Navigate to the URL
driver.get(url)

# Extract the specific component
# Replace 'your-css-selector' with the actual CSS selector for your component
elements = driver.find_elements(By.CSS_SELECTOR, 'a.btn-icon')

report_urls = [element.get_attribute('href') for element in elements]
print("report-urls", report_urls)

if report_urls:
    for report_url in report_urls:
        driver.get(report_url)
        WebDriverWait(driver, 10)
    
# download_pdf(driver, report_urls[0])    # driver.get(report_urls[0])
print("postwait")

time.sleep(60)
driver.quit()

# Parsing
 - Llama-parse apis with json

In [None]:
!pip install llama-parse

In [None]:
import nest_asyncio
from llama_parse import LlamaParse


In [None]:
nest_asyncio.apply()

## Environment Variables
- Llamaparse API Key
- **Todo**: OpenAI, Database Connections

In [2]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Access the API key
api_key = os.getenv('LLAMAPARSE_APIKEY')
# print(api_key)


In [None]:
# LLAMAPARSE_APIKEY = "llx-"

In [None]:
PARSING_INSTUCTION = """The file is the annual financial report of a tyre manufacturing company in India named CEAT. 
It has a consistent formatting with texts in paragraph and relevant diagrams, charts around it. 
It has a lot of images of people, tyres as products, vehicles etc."""

In [None]:
parser = LlamaParse(api_key=LLAMAPARSE_APIKEY, 
                    verbose=True, 
                    parsing_instruction=PARSING_INSTRUCTION)

In [None]:
json_objs = parser.get_json_result("./ceat/CEAT Limited Annual Report FY16.pdf")

In [None]:
pages = json_objs[0]["pages"]

## Write parsed response to a json file (as backup)
- **Todo**: Setup a Database to save the parsed content

In [None]:
file_path = "./llama_parse_data/2016pdf.json"

In [None]:
with open(file_path, 'w') as file:
    json.dump(json_objs, file, indent=4)

print(f"Data successfully written to {file_path}")

## Load parsed response from json file
- **Todo**: Read from Database

```python
file_path = "./llama_parse_data/2016pdf.json"
with open(file_path, 'r') as file:
    json_docs = json.load(file)
pages = json_docs[0]["pages"]
```

## Combine all the markdown as a single markdown text

In [None]:
md_pages = [element["md"] for element in pages]

## Use Langchain MarkdownHeaderTextSplitter using level 1 and level 2 to create chunks

In [None]:
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(md_pages)
md_header_splits

## Use RecursiveCharacterTextSplitter to split it down further

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
# Char-level splits
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 100
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
)

# Split
splits = text_splitter.split_documents(md_header_splits)
splits

## Setup Llama3 8B

In [None]:
from langchain_community.embeddings.ollama import OllamaEmbeddings

In [None]:
MODEL_NAME = "llama3"
BASE_URL = "http://localhost:11434"

In [None]:
def get_embedding_function():
    embeddings = OllamaEmbeddings(model=MODEL_NAME, base_url=BASE_URL)
    return embeddings

# Embeddings

In [None]:
embedding_function = get_embedding_function()
print(embedding_function)

## Vectorstore: ChromaDB
- Setup Chroma DB with Llama3 8B
- **Todo**: Setup hosted vectorstore

In [None]:
CHROMA_PATH_LLAMA = "chroma_llama"

In [None]:
from langchain.vectorstores.chroma import Chroma

In [None]:
db = Chroma.from_documents(splits, embedding_function, persist_directory=CHROMA_PATH_LLAMA)

## Similarity Search using Cosine Similarity
- **Todo**: Query reconstruction and expansion

### PoC: Top-K Similarity Search

In [None]:
query = "What is the ebitda OF CEAT?"

In [None]:
docs = db.similarity_search(query)
print(docs)

In [None]:
docs_with_score = db.similarity_search_with_score(query)
print(docs_with_score)