## Docling ( PDF to MD )

In [46]:
import docling

In [47]:
dir(docling)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 'backend',
 'datamodel',
 'document_converter',
 'exceptions',
 'models',
 'pipeline',
 'utils']

In [48]:
from docling.document_converter import DocumentConverter

In [49]:
pdfs = ["./Tech Innovators Corp sample FS.pdf","./BIT301_Tutorial_1.pdf", "./agile-project-management-whats-the-story.pdf"]

In [50]:
# Initialize the Docling object with the path to your document
converter = DocumentConverter()

In [51]:
# Convert the document to a structured format
structured_data = converter.convert(pdfs[0])

In [52]:
output = structured_data.document.export_to_markdown()

In [53]:
type(output)

str

In [54]:
output  # Display the first 1000 characters of the output

'## Tech Innovators Corp.\n\nFinancial Statements for the Years Ended December 31, 2022 and 2021 (All amounts in USD)\n\n## Balance Sheet\n\n| Assets                     | 2022        | 2021        |\n|----------------------------|-------------|-------------|\n| Current Assets             |             |             |\n| Cash &Equivalents          | $15,000,000 | $12,000,000 |\n| Accounts Receivable        | $10,000,000 | $8,500,000  |\n| Inventory                  | $5,000,000  | $4,200,000  |\n| Total Current Assets       | $30,000,000 | $24,700,000 |\n| Non-Current Assets         |             |             |\n| Property, Plant &Equipment | $25,000,000 | $22,000,000 |\n| Long-term Investments      | $8,000,000  | $6,500,000  |\n| Total Non-Current Assets   | $33,000,000 | $28,500,000 |\n| Total Assets               | $63,000,000 | $53,200,000 |\n| Liabilities &Equity        | 2022        | 2021        |\n| Current Liabilities        |             |             |\n| Accounts Payable 

In [55]:
input_file = pdfs[0].lower()
if input_file.startswith("./"):
    input_file = input_file[2:]
input_file_name = input_file.split("/")[-1].split(".")[0]
if " " in input_file_name:
    input_file_name = input_file_name.replace(" ", "_")
output_file = f"{input_file_name}.md"


In [56]:
input_file, output_file

('tech innovators corp sample fs.pdf', 'tech_innovators_corp_sample_fs.md')

In [57]:
output_dir = "./markdown_docs/"

In [58]:
output_path = f"{output_dir}{output_file}"
# Save the Markdown text to a file
with open(output_path, "w",encoding="utf-8") as file:
    file.write(output)

## Using LLM

In [59]:
import ollama

In [68]:
models = {
    "llama":["llama3.1:latest"],
    "deepseek":["deepseek-r1:8b"],
    "gemma":["gemma3:latest"],
    "qwen":["qwen3:latest"],
}

In [61]:
from ollama import GenerateResponse

In [None]:
# prompt
prompt = f"""
You are an expert financial analyst.You're analyzing financial statement of an organization.

Your goal is to extract key insights from the <document> content provided below.

<instructions>
1. Summarize the financial performance of the organization.
2. Identify key financial metrics and ratios.
3. Provide actual values used in calculations.
4. Analyze the trends in revenue, expenses, and profit margins.
5. Highlight any significant trends or anomalies in the financial data.
6. Provide a brief overview of the organization's financial health.
7. Ensure the analysis is concise and focused on the most critical aspects of the financial statement.
</instructions>

<document>
{output} 
</document>

Follow the instructions carefully and provide a comprehensive analysis based on the provided financial statement.
Calculated metrics should be clearly stated, and do not include any speculative or unverified information. 
Focus on factual data and insights derived from the financial statement.
Do not include any additional information or commentary outside of the analysis.
"""

In [63]:
prompt

"\nYou are an expert financial analyst.You're analyzing financial statement for a organization.\n\nYour goal is to extract key insights from the <document> content provided below.\n\n<instructions>\n1. Summarize the financial performance of the organization.\n2. Identify key financial metrics and ratios.\n3. Highlight any significant trends or anomalies in the financial data.\n4. Provide a brief overview of the organization's financial health.\n5. Include any relevant comparisons to industry benchmarks or previous periods.\n6. Ensure the analysis is concise and focused on the most critical aspects of the financial statement.\n</instructions>\n\n<document>\n## Tech Innovators Corp.\n\nFinancial Statements for the Years Ended December 31, 2022 and 2021 (All amounts in USD)\n\n## Balance Sheet\n\n| Assets                     | 2022        | 2021        |\n|----------------------------|-------------|-------------|\n| Current Assets             |             |             |\n| Cash &Equival

In [64]:
def call_llama(prompt, think=False, model="qwen3:latest"): 
    # Use Ollama to process the prompt
    response = ollama.generate(
        model=model,  # Use any local model you installed
        prompt=prompt,
        options={
            "temperature": 0.2,  # Adjust for creativity
            "top_p": 0.8,  # Adjust for diversity
            "context_window": 8192,  # Adjust for context length
            "think": think  # Enable thinking mode for better reasoning
        },
        # format=GenerateResponse.JSON
    )

    # response = ollama.chat(
    #     model=model,  # Use any local model you installed
    #     messages=[
    #         {"role": "user", "content": prompt}
    #     ]
    # )
    # return response['message']['content']
    return response

In [65]:
def write_response_to_file(response, output_file):
    with open(f"{output_dir}summary_{output_file}", "w", encoding="utf-8") as file:
        file.write(response)

In [69]:
response = call_llama(prompt, think=True, model=models["qwen"][0])
if response.done:
    write_response_to_file(response.response, output_file)
    print("Response completed successfully.")
    # print("Response:", response.response)
else:
    print("Response not completed yet. Status:", response.done)

Response completed successfully.


In [67]:
# Define the prompt for NER
prompt_ner = f"""
You are a natural language assistant.
Identify the entities like person name, organizations name, locations and dates from the provided Markdown content.

Markdown content:
---
{output}
---

Return only the identified names, locations and dates.
"""

In [None]:
print(call_llama(prompt_ner))

<think>
Okay, let's see. The user wants me to return only the identified names, locations, and dates from the provided text. First, I need to go through the text carefully.

Starting with the author's name: Jerry Manas is mentioned. Then there's a mention of Tom Peters, who is a management guru, and Pat Williams, Senior VP of the Orlando Magic. Locations include places like Iceland, where Jerry appeared on a National TV program discussing economic recovery. Dates mentioned are 2015, when Gartner predicted three quarters of knowledge-based project work would be by virtual teams. Also, the book "Napoleon on Project Management" and "Managing the Gray Areas" are referenced, but those are titles, not dates. The Planview Enterprise demo is mentioned, but no specific date there. 

Wait, the user specified to return only names, locations, and dates. So I need to make sure I don't include any other information. Let me check again. 

Names: Jerry Manas, Tom Peters, Pat Williams, Angela Ahrendts,

: 