# Detailed Explanation of the Company Research Agent Notebook:


## Cell 1: Package Installation

transformers: For accessing pre-trained language models

accelerate: Optimizes model performance on hardware

bitsandbytes: Enables 4-bit quantization for memory efficiency

tavily-python: API client for web search functionality

nest-asyncio: Enables nested async operations in Jupyter

beautifulsoup4: HTML parsing for content cleaning

Key Insight: These dependencies enable the core functionality of the agent - efficient model loading, web searching, and content processing.

In [1]:
# Install required packages
!pip install transformers accelerate bitsandbytes tavily-python nest-asyncio beautifulsoup4

Collecting bitsandbytes
  Downloading bitsandbytes-0.46.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting tavily-python
  Downloading tavily_python-0.7.10-py3-none-any.whl.metadata (7.5 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cublas_cu12-

Purpose: Imports all necessary libraries:

Core Python utilities (os, re, asyncio)

PyTorch for deep learning

BeautifulSoup for HTML parsing

Hugging Face Transformers for language model

Tavily for web search API

nest_asyncio for async support

Design Choice: The combination of these libraries enables efficient asynchronous operations and memory-optimized model loading.

In [2]:
import os
import torch
import asyncio
import re
from getpass import getpass
from bs4 import BeautifulSoup
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers import BitsAndBytesConfig
from tavily import TavilyClient
import nest_asyncio

Purpose: Enables nested asynchronous operations in Jupyter notebooks. This is crucial because Colab environments have specific event loop requirements.

Why Important: Allows concurrent web searches without blocking the main execution thread.

In [3]:
# Apply nest_asyncio for better async performance
nest_asyncio.apply()

Purpose: Configures 4-bit quantization to optimize memory usage:

load_in_4bit: Reduces model memory footprint

nf4 quantization: Specialized format for neural networks

float16 compute: Maintains precision during calculations

Double quantization: Further reduces memory requirements

Performance Impact: Allows the 8B parameter model to run efficiently on free Colab GPUs (T4) that would otherwise be insufficient.

In [4]:
# Configuration for 4-bit quantization
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

Purpose: Loads the pre-trained language model with optimizations:

Uses Llama-3-8B-Web specialized for web content understanding

Applies 4-bit quantization from previous configuration

Automatically maps model to available GPU

Uses float16 precision for efficiency

Key Insight: This specialized model produces higher quality summaries from web content compared to base LLMs.

In [19]:
# Load model with optimized settings
model_id = "McGill-NLP/Llama-3-8B-Web"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant_config,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/335 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/16.1G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/143 [00:00<?, ?B/s]

Purpose: Creates optimized text generation pipeline:

max_new_tokens=512: Balances response length and speed

temperature=0.7: Controls creativity vs. determinism

do_sample=True: Enables probabilistic sampling

Padding configuration ensures correct sequence handling

Design Choice: These parameters produce concise yet comprehensive summaries while maintaining reasonable generation speed.

In [20]:
# Create optimized text-generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=574,        # Reduced for faster generation
    temperature=0.7,
    do_sample=True,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id
)

Device set to use cuda:0


Purpose: Securely collects and initializes the Tavily API key. Using getpass() ensures the API key isn't visible during input.

Why Tavily?: Tavily specializes in research-focused web searches with source citation - crucial for this application.

In [7]:
# Get Tavily API key securely
TAVILY_API_KEY = getpass("Enter your Tavily API key: ")
tavily = TavilyClient(api_key=TAVILY_API_KEY)

Enter your Tavily API key: ··········


Defines the structure of the final company report. This comprehensive structure ensures all business-relevant aspects of a company are covered.

In [21]:
# Define report sections
sections = [
    "Overview",
    "Financials",
    "Operations",
    "Market",
    "Digital",
    "Recent News",
    "Employer"
]

Provides optimized search templates for each report section. These templates are engineered to return the most relevant information for each business dimension.

In [22]:
# Optimized prompts for search queries
query_prompts = {
    "Overview": "{company_name} founding date CEO headquarters",
    "Financials": "{company_name} revenue funding financial report",
    "Operations": "{company_name} employee count locations",
    "Market": "{company_name} competitors market share",
    "Digital": "{company_name} social media followers",
    "Recent News": "{company_name} news last week",
    "Employer": "{company_name} Glassdoor ratings jobs"
}

Gives the LLM precise instructions for processing search results. Key features:

Strict 100-word limit for conciseness

Source citation requirements

Formatting constraints

In [23]:
# Optimized section instructions
section_instructions = {
    "Overview": "Provide a 100-word overview of {company_name} including founding date, leadership, and key milestones. Cite sources with URLs and confidence scores (1-5). Complete the entire answer properly, do not add any html element into the answer",
    "Financials": "Provide a 100-word financial summary of {company_name} including revenue and funding. Cite sources with URLs and confidence scores (1-5).Complete the entire answer properly, do not add any html element into the answer",
    "Operations": "Describe {company_name}'s operations in 100 words including employee count and locations. Cite sources with URLs and confidence scores (1-5).Complete the entire answer properly, do not add any html element into the answer",
    "Market": "Analyze {company_name}'s market position in 100 words including competitors. Cite sources with URLs and confidence scores (1-5).Complete the entire answer properly, do not add any html element into the answer",
    "Digital": "Detail {company_name}'s digital presence in 100 words including social media stats. Cite sources with URLs and confidence scores (1-5).Complete the entire answer properly, do not add any html element into the answer",
    "Recent News": "Summarize recent news about {company_name} in 100 words. Cite sources with URLs and confidence scores (1-5).Complete the entire answer properly, do not add any html element into the answer",
    "Employer": "Evaluate {company_name} as an employer in 100 words. Cite sources with URLs and confidence scores (1-5).Complete the entire answer properly, do not add any html element into the answer"
}

Sanitizes raw HTML content from web searches by:

Removing all HTML tags

Collapsing whitespace

Filtering out AI boilerplate text

Truncating to 500 characters with ellipsis

In [24]:
# Function to clean HTML content
def clean_content(content):
    """Remove HTML tags and truncate content"""
    # Remove HTML tags
    clean_text = BeautifulSoup(content, "html.parser").get_text()
    # Remove excessive whitespace
    clean_text = re.sub(r'\s+', ' ', clean_text)
    # Remove AI system messages
    clean_text = re.sub(r'You are an AI assistant.*?;', '', clean_text, flags=re.DOTALL)
    # Truncate while preserving whole words
    return clean_text[:500] + '...' if len(clean_text) > 500 else clean_text

Simple helper that formats search queries using the predefined templates.

In [25]:
# Function to generate search query
def generate_query(section, company_name):
    return query_prompts[section].format(company_name=company_name)

Performs asynchronous web searches with:

Special handling for news (7-day recency filter)

Error handling to prevent full pipeline failures

Results limit (3) for efficiency

In [26]:
# Async function for parallel searches
async def async_search(query, section):
    try:
        if section == "Recent News":
            return tavily.search(query=query, topic="news", days=7, max_results=3)
        return tavily.search(query=query, max_results=3)
    except Exception as e:
        print(f"Error searching {section}: {e}")
        return {"results": []}

Coordinates parallel execution of all section searches. This is the core concurrency manager that enables efficient data gathering.

In [27]:
# Async search manager
async def perform_searches(company_name):
    tasks = []
    for section in sections:
        query = generate_query(section, company_name)
        tasks.append(async_search(query, section))
    return await asyncio.gather(*tasks)

The core intelligence function that:

Structures search results into a standardized prompt

Leverages the LLM to generate a focused summary

Extracts only the relevant summary portion

Returns sources for citation

Prompt Engineering: The [INSTRUCTION]/[SEARCH RESULTS]/[SUMMARY] format reliably produces well-structured outputs.

In [28]:
# Function to generate summary with token optimization
def generate_summary(section, company_name, search_results):
    if not search_results:
        return "No relevant information found", []

    # Format results efficiently
    formatted_results = "\n\n".join(
        f"Source {i+1}: {res['title']}\nURL: {res['url']}\nContent: {clean_content(res['content'])}"
        for i, res in enumerate(search_results)
    )

    # Create optimized prompt
    instructions = section_instructions[section].format(company_name=company_name)
    prompt = f"""
    [INSTRUCTION]
    {instructions}

    [SEARCH RESULTS]
    {formatted_results}

    [SUMMARY]
    """

    # Generate summary
    try:
        output = pipe(
            prompt,
            max_new_tokens=400,
            do_sample=True,
            temperature=0.7,
            top_p=0.9
        )[0]['generated_text']

        # Extract only the summary part
        summary = output.split("[SUMMARY]")[-1].strip()
        return summary, [res['url'] for res in search_results]
    except Exception as e:
        print(f"Error generating {section} summary: {e}")
        return "Summary generation failed", []

Parallel search execution

Sequential summary generation (GPU optimized)

Report compilation with Markdown formatting

Source aggregation

In [29]:
# Main function with performance improvements
def main(company_name):
    print(f"\n🔍 Starting research on {company_name}...")

    # Run all searches in parallel
    print("⚡ Conducting parallel searches...")
    search_responses = asyncio.run(perform_searches(company_name))

    report_sections = []
    all_sources = set()

    # Process each section
    for i, section in enumerate(sections):
        print(f"📝 Generating {section} section...")
        search_results = search_responses[i].get("results", [])
        summary, sources = generate_summary(section, company_name, search_results)
        report_sections.append((section, summary, sources))
        all_sources.update(sources)

    # Compile report
    report = f"# Comprehensive Report: {company_name}\n\n"
    for section, summary, sources in report_sections:
        report += f"## {section}\n{summary}\n\n"
        report += "### Sources\n" + "\n".join(f"- {url}" for url in sources) + "\n\n"

    print("\n✅ Report generated successfully!")
    print(f"\n=== Final Report ===\n{report}")
    print("### All Sources\n" + "\n".join(f"- {url}" for url in sorted(all_sources)))

In [31]:
# Run with example company
if __name__ == "__main__":
    company_name = "Interactive Cares"
    main(company_name)


🔍 Starting research on Interactive Cares...
⚡ Conducting parallel searches...
📝 Generating Overview section...
📝 Generating Financials section...
📝 Generating Operations section...
📝 Generating Market section...
📝 Generating Digital section...
📝 Generating Recent News section...
📝 Generating Employer section...

✅ Report generated successfully!

=== Final Report ===
# Comprehensive Report: Interactive Cares

## Overview
- Interactive Cares was founded in 2020. 
     - Interactive Cares is headquartered in Dhaka, Bangladesh. 
     - The company provides a virtual Edtech platform for creating employability. 
     - On a mission to bridge the skill gap and make India and Bangladesh employable. 
     - The platform aims to make learning interactive and fun. 
     - It was founded by Saadat Bin Mostafiz and Md. Tanvir Bashir. 
     - As of 2022, the company has raised $800K in funding. 
     - The company's mission is to bridge the skill gap and make India and Bangladesh employable. 
     