<a href="https://colab.research.google.com/github/anurag-adk/CSVision/blob/main/CSVision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSVision👀 - An AI Agent for Company Info Extraction

## Project Overview
 CSVision is intelligent agent that extracts company information from textual paragraphs using LangChain's LCEL (LangChain Expression Language) framework and generates CSV output.

### Objectives
- Extract company names, founding dates and founders from text
- Generate CSV output with structured data
- Use LCEL Runnable Interface for parsing
- Implement Tools and Tool calling for automation

### Expected Output Format
- Company Name (string)
- Founding Date (YYYY-MM-DD format)
- Founders (comma-separated string)

## Environment Setup and Dependencies

Step 1: Generate Gemini API key from Google AI Studio.<br>
Step 2: Create a secret key `GEMINI_API_KEY` and assign the value from Google AI Studio.<br>

Now, Follow the instructions and install required packages to set up the environment

In [None]:
!pip install langchain-google-genai langchain langgraph pandas pydantic -qU

In [None]:
# Import necessary libraries
import os
import pandas as pd
import requests
import json
import re


from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel, Field
from google.colab import userdata

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage, ToolMessage, SystemMessage, AIMessage
from langgraph.prebuilt import create_react_agent

In [None]:
# Set your Google API key
os.environ["GOOGLE_API_KEY"] = userdata.get("GEMINI_API_KEY")

In [None]:
# Initialize the Gemini 2.5 Flash Lite model
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash-lite",
    temperature=0.7,
    max_tokens=None
)

# Simple invocation
response = llm.invoke("Can you help me extract company info?")
print(response.content)

Absolutely! I can help you extract company information. To do that effectively, I need you to tell me **what kind of company information you're looking for and where you expect to find it.**

Please be as specific as possible. Here are some common types of company information people look for, and the contexts in which they might be found:

**What Kind of Company Information Are You Looking For?**

*   **Basic Identification:**
    *   Company Name (official, legal name)
    *   Company Website URL
    *   Company Type (e.g., public, private, non-profit, government)
    *   Industry/Sector
    *   Headquarters Location (city, state/province, country)
*   **Financial Information:**
    *   Stock Ticker Symbol (if publicly traded)
    *   Revenue (annual, quarterly)
    *   Profit/Net Income
    *   Market Capitalization
    *   Number of Employees
    *   Funding Rounds (for startups/private companies)
    *   Key Financial Ratios
*   **Operational Information:**
    *   Products/Service

## Structured Output Setup

This section defines the structured output schema for extracting company information using Pydantic models. The `CompanyInfo` and `CompanyExtractionResult` classes ensure that the extracted data (`company_name`, `founding_date` and `founders`) is consistently formatted for further processing and CSV generation.

In [None]:
# Define the desired structure for company info
class CompanyInfo(BaseModel):
    """Information about a company."""
    company_name: str = Field(..., description="The full official company name")
    founding_date: str = Field(..., description="The founding date")
    founders: List[str] = Field(..., description="List of founders' name")

class CompanyExtractionResult(BaseModel):
    """Result containing multiple companies extracted from text."""
    companies: List[CompanyInfo] = Field(..., description="List of extracted companies")

# Display the schema
print("Company Info Schema:")
print(json.dumps(CompanyInfo.model_json_schema(), indent=2))

Company Info Schema:
{
  "description": "Information about a company.",
  "properties": {
    "company_name": {
      "description": "The full official company name",
      "title": "Company Name",
      "type": "string"
    },
    "founding_date": {
      "description": "The founding date",
      "title": "Founding Date",
      "type": "string"
    },
    "founders": {
      "description": "List of founders' name",
      "items": {
        "type": "string"
      },
      "title": "Founders",
      "type": "array"
    }
  },
  "required": [
    "company_name",
    "founding_date",
    "founders"
  ],
  "title": "CompanyInfo",
  "type": "object"
}


In [None]:
# Initialize the model with structured output
structured_llm = llm.with_structured_output(CompanyExtractionResult)

# Test with a sample text
sample_text = "Nvidia Corporation was founded on April 5, 1993, by Jensen Huang, Chris Malachowsky, and Curtis Priem."

result = structured_llm.invoke(f"""
Extract company information from this text:
{sample_text}

For each company, provide:
- Company name (full official name)
- Founding date (as mentioned)
- Founders (all people mentioned)
""")

print("Structured LLM test:")
for company in result.companies:
    print(f"Company: {company.company_name}")
    print(f"Founded: {company.founding_date}")
    print(f"Founders: {company.founders}")

Structured LLM test:
Company: Nvidia Corporation
Founded: April 5, 1993
Founders: ['Jensen Huang', 'Chris Malachowsky', 'Curtis Priem']


## Date Formatting

The `format_date` function defined in this section standardizes various date formats into the **YYYY-MM-DD** format. This is necessary for consistent CSV output. The following code tests the function with a variety of date inputs, including year-only, year-month and full date formats, as well as textual descriptions, to verify its robustness and accuracy.

In [None]:
# Utility function for date formatting
def format_date(date_string: str) -> str:
    """Format various date formats to YYYY-MM-DD"""
    date_string = date_string.strip()

    # Handle different date formats
    if len(date_string) == 4: # When only Year is present
        return f"{date_string}-01-01"
    elif len(date_string) == 7: # When only Year and Month are present
        return f"{date_string}-01"
    elif len(date_string) == 10: # When all Year, Month and Day are present
        return date_string
    else:
        try:
            # Extract year using regex
            year_match = re.search(r'\b(18|19|20)\d{2}\b', date_string)
            if year_match:
                year = year_match.group()

                # Month mapping
                month_map = {
                    'january': '01', 'february': '02', 'march': '03', 'april': '04',
                    'may': '05', 'june': '06', 'july': '07', 'august': '08',
                    'september': '09', 'october': '10', 'november': '11', 'december': '12'
                }

                month = '01'
                for month_name, month_num in month_map.items():
                    if month_name.lower() in date_string.lower():
                        month = month_num
                        break

                # Extract day
                day_match = re.search(r'\b([1-3]?\d)\b', date_string)
                day = day_match.group().zfill(2) if day_match and int(day_match.group()) <= 31 else '01'

                return f"{year}-{month}-{day}"
        except:
            pass

    return f"{date_string}-01-01"

In [None]:
# Test the date formatting function
test_dates = ["1998", "2003-11", "1997-07-12", "November 11, 1883", "April 13, 2001"]
print("Date formatting tests:")
for date in test_dates:
    print(f"{date} -> {format_date(date)}")

Date formatting tests:
1998 -> 1998-01-01
2003-11 -> 2003-11-01
1997-07-12 -> 1997-07-12
November 11, 1883 -> 1883-11-11
April 13, 2001 -> 2001-04-13


## Info Extraction Tools

The project leverages LangChain’s tool-calling capabilities to automate the extraction and storage of company information. Two key tools are defined: `extract_companies`, which processes input text to extract structured company data and `save_to_csv`, which formats and saves the extracted data into a CSV file. These tools form the backbone of the agent’s workflow.

In [None]:
# Define tools

@tool(description="Extract company info from provided text")
def extract_companies(text: str) -> str:
    """Extract company info from provided text"""
    try:
        # Use structured LLM to extract data
        extraction_prompt = f"""
        Extract company information from this text. For each company, provide:
        1. Company name (full official name including Corporation, Inc., LLC, etc.)
        2. Founding date (exact format from text)
        3. Founders (all people mentioned as founders)

        Text: {text}
        """

        result = structured_llm.invoke(extraction_prompt)

        # Convert to JSON for agent communication
        return json.dumps([company.dict() for company in result.companies])
    except Exception as e:
        return f"Error extracting companies: {str(e)}"

In [None]:
@tool(description="Save extracted company data to CSV file")
def save_to_csv(companies_data: str) -> str:
    """Save extracted company data to CSV file"""
    try:
        # Parse JSON data
        data = json.loads(companies_data)

        # Create DataFrame
        df_data = []
        for company in data:
            df_data.append({
                'company_name': company['company_name'],
                'founding_date': format_date(company['founding_date']),
                'founders': ', '.join(company['founders'])
            })

        df = pd.DataFrame(df_data)

        # Save to CSV
        filename = 'CSVision_company_info.csv'
        df.to_csv(filename, index=False)

        return f"Successfully saved {len(df)} companies to {filename}"
    except Exception as e:
        return f"Error saving to CSV: {str(e)}"

In [None]:
# Check the list of tools available to our agent
tools = [extract_companies, save_to_csv]

print(f"Available tools: {[tool.name for tool in tools]}")

Available tools: ['extract_companies', 'save_to_csv']


## Agent Initialization

Initialize the AI agent using LangGraph. This agent will coordinate tool usage and language understanding to extract company info efficiently.

In [None]:
# Create agent using LangGraph
agent_executor = create_react_agent(llm, tools)

print("AI Agent created successfully!")

AI Agent created successfully!


## Input Essay for Info Extraction

The essay below is a sample text providing a detailed narrative about the origins of major corporations, including their founding dates and founders. This text serves as the primary input for the **CSVision**, which will extract structured company info and save it to a CSV file.

*You can replace this sample essay with your own text to extract company details from any paragraph-based input.*

In [None]:
# Sample essay
essay_text = """
In the ever-evolving landscape of global commerce, the origin stories of major corporations are not merely tales of personal ambition and entrepreneurial spirit but also reflections of broader socio-economic trends and technological revolutions that have reshaped industries. These narratives, which often begin with modest ambitions, unfold into chronicles of innovation and strategic foresight that define industries and set benchmarks for future enterprises.

Early Foundations: Pioneers of Industry
One of the earliest examples is The Coca-Cola Company, founded on May 8, 1886, by Dr. John Stith Pemberton in Atlanta, Georgia. Initially sold at Jacob's Pharmacy as a medicinal beverage, Coca-Cola would become one of the most recognized brands worldwide, revolutionizing the beverage industry.
Similarly, Sony Corporation was established on May 7, 1946, by Masaru Ibuka and Akio Morita in Tokyo, Japan. Starting with repairing and building electrical equipment in post-war Japan, Sony would grow to pioneer electronics, entertainment, and technology.
As the mid-20th century progressed, McDonald's Corporation emerged as a game-changer in the fast-food industry. Founded on April 15, 1955, in Des Plaines, Illinois, by Ray Kroc, McDonald's built upon the original concept of Richard and Maurice McDonald to standardize and scale fast-food service globally. Around the same period, Intel Corporation was established on July 18, 1968, by Robert Noyce and Gordon Moore in Mountain View, California

driving advancements in semiconductors and microprocessors that became the backbone of modern computing.

The Rise of Technology Titans
Samsung Electronics Co., Ltd., founded on January 13, 1969, by Lee Byung-chul in Su-dong, South Korea, initially focused on producing electrical appliances like televisions and refrigerators. As Samsung expanded into semiconductors, telecommunications, and digital media, it
grew into a global technology leader. Similarly, Microsoft Corporation was founded on April 4, 1975, by Bill Gates and Paul Allen in Albuquerque, New Mexico, with the vision of placing a computer on every desk and in every home.
In Cupertino, California, Apple Inc. was born on April 1, 1976, founded by Steve Jobs, Steve Wozniak, and Ronald Wayne. Their mission to make personal computing accessible and elegant revolutionized technology and design. A few years later, Oracle Corporation was established on June 16, 1977, by Larry Ellison, Bob Miner, and Ed Oates in Santa Clara, California.
Specializing in relational databases, Oracle would become a cornerstone of enterprise software and cloud computing.
NVIDIA Corporation, founded on April 5, 1993, by Jensen Huang, Chris Malachowsky, and Curtis Priem in Santa Clara, California, began with a focus on graphics processing units (GPUs) for gaming. Today, NVIDIA is a leader in artificial intelligence, deep learning, and autonomous systems, showcasing the power of continuous innovation.

E-Commerce and the Internet Revolution
The 1990s witnessed a dramatic shift toward e-commerce and internet technologies. Amazon.com Inc. was founded on July 5, 1994, by Jeff Bezos in a garage in Bellevue, Washington, with the vision of becoming the world's largest online bookstore. This vision rapidly expanded to encompass
e-commerce, cloud computing, and digital streaming. Similarly, Google LLC was founded on September 4, 1998, by Larry Page and Sergey Brin, PhD students at Stanford University, in a garage in Menlo Park, California.
Google's mission to "organize the world's information" transformed how we search, learn, and connect.
In Asia, Alibaba Group Holding Limited was founded on June 28, 1999, by Jack Ma and 18 colleagues in Hangzhou, China. Originally an e-commerce platform connecting manufacturers with buyers, Alibaba expanded into cloud

computing, digital entertainment, and financial technology, becoming a global powerhouse.
In Europe, SAP SE was founded on April 1, 1972, by Dietmar Hopp,
Hans-Werner Hector, Hasso Plattner, Klaus Tschira, and Claus Wellenreuther in Weinheim, Germany. Specializing in enterprise resource planning (ERP) software, SAP revolutionized how businesses manage operations and data.

Social Media and Digital Platforms
The 2000s brought a wave of social media and digital platforms that reshaped communication and commerce. LinkedIn Corporation was founded on December 28, 2002, by Reid Hoffman and a team from PayPal and Socialnet.com in Mountain View, California, focusing on professional networking.
Facebook, Inc. (now Meta Platforms, Inc.) was launched on February 4, 2004, by Mark Zuckerberg and his college roommates in Cambridge, Massachusetts, evolving into a global social networking behemoth.
Another transformative platform, Twitter, Inc., was founded on March 21, 2006, by Jack Dorsey, Biz Stone, and Evan Williams in San Francisco, California. Starting as a microblogging service, Twitter became a critical tool for communication and social commentary. Spotify AB, founded on April 23, 2006, by Daniel Ek and Martin Lorentzon in Stockholm, Sweden, leveraged streaming technology to democratize music consumption, fundamentally altering the music industry.
In the realm of video-sharing, YouTube LLC was founded on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim in San Mateo, California. YouTube became the leading platform for user-generated video content, influencing global culture and media consumption.

Innovators in Modern Technology
Tesla, Inc., founded on July 1, 2003, by a group including Elon Musk, Martin Eberhard, Marc Tarpenning, JB Straubel, and Ian Wright, in San Carlos, California, championed the transition to sustainable energy with its electric vehicles and energy solutions. Airbnb, Inc., founded in August 2008 by Brian Chesky, Joe Gebbia, and Nathan Blecharczyk in San Francisco, California, disrupted traditional hospitality with its peer-to-peer lodging platform.
In the realm of fintech, PayPal Holdings, Inc. was established in December 1998 by Peter Thiel, Max Levchin, Luke Nosek, and Ken Howery in Palo Alto,

California. Originally a cryptography company, PayPal became a global leader in online payments. Stripe, Inc., founded in 2010 by Patrick and John Collison in Palo Alto, California, followed suit, simplifying online payments and enabling digital commerce.
Square, Inc. (now Block, Inc.), founded on February 20, 2009, by Jack Dorsey and Jim McKelvey in San Francisco, California, revolutionized mobile payment systems with its simple and accessible card readers.

Recent Disruptors
Zoom Video Communications, Inc. was founded on April 21, 2011, by Eric Yuan in San Jose, California. Initially designed for video conferencing, Zoom became essential during the COVID-19 pandemic, transforming remote work and communication. Slack Technologies, LLC, founded in 2009 by Stewart Butterfield, Eric Costello, Cal Henderson, and Serguei Mourachov in Vancouver, Canada, redefined workplace communication with its innovative messaging platform.
Rivian Automotive, Inc., founded on June 23, 2009, by RJ Scaringe in Plymouth, Michigan, entered the electric vehicle market with a focus on adventure and sustainability. SpaceX, established on March 14, 2002, by Elon Musk in Hawthorne, California, revolutionized aerospace with reusable rockets and ambitious plans for Mars exploration.
TikTok, developed by ByteDance and launched in September 2016 by Zhang Yiming in Beijing, China, revolutionized short-form video content, becoming a cultural phenomenon worldwide.

Conclusion
These corporations, with their diverse beginnings and visionary founders, exemplify the interplay of innovation, timing, and strategic foresight that shapes industries and transforms markets. From repairing electronics in post-war Japan to building global e-commerce empires and redefining space exploration, their stories are milestones in the narrative of global economic transformation. Each reflects not only the aspirations of their founders but also the technological advancements and socio-economic trends of their time, serving as inspirations for future innovators.
"""

print("Company essay verified")

Company essay verified


In [None]:
# Execute the agent
print("Running AI Agent to extract company info...\n")

response = agent_executor.invoke({
    "messages": [
        HumanMessage(content=f"""
        Hi! I need you to extract company information from the following essay and save it to a CSV file.

        For each company, please extract:
        1. Company name (full official name)
        2. Founding date
        3. Founders (all people mentioned)

        After extraction, save the data to a CSV file.

        Essay text:
        {essay_text}
        """)
    ]
})

print("Agent execution completed!")
print("\nAgent response:")
print(response["messages"][-1].content)

Running AI Agent to extract company info...



/tmp/ipython-input-1706893342.py:20: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  return json.dumps([company.dict() for company in result.companies])


Agent execution completed!

Agent response:
I have successfully extracted the company information and saved it to a CSV file named 'vision_company_info.csv'.


## CSV Output Verification

The following code verifies the output by checking the existence of the file, its size and its contents. The output is displayed in a formatted table that matches the expected format (S.N, Company Name, Founded in, Founded by), ensuring the extracted data meets the assignment's requirements.

In [None]:
# Verify and display the CSV output
filename = 'CSVision_company_info.csv'

if os.path.exists(filename):
    df = pd.read_csv(filename)

    print(f"CSV file '{filename}' created successfully!")
    print(f"Total companies extracted: {len(df)}")
    print(f"File size: {os.path.getsize(filename)} bytes")

    # Unformatted CSV
    # print("\nCSV Content:")
    # print(df.to_string(index=False))

    # Display in expected format from assignment
    print("\nFinal Output:")
    print("=" * 50)
    print("S.N,Company Name,Founded in,Founded by")
    for i, row in df.iterrows():
        print(f"{i+1},{row['company_name']},{row['founding_date']},\"{row['founders']}\"")

else:
    print("CSV file not found. Please check the agent execution above.")

CSV file 'CSVision_company_info.csv' created successfully!
Total companies extracted: 28
File size: 1726 bytes

Final Output:
S.N,Company Name,Founded in,Founded by
1,The Coca-Cola Company,1886-05-08,"Dr. John Stith Pemberton"
2,Sony Corporation,1946-05-07,"Masaru Ibuka, Akio Morita"
3,McDonald's Corporation,1955-04-15,"Ray Kroc"
4,Intel Corporation,1968-07-18,"Robert Noyce, Gordon Moore"
5,Samsung Electronics Co., Ltd.,1969-01-13,"Lee Byung-chul"
6,Microsoft Corporation,1975-04-04,"Bill Gates, Paul Allen"
7,Apple Inc.,1976-04-01,"Steve Jobs, Steve Wozniak, Ronald Wayne"
8,Oracle Corporation,1977-06-16,"Larry Ellison, Bob Miner, Ed Oates"
9,NVIDIA Corporation,1993-04-05,"Jensen Huang, Chris Malachowsky, Curtis Priem"
10,Amazon.com Inc.,1994-07-05,"Jeff Bezos"
11,Google LLC,1998-09-04,"Larry Page, Sergey Brin"
12,Alibaba Group Holding Limited,1999-06-28,"Jack Ma"
13,SAP SE,1972-04-01,"Dietmar Hopp, Hans-Werner Hector, Hasso Plattner, Klaus Tschira, Claus Wellenreuther"
14,LinkedIn Corpo

## Streaming Analysis of Company Impact

This section uses the Gemini model’s streaming capabilities to provide a real-time analysis of the essay, focusing on the types of companies mentioned and their significance in business history. The streaming response highlights the impact of these companies on their respective industries, complementing the structured CSV output by offering qualitative insights into their historical and economic importance.

In [None]:
# Streaming response
print("Streaming response for company extraction analysis:")
print("-" * 50)

# Use streaming to get real-time analysis
for chunk in llm.stream(f"""
Analyze the following essay and explain what types of companies are mentioned and their significance in business history:

{essay_text[:800]}...

Provide insights about these companies' impact on their respective industries.
"""):
    print(chunk.content, end="", flush=True)

print("\n\nStreaming analysis completed!")

Streaming response for company extraction analysis:
--------------------------------------------------
Here's an analysis of the provided essay excerpt, focusing on the types of companies mentioned and their significance in business history:

**Types of Companies Mentioned:**

The essay excerpt explicitly mentions **The Coca-Cola Company**.

**Significance in Business History:**

*   **The Coca-Cola Company:**
    *   **Industry Revolutionized:** The beverage industry.
    *   **Significance:**
        *   **Pioneer of a Global Brand:** Coca-Cola's story exemplifies the creation of a truly global, instantly recognizable brand. Its success wasn't just about selling a drink; it was about building a powerful identity and a consistent experience across diverse cultures and markets.
        *   **Marketing and Advertising Innovation:** While not detailed in this short excerpt, Coca-Cola is historically renowned for its pioneering work in advertising and marketing. They mastered the art of c