# AI Agent for Market Research & Competitive Analysis

This notebook implements an AI agent designed to automate market research. The agent takes a company name and stock ticker, performs a multi-faceted analysis by gathering real-time financial data and news, and presents the results from two distinct investor perspectives: a short-term Market Investor and a long-term Value Investor.

**Core Components:**
1.  **Agent Framework:** LangChain
2.  **LLM (for generation):** Google's `gemini-2.5-flash` (for better rate limits)
3.  **Embedding Model:** `all-MiniLM-L6-v2` (Open-source from Hugging Face)
4.  **Vector Store:** FAISS (In-memory, local, and free)
5.  **Tools:**
    * Custom Google Search Tool
    * Web Scraper Tool (with PDF support)
    * Yahoo Finance Tool (with EPS)

## Step 1: Install Dependencies

First, we install all the required Python libraries. This cell includes a check to see if the libraries are already installed, saving time on subsequent runs.

In [None]:
import importlib

# Check if a key library is installed. If not, run the installation.
if not importlib.util.find_spec("langchain"):
  print("Installing dependencies...")
  # Use your GitHub username here
  !pip install -q -r https://raw.githubusercontent.com/eriktaylor/ai-agent-moat/main/requirements.txt
else:
  print("Dependencies are already installed.")

## Step 2: Securely Set Up API Keys

We need API keys for Google services. We'll use Colab's built-in **Secrets Manager** to handle these securely. This is the best practice and prevents you from ever exposing your keys in the notebook.

**Instructions:**
1.  Click the **key icon (🔑)** in the left sidebar of Colab.
2.  Click **"Add a new secret"**.
3.  Create a secret with the name `GOOGLE_API_KEY` and paste your Google AI Studio API key as the value.
4.  Create another secret named `GOOGLE_CSE_ID` with your Custom Search Engine ID.

You can get a Google API Key from [Google AI Studio](https://aistudio.google.com/app/apikey) and set up a Custom Search Engine [here](https://programmablesearchengine.google.com/controlpanel/all) to get a Search Engine ID.

In [None]:
import os
from google.colab import userdata

# Securely access the API keys
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
GOOGLE_CSE_ID = userdata.get('GOOGLE_CSE_ID')

# Set environment variables for LangChain
os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY

## Step 3: Define the Agent's Tools

An agent's power comes from its tools. We will create three tools:
1.  **Web Search Tool:** To find relevant articles and sources.
2.  **Web Scraper Tool:** To extract the actual content from the URLs found by the search tool.
3.  **Yahoo Finance Tool:** To fetch key financial metrics for public companies.

In [None]:
import requests
from bs4 import BeautifulSoup
import yfinance as yf
from langchain.agents import tool
from langchain_community.utilities import GoogleSearchAPIWrapper
import fitz  # PyMuPDF
import io

# Tool 1: Google Search Tool
search = GoogleSearchAPIWrapper(google_cse_id=GOOGLE_CSE_ID, google_api_key=GOOGLE_API_KEY)

# Upgraded scraper tool with PDF handling
@tool
def scrape_website(url: str) -> str:
    """Scrapes text from HTML websites and extracts text from PDF files."""
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status() # Raise an exception for bad status codes

        content_type = response.headers.get('content-type', '')

        if 'application/pdf' in content_type or url.lower().endswith('.pdf'):
            # It's a PDF, use PyMuPDF to extract text
            with fitz.open(stream=io.BytesIO(response.content), filetype='pdf') as doc:
                text = "".join(page.get_text() for page in doc)
            return text[:8000] # Return a larger chunk for detailed PDFs
        elif 'text/html' in content_type:
            # It's HTML, use BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')
            text = ' '.join(p.get_text() for p in soup.find_all('p'))
            if len(text) < 200:
                return f"Error: HTML content from {url} is too short."
            return text[:4000]
        else:
            return f"Error: Unsupported content type '{content_type}' at {url}"

    except requests.RequestException as e:
        return f"Error: Could not access the URL. {e}"
    except Exception as e:
        return f"Error: An unexpected error occurred while processing {url}. {e}"

# Tool now fetches EPS
@tool
def get_stock_info(ticker: str) -> str:
    """Fetches key financial information for a given stock ticker using Yahoo Finance."""
    try:
        stock = yf.Ticker(ticker)
        info = stock.info
        market_cap = info.get('marketCap', 'N/A')
        trailing_pe = info.get('trailingPE', 'N/A')
        forward_pe = info.get('forwardPE', 'N/A')
        eps = info.get('trailingEps', 'N/A')
        long_business_summary = info.get('longBusinessSummary', 'N/A')
        return f"### KEY FINANCIAL DATA ###\nMarket Cap: {market_cap}\nTrailing P/E: {trailing_pe}\nForward P/E: {forward_pe}\nTrailing EPS: {eps}\nBusiness Summary: {long_business_summary}\n### END FINANCIAL DATA ###"
    except Exception as e:
        return f"Error fetching stock info for {ticker}: {e}"

## Step 4: Define the Research Agent

The `ResearchAgent` class now encapsulates the entire research process, with distinct methods for different types of analysis. It also includes a caching mechanism to improve speed and avoid hitting API rate limits.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
import re

class ResearchAgent:
    def __init__(self, llm, embeddings_model):
        self.llm = llm
        self.embeddings_model = embeddings_model
        self.text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
        self.search_wrapper = GoogleSearchAPIWrapper(google_cse_id=GOOGLE_CSE_ID, google_api_key=GOOGLE_API_KEY)
        self.cache = {}

    def clear_cache(self):
        """Clears the agent's cache."""
        print("Cache cleared.")
        self.cache = {}

    # <<< CHANGE: This method now separates financial data from unstructured text >>>
    def _get_context(self, entity_name, ticker):
        """Gathers context and returns financial data and unstructured text separately."""
        context_cache_key = f"context_{entity_name}_{ticker}"
        if context_cache_key in self.cache:
            print("Returning cached context.")
            return self.cache[context_cache_key]

        # Get financial data
        financial_data = "No financial data available."
        if ticker:
            print(f"--- Getting Financial Data for {ticker} ---")
            financial_data_result = get_stock_info.run(ticker)
            if not financial_data_result.startswith("Error"):
                financial_data = financial_data_result
                print("Successfully collected financial data.")

        # Gather unstructured text
        unstructured_text_list = []
        print("--- Tier 1: Official News & Analysis ---")
        headline_query = f'"{entity_name}" recent news'
        headline_results = self.search_wrapper.results(headline_query, num_results=3)
        if headline_results:
            unstructured_text_list.extend([f"Headline: {r.get('title', '')}\nSnippet: {r.get('snippet', '')}" for r in headline_results])
            print(f"Collected {len(headline_results)} headlines and snippets.")
        
        print("--- Tier 2: Critical News & Sentiment ---")
        critical_query = f'\"{entity_name}\" issues OR concerns OR investigation OR recall OR safety OR "short interest"'
        critical_results = self.search_wrapper.results(critical_query, num_results=3)
        if critical_results:
            unstructured_text_list.extend([f"Critical Headline: {r.get('title', '')}\nSnippet: {r.get('snippet', '')}" for r in critical_results])
            print(f"Collected {len(critical_results)} critical headlines and snippets.")

        print("--- Tier 3: Deep Dive ---")
        deep_dive_query = f'\"{entity_name}\" market analysis OR in-depth report filetype:pdf OR site:globenewswire.com OR site:prnewswire.com'
        deep_dive_results = self.search_wrapper.results(deep_dive_query, num_results=2)
        if deep_dive_results:
            urls = [result['link'] for result in deep_dive_results if 'link' in result]
            for url in urls:
                print(f"Scraping {url}...")
                content = scrape_website.run(url)
                if content and not content.startswith("Error"):
                    unstructured_text_list.append(content)
                    print(f"Successfully scraped content from {url}")
                else:
                    print(content)
        
        unstructured_corpus = "\n\n---\n\n".join(unstructured_text_list)
        self.cache[context_cache_key] = (financial_data, unstructured_corpus)
        return financial_data, unstructured_corpus

    # <<< CHANGE: This method now takes financial data as a direct input >>>
    def _create_rag_chain(self, system_prompt, unstructured_corpus):
        """Helper to create a RAG chain with a specific prompt."""
        docs = self.text_splitter.split_text(unstructured_corpus)
        vector_store = FAISS.from_texts(texts=docs, embedding=self.embeddings_model)
        retriever = vector_store.as_retriever()
        # The prompt now has a dedicated placeholder for the guaranteed financial data
        prompt = ChatPromptTemplate.from_messages([
            ("system", system_prompt),
            ("human", "{input}"),
        ])
        question_answer_chain = create_stuff_documents_chain(self.llm, prompt)
        return create_retrieval_chain(retriever, question_answer_chain)

    def generate_market_outlook(self, entity_name, ticker):
        print("\nGenerating Market Investor Outlook...")
        financial_data, unstructured_corpus = self._get_context(entity_name, ticker)
        if not unstructured_corpus:
            return "Could not gather context for outlook generation."

        # <<< CHANGE: Prompt now has a dedicated {financial_data} placeholder >>>
        system_prompt = (
            "You are a 'Market Investor' analyst. Here is the key financial data for the company:\n{financial_data}\n\n" 
            "Now, using the retrieved context below (which includes both positive and critical news), generate a report. "
            "The report MUST be structured with the following sections:\n"
            "1. **Market Sentiment:** Synthesize the official news and the critical news to determine the overall market sentiment. Is it bullish, bearish, or mixed? Why?\n"
            "2. **Valuation Analysis:** Is the stock considered expensive or cheap? You MUST reference the 'Trailing P/E' and 'Trailing EPS' from the financial data. If P/E is not applicable because EPS is negative, state this clearly and explain what a negative EPS implies for valuation.\n"
            "3. **Relative Performance (Implied):** Based on the context, how does this company's performance and outlook seem to compare to its peers or the broader market?"
            "Retrieved Context:\n{context}"
            "DO NOT give financial advice. This is an objective summary of the data provided."
        )
        rag_chain = self._create_rag_chain(system_prompt, unstructured_corpus)
        # <<< CHANGE: We now pass the financial_data directly into the prompt >>>
        response = rag_chain.invoke({"input": f"Market outlook for {entity_name}", "financial_data": financial_data})
        return response['answer']

    def generate_value_analysis(self, entity_name, ticker):
        print("\nGenerating Value Investor Analysis...")
        financial_data, unstructured_corpus = self._get_context(entity_name, ticker)
        if not unstructured_corpus:
            return "Could not gather context for value analysis."

        system_prompt = (
            "You are a 'Value Investor' analyst. Here is the key financial data for the company:\n{financial_data}\n\n" 
            "Now, using the retrieved context below (which includes both positive and critical news), generate a detailed business brief. "
            "The report MUST be structured with the following sections:\n"
            "1. **Valuation Summary:** Start by stating if the company appears 'Overvalued', 'Undervalued', or 'Fairly Valued'. Justify your conclusion briefly by referencing the P/E or EPS from the financial data.\n"
            "2. **SWOT Analysis:** A detailed, bulleted list of the company's Strengths, Weaknesses, Opportunities, and Threats. You MUST incorporate information from the 'Critical News' headlines in the Weaknesses and Threats sections.\n"
            "3. **Competitive Moat:** Based on the SWOT analysis, describe the company's long-term competitive advantages. Is its moat wide, narrow, or degrading? You MUST consider the threats and weaknesses when assessing the durability of the moat."
            "Retrieved Context:\n{context}"
        )
        rag_chain = self._create_rag_chain(system_prompt, unstructured_corpus)
        response = rag_chain.invoke({"input": f"Value analysis for {entity_name}", "financial_data": financial_data})
        return response['answer']

# Initialize models and agent
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0.2)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
research_agent = ResearchAgent(llm=llm, embeddings_model=embeddings)

## Step 5: Run the Three-Step Analysis

Now we execute our new, structured workflow. First, we'll display the raw financial data. Then, we'll generate the investment outlook, and finally, the comprehensive business brief.

In [None]:
from IPython.display import display, HTML, clear_output

company_name = input("Enter the company name (e.g., NVIDIA): ")
stock_ticker = input("Enter the stock ticker (e.g., NVDA): ")

clear_output(wait=True) # Clears the input prompts for a cleaner display

# To get fresh data and not use the cache, you can uncomment the next line:
# research_agent.clear_cache()

# --- Step 1: Display Raw Financial Data ---
print(f"--- 1. KEY FINANCIAL DATA for {stock_ticker.upper()} ---")
financial_data_raw = get_stock_info.run(stock_ticker) if stock_ticker else "No ticker provided."
display(HTML(f"<div style='border: 1px solid #444; border-radius: 8px; padding: 20px; white-space: pre-wrap; font-family: monospace; line-height: 1.6; background-color: #2c2c2e; color: #f0f0f0;'>{financial_data_raw}</div>"))

# --- Step 2: Generate Market Investor Outlook ---
print(f"\n--- 2. AI-GENERATED MARKET INVESTOR OUTLOOK for {company_name} ---")
market_outlook = research_agent.generate_market_outlook(company_name, stock_ticker)
display(HTML(f"<div style='border: 1px solid #444; border-radius: 8px; padding: 20px; max-height: 500px; overflow-y: auto; white-space: pre-wrap; font-family: \"SF Pro Text\", \"Inter\", sans-serif; line-height: 1.6; background-color: #2c2c2e; color: #f0f0f0;'>{market_outlook}</div>"))

# --- Step 3: Generate Value Investor Analysis ---
print(f"\n--- 3. AI-GENERATED VALUE INVESTOR ANALYSIS for {company_name} ---")
value_analysis = research_agent.generate_value_analysis(company_name, stock_ticker)
display(HTML(f"<div style='border: 1px solid #444; border-radius: 8px; padding: 20px; max-height: 500px; overflow-y: auto; white-space: pre-wrap; font-family: \"SF Pro Text\", \"Inter\", sans-serif; line-height: 1.6; background-color: #2c2c2e; color: #f0f0f0;'>{value_analysis}</div>"))