# AI Agent for Market Research & Competitive Analysis

This notebook implements an AI agent designed to automate market research. The agent can take a high-level query about a company or technology, search the web for relevant articles, analyze the content, and generate a concise business brief.

**Core Components:**
1.  **Agent Framework:** LangChain
2.  **LLM (for generation):** Google's `gemini-2.5-flash` (for better rate limits)
3.  **Embedding Model:** `all-MiniLM-L6-v2` (Open-source from Hugging Face)
4.  **Vector Store:** FAISS (In-memory, local, and free)
5.  **Tools:**
    * Custom Google Search Tool
    * Web Scraper Tool (Now with PDF support)
    * Yahoo Finance Tool

## Step 1: Install Dependencies

First, we install all the required Python libraries. This cell includes a check to see if the libraries are already installed, saving time on subsequent runs.

In [None]:
import importlib

# Check if a key library is installed. If not, run the installation.
if not importlib.util.find_spec("langchain"):
  print("Installing dependencies...")
  # Use your GitHub username here
  !pip install -q -r https://raw.githubusercontent.com/eriktaylor/ai-agent-moat/main/requirements.txt
else:
  print("Dependencies are already installed.")

## Step 2: Securely Set Up API Keys

We need API keys for Google services. We'll use Colab's built-in **Secrets Manager** to handle these securely. This is the best practice and prevents you from ever exposing your keys in the notebook.

**Instructions:**
1.  Click the **key icon (🔑)** in the left sidebar of Colab.
2.  Click **"Add a new secret"**.
3.  Create a secret with the name `GOOGLE_API_KEY` and paste your Google AI Studio API key as the value.
4.  Create another secret named `GOOGLE_CSE_ID` with your Custom Search Engine ID.

You can get a Google API Key from [Google AI Studio](https://aistudio.google.com/app/apikey) and set up a Custom Search Engine [here](https://programmablesearchengine.google.com/controlpanel/all) to get a Search Engine ID.

In [None]:
import os
from google.colab import userdata

# Securely access the API keys
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
GOOGLE_CSE_ID = userdata.get('GOOGLE_CSE_ID')

# Set environment variables for LangChain
os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY

## Step 3: Define the Agent's Tools

An agent's power comes from its tools. We will create three tools:
1.  **Web Search Tool:** To find relevant articles and sources.
2.  **Web Scraper Tool:** To extract the actual content from the URLs found by the search tool.
3.  **Yahoo Finance Tool:** To fetch key financial metrics for public companies.

In [None]:
import requests
from bs4 import BeautifulSoup
import yfinance as yf
from langchain.agents import tool
from langchain_community.utilities import GoogleSearchAPIWrapper
import fitz  # PyMuPDF
import io

# Tool 1: Google Search Tool
search = GoogleSearchAPIWrapper(google_cse_id=GOOGLE_CSE_ID, google_api_key=GOOGLE_API_KEY)

# Upgraded scraper tool with PDF handling
@tool
def scrape_website(url: str) -> str:
    """Scrapes text from HTML websites and extracts text from PDF files."""
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status() # Raise an exception for bad status codes

        content_type = response.headers.get('content-type', '')

        if 'application/pdf' in content_type or url.lower().endswith('.pdf'):
            # It's a PDF, use PyMuPDF to extract text
            with fitz.open(stream=io.BytesIO(response.content), filetype='pdf') as doc:
                text = "".join(page.get_text() for page in doc)
            return text[:8000] # Return a larger chunk for detailed PDFs
        elif 'text/html' in content_type:
            # It's HTML, use BeautifulSoup
            soup = BeautifulSoup(response.content, 'html.parser')
            text = ' '.join(p.get_text() for p in soup.find_all('p'))
            if len(text) < 200:
                return f"Error: HTML content from {url} is too short."
            return text[:4000]
        else:
            return f"Error: Unsupported content type '{content_type}' at {url}"

    except requests.RequestException as e:
        return f"Error: Could not access the URL. {e}"
    except Exception as e:
        return f"Error: An unexpected error occurred while processing {url}. {e}"

# Tool 3: Yahoo Finance Tool
@tool
def get_stock_info(ticker: str) -> str:
    """Fetches key financial information for a given stock ticker using Yahoo Finance."""
    try:
        stock = yf.Ticker(ticker)
        info = stock.info
        market_cap = info.get('marketCap', 'N/A')
        trailing_pe = info.get('trailingPE', 'N/A')
        forward_pe = info.get('forwardPE', 'N/A')
        long_business_summary = info.get('longBusinessSummary', 'N/A')
        return f"### KEY FINANCIAL DATA ###\nMarket Cap: {market_cap}\nTrailing P/E: {trailing_pe}\nForward P/E: {forward_pe}\nBusiness Summary: {long_business_summary}\n### END FINANCIAL DATA ###"
    except Exception as e:
        return f"Error fetching stock info for {ticker}: {e}"

## Step 4: Define the Research Agent

The `ResearchAgent` class now encapsulates the entire research process, with distinct methods for different types of analysis. It also includes a caching mechanism to improve speed and avoid hitting API rate limits.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
import re

class ResearchAgent:
    def __init__(self, llm, embeddings_model):
        self.llm = llm
        self.embeddings_model = embeddings_model
        self.text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
        self.search_wrapper = GoogleSearchAPIWrapper(google_cse_id=GOOGLE_CSE_ID, google_api_key=GOOGLE_API_KEY)
        self.cache = {}

    def clear_cache(self):
        """Clears the agent's cache."""
        print("Cache cleared.")
        self.cache = {}

    def _get_context(self, query):
        """Helper function to gather all context data."""
        context_cache_key = f"context_{query}"
        if context_cache_key in self.cache:
            print("Returning cached context.")
            return self.cache[context_cache_key]

        ticker_match = re.search(r'\((.*?)\)', query)
        ticker = ticker_match.group(1) if ticker_match else ""
        entity = query.split('for ')[-1].split('(')[0].strip()
        
        all_text_content = []
        if ticker:
            print(f"--- Getting Financial Data for {ticker} ---")
            financial_data = get_stock_info.run(ticker)
            if not financial_data.startswith("Error"):
                all_text_content.insert(0, financial_data)
                print("Successfully collected and prioritized financial data.")

        print("--- Tier 1: Headline Scan ---")
        headline_query = f'"{entity}" recent news'
        headline_results = self.search_wrapper.results(headline_query, num_results=5)
        if headline_results:
            all_text_content.extend([f"Headline: {r.get('title', '')}\nSnippet: {r.get('snippet', '')}" for r in headline_results])
            print(f"Collected {len(headline_results)} headlines and snippets.")

        print("--- Tier 2: Deep Dive ---")
        deep_dive_query = f'\"{entity}\" market analysis OR in-depth report filetype:pdf OR site:globenewswire.com OR site:prnewswire.com'
        deep_dive_results = self.search_wrapper.results(deep_dive_query, num_results=3)
        if deep_dive_results:
            urls = [result['link'] for result in deep_dive_results if 'link' in result]
            for url in urls:
                print(f"Scraping {url}...")
                content = scrape_website.run(url)
                if content and not content.startswith("Error"):
                    all_text_content.append(content)
                    print(f"Successfully scraped content from {url}")
                else:
                    print(content)
        
        full_text = "\n\n---\n\n".join(all_text_content)
        self.cache[context_cache_key] = full_text
        return full_text

    def _create_rag_chain(self, system_prompt, text_corpus):
        """Helper to create a RAG chain with a specific prompt."""
        docs = self.text_splitter.split_text(text_corpus)
        vector_store = FAISS.from_texts(texts=docs, embedding=self.embeddings_model)
        retriever = vector_store.as_retriever()
        prompt = ChatPromptTemplate.from_messages([
            ("system", system_prompt),
            ("human", "{input}"),
        ])
        Youtube_chain = create_stuff_documents_chain(self.llm, prompt)
        return create_retrieval_chain(retriever, Youtube_chain)

    def generate_outlook(self, query):
        print("\nGenerating Investment Outlook...")
        text_corpus = self._get_context(query)
        if not text_corpus:
            return "Could not gather context for outlook generation."

        system_prompt = (
            "You are a neutral financial analyst. Based ONLY on the provided text (which includes financial data and news), summarize the investment outlook. "
            "Structure your response into two sections: 'Bullish Case (Potential Positives)' and 'Bearish Case (Potential Risks)'. "
            "Reference specific data points like P/E ratios or news headlines from the context to support your points. "
            "DO NOT give financial advice or make a final 'buy' or 'sell' recommendation. This is an objective summary of the data provided."
            "\n\n{context}"
        )
        rag_chain = self._create_rag_chain(system_prompt, text_corpus)
        response = rag_chain.invoke({"input": query})
        return response['answer']

    def generate_brief(self, query):
        print("\nGenerating Comprehensive Business Brief...")
        text_corpus = self._get_context(query)
        if not text_corpus:
            return "Could not gather context for brief generation."

        system_prompt = (
            "You are an expert market research analyst. Use the provided context to generate a detailed business brief. "
            "The report MUST be structured with the following sections:\n"
            "1. **Recent Developments:** A summary of the most important recent news from the context.\n"
            "2. **SWOT Analysis:** A bulleted list of the company's Strengths, Weaknesses, Opportunities, and Threats based on the context.\n"
            "3. **Competitive Moat:** Briefly comment on the company's long-term competitive advantages based on the summary and SWOT analysis."
            "\n\n{context}"
        )
        rag_chain = self._create_rag_chain(system_prompt, text_corpus)
        response = rag_chain.invoke({"input": query})
        return response['answer']

# Initialize models and agent
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0.2)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
research_agent = ResearchAgent(llm=llm, embeddings_model=embeddings)

## Step 5: Run the Three-Step Analysis

Now we execute our new, structured workflow. First, we'll display the raw financial data. Then, we'll generate the investment outlook, and finally, the comprehensive business brief.

In [None]:
from IPython.display import display, HTML
import re

user_query = "Generate a market analysis for NVIDIA (NVDA). Identify key growth drivers and summarize recent news."

# To get fresh data and not use the cache, you can uncomment the next line:
# research_agent.clear_cache()

# --- Step 1: Display Raw Financial Data ---
print("--- 1. KEY FINANCIAL DATA ---")
ticker_match = re.search(r'\((.*?)\)', user_query)
ticker = ticker_match.group(1) if ticker_match else ""
financial_data_raw = get_stock_info.run(ticker) if ticker else "No ticker found."
display(HTML(f"<pre style='white-space: pre-wrap; font-family: monospace; background-color: #f0f0f0; padding: 10px; border-radius: 5px;'>{financial_data_raw}</pre>"))

# --- Step 2: Generate Investment Outlook ---
print("\n--- 2. AI-GENERATED INVESTMENT OUTLOOK ---")
outlook = research_agent.generate_outlook(user_query)
display(HTML(f"<div style='border: 1px solid #444; border-radius: 8px; padding: 20px; max-height: 500px; overflow-y: auto; white-space: pre-wrap; font-family: \"SF Pro Text\", \"Inter\", sans-serif; line-height: 1.6; background-color: #2c2c2e; color: #f0f0f0;'>{outlook}</div>"))

# --- Step 3: Generate Business Brief & SWOT Analysis ---
print("\n--- 3. COMPREHENSIVE BUSINESS BRIEF ---")
brief = research_agent.generate_brief(user_query)
display(HTML(f"<div style='border: 1px solid #444; border-radius: 8px; padding: 20px; max-height: 500px; overflow-y: auto; white-space: pre-wrap; font-family: \"SF Pro Text\", \"Inter\", sans-serif; line-height: 1.6; background-color: #2c2c2e; color: #f0f0f0;'>{brief}</div>"))