# AI Agent for Market Research & Competitive Analysis

This notebook implements an AI agent designed to automate market research. The agent can take a high-level query about a company or technology, search the web for relevant articles, analyze the content, and generate a concise business brief.

**Project Goal:** To build a portfolio project demonstrating the practical application of LLMs, Retrieval-Augmented Generation (RAG), and AI agent tool use for a real-world business problem.

**Core Components:**
1.  **Agent Framework:** LangChain
2.  **LLM (for generation):** Google's `gemini-pro` (accessible via free API key)
3.  **Embedding Model:** `all-MiniLM-L6-v2` (Open-source from Hugging Face)
4.  **Vector Store:** FAISS (In-memory, local, and free)
5.  **Tools:**
    * Custom Google Search Tool
    * Web Scraper Tool
    * Yahoo Finance Tool

## Step 1: Install Dependencies

First, we install all the required Python libraries. This cell includes a check to see if the libraries are already installed, saving time on subsequent runs.

In [None]:
import importlib

# Check if a key library is installed. If not, run the installation.
if not importlib.util.find_spec("langchain"):
  print("Installing dependencies...")
  # Use your GitHub username here
  !pip install -q -r https://raw.githubusercontent.com/eriktaylor/ai-agent-moat/main/requirements.txt
else:
  print("Dependencies are already installed.")

## Step 2: Securely Set Up API Keys

We need API keys for Google services. We'll use Colab's built-in **Secrets Manager** to handle these securely. This is the best practice and prevents you from ever exposing your keys in the notebook.

**Instructions:**
1.  Click the **key icon (🔑)** in the left sidebar of Colab.
2.  Click **"Add a new secret"**.
3.  Create a secret with the name `GOOGLE_API_KEY` and paste your Google AI Studio API key as the value.
4.  Create another secret named `GOOGLE_CSE_ID` with your Custom Search Engine ID.

You can get a Google API Key from [Google AI Studio](https://aistudio.google.com/app/apikey) and set up a Custom Search Engine [here](https://programmablesearchengine.google.com/controlpanel/all) to get a Search Engine ID.

In [None]:
import os
from google.colab import userdata

# Securely access the API keys
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
GOOGLE_CSE_ID = userdata.get('GOOGLE_CSE_ID')

# Set environment variables for LangChain
os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY

## Step 3: Define the Agent's Tools

An agent's power comes from its tools. We will create three tools:
1.  **Web Search Tool:** To find relevant articles and sources.
2.  **Web Scraper Tool:** To extract the actual content from the URLs found by the search tool.
3.  **Yahoo Finance Tool:** To fetch key financial metrics for public companies.

In [None]:
import requests
from bs4 import BeautifulSoup
import yfinance as yf
from langchain.agents import tool
from langchain_community.utilities import GoogleSearchAPIWrapper

# Tool 1: Google Search Tool
# This object will be used directly by our agent's logic.
search = GoogleSearchAPIWrapper(google_cse_id=GOOGLE_CSE_ID, google_api_key=GOOGLE_API_KEY)

# Tool 2: Web Scraper Tool
@tool
def scrape_website(url: str) -> str:
    """Scrapes the text content of a given website URL."""
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            text = ' '.join(p.get_text() for p in soup.find_all('p'))
            if len(text) < 200: # Simple check for meaningful content
                return f"Error: Content from {url} is too short or likely requires JavaScript."
            return text[:4000]
        return f"Error: Received status code {response.status_code}"
    except requests.RequestException as e:
        return f"Error: Could not access the URL. {e}"

# Tool 3: Yahoo Finance Tool
@tool
def get_stock_info(ticker: str) -> str:
    """Fetches key financial information for a given stock ticker using Yahoo Finance."""
    try:
        stock = yf.Ticker(ticker)
        info = stock.info
        market_cap = info.get('marketCap', 'N/A')
        trailing_pe = info.get('trailingPE', 'N/A')
        forward_pe = info.get('forwardPE', 'N/A')
        long_business_summary = info.get('longBusinessSummary', 'N/A')
        return f"FINANCIAL DATA:\nMarket Cap: {market_cap}\nTrailing P/E: {trailing_pe}\nForward P/E: {forward_pe}\nBusiness Summary: {long_business_summary}"
    except Exception as e:
        return f"Error fetching stock info for {ticker}: {e}"

## Step 4: Set Up the RAG Pipeline (In-Memory)

This is the core of our project. We will create a `RagAgent` class that encapsulates the logic for Retrieval-Augmented Generation.
It now uses a **Hybrid Search Strategy** to gather both recent headlines and detailed documents, and it explicitly calls the financial data tool.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
import re

class RagAgent:
    def __init__(self, llm, embeddings_model):
        self.llm = llm
        self.embeddings_model = embeddings_model
        self.text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
        self.retriever = None
        self.retrieval_chain = None
        self.search_wrapper = GoogleSearchAPIWrapper(google_cse_id=GOOGLE_CSE_ID, google_api_key=GOOGLE_API_KEY)

    def _create_rag_pipeline(self, text_corpus):
        """Creates a RAG pipeline from a given text corpus."""
        print("\nStep 1: Splitting combined text...")
        docs = self.text_splitter.split_text(text_corpus)
        
        print("Step 2: Creating FAISS vector store...")
        vector_store = FAISS.from_texts(texts=docs, embedding=self.embeddings_model)
        self.retriever = vector_store.as_retriever()

        print("Step 3: Creating retrieval chain with sophisticated prompt...")
        # <<< CHANGE: Implement the advanced, structured prompt >>>
        system_prompt = (
            "You are an expert financial analyst writing a concise investment brief. "
            "Use the provided context, which includes recent news headlines, full articles, and key financial metrics, to generate your report. "
            "The report MUST be structured with the following sections:\n"
            "1. **Recent Developments:** A summary of the most important recent news from the context.\n"
            "2. **SWOT Analysis:** A bulleted list of the company's Strengths, Weaknesses, Opportunities, and Threats based on the context.\n"
            "3. **Valuation & Moat:** An analysis of the company's current valuation, referencing specific metrics like the P/E ratio if available in the context. Briefly comment on the company's competitive moat (its long-term competitive advantages)."
            "\n\n{context}"
        )
        prompt = ChatPromptTemplate.from_messages([
            ("system", system_prompt),
            ("human", "{input}"),
        ])
        
        question_answer_chain = create_stuff_documents_chain(self.llm, prompt)
        self.retrieval_chain = create_retrieval_chain(self.retriever, question_answer_chain)
        print("RAG pipeline is ready.")

    def run(self, query):
        print(f"Executing query: {query}")
        # Extract the ticker symbol from the query using regex
        ticker_match = re.search(r'\((.*?)\)', query)
        ticker = ticker_match.group(1) if ticker_match else ""
        entity = query.split('for ')[-1].split('(')[0].strip()
        
        all_text_content = []

        # <<< CHANGE: Explicitly call the financial data tool >>>
        if ticker:
            print(f"\n--- Getting Financial Data for {ticker} ---")
            financial_data = get_stock_info.run(ticker)
            if not financial_data.startswith("Error"):
                all_text_content.append(financial_data)
                print("Successfully collected financial data.")

        # Tier 1: "Headline Scan" for recent news from major outlets
        print("\n--- Tier 1: Headline Scan ---")
        headline_query = f'"{entity}" recent news'
        print(f"Searching for headlines with: {headline_query}")
        headline_results = self.search_wrapper.results(headline_query, num_results=5)
        if headline_results:
            for result in headline_results:
                all_text_content.append(f"Headline: {result.get('title', '')}\nSnippet: {result.get('snippet', '')}")
            print(f"Collected {len(headline_results)} headlines and snippets.")

        # Tier 2: "Deep Dive" for detailed, scrape-friendly content
        print("\n--- Tier 2: Deep Dive ---")
        deep_dive_query = f'\"{entity}\" market analysis OR in-depth report filetype:pdf OR site:globenewswire.com OR site:prnewswire.com'
        print(f"Searching for documents with: {deep_dive_query}")
        deep_dive_results = self.search_wrapper.results(deep_dive_query, num_results=5)
        if deep_dive_results:
            urls = [result['link'] for result in deep_dive_results if 'link' in result]
            print(f"Found {len(urls)} documents for deep dive.")
            for url in urls[:3]: # Scrape top 3
                print(f"Scraping {url}...")
                content = scrape_website.run(url)
                if content and not content.startswith("Error"):
                    all_text_content.append(content)
                else:
                    print(content) # Print error if scraping fails

        if not all_text_content:
            return "Could not retrieve any content from the web. Please try another query."
        
        full_text = "\n\n---\n\n".join(all_text_content)
        self._create_rag_pipeline(full_text)
        
        print("\nSynthesizing final answer from combined context...")
        response = self.retrieval_chain.invoke({"input": query})
        
        return response['answer']

# Initialize models
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash-latest", temperature=0.2) # Lower temperature for more factual output
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Create the agent
market_research_agent = RagAgent(llm=llm, embeddings_model=embeddings)

## Step 5: Run the Agent and Generate a Business Brief

Now it's time to test our agent. Let's give it a complex query and see what kind of analysis it can generate. The output will be displayed in a formatted, scrollable box for readability.

In [None]:
from IPython.display import display, HTML

user_query = "Generate a market analysis for NVIDIA (NVDA). Identify key growth drivers and summarize recent news."

final_brief = market_research_agent.run(user_query)

print("\n--- FINAL BUSINESS BRIEF ---\n")

# <<< CHANGE: Display the output in a formatted, scrollable box >>>
formatted_brief = f"""
<div style="border: 1px solid #e0e0e0; border-radius: 8px; padding: 20px; max-height: 500px; overflow-y: auto; white-space: pre-wrap; font-family: 'SF Pro Text', 'Inter', sans-serif; line-height: 1.6; background-color: #f9fafb;">
{final_brief}
</div>
"""
display(HTML(formatted_brief))