# Financial Market Intelligence RAG System
## CS6120 Final Project - Colab Version

**Team Members:** Soonbee Hwang & Xinyuan Fan (Amber)

This notebook runs the complete RAG system in Google Colab.


## Step 1: Install Dependencies


In [None]:
# Install all required packages
!pip install -q torch transformers sentence-transformers faiss-cpu numpy pandas beautifulsoup4 lxml tiktoken streamlit pyngrok

# Install llama-cpp-python (CPU version for Colab)
!pip install -q llama-cpp-python

# Install Kaggle API
!pip install -q kaggle

print("✅ All dependencies installed!")


## Step 2: Download Dataset

**Option A: Using Kaggle API (Recommended)**

1. Go to https://www.kaggle.com/account
2. Download your `kaggle.json` API credentials
3. Upload it in the cell below


In [None]:
# Upload kaggle.json file
from google.colab import files
import os

uploaded = files.upload()

# Move kaggle.json to the correct location
for fn in uploaded.keys():
    if fn == 'kaggle.json':
        os.makedirs('/root/.kaggle', exist_ok=True)
        !mv kaggle.json /root/.kaggle/
        !chmod 600 /root/.kaggle/kaggle.json
        print("✅ Kaggle credentials configured!")
        break
else:
    print("⚠️ Please upload kaggle.json file")


In [None]:
# Download dataset
import os

os.makedirs('data/raw', exist_ok=True)

!cd data/raw && kaggle datasets download -d aaron7sun/stocknews
!cd data/raw && unzip -q stocknews.zip

print("✅ Dataset downloaded!")


## Step 3: Download LLM Model

Download Mistral 7B GGUF model (~4.1GB)


In [None]:
import os
os.makedirs('models', exist_ok=True)

# Download Mistral 7B model
!cd models && wget -q --show-progress https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf

print("✅ Model downloaded!")


## Step 4: Upload Project Code

Upload the project source code files, or clone from GitHub:


In [None]:
# Option 1: Clone from GitHub
!git clone https://github.com/amberfxy/financial-market-intelligence-rag.git
!cp -r financial-market-intelligence-rag/src .
!cp -r financial-market-intelligence-rag/ui .
!cp -r financial-market-intelligence-rag/scripts .

print("✅ Code downloaded!")


## Step 5: Build FAISS Index


In [None]:
import sys
sys.path.insert(0, '.')

import pandas as pd
import logging
from src.data.loader import load_kaggle_dataset, preprocess_data
from src.chunking.chunker import chunk_dataframe
from src.embeddings.embedder import BGEEmbedder
from src.vectorstore.faiss_store import FAISSStore

logging.basicConfig(level=logging.INFO)

# Load and preprocess data
print("Loading dataset...")
df = load_kaggle_dataset('data/raw')
df = preprocess_data(df)

# Chunk documents
print("Chunking documents...")
chunks = chunk_dataframe(df, text_column='News Headline', max_tokens=250)

# Generate embeddings
print("Generating embeddings...")
embedder = BGEEmbedder()
chunk_texts = [chunk["text"] for chunk in chunks]
embeddings = embedder.embed_texts(chunk_texts, batch_size=32)

# Build FAISS index
print("Building FAISS index...")
os.makedirs('vectorstore', exist_ok=True)
vectorstore = FAISSStore(dimension=embeddings.shape[1])
vectorstore.add_chunks(embeddings, chunks)
vectorstore.save('vectorstore/faiss.index', 'vectorstore/chunks.pkl')

print(f"✅ Index built successfully! Total chunks: {len(chunks)}")


In [None]:
# Setup ngrok for public access (optional)
from pyngrok import ngrok

# Get your ngrok authtoken from https://dashboard.ngrok.com/get-started/your-authtoken
# Uncomment and set your token:
# ngrok.set_auth_token("YOUR_NGROK_TOKEN")

# Start ngrok tunnel (if using ngrok)
# public_url = ngrok.connect(8501)
# print(f"✅ Streamlit app will be available at: {public_url}")

# Alternative: Use Colab's built-in port forwarding
print("✅ Use Colab's 'Open in new tab' option to access the app")


In [None]:
# Run Streamlit app
!streamlit run ui/app.py --server.port=8501 --server.address=0.0.0.0


## Important Notes

1. **Runtime**: Use a GPU runtime (Runtime → Change runtime type → GPU) for faster processing
2. **Persistence**: Data will be lost when the session ends. Consider saving to Google Drive
3. **Model Size**: The Mistral 7B model is ~4.1GB. Download may take time.
4. **Index Building**: Building the FAISS index may take 10-30 minutes depending on data size.
5. **Session Timeout**: Colab sessions timeout after inactivity. Keep the tab active.
