# Building MCP Servers for Compliance Documentation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ethanolivertroy/pcidss-mcp-training/blob/main/build-mcp-server.ipynb)

**A hands-on tutorial for GRC professionals**

---

## What You'll Learn

This notebook teaches you how to build a Model Context Protocol (MCP) server for compliance documentation. We use PCI-DSS as the example, but these patterns work for **any** compliance framework:

- NIST 800-53
- ISO 27001
- SOC 2
- FedRAMP
- HIPAA

### What is an MCP Server?

MCP servers provide **tools** that AI assistants like Claude can call to:
- Search structured compliance data
- Retrieve specific requirements by ID
- Map controls between frameworks
- Generate compliance checklists

### This Notebook is Fully Self-Contained

Everything runs right here in **Python** using [FastMCP](https://github.com/jlowin/fastmcp) - the leading Python framework for MCP development. Works in:
- Google Colab
- Local Jupyter
- VS Code with Colab extension

---

## Design Principles for MCP Servers

Key insights from [production MCP deployments](https://newsletter.pragmaticengineer.com/p/mcp-deepdive):

### Design for Agents, Not Humans
MCP servers are called by AI assistants, not humans directly:
- Return structured JSON, not formatted text
- Include metadata that helps agents understand context
- Provide clear error messages with actionable suggestions

### Start Internal, Then Public
Most successful MCPs start as internal tools:
- Validate use cases in controlled environments first
- Security is easier when you control both client and server
- Public servers face client compatibility challenges

### Keep It Focused
- Expose only necessary capabilities (least privilege)
- One MCP per domain (compliance, ticketing, docs)
- Avoid feature creep - simple tools are more reliable

## Part 1: Data Source - PCI-DSS Prioritized Approach Tool

### Why Excel Over PDF?

Many compliance documents are available as PDFs, but PDFs are difficult to parse. The same data is often available in Excel format, which is:

- **Already structured** - No complex extraction needed
- **Reliable** - Consistent column layouts
- **Fast** - Parse in milliseconds vs. minutes
- **Accurate** - No OCR errors or layout issues

For PCI-DSS, the **Prioritized Approach Tool** is an official Excel file from PCI Security Standards Council.

### Download the Excel File

1. Download from PCI SSC:
   
   **[Prioritized-Approach-Tool-For-PCI-DSS-v4_0_1.xlsx](https://docs-prv.pcisecuritystandards.org/PCI%20DSS/Supporting%20Document/Prioritized-Approach-Tool-For-PCI-DSS-v4_0_1.xlsx)**

2. Accept the license agreement

3. Upload when prompted (Colab) or place in your project directory (local)

### Legal Notes

The Excel file is licensed by PCI SSC for **internal use only**:
- Use for your organization's compliance work
- Do not redistribute converted data publicly
- Each user should download and accept the license directly

## Part 2: Parsing Excel Data

The Prioritized Approach Tool contains a sheet called "Prioritized Approach Milestones" with:

| Column A | Column B |
|----------|----------|
| Requirement text | Milestone priority (1-6) |

Requirements follow a hierarchical numbering:
- **Level 1**: "Requirement 1: Install and Maintain Network Security Controls"
- **Level 2**: "1.1 Processes and mechanisms..."
- **Level 3**: "1.1.1 All security policies..."

In [None]:
# Install dependencies
!pip install -q openpyxl
print("Dependencies installed.")

In [None]:
from pathlib import Path
import re
from typing import Dict, List, Any, Optional, Tuple

# Detect environment
try:
    import google.colab
    IN_COLAB = True
    print("Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("Running locally")

if IN_COLAB:
    from google.colab import files
    print("\nPlease upload the Prioritized Approach Tool Excel file...")
    print("Download from: https://docs-prv.pcisecuritystandards.org/PCI%20DSS/Supporting%20Document/Prioritized-Approach-Tool-For-PCI-DSS-v4_0_1.xlsx")
    uploaded = files.upload()
    excel_path = Path(list(uploaded.keys())[0])
    print(f"\nUploaded: {excel_path.name}")
else:
    excel_path = Path("Prioritized-Approach-Tool-For-PCI-DSS-v4_0_1.xlsx")
    if not excel_path.exists():
        print(f"Excel file not found. Please download from:")
        print("https://docs-prv.pcisecuritystandards.org/PCI%20DSS/Supporting%20Document/Prioritized-Approach-Tool-For-PCI-DSS-v4_0_1.xlsx")
    else:
        print(f"Found: {excel_path.name}")

In [None]:
def parse_requirement_cell(text: str) -> Optional[Dict[str, Any]]:
    """Parse requirement ID and title from Excel cell."""
    if not text:
        return None
    
    text = str(text).strip()
    
    # Pattern: "1.1.1 Title text"
    match = re.match(r'^(\d+(?:\.\d+)*)\s+(.+)', text)
    if match:
        req_id = match.group(1)
        title = match.group(2).strip()
        level = len(req_id.split('.'))
        return {'id': req_id, 'title': title, 'level': level}
    
    # Pattern: "Requirement 1: Title"
    section_match = re.match(r'^Requirement\s+(\d+):\s*(.+)', text, re.IGNORECASE)
    if section_match:
        return {
            'id': section_match.group(1),
            'title': section_match.group(2).strip(),
            'level': 1
        }
    
    return None

# Test
test_cases = [
    "Requirement 1: Install and Maintain Network Security Controls",
    "1.1 Processes and mechanisms for network security controls",
    "1.2.3 Specific requirement text"
]
print("Testing parser:")
for text in test_cases:
    result = parse_requirement_cell(text)
    if result:
        print(f"  {result['id']}: Level {result['level']}")

In [None]:
from openpyxl import load_workbook
import warnings

# Suppress harmless openpyxl warning about Data Validation
# (Excel has dropdown menus that openpyxl ignores - doesn't affect our data)
warnings.filterwarnings('ignore', category=UserWarning, module='openpyxl')

def extract_requirements(excel_path: Path) -> List[Dict[str, Any]]:
    """Extract requirements from Excel file."""
    workbook = load_workbook(excel_path, data_only=True)
    
    sheet_name = "Prioritized Approach Milestones"
    if sheet_name not in workbook.sheetnames:
        print(f"Available sheets: {workbook.sheetnames}")
        return []
    
    worksheet = workbook[sheet_name]
    requirements = []
    
    for row in worksheet.iter_rows(min_row=3, values_only=True):
        if not row or not row[0]:
            continue
        
        parsed = parse_requirement_cell(str(row[0]))
        if not parsed:
            continue
        
        parts = parsed['id'].split('.')
        parent_id = '.'.join(parts[:-1]) if len(parts) > 1 else None
        
        requirement = {
            'id': parsed['id'],
            'level': parsed['level'],
            'parentId': parent_id,
            'title': parsed['title'],
            'statement': parsed['title'],
        }
        
        if row[1]:
            requirement['milestone'] = int(row[1])
        
        requirements.append(requirement)
    
    requirements.sort(key=lambda r: [int(x) for x in r['id'].split('.')])
    return requirements

# Extract requirements
if excel_path.exists():
    requirements = extract_requirements(excel_path)
    print(f"Extracted {len(requirements)} requirements")
    
    levels = {}
    for req in requirements:
        levels[req['level']] = levels.get(req['level'], 0) + 1
    
    # Show level distribution with explanation
    print("\nLevel distribution:")
    level_names = {
        1: "Principal requirements (Requirement 1-12)",
        2: "Sub-requirements (1.1, 1.2, etc.)",
        3: "Detailed requirements (1.1.1, 1.2.3, etc.)",
        4: "Specific controls (1.2.3.1, etc.)",
        5: "Deep nesting (rare)"
    }
    for level in sorted(levels.keys()):
        desc = level_names.get(level, "")
        print(f"  Level {level}: {levels[level]:>3}  {desc}")
    
    print(f"\nThis is the expected distribution for PCI-DSS v4.0.1")
else:
    requirements = []

## Part 3: Building a Fast Index

When an AI asks "What is requirement 1.2.3?", we need O(1) lookup, not O(n) linear search.

In [None]:
class RequirementIndex:
    """Fast O(1) index for requirement lookups."""
    
    def __init__(self):
        self.requirements: Dict[str, Dict] = {}
        self.by_level: Dict[int, List[str]] = {}
        
    def add(self, req: Dict):
        self.requirements[req['id']] = req
        level = req['level']
        if level not in self.by_level:
            self.by_level[level] = []
        self.by_level[level].append(req['id'])
    
    def get(self, req_id: str) -> Optional[Dict]:
        return self.requirements.get(req_id)
    
    def get_children(self, parent_id: str) -> List[Dict]:
        return sorted(
            [r for r in self.requirements.values() if r.get('parentId') == parent_id],
            key=lambda r: [int(x) for x in r['id'].split('.')]
        )

# Build index
if requirements:
    index = RequirementIndex()
    for req in requirements:
        index.add(req)
    print(f"Indexed {len(index.requirements)} requirements")
    
    # Test
    req = index.get("1.2")
    if req:
        print(f"\nTest lookup 1.2: {req['title'][:50]}...")
        children = index.get_children("1.2")
        print(f"Children of 1.2: {len(children)}")

## Part 4: MCP Tool Patterns (Python)

Every MCP tool follows: **Schema -> Execute -> Return**

In [None]:
from dataclasses import dataclass

@dataclass
class GetRequirementInput:
    id: str
    include_children: bool = False

def get_requirement_tool(input: GetRequirementInput, idx: RequirementIndex):
    """MCP tool pattern: validate -> execute -> return"""
    req = idx.get(input.id)
    if not req:
        raise ValueError(f"Requirement not found: {input.id}")
    
    result = {'requirement': req}
    if input.include_children:
        result['children'] = idx.get_children(input.id)
    return result

# Test
if requirements:
    result = get_requirement_tool(GetRequirementInput(id="1", include_children=True), index)
    print(f"Requirement 1: {result['requirement']['title'][:40]}...")
    print(f"Children: {len(result['children'])}")

## Part 5: Full-Text Search

TF-IDF scoring for "find requirements about encryption" queries.

In [None]:
from collections import Counter
import math

class SearchIndex:
    """Simple TF-IDF search."""
    
    def __init__(self, reqs: List[Dict]):
        self.requirements = {r['id']: r for r in reqs}
        self.inverted_index: Dict[str, List] = {}
        self.doc_count = len(reqs)
        
        for req in reqs:
            text = f"{req['id']} {req['title']}".lower()
            word_counts = Counter(text.split())
            for word, count in word_counts.items():
                if word not in self.inverted_index:
                    self.inverted_index[word] = []
                self.inverted_index[word].append((req['id'], count))
    
    def search(self, query: str, limit: int = 5):
        scores = {}
        for word in query.lower().split():
            if word not in self.inverted_index:
                continue
            idf = math.log(self.doc_count / len(self.inverted_index[word]))
            for doc_id, tf in self.inverted_index[word]:
                scores[doc_id] = scores.get(doc_id, 0) + tf * idf
        
        return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:limit]

# Test
if requirements:
    search_index = SearchIndex(requirements)
    results = search_index.search("network security firewall")
    print("Search: 'network security firewall'")
    for req_id, score in results:
        print(f"  [{req_id}] {search_index.requirements[req_id]['title'][:50]}...")

## Part 6: Save Data as JSON

Export for use in the MCP server.

In [None]:
import json
from datetime import datetime

if requirements:
    output = {
        'info': {
            'version': '4.0.1',
            'convertedAt': datetime.now().isoformat(),
        },
        'requirements': requirements,
    }
    
    output_file = Path('requirements.json')
    with open(output_file, 'w') as f:
        json.dump(output, f, indent=2)
    
    print(f"Saved: {output_file} ({output_file.stat().st_size / 1024:.1f} KB)")
    print(f"Requirements: {len(requirements)}")
    
    if IN_COLAB:
        print("\n(File will be used by the FastMCP server in Part 8)")

## Part 7: Adapting for Other Frameworks

The same patterns work for any compliance framework:

### NIST 800-53
```python
# Pattern: AC-1, AC-1(1)
match = re.match(r'^([A-Z]{2}-\d+(?:\(\d+\))?)\s+(.+)$', text)
```

### ISO 27001
```python
# Pattern: A.5.1.1
match = re.match(r'^(A\.\d+(?:\.\d+)*)\s+(.+)$', text)
```

### Key Steps
1. Find structured data (Excel/CSV > PDF)
2. Write regex for your ID format
3. Build the same index structure
4. Adapt tool descriptions

---

## Part 8: Building the MCP Server with FastMCP

Now let's build a **real, working MCP server** using [FastMCP](https://github.com/jlowin/fastmcp) - the leading Python framework for MCP development.

### Why FastMCP?
- Pure Python - no TypeScript/Node.js needed
- Decorator-based API - clean and intuitive
- Production-ready - used by many organizations
- Great documentation and community support

In [None]:
# Install FastMCP
!pip install -q fastmcp
print("FastMCP installed!")

In [None]:
%%writefile compliance_server.py
"""
PCI-DSS Compliance MCP Server
Built with FastMCP - the leading Python framework for MCP development
"""
from fastmcp import FastMCP
import json
from pathlib import Path

# Initialize server
mcp = FastMCP("pci-dss-compliance")

# Load requirements at startup
data = json.loads(Path("requirements.json").read_text())
REQUIREMENTS = {r["id"]: r for r in data["requirements"]}

@mcp.tool()
def get_requirement(id: str, include_children: bool = False) -> dict:
    """
    Get a specific PCI-DSS requirement by ID.

    Args:
        id: Requirement ID (e.g., "1.2.3")
        include_children: Include child requirements in response
    """
    req = REQUIREMENTS.get(id)
    if not req:
        return {"error": f"Requirement {id} not found"}

    result = {"requirement": req}
    if include_children:
        result["children"] = [
            r for r in REQUIREMENTS.values()
            if r.get("parentId") == id
        ]
    return result

@mcp.tool()
def search_requirements(query: str, limit: int = 10) -> dict:
    """
    Search PCI-DSS requirements by keyword.

    Args:
        query: Search terms (e.g., "encryption", "firewall")
        limit: Maximum number of results to return
    """
    query_lower = query.lower()
    matches = [
        {"id": r["id"], "title": r["title"]}
        for r in REQUIREMENTS.values()
        if query_lower in r["title"].lower()
    ]
    return {
        "query": query,
        "count": len(matches),
        "results": matches[:limit]
    }

@mcp.tool()
def list_requirements(level: int = None, limit: int = 20) -> dict:
    """
    List PCI-DSS requirements, optionally filtered by hierarchy level.

    Args:
        level: Filter by level (1=principal, 2=sub, 3=detailed, 4=specific)
        limit: Maximum number of results to return
    """
    if level:
        reqs = [r for r in REQUIREMENTS.values() if r["level"] == level]
    else:
        reqs = list(REQUIREMENTS.values())

    return {
        "level": level or "all",
        "count": len(reqs),
        "requirements": [
            {"id": r["id"], "title": r["title"]}
            for r in reqs[:limit]
        ]
    }

if __name__ == "__main__":
    mcp.run()

In [None]:
# Test the MCP server logic directly
# Note: FastMCP decorators wrap functions for the MCP protocol,
# so we test the logic by re-implementing it here with our loaded data.

import json

# Load the JSON data (same as the server does)
with open("requirements.json") as f:
    data = json.load(f)
REQUIREMENTS = {r["id"]: r for r in data["requirements"]}

print("=== Testing MCP Server Logic ===\n")

# Test get_requirement logic
print("--- get_requirement('1', include_children=True) ---")
req = REQUIREMENTS.get("1")
result = {"requirement": req}
result["children"] = [r for r in REQUIREMENTS.values() if r.get("parentId") == "1"]
print(f"Requirement: {result['requirement']['title']}")
print(f"Children: {len(result['children'])} sub-requirements\n")

# Test search_requirements logic
print("--- search_requirements('network security') ---")
query_lower = "network security".lower()
matches = [
    {"id": r["id"], "title": r["title"]}
    for r in REQUIREMENTS.values()
    if query_lower in r["title"].lower()
]
print(f"Found {len(matches)} matches:")
for r in matches[:3]:
    print(f"  [{r['id']}] {r['title'][:50]}...")
print()

# Test list_requirements logic
print("--- list_requirements(level=1) ---")
level_1 = [r for r in REQUIREMENTS.values() if r["level"] == 1]
print(f"Level 1 requirements: {len(level_1)}")
for r in level_1[:5]:
    print(f"  [{r['id']}] {r['title'][:50]}...")

print("\n=== Server Logic Tests Complete ===")

---

## Part 9: Security Considerations

Compliance data is sensitive. Follow these guidelines when deploying MCP servers:

### Deploy Inside Your Firewall
- Run MCP servers on internal networks only
- Never expose compliance data to public endpoints
- Use VPN or internal-only access for remote teams

### Least Privilege Access
- Only expose read operations (no writes unless required)
- Don't include admin or configuration tools
- Limit to the minimum data required for the use case

### Data Handling
- Never log full requirement content in production
- Be careful with PII in audit evidence
- Follow your organization's data classification policies

### Security is Evolving

> "Security's still the Achilles heel of MCPs and LLMs"
> — [The Pragmatic Engineer](https://newsletter.pragmaticengineer.com/p/mcp-deepdive)

Best practices are still emerging. Stay updated:
- [MCP Security Guidelines](https://modelcontextprotocol.io/docs/concepts/security)
- Your organization's AI/ML security policies

## Part 10: Testing & Debugging

### MCP Inspector

The official debugging tool for MCP servers:

```bash
npx @modelcontextprotocol/inspector python compliance_server.py
```

This opens a web UI where you can:
- See all available tools
- Test tool calls interactively
- View request/response payloads

### Claude Desktop Integration

To use your server with Claude Desktop, add to `~/.claude/claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "pci-dss": {
      "command": "python",
      "args": ["/path/to/compliance_server.py"]
    }
  }
}
```

Then restart Claude Desktop. Your tools will appear in Claude's tool list.

### Testing Checklist

Before deploying:
- [ ] Server starts without errors
- [ ] All tools return valid JSON
- [ ] Error cases return helpful messages
- [ ] Performance is acceptable (<1s for lookups)
- [ ] Works in MCP Inspector
- [ ] Works in Claude Desktop

---

## Summary

You've learned to build MCP servers for compliance documentation using Python:

1. **Data Extraction** - Excel parsing with openpyxl (more reliable than PDFs)
2. **Indexing** - O(1) dictionary lookups for fast retrieval
3. **MCP Tool Patterns** - Schema → Execute → Return
4. **Full-Text Search** - TF-IDF scoring for keyword queries
5. **FastMCP Server** - Production-ready Python MCP server
6. **Security** - Internal deployment, least privilege access
7. **Testing** - MCP Inspector and Claude Desktop integration

### What's Next?

Integrate your compliance MCP with other systems:
- **Jira** - Link requirements to implementation tickets
- **GitHub** - Connect to evidence repositories
- **Confluence** - Pull policy documentation
- **Datadog/Sentry** - Correlate with monitoring alerts

### Resources

- [FastMCP Documentation](https://github.com/jlowin/fastmcp)
- [MCP Specification](https://modelcontextprotocol.io/specification)
- [MCP Inspector](https://github.com/modelcontextprotocol/inspector)
- [The Pragmatic Engineer - MCP Deep Dive](https://newsletter.pragmaticengineer.com/p/mcp-deepdive)
- [PCI-DSS v4.0.1 Documentation](https://www.pcisecuritystandards.org/document_library/)