Skip to content
Merged
40 changes: 40 additions & 0 deletions DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,46 @@ ack-mcp-server 采用分层架构,遵循以下设计原则:

每个 Handler 实现特定领域的 MCP 工具和资源:

## 鉴权方案策略 / 集群Kubeconfig证书管理

ack-mcp-server中tools所需权限分为:
- 访问Kubernetes集群rbac权限,通过集群证书访问
- 访问阿里云服务权限,通过阿里云OpenAPI访问,通过阿里云Ram鉴权体系鉴权
- 访问可观测数据,如Prometheus指标、日志系统数据

### Kubernetes集群访问策略

通过配置ack-mcp-server参数:
```shell
KUBECONFIG_MODE = ACK_PUBLIC(默认,通过ACK OpenAPI获取公网kubeconfig访问) / ACK_PRIVATE (通过ACK OpenAPI获取内网kubeconfig访问) / LOCAL(本地kubeconfig)

KUBECONFIG_PATH = xxx (Optional参数,只有当KUBECONFIG_MODE = LOCAL 时生效,指定本地kubeconfig文件路径)
```

注意:本地测试使用公网访问集群kubeconfig需在[对应ACK开启公网访问kubeconfig](https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/control-public-access-to-the-api-server-of-a-cluster)。

默认配置为通过阿里云OpenAPI获取公网kubeconfig访问,默认ttl=1h。

推荐生产使用时,打通集群网络内网访问后,推荐使用KUBECONFIG_MODE = ACK_PRIVATE,通过阿里云OpenAPI获取内网kubeconfig访问,避免公网暴露kubeconfig。

### 访问阿里云服务权限

通过[阿里云Ram鉴权体系](https://help.aliyun.com/zh/sdk/developer-reference/v2-manage-python-access-credentials)。

推荐生产使用,推荐通过子账号控制授权策略,满足安全最小使用权限范围最佳实践。

### 访问可观测数据

优先访问ACK集群对应的阿里云Prometheus服务数据,如没有对应服务,通过env参数寻找可观测数据的访问地址。
通过配置可指定[Prometheus Read HTTP API](https://prometheus.io/docs/prometheus/latest/querying/api/)。

当该集群没有阿里云Prometheus对应实例数据,ack-mcp-server将按按如下优先级寻找={prometheus_http_api_url}访问可观测数据。
```shell
env参数配置:
PROMETHEUS_HTTP_API_{cluster_id}={prometheus_http_api_url}
PROMETHEUS_HTTP_API={prometheus_http_api_url}
```

## 包命名和版本管理

### 项目命名
Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ https://github.com/user-attachments/assets/9e48cac3-0af1-424c-9f16-3862d047cc68
### 💻 2.2 (可选)创建ACK集群

- 阿里云账号中已创建的 ACK 集群
- ACK集群开启公网访问的kubeconfig or ack-mcp-server本地网络可访问的kubeconfig配置(置于.kube/config中)
- 需要生成的集群网络可访问的情况下,配置对应的Kubernetes集群访问凭证,参考[配置方式](./DESIGN.md#kubernetes集群访问策略),在生产环境建议打通集群网络后,通过配置KUBECONFIG_MODE = ACK_PRIVATE,通过内网访问集群。

### 📍 2.3 部署运行ack-mcp-server

Expand Down Expand Up @@ -196,8 +196,8 @@ make build-binary
- Python 3.12+
- 阿里云账号及 AccessKey、AccessSecretKey,所需权限集
- 阿里云账号中已创建的 ACK 集群
- ACK集群开启公网访问的kubeconfig or ack-mcp-server本地网络可访问的kubeconfig配置(置于.kube/config中)

- 配置ACK集群可被ack-mcp-server本地网络可访问的kubeconfig配置,参考[配置方式](./DESIGN.md#kubernetes集群访问策略)。
- 注:推荐在生产环境建议打通集群网络后,通过配置KUBECONFIG_MODE = ACK_PRIVATE,通过内网访问集群。本地测试使用公网访问集群kubeconfig需在[对应ACK开启公网访问kubeconfig](https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/control-public-access-to-the-api-server-of-a-cluster)。

### 📋 3.2 开发环境搭建

Expand Down Expand Up @@ -374,7 +374,7 @@ cd benchmarks
## 7. 常见问题

- **未配置 AK**: 请检查 ACCESS_KEY_ID/ACCESS_KEY_SECRET 环境变量
- **ACK集群未开公网kubeconfig**: ack-mcp-server无法执行kubectl tool,需要ACK集群开启公网访问的kubeconfig 或者 ack-mcp-server本地网络可访问的kubeconfig配置(置于.kube/config中)
- **ACK集群网络不可访问**: 当ack-mcp-server使用 KUBECONFIG_MODE = ACK_PUBLIC 公网方式访问集群kubeconfig,需要ACK集群开启公网访问的kubeconfig,在生产环境中推荐打通集群网络,并使用 ACK_PRIVATE 私网方式访问集群kubeconfig,以遵守生产安全最佳实践。

## 8. 安全

Expand Down
111 changes: 97 additions & 14 deletions src/kubectl_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
from ack_cluster_handler import parse_master_url
from models import KubectlOutput


class KubectlContextManager(TTLCache):
"""基于 TTL+LRU 缓存的 kubeconfig 文件管理器"""

Expand All @@ -23,6 +22,7 @@ def __init__(self, ttl_minutes: int = 60):
super().__init__(maxsize=50, ttl=ttl_minutes * 60) # TTL 以秒为单位,提前5min

self._cs_client = None # CS客户端实例
self.do_not_cleanup_file = None # 本地kubeconfig文件路径,不需要清理

# 使用 .kube 目录存储 kubeconfig 文件
self._kube_dir = os.path.expanduser("~/.kube")
Expand Down Expand Up @@ -78,12 +78,14 @@ def cleanup_all_mcp_files(self):
except Exception:
pass

def _get_or_create_kubeconfig_file(self, cluster_id: str) -> str:
def _get_or_create_kubeconfig_file(self, cluster_id: str, kubeconfig_mode: str, kubeconfig_path: str) -> str:
"""获取或创建集群的 kubeconfig 文件

Args:
cluster_id: 集群ID

kubeconfig_mode: 获取kubeconfig的模式,支持 "ACK_PUBLIC", "ACK_PRIVATE", "LOCAL"
kubeconfig_path: 本地kubeconfig文件路径(仅在模式为LOCAL时使用)

Returns:
kubeconfig 文件路径
"""
Expand All @@ -92,8 +94,31 @@ def _get_or_create_kubeconfig_file(self, cluster_id: str) -> str:
logger.debug(f"Found cached kubeconfig for cluster {cluster_id}")
return self[cluster_id]

if kubeconfig_mode == "INCLUSTER":
# 使用集群内配置
logger.debug(f"Using in-cluster kubeconfig for cluster {cluster_id}")
kubeconfig_path = self._construct_incluster_kubeconfig()
self[cluster_id] = kubeconfig_path
return kubeconfig_path

if kubeconfig_mode == "LOCAL":
# 使用本地 kubeconfig 文件
# 检查路径是否为空
if not kubeconfig_path:
raise ValueError(f"Local kubeconfig path is not set")
kubeconfig_path = os.path.abspath(os.path.expanduser(kubeconfig_path))
if not os.path.exists(kubeconfig_path):
raise ValueError(f"File {kubeconfig_path} does not exist")
self.do_not_cleanup_file = kubeconfig_path
logger.debug(f"Using local kubeconfig for cluster {cluster_id} from {kubeconfig_path}")
self[cluster_id] = kubeconfig_path
return kubeconfig_path

# 从 ACK 获取 kubeconfig
private_ip_address = kubeconfig_mode == "ACK_PRIVATE"

# 创建新的 kubeconfig 文件
kubeconfig_content = self._get_kubeconfig_from_ack(cluster_id, int(self.ttl / 60)) # 转换为分钟
kubeconfig_content = self._get_kubeconfig_from_ack(cluster_id, private_ip_address, int(self.ttl / 60)) # 转换为分钟
if not kubeconfig_content:
raise ValueError(f"Failed to get kubeconfig for cluster {cluster_id}")

Expand All @@ -115,6 +140,9 @@ def popitem(self):
key, path = super().popitem()
# 删除 kubeconfig 文件
if path and os.path.exists(path):
if self.do_not_cleanup_file and os.path.samefile(path, self.do_not_cleanup_file):
logger.debug(f"Skipped removal of protected kubeconfig file: {path}")
return
try:
os.remove(path)
logger.debug(f"Removed cached kubeconfig file: {path}")
Expand All @@ -128,6 +156,9 @@ def cleanup(self):
removed_count = 0
for key, path in list(self.items()):
if path and os.path.exists(path):
# 只有当do_not_cleanup_file存在且路径不同时才清理
if self.do_not_cleanup_file and os.path.samefile(path, self.do_not_cleanup_file):
continue
try:
os.remove(path)
removed_count += 1
Expand All @@ -150,11 +181,12 @@ def _get_cs_client(self):
raise ValueError("CS client not set")
return self._cs_client

def _get_kubeconfig_from_ack(self, cluster_id: str, ttl_minutes: int = 60) -> Optional[str]:
def _get_kubeconfig_from_ack(self, cluster_id: str, private_ip_address: bool = False, ttl_minutes: int = 60) -> Optional[str]:
"""通过ACK API获取kubeconfig配置

Args:
cluster_id: 集群ID
private_ip_address: 是否获取内网连接配置
ttl_minutes: kubeconfig有效期(分钟),默认60分钟
"""
try:
Expand All @@ -172,13 +204,22 @@ def _get_kubeconfig_from_ack(self, cluster_id: str, ttl_minutes: int = 60) -> Op
# 检查是否有公网API Server端点
master_url_str = getattr(cluster_info, 'master_url', '')
master_url = parse_master_url(master_url_str)
if not master_url["api_server_endpoint"]:
raise ValueError(f"Cluster {cluster_id} does not have public endpoint access, "
f"Please enable public endpoint access setting first.")
if private_ip_address:
if not master_url["intranet_api_server_endpoint"]:
raise ValueError(
f"Cluster {cluster_id} does not have intranet endpoint access, "
f"Please enable intranet endpoint access setting first."
)
else:
if not master_url["api_server_endpoint"]:
raise ValueError(
f"Cluster {cluster_id} does not have public endpoint access, "
f"Please enable public endpoint access setting first."
)

# 调用DescribeClusterUserKubeconfig API
request = cs_models.DescribeClusterUserKubeconfigRequest(
private_ip_address=False, # 获取公网连接配置
private_ip_address=private_ip_address,
temporary_duration_minutes=ttl_minutes, # 使用传入的TTL
)

Expand All @@ -195,16 +236,52 @@ def _get_kubeconfig_from_ack(self, cluster_id: str, ttl_minutes: int = 60) -> Op
logger.error(f"Failed to fetch kubeconfig for cluster {cluster_id}: {e}")
raise e

def get_kubeconfig_path(self, cluster_id: str) -> str:
def _construct_incluster_kubeconfig(self) -> str:
"""构造集群内 kubeconfig 文件路径

Returns:
kubeconfig 文件路径
"""
tokenFile = "/var/run/secrets/kubernetes.io/serviceaccount/token"
rootCAFile = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
host, port = os.getenv("KUBERNETES_SERVICE_HOST"), os.getenv("KUBERNETES_SERVICE_PORT")
if not host or not port:
raise ValueError("unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined")

kubeconfig_path = os.path.join(self._kube_dir, "config.incluster")
with open(kubeconfig_path, 'w') as f:
f.write(f"""apiVersion: v1
clusters:
- cluster:
certificate-authority: {rootCAFile}
server: https://{host}:{port}
name: in-cluster
contexts:
- context:
cluster: in-cluster
user: in-cluster
name: in-cluster
current-context: in-cluster
kind: Config
users:
- name: in-cluster
user:
tokenFile: {tokenFile}
""")
return kubeconfig_path

def get_kubeconfig_path(self, cluster_id: str, kubeconfig_mode: str, kubeconfig_path: str) -> str:
"""获取集群的 kubeconfig 文件路径

Args:
cluster_id: 集群ID

kubeconfig_mode: 获取kubeconfig的模式,支持 "ACK_PUBLIC", "ACK_PRIVATE", "LOCAL"
kubeconfig_path: 本地kubeconfig文件路径(仅在模式为LOCAL时使用)

Returns:
kubeconfig 文件路径
"""
return self._get_or_create_kubeconfig_file(cluster_id)
return self._get_or_create_kubeconfig_file(cluster_id, kubeconfig_mode, kubeconfig_path)


# 全局上下文管理器实例
Expand All @@ -224,7 +301,13 @@ def get_context_manager(ttl_minutes: int = 60) -> KubectlContextManager:


class KubectlHandler:
"""Handler for running kubectl commands via a FastMCP tool."""
"""
Handler for running kubectl commands via a FastMCP tool.

Design:
kubeconfig management policy: https://github.com/aliyun/alibabacloud-ack-mcp-server/issues/1

"""

def __init__(self, server: FastMCP, settings: Optional[Dict[str, Any]] = None):
"""Initialize the kubectl handler.
Expand Down Expand Up @@ -538,7 +621,7 @@ async def ack_kubectl(

# 获取 kubeconfig 文件路径
context_manager = get_context_manager()
kubeconfig_path = context_manager.get_kubeconfig_path(cluster_id)
kubeconfig_path = context_manager.get_kubeconfig_path(cluster_id, self.settings.get("kubeconfig_mode"), self.settings.get("kubeconfig_path"))

# 检查是否为流式命令
is_streaming, stream_type = self.is_streaming_command(command)
Expand Down
15 changes: 15 additions & 0 deletions src/main_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,17 @@ def main():
type=str,
help="AlibabaCloud Access Key Secret (default: from env ACCESS_KEY_SECRET)"
)
parser.add_argument(
"--kubeconfig-mode",
type=str,
choices=["ACK_PUBLIC", "ACK_PRIVATE", "INCLUSTER", "LOCAL"],
help="Mode to obtain kubeconfig for ACK clusters (default: from env KUBECONFIG_MODE)"
)
parser.add_argument(
"--kubeconfig-path",
type=str,
help="Path to local kubeconfig file when KUBECONFIG_MODE is LOCAL (default: from env KUBECONFIG_PATH)"
)
parser.add_argument(
"--audit-config",
"-c",
Expand Down Expand Up @@ -273,6 +284,10 @@ def main():
# 兼容性配置
"access_secret_key": args.access_key_secret or os.getenv("ACCESS_KEY_SECRET"), # 兼容旧字段名
"original_settings": Configs(vars(args)),

# ACK kubectl 配置
"kubeconfig_mode": args.kubeconfig_mode or os.getenv("KUBECONFIG_MODE", "ACK_PUBLIC"),
"kubeconfig_path": args.kubeconfig_path or os.getenv("KUBECONFIG_PATH", "~/.kube/config"),
}

# 验证必要的配置
Expand Down
Loading