-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about the duration of function calls limited by the nginx gateway #5429
Comments
@style95 hi boss, could you give me some suggestions? |
I suppose that is not related to the duration limit of nginx. Regarding the contribution, you may find this useful. |
Short-running functions don't produce this error, once my functions are a minute and a half or more long, the problem occurs, so I'm very confused. |
this is my action code: import pandas as pd
from torchvision import datasets, transforms
import numpy as np
import os
from minio import Minio
from minio.error import S3Error
import io
def main(args):
# 转换器
transform_train = transforms.Compose([transforms.RandomHorizontalFlip(p=0.5),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))])
transform_test = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.226, 0.224, 0.225))])
# 数据集路径
data_path = "/dataset/CIFAR10"
# 创建与MinIO服务器的连接
client = Minio(
"192.168.1.23:9000",
access_key="CuZRDbhLVIIjfDp4M4p8",
secret_key="tFDb4aoeSqlCB0bH6BoYfNvIWXqSCJQCZ6UD0Hd9",
secure=False
)
# 创建存储桶
bucket_name = "cifar"
try:
if not client.bucket_exists(bucket_name):
client.make_bucket(bucket_name)
print(f"存储桶 {bucket_name} 创建成功")
else:
print(f"存储桶 {bucket_name} 已存在")
except S3Error as exc:
print(f"存储桶 {bucket_name} 创建失败: {exc}")
# 加载训练集和测试集
train_dataset = datasets.CIFAR10(root=data_path, train=True, transform=transform_train, download=True)
test_dataset = datasets.CIFAR10(root=data_path, train=False, transform=transform_test, download=True)
# 将训练集数据保存到CSV文件
train_data = []
for i, (image, label) in enumerate(train_dataset):
image_data = np.array(image).tolist() # 转换为列表
train_data.append([image_data, label])
train_df = pd.DataFrame(train_data, columns=['image', 'label'])
train_csv_data = train_df.to_csv(index=False)
# 将数据保存到MinIO存储中
try:
data_bytes = train_csv_data.encode() # 将字符串转换为字节类型
data_stream = io.BytesIO(data_bytes) # 创建可读取的字节流
client.put_object(bucket_name, "train_dataset.csv", data_stream, len(data_bytes))
print(f"训练集数据已成功保存到MinIO存储,存储桶:{bucket_name},对象:train_dataset.csv")
except S3Error as exc:
print(f"保存训练集数据到MinIO存储失败: {exc}")
# 将测试集数据保存到CSV文件
test_data = []
for i, (image, label) in enumerate(test_dataset):
image_data = np.array(image).tolist() # 转换为列表
test_data.append([image_data, label])
test_df = pd.DataFrame(test_data, columns=['image', 'label'])
test_csv_data = test_df.to_csv(index=False)
# 将数据保存到MinIO存储中
try:
data_bytes = test_csv_data.encode() # 将字符串转换为字节类型
data_stream = io.BytesIO(data_bytes) # 创建可读取的字节流
client.put_object(bucket_name, "test_dataset.csv", data_stream, len(data_bytes))
print(f"测试集数据已成功保存到MinIO存储,存储桶:{bucket_name},对象:test_dataset.csv")
except S3Error as exc:
print(f"保存测试集数据到MinIO存储失败: {exc}")
result = {'res': '666'}
return result |
and we alse could discuss some othe: #5427 |
I figured out that this error might stem from long running functions, my data processing function behaves fine when processing a tenth of the size of the CIFAR10 dataset, but terminates at minutes when processing the full CIFAR10 dataset. Therefore, I guess that openwhisk has some bugs. On the surface, this bug is that when the function executes for more than 1 minute, it will cut off the client and make the function execute in the background, but the real situation is that it reports an error after running for several minutes. I think this may be a worthwhile question. @style95 No error action code: import pandas as pd
from torchvision import datasets, transforms
import numpy as np
from minio import Minio
from minio.error import S3Error
import io
def main(args):
# 转换器
transform_train = transforms.Compose([transforms.RandomHorizontalFlip(p=0.5),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))])
transform_test = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.226, 0.224, 0.225))])
# 数据集路径
data_path = "/dataset/CIFAR10"
# 创建与MinIO服务器的连接
client = Minio(
"192.168.1.23:9000",
access_key="CuZRDbhLVIIjfDp4M4p8",
secret_key="tFDb4aoeSqlCB0bH6BoYfNvIWXqSCJQCZ6UD0Hd9",
secure=False
)
# 创建存储桶
bucket_name = "cifar"
try:
if not client.bucket_exists(bucket_name):
client.make_bucket(bucket_name)
print(f"存储桶 {bucket_name} 创建成功")
else:
print(f"存储桶 {bucket_name} 已存在")
except S3Error as exc:
print(f"存储桶 {bucket_name} 创建失败: {exc}")
# 加载训练集和测试集
train_dataset = datasets.CIFAR10(root=data_path, train=True, transform=transform_train, download=False)
test_dataset = datasets.CIFAR10(root=data_path, train=False, transform=transform_test, download=False)
# 将训练集数据保存到CSV文件
train_data = []
for i, (image, label) in enumerate(train_dataset):
image_data = np.array(image).tolist() # 转换为列表
train_data.append([image_data, label])
if i >= len(train_dataset) * 0.1: # 控制训练集样本数量
break
train_df = pd.DataFrame(train_data, columns=['image', 'label'])
train_csv_data = train_df.to_csv(index=False)
# 将数据保存到MinIO存储中
try:
data_bytes = train_csv_data.encode() # 将字符串转换为字节类型
data_stream = io.BytesIO(data_bytes) # 创建可读取的字节流
client.put_object(bucket_name, "train_dataset.csv", data_stream, len(data_bytes))
print(f"训练集数据已成功保存到MinIO存储,存储桶:{bucket_name},对象:train_dataset.csv")
except S3Error as exc:
print(f"保存训练集数据到MinIO存储失败: {exc}")
# 将测试集数据保存到CSV文件
test_data = []
for i, (image, label) in enumerate(test_dataset):
image_data = np.array(image).tolist() # 转换为列表
test_data.append([image_data, label])
if i >= len(test_dataset) * 0.1: # 控制测试集样本数量
break
test_df = pd.DataFrame(test_data, columns=['image', 'label'])
test_csv_data = test_df.to_csv(index=False)
# 将数据保存到MinIO存储中
try:
data_bytes = test_csv_data.encode() # 将字符串转换为字节类型
data_stream = io.BytesIO(data_bytes) # 创建可读取的字节流
client.put_object(bucket_name, "test_dataset.csv", data_stream, len(data_bytes))
print(f"测试集数据已成功保存到MinIO存储,存储桶:{bucket_name},对象:test_dataset.csv")
except S3Error as exc:
print(f"保存测试集数据到MinIO存储失败: {exc}")
result = {"res": "666"}
return result and error action code: import pandas as pd
from torchvision import datasets, transforms
import numpy as np
from minio import Minio
from minio.error import S3Error
import io
def main(args):
# 转换器
transform_train = transforms.Compose([transforms.RandomHorizontalFlip(p=0.5),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))])
transform_test = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.226, 0.224, 0.225))])
# 数据集路径
data_path = "/dataset/CIFAR10"
# 创建与MinIO服务器的连接
client = Minio(
"192.168.1.23:9000",
access_key="CuZRDbhLVIIjfDp4M4p8",
secret_key="tFDb4aoeSqlCB0bH6BoYfNvIWXqSCJQCZ6UD0Hd9",
secure=False
)
# 创建存储桶
bucket_name = "cifar"
try:
if not client.bucket_exists(bucket_name):
client.make_bucket(bucket_name)
print(f"存储桶 {bucket_name} 创建成功")
else:
print(f"存储桶 {bucket_name} 已存在")
except S3Error as exc:
print(f"存储桶 {bucket_name} 创建失败: {exc}")
# 加载训练集和测试集
train_dataset = datasets.CIFAR10(root=data_path, train=True, transform=transform_train, download=False)
test_dataset = datasets.CIFAR10(root=data_path, train=False, transform=transform_test, download=False)
# 将训练集数据保存到CSV文件
train_data = []
for i, (image, label) in enumerate(train_dataset):
image_data = np.array(image).tolist() # 转换为列表
train_data.append([image_data, label])
train_df = pd.DataFrame(train_data, columns=['image', 'label'])
train_csv_data = train_df.to_csv(index=False)
# 将数据保存到MinIO存储中
try:
data_bytes = train_csv_data.encode() # 将字符串转换为字节类型
data_stream = io.BytesIO(data_bytes) # 创建可读取的字节流
client.put_object(bucket_name, "train_dataset.csv", data_stream, len(data_bytes))
print(f"训练集数据已成功保存到MinIO存储,存储桶:{bucket_name},对象:train_dataset.csv")
except S3Error as exc:
print(f"保存训练集数据到MinIO存储失败: {exc}")
# 将测试集数据保存到CSV文件
test_data = []
for i, (image, label) in enumerate(test_dataset):
image_data = np.array(image).tolist() # 转换为列表
test_data.append([image_data, label])
test_df = pd.DataFrame(test_data, columns=['image', 'label'])
test_csv_data = test_df.to_csv(index=False)
# 将数据保存到MinIO存储中
try:
data_bytes = test_csv_data.encode() # 将字符串转换为字节类型
data_stream = io.BytesIO(data_bytes) # 创建可读取的字节流
client.put_object(bucket_name, "test_dataset.csv", data_stream, len(data_bytes))
print(f"测试集数据已成功保存到MinIO存储,存储桶:{bucket_name},对象:test_dataset.csv")
except S3Error as exc:
print(f"保存测试集数据到MinIO存储失败: {exc}")
result = {"res": "666"}
return result The code that only differs from the reduced dataset |
We have a 1-minute duration limit for the blocking invocation at controllers. So if you get a developer error, I suspect there is an issue in your action. Just for sure, what is the timeout value of your action? |
So, I think I should consider changing the source code of the controller to remove the limit of blocking in one minute. |
I think that is not related to your action failure. |
Here is our Invoker log: |
Yes as the log says, the running result is "not ok". |
However, it seems like I'm having a hard time finding logs for containers, since containers are always created and destroyed. So I opened two console windows, one to call the function, and the other to observe the real-time container log. The result was very unfortunate, and the same result as the wsk activation command was printed. So, how should we access the container logs correctly and see what we want. I temporarily suspect that the container may not have enough memory. I have configured the container memory to be 8GB. Maybe I should give it more memory. XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX
(base) root@work1:~# kubectl logs -n openwhisk wskowdev-invoker-00-5-prewarm-pythontorch
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX
XXX_THE_END_OF_A_WHISK_ACTIVATION_XXX I will continue to follow up with some tests, thank you for your attention. |
Are you able to access the underlying node? Then you would be able to collect container logs. IIRC the flag( |
Well notes, I managed to solve this issue by increasing the memory of a single function container to 16GB, it was indeed caused by the OOM issue, having said that, I still think that OpenWhisk is currently difficult to troubleshoot this issue. So, I want to make some changes to the source code. One is to remove the 1m blocking restriction of the controller, and the other is to connect the OOM of kubernetes with the OpenWhisk message system, because the current OpenWhisk is difficult to determine the OOM problem. |
In my opinion, compared to a few years ago, the current workflow has shifted to ML services, so the time limit of 1m should be configurable. |
1 min limit is introduced to protect controllers. Regarding the OOM issue, I believe you can easily catch it with Kubernetes events. |
Great, I think I should get more familiar with K8s events and deploy the Prometheus+Grafana system. |
How to overcome the call duration limit of nginx gateway? Or which piece of source code has the limitation shown in the figure below, maybe I need to modify the source code and republish it.
https://192.168.1.22:31001/api/23bc46b1-71f6-4ed5-8c54-816aa4f8c502/resnet18-CIF-1
<title>504 Gateway Time-out</title>(wsk) root@work1:~/good/day0708_wsk_mL/func/resnet18/CIFAR10/dataset# curl -X POST http://192.168.1.22:31005/api/23bc46b1-71f6-4ed5-8c54-816aa4f8c502/resnet18-CIF-1
504 Gateway Time-out
nginx/1.21.1
Meanwhile, how to contribute code to openwhisk?
The text was updated successfully, but these errors were encountered: