Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training problem by using HMS #4

Open
TargosLi opened this issue Sep 28, 2021 · 1 comment
Open

training problem by using HMS #4

TargosLi opened this issue Sep 28, 2021 · 1 comment

Comments

@TargosLi
Copy link

I use
Tensorflow version is 2.4.1
training config:
RESULT_DIR: "result_wholeslide_1x"
MODEL_PATH: "${RESULT_DIR}/model.h5"
LOAD_MODEL_BEFORE_TRAIN: False
CONFIG_RECORD_PATH: "${RESULT_DIR}/config.yaml"

USE_MIXED_PRECISION: True
USE_HMS: True
USE_MIL: False

TRAIN_CSV_PATH: "/home/de1119151/PycharmProjects/whole-slide-cnn-main/slide_data_targos/Train_SKIN_TCGA.csv"
VAL_CSV_PATH: "/home/de1119151/PycharmProjects/whole-slide-cnn-main/slide_data_targos/Val_SKIN_TCGA.csv"
TEST_CSV_PATH: "/home/de1119151/PycharmProjects/whole-slide-cnn-main/slide_data_targos/Test_SKIN_TCGA.csv"
SLIDE_DIR: "/mnt/data/RawImages/HE_SKIN_WSI_TCGA/"
SLIDE_FILE_EXTENSION: ".svs"
SLIDE_READER: "openslide"
RESIZE_RATIO: 0.05 # 1x magnification for 20x WSIs
INPUT_SIZE: [21500, 21500, 3]

MODEL: "fixup_resnet50"
NUM_CLASSES: 3
BATCH_SIZE: 1
EPOCHS: 200
NUM_UPDATES_PER_EPOCH: 100
INIT_LEARNING_RATE: 0.00002
POOL_USE: "gmp"
REDUCE_LR_FACTOR: 0.1
REDUCE_LR_PATIENCE: 24
TIME_RECORD_PATH: "${RESULT_DIR}/time_record.csv"
TEST_TIME_RECORD_PATH: "${RESULT_DIR}/test_time_record.csv"

MIL_PATCH_SIZE: NULL
MIL_INFER_BATCH_SIZE: NULL
MIL_USE_EM: False
MIL_K: NULL
MIL_SKIP_WHITE: NULL

TEST_RESULT_PATH: "${RESULT_DIR}/test_result.json"
ENABLE_VIZ: False
VIZ_SIZE: [2150, 2150]
VIZ_FOLDER: "${RESULT_DIR}/viz"

DEBUG_PATH: NULL

I tried this config, and
Traceback (most recent call last):
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/train.py", line 128, in
model = build_model(
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/model.py", line 129, in build_model
conv_block = get_conv_block(input_shape)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/model.py", line 85, in get_conv_block
conv_block = model_fn(
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/model.py", line 26, in
"fixup_resnet50": lambda *args, **kwargs: ResNet50(
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/resnet.py", line 557, in ResNet50
return ResNet(stack_fn, False, True, 'resnet50',
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/resnet.py", line 436, in ResNet
x = _ZeroPadding2D(padding=((3, 3), (3, 3)), name='conv1_pad')(x)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/huge_layer_wrapper.py", line 206, in call
res = super(HugeLayerWrapper, self).call(inputs, **kwargs)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/venv/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1012, in call
outputs = call_fn(inputs, *args, **kwargs)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/huge_layer_wrapper.py", line 267, in call
output_tensor_list = self._do_padding(inputs, **kwargs)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/huge_layer_wrapper.py", line 517, in _do_padding
self.layer.compute_output_shape(self._get_shape(inputs)),
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/venv/lib/python3.8/site-packages/tensorflow/python/keras/layers/convolutional.py", line 2868, in compute_output_shape
if input_shape[1] is not None:
IndexError: list index out of range

Process finished with exit code 1

@TargosLi
Copy link
Author

/home/de1119151/PycharmProjects/whole-slide-cnn-main/venv/bin/python /home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/train.py
2021-09-30 08:33:59.142871: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
No protocol specified

Config

{
"RESULT_DIR": "result_wholeslide_1x",
"MODEL_PATH": "result_wholeslide_1x/model.h5",
"LOAD_MODEL_BEFORE_TRAIN": false,
"CONFIG_RECORD_PATH": "result_wholeslide_1x/config.yaml",
"USE_MIXED_PRECISION": true,
"USE_HMS": true,
"USE_MIL": false,
"TRAIN_CSV_PATH": "/home/de1119151/PycharmProjects/whole-slide-cnn-main/slide_data_targos/Train_SKIN_TCGA.csv",
"VAL_CSV_PATH": "/home/de1119151/PycharmProjects/whole-slide-cnn-main/slide_data_targos/Val_SKIN_TCGA.csv",
"TEST_CSV_PATH": "/home/de1119151/PycharmProjects/whole-slide-cnn-main/slide_data_targos/Test_SKIN_TCGA.csv",
"SLIDE_DIR": "/mnt/data/RawImages/HE_SKIN_WSI_TCGA/",
"SLIDE_FILE_EXTENSION": ".svs",
"SLIDE_READER": "openslide",
"RESIZE_RATIO": 0.05,
"INPUT_SIZE": [
21500,
21500,
3
],
"MODEL": "fixup_resnet50",
"NUM_CLASSES": 3,
"BATCH_SIZE": 1,
"EPOCHS": 200,
"NUM_UPDATES_PER_EPOCH": 100,
"INIT_LEARNING_RATE": 2e-05,
"POOL_USE": "gmp",
"REDUCE_LR_FACTOR": 0.1,
"REDUCE_LR_PATIENCE": 24,
"TIME_RECORD_PATH": "result_wholeslide_1x/time_record.csv",
"TEST_TIME_RECORD_PATH": "result_wholeslide_1x/test_time_record.csv",
"TEST_RESULT_PATH": "result_wholeslide_1x/test_result.json",
"ENABLE_VIZ": false,
"VIZ_SIZE": [
2150,
2150
],
"VIZ_FOLDER": "result_wholeslide_1x/viz",
"DEBUG_PATH": null
}
2021-09-30 08:34:00.865431: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-30 08:34:00.867343: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-30 08:34:00.868142: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-09-30 08:34:00.890023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-09-30 08:34:00.890048: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-09-30 08:34:00.892048: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-09-30 08:34:00.892087: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-09-30 08:34:00.892943: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-09-30 08:34:00.893123: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-09-30 08:34:00.895095: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-09-30 08:34:00.895550: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-09-30 08:34:00.895660: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-09-30 08:34:00.898047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-09-30 08:34:00.898070: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-09-30 08:34:01.310261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-30 08:34:01.310293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2021-09-30 08:34:01.310299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2021-09-30 08:34:01.313915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 128748 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)
WARNING:tensorflow:From /home/de1119151/PycharmProjects/whole-slide-cnn-main/tensorflow_huge_model_support/tf_keras.py:29: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

2021-09-30 08:34:01.324147: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-30 08:34:01.325364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-09-30 08:34:01.325427: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.326527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:21:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-09-30 08:34:01.326587: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.327667: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties:
pciBusID: 0000:4b:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-09-30 08:34:01.327722: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.328798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties:
pciBusID: 0000:4c:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-09-30 08:34:01.328815: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-09-30 08:34:01.328875: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-09-30 08:34:01.328889: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-09-30 08:34:01.328901: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-09-30 08:34:01.328913: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-09-30 08:34:01.328926: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-09-30 08:34:01.328938: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-09-30 08:34:01.328950: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-09-30 08:34:01.330106: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.331218: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.332327: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

Initializing datasets

Training dataset contains 660 slides.
Validation dataset contains 252 slides.

Initializing the model

2021-09-30 08:34:01.334535: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.335647: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.336760: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.337833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1, 2, 3
2021-09-30 08:34:01.337875: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-30 08:34:01.339004: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-30 08:34:01.339059: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.340133: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-30 08:34:01.340185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.341256: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-30 08:34:01.341309: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Traceback (most recent call last):
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/train.py", line 128, in
model = build_model(
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/model.py", line 129, in build_model
conv_block = get_conv_block(input_shape)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/model.py", line 85, in get_conv_block
conv_block = model_fn(
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/model.py", line 26, in
"fixup_resnet50": lambda *args, **kwargs: ResNet50(
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/resnet.py", line 557, in ResNet50
return ResNet(stack_fn, False, True, 'resnet50',
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/resnet.py", line 436, in ResNet
x = _ZeroPadding2D(padding=((3, 3), (3, 3)), name='conv1_pad')(x)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/huge_layer_wrapper.py", line 206, in call
res = super(HugeLayerWrapper, self).call(inputs, **kwargs)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/venv/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1012, in call
outputs = call_fn(inputs, *args, **kwargs)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/huge_layer_wrapper.py", line 267, in call
output_tensor_list = self._do_padding(inputs, **kwargs)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/huge_layer_wrapper.py", line 517, in _do_padding
self.layer.compute_output_shape(self._get_shape(inputs)),
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/venv/lib/python3.8/site-packages/tensorflow/python/keras/layers/convolutional.py", line 2868, in compute_output_shape
if input_shape[1] is not None:
IndexError: list index out of range

Process finished with exit code 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant