training problem by using HMS #4

TargosLi · 2021-09-28T12:48:51Z

I use
Tensorflow version is 2.4.1
training config:
RESULT_DIR: "result_wholeslide_1x"
MODEL_PATH: "${RESULT_DIR}/model.h5"
LOAD_MODEL_BEFORE_TRAIN: False
CONFIG_RECORD_PATH: "${RESULT_DIR}/config.yaml"

USE_MIXED_PRECISION: True
USE_HMS: True
USE_MIL: False

TRAIN_CSV_PATH: "/home/de1119151/PycharmProjects/whole-slide-cnn-main/slide_data_targos/Train_SKIN_TCGA.csv"
VAL_CSV_PATH: "/home/de1119151/PycharmProjects/whole-slide-cnn-main/slide_data_targos/Val_SKIN_TCGA.csv"
TEST_CSV_PATH: "/home/de1119151/PycharmProjects/whole-slide-cnn-main/slide_data_targos/Test_SKIN_TCGA.csv"
SLIDE_DIR: "/mnt/data/RawImages/HE_SKIN_WSI_TCGA/"
SLIDE_FILE_EXTENSION: ".svs"
SLIDE_READER: "openslide"
RESIZE_RATIO: 0.05 # 1x magnification for 20x WSIs
INPUT_SIZE: [21500, 21500, 3]

MODEL: "fixup_resnet50"
NUM_CLASSES: 3
BATCH_SIZE: 1
EPOCHS: 200
NUM_UPDATES_PER_EPOCH: 100
INIT_LEARNING_RATE: 0.00002
POOL_USE: "gmp"
REDUCE_LR_FACTOR: 0.1
REDUCE_LR_PATIENCE: 24
TIME_RECORD_PATH: "${RESULT_DIR}/time_record.csv"
TEST_TIME_RECORD_PATH: "${RESULT_DIR}/test_time_record.csv"

MIL_PATCH_SIZE: NULL
MIL_INFER_BATCH_SIZE: NULL
MIL_USE_EM: False
MIL_K: NULL
MIL_SKIP_WHITE: NULL

TEST_RESULT_PATH: "${RESULT_DIR}/test_result.json"
ENABLE_VIZ: False
VIZ_SIZE: [2150, 2150]
VIZ_FOLDER: "${RESULT_DIR}/viz"

DEBUG_PATH: NULL

I tried this config, and
Traceback (most recent call last):
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/train.py", line 128, in
model = build_model(
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/model.py", line 129, in build_model
conv_block = get_conv_block(input_shape)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/model.py", line 85, in get_conv_block
conv_block = model_fn(
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/model.py", line 26, in
"fixup_resnet50": lambda *args, **kwargs: ResNet50(
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/resnet.py", line 557, in ResNet50
return ResNet(stack_fn, False, True, 'resnet50',
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/resnet.py", line 436, in ResNet
x = _ZeroPadding2D(padding=((3, 3), (3, 3)), name='conv1_pad')(x)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/huge_layer_wrapper.py", line 206, in call
res = super(HugeLayerWrapper, self).call(inputs, **kwargs)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/venv/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1012, in call
outputs = call_fn(inputs, *args, **kwargs)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/huge_layer_wrapper.py", line 267, in call
output_tensor_list = self._do_padding(inputs, **kwargs)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/huge_layer_wrapper.py", line 517, in _do_padding
self.layer.compute_output_shape(self._get_shape(inputs)),
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/venv/lib/python3.8/site-packages/tensorflow/python/keras/layers/convolutional.py", line 2868, in compute_output_shape
if input_shape[1] is not None:
IndexError: list index out of range

Process finished with exit code 1

TargosLi · 2021-09-30T06:49:56Z

/home/de1119151/PycharmProjects/whole-slide-cnn-main/venv/bin/python /home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/train.py
2021-09-30 08:33:59.142871: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
No protocol specified

Config

{
"RESULT_DIR": "result_wholeslide_1x",
"MODEL_PATH": "result_wholeslide_1x/model.h5",
"LOAD_MODEL_BEFORE_TRAIN": false,
"CONFIG_RECORD_PATH": "result_wholeslide_1x/config.yaml",
"USE_MIXED_PRECISION": true,
"USE_HMS": true,
"USE_MIL": false,
"TRAIN_CSV_PATH": "/home/de1119151/PycharmProjects/whole-slide-cnn-main/slide_data_targos/Train_SKIN_TCGA.csv",
"VAL_CSV_PATH": "/home/de1119151/PycharmProjects/whole-slide-cnn-main/slide_data_targos/Val_SKIN_TCGA.csv",
"TEST_CSV_PATH": "/home/de1119151/PycharmProjects/whole-slide-cnn-main/slide_data_targos/Test_SKIN_TCGA.csv",
"SLIDE_DIR": "/mnt/data/RawImages/HE_SKIN_WSI_TCGA/",
"SLIDE_FILE_EXTENSION": ".svs",
"SLIDE_READER": "openslide",
"RESIZE_RATIO": 0.05,
"INPUT_SIZE": [
21500,
21500,
3
],
"MODEL": "fixup_resnet50",
"NUM_CLASSES": 3,
"BATCH_SIZE": 1,
"EPOCHS": 200,
"NUM_UPDATES_PER_EPOCH": 100,
"INIT_LEARNING_RATE": 2e-05,
"POOL_USE": "gmp",
"REDUCE_LR_FACTOR": 0.1,
"REDUCE_LR_PATIENCE": 24,
"TIME_RECORD_PATH": "result_wholeslide_1x/time_record.csv",
"TEST_TIME_RECORD_PATH": "result_wholeslide_1x/test_time_record.csv",
"TEST_RESULT_PATH": "result_wholeslide_1x/test_result.json",
"ENABLE_VIZ": false,
"VIZ_SIZE": [
2150,
2150
],
"VIZ_FOLDER": "result_wholeslide_1x/viz",
"DEBUG_PATH": null
}
2021-09-30 08:34:00.865431: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-30 08:34:00.867343: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-30 08:34:00.868142: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-09-30 08:34:00.890023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-09-30 08:34:00.890048: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-09-30 08:34:00.892048: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-09-30 08:34:00.892087: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-09-30 08:34:00.892943: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-09-30 08:34:00.893123: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-09-30 08:34:00.895095: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-09-30 08:34:00.895550: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-09-30 08:34:00.895660: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-09-30 08:34:00.898047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-09-30 08:34:00.898070: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-09-30 08:34:01.310261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-30 08:34:01.310293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2021-09-30 08:34:01.310299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
2021-09-30 08:34:01.313915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 128748 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)
WARNING:tensorflow:From /home/de1119151/PycharmProjects/whole-slide-cnn-main/tensorflow_huge_model_support/tf_keras.py:29: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

2021-09-30 08:34:01.324147: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-30 08:34:01.325364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-09-30 08:34:01.325427: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.326527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:21:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-09-30 08:34:01.326587: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.327667: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties:
pciBusID: 0000:4b:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-09-30 08:34:01.327722: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.328798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties:
pciBusID: 0000:4c:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-09-30 08:34:01.328815: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-09-30 08:34:01.328875: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-09-30 08:34:01.328889: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-09-30 08:34:01.328901: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-09-30 08:34:01.328913: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-09-30 08:34:01.328926: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-09-30 08:34:01.328938: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-09-30 08:34:01.328950: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-09-30 08:34:01.330106: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.331218: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.332327: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

Initializing datasets

Training dataset contains 660 slides.
Validation dataset contains 252 slides.

Initializing the model

2021-09-30 08:34:01.334535: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.335647: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.336760: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.337833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1, 2, 3
2021-09-30 08:34:01.337875: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-30 08:34:01.339004: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-30 08:34:01.339059: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.340133: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-30 08:34:01.340185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-30 08:34:01.341256: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-09-30 08:34:01.341309: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Traceback (most recent call last):
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/train.py", line 128, in
model = build_model(
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/model.py", line 129, in build_model
conv_block = get_conv_block(input_shape)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/model.py", line 85, in get_conv_block
conv_block = model_fn(
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/model.py", line 26, in
"fixup_resnet50": lambda *args, **kwargs: ResNet50(
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/resnet.py", line 557, in ResNet50
return ResNet(stack_fn, False, True, 'resnet50',
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/resnet.py", line 436, in ResNet
x = _ZeroPadding2D(padding=((3, 3), (3, 3)), name='conv1_pad')(x)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/huge_layer_wrapper.py", line 206, in call
res = super(HugeLayerWrapper, self).call(inputs, **kwargs)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/venv/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1012, in call
outputs = call_fn(inputs, *args, **kwargs)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/huge_layer_wrapper.py", line 267, in call
output_tensor_list = self._do_padding(inputs, **kwargs)
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/whole_slide_cnn/huge_layer_wrapper.py", line 517, in _do_padding
self.layer.compute_output_shape(self._get_shape(inputs)),
File "/home/de1119151/PycharmProjects/whole-slide-cnn-main/venv/lib/python3.8/site-packages/tensorflow/python/keras/layers/convolutional.py", line 2868, in compute_output_shape
if input_shape[1] is not None:
IndexError: list index out of range

Process finished with exit code 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training problem by using HMS #4

training problem by using HMS #4

TargosLi commented Sep 28, 2021

TargosLi commented Sep 30, 2021

training problem by using HMS #4

training problem by using HMS #4

Comments

TargosLi commented Sep 28, 2021

TargosLi commented Sep 30, 2021

Config

Initializing datasets

Initializing the model