Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nothing in Tensorboard after Eval steps #122

Closed
samuelhkahn opened this issue Apr 3, 2018 · 10 comments
Closed

Nothing in Tensorboard after Eval steps #122

samuelhkahn opened this issue Apr 3, 2018 · 10 comments

Comments

@samuelhkahn
Copy link

samuelhkahn commented Apr 3, 2018

I recently upgraded to the most recent release of the python SDK through a pip upgrade. I followed a couple other Github threads and it said the with the most recent version of sagemaker-python-sdk this was solved (temporarily). But I am not seeing anything updating in my Local Tensorboard Instance. Am I missing something?
This is the model and helper functions I am deploying with SageMaker:

import pandas as pd
import numpy as np
import os
import json
import pickle
import sys
import traceback
import tensorflow as tf
from tensorflow.python.estimator.export.export import build_raw_serving_input_receiver_fn

from tensorflow.python.keras._impl.keras.layers import Dense
from tensorflow.python.keras._impl.keras.layers import Dropout
from tensorflow.python.keras._impl.keras.layers import LSTM
from tensorflow.python.keras._impl.keras.layers.embeddings import Embedding
from tensorflow.python.keras._impl.keras.optimizers import Adam
from tensorflow.python.keras._impl.keras.callbacks import ModelCheckpoint
from tensorflow.python.keras._impl.keras.callbacks import CSVLogger
from tensorflow.python.keras._impl.keras.callbacks import EarlyStopping
from tensorflow.python.keras._impl.keras.callbacks import LambdaCallback
from tensorflow.python.keras._impl.keras import metrics
from tensorflow.python.keras._impl.keras.models import Model
from tensorflow.python.keras._impl.keras import layers
from tensorflow.python.keras._impl.keras import Input

NUM_CLASSES = 2
NUM_DATA_BATCHES = 5
NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN = 10000 * NUM_DATA_BATCHES
BATCH_SIZE = 256
INPUT_TENSOR_NAME_1 = 'text1' # needs to match the name of the first layer + "_input"
INPUT_TENSOR_NAME_2 = 'text2' # needs to match the name of the first layer + "_input"
INPUT_TENSOR_NAME_3 = 'title1' # needs to match the name of the first layer + "_input"
INPUT_TENSOR_NAME_4 = 'title2' # needs to match the name of the first layer + "_input"



def keras_model_fn(training_dir):
    """keras_model_fn receives hyperparameters from the training job and returns a compiled keras model.
    The model will transformed in a TensorFlow Estimator before training and it will saved in a TensorFlow Serving
    SavedModel in the end of training.

    Args:
        hyperparameters: The hyperparameters passed to SageMaker TrainingJob that runs your TensorFlow training
                         script.
    Returns: A compiled Keras model
    """

    text_input_1 = Input(shape=(None,), dtype='int32', name='text1')
    embedded_text_1 = layers.Embedding(50000,300)(text_input_1)
    embed_drop_1=Dropout(.5)(embedded_text_1)

    text_input_2 = Input(shape=(None,), dtype='int32', name='text2')
    embedded_text_2 = layers.Embedding(50000,300,)(text_input_2)
    embed_drop_2=Dropout(.5)(embedded_text_2)


    shared_lstm_text = LSTM(256)
    left_output_text = shared_lstm_text(embed_drop_1)
    right_output_text = shared_lstm_text(embed_drop_2)

    title_input_1 = Input(shape=(None,), dtype='int32', name='title1')
    embedded_title_1 = layers.Embedding(50000,300)(title_input_1)
    embed_drop_3=Dropout(.5)(embedded_title_1)

    title_input_2 = Input(shape=(None,), dtype='int32', name='title2')
    embedded_title_2 = layers.Embedding(50000,300)(title_input_2)
    embed_drop_4=Dropout(.5)(embedded_title_2)

    shared_lstm_title = LSTM(128)
    left_output_title = shared_lstm_title(embed_drop_3)
    right_output_title = shared_lstm_title(embed_drop_4)
    # Calculates the distance as defined by the MaLSTM model
    # malstm_distance = Merge(mode=lambda x: exponent_neg_manhattan_distance(x[0], x[1]), output_shape=lambda x: (x[0][0], 1))([left_output, right_output])
    merged = layers.concatenate([left_output_text, right_output_text,left_output_title, right_output_title], axis=-1)
    drop_1 = Dropout(.3)(merged)
    dense_1 = layers.Dense(256, activation='sigmoid')(drop_1)
    drop_2 = Dropout(.3)(dense_1)

    dense_2 = layers.Dense(128, activation='sigmoid')(drop_2)


    predictions = layers.Dense(1, activation='sigmoid')(dense_2)

    # Pack it all up into a model
    shared_layer_model = Model([text_input_1, text_input_2,title_input_1,title_input_2], [predictions])
    shared_layer_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return shared_layer_model


def train_input_fn(training_dir , hyperparameters = None):

    return _input_fn(training_dir,"train")

def eval_input_fn(training_dir , hyperparameters = None):

    return _input_fn(training_dir,"dev")

def serving_input_fn(hyperparameters = None):

    text_ph_1 = tf.placeholder(tf.int32, shape=[None,501])
    text_ph_2 = tf.placeholder(tf.int32, shape=[None,501])
    title_ph_1 = tf.placeholder(tf.int32, shape=[None,51])
    title_ph_2 = tf.placeholder(tf.int32, shape=[None,51])

    #label is not required since serving is only used for inference
    feature_placeholders = {"text1":text_ph_1,"text2":text_ph_2,"title1":title_ph_1,"title2":title_ph_2}
    return build_raw_serving_input_receiver_fn(feature_placeholders)()

def _input_fn(training_dir,mode):
    train_text_1=np.load(training_dir+"/"+mode+"_text_1.npy")
    train_text_2=np.load(training_dir+"/"+mode+"_text_2.npy")
    train_title_1=np.load(training_dir+"/"+mode+"_title_1.npy")
    train_title_2=np.load(training_dir+"/"+mode+"_title_2.npy")

    y=np.load(training_dir+"/targets_"+mode+".npy")
    y=y.reshape((y.shape[0],1)).astype(np.float32)
    # y=tf.cast(y, tf.float32)



    x={INPUT_TENSOR_NAME_1: train_text_1, 
       INPUT_TENSOR_NAME_2: train_text_2,
       INPUT_TENSOR_NAME_3: train_title_1, 
       INPUT_TENSOR_NAME_4: train_title_2}
    dataset=tf.estimator.inputs.numpy_input_fn(x=x,y=y,batch_size=BATCH_SIZE,num_epochs=10,shuffle=False)()


    return dataset

here is my train script to actually deploy this model:

import sagemaker
from sagemaker.tensorflow import TensorFlow

TRAING_DATA_BUCKET = "s3:/some_bucket"

def main():
    estimator = TensorFlow(
        entry_point='model.py',
        role="some_role",
        training_steps=100000,
        evaluation_steps=100,
        train_instance_count=1,
        train_instance_type='ml.p2.xlarge',
        base_job_name='model')

    estimator.fit(TRAING_DATA_BUCKET,run_tensorboard_locally=True)


if __name__ == "__main__": main()

and here is a sample of the logs:

018-04-02 22:48:34,865 INFO - root - running container entrypoint
2018-04-02 22:48:34,866 INFO - root - starting train task
2018-04-02 22:48:34,884 INFO - container_support.training - Training starting
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
2018-04-02 22:48:37,298 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2
2018-04-02 22:48:37,572 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-west-2-304913402249.s3.amazonaws.com
2018-04-02 22:48:37,623 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-us-west-2-304913402249.s3.amazonaws.com
2018-04-02 22:48:37,640 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-us-west-2-304913402249.s3.us-west-2.amazonaws.com
2018-04-02 22:48:37,694 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-us-west-2-304913402249.s3.us-west-2.amazonaws.com
2018-04-02 22:48:37,800 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com
2018-04-02 22:48:38,133 INFO - tf_container - ----------------------TF_CONFIG--------------------------
2018-04-02 22:48:38,134 INFO - tf_container - {"environment": "cloud", "cluster": {"master": ["algo-1:2222"]}, "task": {"index": 0, "type": "master"}}
2018-04-02 22:48:38,134 INFO - tf_container - ---------------------------------------------------------
2018-04-02 22:48:38,134 INFO - tf_container - creating RunConfig:
2018-04-02 22:48:38,134 INFO - tf_container - {'save_checkpoints_secs': 300}
2018-04-02 22:48:38,134 INFO - tensorflow - TF_CONFIG environment variable: {u'environment': u'cloud', u'cluster': {u'master': [u'algo-1:2222']}, u'task': {u'index': 0, u'type': u'master'}}
2018-04-02 22:48:38,134 INFO - tf_container - invoking keras_model_fn
2018-04-02 22:48:39,319 INFO - tensorflow - Using the Keras model from memory.
2018-04-02 22:48:39.476033: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-02 22:48:39.476412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8755
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-04-02 22:48:39.476442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-02 22:48:40.256068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10764 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-04-02 22:48:40.671208: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x5ab2560
2018-04-02 22:48:41,776 INFO - tensorflow - Using config: {'_save_checkpoints_secs': 300, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': u'master', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fca8794a650>, '_model_dir': u's3://sagemaker-us-west-2-304913402249/shared-model-lstm-2018-04-02-22-39-04-213/checkpoints', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}
2018-04-02 22:48:41.783420: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/config and using profilePrefix = 1
2018-04-02 22:48:41.783448: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/credentials and using profilePrefix = 0
2018-04-02 22:48:41.783460: I tensorflow/core/platform/s3/aws_logging.cc:54] Setting provider to read credentials from /root//.aws/credentials for credentials file and /root//.aws/config for the config file , for use with profile default
2018-04-02 22:48:41.783476: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating HttpClient with max connections2 and scheme http
2018-04-02 22:48:41.783501: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 2
2018-04-02 22:48:41.783525: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating TaskRole with default ECSCredentialsClient and refresh rate 900000
2018-04-02 22:48:41.783572: I tensorflow/core/platform/s3/aws_logging.cc:54] Unable to open config file /root//.aws/credentials for reading.
2018-04-02 22:48:41.783586: I tensorflow/core/platform/s3/aws_logging.cc:54] Failed to reload configuration.
2018-04-02 22:48:41.783601: I tensorflow/core/platform/s3/aws_logging.cc:54] Unable to open config file /root//.aws/config for reading.
2018-04-02 22:48:41.783617: I tensorflow/core/platform/s3/aws_logging.cc:54] Failed to reload configuration.
2018-04-02 22:48:41.783630: I tensorflow/core/platform/s3/aws_logging.cc:54] Credentials have expired or will expire, attempting to repull from ECS IAM Service.
2018-04-02 22:48:41.783741: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
2018-04-02 22:48:41.783761: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:48:41.787251: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 25
2018-04-02 22:48:41.790005: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
2018-04-02 22:48:41.790033: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:48:41.962637: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-02 22:48:41.962686: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-02 22:48:41.963541: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:48:43.164915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-02 22:48:43.165103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 298 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-04-02 22:48:48.326432: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:48:48.387841: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2248481522709328325
2018-04-02 22:48:49.098319: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:08.979787: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2248481522709328387
2018-04-02 22:49:09.067202: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:09.081492: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:31.687063: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:31.696369: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:31.718624: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2249311522709371695
2018-04-02 22:49:31.718807: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:31.731217: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:31.758332: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:32.657755: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:32.674760: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2249321522709372657
2018-04-02 22:49:32.675032: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:32.691849: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:32.738527: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.297412: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.339622: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-02 22:49:33.339665: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-02 22:49:33.339807: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.428415: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.593379: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2249331522709373426
2018-04-02 22:49:33.593994: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.604710: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.688775: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33,848 INFO - tensorflow - Skip starting Tensorflow server as there is only one node in the cluster.
2018-04-02 22:49:33.849136: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.855879: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.867678: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.877568: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.887121: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.898457: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.917874: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.925198: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.936749: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.943854: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.954551: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.971271: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.980231: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.988384: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:33.994245: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.005499: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.016162: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.042879: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.057164: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.068841: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.082194: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.090577: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:49:34.100416: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:50:56,086 INFO - tensorflow - Calling model_fn.
2018-04-02 22:51:00,260 INFO - tensorflow - Done calling model_fn.
2018-04-02 22:51:00,262 INFO - tensorflow - Create CheckpointSaverHook.
2018-04-02 22:51:00.262548: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.325489: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-02 22:51:00.325524: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-02 22:51:00.325674: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.415882: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.425139: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-02 22:51:00.425182: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-02 22:51:00.425382: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.881982: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.894606: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.904546: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.913076: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.927390: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:00.980548: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01.190686: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01.199297: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01,888 INFO - tensorflow - Graph was finalized.
2018-04-02 22:51:01.889441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-02 22:51:01.889626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 294 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-04-02 22:51:01.890096: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01.897931: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01.908246: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01.916873: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01.927502: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:01,939 INFO - tensorflow - Restoring parameters from s3://sagemaker-us-west-2-304913402249/shared-model-lstm-2018-04-02-22-39-04-213/checkpoints/keras_model.ckpt
2018-04-02 22:51:02.287560: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.305949: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.316070: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.326439: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.337997: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.348196: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.356024: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.365497: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.375296: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.387042: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.396985: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:02.553669: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:03.341350: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:04.435127: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:05.190987: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:05.887292: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:05.925875: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:05.953532: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:05.976745: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:06.089836: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:06.948680: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:07.925257: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:07.949810: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:08.862382: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:09.552191: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:10.286866: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:10.985758: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:11.751739: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:11.810978: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:11.832691: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:11.861106: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:11.882758: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.659197: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.704166: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.813763: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.835065: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.923866: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.943118: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:12.981208: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:13.000233: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:13.019809: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:13.041899: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:13.053608: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:13,170 INFO - tensorflow - Running local_init_op.
2018-04-02 22:51:13,209 INFO - tensorflow - Done running local_init_op.
2018-04-02 22:51:13.882071: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:13.891074: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-02 22:51:13.891112: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-02 22:51:13.891286: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:15.615216: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:15.741352: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2251151522709475613
2018-04-02 22:51:15.742244: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:15.757312: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:15.880544: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:20,436 INFO - tensorflow - Saving checkpoints for 1 into s3://sagemaker-us-west-2-304913402249/shared-model-lstm-2018-04-02-22-39-04-213/checkpoints/model.ckpt.
2018-04-02 22:51:20.540136: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:20.550186: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2251201522709480539
2018-04-02 22:51:21.201413: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:37.626173: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2251201522709480550
2018-04-02 22:51:39.231071: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:39.248048: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.381842: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.394329: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.413002: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2251521522709512394
2018-04-02 22:51:52.413186: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.426175: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.461108: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.471439: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.482154: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2251521522709512471
2018-04-02 22:51:52.482316: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.490492: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.501403: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.510951: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.519910: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.529277: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:51:52.541365: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:09.972384: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:09.981957: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:09.998237: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2252091522709529981
2018-04-02 22:52:09.998410: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.007228: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.019049: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.030835: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.080188: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2252101522709530030
2018-04-02 22:52:10.080439: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.096768: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.139914: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.481185: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.512270: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-04-02 22:52:10.512315: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-04-02 22:52:10.512477: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.609451: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.707867: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180402T2252101522709530607
2018-04-02 22:52:10.708356: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.722452: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.829838: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.919452: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.927894: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:10.939060: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.347466: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.370474: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.382340: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.401401: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.410047: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.420895: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:11.428769: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:23,944 INFO - tensorflow - Calling model_fn.
2018-04-02 22:52:28,194 INFO - tensorflow - Done calling model_fn.
2018-04-02 22:52:28,216 INFO - tensorflow - Starting evaluation at 2018-04-02-22:52:28
2018-04-02 22:52:28,412 INFO - tensorflow - Graph was finalized.
2018-04-02 22:52:28.413193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-02 22:52:28.413394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 294 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-04-02 22:52:28,413 INFO - tensorflow - Restoring parameters from s3://sagemaker-us-west-2-304913402249/shared-model-lstm-2018-04-02-22-39-04-213/checkpoints/model.ckpt-1
2018-04-02 22:52:28.741275: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.819166: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.826148: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.835709: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.843095: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.853672: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.862041: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.874661: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.885648: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.895464: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.905889: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:28.960158: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:29.660736: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:30.502709: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:31.190924: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:32.085296: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:32.106829: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:32.128063: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:32.149246: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:32.169160: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:32.858521: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:33.546670: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:33.566797: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:34.256998: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:34.964872: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:35.654776: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:36.344339: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.043329: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.066176: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.089093: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.109659: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.133638: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.821845: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.839246: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.859660: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:37.880892: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
2018-04-02 22:52:38,038 INFO - tensorflow - Running local_init_op.
2018-04-02 22:52:38,075 INFO - tensorflow - Done running local_init_op.
2018-04-02 22:52:45,575 INFO - tensorflow - Evaluation [10/100]
2018-04-02 22:52:52,843 INFO - tensorflow - Evaluation [20/100]
2018-04-02 22:53:00,118 INFO - tensorflow - Evaluation [30/100]
2018-04-02 22:53:07,396 INFO - tensorflow - Evaluation [40/100]
2018-04-02 22:53:14,678 INFO - tensorflow - Evaluation [50/100]
2018-04-02 22:53:21,963 INFO - tensorflow - Evaluation [60/100]
2018-04-02 22:53:29,247 INFO - tensorflow - Evaluation [70/100]
2018-04-02 22:53:36,525 INFO - tensorflow - Evaluation [80/100]
2018-04-02 22:53:43,806 INFO - tensorflow - Evaluation [90/100]
2018-04-02 22:53:51,092 INFO - tensorflow - Evaluation [100/100]
2018-04-02 22:53:51,298 INFO - tensorflow - Finished evaluation at 2018-04-02-22:53:51
2018-04-02 22:53:51,299 INFO - tensorflow - Saving dict for global step 1: accuracy = 0.49929687, global_step = 1, loss = 0.703844

Any tips or clues would be greatly appreciated.

@samuelhkahn samuelhkahn changed the title Unintended results from updating Sagemaker Python SDK in Keras/TF Nothing in Tensorboard after Eval steps Apr 3, 2018
@owen-t
Copy link
Contributor

owen-t commented Apr 3, 2018

Hi,

The evaluation steps may have finished before you were able to see any updates in TensorBoard. Have you tried running wit more evaluation steps? Do you see updates during training?

@winstonaws
Copy link
Contributor

Hi,

We pushed a new image which uses a different parameter to control the frequency of evaluations. You can specify it as follows:

hyperparameters={'throttle_secs': 30}

Where throttle_secs is the minimum amount of elapsed time between evaluations. By default this value is 600, so it'll only update once every 10 minutes.

We'll update our example notebooks to document this.

@samuelhkahn
Copy link
Author

samuelhkahn commented Apr 5, 2018

Thanks for the response @winstonaws! Appreciate you guys being so active in helping out the new SageMaker community!

@chang2394
Copy link

@winstonaws : I have tried setting the value of throttle_secs using hyperparameters as mentioned above, but it is not getting reflected while running the job.
Please let me know if i am doing something wrong.

hyperparams = {
    'learning_rate': 0.001,
    'dropout_rate' : 0.2,
    'save_checkpoints_steps' : 100,
    'save_checkpoints_secs': None,
    'keep_checkpoint_max': None,
    'min_eval_frequency': 100,
    'throttle_secs': 10,
    'eval_throttle_secs': 10
}

estimator = TensorFlow(
    entry_point='xxx.py',
    source_dir='xxx',
    role=role,
    output_path=model_artifacts_location,
    code_location=custom_code_upload_location,
    hyperparameters=hyperparams, 
    train_instance_count=1,
    train_instance_type=train_instance,
    training_steps=10000,
    evaluation_steps=100,
    base_job_name=job_name)

@samuelhkahn
Copy link
Author

I am not seeing tensorboard working with the above suggested changes as well. I thought it may have fixed it, but it doesn't look like it.

@chang2394
Copy link

chang2394 commented Apr 24, 2018

@samuelhkahn : were you able to find any workaround for this ?

@winstonaws : please suggest what should be done in order to update the evaluation throttle duration.

@winstonaws
Copy link
Contributor

@samuelhkahn @chang2394 What version of the python SDK are you using? The fixes don't go out automatically to the notebook instances at the moment, unfortunately. The fix needed is in this change: #105 Can you try updating to the latest version and rerunning it? You can do this by running this in your notebook:

!pip install --upgrade sagemaker

If it's still not working correctly, what behavior are you seeing? Does it differ from your experience when trying out https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_resnet_cifar10_with_tensorboard/tensorflow_resnet_cifar10_with_tensorboard.ipynb ? When I run that example I can see the TensorBoard UI as soon as I call fit, and every time the training job evaluates (which happens at the throttle_secs frequency), I see the scalars update.

@chang2394
Can you confirm you are running TensorFlow 1.6? You can do this by viewing the image used by the training job in the SageMaker console, or you can set the version explicitly using the framework_version constructor argument.

@chang2394
Copy link

@winstonaws : I am able to see data in tensorboard UI, not sure what was the cause of the issue. Right now, I am using sagemaker 1.2.4 and tensorflow 1.6.

@winstonaws
Copy link
Contributor

@chang2394 Great! Are you still having any other problems with tensorboard?

@chang2394
Copy link

chang2394 commented Jun 1, 2018

@winstonaws No, it seems to be working fine as of now. Thanks a lot :)

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018
Fixed the roles from all the notebooks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants