# "Hallelujah Effect" Analysis

This notebook models the "Hallelujah Effect" in terms of all basic features available in the dataset for those subjects that listened to the song and had an EDA quality >80%.

In [None]:
%bash
pip uninstall -y google-cloud-dataflow
pip install --upgrade --force tensorflow_transform==0.6.0 apache-beam[gcp]

<b>Restart the kernel</b> after you do a pip install (click on the <b>Reset</b> button in Datalab)

In [1]:
%bash
pip freeze | grep -e 'flow\|beam'

apache-airflow==1.9.0
apache-beam==2.5.0
tensorflow==1.8.0
tensorflow-transform==0.6.0


You are using pip version 9.0.3, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.


In [2]:
import tensorflow as tf
import tensorflow_transform as tft
import shutil
print(tf.__version__)

1.8.0


  from ._conv import register_converters as _register_converters


In [3]:
# Set bucket, project, and region
BUCKET = 'eim-muse'
PROJECT = 'eim-muse'
REGION = 'us-central1'

In [4]:
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [5]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


## Retrieve and Subset Datasource

Get data from BigQuery but defer filtering, etc. to Beam. Data in BigQuery has been pre-processed with Dataprep.

In [6]:
import google.datalab.bigquery as bq
def create_query(phase, EVERY_N):
  """
  phase: 1=train 2=valid
  """
  base_query = """
SELECT *
FROM
  `eim-muse.hallelujah_effect.full_hallelujah_trials_cleaned`
  """

  if EVERY_N == None:
    if phase < 2:
      # Training
      query = "{0} WHERE MOD(FARM_FINGERPRINT(id), 10) < 7".format(base_query)
    else:
      # Validation
      query = "{0} WHERE MOD(FARM_FINGERPRINT(id), 10) >= 7".format(base_query)
  else:
      query = "{0} WHERE MOD(FARM_FINGERPRINT(id), {1}) = {2}".format(base_query, EVERY_N, phase)
    
  return query

query = create_query(1, None)

In [7]:
df_valid = bq.Query(query).execute().result().to_dataframe()
df_valid.head()
df_valid.describe()

Unnamed: 0,age,concentration,musical_expertise,artistic,fault,imagination,lazy,nervous,outgoing,reserved,...,music_pref_none,music_pref_hiphop,music_pref_dance,music_pref_world,music_pref_rock,music_pref_pop,music_pref_classical,music_pref_jazz,music_pref_folk,music_pref_traditional_irish
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,...,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,24.726073,3.9935,2.547224,2.344099,3.175612,3.852643,3.690403,3.629726,3.213016,3.145503,...,0.006601,0.138614,0.188119,0.132013,0.432343,0.673267,0.306931,0.171617,0.089109,0.059406
std,13.931034,0.795258,1.009324,0.988361,0.891464,0.82162,0.905895,0.875898,0.961299,0.891699,...,0.08111,0.346115,0.391454,0.339065,0.496221,0.469794,0.461983,0.377671,0.285372,0.236774
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,16.0,3.991266,2.0,2.0,3.0,3.824561,3.659389,3.596491,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,21.0,3.991266,2.52988,2.353712,3.144737,3.824561,3.659389,3.596491,3.22807,3.117904,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
75%,31.0,4.0,3.0,2.353712,4.0,4.0,4.0,4.0,4.0,4.0,...,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0
max,121.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [8]:
train_query = create_query(1, None)
train_n = len(list(bq.Query(train_query).execute().result()))

eval_query = create_query(2, None)
eval_n = len(list(bq.Query(eval_query).execute().result()))

os.environ['TRAIN_N'] = str(train_n)
os.environ['EVAL_N'] = str(eval_n)

print('{} training examples / {} evaluation examples'.format(train_n, eval_n))

303 training examples / 61 evaluation examples


In [9]:
df_valid.columns

Index([u'id', u'age', u'concentration', u'hearing_impairments',
       u'musical_expertise', u'nationality', u'artistic', u'fault',
       u'imagination', u'lazy', u'nervous', u'outgoing', u'reserved',
       u'stress', u'thorough', u'trusting', u'activity', u'engagement',
       u'familiarity', u'like_dislike', u'positivity', u'tension', u'sex',
       u'hallelujah_reaction', u'location', u'language', u'music_pref_none',
       u'music_pref_hiphop', u'music_pref_dance', u'music_pref_world',
       u'music_pref_rock', u'music_pref_pop', u'music_pref_classical',
       u'music_pref_jazz', u'music_pref_folk',
       u'music_pref_traditional_irish'],
      dtype='object')

## Create ML dataset using tf.transform and Dataflow

Let's use Cloud Dataflow to read in the BigQuery data and write it out as CSV files. Along the way, let's use tf.transform to do scaling and transforming. Using tf.transform allows us to save the metadata to ensure that the appropriate transformations get carried out during prediction as well.

In [10]:
%writefile requirements.txt
tensorflow-transform==0.6.0

Overwriting requirements.txt


In [21]:
import datetime
import tensorflow as tf
import apache_beam as beam
import tensorflow_transform as tft
from tensorflow_transform.beam import impl as beam_impl

def is_valid(inputs):
    try:
        return True
    except:
        return False

float_features = [
    'activity',
    'age',
    'artistic',
    'concentration',
    'engagement',
    'familiarity',
    'fault',
    'imagination',
    'lazy',
    'like_dislike',
    'musical_expertise',
    'nervous',
    'outgoing',
    'positivity',
    'reserved',
    'stress',
    'tension',
    'thorough',
    'trusting'
]

boolean_features = [
    'hallelujah_reaction',
    'hearing_impairments',
    'music_pref_classical',
    'music_pref_dance',
    'music_pref_folk',
    'music_pref_hiphop',
    'music_pref_jazz',
    'music_pref_none',
    'music_pref_pop',
    'music_pref_rock',
    'music_pref_traditional_irish',
    'music_pref_world'
]

categorical_features = [
    'language',
    'location',
    'nationality',
    'sex'
]

def preprocess_tft(inputs):
    import datetime
    result = {}
    
    for feature in float_features:
        result[feature] = tft.scale_to_0_1(inputs[feature])
    
    for feature in boolean_features:
        result[feature] = tf.cast(inputs[feature], tf.int64)
    
    for feature in categorical_features:
        result[feature] = tf.identity(inputs[feature])
    
    return result

def preprocess(in_test_mode, EVERY_N=None):
  import os
  import os.path
  import tempfile
  from apache_beam.io import tfrecordio
  from tensorflow_transform.coders import example_proto_coder
  from tensorflow_transform.tf_metadata import dataset_metadata
  from tensorflow_transform.tf_metadata import dataset_schema
  from tensorflow_transform.beam import tft_beam_io
  from tensorflow_transform.beam.tft_beam_io import transform_fn_io

  job_name = 'hallelujah-effect-features' + '-' + datetime.datetime.now().strftime('%y%m%d-%H%M%S')    
  if in_test_mode:
    import shutil
    print 'Launching local job ... hang on'
    OUTPUT_DIR = './preproc_tft'
    shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
    
  else:
    print 'Launching Dataflow job {} ... hang on'.format(job_name)
    OUTPUT_DIR = 'gs://{0}/analysis/hallelujah-effect/preproc_tft/'.format(BUCKET)
    import subprocess
    subprocess.call('gsutil rm -r {}'.format(OUTPUT_DIR).split())
  
  # Configure Beam pipeline options
  options = {
    'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
    'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
    'job_name': job_name,
    'project': PROJECT,
    'max_num_workers': 24,
    'teardown_policy': 'TEARDOWN_ALWAYS',
    'no_save_main_session': True,
    'requirements_file': 'requirements.txt'
  }
  opts = beam.pipeline.PipelineOptions(flags=[], **options)
  if in_test_mode:
    RUNNER = 'DirectRunner'
  else:
    RUNNER = 'DataflowRunner'

  # Setup metadata
  raw_data_schema = {
    colname : dataset_schema.ColumnSchema(tf.string, [], dataset_schema.FixedColumnRepresentation())
                   for colname in categorical_features
  }
  raw_data_schema.update({
      colname : dataset_schema.ColumnSchema(tf.float32, [], dataset_schema.FixedColumnRepresentation())
                   for colname in float_features
    })
  raw_data_schema.update({
      colname : dataset_schema.ColumnSchema(tf.int64, [], dataset_schema.FixedColumnRepresentation())
                   for colname in boolean_features
    })
  raw_data_metadata = dataset_metadata.DatasetMetadata(dataset_schema.Schema(raw_data_schema))
  
  # run Beam  
  with beam.Pipeline(RUNNER, options=opts) as p:
    with beam_impl.Context(temp_dir=os.path.join(OUTPUT_DIR, 'tmp')):
      
      # Write the raw data metadata to disk
      # Without the overloaded operators: p.apply(tft_beam_io.WriteMetadata(os.path.join(OUTPUT_DIR, 'metadata/rawdata_metadata'), raw_data_metadata)
      _ = (raw_data_metadata
        | 'WriteInputMetadata' >> tft_beam_io.WriteMetadata(
            os.path.join(OUTPUT_DIR, 'metadata/rawdata_metadata'),
            pipeline=p))
           
      # Analyze and transform training data
      this_query = create_query(1, EVERY_N)
      
      # Read in training data from BigQuery table
      raw_data = (p
        # Get raw training data from BigQuery
        | 'train_read' >> beam.io.Read(beam.io.BigQuerySource(query=this_query, use_standard_sql=True))
        # Use our is_valid function to only retain valid examples from training data
        | 'train_filter' >> beam.Filter(is_valid))

      # Package raw training data and its metadata into a 'dataset'
      raw_dataset = (raw_data, raw_data_metadata)
      
      # Using the preprocessing function `preprocess_tft`, preprocess the training data
      # and produce a transformed training dataset and a function to transform other data later
      transformed_dataset, transform_fn = (
          raw_dataset | beam_impl.AnalyzeAndTransformDataset(preprocess_tft))
      
      # Break out the transformed training data and its metadata
      transformed_data, transformed_metadata = transformed_dataset
      
      # Write the transformed training data to files
      _ = transformed_data | 'WriteTrainData' >> tfrecordio.WriteToTFRecord(
          os.path.join(OUTPUT_DIR, 'train'),
          file_name_suffix='.gz',
          coder=example_proto_coder.ExampleProtoCoder(
              transformed_metadata.schema))
      
      # Read in test data from BigQuery table and filter as we did with training data
      raw_test_data = (p 
        | 'eval_read' >> beam.io.Read(beam.io.BigQuerySource(query=create_query(2, EVERY_N), use_standard_sql=True))
        | 'eval_filter' >> beam.Filter(is_valid))
      
      # Package test data and metadata into a dataset
      raw_test_dataset = (raw_test_data, raw_data_metadata)
      
      # Using the same transformation function that was calculated above, transform the test dataset
      transformed_test_dataset = (
          (raw_test_dataset, transform_fn) | beam_impl.TransformDataset())
      
      # Write the transformed test data to files
      transformed_test_data, _ = transformed_test_dataset
      _ = transformed_test_data | 'WriteTestData' >> tfrecordio.WriteToTFRecord(
          os.path.join(OUTPUT_DIR, 'eval'),
          file_name_suffix='.gz',
          coder=example_proto_coder.ExampleProtoCoder(
              transformed_metadata.schema))
      
      # Write the transformation function to a file, as well
      _ = (transform_fn
           | 'WriteTransformFn' >>
           transform_fn_io.WriteTransformFn(os.path.join(OUTPUT_DIR, 'metadata')))

# Preprocess the training/test data
preprocess(in_test_mode=True, EVERY_N=None)

Launching local job ... hang on
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: ./preproc_tft/tmp/tftransform_tmp/33b3c2175e674342acec8a75679a7e15/saved_model.pb
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: ./preproc_tft/tmp/tftransform_tmp/742f34884e704beaa5527063033984b5/saved_model.pb


  pipeline.replace_all(_get_transform_overrides(pipeline.options))


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


INFO:tensorflow:SavedModel written to: ./preproc_tft/tmp/tftransform_tmp/a37f0c49ef7e46c99f70b715d910eab3/saved_model.pb


INFO:tensorflow:SavedModel written to: ./preproc_tft/tmp/tftransform_tmp/a37f0c49ef7e46c99f70b715d910eab3/saved_model.pb


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore
  chunks = self.iterencode(o, _one_shot=True)


In [11]:
%bash
ls -ls preproc_tft
ls -ls preproc_tft/metadata
# gsutil ls -l gs://${BUCKET}/analysis/hallelujah-effect/preproc_tft/
# gsutil ls -l gs://${BUCKET}/analysis/hallelujah-effect/preproc_tft/metadata

total 24
 4 -rw-r--r-- 1 root root  3059 Jul  9 02:13 eval-00000-of-00001.gz
 4 drwxr-xr-x 5 root root  4096 Jul  9 02:13 metadata
 4 drwxr-xr-x 3 root root  4096 Jul  9 02:13 tmp
12 -rw-r--r-- 1 root root 11622 Jul  9 02:13 train-00000-of-00001.gz
total 12
4 drwxr-xr-x 3 root root 4096 Jul  9 02:13 rawdata_metadata
4 drwxr-xr-x 3 root root 4096 Jul  9 02:13 transformed_metadata
4 drwxr-xr-x 3 root root 4096 Jul  9 02:13 transform_fn


<h2> Train off preprocessed data </h2>

### Local Manual Training

In [12]:
MODEL_NAME = 'basic_features'
os.environ['MODEL_NAME'] = MODEL_NAME

In [25]:
%%bash

OUTPUT_DIR=${PWD}/../models/${MODEL_NAME}
rm -rf ${OUTPUT_DIR}
export PYTHONPATH=${PYTHONPATH}:${PWD}/trainer
python -m trainer.task \
   --train_data_paths="${PWD}/preproc_tft/train*" \
   --eval_data_paths="${PWD}/preproc_tft/eval*" \
   --train_steps=25000 \
   --train_batch_size=${TRAIN_N} \
   --eval_steps=1 \
   --output_dir=${OUTPUT_DIR} \
   --job-dir=/tmp \
   --metadata_path="${PWD}/preproc_tft/metadata"

  from ._conv import register_converters as _register_converters
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 90, '_session_config': None, '_keep_checkpoint_max': 10, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f353f970e50>, '_evaluation_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': '/content/datalab/notebooks/eim-analysis/Hallelujah_Effect/basic_features/../models/basic_features', '_global_id_in_cluster': 0, '_save_summary_steps': 100}
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 5 secs (eval_spec.throttle_secs) or training is finished.
Instructions for upda

Results for 25000 batches of 303 cases (25000 training epochs):

- Accuracy = 0.60655737
- Accuracy baseline = 0.6721312
- AUC = 0.5054878
- AUC-PR = 0.30782944
- Average loss = 1.9519312
- F1 score = 0.142857143
- Label/mean = 0.32786885
- Loss = 119.0678
- Precision = 0.25
- Prediction/mean = 0.14571334
- Recall = 0.1

### Local ML Engine Training

In [31]:
%%bash

OUTPUT_DIR=${PWD}/../models/${MODEL_NAME}
rm -rf $OUTPUT_DIR
gcloud ml-engine local train \
   --module-name=trainer.task \
   --package-path=${PWD}/trainer/trainer \
   --job-dir=$OUTPUT_DIR \
   -- \
   --train_data_paths="${PWD}/preproc_tft/train*" \
   --eval_data_paths="${PWD}/preproc_tft/eval*" \
   --train_steps=25000 \
   --train_batch_size=${TRAIN_N} \
   --eval_steps=1 \
   --output_dir=$OUTPUT_DIR \
   --metadata_path="${PWD}/preproc_tft/metadata/"

  from ._conv import register_converters as _register_converters
INFO:tensorflow:TF_CONFIG environment variable: {u'environment': u'cloud', u'cluster': {}, u'job': {u'args': [u'--train_data_paths=/content/datalab/notebooks/eim-analysis/Hallelujah_Effect/basic_features/preproc_tft/train*', u'--eval_data_paths=/content/datalab/notebooks/eim-analysis/Hallelujah_Effect/basic_features/preproc_tft/eval*', u'--train_steps=25000', u'--train_batch_size=303', u'--eval_steps=1', u'--output_dir=/content/datalab/notebooks/eim-analysis/Hallelujah_Effect/basic_features/../models/basic_features', u'--metadata_path=/content/datalab/notebooks/eim-analysis/Hallelujah_Effect/basic_features/preproc_tft/metadata/', u'--job-dir', u'/content/datalab/notebooks/eim-analysis/Hallelujah_Effect/basic_features/../models/basic_features'], u'job_name': u'trainer.task'}, u'task': {}}
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 90, '_session_config': None, '_keep_checkpoint_max': 10, '_task_type': 'worker'

Results for 25000 batches of 303 cases (25000 training epochs):

- Accuracy = 0.6229508
- Accuracy baseline = 0.6721312
- AUC = 0.49146342
- AUC-PR = 0.3212541
- Average loss = 1.8062559
- F1 score = 0.14814815
- Label/mean = 0.32786885
- Loss = 110.18161
- Precision = 0.2857143
- Prediction/mean = 0.14424987
- Recall = 0.1

### Cloud ML Engine Training

In [13]:
%%bash

OUTPUT_DIR=gs://${BUCKET}/analysis/hallelujah-effect/models/${MODEL_NAME}
JOBNAME=hallelujah_effect$(date -u +%y%m%d_%H%M%S)
echo ${OUTPUT_DIR} ${REGION} ${JOBNAME}
gsutil -m rm -rf ${OUTPUT_DIR}
gcloud ml-engine jobs submit training ${JOBNAME} \
   --region=${REGION} \
   --package-path=${PWD}/trainer/trainer \
   --module-name=trainer.task \
   --job-dir=${OUTPUT_DIR} \
   --scale-tier=STANDARD_1 \
   --runtime-version=1.8 \
   -- \
   --train_data_paths="gs://${BUCKET}/analysis/hallelujah-effect/preproc_tft/train*" \
   --eval_data_paths="gs://${BUCKET}/analysis/hallelujah-effect/preproc_tft/eval*" \
   --output_dir=${OUTPUT_DIR} \
   --train_steps=2500000 \
   --train_batch_size=${TRAIN_N} \
   --eval_steps=1 \
   --metadata_path=gs://${BUCKET}/analysis/hallelujah-effect/preproc_tft/metadata/

gs://eim-muse/analysis/hallelujah-effect/models/basic_features us-central1 hallelujah_effect180710_142657
jobId: hallelujah_effect180710_142657
state: QUEUED


Removing gs://eim-muse/analysis/hallelujah-effect/models/basic_features/checkpoint#1531108730921226...
Removing gs://eim-muse/analysis/hallelujah-effect/models/basic_features/#1531108727262402...
Removing gs://eim-muse/analysis/hallelujah-effect/models/basic_features/eval/#1531105474116837...
Removing gs://eim-muse/analysis/hallelujah-effect/models/basic_features/eval/events.out.tfevents.1531105474.cmle-training-master-8479dedb7c-0-9mhzl#1531108735858257...
Removing gs://eim-muse/analysis/hallelujah-effect/models/basic_features/events.out.tfevents.1531105431.cmle-training-master-8479dedb7c-0-9mhzl#1531108746468766...
Removing gs://eim-muse/analysis/hallelujah-effect/models/basic_features/export/exporter/#1531105478632676...
Removing gs://eim-muse/analysis/hallelujah-effect/models/basic_features/export/#1531105478278103...
Removing gs://eim-muse/analysis/hallelujah-effect/models/basic_features/export/exporter/1531105476/#1531105484593576...
Removing gs://eim-muse/analysis/hallelujah-eff

Results for 2500000 batches of 303 cases (2500000 training epochs):

- Accuracy = 0.590164
- Accuracy baseline = 0.672131
- AUC = 0.470732
- AUC-PR = 0.355228
- Average loss = 4.82224
- Label/mean = 0.327869
- Loss = 294.156
- Prediction/mean = 0.214272

## View Results in TensorBoard

In [34]:
from google.datalab.ml import TensorBoard
TensorBoard().start('gs://eim-muse/analysis/hallelujah-effect/models')

4784

  chunks = self.iterencode(o, _one_shot=True)


In [35]:
TensorBoard.stop(4784)

  chunks = self.iterencode(o, _one_shot=True)
