<a href="https://colab.research.google.com/github/cedoard/fine_tuned_bert/blob/master/notebooks/AlBERTo_End_to_End_(Fine_tuning_%2B_Predicting)_with_Cloud_TPU_Sentence_Classification_Tasks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Original code licensed by:
# Copyright 2018 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

# AlBERTo End to End (Fine-tuning + Predicting) with Cloud TPU

## Overview

**BERT**, or **B**idirectional **E**mbedding **R**epresentations from **T**ransformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. The academic paper can be found here: https://arxiv.org/abs/1810.04805.

In particular we use this Notebook for fine-tuning **AlBERTo**, the first italian undertanding language model for Twitter Language.

**Note:**  You will need a GCP (Google Compute Engine) account and a GCS (Google Cloud 
Storage) bucket for this Colab to run.

Please follow the [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart) for how to create GCP account and GCS bucket. 

You can learn more about Cloud TPU at https://cloud.google.com/tpu/docs.

## Instructions

<h3><a href="https://cloud.google.com/tpu/"><img valign="middle" src="https://raw.githubusercontent.com/GoogleCloudPlatform/tensorflow-without-a-phd/master/tensorflow-rl-pong/images/tpu-hexagon.png" width="50"></a>  &nbsp;&nbsp;Train on TPU</h3>

   1. Create a Cloud Storage bucket for your TensorBoard logs at http://console.cloud.google.com/storage and fill in the BUCKET parameter in the "Parameters" section below.
 
   1. On the main menu, click Runtime and select **Change runtime type**. Set "TPU" as the hardware accelerator.

### Install/Import required modules

In [None]:
!pip install tensorflow==1.14.0

!pip install gcsfs 
!pip install fsspec 

!pip install ekphrasis
#!pip install pandas
#!pip install numpy

Collecting tensorflow==1.14.0
[?25l  Downloading https://files.pythonhosted.org/packages/de/f0/96fb2e0412ae9692dbf400e5b04432885f677ad6241c088ccc5fe7724d69/tensorflow-1.14.0-cp36-cp36m-manylinux1_x86_64.whl (109.2MB)
[K     |████████████████████████████████| 109.2MB 86kB/s 
Collecting keras-applications>=1.0.6
[?25l  Downloading https://files.pythonhosted.org/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
[K     |████████████████████████████████| 51kB 6.5MB/s 
Collecting tensorflow-estimator<1.15.0rc0,>=1.14.0rc0
[?25l  Downloading https://files.pythonhosted.org/packages/3c/d5/21860a5b11caf0678fbc8319341b0ae21a07156911132e0e71bffed0510d/tensorflow_estimator-1.14.0-py2.py3-none-any.whl (488kB)
[K     |████████████████████████████████| 491kB 43.2MB/s 
Collecting tensorboard<1.15.0,>=1.14.0
[?25l  Downloading https://files.pythonhosted.org/packages/91/2d/2ed263449a078cd9c8a9ba50ebd50123adf1f8cfbea1492f908

In [None]:
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf

#PREPARE TRAINING SENTENCES
from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons
import pandas as pd
import numpy as np
import re

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])



### Set up your TPU environment


Google Cloud Shell commands (see: [Google Cloud TPU quickstart](https://cloud.google.com/tpu/docs/quickstart))

`export PROJECT_ID=reberting`

`gcloud config set project $PROJECT_ID`

`gsutil mb -p ${PROJECT_ID} -c standard -l us-central1 -b on gs://bucket-rebert`

```
ctpu up --project=${PROJECT_ID} \
 --zone=us-central1-b \
 --tf-version=1.14 \
 --name=tpu-alberto
```






In this section, you perform the following tasks:

*   Set up a Colab TPU running environment
*   Verify that you are connected to a TPU device
*   Upload your credentials to TPU to access your GCS bucket.

In [None]:
assert 'COLAB_TPU_ADDR' in os.environ, 'ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!'
TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print('TPU address is', TPU_ADDRESS)

from google.colab import auth
auth.authenticate_user()

with tf.Session(TPU_ADDRESS) as session:
  print('TPU devices:')
  pprint.pprint(session.list_devices())

  # Upload credentials to TPU.
  with open('/content/adc.json', 'r') as f:
    auth_info = json.load(f)
  tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
  # Now credentials are set for all future sessions on this TPU.

TPU address is grpc://10.124.11.2:8470
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

TPU devices:
[_DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:CPU:0, CPU, -1, 11138662374282293171),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 7478276998692968534),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 17842271090012816987),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 10636817016454091163),
 _DeviceAttributes(/job:tpu_worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 14874749825002370954),
 _DeviceAttributes(/job:tpu_worker/rep

### Prepare and import BERT modules
​
With your environment configured, you can now prepare and import the BERT modules. The following step clones the source code from GitHub and import the modules from the source. 


In [None]:
!test -d bert_repo || git clone https://github.com/google-research/bert bert_repo
if not 'bert_repo' in sys.path:
  sys.path += ['bert_repo']

# import python modules defined by BERT
from run_classifier import *
import modeling
import optimization
import tokenization

Cloning into 'bert_repo'...
remote: Enumerating objects: 340, done.[K
remote: Total 340 (delta 0), reused 0 (delta 0), pack-reused 340[K
Receiving objects: 100% (340/340), 315.49 KiB | 4.15 MiB/s, done.
Resolving deltas: 100% (185/185), done.



### Define Path

In [None]:
from google.colab import drive
drive.mount('/content/drive')

DATA_PATH = "/content/drive/My Drive/Colab Notebooks/data/"
MODEL_PATH = "/content/drive/My Drive/Colab Notebooks/model/"

Mounted at /content/drive


In [None]:
TASK = 'SENTIPOLC_TASK3' #@param {type:"string"}
BUCKET = 'bucket-rebert' #@param {type:"string"}
INIT_MODEL = 'alberto_model.ckpt'

assert BUCKET, 'Must specify an existing GCS bucket name'
BUCKET_DIR = 'gs://{}'.format(BUCKET)

OUTPUT_DIR = 'gs://{}/{}/models/'.format(BUCKET, TASK)
tf.gfile.MakeDirs(OUTPUT_DIR)
print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

#CONFIGURE AlBERTo MODEL
BERT_CONFIG_FILE = 'gs://{}/config.json'.format(BUCKET) #@param {type:"string"}
VOCAB_FILE = 'gs://{}/vocab.txt'.format(BUCKET) #@param {type:"string"}
#VOCAB_FILE = os.path.join(DATA_PATH,"vocab.txt")

tf.gfile.MakeDirs('gs://{}/alberto_model'.format(BUCKET))
INIT_CHECKPOINT = 'gs://{}/alberto_model/{}'.format(BUCKET,INIT_MODEL) #@param {type:"string"}
#INIT_CHECKPOINT = os.path.join(MODEL_PATH,'SENTIPOLC_TASK2_NEG_N','model.ckpt-44')


***** Model output directory: gs://bucket-rebert/SENTIPOLC_TASK3/models/ *****


### Initialize BERT hyperparams and initialize TPU config.

In [None]:
#SET THE PARAMETERS
TRAIN_BATCH_SIZE = 512
PREDICT_BATCH_SIZE = 512
EVAL_BATCH_SIZE = 512
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 10.0
MAX_SEQ_LENGTH = 128
# Warmup is a period of time where hte learning rate 
# is small and gradually increases--usually helps training.
WARMUP_PROPORTION = 0.1
# Model configs
SAVE_CHECKPOINTS_STEPS = 1000
SAVE_SUMMARY_STEPS = 500

# Setup TPU related config
tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)
NUM_TPU_CORES = 8
ITERATIONS_PER_LOOP = 1000

def get_run_config(output_dir):
  return tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=output_dir,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=ITERATIONS_PER_LOOP,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))


In [None]:
#Labels used for annotating sentences
label_list = [0, 1]

#Inizialize BERT tokenizer
tokenizer = tokenization.FullTokenizer(VOCAB_FILE, do_lower_case=True)
tokenizer.tokenize("dovevo arrivare in università luiss e si è spento perché è entrato nella zona in cui non può più circolare ma va va")




['dovevo',
 'arrivare',
 'in',
 'universita',
 'luiss',
 'e',
 'si',
 'e',
 'spento',
 'perche',
 'e',
 'entrato',
 'nella',
 'zona',
 'in',
 'cui',
 'non',
 'puo',
 'piu',
 'circolare',
 'ma',
 'va',
 'va']

### Prepare the training data

In [None]:
text_processor = TextPreProcessor (
    # terms that will be normalized
    normalize=[ 'url' , 'email', 'user', 'percent', 'money', 'phone', 'time', 'date', 'number'] ,
    # terms that will be annotated
    annotate={"hashtag"} ,
    fix_html=True ,  # fix HTML tokens

    unpack_hashtags=True ,  # perform word segmentation on hashtags

    # select a tokenizer. You can use SocialTokenizer, or pass your own
    # the tokenizer, should take as input a string and return a list of tokens
    tokenizer=SocialTokenizer(lowercase=True).tokenize,
    dicts = [ emoticons ]
)

def ekphrasis_preprocess(text_processor, s):
  s = s.lower()
  s = str(" ".join(text_processor.pre_process_doc(s)))
  s = re.sub(r"[^a-zA-ZÀ-ú</>!?♥♡\s\U00010000-\U0010ffff]", ' ', s)
  s = re.sub(r"\s+", ' ', s)
  s = re.sub(r'(\w)\1{2,}',r'\1\1', s)
  s = re.sub ( r'^\s' , '' , s )
  s = re.sub ( r'\s$' , '' , s )
  return s

Word statistics files not found!
Downloading... done!
Unpacking... done!
Reading english - 1grams ...
generating cache file for faster loading...
reading ngrams /root/.ekphrasis/stats/english/counts_1grams.txt
Reading english - 2grams ...
generating cache file for faster loading...
reading ngrams /root/.ekphrasis/stats/english/counts_2grams.txt
Reading english - 1grams ...


In [None]:
#LOAD TRAINING AND TEST DATA
training_data = pd.read_excel(os.path.join(BUCKET_DIR,'rev_df_final.xlsx'))
training_data = training_data.loc[~training_data.sentiment.isin(['NEUTRAL'])]
training_data = training_data.dropna().reset_index(drop=True)

sentences = training_data.iloc[:,0]
labels_str = training_data.iloc[:,-1]
print(len(sentences))
print(labels_str.nunique())

training_data.head()

7306
2


Unnamed: 0,comment,date,keywords,rating,username,sentiment
0,Rasentiamo il ridicolo. Il servizio e l'idea s...,2020-11-05 15:34:53,helbiz,2,Roberto Spinelli,NEG
1,Trovo ottima questa iniziativa . Purtroppo le ...,2020-11-03 11:25:06,helbiz,4,graziano Qutro,POS
2,"Monopattini regolarmente parcheggiati, ma che ...",2020-10-28 17:07:03,helbiz,1,Fabio C,NEG
3,Dopo aver messo due (letteralmente due) bici a...,2020-10-21 20:50:21,helbiz,1,Antonio Casto,NEG
4,Dopo i primi mesi in cui il servizio era relat...,2020-11-01 12:13:48,helbiz,1,Franco Papalia,NEG


In [None]:
#PREPROCESS TRAINING AND TEST DATA
def func(row):
    if row == 'POS':
        return 1
    elif row =='NEG':
        return 0

labels = list(map(lambda x: func(x),labels_str))
print(labels[:20])

sentences_filtered = []
i = 0
for s in sentences:
  sentences_filtered.append([labels[i],ekphrasis_preprocess(text_processor, s)])
  i = i+1

np.array(sentences_filtered)

np.random.shuffle(sentences_filtered)
split = int(len(sentences_filtered)*0.8)
sentences_filtered_train, sentences_filtered_test = sentences_filtered[:split], sentences_filtered[split:]

print(len(sentences_filtered),len(sentences_filtered_train),len(sentences_filtered_test))

[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1]
7306 5844 1462


In [None]:
'''
We'll need to transform our data into a format BERT understands. This involves two steps. First, we create InputExample's using the constructor provided in the BERT library.

    text_a is the text we want to classify, which in this case, is the Request field in our Dataframe.
    text_b is used if we're training a model to understand the relationship between sentences (i.e. is text_b a translation of text_a? Is text_b an answer to the question asked by text_a?). This doesn't apply to our task, so we can leave text_b blank.
    label is the label for our example, i.e. True, False

'''

f = lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
                           text_a = x[1], 
                           text_b = None, 
                           label = int(x[0])
                           )

f2 = lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
                            text_a = x[1], 
                            text_b = None, 
                            label = 0
                            )

train_examples = map(f,sentences_filtered_train)
train_examples = list(train_examples)
train_examples = np.array(train_examples)

test_examples = map(f2,sentences_filtered_test)
test_examples = list(test_examples)
test_examples = np.array(test_examples)

print(test_examples.shape)
print(train_examples.shape)

(1462,)
(5844,)


In [None]:
#Test data just created
for r in test_examples[:10]:
  print(r.text_a)

totalmente sconsigliata ! ció che mi é successo stamattina ha del comico dovevo prendere il treno in centrale alle <number> e <number> noleggio il motorino alle <number> e <number> ma l app non funziona rimane bloccata alla schermata di noleggio dopo <number> minuti fermo la dove gia immaginavo di perdere il treno a causa di mimoto passa il tram lo vedo e corro a prenderlo dopo <number> minuti che sono sul tram nimoto da sola inizia il noleggio ed apre il bauletto del motorino ho dovuto pagare <number> euro più penali ridicoli
ottimo servizio a milano e ottima la possibilità di usarlo in due con due caschi ! codice dxsjm per <money> gratis
great ! the best would be if you integrate a navigator in the map
bella l app bello il servizio che funziona ed è molto comodo a roma se volete <number> minuti gratis questo è il codice hcgsx
un app fantastica e anche l idea lo è altrettanto ! grazie a lime ci si può muovere più velocemente tra le strade di torino e altre città senza troppo ingombro 

In [None]:
'''
Il metodo "convert_examples_to_features" crea le features da dare in input alla rete BERT:
  - restituisce un array di oggetti "InputFeatures"
  - "InputFeatures" ha i seguenti attributi:
          - input_ids
          - input_mask
          - segment_ids
          - label_id

'''

train_features = convert_examples_to_features(
      train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)


INFO:tensorflow:Writing example 0 of 5844
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: None
INFO:tensorflow:tokens: [CLS] rk ##ur a codice sconto molto comodo al centro di roma da evitare i s pietri ##ni privilegi ##are strade asfaltate [SEP]
INFO:tensorflow:input_ids: 2 51760 7015 14 2242 2062 156 3727 55 631 12 65 45 2140 31 164 44508 909 7476 4300 2334 83887 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In [None]:
print(train_examples[0].text_a)
print(train_features[0].input_ids)
print(train_features[0].input_mask)
print(train_features[0].segment_ids)
print(train_features[0].label_id)

rkur a codice sconto molto comodo al centro di roma da evitare i s pietrini privilegiare strade asfaltate
[2, 51760, 7015, 14, 2242, 2062, 156, 3727, 55, 631, 12, 65, 45, 2140, 31, 164, 44508, 909, 7476, 4300, 2334, 83887, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

### Fine-tune pretrained BERT Model

This section demonstrates fine-tuning from a pre-trained BERT TF Hub module and running predictions.


In [None]:
BERT_CONFIG= modeling.BertConfig.from_json_file(BERT_CONFIG_FILE)

#inizialize parameters
num_train_steps = int(len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)+1
num_warmup_steps = int(NUM_TRAIN_EPOCHS * WARMUP_PROPORTION)
print(num_train_steps)
print(num_warmup_steps)

115
1


In [None]:
model_fn = model_fn_builder(
  bert_config=BERT_CONFIG,
  num_labels=len(label_list),
  init_checkpoint=INIT_CHECKPOINT,
  learning_rate=LEARNING_RATE,
  num_train_steps=num_train_steps,
  num_warmup_steps=num_warmup_steps,
  use_tpu=True,
  use_one_hot_embeddings=True
)

estimator = tf.contrib.tpu.TPUEstimator(
  use_tpu=True,
  model_fn=model_fn,
  config=get_run_config(OUTPUT_DIR),
  train_batch_size=TRAIN_BATCH_SIZE,
  eval_batch_size=EVAL_BATCH_SIZE,
  predict_batch_size=PREDICT_BATCH_SIZE,
)


INFO:tensorflow:Using config: {'_model_dir': 'gs://bucket-rebert/SENTIPOLC_TASK3/models/', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.96.217.26:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f478e0245f8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.96.217.26:8470', '_evaluation_master': 'grpc://10.96.217.26:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas':

In [None]:
print("***** Running training *****")
print("  Num examples = %d", len(train_examples))
print("  Num labels = %d", len(label_list))
print("  Batch size = %d", TRAIN_BATCH_SIZE)
print("  Num steps = %d", num_train_steps)

train_input_fn = input_fn_builder(
    features=train_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=True,
    drop_remainder=True)

print('***** Started training at {} *****'.format(datetime.datetime.now()))
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print('***** Finished training at {} *****'.format(datetime.datetime.now()))


***** Running training *****
  Num examples = %d 5844
  Num labels = %d 2
  Batch size = %d 512
  Num steps = %d 115
***** Started training at 2020-12-20 22:32:29.806059 *****
INFO:tensorflow:Skipping training since max_steps has already saved.
INFO:tensorflow:training_loop marked as finished
***** Finished training at 2020-12-20 22:32:30.550843 *****


### Save Model

In [None]:
#SAVE MODEL TO PB FORMAT

EXPORT_PATH_MODEL = os.path.join(BUCKET_DIR,'model_alberto_addestrato')
#TO DO PROVA A TOGLIERE NONE
def serving_input_fn():
    label_ids = tf.placeholder(tf.int32, [None], name='label_ids')
    input_ids = tf.placeholder(tf.int32, [None, MAX_SEQ_LENGTH], name='input_ids')
    input_mask = tf.placeholder(tf.int32, [None, MAX_SEQ_LENGTH], name='input_mask')
    segment_ids = tf.placeholder(tf.int32, [None, MAX_SEQ_LENGTH], name='segment_ids')
    input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({
        'label_ids': label_ids,
        'input_ids': input_ids,
        'input_mask': input_mask,
        'segment_ids': segment_ids,
    })()
    return input_fn

estimator._export_to_tpu = False
estimator.export_saved_model(EXPORT_PATH_MODEL, serving_input_receiver_fn=serving_input_fn)

In [None]:
!gsutil cp -r \
  gs://bucket-rebert/model_alberto_addestrato/1608503656/ \
  /content/drive/My Drive/Colab Notebooks/model

CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.


In [None]:
!saved_model_cli show --dir 'gs://bucket-rebert/model_alberto_addestrato/1608504002' --all

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input_ids'] tensor_info:
        dtype: DT_INT32
        shape: (-1, -1)
        name: input_ids_1:0
    inputs['input_mask'] tensor_info:
        dtype: DT_INT32
        shape: (-1, -1)
     

## Make predictions from loaded model

### Load Model

In [None]:
#LOAD MODEL
#LOAD_PATH = os.path.join(MODEL_PATH,'model_alberto_addestrato.h5','1608370941')
#LOAD_PATH_PB = os.path.join(MODEL_PATH,'model_alberto_addestrato.h5','1608370941','saved_model.pb')
LOAD_PATH_GCP = os.path.join(BUCKET_DIR,'model_alberto_addestrato','1608503656')
LOAD_PATH_GCP_PB = os.path.join(BUCKET_DIR,'model_alberto_addestrato.h5','1608503656','saved_model.pb')
print(LOAD_PATH_GCP)

gs://bucket-rebert/model_alberto_addestrato/1608503656


In [None]:
from tensorflow.contrib import predictor

predict_fn = predictor.from_saved_model(LOAD_PATH_GCP)

Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.loader.load or tf.compat.v1.saved_model.load. There will be a new function for importing SavedModels in Tensorflow 2.0.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from gs://bucket-rebert/model_alberto_addestrato/1608503656/variables/variables


### Make Predictions

In [None]:
def convert_single_string_to_input_dict(tokenizer,example_string_prep):

  token_a = tokenizer.tokenize(example_string_prep)

  tokens = []
  segments_ids = []
  segment_ids = []

  tokens.append("[CLS]")
  segment_ids.append(0)
  for token in token_a:
    tokens.append(token)
    segment_ids.append(0)

  tokens.append('[SEP]')
  segment_ids.append(0)
    
  input_ids = tokenizer.convert_tokens_to_ids(tokens)
  input_mask = [1] * len(input_ids)

  while len(input_ids) < MAX_SEQ_LENGTH:
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

  label_id = [0]
  padding = [0] * MAX_SEQ_LENGTH
  print(len(input_ids),len(input_mask),len(segment_ids),len(label_id))
  return {"input_ids":[input_ids,padding], "input_mask":[input_mask,padding], "segment_ids":[segment_ids,padding], "label_ids":label_id}

def predict(tokenizer, predict_fn, input_str, MAX_SEQ_LENGTH):
    # CONVERT DATA TO FEATURES
    example_prep = ekphrasis_preprocess(input_str)
    example_features = convert_single_string_to_input_dict(tokenizer=tokenizer,
                                                       example_string=example_prep,
                                                       max_seq_length=MAX_SEQ_LENGTH)

    prediction = predict_fn(example_features)['probabilities'][0]
    prediction_dict = {'POS': round(prediction[1],4), 'NEG': round(prediction[0],4)}
    pprint(f"prediction: {prediction_dict}")
    return prediction

In [None]:
# MODEL PREDICTIONS
example_sent_neg = "brutto e cattivo, sono veramente triste mi vorrei uccidere la mia vita non ha senso è terribile male male"
example_sent_pos = "sono euforico, mi piace così tanto che sono felice solo di poter essere vivo e poter prendere il monopattino per raggiungere l'apice della mia felicità"

print(predict(tokenizer, predict_fn, example_sent_neg, MAX_SEQ_LENGTH))
print(predict(tokenizer, predict_fn, example_sent_neg, MAX_SEQ_LENGTH))

In [None]:
#PROVA PREDICT 1
input = '"input_ids":[2, 337, 7855, 13, 32584, 49470, 29, 16, 232, 492, 122, 93, 811, 452, 12, 38, 204, 16, 56481, 30, 53, 1754, 14, 931, 60, 786, 3516, 815, 53, 3153, 12, 24973, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],' \
    '"input_mask": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],' \
    '"segment_ids": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],' \
    '"label_ids": [0]}'
    
print(input_dict)

!saved_model_cli run \
    --dir 'gs://bucket-rebert/model_alberto_addestrato/1608503656' \
    --tag_set serve \
    --signature_def predict \
    --input_exprs '"instances":[{"examples":{"input_ids":[2, 337, 7855, 13, 32584, 49470, 29, 16, 232, 492, 122, 93, 811, 452, 12, 38, 204, 16, 56481, 30, 53, 1754, 14, 931, 60, 786, 3516, 815, 53, 3153, 12, 24973, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],"input_mask": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],"segment_ids": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],"label_ids": [0]}}]'

input=[{"input_ids":[2, 337, 7855, 13, 32584, 49470, 29, 16, 232, 492, 122, 93, 811, 452, 12, 38, 204, 16, 56481, 30, 53, 1754, 14, 931, 60, 786, 3516, 815, 53, 3153, 12, 24973, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],"input_mask": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],"segment_ids": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [None]:
#PROVA PREDICT 2
example_input = "This is the input string"
example1 = ekphrasis_preprocess(text_processor, example_input)
example2 = InputExample(guid=None,text_a = example1,text_b = None,label = 0)
example3 = convert_single_example(0,example2, label_list, MAX_SEQ_LENGTH, tokenizer)
print(example3)


INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: None
INFO:tensorflow:tokens: [CLS] this is the input string [SEP]
INFO:tensorflow:input_ids: 2 1869 721 291 43049 42225 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorf

In [None]:
  #SAVE IN BUCKLET RESULTS AND PRINT THEM
output_eval_file = os.path.join(OUTPUT_DIR, "alberto_sentipolc16_task3_results.tsv")
with tf.gfile.GFile(output_eval_file, "w") as writer:
  print("***** Results *****")
  for example, prediction, id in zip(sentences_test, predictions, test_ids):
    print('\t prediction:%s \t id:%s \t text_a: %s' % ( np.argmax(prediction['probabilities']),str(id),str(example) ) )
    writer.write("%s \t %s\n" % (str(id), np.argmax(prediction['probabilities'])) )