# **USE CASE 3.** Sentiment analysis in TFF

## Required libraries and configuration


Import required libraries

In [1]:
import os

import collections
import random
import re

import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_federated as tff
import tensorflow_datasets as tfds
import tensorflow_hub as hub

from tensorflow_federated.python.learning.algorithms import build_unweighted_fed_avg, build_fed_eval
from tensorflow.keras import models, layers, losses, metrics, optimizers

# Option for debugging warning errors
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 

2023-05-10 09:51:42.199009: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-10 09:51:42.387710: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-10 09:51:42.389129: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  from .autonotebook import tqdm as notebook_tqdm


Define some parameters for the simulation, such as the number of clients in the federated scenario, the number of federated rounds, the number of epochs of each client before communicating, and the batch size for training phase

In [2]:
# Some parameters
NUM_CLIENTS = 10 # Number of clients in the federated scenario
NUM_ROUNDS = 10 # Number of learning rounds in the federated computation
NUM_EPOCHS = 5 # Number of epochs that the local dataset is seen each round
BATCH_SIZE = 20 # Batch size for training phase
SHUFFLE_BUFFER = 1000 # For dataset shuffling

# Define the seed for random numbers
seed = 10
np.random.seed(seed)
tf.random.set_seed(seed)
tf.keras.utils.set_random_seed(seed)

## Methods for text processing

The following method cleans each tweet by removing urls, since they do not provide any valuable sentiment information. Other unuseful characters are also removed, such as the punctuation marks, whitespaces and numbers. Finally, in order treat the text equally, all the characters are converted to lowercase.

In [3]:
def text_processing(tweet):
    if isinstance(tweet, bytes):
        tweet = tweet.decode('utf-8')
        
    # remove https links
    clean_tweet = re.sub(r'http\S+', '', tweet)
    
    # remove punctuation marks
    punctuation = '!"#$%&()*+-/:;<=>?@[\\]^_`{|}~'
    clean_tweet = ''.join(ch for ch in clean_tweet if ch not in set(punctuation))
    
    # remove numbers
    clean_tweet = re.sub('\d', ' ', clean_tweet)
    
    # remove whitespaces
    clean_tweet = ' '.join(clean_tweet.split())
    
    # convert text to lowercase
    clean_tweet = clean_tweet.lower()
    
    return clean_tweet

## Loading and preparing the input data

The Sentiment140 dataset is not available in TFF, so it needs to be loaded from other source such as the tfds (tensorflow datasets) library. Then, it is adapted to the TFF format, so it can be used to train a model using TFF.

Note: We download the full Sentiment140 dataset, but in order to be able to execute the experiments in reasonable time, we are going to use a portion of it. In this notebook we are selecting just 1% of the data for training and 10% for testing purposes.

In [4]:
sent140 = tfds.load('sentiment140', split=['train[:1%]', 'test[:10%]'])
sent140_train, sent140_test = sent140[0], sent140[1]

In [5]:
# Print size of the training set, i.e., number of instances
len(sent140_train)

16000

Select the text and polarity columns from the original data, and transform it to a dataframe so it can be later used in TFF.

At this point, we do not select the 'user' column because we will create random IID partitions, so the 'user' column is not neccesary. If the user want to try it with non-IID partition, he/she may also keep the 'user' column and use it as client identifier instead of the following random user ID. 

In [6]:
# Transform the data to a dataframe
sent140_train_df = tfds.as_dataframe(sent140_train)[['text', 'polarity']]

# Preprocess and clean text with previously defined method
sent140_train_df['text'] = sent140_train_df['text'].apply(lambda x: text_processing(x))

2023-05-10 09:51:47.017035: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_4' with dtype int64 and shape [1]
	 [[{{node Placeholder/_4}}]]
2023-05-10 09:51:47.017376: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_2' with dtype string and shape [1]
	 [[{{node Placeholder/_2}}]]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sent140_train_df['text'] = sent140_train_df['text'].ap

In [7]:
# Create a random list of ids. Each instance is given a random id, indicating the client where will be distributed
ids_train = [i for i in range(NUM_CLIENTS) for _ in range(len(sent140_train)//NUM_CLIENTS)]
random.Random(seed).shuffle(ids_train)

# Add the id assignment to the dataframe
sent140_train_df['user'] = ids_train

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sent140_train_df['user'] = ids_train


In [8]:
# Do the same with the test data
sent140_test_df = tfds.as_dataframe(sent140_test)[['text', 'polarity']]
sent140_test_df['text'] = sent140_test_df['text'].apply(lambda x: text_processing(x))
ids_test = [i for i in range(NUM_CLIENTS) for _ in range(len(sent140_test)//NUM_CLIENTS)]
random.Random(seed+1).shuffle(ids_test)
sent140_test_df['user'] = ids_test

2023-05-10 09:51:51.157269: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_4' with dtype int64 and shape [1]
	 [[{{node Placeholder/_4}}]]
2023-05-10 09:51:51.157575: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_4' with dtype int64 and shape [1]
	 [[{{node Placeholder/_4}}]]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sent140_test_df['text'] = sent140_test_df['text'].apply

For the sake of simplicity, in this notebook we will be dealing with a binary problem. For that purpose, we remove the neutral tweets, so the classifier's aim is to differentiate between positive and negative tweets. 

In [9]:
# Convert into binary problem by deleting neutral opinions
def delete_neutral_ops(df):
    # Remove those tweets whose polarity is 2, i.e., neutral
    df = df.loc[df['polarity']!=2]
    
    # For ease of representation, replace class 4 (i.e., positive sentiment) by 1.
    df['polarity'] = df['polarity'].replace(4, 1)
    
    return df

# Transform both trainin and testing dataframes to a binary problem
sent140_train_df = delete_neutral_ops(sent140_train_df)
sent140_test_df = delete_neutral_ops(sent140_test_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['polarity'] = df['polarity'].replace(4, 1)


After getting the ID's, it is necessary to create a function that convert the dataset into a TF Dataset for each client by using the client_id's.

In [10]:
# This method receives a client_id, and returns the training tf.data.Dataset for that client
def create_tf_dataset_for_client_fn_train(client_id):
    client_data = sent140_train_df[sent140_train_df['user'] == client_id].drop(columns='user')
    dataset = tf.data.Dataset.from_tensor_slices(client_data.to_dict('list'))
    dataset = dataset.shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE).repeat(NUM_EPOCHS)
    return dataset

# This method receives a client_id, and returns the testing tf.data.Dataset for that client
def create_tf_dataset_for_client_fn_test(client_id):
    client_data = sent140_test_df[sent140_test_df['user'] == client_id].drop(columns='user')
    dataset = tf.data.Dataset.from_tensor_slices(client_data.to_dict('list'))
    dataset = dataset.shuffle(SHUFFLE_BUFFER).batch(1).repeat(NUM_EPOCHS)
    return dataset

Create the train and the test datasets using the function created above.

In [11]:
sent140_train = tff.simulation.datasets.ClientData.from_clients_and_tf_fn(
    client_ids=list(range(0,NUM_CLIENTS)),
    serializable_dataset_fn=create_tf_dataset_for_client_fn_train
)

sent140_test = tff.simulation.datasets.ClientData.from_clients_and_tf_fn(
    client_ids=list(range(0,NUM_CLIENTS)),
    serializable_dataset_fn=create_tf_dataset_for_client_fn_test
)

Create and prepare the federated dataset.
 * The elements are distributed to the clients by id.
 * The dataset is converted into an OrderedDict structure, where the tweets are referred as *text* and the polarities as *polarity*.

In [12]:
def preprocess(dataset):
    def batch_format_fn(element):
        return collections.OrderedDict(
            x=element['text'],
            y=element['polarity']
        )

    return dataset.map(batch_format_fn)

# Create a list of datasets (one for each client) from the complete dataset and the number of clients
def make_federated_data(client_data, n_clients):    
    return [
        preprocess(client_data.create_tf_dataset_for_client(x)) # Call previous preprocess method
        for x in client_data.client_ids[0:n_clients]
    ]

# Create the final federated train and testing data
train_data = make_federated_data(sent140_train, NUM_CLIENTS)
test_data = make_federated_data(sent140_test, NUM_CLIENTS)

## Create a Deep Learning model

In this case we use a model composed by a pre-trained model from tfhub, as well as dense layers. The pre-trained model is not updated in the example; however, the ``trainable`` parameter can be set to ``True``, so such layers are also fine-tuned in the collaborative training.

Note that any network architecture supported by keras can be used.

In [13]:
def create_keras_model():
    # Load pre-trained model
    model = "https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2"
    hub_layer = hub.KerasLayer(model, input_shape=[], dtype=tf.string, trainable=False)
    
    # Set model layers
    model = tf.keras.Sequential()
    model.add(hub_layer)
    model.add(tf.keras.layers.Dense(16, activation='relu'))
    model.add(tf.keras.layers.Dense(1))
    
    return model

In [14]:
def model_fn():
    # We _must_ create a new model here, and _not_ capture it from an external
    # scope. TFF will call this within different graph contexts.
    keras_model = create_keras_model()
    
    input_spec = train_data[0].element_spec
    
    return tff.learning.models.from_keras_model(
        keras_model,
        input_spec=input_spec,
        loss=losses.BinaryCrossentropy(from_logits=True),
        metrics=[metrics.BinaryAccuracy(threshold=0.0, name='accuracy')]
    )

## Training in the federated scenario

Train with the unweighted FedAvg algorithm.
We define the model to use, as well as the optimizer for the clients and server (in both, we are using Adam but with different learning rate).

In [15]:
training_process = build_unweighted_fed_avg(
    model_fn,
    client_optimizer_fn=lambda: optimizers.Adam(learning_rate=0.001),
    server_optimizer_fn=lambda: optimizers.Adam(learning_rate=0.01)
)

2023-05-10 09:51:51.919499: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'sentences' with dtype string and shape [?]
	 [[{{node sentences}}]]
2023-05-10 09:51:51.922945: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'keras_layer_input' with dtype string and shape [?]
	 [[{{node keras_layer_input}}]]
2023-05-10 09:51:52.160359: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'sentences' with dtype string and shape [?]
	 [[{{node sente

Initialize the training process and run it for NUM_ROUNDS rounds of federated learning

In [16]:
train_state = training_process.initialize()

for round_num in range(1, NUM_ROUNDS+1):
    # Train next round (send model to clients, local training, and server model averaging)
    result = training_process.next(train_state, train_data)
        
    # Current state of the model
    train_state = result.state
        
    # Get and print metrics, as the loss and accuracy (averaged across all clients)
    train_metrics = result.metrics['client_work']['train']
        
    print('Round {:2d},  \t Loss={:.4f}, \t Accuracy={:.4f}'.format(round_num, train_metrics['loss'], 
                                                                    train_metrics['accuracy']))


2023-05-10 09:51:54.828688: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'keras_layer_input' with dtype string and shape [?]
	 [[{{node keras_layer_input}}]]
2023-05-10 09:51:54.844145: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2023-05-10 09:51:54.844542: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2023-05-10 09:51:54.877437: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2023-05-10 09:51:54.877528: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2023-05-10 09:52:01.071845: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indica

Round  1,  	 Loss=0.6273, 	 Accuracy=0.6729
Round  2,  	 Loss=0.6160, 	 Accuracy=0.6873
Round  3,  	 Loss=0.6064, 	 Accuracy=0.6979
Round  4,  	 Loss=0.5981, 	 Accuracy=0.7054
Round  5,  	 Loss=0.5907, 	 Accuracy=0.7103
Round  6,  	 Loss=0.5840, 	 Accuracy=0.7136
Round  7,  	 Loss=0.5779, 	 Accuracy=0.7172
Round  8,  	 Loss=0.5723, 	 Accuracy=0.7199
Round  9,  	 Loss=0.5673, 	 Accuracy=0.7214
Round 10,  	 Loss=0.5628, 	 Accuracy=0.7226


## Evaluation with test data

Prepare the model to pass it unseen test data to evaluate its performance

In [17]:
# Indicate that the model arquitecture is the one proposed before
evaluation_process = build_fed_eval(model_fn)

# Initialize the process and set the weights to those previously trained (getting from the training
# state and setting to the evaluation one).
evaluation_state = evaluation_process.initialize()
model_weights = training_process.get_model_weights(train_state)
evaluation_state = evaluation_process.set_model_weights(evaluation_state, model_weights)

2023-05-10 09:54:16.239320: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'sentences' with dtype string and shape [?]
	 [[{{node sentences}}]]
2023-05-10 09:54:16.258549: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'keras_layer_input' with dtype string and shape [?]
	 [[{{node keras_layer_input}}]]
2023-05-10 09:54:16.672408: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'sentences' with dtype string and shape [?]
	 [[{{node sente

Evaluate the model with test data and print the desired evaluation metrics

In [18]:
# Pass test data to the model in each client
evaluation_output = evaluation_process.next(evaluation_state, test_data)

# Get and print metrics
eval_metrics = evaluation_output.metrics['client_work']['eval']['current_round_metrics']
print('Test data, \t Loss={:.4f}, \t Accuracy={:.4f}'.format(eval_metrics['loss'], 
                                                             eval_metrics['accuracy']))

2023-05-10 09:54:52.212334: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'keras_layer_input' with dtype string and shape [?]
	 [[{{node keras_layer_input}}]]
2023-05-10 09:54:52.231866: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2023-05-10 09:54:52.232319: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2023-05-10 09:54:52.998755: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2023-05-10 09:54:52.998826: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session
2023-05-10 09:54:53.007276: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2023-05-10 

Test data, 	 Loss=0.4858, 	 Accuracy=0.8684
