## Word Prediction in a Mobile Device Federation

- [x] Get the data
- [x] Data describe 
- [x] Preprocessing 
- [x] Data exploration
- [x] Create tensors
> NOTE: From this point on, the data is in the same pattern as the tff simulation module
- [x] Data federated understanding
- [x] Load the federated data
> NOTE: now on the next steps will be done on a federation client
- [x] Preprocess the data 
- [ ] load pre-trained model each client
- [ ] Compile the model and test on the preprocessed data
- [ ] Fine-tune the model with Federated Learning

In [18]:
# data analysis and data wrangling
import pandas as pd
import numpy as np

# plotting
from wordcloud import WordCloud

from plotly.offline import plot, iplot
import plotly.graph_objs as go # create graphics
import plotly.offline as py
import plotly

import cufflinks as cf # conect ploty and pandas
import plotly.io as pio

# preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import NMF

# import pyLDAvis
import pyLDAvis.sklearn

# ml and dl
import tensorflow_federated as tff
import tensorflow as tf

## Natural Processing Language
from nltk.corpus import stopwords
from nltk import word_tokenize
import nltk

# other
from imp import reload

import platform
import warnings
import pathlib
import pprint
import string
import json
import time
import re
import os

In [19]:
import nest_asyncio
nest_asyncio.apply()

### Prepare Work Directory 

In [20]:
def prepare_work_directory(end_directory: str='notebooks'):
    # Current path
    curr_dir = os.path.dirname(os.path.realpath("__file__")) 
    
    if curr_dir.endswith(end_directory):
        os.chdir('..')
        return curr_dir
    
    return f'Current working directory: {curr_dir}' 

In [21]:
prepare_work_directory(end_directory='notebooks')

'Current working directory: /home/campos/projects/ml/federated-Learning-for-text-generation'

# **Federated Learning**

## **Simulation of Federation**

To simulate the federation of clients, the module will be used [tff.simulation](https://www.tensorflow.org/federated/api_docs/python/tff/simulation)

## **Federation Design**
- 715 users  where each example corresponds to a contiguous set of lines spoken by the character in a given play.
- For each client, we split the data into a set of training lines(the first 80% of lines for the role), and test lines (the last 20%, rounded up to at least one line).

<!-- onde cada exemplo corresponde a um conjunto contíguo de falas faladas pelo personagem em uma determinada peça. -->

In [22]:
train_data, test_data = tff.simulation.datasets.shakespeare.load_data()

In [23]:
len(train_data.client_ids)

715

In [24]:
len(test_data.client_ids)

715

### Undertanding `tff.simulation.datasets`
- [Dataset documentation](https://www.tensorflow.org/federated/api_docs/python/tff/simulation/datasets/shakespeare/load_data)
- The data and the distributions use the [LEAF: A Benchmark for Federated Settings](https://leaf.cmu.edu/).
- The LEAF get the data from [Gutenberg Project](http://www.gutenberg.org/files/100/old/1994-01-100.zip)

### Distribute Data Between Users
The `tff.simulation.datasets` package provides a variety of datasets that are divided into "clients", where each client corresponds to a dataset on a specific device that can participate in federated learning.

### Split Plays by Client

In [25]:
!python src/preprocess_shakespeare.py data/raw/100.txt data/federation

Splitting .txt data between users
Discarded 5730 lines


In [38]:
users_and_plays = open('data/federation/users_and_plays.json', mode='r')
json_users_and_plays = json.load(users_and_plays)
# pprint.pprint(json_users_and_plays)

#### Client ID
The client keys consist of the name of the play joined with the name of the character, so for example `MUCH_ADO_ABOUT_NOTHING_OTHELLO` corresponds to the lines for the character **Othello** in the play **Much Ado About Nothing**. 

<!-- The IDs of clients consistem no nome da peça junto com o nome do personagem, então por exemplo `MUCH_ADO_ABOUT_NOTHING_OTHELLO` corresponde às falas do personagem Otelo na peça "Muito Barulho por Nada". -->

### Data to Tensors
The datasets provided by `shakespeare.load_data()` consist of a sequence of string `Tensors`, one for each line spoken by a particular character in a Shakespeare play.

In [27]:
type(train_data)

tensorflow_federated.python.simulation.datasets.client_data.PreprocessClientData

#### Sample

Here the play is "The Tragedy of King Lear" and the character is "King".

In [28]:
raw_example_dataset = train_data.create_tf_dataset_for_client('THE_TRAGEDY_OF_KING_LEAR_KING')
print(type(raw_example_dataset))

for x in raw_example_dataset.take(10):
    print(x['snippets'])

<class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
tf.Tensor(b'', shape=(), dtype=string)
tf.Tensor(b'What?', shape=(), dtype=string)
tf.Tensor(b'Peace!', shape=(), dtype=string)
tf.Tensor(b'[Reads]', shape=(), dtype=string)
tf.Tensor(b'Hence, sirs, away.', shape=(), dtype=string)
tf.Tensor(b'I was, fair madam.', shape=(), dtype=string)
tf.Tensor(b'That can never be.', shape=(), dtype=string)
tf.Tensor(b'Upon mine honour, no.', shape=(), dtype=string)
tf.Tensor(b"'that shallow vassal,'", shape=(), dtype=string)
tf.Tensor(b'How fares your Majesty?', shape=(), dtype=string)


---

## **Prepare Dataset for Training RNN**

#### Generate the vocab lookup tables

In [29]:
# A fixed vocabularly of ASCII chars that occur in the works of Shakespeare and Dickens:
vocab = list('dhlptx@DHLPTX $(,048cgkoswCGKOSW[_#\'/37;?bfjnrvzBFJNRVZ"&*.26:\naeimquyAEIMQUY]!%)-159\r')

# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

In [30]:
# Input pre-processing parameters
SEQ_LENGTH = 100
BATCH_SIZE = 8
BUFFER_SIZE = 100  # For dataset shuffling

Construct a lookup table (precalculated) to map string chars to indexes, using the vocab loaded above:
<!-- 
- lookup table é uma tabela hash onde a key enderaça diretamente o value.
- são tabelas pré-calculadas
- economiza tempo de processamento -->

In [31]:
values = tf.constant(list(range(len(vocab))),
                     dtype=tf.int64)

table = tf\
    .lookup\
    .StaticHashTable(tf\
                     .lookup\
                     .KeyValueTensorInitializer(keys=vocab, values=values), default_value=0)

type(table)

tensorflow.python.ops.lookup_ops.StaticHashTable

In [32]:
# source: https://www.tensorflow.org/federated/tutorials/federated_learning_for_text_generation

def to_ids(x):
    s = tf.reshape(x['snippets'], shape=[1])
    chars = tf.strings.bytes_split(s).values
    return table.lookup(chars)
  
def split_input_target(chunk):
    input_text = tf.map_fn(lambda x: x[:-1], chunk)
    target_text = tf.map_fn(lambda x: x[1:], chunk)
    return (input_text, target_text)

def preprocess(dataset):
    return (
      # Map ASCII chars to int64 indexes using the vocab
      dataset.map(to_ids)
      # Split into individual chars
      .unbatch()
      # Form example sequences of SEQ_LENGTH +1
      .batch(SEQ_LENGTH + 1, drop_remainder=True)
      # Shuffle and form minibatches
      .shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
      # And finally split into (input, target) tuples,
      # each of length SEQ_LENGTH.
      .map(split_input_target)
    )

Now we can preprocess our raw_example_dataset, and check the types:

In [33]:
example_dataset = preprocess(raw_example_dataset)
print(example_dataset.element_spec)

(TensorSpec(shape=(8, 100), dtype=tf.int64, name=None), TensorSpec(shape=(8, 100), dtype=tf.int64, name=None))


In [34]:
type(example_dataset)

tensorflow.python.data.ops.dataset_ops.MapDataset

### References
- [1] Bag-of-Words https://machinelearningmastery.com/gentle-introduction-bag-words-model/
- [2] APPLIED FEDERATED LEARNING:IMPROVING GOOGLE KEYBOARD QUERY SUGGESTIONS https://arxiv.org/pdf/1812.02903.pdf
- [3] Federated Learning Collaborative https://ai.googleblog.com/2017/04/federated-learning-collaborative.html

---