# Application of Federated Learning

I am going to create an NLP solution using federated learning using three different simulated locations. 

#### Clonning Github repo to Drive

In [1]:
! git clone https://github.com/dssaenzml/leaf.git

Cloning into 'leaf'...
remote: Enumerating objects: 752, done.[K
remote: Total 752 (delta 0), reused 0 (delta 0), pack-reused 752[K
Receiving objects: 100% (752/752), 6.78 MiB | 2.53 MiB/s, done.
Resolving deltas: 100% (350/350), done.


#### Installing requirements

In [1]:
import os
os.chdir('leaf')
! ls

data					 LICENSE.md	    README.md
docs					 models		    requirements.txt
leaf_based_federated_learning_NLP.ipynb  paper_experiments


In [4]:
%%capture

! pip3 install -r requirements.txt

## Twitter Sentiment Analysis Experiments

In this experiment, we reproduce the statistical analysis experiment conducted in the LEAF paper. Specifically, we investigate the effect of varying the minimum number of samples per user (for training) on model accuracy when training using `FedAvg` algorithm, using the LEAF framework.

For this example, we shall use Sentiment140 dataset (containing 1.6 million tweets), and we shall train a 2-layer LSTM model with cross-entropy loss, and using pre-trained GloVe embeddings.

### Experiment Setup and Execution

#### Pre-requisites

Since this experiment requires pre-trained word embeddings, we recommend running the `models/sent140/get_embs.sh` file, which fetches 300-dimensional pretrained GloVe vectors.

After extraction, this data is stored in `models/sent140/embs.json`.

In [5]:
os.chdir('models/sent140')

! sh ./get_embs.sh

./get_embs.sh: 3: cd: can't cd to sent140


#### Dataset fetching and pre-processing

LEAF contains powerful scripts for fetching and conversion of data into JSON format for easy utilization. Additionally, these scripts are also capable of subsampling from the dataset, and splitting the dataset into training and testing sets.

For our experiment, as a first step, we shall use 50% of the dataset in an 80-20 train/test split, and we shall discard all users with less than 10 tweets. The following command shows how this can be accomplished (the `--spltseed` flag in this case is to enable reproducible generation of the dataset)

After running this script, the `data/sent140/data` directory should contain `train/` and `test/` directories.

In [6]:
os.chdir('../../')
os.chdir('data/sent140')

! sh ./preprocess.sh --sf 0.5 -t sample -s niid --tf 0.8 -k 3 --spltseed 1549775860

------------------------------
calculating JSON file checksums
checksums written to meta/dir-checksum.md5
Data for one of the specified preprocessing tasks has already been
generated. If you would like to re-generate data for this directory,
please delete the existing one. Otherwise, please remove the
respective tag(s) from the preprocessing command.


#### Model Execution

Now that we have our data, we can execute our model! For this experiment, the model file is stored at `models/sent140/stacked_lstm.py`. In order train this model using `FedAvg` with 2 clients every round for 10 rounds, we execute the following command:

In [33]:
os.chdir('../../')
os.chdir('models')

! python3 ./main.py -dataset sent140 -model stacked_lstm -lr 0.0003 --clients-per-round 2 --num-rounds 10

2021-04-14 12:17:26.913984: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
############################## sent140.stacked_lstm ##############################
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
2021-04-14 12:17:53.779619: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-14 12:17:53.780487: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-14 12:17:53.809943: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-04-14 12:17:53.810006: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: MBZU-AL2-WS059
2021-04-14 12:17:53.810020: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: MBZU-AL2-WS

#### Metrics Collection

Executing the above command will write out system and statistical metrics to `leaf/models/metrics/stat_metrics.csv` and `leaf/models/metrics/sys_metrics.csv` - since these are overwritten for every run, we highly recommend storing the generated metrics files at a different location.

To experiment with a different min-sample setting, re-run the preprocessing script with a different `-k` flag. The plots shown below can be generated using `plots.py` file in the repo root.

#### Results and Analysis

Upon performing this experiment, we see that, while median performance degrades only slightly with data-deficient users (i.e., k = 3), the 25th percentile (bottom of box) degrades dramatically.

