<a href="https://colab.research.google.com/github/dssaenzml/federated_learning_nlp/blob/main/leaf_based_federated_learning_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Application of Federated Learning

I am going to create an NLP solution using federated learning using three different simulated locations. 

#### Clonning Github repo to Drive

In [1]:
! git clone https://github.com/dssaenzml/leaf.git

Cloning into 'leaf'...
remote: Enumerating objects: 752, done.[K
remote: Total 752 (delta 0), reused 0 (delta 0), pack-reused 752[K
Receiving objects: 100% (752/752), 6.78 MiB | 13.57 MiB/s, done.
Resolving deltas: 100% (350/350), done.


#### Installing requirements

In [2]:
import os
os.chdir('leaf')
! ls

data					 LICENSE.md	    README.md
docs					 models		    requirements.txt
leaf_based_federated_learning_NLP.ipynb  paper_experiments


In [3]:
%%capture

! pip3 install -r requirements.txt

## Twitter Sentiment Analysis Experiments

In this experiment, we reproduce the statistical analysis experiment conducted in the LEAF paper. Specifically, we investigate the effect of varying the minimum number of samples per user (for training) on model accuracy when training using `FedAvg` algorithm, using the LEAF framework.

For this example, we shall use Sentiment140 dataset (containing 1.6 million tweets), and we shall train a 2-layer LSTM model with cross-entropy loss, and using pre-trained GloVe embeddings.

### Experiment Setup and Execution

#### Pre-requisites

Since this experiment requires pre-trained word embeddings, we recommend running the `models/sent140/get_embs.sh` file, which fetches 300-dimensional pretrained GloVe vectors.

After extraction, this data is stored in `models/sent140/embs.json`.

In [4]:
os.chdir('models/sent140')

! sh ./get_embs.sh

./get_embs.sh: 3: cd: can't cd to sent140
--2021-04-11 06:30:16--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-04-11 06:30:16--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-04-11 06:30:17--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [applicati

#### Dataset fetching and pre-processing

LEAF contains powerful scripts for fetching and conversion of data into JSON format for easy utilization. Additionally, these scripts are also capable of subsampling from the dataset, and splitting the dataset into training and testing sets.

For our experiment, as a first step, we shall use 50% of the dataset in an 80-20 train/test split, and we shall discard all users with less than 10 tweets. The following command shows how this can be accomplished (the `--spltseed` flag in this case is to enable reproducible generation of the dataset)

After running this script, the `data/sent140/data` directory should contain `train/` and `test/` directories.

In [None]:
os.chdir('../../')
os.chdir('data/sent140')

! sh ./preprocess.sh --sf 0.5 -t sample -s niid --tf 0.8 -k 3 --spltseed 1549775860

------------------------------
retrieving raw data
--2021-03-30 10:17:20--  http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip [following]
--2021-03-30 10:17:20--  https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81363704 (78M) [application/zip]
Saving to: ‘trainingandtestdata.zip’


2021-03-30 10:17:23 (36.9 MB/s) - ‘trainingandtestdata.zip’ saved [81363704/81363704]

Archive:  trainingandtestdata.zip
  inflating: testdata.manual.2009.06.14.csv  
  inflating: training.1600000.processed.noemoticon.csv  
finished retrieving raw data
----------------------------

#### Model Execution

Now that we have our data, we can execute our model! For this experiment, the model file is stored at `models/sent140/stacked_lstm.py`. In order train this model using `FedAvg` with 2 clients every round for 10 rounds, we execute the following command:

In [None]:
os.chdir('../../')
os.chdir('models')

! python3 ./main.py -dataset sent140 -model stacked_lstm -lr 0.0003 --clients-per-round 2 --num-rounds 10

############################## sent140.stacked_lstm ##############################
Traceback (most recent call last):
  File "./main.py", line 186, in <module>
    main()
  File "./main.py", line 58, in main
    client_model = ClientModel(args.seed, *model_params)
  File "/content/leaf/models/sent140/stacked_lstm.py", line 21, in __init__
    _, self.indd, vocab = get_word_emb_arr(VOCAB_DIR)
  File "/content/leaf/models/utils/language_utils.py", line 119, in get_word_emb_arr
    with open(path, 'r') as inf:
FileNotFoundError: [Errno 2] No such file or directory: 'sent140/embs.json'


#### Quickstart script

This script will execute the instructions provided below for min-sample counts of 3, 10, 30 and 100, reproducibly generating the data partitions and results observed by the authors during analysis.

In [None]:
! sh paper_experiments/sent140.sh paper_experiments

leaf/paper_experiments/sent140.sh: 6: leaf/paper_experiments/sent140.sh: Syntax error: "(" unexpected
