# Scikit-Learn SVM with NVFLARE


## Prepare data

In this section, we will download the data and split the data and save to the local disk

### Download data

In [1]:
from utils.prepare_data import download_data

The download data function will download one of the two datasets from Scikit-learn: Iris or Cancer
* the file will be save to the output directory 
* the file format will be CSV format with comma separated
* the file will be remove the header 
* default dataset is iris
* filename = dataset name


In [2]:
output_dir="/tmp/nvflare/sklearn/data"
download_data(output_dir)

Verify the file is downloaded


In [3]:
!ls {output_dir}

iris.csv


#### Split Data
* **Split Method**


Split the data into different datasets, one for each client. 
There are several split methods, we use test our algorithms in different scenarios. Here we just pick uniform split from the followns
* Uniform 
* linear
* Sqare
* Exponential



* **data store method**

There are two approaches to store the splited data 
* STORE DATA: 

similar to the real application, we split the data total into different directories (sites), and each client will ready one-site's data

* STORE_INDEX: 

simulate the split, by assign data index range for each site, but the original file is not splited. The data loader is reading from the original file but only for the data within the index range
  For example: the index assignment for the data split is captured in a json file
 ``` 
  {
     "data_path" : "/tmp/nvflare/sklearn/data/iris.csv"
     "data_index" : {
         "site-1": {"start": 100, "end": 300},
         "site-2": {"start": 301, "end": 600},
     }
  }
 ```

Here we choose STORE_DATA approach

In [1]:
from utils.prepare_data_split import split_data, SplitMethod, StoreMethod

In [2]:
input_path = "/tmp/nvflare/sklearn/data/iris.csv"
output_dir = "/tmp/nvflare/sklearn/data"
site_num = 2
valid_frac = 0.3
split_method: SplitMethod = SplitMethod.UNIFORM
store_method: StoreMethod = StoreMethod.STORE_DATA

In [3]:

split_data(input_path, output_dir, site_num, valid_frac, split_method=split_method, store_method=store_method)

assign_data_index_to_sites
valid size type <class 'int'>
site_sizes type <class 'list'>
start type= <class 'int'>
end type= <class 'int'>
start type= <class 'int'>
end type= <class 'int'>
{'valid': {'start': 0, 'end': 45}, 'site-1': {'start': 45, 'end': 97}, 'site-2': {'start': 97, 'end': 150}}
output_file= /tmp/nvflare/sklearn/data/data_valid.csv
['0.0,5.1,3.5,1.4,0.2\n', '0.0,4.9,3.0,1.4,0.2\n', '0.0,4.7,3.2,1.3,0.2\n', '0.0,4.6,3.1,1.5,0.2\n', '0.0,5.0,3.6,1.4,0.2\n', '0.0,5.4,3.9,1.7,0.4\n', '0.0,4.6,3.4,1.4,0.3\n', '0.0,5.0,3.4,1.5,0.2\n', '0.0,4.4,2.9,1.4,0.2\n', '0.0,4.9,3.1,1.5,0.1\n', '0.0,5.4,3.7,1.5,0.2\n', '0.0,4.8,3.4,1.6,0.2\n', '0.0,4.8,3.0,1.4,0.1\n', '0.0,4.3,3.0,1.1,0.1\n', '0.0,5.8,4.0,1.2,0.2\n', '0.0,5.7,4.4,1.5,0.4\n', '0.0,5.4,3.9,1.3,0.4\n', '0.0,5.1,3.5,1.4,0.3\n', '0.0,5.7,3.8,1.7,0.3\n', '0.0,5.1,3.8,1.5,0.3\n', '0.0,5.4,3.4,1.7,0.2\n', '0.0,5.1,3.7,1.5,0.4\n', '0.0,4.6,3.6,1.0,0.2\n', '0.0,5.1,3.3,1.7,0.5\n', '0.0,4.8,3.4,1.9,0.2\n', '0.0,5.0,3.0,1.6,0.2\n',

In [4]:
!ls -l {output_dir}

total 32
-rw-r--r--  1 chesterc  wheel  1040 Dec 16 09:32 data_site-1.csv
-rw-r--r--  1 chesterc  wheel  1060 Dec 16 09:32 data_site-2.csv
-rw-r--r--  1 chesterc  wheel     0 Dec 16 09:26 data_site-3.csv
-rw-r--r--  1 chesterc  wheel   900 Dec 16 09:32 data_valid.csv
-rw-r--r--  1 chesterc  wheel  3000 Dec 16 09:19 iris.csv
