# Artificien Sample Model Upload
Here, we demonstrate how to create a machine learning model and upload it so that it may be trained on client devices using federated learning.
We create a simple linear regression model below, defined in PyTorch syntax, to predict a single variable 'Y' using another three input variables, which are stored in the vector 'X'. We then upload the model to a backend service called a 'PyGrid' node. Models sent to this PyGrid node will be downloaded by client devices (iPhones, Androids, etc.) and trained on local data. 

Each time a device or set of devices trains your model, the model stored on the PyGrid node will be updated, to reflect the newly improved model. In our case, we train a perceptron model (a very simple machine learning model) to take in the data `age`, `bodyMassIndex` and `sex` to predict a user's `stepCount`. Once we upload your model, then users of the artificien health app can download and train your model on their **real live Apple Health data**, where your model will learn how to utilize the three input variables (`age`, `bodyMassIndex` and `sex`) to predict the output variable (`stepCount`). 

As soon as you upload the below tutorial model, you'll find it on [artificien.com](https://artificien.com) on your 'models' page. There, you'll be able to check the model progress and download the newly trained model once training is complete. 

## Cycles, Accuracy, and Model Progress

The model uploaded in this tutorial, for the sake of this demonstration, is configured to train on at minimum 1 device per "training cycle" and at maximum 5 devices per training cycle. If more than 1 device but fewer than 5 devices have trained your model within 1 hour, this cycle will be marked complete. Likewise, *as soon as* 5 devices have trained your model, the cycle is complete. In total, this model will train for 5 cycles. 

As your model trains, devices training the model will send 'updates' to our backend servers indicating how the model should be improved. Once a cycle completes, we average up all of the 'updates' that devices who trained your model sent, and create a new-and-improved model, which incorporates the learnings your model obtained while training on the devices. Then, when another cycle commences, devices will download this new-and-improved model and send information to us on how we can improve *that* model even further as well. This process, where we iteratively update the model at the end of each cycle, continues until we reach `num_cycles`. As your model trains, you'll see its average loss (in this case, the number of steps by which it misses the mark when trying to predict someone's step count) update, and you'll see that its progress % goes up over time.

For the machine learning engineers out there - think of `num_cycles` as analagous to `num_epochs`, and min/ max `num_devices` as analagous to `batch_size`. In standard ML, models are updated at end of an epoch, after training on `batch_size` samples. In federated ML, model parameters are updated at the end of a cycle, after training on `num_devices`.

Note that, for a real machine learning model, you'd certainly want to train on a lot more than 1-5 devices per cycle, and you might want to have more than 5 cycles.

## Please Read: 
- You cannot save changes made to this tutorial - this file is shared by all Artificien users and cannot be altered. To create your own notebooks (which can be edited and saved), create a new notebook in your root ('/') directory, or make a copy of this notebook and place it in your root directory.

- To build models for deployment to actual devices running Artificien partner apps, ensure that your model can train on and make predictions using the provided [sample dataset](../sample_data) corresponding to the type of data you'd like to build on top of. For this tutorial, we ensure that the model indeed works on the health app data before sending it.

In [1]:
from artificienlib import syftfunctions as sf
import pandas as pd
import numpy as np
import torch as th
import syft as sy
import os

from torch import nn
from sklearn.model_selection import train_test_split
from syft.federated.fl_client import FLClient
from syft.federated.fl_job import FLJob
from syft.grid.clients.model_centric_fl_client import ModelCentricFLClient

Setting up Sandbox...
Done!


### Explore the Data
For every dataset on [artificien.com](https://artificien.com), we provide a sample dataset to demonstrate to you what the data on user devices will actually look like. Here, we explore the Artificien Dataset. Before we deploy our ML model for training, we'll first make sure it works correctly on the sample dataset. Note that the sample dataset is entirely made-up data - the real data **only** ever exists on user devices, and it **never** goes anywhere else. We keep the user's privacy first and foremost.

In [2]:
health_data = pd.read_csv('../sample_data/Artificien-Health.csv')
health_data.head()

Unnamed: 0,Age,Sex,Body Mass Index,Weekly Step Count
0,41,1,43,67173
1,47,1,2,113562
2,100,1,23,105109
3,42,1,8,175606
4,89,0,7,220145


### Features and Labels
We want to build a model that uses Age, Sex, and Body Mass Index in order to predict an individual's step count. For this reason, we split up the training features and labels as follows.

In [3]:
feature_names = ['Age', 'Sex', 'Body Mass Index']
# features = ['Body Mass Index']
label_names = ['Weekly Step Count']
features = health_data[feature_names]
labels = health_data[label_names]

In [4]:
# Show Features
features.head()

Unnamed: 0,Age,Sex,Body Mass Index
0,41,1,43
1,47,1,2
2,100,1,23
3,42,1,8
4,89,0,7


In [5]:
# Show Labels (only one in this case)
labels.head()

Unnamed: 0,Weekly Step Count
0,67173
1,113562
2,105109
3,175606
4,220145


### Train Test Split
In order to test our model on the sample data, we'll perform a train test split. We'll save 80% of the sample data for training (X_train, y_train) and 20% of the sample data for testing. We'll alsodefine a standard Torch DataLoader to load in the data. Note that the exact training process we've shown here will *also* occur on the user devices.

As explained above, `batch_size` and `num_devices` are analagous in some ways, so we'll set our `batch_size` equal to 5 - the maximum number of devices per cycle we aim to train on.

In [6]:
def get_data_loader(features, labels, batch_size):
    features = th.tensor(features.values.astype(np.float32)) 
    labels = th.tensor(labels.values.astype(np.float32))
    tensor = th.utils.data.TensorDataset(features, labels)
    loader = th.utils.data.DataLoader(dataset=tensor, batch_size=batch_size, shuffle=True)
    return loader

In [7]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=0.2) # 20% of data retained for testing

In [8]:
batch_size = 5
train_dataloader = get_data_loader(train_features, train_labels, batch_size)
test_dataloader = get_data_loader(test_features, test_labels, batch_size)

### Build the Model
Here, we define our model using standard pytorch model definition. This example is just a simple 3 variable linear classifier. Note that the LinearRegression model does not need the input data nor the labels to be normalized. Other ML models; however, do. This normalization process for other machine learning models will be described in another tutorial.

In [9]:
class LinearRegression(th.nn.Module):

    def __init__(self, input_size, output_size):
        super(LinearRegression, self).__init__()
        self.linear = th.nn.Linear(input_size, output_size)

    def forward(self, x):
        y_pred = self.linear(x)
        return y_pred

### Test the model on the sample data
To train the model, we choose a standard mean absolute error (MAE) loss, which means that our loss is in the same units as our labels. In this case, this means that our loss indicates the number of steps by which our model 'missed the mark'. We set our learning rate to 0.01, a standard rate in most ML workflows.

We set our `num_epochs` to 5, since, as explained above, the number of cycles is analagous to the number of epochs.

In [10]:
# Get our model
model = LinearRegression(input_size=3, output_size=1)
print(model)

LinearRegression(
  (linear): Linear(in_features=3, out_features=1, bias=True)
)


In [11]:
device = "cpu"
lr = 0.01 # learning rate
loss_fn = nn.L1Loss()
optimizer = th.optim.SGD(model.parameters(), lr=lr)

In [12]:
def train(dataloader, model, loss_fn, optimizer):
    """ A standard PyTorch training function """
    size = len(dataloader.dataset)
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        # y = nn.functional.normalize(y) # model outputs normalized predictions
        loss = loss_fn(pred, y)
        
        # Clear gradient buffers because we don't want any gradient from previous epoch to carry forward, dont want to cummulate gradients
        optimizer.zero_grad() 
        # get gradients w.r.t to parameters
        loss.backward()
        optimizer.step()

In [13]:
def test(dataloader, model):
    size = len(dataloader.dataset)
    model.eval()
    test_loss, correct = 0, 0
    with th.no_grad():
        for X, y in dataloader:
            # y = nn.functional.normalize(y)
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(th.float).sum().item()
    test_loss /= size
    correct /= size
    print(f"Test Error - Avg loss: {test_loss:>8f} \n")
    
    return test_loss

In [14]:
# Train it
epochs = 5
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train(train_dataloader, model, loss_fn, optimizer)
    test_loss = test(test_dataloader, model)
print("Done!")

Epoch 1
-------------------------------
Test Error - Avg loss: 23192.588203 

Epoch 2
-------------------------------
Test Error - Avg loss: 22586.672793 

Epoch 3
-------------------------------
Test Error - Avg loss: 22017.616406 

Epoch 4
-------------------------------
Test Error - Avg loss: 21480.276680 

Epoch 5
-------------------------------
Test Error - Avg loss: 20973.369883 

Done!


### Done Training!
As you can see, our model improved as it trained. In a real world scenario, you'd likely want a much larger batch size (`num_devices`) and epoch count (`num_cycles`) in order for your model to improve substantially. Nevertheless, we've shown some improvement here, reaching an average weekly step count error of around 20K, and it is clear that the model has begun to determine how the input data predicts the output label (step count). Below, we print the average step count error we obtained, and a 'baseline' metric, which is the average error you would obtain if you were to simply guess that all individuals had the *average* step count.

In [15]:
print('Our test loss:', int(test_loss))
print('Test loss obtained by guessing:', int(test_labels.std()))

Our test loss: 20973
Test loss obtained by guessing: 75347


As you can see, if we were to train our model with a larger batch size or more epochs (more devices and more cycles), we'd get a better result:

In [16]:
%%capture

batch_size = 32
epochs = 100
train_dataloader = get_data_loader(train_features, train_labels, batch_size)
test_dataloader = get_data_loader(test_features, test_labels, batch_size)

for t in range(epochs):
    train(train_dataloader, model, loss_fn, optimizer)
    test_loss = test(test_dataloader, model)

In [17]:
print('Our test loss:', int(test_loss))
print('Test loss obtained by guessing:', int(test_labels.std()))

Our test loss: 3563
Test loss obtained by guessing: 75347


# Federated learning
Now that we've built and validated that our model works on the sample data - here comes the easy part. Now, we simply upload our model to artificien's backend servers, to make it available for download and training by client devices. Lets walk you through the steps.

#### Check available datasets
Check the datasets that you've purchased access to. Today, we'll use the Artificien Health dataset, which you'll have access to by default. Enter your artificien password bellow so artificien can validate your credentials

In [18]:
password = "artificien1"

In [19]:
sf.get_my_purchased_datasets(password)

{'datasets': ['Artificien-Health']}


#### Name your model
Next, we name the model. Feel free to name it anything you like, but note that, once you upload a model with a given name and version... you cannot use that name and version again. For every new model you upload, you create a unique (name, version) pair. We do this to keep track of which model is which on (artificien.com)[artificien.com].

In [20]:
name = "tutorial-model"
version = "1.0"
dataset = "Artificien-Health"

#### Defining a Training Plan
First, we must choose some dummy X and Y representative of our input and output parameters respectively. Then, we select a learning rate and batch size. Since each user only has a single entry for their BMI/ step count/ etc., the batch size should be set to '1' (each user only has one sample of data). We set the learning rate to 0.01, just as we did in our sample dataset trial run. Likewise, we use mean absolute error as our loss function. We pass all these parameters into artificien's `def_training_plan` function to obtain a federated learning training plan.

In [21]:
lr = 0.01
batch_size = 1 # each user only has a single entry of data
X = th.randn(1, 3) # dimensions of the input data (3 features)
y = nn.functional.one_hot(th.tensor([1])) # dimensions of the labels (a single number - step count)
model_params, training_plan = sf.def_training_plan(model, X, y, batch_size, lr, {"loss": sf.mse_with_logits})

#### Defining an Averaging Plan
Next we define our averaging plan - the way our model averages the results from multiple edge devices. Here, we just use the default averaging plan, which is to take all model updates from any device in equal weight - that is, Device 1 should have just as much input as device 2 when improving the model.

In [22]:
avg_plan = sf.def_avg_plan(model_params)

#### Send Model
Next, we sent the model to artificien's backend services to be hosted and downloaded by client devices. Here, we are using the features we defined up above - `age`, `bodyMassIndex` and `sex` to predict a the label - `stepCount`... so we make sure to let artificien's servers know that this is the case. If we instead, say, wanted to to predict a person's age using only their `sex` and `stepCount`, we are free to do that as well.

Note that we set `min_workers` to 1, `max_workers` to 8, and `num_cyles` to 5.

If you happen to be the first person to run this tutorial in a while, you will need to wait a few minutes for the infrastructure to host your model to be created. Otherwise, you'll see that your model is uploaded and made available for training almost instantly.

In [23]:
feature_names

['Age', 'Sex', 'Body Mass Index']

In [24]:
label_names

['Weekly Step Count']

In [25]:
sf.send_model(
    
    # Model Information
    name=name, 
    dataset_id=dataset, 
    version=version, 
    
    # Determine what training should look like
    min_workers=1,
    max_workers=5,
    num_cycles=5,


    # Set the training and average plans
    batch_size=1, 
    learning_rate=0.2,
    model_params=model_params,
    features=feature_names, 
    labels=label_names,
    avg_plan=avg_plan,
    training_plan=training_plan,
    
    # Authenticate
    password=password
)

Host response: {'type': 'model-centric/host-training', 'data': {'status': 'success'}}


### Done!
If we get the response "Host response: {'type': 'model-centric/host-training', 'data': {'status': 'success'}}"
back, then our model was successfully uploaded to our backend services, and is now available to be downloaded and trained by client devices. If you navigate to [your models](https://artificien.com/models), you can monitor your model's progress and loss, and download it once it's done.

## Learn More
To learn more about how to use the artificien library for your own unique models, head to our documentation page at [artificien.com/data_scientist_documentation](https://artificien.com/data_scientist_documentation).