![image.png](attachment:335d4016-1813-4afe-bdf2-f0f9bc31c729.png)

# **Brief setup**

**Rachel** is a **Data Scientist**, and researcher who is working on a project using Machine Learning to study breast cancer data. To do so, Rachel would like to use the (non-public) “Breast Cancer Biormaker” dataset that has been made available on the Cancer Research Centre Datasite.

**Owen** is a **laboratory data manager** in the Cancer Biomarker Research group. Owen is responsibile to organise, and curate the database of clinical data collected from anonymised patient samples. Due to legal and regulatory constraints, this dataset cannot be made publicly available, nor any of its copy can leave the premises of their research centre. Nonetheless Owen is very keen on allowing researchers to feature the “Breast Cancer Biomarker” dataset in their projects. So Owen sets up a **PySyft Datasite** hosting the dataset. As Data Owner, Owen will be responsible to:

- upload the data;

- manage credentials and user profiles;

- review any project proposal submitted by external data scientists.

In [1]:
import syft as sy

## Launch a local development Datasite

### **Datasite?**

A Datasite is a platform for accessing data without downloading it directly, allowing data scientists to get answers to questions from data on the server. Instead of files like .html or .css, data is provided as datasets containing multiple assets.

![image.png](attachment:519b7283-9e19-4cba-953a-6f79544c7b3b.png)

In [None]:
data_site = sy.orchestra.launch(name="cancer-research-centre", reset=True) #good for local development
client = data_site.login(email="info@openmined.org", password="changethis") 

## Downloading our example dataset, Breast Cancer -> idea is to simulate a version of the Owen’s “Breast Cancer Biomarker” dataset.

In [None]:
#lazy to install using venv tools!

try:
    from ucimlrepo import fetch_ucirepo
except ImportError:
    !pip install ucimlrepo
    from ucimlrepo import fetch_ucirepo

  
# fetch dataset 
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 
  
# data (as pandas dataframes) 
X = breast_cancer_wisconsin_diagnostic.data.features 
y = breast_cancer_wisconsin_diagnostic.data.targets

# metadata 
metadata = breast_cancer_wisconsin_diagnostic.metadata
# variable information 
variables = breast_cancer_wisconsin_diagnostic.variables

### This dataset contains 596 samples, organised in 30 clinical features (i.e. X).

In [5]:
X.head(n=5)  # n specifies how many rows we want in the preview

X.shape 

(569, 30)

### How does PySyft allow to work with data **without downloading** nor seeing **any copy of the data** itself?

PySyft will solve this problem by hosting two **kinds of data**:

First, it will host the *real data*; second, it will host *mock data*, that is, a fake version of the real data that data scientists can download and see.

## - Create mock data

In [7]:
#Before the dataset gets uploaded to the Datasite, Owen needs to create a mock version of their data by adding noise.

import numpy as np

#fix seed for reproducibility
SEED = 12345
np.random.seed(SEED)

X_mock = X.apply(lambda s: s + np.mean(s) + np.random.uniform(size=len(s)))
y_mock = y.sample(frac=1, random_state=SEED).reset_index(drop=True)

### Is this mock data good?

Perhaps using Differential Privacy would be better, meaning something like:

`noise = np.random.laplace(loc= 0, scale=1/epsilon, size=len(s))`

`X_mock = X.apply(lambda s: s + noise)`

Ultimately it is a responsibility of the Data Owner to decide what solution to choose that would better fit in the desired data usage. Now that we have both real and mock data we can move on to:

## - Create Assets
Now that we have both real and mock data, we are ready to create the corresponding assets in PySyft, each identified by their unique `name` within the Datasite.

In [8]:
features_asset = sy.Asset(
    name="Breast Cancer Data: Features",
    data = X,      # real data
    mock = X_mock  # mock data
)

targets_asset = sy.Asset(
    name="Breast Cancer Data: Targets",
    data = y,      # real data
    mock = y_mock  # mock data
)

In [None]:
#Inspecting Asset obejcts created -> real data

features_asset.data.head(n=3)


In [None]:
#Inspecting Asset obejcts created -> mock data

features_asset.mock.head(n=3)

In [None]:
#10 first values of variable radius1 on mock data

features_asset.mock.radius1.head(n=10)

In [None]:
#10 first values of variable radius1 on real data

features_asset.data.radius1.head(n=10)

Given that the assets are created, one can try to proceed to create a dataset.

### **Problem!** 

If one were to upload these assets as-is to our Datasite with no additional information, how could an external data scientist ever find and know how to use the data?

For that reason, each dataset in PySyft is identified by its unique name, and contains additional metadata (e.g. `description`, `citation`, `contributors`) that further describe the core data it includes in its assets.

Let's now collect our metadata, and then use it to create our Dataset object:

## - Create Dataset

In [21]:
# Metadata
description = f'{metadata["abstract"]}\n{metadata["additional_info"]["summary"]}'

paper = metadata["intro_paper"]

# Check if the necessary keys exist in the 'paper' dictionary
authors = paper.get("authors", "Unknown authors")
title = paper.get("title", "No title available")
published_in = paper.get("published_in", "No journal available")
year = paper.get("year", "Unknown year")

citation = f'{authors} - {title}, {published_in}, {year}'

summary = "The Breast Cancer Wisconsin dataset can be used to predict whether the cancer is benign or malignant."

# Dataset creation
breast_cancer_dataset = sy.Dataset(
    name="Breast Cancer Biomarker",
    description=description,
    summary=summary,
    citation=citation,
    url=metadata["dataset_doi"],
)


### And now, one adds the assets to the dataset!

In [None]:
breast_cancer_dataset.add_asset(features_asset)

breast_cancer_dataset.add_asset(targets_asset)

In [None]:
#Checking metadata -> quite aesthetic!!

breast_cancer_dataset

In [None]:
client.upload_dataset(dataset=breast_cancer_dataset)

In [None]:
#To see the datasets uploaded by a client on this server, use command `[your_client].datasets`

client.datasets

Again, quite a good job OpenMined did on the results' presentation!! Now, the Datasite can be shut.

In [26]:
data_site.land()