![Descrição da Imagem](https://om-assets-h3f2axhubqdkbreg.z01.azurefd.net/docs-assets/gs3-project-proposal.png)


# Part 3: Propose the Research Study

### This is the first part of the workflow for **doing remote data science with PySyft on a private dataset**. We will change POV towards how Rachel would be able to interact with a Datasite as a data scientist, and how PySyft guarantees that non-public information would never be seen nor opened by external users. 

### Focus here: **preparing the research project**. This is the most thorough part, where the data scientist is involved all the way!

### Same old same old -> Launch a local development Datasite, login and check if dataset is uploaded


In [None]:
import syft as sy

data_site = sy.orchestra.launch(name="cancer-research-centre")

client = data_site.login(email="rachel@datascience.inst", password="syftrocks") #updated with NEW credentials created on the previous notebook

client.datasets #check if "Breast Cancer Biomarker" dataset is in DataSite, that was done on the first notebook of this tutorial series

### Acessing the dataset "Breast Cancer Biomarker" either by name or id & checking structure

In [None]:
bc_dataset = client.datasets["Breast Cancer Biomarker"]
bc_dataset

Access its internal assets either by index or by their unique names. In our example, we can create a pointer to the features asset, and the targets asset

In [None]:
features, targets = bc_dataset.assets  

#checking mock and real data values

features.mock.head(n=3)  # pandas.DataFrame

targets.mock.head(n=3)

And, as expected, Rachel cannot access the raw data!

In [None]:
features.data
targets.data

This clear distinction between the main components of an asset has the following advantages:

  1.  **mock data is open-access** and imposes no risks to the data owner for sharing publicly non-public information;

  2.  it creates a staging environment for the data scientist to simulate their intended study in a realistic way;

  3.  **reduces liability** for the data scientist, who is not responsible anymore for storing safely non-public data;

  4. enables the **data owner to control** how non-public assets can be used by data scientists for their study.

### Prepare code using public mock data

Rachel decides to study the breast cancer data by running a simple supervised machine learning experiment using `scikit-learn`.
The dataset is already in the expected format, so we are good to go!

In short, these are the steps of the machine learning experiment that Rachel has in mind:

 1. use the **train_test_split** function to generate training and testing partitions;

 2. apply **StandardScaler** to normalise features;

 3. train a **LogisticRegression** model (since we are interested in predicting a binary variable **y** using numerical predictors **X**);

 4. calculate **accuracy_score** on training, and testing data.


In [7]:
def ml_experiment_on_breast_cancer_data(features_data, labels, seed: int = 12345) -> tuple[float, float]:
    # include the necessary imports in the main body of the function
    # to prepare for what PySyft would expect for submitted code.

    
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score
    
    X, y = features_data, labels.values.ravel()
    # 1. Data Partition
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=seed, stratify=y)
    # 2. Data normalisation
    scaler = StandardScaler()
    scaler.fit(X_train, y_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    # 3. Model training
    model = LogisticRegression().fit(X_train, y_train)
    # 4. Metrics Calculation
    acc_train = accuracy_score(y_train, model.predict(X_train))
    acc_test = accuracy_score(y_test, model.predict(X_test))
    
    return acc_train, acc_test

In [None]:
#Checking if it works!

X, y = features.mock, targets.mock

ml_experiment_on_breast_cancer_data(features_data=X, labels=y)

Good to go!

### Experimented modelling with mock data -> what about real data??

Now, we would be interested in testing that function on real data, using PySyft. In prticular, we need to convert the transform our (local) Python function into a remote code request: a function that PySyft can process, and execute remotely on the Datasite, where real data are stored.

To do so, we only need to wrap our Python function with a special decorator: `syft_function_single_use`! Takes two main parameters:

- input_policy: submitted code will only run on the selected input assets;
- output_policy: imposing an upper limit to the number of times a specified code is allowed to run;


In [None]:
remote_user_code = sy.syft_function_single_use(features_data=features, labels=targets)(ml_experiment_on_breast_cancer_data)

### Can submit our `remote_user_code` request -> Not so fast!

This would make things extremely difficult for Owen, as he would have no clue whatsoever about the intent of the code, nor of the study Rachel is willing to conduct!

To overcome these issues, PySyft allows to create and submit a **research project**! In essence, a Project (i.e. syft.Project) is composed by one (or more) **code request(s)**, and includes a (short) description to communicate the intent of the study to the data owner.

![Descrição da Imagem](https://om-assets-h3f2axhubqdkbreg.z01.azurefd.net/docs-assets/gs3-project-overview.png)


In [None]:
description = """
    The purpose of this study will be to run a machine learning
    experimental pipeline on breast cancer data. 
    As first attempt, the pipelines includes a normalisation steps for 
    features and labels using a StandardScaler and a LabelEncoder. 
    The selected ML model is Logistic regression, with the intent
    to gather the accuracy scores on both training, and testing 
    data partitions, randomly generated.
"""

# Create a project

research_project = client.create_project(
    name="Breast Cancer ML Project",
    description=description,
    user_email_address="rachel@datascience.inst"
)

In [None]:
#check which projects are in the DataSite

client.projects

Now, we can submit our `remote_user_code` request!!

In [None]:
code_request = research_project.create_code_request(remote_user_code, client)

#checking if it worked

code_request

We can now check that the code request has reached the project by accessing `client.code`:

In [None]:
client.code

We do indeed have a code request, in `PENDING` status. Similarly, we can review our existing requests, by accessing `client.requests`:

In [None]:
client.requests

### Before moving on

Let’s say Rachel is very impatient, and would try to force the execution of a not-yet-approved (not-yet-reviewed) request. Does PySyft allow this?

In [None]:
client.code.ml_experiment_on_breast_cancer_data(features_data=features, labels=targets)

**No!!**

Same idea as before, Rachel cannot force the raw execution of a unapproved request!