# Part 2: Inspecting Datasets as an External Data Scientist

## Objective
This tutorial covers the following,
- Log into domain as a data scientist
- Accesing datasets on domain
- Explaining a dataset, assets and access control
- Explaining the asset duality: difference between asset.mock, asset.data and asset.pointer
- Working with assets as a data scientist

## Context

In a privacy-preserving remote data science environment, there are two key participants: the Data Owner and the Data Scientist. The Data Owner is responsible for hosting their private datasets securely on a server or domain, that they own and manage, all utilizing PySyft. On the other hand, the Data Scientist, typically an external entity, is granted controlled access to this domain. However, their access is limited to a mock or synthetic representation of the original data, which they can use for their studies.

The PySyft server operates at different levels. At Level 0 and Level 1, the mock and private datasets are stored on separate machines. As we move to Level 2 and beyond, both the mock and private datasets are hosted on the same machine. Despite these variations in server configuration, the Data Scientist's experience remains consistent. Regardless of the server level, they will only have access to the mock version of the data and never to the original private dataset.

In this tutorial, we will guide you through the process of logging in as an authorized user, specifically a data scientist, to an already established domain. We will explore the available datasets and clarify the distinctions between various attributes of these hosted datasets.

### 1. Required Setup for the Tutorial from Data Owner's Side

Before we go into the crux of the tutorial, first we will set-up the environment for a Data Scientist. 

##### Note: The following steps will be executed in the Data Owner's end and these steps are **only for demontration purpose** for this tutorial. 

#### 1.1. Preparing a test domain

In [1]:
import pandas as pd
import syft as sy



In [2]:
# launching a test node
node = sy.orchestra.launch(name="test_domain", port=8080, dev_mode=False, reset=True)

# logging in with default credentials (only for example)
domain = sy.login(email="info@openmined.org", password="changethis", port=8080)

Starting test_domain server on 0.0.0.0:8080




Waiting for server to start.. Done.
Logged into <test_domain: High side Domain> as <info@openmined.org>


#### 1.2. Gather data and upload to test domain

For this demonstration, we are going to use [The Age Dataset 2023 from Kaggle](https://www.kaggle.com/datasets/lasaljaywardena/age-dataset-2023), which was introduced in the [tutorial on generating mock data](https://github.com/OpenMined/Tutorials/blob/zarreen/docs/notebooks/getting_started/data_owner/0.8.2/0/en/part-3-level-0-creating-mock-data.ipynb). 

In [None]:
# !pip install gdown

In [3]:
import gdown

url = "https://drive.google.com/u/1/uc?id=1TGxZ8wVAR0beTcKkKw5pQm2rbjEx-VBX&export=download"
gdown.download(url=url, output="ages_dataset.csv", quiet=True)

age_df = pd.read_csv("ages_dataset.csv")
age_df = age_df.dropna(how="any")
print(age_df.shape)
age_df.head()

(44211, 13)


Unnamed: 0,Id,Name,Short description,Gender,Country,Occupation,Birth year,Death year,Manner of death,Age of death,Associated Countries,Associated Country Coordinates (Lat/Lon),Associated Country Life Expectancy
0,Q23,George Washington,1st president of the United States (1732–1799),Male,United States of America; Kingdom of Great Bri...,Politician,1732,1799.0,natural causes,67.0,"['United Kingdom', 'United States']","[(55.378051, -3.435973), (37.09024, -95.712891)]","[81.3, 78.5]"
1,Q42,Douglas Adams,English writer and humorist,Male,United Kingdom,Artist,1952,2001.0,natural causes,49.0,['United Kingdom'],"[(55.378051, -3.435973)]",[81.3]
2,Q91,Abraham Lincoln,16th president of the United States (1809-1865),Male,United States of America,Politician,1809,1865.0,homicide,56.0,['United States'],"[(37.09024, -95.712891)]",[78.5]
5,Q260,Jean-François Champollion,French classical scholar,Male,Kingdom of France; First French Empire,Egyptologist,1790,1832.0,natural causes,42.0,['France'],"[(46.227638, 2.213749)]",[82.5]
7,Q296,Claude Monet,French impressionist painter (1840-1926),Male,France,Artist,1840,1926.0,natural causes,86.0,['France'],"[(46.227638, 2.213749)]",[82.5]


We have also generated a mock version of this original `age_df` using the steps demonstrated in the [tutorial on generating mock data](https://github.com/OpenMined/Tutorials/blob/zarreen/docs/notebooks/getting_started/data_owner/0.8.2/0/en/part-3-level-0-creating-mock-data.ipynb). 

In [4]:
url = "https://drive.google.com/u/1/uc?id=1maJrS8JJgThQ_Wt4YtHLEO2RM4SocWdw&export=download"
gdown.download(url=url, output="ages_mock_dataset.csv", quiet=True)

age_mock_df = pd.read_csv("ages_mock_dataset.csv")
age_mock_df = age_mock_df.dropna(how="any")
print(age_mock_df.shape)
age_mock_df.head()

(44211, 13)


Unnamed: 0,Id,Gender,Age of death,Associated Countries,Associated Country Life Expectancy,Manner of death,Name,Short description,Occupation,Death year,Birth year,Country,Associated Country Coordinates (Lat/Lon)
0,Q19723,Gender 1,53.0,['United States'],[78.5],homicide,Norma Fisher,Magazine truth stop whose group through despite.,Corporate treasurer,1989.0,1936,Not Available,Not Available
1,Q20057,Gender 1,51.0,['United Kingdom'],[81.3],natural causes,Brandon Lloyd,Total financial role together range line beyon...,Chief Financial Officer,2018.0,1967,Not Available,Not Available
2,Q8791,Gender 1,84.0,['Sweden'],[82.5],natural causes,Michelle Glover,Partner stock four. Region as true develop sou...,Speech and language therapist,2000.0,1916,Not Available,Not Available
3,Q30567,Gender 1,64.0,['Belgium'],[81.6],natural causes,Willie Golden,Feeling fact by four. Data son natural explain...,Financial controller,1989.0,1925,Not Available,Not Available
4,Q14013,Gender 1,88.0,['United Kingdom'],[81.3],suicide,Roberto Johnson,Attorney quickly candidate change although bag...,"Sound technician, broadcasting/film/video",2016.0,1928,Not Available,Not Available


Now that we have both the real and mock data, we will upload them to the test domain, following the steps decribed in the [tutorial on uploading dataset in Level 0 domain](https://github.com/OpenMined/Tutorials/blob/carmen-part4/docs/notebooks/getting_started/data_owner/0.8.2/0/en/part3_level0_uploading_dataset.ipynb). 

ℹ️ **While uploading real and mock data to Level-0 domain, the dimensions of both datasets must match. If they don't match, we need to take a subsample of the real data to match the dimension and column names of it's mock counterpart. In this example, both of the shapes match.**

In [29]:
# assert age_df.shape == mock_df.shape

In [27]:
# mock_columns = list(age_mock_df.columns)
# age_df = age_df[:age_mock_df.shape[0]][mock_columns]
# print(age_df.shape)

In [6]:
description = '''### About the dataset
This extensive dataset provides a rich collection of demographic and life events records for individuals across multiple countries. It covers a wide range of indicators and attributes related to personal information, birth and death events, gender, occupation, and associated countries. The dataset offers valuable insights into population dynamics and various aspects of human life, enabling comprehensive analyses and cross-country comparisons. The dataset is the largest one on notable deceased people and includes individ- uals from a variety of social groups, including but not limited to 107k females, 90k researchers, and 124 non-binary indi- viduals, spread across more than 300 contemporary or histor- ical regions.

### Key features
1. **Id**: Unique identifier for each individual.
2. **Name**: Name of the person.
3. **Short description**: Brief description or summary of the individual.
4. **Gender**: Gender/s of the individual.
5. **Country**: Countries/Kingdoms of residence and/or origin.
6. **Occupation**: Occupation or profession of the individual.
7. **Birth year**: Year of birth for the individual.
8. **Death year**: Year of death for the individual.
9. **Manner of death**: Details about the circumstances or manner of death.
10. **Age of death**: Age at the time of death for the individual.
11. **Associated Countries**: Modern Day Countries associated with the individual.
12. **Associated Country Coordinates (Lat/Lon)**: Modern Day Latitude and longitude coordinates of the associated countries.
13. **Associated Country Life Expectancy**: Life expectancy of the associated countries.

### Use cases
- Analyze demographic trends and birth rates in different countries.
- Investigate factors affecting life expectancy and mortality rates.
- Study the relationship between gender and occupation across regions.
- Explore correlations between age of death and associated country attributes.
- Examine patterns of migration and associated countries' life expectancy.

### Citation
Annamoradnejad, Issa; Annamoradnejad, Rahimberdi (2022), “Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people”, In Workshop Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM), doi: 10.36190/2022.82
'''

# Creating a dataset with one asset

level_0_dataset = sy.Dataset(
    name="Age Dataset",
    description=description,
    asset_list=[
        sy.Asset(
            name="Age Data 2023",
            data=age_df,
            mock=age_mock_df
    )]
)

In [7]:
# Uploading the dataset
domain.upload_dataset(level_0_dataset)



  0%|                                                                              | 0/1 [00:00<?, ?it/s]

Uploading: Age Data 2023


100%|██████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.48it/s]


#### 1.3. Create an account for data scientist

In [8]:
# Register a new user as a GUEST
response = domain.register(email="holmes@bakerstreet.com", password="SKY5cC2zQPRP", name="Holmes")
response

Confirm Password: ········


### 2. Log into Domain as a Data Scientist

In order to log into a domain hosted by a Data Owner, a Data Scientist needs the following information.

- `name` or `url` of the domain and the `port` it is launched
- Credentials associated with the account created for the data scientist like `email` and `password`.

In [9]:
node = sy.orchestra.launch(name="test_domain", port=8080, dev_mode=False, reset=False)
guest_client = node.client.login(email="holmes@bakerstreet.com", password="SKY5cC2zQPRP", name="Holmes")

Starting test_domain server on 0.0.0.0:8080
 Done.




Logged into <test_domain: High side Domain> as <holmes@bakerstreet.com>


### 3. Accesing datasets

On logging into the domain, a Data Scientist can explore the datasets and associated attributes made available to them. In PySyft, datasets come in the form of `sy.Dataset` datatype, which are Syft objects. Datasets are like a wrapper which holds attributes related to the dataset like a `name`, `description`, and most importantly `assets`. Assets (`sy.Asset`) are Syft objects which hold the actual data uploaded by the data owner. Assets hold both the real and mock data, however data scientists can only access the mock data.

In [10]:
### Access a list of all available datasets in the domain
guest_client.datasets

In [11]:
# Access a specific dataset by id
guest_client.datasets.get_by_id("e70c8661cd4c448a8a9234b814040dea")

---------------------------------------------------------------------------
SyftAttributeError
---------------------------------------------------------------------------
Exception: 'APIModule' api.dataset object has no submodule or method 'get_by_id', you may not have permission to access the module you are trying to access


In [12]:
# Access one of the datasets in the list by index and store it in a variable for later use

dataset = guest_client.datasets[0]
dataset

In [13]:
# Access specific properties of a dataset
dataset.description

### About the dataset
This extensive dataset provides a rich collection of demographic and life events records for individuals across multiple countries. It covers a wide range of indicators and attributes related to personal information, birth and death events, gender, occupation, and associated countries. The dataset offers valuable insights into population dynamics and various aspects of human life, enabling comprehensive analyses and cross-country comparisons. The dataset is the largest one on notable deceased people and includes individ- uals from a variety of social groups, including but not limited to 107k females, 90k researchers, and 124 non-binary indi- viduals, spread across more than 300 contemporary or histor- ical regions.

### Key features
1. **Id**: Unique identifier for each individual.
2. **Name**: Name of the person.
3. **Short description**: Brief description or summary of the individual.
4. **Gender**: Gender/s of the individual.
5. **Country**: Countries/Kingdoms of residence and/or origin.
6. **Occupation**: Occupation or profession of the individual.
7. **Birth year**: Year of birth for the individual.
8. **Death year**: Year of death for the individual.
9. **Manner of death**: Details about the circumstances or manner of death.
10. **Age of death**: Age at the time of death for the individual.
11. **Associated Countries**: Modern Day Countries associated with the individual.
12. **Associated Country Coordinates (Lat/Lon)**: Modern Day Latitude and longitude coordinates of the associated countries.
13. **Associated Country Life Expectancy**: Life expectancy of the associated countries.

### Use cases
- Analyze demographic trends and birth rates in different countries.
- Investigate factors affecting life expectancy and mortality rates.
- Study the relationship between gender and occupation across regions.
- Explore correlations between age of death and associated country attributes.
- Examine patterns of migration and associated countries' life expectancy.

### Citation
Annamoradnejad, Issa; Annamoradnejad, Rahimberdi (2022), “Age dataset: A structured general-purpose dataset on life, work, and death of 1.22 million distinguished people”, In Workshop Proceedings of the 16th International AAAI Conference on Web and Social Media (ICWSM), doi: 10.36190/2022.82


You can also access to other properties by calling `dataset.[property]`, as exmaple
- `dataset.name`
- `dataset.assets`

### 4. Accesing assets

Assets are the most important property of a dataset. Without assets, the dataset will only be holding the metadata. The actual data comes as only as assets. Therefore, it is crucial to understand what assets are and how to navigate through them properly.

#### Difference between datasets and assets

PySyft library provides flexibility so that data owners can group together assets or related data into a single dataset.

In other words, a dataset (`sy.Dataset`) is a collection of assets, and each asset is a singular object which holds the data. One dataset can hold zero, one or more assets, however, an asset must come accompanied by a dataset.

For example, a data owner can add the same data in multiple formats, but use a single uploaded dataset, with more assets (one asset for each format of the data). Similarly, one can split the dataset into testing / training / validation, and use assets to mark this.

In [14]:
# Access the list of assets attached to a dataset. In this dataset, there is just one asset
dataset.assets

To retrieve a specific asset, you can use the name of the asset. As seen in the example below, the asset shows **only** the mock data, not the real one.

In [15]:
asset = dataset.assets['Age Data 2023']
asset

Id,Gender,Age of death,Associated Countries,Associated Country Life Expectancy,Manner of death,Name,Short description,Occupation,Death year,Birth year,Country,Associated Country Coordinates (Lat/Lon)
Loading... (need help?),,,,,,,,,,,,


#### Understanding Asset Duality

Each asset is a two-sided object that secures the private data. An asset has the following properties which are different in nature in terms of accessibility and useability.

- `asset` or `asset.pointer` points to the real, private data and it can be passed in computations, but cannot be accessed directly by anyone except the Data Manager
- `asset.mock` is a Pandas DataFrame containing the fake or mock counter part of the data, which can be safely used for running studies and testing code by the data scientist
- `asset.data` is the actual private data, which is unaccesible by the Data Scientist

### 5. Working with Assets 

As a data scientist, you can work with the data or asset in two ways (both before submitting the project and the code request). More on submitting project and writing code requests will be covered in future tutorials.

**Approach 1: Fully local**

- this approach is useful when you experiment with the data **before** making a code request
- in this approach, you should use `dataset.assets[index].mock`
- one benefit of using this version of the mock is that the code written using this data will use the underlying NumPy and Pandas objects, and not the actual Syft objects. Thus you can freely experiment with it in your code.

**Approach 2: Low-side domain**

- this approach is recommended when you run the code on the low-side domain, by writing a code request
- in this approach, you should use `dataset.assets[0]` instead of the mock data 
- this version of the data will be used for submitting the code request which you wrote, and as such, will use the Syft library and have Syft objects
- the behaviour of this function using this data should, in theory, be the same as the behaviour of the submitted function, so that it doesn't break when it's run on the real data, on the high-side domain by the data owner


Now that we understand the properties of an asset, let's see how to get the data (i.e. mock data) from an asset to be used for running studies.

In [16]:
# Access data (mock data) from an asset (Approach 1)

mock_df = dataset.assets[0].mock
mock_df

Unnamed: 0,Id,Gender,Age of death,Associated Countries,Associated Country Life Expectancy,Manner of death,Name,Short description,Occupation,Death year,Birth year,Country,Associated Country Coordinates (Lat/Lon)
0,Q19723,Gender 1,53.0,['United States'],[78.5],homicide,Norma Fisher,Magazine truth stop whose group through despite.,Corporate treasurer,1989.0,1936,Not Available,Not Available
1,Q20057,Gender 1,51.0,['United Kingdom'],[81.3],natural causes,Brandon Lloyd,Total financial role together range line beyon...,Chief Financial Officer,2018.0,1967,Not Available,Not Available
2,Q8791,Gender 1,84.0,['Sweden'],[82.5],natural causes,Michelle Glover,Partner stock four. Region as true develop sou...,Speech and language therapist,2000.0,1916,Not Available,Not Available
3,Q30567,Gender 1,64.0,['Belgium'],[81.6],natural causes,Willie Golden,Feeling fact by four. Data son natural explain...,Financial controller,1989.0,1925,Not Available,Not Available
4,Q14013,Gender 1,88.0,['United Kingdom'],[81.3],suicide,Roberto Johnson,Attorney quickly candidate change although bag...,"Sound technician, broadcasting/film/video",2016.0,1928,Not Available,Not Available
...,...,...,...,...,...,...,...,...,...,...,...,...,...
44206,Q21223,Gender 1,87.0,['United States'],[78.5],natural causes,Steven Hill,Occur site mean. None imagine social collectio...,Television/film/video producer,2014.0,1927,Not Available,Not Available
44207,Q18681,Gender 1,75.0,['Austria'],[81.6],natural causes,Laura Smith,Five help event as sort. Class training possib...,Race relations officer,2018.0,1943,Not Available,Not Available
44208,Q34424,Gender 1,56.0,['France'],[82.5],natural causes,Diana Jacobs,Middle style capital describe increase. Fly si...,Civil Service fast streamer,2009.0,1953,Not Available,Not Available
44209,Q33102,Gender 1,75.0,['France'],[82.5],natural causes,Larry Foster,Watch size character piece speak moment outsid...,Speech and language therapist,1982.0,1907,Not Available,Not Available


As you can see, the `mock_df` is nothing but a standard Pandas dataframe object. You can work with this object like with any other regular dataframe. As example, you can access the shape, print rows, get statistical results on the columns, split the dataset etc.

In [17]:
print(mock_df.shape)
mock_df.head()

(44211, 13)


Unnamed: 0,Id,Gender,Age of death,Associated Countries,Associated Country Life Expectancy,Manner of death,Name,Short description,Occupation,Death year,Birth year,Country,Associated Country Coordinates (Lat/Lon)
0,Q19723,Gender 1,53.0,['United States'],[78.5],homicide,Norma Fisher,Magazine truth stop whose group through despite.,Corporate treasurer,1989.0,1936,Not Available,Not Available
1,Q20057,Gender 1,51.0,['United Kingdom'],[81.3],natural causes,Brandon Lloyd,Total financial role together range line beyon...,Chief Financial Officer,2018.0,1967,Not Available,Not Available
2,Q8791,Gender 1,84.0,['Sweden'],[82.5],natural causes,Michelle Glover,Partner stock four. Region as true develop sou...,Speech and language therapist,2000.0,1916,Not Available,Not Available
3,Q30567,Gender 1,64.0,['Belgium'],[81.6],natural causes,Willie Golden,Feeling fact by four. Data son natural explain...,Financial controller,1989.0,1925,Not Available,Not Available
4,Q14013,Gender 1,88.0,['United Kingdom'],[81.3],suicide,Roberto Johnson,Attorney quickly candidate change although bag...,"Sound technician, broadcasting/film/video",2016.0,1928,Not Available,Not Available


In [18]:
mock_df.describe()

Unnamed: 0,Age of death,Death year,Birth year
count,44211.0,44211.0,44211.0
mean,60.740969,1996.392482,1935.651512
std,20.890624,15.518847,26.057708
min,8.0,1970.0,1862.0
25%,44.0,1983.0,1917.0
50%,61.0,1996.0,1935.0
75%,77.0,2010.0,1954.0
max,125.0,2023.0,2012.0


In [19]:
# Access asset or asset.pointer from an asset (Approach 2). 

# As explained above, it will include the asset
# metadata and mock data, but it also points to the real data attached to it, only unaccessible to view.

asset = dataset.assets[0]
asset

Id,Gender,Age of death,Associated Countries,Associated Country Life Expectancy,Manner of death,Name,Short description,Occupation,Death year,Birth year,Country,Associated Country Coordinates (Lat/Lon)
Loading... (need help?),,,,,,,,,,,,


Finally, you can try accessing the `asset.data` property which holds the actual private data. As seen in the example, it will return a `NoneType` object since data scientists do not have access to it.

In [20]:
real_df = dataset.assets[0].data
real_df

In [21]:
print(type(real_df))

<class 'NoneType'>


### Summary

- A data scientist can access all the datasets made available to them by loggin into an authorized domain
- A dataset is a special Syft object which holds the metadata like name, description etc. and the data itself as assets. One dataset can have multiple number of assets. 
- Assets always come accompanied by a dataset.
- Assets are two-sided Syft objects - where one side holds the real private data, whereas the other side holds it's mock counterpart
- Data scientists can only access the mock data from assets, and can never access the real data

In the next section, we will learn how data scientists can prepare their studies using assets and mock data and submit a code request. 