#  OSFUpload

The goal of this notebook is to :
- Explain how compSPI datasets are stored in OSF.io
- Illustrate retrieval of datasets
- Demonstrate the use of OSFUpload 

Pre Requisites :
- OSF.io account and personal token
- (For Upload) Contributor role on OSF.io project page


## OSF.io

*OSF.io* is a free, open platform maintained by the Centre of Open Science that allows researchers/groups to share data and publish findings. The datasets generated and used by *compSPI* are publically hosted on the "[compSPI datasets](https://osf.io/tj8ya/)" page on OSF.


<!-- <img src="attachment:0d15fc81-208c-4b0e-a42d-71af1ade2ada.png" width="600"> -->
![image.png](attachment:0d15fc81-208c-4b0e-a42d-71af1ade2ada.png)

Project Pages on *OSF.io* are structured like a file system. The Project Page/Components (also referred to as *Nodes*) can be thought of as folders, that can contain other folders (i.e components) or files.
The *CryoEM Datasets* project page is organised in the manner shown above where,

- "Data" houses all compSPI strucutral data
- Components labelled as various "Structure IDs" house all data pertaining to a certain structure
- Components under a "Structure ID" correspond to various datasets of the structure. The data/meta-data files can be found at this level.




## Downloading from OSF

### Website

The datasets are publically available and can be accessed directly by navigating to the [compSPI datasets](https://osf.io/24htr/) data page. The various levels of the project page can be navigated through the *OSF.io* user interface as shown below.

<!-- <div><img src ="attachment:c04c355a-0d30-4717-9755-1715a2176f97.png" width="700"></div> -->
![image.png](attachment:5093dc53-ae6a-4dea-8c09-cece1df5a39c.png)

The dataset components are labelled with tags that provide a high level overview of their charecteristics. The tags can be seen by navigating to one of the dataset components on the right.

The OSF search feature can be used to quickly filter through and retrieve the datasets. The search queries are constructed using the [Lucene Search Syntax]("https://extensions.xwiki.org/xwiki/bin/view/Extension/Search+Application+Query+Syntax#HSearchingforfieldsinXWikiobjects"). 

For compSPI datasets, all datasets corresponding to a certain structure can retrieved using the syntax : <code>title:structure_name\*</code>. For example, <code>title:4v6x\*</code>. would retrieve all nodes with "4v6x" in their title.

Further, datasets can be narrowed by tags. compSPI datasets tags follow the convent *tag_name:tag_value*. For example, <code>noise_type:gaussian</code> or <code>pixel_size_um:1</code>. To use tags in the search use the following syntax <code>tags:("tag_name:tag_value")</code>.

An example of a search query retrieving all datasets for the 4v6x structure with rotation_distribution as uniform_on_sphere would be <code>title:4v6x*  AND tags:("rotation_distribution:uniform_on_sphere")</code>

![image.png](attachment:20f689ed-c799-4b9e-9a28-b30c7036ae4c.png)

### API

Nodes and files hosted on *OSF.io* can be accessed using the [OSF API](https://developer.osf.io/). An OSF account and token (generation demonstrated in API docs) are required to use the API. 
Note that each node is identified by a unique ID. The ID for the Data node on the CompSPI project page is *24htr*.

The OSF API is incredibly powerful and well documented. Some examples in the context of the CompSPI datasets are given below. For an indepth walk through of the API, the API documentation is a greate resource.

In [5]:
import requests
token = "API_TOKEN_HERE" 
    

In [6]:
# get all children under a node (data in this instance)
def get_child_nodes_under_data():

    request_url = "https://api.osf.io/v2/nodes/24htr/children"
    request_headers = {"Authorization": f"Bearer {token}"}
    
    
    response = requests.get(request_url, headers=request_headers)
    response.raise_for_status()
    
    return response.json()["data"]
    
response_data = get_child_nodes_under_data()

# print first child's title and id

print(response_data[0]["attributes"]["title"])
print(response_data[0]["id"])

4v6x
grzkb


In [7]:
# Retrieve a nodes guid given title
def get_guid(title):
    request_url = f"https://api.osf.io/v2/nodes/?filter[title]={title}"
    request_headers = {"Authorization": f"Bearer {token}"}
    
    
    response = requests.get(request_url, headers=request_headers)
    response.raise_for_status()
    
    node_id = response.json()["data"][0]["id"]
    
    return node_id
  
    
    

get_guid("download_demo_compspi")

'w9f6z'

In [9]:
# Download the first file stored in a node
def download_file(guid):
    file_request_url =  f"https://api.osf.io/v2/nodes/{guid}/files/osfstorage"
    request_headers = {"Authorization": f"Bearer {token}"}
    
    response = requests.get(file_request_url, headers=request_headers)
    response.raise_for_status()
    
    download_link = response.json()["data"][0]["links"]["download"]
    return requests.get(download_link).text

download_file("w9f6z")

'Demo for osf download'

## Uploading to OSF

The *datasets* module in *IoSPI* contains utility functions that can be used to upload datasets (comprising of structure + metadata files).

In [11]:
from ioSPI.datasets import OSFUpload

To upload datasets through the API, You must be a contributor to the *compSPI datasets* page. You can become one by simply requesting [access](https://help.osf.io/hc/en-us/articles/360019737394-Request-Access-to-a-Public-Project). Once your request has been approved, You can use our personal access token to use <code> OSFUpload </code>.

In [12]:
token = "API_TOKEN_HERE" 


The OSFUpload class is instantiated with an access token and an optional parent node argument. The parent node argument defaults to *24htr* which is the data node. Recall that The parent node is where all the structure nodes (further housing the dataset nodes) are housed. For the purposes of the notebook, we use the internal node as the parent node instead.



In [13]:
internal_node = "9jwpu"
osf = OSFUpload(token,internal_node)

The <code>read_existing_structure_labels()</code> method returns the titles and GUIDs of all nodes under the parent node in a dictionary.

In [14]:
osf.read_existing_structure_labels()

{'notebook_demo': '3n8dv',
 'test_xjcdl': 'm73ac',
 'test_viHVr': 'j2d9q',
 'test_hzQqZ': 'b37tv',
 'test_ZEbyu': '7ef2v',
 'test_XgNfh': '4k75c',
 'test_LXnPb': 'ytcwv',
 'test_yXXoP': 'njbfy'}

To get the GUID of a paticular structure under the parent node, The <code> read_structure_guid()</code> can be used. If no node with the given title is found, the method returns <code> None </code>.

In [15]:
returned_guid = osf.read_structure_guid('Non-Existent_Label') 
print(returned_guid)


None


In [16]:
structure_guid = osf.read_structure_guid('notebook_demo') 
print(structure_guid)

3n8dv


The <code>write_child_node()</code> can be used to create new nodes. The method requires :
- Parent GUID 
- Node title
- Tags(Optional)
If the node is successfully created, The method will return the GUID of the node just created.

In [18]:
node_title = "osf_write_demo"
tags = ["tag_a:value","tag_b:value","tag_c:value"]

dataset_guid = osf.write_child_node(structure_guid,node_title,tags)
print(dataset_guid)

sgpwd


Finally, <code>write_files()</code> is used to upload a file to a given node. It accepts a node GUID (where the files are stored) and the paths of files to be uploaded as arguments.

In [20]:
def make_file(filename):
    with open(filename, 'w') as f:
        f.write('Sample file containing particle map or meta-data!')
    return filename

upload_file_paths = [
    make_file('sample_data.txt'),
    make_file('sample_data_2.txt')
]

In [21]:
success = osf.write_files(dataset_guid,upload_file_paths)

print(success)

Uploaded sample_data.txt 
Uploaded sample_data_2.txt 
True


To see the uploaded files, We can navigate to the internal node on the OSF page.

![image.png](attachment:067d9721-85dc-4c6d-ba99-5d6d116896a8.png)

### Clean Up

In [None]:
def cleanup(node_guid, test_node_label):
    """Recursively delete nodes and subcomponents."""
    base_api_url = "https://api.osf.io/v2/nodes/"
    base_node_url = f"{base_api_url}{node_guid}/"
    request_headers = {"Authorization": f"Bearer {token}"}
    
    
    response = requests.get(f"{base_node_url}children/", headers=request_headers)
    response.raise_for_status()

    for node_child in response.json()["data"]:
        cleanup(node_child["id"], node_child["attributes"]["title"])

    response = requests.delete(base_node_url, headers=request_headers)

    if not response.ok:
        print(
            f"Failure: {test_node_label} could not be"
            f" deleted due to error code {response.status_code}."
        )
        print(response.json()["errors"][0]["detail"])
    else:
        print(f"Success: {test_node_label} deleted")

In [23]:
# Deletes files and nodes created in this notebook

import os
import requests


for path in upload_file_paths:
    if os.path.exists(path):
        os.remove(path)
    
cleanup(dataset_guid,node_title)

Success: no_noise deleted
