# Module 0: Downloading Profiles from Figshare

In this module, we display the step by step processing how the downloading of cell-heatlh datasets 


This notebook is split into three parts: 
- Imports: the required imports
- Setup: Explains our methods and what variables are required to execute
- Execution: Execution of downloading all files in parallel

## Imports

Below are the required imports in order to successfully to run this notebook.

In [1]:
from typing import Union
from pathlib import Path
import multiprocessing as mp
import requests

## Setup

In this section, we are going to explain the required variables in order to execute our parallelized download of cell-health datasets. 

first we create a function that will handle all the downloading of sqlite files from figshare

In [2]:
def download_sqlite_file(filename: Union[str, Path], url: str):
    """Downloads Single-Cell Cell painting profiles from Figshare data
    repository

    Parameters
    ----------
    filename : Union[str, Path]
        Path to store downloaded profiles
    url : str
        Url that downloads specific profile data.

    Returns
    -------
    None
        All downloaded profiles will be stored in the `./data` directory
    """

    print("Now downloading... {}".format(filename))
    with requests.get(url, stream=True) as sql_request:
        sql_request.raise_for_status()
        with open(filename, "wb") as sql_fh:
            for chunk in sql_request.iter_content(chunk_size=786432000):
                if chunk:
                    assert isinstance(chunk, object)
                    sql_fh.write(chunk)

Next we create a dictionary that contains the name of the sqlite and the associated figshare unique identifier. 

Once we have our list, we also create a directory where all the downloaded sqlite files will be placed. 

In [3]:
file_info = {
        # "SQ00014610": "18028784",
        # "SQ00014611": "18508583",
        # "SQ00014612": "18505937",
        "SQ00014613": "18506036",
        "SQ00014614": "18031619",
        "SQ00014615": "18506108",
        "SQ00014616": "18506912",
        "SQ00014617": "18508316",
        "SQ00014618": "18508421",
    }

# setting update directory to place downloaded files
download_dir_obj = Path("../data")
download_dir_obj.mkdir(exist_ok=True)

Since we are using multi-processing module to spawn multiple download calls from figshare, it is required to generate a nested list parameters for our `downloade_sqlite_file()` function. 

- The `func_params_list` contains the filename and the download url, which are the required positional parameters for `download_sqlite_file()` function.

We need our parameters to be in a nested list due to our method utilizing the `.starmap()` method call in the `multiprocessing` module.

This will be explained in the `Execution` section of this notebook

**NOTE**: Files that already exists within the data folder will be skipped

In [4]:
# collect all function inputs in a list
func_params_list = []
for plate in file_info:
    figshare_id = file_info[plate]
    filename = download_dir_obj / f"{plate}.sqlite"
    if filename.is_file():
        continue
    url = f"https://nih.figshare.com/ndownloader/files/{figshare_id}"
    func_params_list.append([filename, url])

<bv>

## Execution

Now that we have our lists of parameters, we can now set out parallelized download by using the `.starmap()` method. 

First, let's explain why the nested parameter list in `func_param_list` is important for the `.starmap()` method call

In order for a function to be parallelized, each function call requires to have a different set of parameters. 

The `.starmap()` unpacks the elements in the list and places them into the function. 

The `star` in `.starmap()` refers pythons ability to unpack positional arguments from array and use it in a function. This is also known as the `*args` positional argument 

code example:

```python
def sum_all(num1, num2, num3, num4)
param_list = [1, 2, 3, 4]

# using the star to unpack all parameters in a list
sum_all(*param_list)

# the code above is represents this:
sum_all(1,2,3,4)
```

Now we know what the `star` does in the `starmap` now let talk about the `map` version. 

The `map` basically means you are mapping every single list of parameters to your function and calling it.

Therefore, `.starmap()` maps all list of parameters to a function call independently and unpacks the list of parameters by using  `*` 

Each independent function call will be assigned to one cpu core in your machine. 

For this example, there are 9 files to download, hence creating 9 function calls that will be placed to each core in your machine. 

A more technical explanation of `.starmap()` can be found in the [documentation](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool)


In [None]:
# initializing parallelization
n_jobs = len(file_info)
with mp.Pool(processes=n_jobs) as pool:
    pool.starmap(download_sqlite_file, func_params_list)
    pool.close()
    pool.join()

In [6]:
from pathlib import Path

In [11]:
str(Path("../0.download-profiles-from-figshare/data/") / "*.sqite")

'../0.download-profiles-from-figshare/data/*.sqite'