# Downloadling MitoCheck data

In this notebook, we're fetching MitoCheck image-based profiles from this [repository](https://zenodo.org/records/7967386). Since the data is quite large, the download process may take slightly over an hour. Once downloaded, you'll receive a compressed zip file, which needs to be uncompressed. Then, we extract the image-based profiles (in CSV format) and convert them to Parquet for quicker loading times.

In [1]:
import os
import pathlib
import zipfile

import pandas as pd
import requests

## Parameters and File Paths

In [2]:
url = "https://zenodo.org/records/7967386/files/3.normalize_data__normalized_data.zip?download=1"
chunk_size = 8192

# setting data directory path
data_dir = pathlib.Path("../../data").resolve(strict=True)

# create raw sub directory
raw_data_dir = (data_dir / "raw").resolve()
raw_data_dir.mkdir(exist_ok=True)

# output file name
data_out_path = (raw_data_dir / "3.normalize_data__normalized_data.zip").resolve()

## Downloading the dataset from he Zenodoo repository.

In [3]:
# downloading training data using requests
with requests.get(url, stream=True) as r:
    # raise error if the there's an error
    r.raise_for_status()

    # creating a file to write the downloaded contents in chunks
    with open(data_out_path, mode="wb") as out_file:
        for chunk in r.iter_content(chunk_size=chunk_size):
            out_file.write(chunk)

In [4]:
# then lets unzip the file
with zipfile.ZipFile(data_out_path, mode="r") as zip_file_contents:
    zip_file_contents.extractall(raw_data_dir / "mitocheck_data")

## Converting to parquet files

In [5]:
mitocheck_data_paths = list(
    pathlib.Path("../../data/raw/mitocheck_data/normalized_data").glob("*.csv.gz")
)
for path in mitocheck_data_paths:
    name = path.name.split(".")[0]
    parquet_path = path.parent.resolve() / f"{name}.parquet"

    # load in csv file
    df = pd.read_csv(path)
    # converting into parquet file
    df.to_parquet(parquet_path, index=False)

    # remove csv file
    os.remove(path)