# Synthetic Data

This notebook shows a basic use case of `stelaro` that manipulates synthetic data.

In [22]:
import os
from stelaro.data import ncbi

DATA_DIRECTORY = "../data/"  # You can modify this directory.


def mkdir(path: str) -> None:
    """Create a directory if it does not exist."""
    if not os.path.exists(path):
        os.makedirs(path)


mkdir(DATA_DIRECTORY)

## Fetch Reference Genomes

You can download reference genome summaries and use these summaries to create
your own datasets:

In [23]:
ASSEMBLY_DIRECTORY = DATA_DIRECTORY + "ncbi_genome_summaries/"
mkdir(ASSEMBLY_DIRECTORY)
ncbi.install_summaries(ASSEMBLY_DIRECTORY)

In [24]:
ncbi.summarize_assemblies(ASSEMBLY_DIRECTORY)

archaea: 2 316 genomes
bacteria: 388 560 genomes
fungi: 644 genomes
invertebrate: 434 genomes
plant: 186 genomes
protozoa: 121 genomes
vertebrate mammalian: 239 genomes
vertebrate other: 432 genomes
viral: 14 997 genomes

Total: 407 929


## Sample Genomes

You can create an index of genomes that will be use to download reference
genomes. Given that the NCBI contains 407929 reference genomes as of November
2024, you may want to sample a subset of genomes to create a more manageable
genome database during tests.

In [25]:
DATASET_DIRECTORY = DATA_DIRECTORY + "genome_small_dataset/"
mkdir(DATASET_DIRECTORY)
INDEX_FILE = DATASET_DIRECTORY + "index.tsv"
ncbi.sample_genomes(ASSEMBLY_DIRECTORY, INDEX_FILE, fraction = 0.005)

Let's visualize the index file:

In [26]:
with open(INDEX_FILE, "r") as f:
    count = 0
    print("First lines contained in the index file:\n\n```")
    for line in f:
        if count < 5:
            print(line[:-1])
        count += 1
    print(f"```\n\nTotal number of lines: {count}.")

First lines contained in the index file:

```
ID	URL	category
GCF_035413665.1	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/035/413/665/GCF_035413665.1_ASM3541366v1	archaea
GCF_014647095.1	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/014/647/095/GCF_014647095.1_ASM1464709v1	archaea
GCF_020844815.1	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/844/815/GCF_020844815.1_ASM2084481v1	archaea
GCF_902774605.1	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/902/774/605/GCF_902774605.1_Rumen_uncultured_genome_RUG11987	archaea
```

Total number of lines: 2069.
