# Synthetic Data

Basic use case of `stelaro` that manipulates synthetic data.

In [2]:
import os
from stelaro.data import ncbi

DATA_DIRECTORY = "../data/"  # You can modify this directory.


def mkdir(path: str) -> None:
    """Create a directory if it does not exist."""
    if not os.path.exists(path):
        os.makedirs(path)


mkdir(DATA_DIRECTORY)

## Fetch Reference Genomes

You can download reference genome summaries and use these summaries to create
your own datasets:

In [3]:
ASSEMBLY_DIRECTORY = DATA_DIRECTORY + "ncbi_genome_summaries/"
mkdir(ASSEMBLY_DIRECTORY)
ncbi.install_summaries(ASSEMBLY_DIRECTORY)

In [4]:
ncbi.summarize_assemblies(ASSEMBLY_DIRECTORY)

archaea: 2 316 genomes
bacteria: 388 560 genomes
fungi: 644 genomes
invertebrate: 434 genomes
plant: 186 genomes
protozoa: 121 genomes
vertebrate mammalian: 239 genomes
vertebrate other: 432 genomes
viral: 14 997 genomes

Total: 407 929


## Sample Genomes

You can create an index of genomes that will be use to download reference
genomes. Given that the NCBI contains 407929 reference genomes as of November
2024, you may want to sample a subset of genomes to create a more manageable
genome database during tests.

In [5]:
DATASET_DIRECTORY = DATA_DIRECTORY + "genome_small_dataset/"
mkdir(DATASET_DIRECTORY)
INDEX_FILE = DATASET_DIRECTORY + "index.tsv"
ncbi.sample_genomes(ASSEMBLY_DIRECTORY, INDEX_FILE, fraction = 0.005)

Let's visualize the index file:

In [6]:
with open(INDEX_FILE, "r") as f:
    count = 0
    print("First 5 lines contained in the index file:\n\n```")
    for line in f:
        if count < 5:
            print(line[:-1])
        count += 1
    print(f"```\n\nTotal number of lines: {count}.")

First 5 lines contained in the index file:

```
ID	URL	category
GCF_001639295.1.fna	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/639/295/GCF_001639295.1_ASM163929v1/GCF_001639295.1_ASM163929v1_genomic.fna.gz	archaea
GCF_902384065.1.fna	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/902/384/065/GCF_902384065.1_UHGG_MGYG-HGUT-02162/GCF_902384065.1_UHGG_MGYG-HGUT-02162_genomic.fna.gz	archaea
GCF_002214525.1.fna	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/214/525/GCF_002214525.1_ASM221452v1/GCF_002214525.1_ASM221452v1_genomic.fna.gz	archaea
GCF_003711245.1.fna	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/711/245/GCF_003711245.1_ASM371124v1/GCF_003711245.1_ASM371124v1_genomic.fna.gz	archaea
```

Total number of lines: 1964.


## Install Genomes

The following cell installs the genomes downloaded at the previous step.

In [7]:
ncbi.install_genomes(INDEX_FILE, DATASET_DIRECTORY)

The genomes listed in the index file should now be installed at ``DATASET_DIRECTORY``. Let's
examine the first file that was downloaded:

In [8]:
with open(INDEX_FILE, "r") as f:
    next(f)
    filename = f.readline().split("\t")[0]

with open(DATASET_DIRECTORY + "/" + filename) as f:
    count = 0
    print("First lines contained in the genome file:\n\n```")
    for line in f:
        if count < 5:
            print(line[:-1])
        count += 1
    print(f"```\n\nTotal number of lines: {count}.")

First lines contained in the genome file:

```
>NZ_LWMV01000001.1 Methanobrevibacter curvatus strain DSM 11111 MBCUR_contig000001, whole genome shotgun sequence
ATCAGTAGAGTGTGCAGAGGTATATAGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTATATGTGTATGGTTTTATTCAAGCTTTTCAATAAATTAACAGCAGAATAAGCCGCTAAAACACTTGT
TTTTGGATTTATGTTGGATGGAACATTTTCAGTTTTACTAGTAAAACTTCCAAATTCTCCTTTTACATGGACTTCATGAA
TATTTCTATTTATTTCTGGATCTATAATGATTTTTACATTAATATCCATATTAGAAGCAATACTTAGTGCAGCTGCAACA
```

Total number of lines: 30529.
