# GFF2Parquet Testing Notebook
This notebook demonstrates the various features of the `gff2parquet` CLI tool.




## Setup and Data Paths

In [6]:
from pathlib import Path
import subprocess
import sys
import os
import polars as pl
import pyarrow.parquet as pq


# Define data paths
DATA_DIR = Path("./")
GFF_DIR = DATA_DIR / "downloaded_gff"
FASTA_DIR = DATA_DIR / "downloaded_fasta"
OUTPUT_DIR = DATA_DIR / "test_outputs"

# Create output directory
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

# print(f"current directory: {os.getcwd()}")
# Helper function to run CLI commands
def run_gff2parquet(args):
    """Run gff2parquet command and print output."""
    cmd = ["gff2parquet"] + args
    print(f"Running: {' '.join(cmd)}")
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.stdout:
        print(result.stdout)
    if result.stderr:
        print(result.stderr, file=sys.stderr)
    if result.returncode != 0:
        print(f"Command failed with return code {result.returncode}", file=sys.stderr)
    return result

## 1. Print Command - Inspect GFF Files
Let's start by examining the first GFF file to understand its structure.


In [12]:
# Print first 10 rows of a GFF file
run_gff2parquet([
    "print",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    "--head", "10",
    "--format", "table"
])

Running: gff2parquet print downloaded_gff/groupI_GCA_000859985.2.gff --head 10 --format table
| seqid      | source  | type            | start | end    | score | strand | phase | attributes                                                                                                 | source_file                               |
| ---        | ---     | ---             | ---   | ---    | ---   | ---    | ---   | ---                                                                                                        | ---                                       |
| str        | str     | str             | u32   | u32    | f32   | str    | u32   | list[struct[2]]                                                                                            | str                                       |
|------------|---------|-----------------|-------|--------|-------|--------|-------|------------------------------------------------------------------------------------------------------------

Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff



CompletedProcess(args=['gff2parquet', 'print', 'downloaded_gff/groupI_GCA_000859985.2.gff', '--head', '10', '--format', 'table'], returncode=0, stdout='| seqid      | source  | type            | start | end    | score | strand | phase | attributes                                                                                                 | source_file                               |\n| ---        | ---     | ---             | ---   | ---    | ---   | ---    | ---   | ---                                                                                                        | ---                                       |\n| str        | str     | str             | u32   | u32    | f32   | str    | u32   | list[struct[2]]                                                                                            | str                                       |\n|------------|---------|-----------------|-------|--------|-------|--------|-------|-----------------------------------------------

### Show statistics about feature types
(in addition to printing)

In [15]:
run_gff2parquet([
    "print",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    "--stats",
])

Running: gff2parquet print downloaded_gff/groupI_GCA_000859985.2.gff --stats
| seqid      | source  | type                  | start  | end    | score | strand | phase | attributes                                                                                                                       | source_file                               |
| ---        | ---     | ---                   | ---    | ---    | ---   | ---    | ---   | ---                                                                                                                              | ---                                       |
| str        | str     | str                   | u32    | u32    | f32   | str    | u32   | list[struct[2]]                                                                                                                  | str                                       |
|------------|---------|-----------------------|--------|--------|-------|--------|-------|-------------------------------

Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff

--- Statistics ---
Total rows: 316
Total columns: 10

Feature types:
| type                  | count |
| ---                   | ---   |
| str                   | u32   |
|-----------------------|-------|
| CDS                   | 82    |
| gene                  | 79    |
| polyA_signal_sequence | 53    |
| exon                  | 27    |
| repeat_region         | 26    |
| TATA_box              | 19    |
| mRNA                  | 17    |
| inverted_repeat       | 5     |
| stem_loop             | 3     |
| ncRNA                 | 2     |
| sequence_feature      | 2     |
| region                | 1     |



CompletedProcess(args=['gff2parquet', 'print', 'downloaded_gff/groupI_GCA_000859985.2.gff', '--stats'], returncode=0, stdout='| seqid      | source  | type                  | start  | end    | score | strand | phase | attributes                                                                                                                       | source_file                               |\n| ---        | ---     | ---                   | ---    | ---    | ---   | ---    | ---   | ---                                                                                                                              | ---                                       |\n| str        | str     | str                   | u32    | u32    | f32   | str    | u32   | list[struct[2]]                                                                                                                  | str                                       |\n|------------|---------|-----------------------|--------|--------|----

### Filter and display only CDS features
bonus, using csv (tsv actually) format for the output

In [17]:
run_gff2parquet([
    "print",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    "--type", "CDS",
    "--head", "5",
    "--columns", "seqid,type,start,end,strand",
    "--format", "csv"
])

Running: gff2parquet print downloaded_gff/groupI_GCA_000859985.2.gff --type CDS --head 5 --columns seqid,type,start,end,strand --format csv
seqid	type	start	end	strand
JN555585.1	CDS	513	1259	+
JN555585.1	CDS	2262	2318	+
JN555585.1	CDS	3084	3750	+
JN555585.1	CDS	3887	5490	+
JN555585.1	CDS	9338	10012	+



Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff



CompletedProcess(args=['gff2parquet', 'print', 'downloaded_gff/groupI_GCA_000859985.2.gff', '--type', 'CDS', '--head', '5', '--columns', 'seqid,type,start,end,strand', '--format', 'csv'], returncode=0, stdout='seqid\ttype\tstart\tend\tstrand\nJN555585.1\tCDS\t513\t1259\t+\nJN555585.1\tCDS\t2262\t2318\t+\nJN555585.1\tCDS\t3084\t3750\t+\nJN555585.1\tCDS\t3887\t5490\t+\nJN555585.1\tCDS\t9338\t10012\t+\n', stderr="Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\n")


## 2. Convert Command - GFF to Parquet
whow

In [18]:
run_gff2parquet([
    "convert",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    "-o", str(OUTPUT_DIR / "groupI.parquet")
])

Running: gff2parquet convert downloaded_gff/groupI_GCA_000859985.2.gff -o test_outputs/groupI.parquet


Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Writing Parquet to test_outputs/groupI.parquet...
Done!



CompletedProcess(args=['gff2parquet', 'convert', 'downloaded_gff/groupI_GCA_000859985.2.gff', '-o', 'test_outputs/groupI.parquet'], returncode=0, stdout='', stderr="Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\nWriting Parquet to test_outputs/groupI.parquet...\nDone!\n")

### Convert with column normalization
"normalization" - in a very opioniotated way, coherce colums and attributes that to fit specification. 
Converts the items in each column below into field names (e.g. `begin` -> `start`).

| Start | End | Sequence ID | Score | Source | Type |
|-------|-----|-------------|-------|--------|------|
| begin | to | qseqid | bitscore | tool | feature |
| from | seq_to | sequence_ID | bit_score | method | annotation |
| seq_from | query_end | contig_id | bits | db | category |
| query_start | qend | contig | evalue | database |  |
| qstart |  | query | e_value |  |  |
|  |  | id |  |  |  |
|  |  | name |  |  |  |

I use this when stacking multiple annotation results - like hmmsearch or blast, from different tabular formats. It ain't perfect, I know. If you want to do something more sophisticated, you can use the `normalize_column_names` function in the `gff2parquet/cli.py` script - personally I think dumping everything into the attributes column is even more of a mess, but that's just me.

In [19]:
run_gff2parquet([
    "convert",
    str(GFF_DIR / "groupII_GCA_031099375.1.gff"),
    "--normalize",
    "-o", str(OUTPUT_DIR / "groupII_normalized.parquet")
])
print(f"headers from reading the output file: {pq.read_metadata(OUTPUT_DIR / 'groupII_normalized.parquet')}")

Running: gff2parquet convert downloaded_gff/groupII_GCA_031099375.1.gff --normalize -o test_outputs/groupII_normalized.parquet
headers from reading the output file: <pyarrow._parquet.FileMetaData object at 0x7f4140053e50>
  created_by: Polars
  num_columns: 11
  num_rows: 7
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 2050


Found 1 file(s) matching pattern 'downloaded_gff/groupII_GCA_031099375.1.gff'
Scanning: downloaded_gff/groupII_GCA_031099375.1.gff
Normalizing column names...
Writing Parquet to test_outputs/groupII_normalized.parquet...
Done!




### Convert with coordinate shifting (0-based to 1-based)


In [20]:
run_gff2parquet([
    "convert",
    str(GFF_DIR / "groupIII_GCA_000880735.1.gff"),
    "--shift-start", "1",
    "-o", str(OUTPUT_DIR / "groupIII_shifted.parquet")
])

Running: gff2parquet convert downloaded_gff/groupIII_GCA_000880735.1.gff --shift-start 1 -o test_outputs/groupIII_shifted.parquet


Found 1 file(s) matching pattern 'downloaded_gff/groupIII_GCA_000880735.1.gff'
Scanning: downloaded_gff/groupIII_GCA_000880735.1.gff
Shifting coordinates (start: 1, end: 0)...
Writing Parquet to test_outputs/groupIII_shifted.parquet...
Done!



CompletedProcess(args=['gff2parquet', 'convert', 'downloaded_gff/groupIII_GCA_000880735.1.gff', '--shift-start', '1', '-o', 'test_outputs/groupIII_shifted.parquet'], returncode=0, stdout='', stderr="Found 1 file(s) matching pattern 'downloaded_gff/groupIII_GCA_000880735.1.gff'\nScanning: downloaded_gff/groupIII_GCA_000880735.1.gff\nShifting coordinates (start: 1, end: 0)...\nWriting Parquet to test_outputs/groupIII_shifted.parquet...\nDone!\n")


## 3. Merge Command - Combine Multiple GFF Files

Merge all GFF files from a directory into a single file.


In [21]:
run_gff2parquet([
    "merge",
    str(GFF_DIR / "*.gff"),
    "-o", str(OUTPUT_DIR / "all_merged.parquet")
])
print(f"total number of records from reading the output file: {pq.read_metadata(OUTPUT_DIR / 'all_merged.parquet').num_rows}")

Running: gff2parquet merge downloaded_gff/*.gff -o test_outputs/all_merged.parquet
total number of records from reading the output file: 444


Merging 1 input pattern(s)...
Found 8 file(s) matching pattern 'downloaded_gff/*.gff'
Scanning: downloaded_gff/groupIII_GCA_000880735.1.gff
Scanning: downloaded_gff/groupII_GCA_031099375.1.gff
Scanning: downloaded_gff/groupIV_GCA_031102545.1.gff
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Scanning: downloaded_gff/groupVII_GCA_031171435.1.gff
Scanning: downloaded_gff/groupVI_GCA_000864765.1.gff
Scanning: downloaded_gff/groupV_GCA_053294245.1.gff
Scanning: downloaded_gff/groupcirular_rna_GCA_050924405.1.gff
Writing Parquet to test_outputs/all_merged.parquet...
Done!



### Merge with normalization and output as csv

In [22]:
run_gff2parquet([
    "merge",
    str(GFF_DIR / "group*.gff"),
    "--normalize",
    "-f", "csv",  # Changed from "pyarrow" to "csv"
    "-o", str(OUTPUT_DIR / "merged_normalized.csv")
])

Running: gff2parquet merge downloaded_gff/group*.gff --normalize -f csv -o test_outputs/merged_normalized.csv


Merging 1 input pattern(s)...
Found 8 file(s) matching pattern 'downloaded_gff/group*.gff'
Scanning: downloaded_gff/groupIII_GCA_000880735.1.gff
Scanning: downloaded_gff/groupII_GCA_031099375.1.gff
Scanning: downloaded_gff/groupIV_GCA_031102545.1.gff
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Scanning: downloaded_gff/groupVII_GCA_031171435.1.gff
Scanning: downloaded_gff/groupVI_GCA_000864765.1.gff
Scanning: downloaded_gff/groupV_GCA_053294245.1.gff
Scanning: downloaded_gff/groupcirular_rna_GCA_050924405.1.gff
Normalizing column names...
Writing CSV...
Done!



CompletedProcess(args=['gff2parquet', 'merge', 'downloaded_gff/group*.gff', '--normalize', '-f', 'csv', '-o', 'test_outputs/merged_normalized.csv'], returncode=0, stdout='', stderr="Merging 1 input pattern(s)...\nFound 8 file(s) matching pattern 'downloaded_gff/group*.gff'\nScanning: downloaded_gff/groupIII_GCA_000880735.1.gff\nScanning: downloaded_gff/groupII_GCA_031099375.1.gff\nScanning: downloaded_gff/groupIV_GCA_031102545.1.gff\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\nScanning: downloaded_gff/groupVII_GCA_031171435.1.gff\nScanning: downloaded_gff/groupVI_GCA_000864765.1.gff\nScanning: downloaded_gff/groupV_GCA_053294245.1.gff\nScanning: downloaded_gff/groupcirular_rna_GCA_050924405.1.gff\nNormalizing column names...\nWriting CSV...\nDone!\n")


## 4. Filter Command - Extract Specific Features




### Filter CDS features only

In [29]:
run_gff2parquet([
    "filter",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    "--type", "CDS",
    "-o", str(OUTPUT_DIR / "cds_only.parquet")
])

Running: gff2parquet filter downloaded_gff/groupI_GCA_000859985.2.gff --type CDS -o test_outputs/cds_only.parquet


Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Applying filters...
Writing Parquet to test_outputs/cds_only.parquet...
Done!



CompletedProcess(args=['gff2parquet', 'filter', 'downloaded_gff/groupI_GCA_000859985.2.gff', '--type', 'CDS', '-o', 'test_outputs/cds_only.parquet'], returncode=0, stdout='', stderr="Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\nApplying filters...\nWriting Parquet to test_outputs/cds_only.parquet...\nDone!\n")

### Filter by minimum length

In [30]:
run_gff2parquet([
    "filter",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    "--type", "CDS",
    "--min-length", "500",
    "-o", str(OUTPUT_DIR / "long_cds.csv"),
    "-f", "csv"
])

Running: gff2parquet filter downloaded_gff/groupI_GCA_000859985.2.gff --type CDS --min-length 500 -o test_outputs/long_cds.csv -f csv


Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Applying filters...
Writing CSV...
Done!



CompletedProcess(args=['gff2parquet', 'filter', 'downloaded_gff/groupI_GCA_000859985.2.gff', '--type', 'CDS', '--min-length', '500', '-o', 'test_outputs/long_cds.csv', '-f', 'csv'], returncode=0, stdout='', stderr="Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\nApplying filters...\nWriting CSV...\nDone!\n")

### Filter by strand and length range

In [31]:
run_gff2parquet([
    "filter",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    "--type", "gene",
    "--strand", "+",
    "--min-length", "300",
    "--max-length", "3000",
    "-o", str(OUTPUT_DIR / "filtered_genes.parquet")
])


Running: gff2parquet filter downloaded_gff/groupI_GCA_000859985.2.gff --type gene --strand + --min-length 300 --max-length 3000 -o test_outputs/filtered_genes.parquet


Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Applying filters...
Writing Parquet to test_outputs/filtered_genes.parquet...
Done!



CompletedProcess(args=['gff2parquet', 'filter', 'downloaded_gff/groupI_GCA_000859985.2.gff', '--type', 'gene', '--strand', '+', '--min-length', '300', '--max-length', '3000', '-o', 'test_outputs/filtered_genes.parquet'], returncode=0, stdout='', stderr="Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\nApplying filters...\nWriting Parquet to test_outputs/filtered_genes.parquet...\nDone!\n")


## 5. Split Command - Separate by Column Values

Split GFF data into separate files based on column values.


### Split by feature type

In [32]:
run_gff2parquet([
    "split",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    "--column", "type",
    "--output-dir", str(OUTPUT_DIR / "split_by_type"),
    "-f", "parquet"
])


Running: gff2parquet split downloaded_gff/groupI_GCA_000859985.2.gff --column type --output-dir test_outputs/split_by_type -f parquet


Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Splitting into 12 files by 'type'
Wrote 19 rows to test_outputs/split_by_type/type_TATA_box.parquet
Wrote 17 rows to test_outputs/split_by_type/type_mRNA.parquet
Wrote 53 rows to test_outputs/split_by_type/type_polyA_signal_sequence.parquet
Wrote 27 rows to test_outputs/split_by_type/type_exon.parquet
Wrote 82 rows to test_outputs/split_by_type/type_CDS.parquet
Wrote 1 rows to test_outputs/split_by_type/type_region.parquet
Wrote 2 rows to test_outputs/split_by_type/type_ncRNA.parquet
Wrote 26 rows to test_outputs/split_by_type/type_repeat_region.parquet
Wrote 5 rows to test_outputs/split_by_type/type_inverted_repeat.parquet
Wrote 3 rows to test_outputs/split_by_type/type_stem_loop.parquet
Wrote 79 rows to test_outputs/split_by_type/type_gene.parquet
Wrote 2 rows to test_outputs/split_by_type/type_sequence_feature.parquet
Done!



CompletedProcess(args=['gff2parquet', 'split', 'downloaded_gff/groupI_GCA_000859985.2.gff', '--column', 'type', '--output-dir', 'test_outputs/split_by_type', '-f', 'parquet'], returncode=0, stdout='', stderr="Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\nSplitting into 12 files by 'type'\nWrote 19 rows to test_outputs/split_by_type/type_TATA_box.parquet\nWrote 17 rows to test_outputs/split_by_type/type_mRNA.parquet\nWrote 53 rows to test_outputs/split_by_type/type_polyA_signal_sequence.parquet\nWrote 27 rows to test_outputs/split_by_type/type_exon.parquet\nWrote 82 rows to test_outputs/split_by_type/type_CDS.parquet\nWrote 1 rows to test_outputs/split_by_type/type_region.parquet\nWrote 2 rows to test_outputs/split_by_type/type_ncRNA.parquet\nWrote 26 rows to test_outputs/split_by_type/type_repeat_region.parquet\nWrote 5 rows to test_outputs/split_by_type/type_inverted_repeat.parquet\nWrote 3 rows to te

### Split by sequence ID (chromosome/contig)

In [33]:
run_gff2parquet([
    "split",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    "--column", "seqid",
    "--output-dir", str(OUTPUT_DIR / "split_by_seqid"),
    "-f", "gff"
])

Running: gff2parquet split downloaded_gff/groupI_GCA_000859985.2.gff --column seqid --output-dir test_outputs/split_by_seqid -f gff


Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Splitting into 1 files by 'seqid'
Wrote 316 rows to test_outputs/split_by_seqid/seqid_JN555585.1.gff3
Done!



CompletedProcess(args=['gff2parquet', 'split', 'downloaded_gff/groupI_GCA_000859985.2.gff', '--column', 'seqid', '--output-dir', 'test_outputs/split_by_seqid', '-f', 'gff'], returncode=0, stdout='', stderr="Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\nSplitting into 1 files by 'seqid'\nWrote 316 rows to test_outputs/split_by_seqid/seqid_JN555585.1.gff3\nDone!\n")


## 6. Extract Command - Get Sequences from FASTA

### Extract CDS sequences as nucleotides


In [34]:
run_gff2parquet([
    "extract",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    str(FASTA_DIR / "groupI_GCA_000859985.2.fna"),
    "--type", "CDS",
    "-o", str(OUTPUT_DIR / "cds_sequences.fasta")
])

Running: gff2parquet extract downloaded_gff/groupI_GCA_000859985.2.gff downloaded_fasta/groupI_GCA_000859985.2.fna --type CDS -o test_outputs/cds_sequences.fasta


Loading GFF from: downloaded_gff/groupI_GCA_000859985.2.gff
Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Applying filters...
Found 1 FASTA file(s)
Loading FASTA sequences...
  Reading: downloaded_fasta/groupI_GCA_000859985.2.fna

0rows [00:00, ?rows/s]
1rows [00:00, 189.43rows/s]
Extracting 82 features...
Extracted 82 sequences
Done!



CompletedProcess(args=['gff2parquet', 'extract', 'downloaded_gff/groupI_GCA_000859985.2.gff', 'downloaded_fasta/groupI_GCA_000859985.2.fna', '--type', 'CDS', '-o', 'test_outputs/cds_sequences.fasta'], returncode=0, stdout='', stderr="Loading GFF from: downloaded_gff/groupI_GCA_000859985.2.gff\nFound 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\nApplying filters...\nFound 1 FASTA file(s)\nLoading FASTA sequences...\n  Reading: downloaded_fasta/groupI_GCA_000859985.2.fna\n\n0rows [00:00, ?rows/s]\n1rows [00:00, 189.43rows/s]\nExtracting 82 features...\nExtracted 82 sequences\nDone!\n")

### Extract and translate CDS to proteins

In [36]:
run_gff2parquet([
    "extract",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    str(FASTA_DIR / "groupI_GCA_000859985.2.fna"),
    "--type", "CDS",
    "--outfmt", "amino",
    "--genetic-code", "11",
    "-o", str(OUTPUT_DIR / "cds_proteins.fasta")
])

Running: gff2parquet extract downloaded_gff/groupI_GCA_000859985.2.gff downloaded_fasta/groupI_GCA_000859985.2.fna --type CDS --outfmt amino --genetic-code 11 -o test_outputs/cds_proteins.fasta


Loading GFF from: downloaded_gff/groupI_GCA_000859985.2.gff
Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Applying filters...
Found 1 FASTA file(s)
Loading FASTA sequences...
  Reading: downloaded_fasta/groupI_GCA_000859985.2.fna

0rows [00:00, ?rows/s]
1rows [00:00, 65.08rows/s]
Extracting 82 features...
Extracted 82 sequences
Done!



CompletedProcess(args=['gff2parquet', 'extract', 'downloaded_gff/groupI_GCA_000859985.2.gff', 'downloaded_fasta/groupI_GCA_000859985.2.fna', '--type', 'CDS', '--outfmt', 'amino', '--genetic-code', '11', '-o', 'test_outputs/cds_proteins.fasta'], returncode=0, stdout='', stderr="Loading GFF from: downloaded_gff/groupI_GCA_000859985.2.gff\nFound 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\nApplying filters...\nFound 1 FASTA file(s)\nLoading FASTA sequences...\n  Reading: downloaded_fasta/groupI_GCA_000859985.2.fna\n\n0rows [00:00, ?rows/s]\n1rows [00:00, 65.08rows/s]\nExtracting 82 features...\nExtracted 82 sequences\nDone!\n")

### Extract long CDS and translate

In [37]:
run_gff2parquet([
    "extract",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    str(FASTA_DIR / "groupI_GCA_000859985.2.fna"),
    "--type", "CDS",
    "--min-length", "500",
    "--outfmt", "amino",
    "-o", str(OUTPUT_DIR / "long_proteins.fasta")
])

Running: gff2parquet extract downloaded_gff/groupI_GCA_000859985.2.gff downloaded_fasta/groupI_GCA_000859985.2.fna --type CDS --min-length 500 --outfmt amino -o test_outputs/long_proteins.fasta


Loading GFF from: downloaded_gff/groupI_GCA_000859985.2.gff
Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Applying filters...
Found 1 FASTA file(s)
Loading FASTA sequences...
  Reading: downloaded_fasta/groupI_GCA_000859985.2.fna

0rows [00:00, ?rows/s]
1rows [00:00, 221.42rows/s]
Extracting 71 features...
Extracted 71 sequences
Done!



CompletedProcess(args=['gff2parquet', 'extract', 'downloaded_gff/groupI_GCA_000859985.2.gff', 'downloaded_fasta/groupI_GCA_000859985.2.fna', '--type', 'CDS', '--min-length', '500', '--outfmt', 'amino', '-o', 'test_outputs/long_proteins.fasta'], returncode=0, stdout='', stderr="Loading GFF from: downloaded_gff/groupI_GCA_000859985.2.gff\nFound 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\nApplying filters...\nFound 1 FASTA file(s)\nLoading FASTA sequences...\n  Reading: downloaded_fasta/groupI_GCA_000859985.2.fna\n\n0rows [00:00, ?rows/s]\n1rows [00:00, 221.42rows/s]\nExtracting 71 features...\nExtracted 71 sequences\nDone!\n")

### Extract from multiple genomes


In [38]:
run_gff2parquet([
    "extract",
    str(GFF_DIR / "groupI*.gff"),
    str(FASTA_DIR / "groupI_GCA_000859985.2.fna"),
    str(FASTA_DIR / "groupII_GCA_031099375.1.fna"),
    "--type", "CDS",
    "--outfmt", "amino",
    "-f", "parquet",
    "-o", str(OUTPUT_DIR / "multi_genome_proteins.parquet")
])


Running: gff2parquet extract downloaded_gff/groupI*.gff downloaded_fasta/groupI_GCA_000859985.2.fna downloaded_fasta/groupII_GCA_031099375.1.fna --type CDS --outfmt amino -f parquet -o test_outputs/multi_genome_proteins.parquet


Loading GFF from: downloaded_gff/groupI*.gff
Found 4 file(s) matching pattern 'downloaded_gff/groupI*.gff'
Scanning: downloaded_gff/groupIII_GCA_000880735.1.gff
Scanning: downloaded_gff/groupII_GCA_031099375.1.gff
Scanning: downloaded_gff/groupIV_GCA_031102545.1.gff
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Applying filters...
Found 2 FASTA file(s)
Loading FASTA sequences...
  Reading: downloaded_fasta/groupI_GCA_000859985.2.fna

0rows [00:00, ?rows/s]
1rows [00:00, 217.38rows/s]
  Reading: downloaded_fasta/groupII_GCA_031099375.1.fna

0rows [00:00, ?rows/s]
1rows [00:00, 583.68rows/s]
Extracting 101 features...
Extracted 88 sequences
Done!



CompletedProcess(args=['gff2parquet', 'extract', 'downloaded_gff/groupI*.gff', 'downloaded_fasta/groupI_GCA_000859985.2.fna', 'downloaded_fasta/groupII_GCA_031099375.1.fna', '--type', 'CDS', '--outfmt', 'amino', '-f', 'parquet', '-o', 'test_outputs/multi_genome_proteins.parquet'], returncode=0, stdout='', stderr="Loading GFF from: downloaded_gff/groupI*.gff\nFound 4 file(s) matching pattern 'downloaded_gff/groupI*.gff'\nScanning: downloaded_gff/groupIII_GCA_000880735.1.gff\nScanning: downloaded_gff/groupII_GCA_031099375.1.gff\nScanning: downloaded_gff/groupIV_GCA_031102545.1.gff\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\nApplying filters...\nFound 2 FASTA file(s)\nLoading FASTA sequences...\n  Reading: downloaded_fasta/groupI_GCA_000859985.2.fna\n\n0rows [00:00, ?rows/s]\n1rows [00:00, 217.38rows/s]\n  Reading: downloaded_fasta/groupII_GCA_031099375.1.fna\n\n0rows [00:00, ?rows/s]\n1rows [00:00, 583.68rows/s]\nExtracting 101 features...\nExtracted 88 sequences\nDone!\n")

### extract from multiple genomes using glob pattern

In [39]:
run_gff2parquet([
    "extract",
    str(GFF_DIR / "groupI*.gff"),
    str(FASTA_DIR / "group*.fna"),  # Glob pattern
    "--type", "CDS",
    "-o", str(OUTPUT_DIR / "all_cds.fasta")
])

# Or multiple patterns:
run_gff2parquet([
    "extract",
    str(GFF_DIR / "*.gff"),
    str(FASTA_DIR / "groupI*.fna"),
    str(FASTA_DIR / "groupII*.fna"),
    "--type", "CDS",
    "--outfmt", "amino",
    "-o", str(OUTPUT_DIR / "selected_proteins.fasta")
])

Running: gff2parquet extract downloaded_gff/groupI*.gff downloaded_fasta/group*.fna --type CDS -o test_outputs/all_cds.fasta


Loading GFF from: downloaded_gff/groupI*.gff
Found 4 file(s) matching pattern 'downloaded_gff/groupI*.gff'
Scanning: downloaded_gff/groupIII_GCA_000880735.1.gff
Scanning: downloaded_gff/groupII_GCA_031099375.1.gff
Scanning: downloaded_gff/groupIV_GCA_031102545.1.gff
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Applying filters...
Found 8 FASTA file(s)
Loading FASTA sequences...
  Reading: downloaded_fasta/groupIII_GCA_000880735.1.fna

0rows [00:00, ?rows/s]
11rows [00:00, 2295.39rows/s]
  Reading: downloaded_fasta/groupII_GCA_031099375.1.fna

0rows [00:00, ?rows/s]
1rows [00:00, 536.91rows/s]
  Reading: downloaded_fasta/groupIV_GCA_031102545.1.fna

0rows [00:00, ?rows/s]
1rows [00:00, 937.69rows/s]
  Reading: downloaded_fasta/groupI_GCA_000859985.2.fna

0rows [00:00, ?rows/s]
1rows [00:00, 304.09rows/s]
  Reading: downloaded_fasta/groupVII_GCA_031171435.1.fna

0rows [00:00, ?rows/s]
1rows [00:00, 963.54rows/s]
  Reading: downloaded_fasta/groupVI_GCA_000864765.1.fna

0rows [00:00

Running: gff2parquet extract downloaded_gff/*.gff downloaded_fasta/groupI*.fna downloaded_fasta/groupII*.fna --type CDS --outfmt amino -o test_outputs/selected_proteins.fasta


Loading GFF from: downloaded_gff/*.gff
Found 8 file(s) matching pattern 'downloaded_gff/*.gff'
Scanning: downloaded_gff/groupIII_GCA_000880735.1.gff
Scanning: downloaded_gff/groupII_GCA_031099375.1.gff
Scanning: downloaded_gff/groupIV_GCA_031102545.1.gff
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Scanning: downloaded_gff/groupVII_GCA_031171435.1.gff
Scanning: downloaded_gff/groupVI_GCA_000864765.1.gff
Scanning: downloaded_gff/groupV_GCA_053294245.1.gff
Scanning: downloaded_gff/groupcirular_rna_GCA_050924405.1.gff
Applying filters...
Found 6 FASTA file(s)
Loading FASTA sequences...
  Reading: downloaded_fasta/groupIII_GCA_000880735.1.fna

0rows [00:00, ?rows/s]
11rows [00:00, 2615.79rows/s]
  Reading: downloaded_fasta/groupII_GCA_031099375.1.fna

0rows [00:00, ?rows/s]
1rows [00:00, 870.01rows/s]
  Reading: downloaded_fasta/groupIV_GCA_031102545.1.fna

0rows [00:00, ?rows/s]
1rows [00:00, 956.95rows/s]
  Reading: downloaded_fasta/groupI_GCA_000859985.2.fna

0rows [00:00, ?rows/

CompletedProcess(args=['gff2parquet', 'extract', 'downloaded_gff/*.gff', 'downloaded_fasta/groupI*.fna', 'downloaded_fasta/groupII*.fna', '--type', 'CDS', '--outfmt', 'amino', '-o', 'test_outputs/selected_proteins.fasta'], returncode=0, stdout='', stderr="Loading GFF from: downloaded_gff/*.gff\nFound 8 file(s) matching pattern 'downloaded_gff/*.gff'\nScanning: downloaded_gff/groupIII_GCA_000880735.1.gff\nScanning: downloaded_gff/groupII_GCA_031099375.1.gff\nScanning: downloaded_gff/groupIV_GCA_031102545.1.gff\nScanning: downloaded_gff/groupI_GCA_000859985.2.gff\nScanning: downloaded_gff/groupVII_GCA_031171435.1.gff\nScanning: downloaded_gff/groupVI_GCA_000864765.1.gff\nScanning: downloaded_gff/groupV_GCA_053294245.1.gff\nScanning: downloaded_gff/groupcirular_rna_GCA_050924405.1.gff\nApplying filters...\nFound 6 FASTA file(s)\nLoading FASTA sequences...\n  Reading: downloaded_fasta/groupIII_GCA_000880735.1.fna\n\n0rows [00:00, ?rows/s]\n11rows [00:00, 2615.79rows/s]\n  Reading: download


## 7. Complex Workflows - Combining Commands


### Workflow 1: Merge → Filter → Extract

In [40]:
# Step 1: Merge all annotations
run_gff2parquet([
    "merge",
    str(GFF_DIR / "group[I-IV]*.gff"),
    "-o", str(OUTPUT_DIR / "workflow1_merged.parquet")
])

# Step 2: Filter for long CDS
run_gff2parquet([
    "filter",
    str(OUTPUT_DIR / "workflow1_merged.parquet"),
    "--type", "CDS",
    "--min-length", "600",
    "-o", str(OUTPUT_DIR / "workflow1_filtered.gff"),
    "-f", "gff"
])

# Step 3: Extract and translate
run_gff2parquet([
    "extract",
    str(OUTPUT_DIR / "workflow1_filtered.gff"),
    str(FASTA_DIR / "groupI_GCA_000859985.2.fna"),
    str(FASTA_DIR / "groupII_GCA_031099375.1.fna"),
    str(FASTA_DIR / "groupIII_GCA_000880735.1.fna"),
    str(FASTA_DIR / "groupIV_GCA_031102545.1.fna"),
    "--outfmt", "amino",
    "-o", str(OUTPUT_DIR / "workflow1_proteins.fasta")
])


Running: gff2parquet merge downloaded_gff/group[I-IV]*.gff -o test_outputs/workflow1_merged.parquet


Merging 1 input pattern(s)...
Found 7 file(s) matching pattern 'downloaded_gff/group[I-IV]*.gff'
Scanning: downloaded_gff/groupIII_GCA_000880735.1.gff
Scanning: downloaded_gff/groupII_GCA_031099375.1.gff
Scanning: downloaded_gff/groupIV_GCA_031102545.1.gff
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Scanning: downloaded_gff/groupVII_GCA_031171435.1.gff
Scanning: downloaded_gff/groupVI_GCA_000864765.1.gff
Scanning: downloaded_gff/groupV_GCA_053294245.1.gff
Writing Parquet to test_outputs/workflow1_merged.parquet...
Done!



Running: gff2parquet filter test_outputs/workflow1_merged.parquet --type CDS --min-length 600 -o test_outputs/workflow1_filtered.gff -f gff


Found 1 file(s) matching pattern 'test_outputs/workflow1_merged.parquet'
Scanning: test_outputs/workflow1_merged.parquet
Applying filters...
Writing GFF3 to test_outputs/workflow1_filtered.gff...
Done!



Running: gff2parquet extract test_outputs/workflow1_filtered.gff downloaded_fasta/groupI_GCA_000859985.2.fna downloaded_fasta/groupII_GCA_031099375.1.fna downloaded_fasta/groupIII_GCA_000880735.1.fna downloaded_fasta/groupIV_GCA_031102545.1.fna --outfmt amino -o test_outputs/workflow1_proteins.fasta


Loading GFF from: test_outputs/workflow1_filtered.gff
Found 1 file(s) matching pattern 'test_outputs/workflow1_filtered.gff'
Scanning: test_outputs/workflow1_filtered.gff
Found 4 FASTA file(s)
Loading FASTA sequences...
  Reading: downloaded_fasta/groupI_GCA_000859985.2.fna

0rows [00:00, ?rows/s]
1rows [00:00, 178.90rows/s]
  Reading: downloaded_fasta/groupII_GCA_031099375.1.fna

0rows [00:00, ?rows/s]
1rows [00:00, 413.88rows/s]
  Reading: downloaded_fasta/groupIII_GCA_000880735.1.fna

0rows [00:00, ?rows/s]
11rows [00:00, 6403.52rows/s]
  Reading: downloaded_fasta/groupIV_GCA_031102545.1.fna

0rows [00:00, ?rows/s]
1rows [00:00, 885.81rows/s]
Extracting 96 features...
Extracted 82 sequences
Done!



CompletedProcess(args=['gff2parquet', 'extract', 'test_outputs/workflow1_filtered.gff', 'downloaded_fasta/groupI_GCA_000859985.2.fna', 'downloaded_fasta/groupII_GCA_031099375.1.fna', 'downloaded_fasta/groupIII_GCA_000880735.1.fna', 'downloaded_fasta/groupIV_GCA_031102545.1.fna', '--outfmt', 'amino', '-o', 'test_outputs/workflow1_proteins.fasta'], returncode=0, stdout='', stderr="Loading GFF from: test_outputs/workflow1_filtered.gff\nFound 1 file(s) matching pattern 'test_outputs/workflow1_filtered.gff'\nScanning: test_outputs/workflow1_filtered.gff\nFound 4 FASTA file(s)\nLoading FASTA sequences...\n  Reading: downloaded_fasta/groupI_GCA_000859985.2.fna\n\n0rows [00:00, ?rows/s]\n1rows [00:00, 178.90rows/s]\n  Reading: downloaded_fasta/groupII_GCA_031099375.1.fna\n\n0rows [00:00, ?rows/s]\n1rows [00:00, 413.88rows/s]\n  Reading: downloaded_fasta/groupIII_GCA_000880735.1.fna\n\n0rows [00:00, ?rows/s]\n11rows [00:00, 6403.52rows/s]\n  Reading: downloaded_fasta/groupIV_GCA_031102545.1.fna

### Workflow 2: Filter by type → Split by seqid

In [41]:
# Filter for genes
run_gff2parquet([
    "filter",
    str(GFF_DIR / "groupI_GCA_000859985.2.gff"),
    "--type", "gene",
    "-o", str(OUTPUT_DIR / "workflow2_genes.parquet")
])

# Split by seqid 
run_gff2parquet([
    "split",
    str(OUTPUT_DIR / "workflow2_genes.parquet"),
    "--column", "seqid",
    "--output-dir", str(OUTPUT_DIR / "workflow2_by_chromosome"),
    "-f", "gff"
])


Running: gff2parquet filter downloaded_gff/groupI_GCA_000859985.2.gff --type gene -o test_outputs/workflow2_genes.parquet


Found 1 file(s) matching pattern 'downloaded_gff/groupI_GCA_000859985.2.gff'
Scanning: downloaded_gff/groupI_GCA_000859985.2.gff
Applying filters...
Writing Parquet to test_outputs/workflow2_genes.parquet...
Done!



Running: gff2parquet split test_outputs/workflow2_genes.parquet --column seqid --output-dir test_outputs/workflow2_by_chromosome -f gff


Found 1 file(s) matching pattern 'test_outputs/workflow2_genes.parquet'
Scanning: test_outputs/workflow2_genes.parquet
Splitting into 1 files by 'seqid'
Wrote 79 rows to test_outputs/workflow2_by_chromosome/seqid_JN555585.1.gff3
Done!



CompletedProcess(args=['gff2parquet', 'split', 'test_outputs/workflow2_genes.parquet', '--column', 'seqid', '--output-dir', 'test_outputs/workflow2_by_chromosome', '-f', 'gff'], returncode=0, stdout='', stderr="Found 1 file(s) matching pattern 'test_outputs/workflow2_genes.parquet'\nScanning: test_outputs/workflow2_genes.parquet\nSplitting into 1 files by 'seqid'\nWrote 79 rows to test_outputs/workflow2_by_chromosome/seqid_JN555585.1.gff3\nDone!\n")

## 8. Verification - Check Outputs

Verify that output files were created successfully.


In [None]:
print("Output files created:")
for root, dirs, files in os.walk(OUTPUT_DIR):
    level = root.replace(str(OUTPUT_DIR), '').count(os.sep)
    indent = ' ' * 2 * level
    print(f'{indent}{os.path.basename(root)}/')
    subindent = ' ' * 2 * (level + 1)
    for file in files:
        size = os.path.getsize(os.path.join(root, file))*1e-6
        print(f'{subindent}{file} ({size:,} MB)')


Output files created:
test_outputs/
  all_cds.fasta (157,604 bytes)
  all_merged.parquet (17,531 bytes)
  cds_only.parquet (7,119 bytes)
  cds_proteins.fasta (44,014 bytes)
  cds_sequences.fasta (126,208 bytes)
  filtered_genes.parquet (5,161 bytes)
  groupI.parquet (12,825 bytes)
  groupIII_shifted.parquet (5,468 bytes)
  groupII_normalized.parquet (4,509 bytes)
  long_cds.csv (17,342 bytes)
  long_proteins.fasta (42,551 bytes)
  merged_normalized.csv (89,091 bytes)
  multi_genome_proteins.parquet (711,170 bytes)
  selected_proteins.fasta (63,492 bytes)
  workflow1_filtered.gff (18,341 bytes)
  workflow1_merged.parquet (17,277 bytes)
  workflow1_proteins.fasta (52,149 bytes)
  workflow2_genes.parquet (5,544 bytes)
  split_by_seqid/
    seqid_JN555585.1.gff3 (48,783 bytes)
  split_by_type/
    type_CDS.parquet (7,119 bytes)
    type_TATA_box.parquet (4,593 bytes)
    type_exon.parquet (4,862 bytes)
    type_gene.parquet (5,544 bytes)
    type_inverted_repeat.parquet (4,381 bytes)
    t

## Summary

This notebook demonstrated:
- **print**: Inspecting GFF data and statistics
- **convert**: Converting GFF to Parquet/CSV with normalization and coordinate shifting
- **merge**: Combining multiple GFF files
- **filter**: Extracting features by type, length, strand, etc.
- **split**: Separating data into multiple files by column values
- **extract**: Extracting and optionally translating sequences from FASTA
- Complex multi-step workflows combining multiple commands
