Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GGCAT API crash when building many graphs in memory from the same instance #40

Closed
tmaklin opened this issue Feb 1, 2024 · 2 comments
Closed

Comments

@tmaklin
Copy link
Contributor

tmaklin commented Feb 1, 2024

Link to files + code that reproduce the crash on my system (Fedora 39 Linux 6.6.13-200.fc39.x86_64) at the end.

Description

GGCAT API seems to have a bug where using the API to build many (> 100) graphs from the same instance initialized with prefer_memory: true eventually causes a panic with error message:

thread 'main' panicked at /home/temaklin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parallel-processor-0.1.13/src/memory_fs/file/internal.rs:248:26:
called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I looked more into this by making the panicking function (create_writing_underlying_file in parallel-processor) print the file it's attempting to access, and the panic seems to be caused by the instance going into a state where it thinks that it has run out of memory after building some of the graphs. The first graphs are built normally in memory (no temporary files are created) but after a while the building seems to switch to 100% on disk. This eventually causes a crash with the error:

	create_writing_underlying_file: tmp/build_graph_95c7a77f-d9b1-4028-989f-f5676fdf4417/result.997
	create_writing_underlying_file: tmp/build_graph_95c7a77f-d9b1-4028-989f-f5676fdf4417/result.998
	create_writing_underlying_file: tmp/build_graph_95c7a77f-d9b1-4028-989f-f5676fdf4417/result.999
	create_writing_underlying_file: tmp/build_graph_02d33ca3-550c-42c5-b3f2-993a06afe332/maximal-links.207
thread 'main' panicked at /home/temaklin/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parallel-processor-0.1.13/src/memory_fs/file/internal.rs:249:26:
called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

because the file tmp/build_graph_02d33ca3-550c-42c5-b3f2-993a06afe332/maximal-links.207 doesn't exist in the temporary directory.

I also tried calling run_assembler directly but it results in the same crash, so it seems like the API isn't the issue.

Code

use std::collections::HashMap;
use std::path::PathBuf;

fn build_pangenome_graph(input_seq_names: &[String], prefix: &String, instance: &ggcat_api::GGCATInstance) {
    println!("Building graph {} from {} sequences:", prefix, input_seq_names.len());
    input_seq_names.iter().for_each(|x| { println!("\t{}", x) });

    let graph_file = PathBuf::from(prefix.to_string());
    let ggcat_inputs: Vec<ggcat_api::GeneralSequenceBlockData> = input_seq_names
        .iter()
        .map(|x| ggcat_api::GeneralSequenceBlockData::FASTA((PathBuf::from(x), None)))
        .collect();

    instance.build_graph(
        ggcat_inputs,
        graph_file,
        Some(input_seq_names),
        51 as usize,
        4 as usize,
        false,
        None,
        false, // No colors
        1 as usize,
        ggcat_api::ExtraElaboration::GreedyMatchtigs,
    );
}

fn main() {
    // Read in the inputs
    let f = std::fs::File::open("clusters_morethanone.tsv").unwrap();
    let mut reader = csv::ReaderBuilder::new()
        .delimiter(b'\t')
        .has_headers(false)
        .from_reader(f);
    let mut seqs_to_clusters: HashMap<String, Vec<String>> = HashMap::new();
    for line in reader.records().into_iter() {
        let record = line.unwrap();
        let key = record[0].to_string().clone();
        let val = record[1].to_string().clone();

	if seqs_to_clusters.contains_key(&key) {
            seqs_to_clusters.get_mut(&key).unwrap().push(val.clone());
        } else {
            seqs_to_clusters.insert(key.clone(), vec![val.clone()]);
        }
    }

    let config = ggcat_api::GGCATConfig {
        temp_dir: Some(PathBuf::from("tmp")),
        memory: 2.0 as f64,
        prefer_memory: true,
        total_threads_count: 4 as usize,
        intermediate_compression_level: None,
        stats_file: None,
    };

    let instance = ggcat_api::GGCATInstance::create(config);

    // Build 170 graphs with > 1 genomes each
    seqs_to_clusters
        .iter()
        .for_each(|x| build_pangenome_graph(x.1, x.0, &instance));
}

Reproducing

Download the files from https://drive.google.com/file/d/11wj5h6D40zgQcncmCbNRhBT73HeFiAec/view?usp=sharing and run using cargo build --release && target/release/ggcat-tmpfiles-crash.

@Guilucand
Copy link
Collaborator

Hi! I fixed the problem, it was due to some files remaining in the memory cache after the completion of a build task, that after some time were offloaded to disk on a directory that was already deleted. Now I remove the files associated with a task every time it finishes.

@tmaklin
Copy link
Contributor Author

tmaklin commented Mar 11, 2024

thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants