Skip to content

agent cannot read a specific dataset #4588

@kunwp1

Description

@kunwp1

Dataset (Too large to upload in github):
https://texera.eye.som.uci.edu/dashboard/hub/dataset/result/detail/6

Model: Claude-Haiku-4-5

Issue: The agent couldn't create a workflow that reads the dataset using the following prompt and ends up getting a litellm.RateLimitError: AnthropicException.

The user expects to read the file using pandas.read_table() function, but the agent keeps complaining that the file paths given to the prompt are not valid file system paths.

# Dataset 

1. TexeraChatbot_testdata_DDX41.txt.gz
Path: /lijin.bcm@gmail.com/texerachatbot-testdata-ddx41/v4/TexeraChatbot_testdata_DDX41.txt.gz

This TexeraChatbot_testdata_DDX41.txt.gz file includes a cell-by-gene raw count matrix, comprising 15,307 single cells (in columns) and 33,696 features (gene symbols, in rows). The first row contains cell barcodes, and the first column contains gene symbols.

2. TexeraChatbot_testdata_DDX41_obs.txt.gz
Path: /lijin.bcm@gmail.com/texerachatbot-testdata-ddx41/v4/TexeraChatbot_testdata_DDX41_obs.txt.gz

This TexeraChatbot_testdata_DDX41_obs.txt.gz file includes cell-level metadata for cell barcodes. The column “barcode” is the unique identifier for each cell. Other columns are described below:

- nCount_RNA: total UMI counts per cell
- nFeature_RNA: total number of detected features per cell
- percent.mt: percentage of mitochondrial reads per cell
- pANN: proportion of artificial nearest neighbors calculated by DoubletFinder
- nuclear_fraction: nuclear fraction score, capturing the proportion of reads derived from intronic regions; calculated using the DropletQC R package
- sampleid: 2 unique sample IDs, i.e., DDX41 for DDX41 cKO mouse and WT for wild-type mouse. The genotype for the conditional knockout mouse is Ddx41 fl/fl; ChxCre, and the genotype for the wild-type mouse is Ddx41fl/fl.
- majorclass: 12 annotated major cell classes, including AC, BC, Cone, HC, MG, Microglia, RGC, Rod, Endothelial, Pericyte, RPE, and Astrocyte
- celltype: high-resolution cell type annotation

In summary, the dataset comprises 15,307 single cells derived from 2 unique sample IDs, annotated into 12 major cell classes.

3. TexeraChatbot_testdata_DDX41_var.txt.gz
Path: /lijin.bcm@gmail.com/texerachatbot-testdata-ddx41/v4/TexeraChatbot_testdata_DDX41_var.txt.gz

This TexeraChatbot_testdata_DDX41_var.txt.gz file includes the gene features for the single-cell dataset. The “symbol” column contains the gene symbols for the 33,696 features, including both protein-coding and non-coding genes. Gene identifiers are gene symbols, and the RNA genome build used is the mouse reference (GRCm39).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions