Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Nested Schema of Lists and Structs #38843

Open
vizkidd opened this issue Nov 22, 2023 · 2 comments
Open

Support for Nested Schema of Lists and Structs #38843

vizkidd opened this issue Nov 22, 2023 · 2 comments
Labels
Component: C++ Type: usage Issue is a user question

Comments

@vizkidd
Copy link

vizkidd commented Nov 22, 2023

Hi! I am trying to store NCBI_BLAST data with arrow. In my use case the data is grouped and tabular with each group having duplicated data in ID columns (like name, ID, taxon etc). If apache-arrow could support nested lists and structs, I can create a schema which would eliminate such duplications. Pardon me if such a design is by choice

Component(s)

Archery, C++

@mapleFU
Copy link
Member

mapleFU commented Nov 22, 2023

Seems arrow has struct and list datatype, see: https://arrow.apache.org/docs/python/api/datatypes.html

@vizkidd
Copy link
Author

vizkidd commented Nov 23, 2023

Thank you. How can I make a List of Structs?

My Current schema looks like this :

 seq_info_type = arrow::struct_({
      arrow::field("num_alignments", arrow::int8()),
      arrow::field("seqids", arrow::struct_({arrow::field("qseqid", arrow::utf8()),
                                             arrow::field("sseqid", arrow::utf8())})),
      arrow::field("seqs", arrow::struct_({arrow::field("qseq", arrow::large_utf8()),
                                           arrow::field("sseq", arrow::large_utf8())})),
      arrow::field("strands", arrow::utf8()),
      arrow::field("lengths", arrow::struct_({arrow::field("qlen", arrow::int8()),
                                              arrow::field("slen", arrow::int8())})),
  });
  hsp_type = arrow::struct_({arrow::field("pident", arrow::float64()),
 /*fill with data*/
                             arrow::field("comp_adj_method", arrow::float64())});

  alignment_scores_type = arrow::list({hsp_type});

  blast_schema = arrow::schema({arrow::field("seq_info", seq_info_type),
                                arrow::field("hsps", hsp_type)});

I want to use alignment_scores_type in blast_schema instead of the struct hsp_type so that I can store some meta-info in seq_info_type and have a variable length List of Structs with the data. This was disallowed in 13.0.0 (yet to check in 14.0.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

3 participants