Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using io.ascii as parser for files (ECSV) #13095

Open
agy-why opened this issue Apr 11, 2022 · 2 comments
Open

Using io.ascii as parser for files (ECSV) #13095

agy-why opened this issue Apr 11, 2022 · 2 comments

Comments

@agy-why
Copy link

agy-why commented Apr 11, 2022

Description

My current use case is the following: I have (large) .ecsv files and I'd like to ingest them into a Postgres database as
a bulk using the COPY ... FROM <file> postgres command. However this requires to know where the actual .csv data starts.

Since the ECSV format is natively supported by astropy my first thought was to use the library to parse the file and access the header and the data.

This was not as straight forward as I though, since the UniversalReadWriteMethods only provide an API adapted to the astropy.Table which I don't need.

I could solve my issue doing the following:

reader = Ecsv()
with open("file.ecsv", "r") as fd:
    lines = fd.readlines() # store all lines into a list
    reader.header.get_cols(lines) # set the table and columns metadata from the header
    reader.data.data_lines = reader.data.process_lines(lines) # extract and store the lines representing the data

I can now access the header and the data like:

reader.data.data_lines
reader.header.cols

A clean API to access these objects -- without having to deal with an astropy.Table -- may be usefull: using astropy as a parser for .ecsv files.

Nevertheless I still face an issue. The file I deal with are quite large (several GB) and must be ingested as a stream. Alike:
https://stackoverflow.com/a/51882751

csv_file_name = '/home/user/some_file.csv'
sql = "COPY table_name FROM STDIN DELIMITER '|' CSV HEADER"
cursor.copy_expert(sql, open(csv_file_name, "r"))

Therefore I am actually wondering if it would be possible to implement the following API:

csv_filename = '/home/user/some_file.csv'
sql = "COPY table_name FROM STDIN DELIMITER '|' CSV HEADER"

reader = Ecsv()
with open(csv_filename, "r") as fd:

    # scan until end of header (seek points to beginning of data)
    reader.header.get_cols(fd) # --> this raises an issue

    # Ingest the data without storing entire file in memory
    cursor.copy_expert(sql, fd)

The former would work alike:

reader = Ecsv()
with open("file.ecsv", "r") as fd:
    reader.header.get_cols(fd) # --> this raises an issue
    reader.data.data_lines = reader.data.process_lines(fd)

The issue raised is the following: astropy.io.ascii.core.InconsistentTableError: column names from ECSV header [<colnames>] do not match names from header line of CSV data [<first-data-row>]

To me it is not clear why this issue is thrown. The pointer seems to be one line ahead.

Additional context

@github-actions
Copy link

Welcome to Astropy 👋 and thank you for your first issue!

A project member will respond to you as soon as possible; in the meantime, please double-check the guidelines for submitting issues and make sure you've provided the requested details.

GitHub issues in the Astropy repository are used to track bug reports and feature requests; If your issue poses a question about how to use Astropy, please instead raise your question in the Astropy Discourse user forum and close this issue.

If you feel that this issue has not been responded to in a timely manner, please leave a comment mentioning our software support engineer @embray, or send a message directly to the development mailing list. If the issue is urgent or sensitive in nature (e.g., a security vulnerability) please send an e-mail directly to the private e-mail feedback@astropy.org.

@agy-why
Copy link
Author

agy-why commented Apr 11, 2022

This will be also an issue for me since file do not fit in memory:
https://github.com/astropy/astropy/blob/main/astropy/io/ascii/ecsv.py#L120

This is where the Issue is raised:
https://github.com/astropy/astropy/blob/main/astropy/io/ascii/ecsv.py#L158

        # Read the first non-commented line of table and split to get the CSV
        # header column names.  This is essentially what the Basic reader does.
        header_line = next(super().process_lines(raw_lines))
        header_names = next(self.splitter([header_line])) 

        # Check for consistency of the ECSV vs. CSV header column names
        if header_names != self.names:
            raise core.InconsistentTableError('column names from ECSV header {} do not '
                                              'match names from header line of CSV data {}'
                                              .format(self.names, header_names))

It is not clear to me why the header_line needs a next here. And it seems that this is the reason why the issue is raised. It takes the first data-line instead of the header line in case of a streamed data (raw_lines is not a list but an io.File).

@pllim pllim added the io.ascii label Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants