Using io.ascii as parser for files (ECSV) #13095

agy-why · 2022-04-11T11:16:13Z

Description

My current use case is the following: I have (large) .ecsv files and I'd like to ingest them into a Postgres database as
a bulk using the COPY ... FROM <file> postgres command. However this requires to know where the actual .csv data starts.

Since the ECSV format is natively supported by astropy my first thought was to use the library to parse the file and access the header and the data.

This was not as straight forward as I though, since the UniversalReadWriteMethods only provide an API adapted to the astropy.Table which I don't need.

I could solve my issue doing the following:

reader = Ecsv()
with open("file.ecsv", "r") as fd:
    lines = fd.readlines() # store all lines into a list
    reader.header.get_cols(lines) # set the table and columns metadata from the header
    reader.data.data_lines = reader.data.process_lines(lines) # extract and store the lines representing the data

I can now access the header and the data like:

reader.data.data_lines
reader.header.cols

A clean API to access these objects -- without having to deal with an astropy.Table -- may be usefull: using astropy as a parser for .ecsv files.

Nevertheless I still face an issue. The file I deal with are quite large (several GB) and must be ingested as a stream. Alike:
https://stackoverflow.com/a/51882751

csv_file_name = '/home/user/some_file.csv'
sql = "COPY table_name FROM STDIN DELIMITER '|' CSV HEADER"
cursor.copy_expert(sql, open(csv_file_name, "r"))

Therefore I am actually wondering if it would be possible to implement the following API:

csv_filename = '/home/user/some_file.csv'
sql = "COPY table_name FROM STDIN DELIMITER '|' CSV HEADER"

reader = Ecsv()
with open(csv_filename, "r") as fd:

    # scan until end of header (seek points to beginning of data)
    reader.header.get_cols(fd) # --> this raises an issue

    # Ingest the data without storing entire file in memory
    cursor.copy_expert(sql, fd)

The former would work alike:

reader = Ecsv()
with open("file.ecsv", "r") as fd:
    reader.header.get_cols(fd) # --> this raises an issue
    reader.data.data_lines = reader.data.process_lines(fd)

The issue raised is the following: astropy.io.ascii.core.InconsistentTableError: column names from ECSV header [<colnames>] do not match names from header line of CSV data [<first-data-row>]

To me it is not clear why this issue is thrown. The pointer seems to be one line ahead.

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2022-04-11T11:16:56Z

Welcome to Astropy 👋 and thank you for your first issue!

A project member will respond to you as soon as possible; in the meantime, please double-check the guidelines for submitting issues and make sure you've provided the requested details.

GitHub issues in the Astropy repository are used to track bug reports and feature requests; If your issue poses a question about how to use Astropy, please instead raise your question in the Astropy Discourse user forum and close this issue.

If you feel that this issue has not been responded to in a timely manner, please leave a comment mentioning our software support engineer @embray, or send a message directly to the development mailing list. If the issue is urgent or sensitive in nature (e.g., a security vulnerability) please send an e-mail directly to the private e-mail feedback@astropy.org.

agy-why · 2022-04-11T11:27:56Z

This will be also an issue for me since file do not fit in memory:
https://github.com/astropy/astropy/blob/main/astropy/io/ascii/ecsv.py#L120

This is where the Issue is raised:
https://github.com/astropy/astropy/blob/main/astropy/io/ascii/ecsv.py#L158

        # Read the first non-commented line of table and split to get the CSV
        # header column names.  This is essentially what the Basic reader does.
        header_line = next(super().process_lines(raw_lines))
        header_names = next(self.splitter([header_line])) 

        # Check for consistency of the ECSV vs. CSV header column names
        if header_names != self.names:
            raise core.InconsistentTableError('column names from ECSV header {} do not '
                                              'match names from header line of CSV data {}'
                                              .format(self.names, header_names))

It is not clear to me why the header_line needs a next here. And it seems that this is the reason why the issue is raised. It takes the first data-line instead of the header line in case of a streamed data (raw_lines is not a list but an io.File).

agy-why added the Feature Request label Apr 11, 2022

pllim added the io.ascii label Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using io.ascii as parser for files (ECSV) #13095

Using io.ascii as parser for files (ECSV) #13095

agy-why commented Apr 11, 2022

github-actions bot commented Apr 11, 2022

agy-why commented Apr 11, 2022

Using io.ascii as parser for files (ECSV) #13095

Using io.ascii as parser for files (ECSV) #13095

Comments

agy-why commented Apr 11, 2022

Description

Additional context

github-actions bot commented Apr 11, 2022

agy-why commented Apr 11, 2022