Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_data, empty line #12

Open
castelao opened this issue Apr 16, 2014 · 4 comments
Open

load_data, empty line #12

castelao opened this issue Apr 16, 2014 · 4 comments
Labels

Comments

@castelao
Copy link
Owner

The load_data was not able to handle an empty line on the raw_data.

A temporary solution is to remove the empty line with re.sub()

@OlyDLG
Copy link

OlyDLG commented Mar 20, 2016

Hi, Gui. The tests passed so I believe I'm ready to go! I was thinking I'd tackle a bug before jumping into enhancements, but this is the only one I see as still pending; did you want to retain ownership? If it's OK for me to work on, do you have (more extensive) notes somewhere on what you intended to do (e.g., in the source itself perhaps)? Did you implement a test for the "temp. solution," and if so, should the "refactor" pass the same test, perhaps enhanced, or replaced entirely? Any other input/advice? DLG

@castelao castelao removed their assignment Mar 20, 2016
@castelao
Copy link
Owner Author

Thanks @OlyDLG, that would be great!

I did a fast test and if you include an empty line anywhere it would fail, because the regexp doesn't match anymore. I saw some old CTD files with this kind of issue, so it is real.

Do you see any reason why would we need to retain a blank line after parse an input file? I don't. Remember headers and notes comes with # or ## or *. If a blank line is truly meaningless, one solution could be to apply something like:

raw_text = re.sub('(\r\n){2,}', '\r\n', raw_text)
raw_text = re.sub('\n{2,}', '\n', raw_text)

on the first lines of the CNV object, when we load the raw_text.

What's your opinion on that?

@OlyDLG
Copy link

OlyDLG commented Mar 21, 2016

The only reason I can see for keeping track of blank lines is that they might be symptomatic of a "deeper" problem with the data in that file, and where they are precisely might be useful in determining if that's the case and what that problem might be. Accordingly, what I would suggest, to increase the robustness of the resulting data structure, would be to add an attribute in which we store some indication of where blank lines occur, e.g., either a line number, or the line immediately following a blank line, or some such. That way, we're not losing any potential information, but the information is segregated enough that it won't interfere with anything else we're doing. (And, thinking long-term, if/when we get to implementing automated QA of files, we'll already have this as a possible indicator of problems in the file).

Another option: when I was parsing CTD files from moored sensors for the WA State Dept. of Ecology, we had occasions where data rows would get cropped for various reasons--these definitely caused problems for my parser! So, if you don't already have such, eventually you'll probably need to support handling such "short" rows, and from that perspective, a blank line can be seen as simply a cropped row of length zero. In other words, perhaps we shouldn't have a separate "solution" for blank rows, but one that simply sees them as one instance of this more general problem.

Of course, either of the things I'm proposing would really be an enhancement, not merely a bug fix, so we could implement a more robust bug fix for the time being (esp. if my ideas are for much further down the road). That said, in general, I'm opposed to simply dismissing anything that may turn out to be informative, so I would urge that whatever we do in the short-term, it not be simply ignoring blank lines.

@castelao
Copy link
Owner Author

@OlyDLG those are two good points.

One initial solution could be to collect all blank and/or cropped lines into a dictionary, using the line position in the original file as the key, and save all this into a new attribute in the CNV object attributes.

And you're right, the data is now loaded all at once, but it would be better to load line by line. I'll abort the idea of simply remove the blank lines and open two new issues to keep those ideas in the radar.

About the automatic QC, you might enjoy to check CoTeDe, it's already in production, and you can choose which QC rules do you want to apply. There is even a command line (ctdqc) to read .cnv and write the QC'ed file as a netCDF file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants