load_data, empty line #12

castelao · 2014-04-16T14:46:44Z

The load_data was not able to handle an empty line on the raw_data.

A temporary solution is to remove the empty line with re.sub()

OlyDLG · 2016-03-20T23:20:02Z

Hi, Gui. The tests passed so I believe I'm ready to go! I was thinking I'd tackle a bug before jumping into enhancements, but this is the only one I see as still pending; did you want to retain ownership? If it's OK for me to work on, do you have (more extensive) notes somewhere on what you intended to do (e.g., in the source itself perhaps)? Did you implement a test for the "temp. solution," and if so, should the "refactor" pass the same test, perhaps enhanced, or replaced entirely? Any other input/advice? DLG

castelao · 2016-03-20T23:48:17Z

Thanks @OlyDLG, that would be great!

I did a fast test and if you include an empty line anywhere it would fail, because the regexp doesn't match anymore. I saw some old CTD files with this kind of issue, so it is real.

Do you see any reason why would we need to retain a blank line after parse an input file? I don't. Remember headers and notes comes with # or ## or *. If a blank line is truly meaningless, one solution could be to apply something like:

raw_text = re.sub('(\r\n){2,}', '\r\n', raw_text)
raw_text = re.sub('\n{2,}', '\n', raw_text)

on the first lines of the CNV object, when we load the raw_text.

What's your opinion on that?

OlyDLG · 2016-03-21T00:32:54Z

The only reason I can see for keeping track of blank lines is that they might be symptomatic of a "deeper" problem with the data in that file, and where they are precisely might be useful in determining if that's the case and what that problem might be. Accordingly, what I would suggest, to increase the robustness of the resulting data structure, would be to add an attribute in which we store some indication of where blank lines occur, e.g., either a line number, or the line immediately following a blank line, or some such. That way, we're not losing any potential information, but the information is segregated enough that it won't interfere with anything else we're doing. (And, thinking long-term, if/when we get to implementing automated QA of files, we'll already have this as a possible indicator of problems in the file).

Another option: when I was parsing CTD files from moored sensors for the WA State Dept. of Ecology, we had occasions where data rows would get cropped for various reasons--these definitely caused problems for my parser! So, if you don't already have such, eventually you'll probably need to support handling such "short" rows, and from that perspective, a blank line can be seen as simply a cropped row of length zero. In other words, perhaps we shouldn't have a separate "solution" for blank rows, but one that simply sees them as one instance of this more general problem.

Of course, either of the things I'm proposing would really be an enhancement, not merely a bug fix, so we could implement a more robust bug fix for the time being (esp. if my ideas are for much further down the road). That said, in general, I'm opposed to simply dismissing anything that may turn out to be informative, so I would urge that whatever we do in the short-term, it not be simply ignoring blank lines.

castelao · 2016-03-23T04:08:46Z

@OlyDLG those are two good points.

One initial solution could be to collect all blank and/or cropped lines into a dictionary, using the line position in the original file as the key, and save all this into a new attribute in the CNV object attributes.

And you're right, the data is now loaded all at once, but it would be better to load line by line. I'll abort the idea of simply remove the blank lines and open two new issues to keep those ideas in the radar.

About the automatic QC, you might enjoy to check CoTeDe, it's already in production, and you can choose which QC rules do you want to apply. There is even a command line (ctdqc) to read .cnv and write the QC'ed file as a netCDF file.

castelao added the bug label Apr 16, 2014

castelao self-assigned this Apr 16, 2014

castelao added a commit that referenced this issue Jun 17, 2014

Bugfix #12, temp. solution. I should refactor this.

35ad1e9

castelao removed their assignment Mar 20, 2016

This was referenced Mar 23, 2016

Extracting and saving corrupted lines in the header #32

Open

Extracting and saving corrupted lines in the data section #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_data, empty line #12

load_data, empty line #12

castelao commented Apr 16, 2014

OlyDLG commented Mar 20, 2016

castelao commented Mar 20, 2016

OlyDLG commented Mar 21, 2016

castelao commented Mar 23, 2016

load_data, empty line #12

load_data, empty line #12

Comments

castelao commented Apr 16, 2014

OlyDLG commented Mar 20, 2016

castelao commented Mar 20, 2016

OlyDLG commented Mar 21, 2016

castelao commented Mar 23, 2016