Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Grep Flag! #82

Open
davidmiller opened this issue Nov 15, 2013 · 15 comments
Open

New Grep Flag! #82

davidmiller opened this issue Nov 15, 2013 · 15 comments

Comments

@davidmiller
Copy link
Contributor

Render header row regardless of match status.

@andylolz
Copy link
Collaborator

Aha… I bet you are talking about this example.

That data is fiddly, because the header row appears after 6 rows of cruft. So in that example, I’m calling csv -H to say that the first row (some cruft) shouldn’t be treated as a header row (and therefore should still be processed by grep), but I have no way of saying that the 7th row should be treated as the header row (and ignored by grep). So I do a weird hack to also grep for the header row.

Maybe we need a flag on csv to specify which row the header row is? Or is that too specific to this example?

@andylolz
Copy link
Collaborator

The epic hack would be:

/ csv -H / delete 0:5 / csv / csv / grep -i London / cut 0 / html?url=…

i.e. parse csv without a header row, strip out the first 6 rows, render to csv, then parse again but this time stating there is a header row. Then do some operations and then render.

This would very nearly work at the moment (just need some better logic for figuring out when csv means 'parse' and when it means 'render').

behold!

@davidmiller
Copy link
Contributor Author

It is awesome that piping via csv three times works :)
I can't help but think (somewhat hand-wavily admittedly) that there Must Be A Better Way.

I think flagging the header row index on the original CSV parse is very plausibly useful...

Part of me wants this to Just Do What I Mean (Leave headers in place, skip the blank rows, apply grep to the data from 7: )

/ csv -H 6 / delete 0:5 / grep -i London / cut 0 / html?url=…

Thoughts?

@andylolz
Copy link
Collaborator

But if you didn’t have that delete there, what would happen – how would it render? We can’t render it with the thead halfway through the tbody

It might be a useful thing (your example is very appealing) but it’s just not clear how to execute it cleanly.

UPDATE: Ok, actually… I think you might have nailed it, @davidmiller. Let me just have a go at implementing what you have here.

UPDATE x2: the intuitive thing would be to specify the header row in the way you suggest, @davidmiller, and for it to then be pulled out and put to the top on render. The problem is that isn’t a streaming operation.

@rufuspollock
Copy link
Member

Would this be helped if row indexes reflected current index not original index ...

@andylolz
Copy link
Collaborator

I see what you mean.

At the moment, if a csv is imported with csv without --no-header-row, we assume the first row is the header row. That means all the operations will just pass the first row straight through without touching it. So from that point, there’s no way to delete the first row – and the current and original indexes for the header row will be the same.

@rufuspollock
Copy link
Member

I think the whole header row stuff is causing us lots of problems :-) Perhaps worth standing back and just writing out the few specific things we want to enable. e.g.

  • I want to be able to delete a bunch of rows and have remains data used in standard way (so first row becomes headings in html)
  • I want header row recognized so that it could be used to set column names and be able to use that in transforms (e.g. delete column named X) - note we don't support this yet!
  • ...

If nothing we want to support requires the --header-row / --no-header-row stuff then we could just drop ...

@andylolz
Copy link
Collaborator

Sounds like a good suggestion.

The way --no-header-row makes sense to me now, but that doesn’t mean it will make sense to Irene the journalist (#20). Genuine, natural use cases for operators and options sound like good criteria for deciding what makes the cut.

@andylolz
Copy link
Collaborator

I think these operations are feasibly useful.

  • Defaulting to assume “(i) we have a header row, and (ii) it’s the first row” seems like it will cover the majority of cases, and seems very intuitive.
  • Being able to override the two parts of that seems useful and pretty intuitive as well – it should be a simple concept to explain in docs.

I think --no-header-row probably should be removed from renderers completely. It currently does nothing in the csv renderer, but it is present in the html renderer. I just had to read the docs to remind myself how that works, which is not a good sign (since I only added the flipping thing a few days ago.)

@andylolz
Copy link
Collaborator

How about this option:

csv [-H] [-i COUNT]

  -i COUNT, --ignore-lines COUNT
      When parsing, ignore the first COUNT lines of input data.

This example would then become:

/ csv -i 6 / grep -i London / cut 0 / html?url=…

it would be a streaming operation, and would play nicely with --no-header-row.

@davidmiller
Copy link
Contributor Author

/ csv -i 6 / grep -i London / cut 0 / html?url=…

+1

@andylolz
Copy link
Collaborator

kk awesome :)

@rufuspollock
Copy link
Member

@andylolz sounds good. Could we spell out really clearly 2-3 pipeline examples that we need to support so we can check a proposed solution again those. I have stuff like

  • /csv/html/ with output having a nicely formatted header
  • /csv/delete 3/html with output having nicely formatted header
  • /csv/grep xyx/html (but i wanted the first row to come through grep!)

@andylolz
Copy link
Collaborator

Could we spell out really clearly 2-3 pipeline examples that we need to support

Sure! @rgrp do you have some examples of some really broken csv? I’m not sure where to look!

@andylolz
Copy link
Collaborator

The example on the homepage becomes:

/csv -i 6/head -n 50/cut 0/grep -i London/html?url=https://raw.github.com/okfn/datapipes/master/test/data/gla.csv

…and would now have a header row.

/csv/grep xyx/html (but i wanted the first row to come through grep!)

I am assuming people wouldn’t want to grep in their header row but might want to grep in their first row (is that a fair assumption?) So in that case, this example becomes:

/csv --no-header-row/grep xyx/html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants