Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column-wise maths (sum-product?) #68

Open
Llammissar opened this issue Apr 27, 2017 · 3 comments
Open

Column-wise maths (sum-product?) #68

Llammissar opened this issue Apr 27, 2017 · 3 comments

Comments

@Llammissar
Copy link

Hey, been a while! Just hit another thing that'd be useful for my work, so I figured I'd put it here.

Consider the following file sample.tsv:

script	weight	actions
foo.js	100	6
bar.js	600	6
baz.js	1200	5
qux.js	500	2

If we want to know the total number of actions for each script, each second column value needs to be multiplied by the third for each line. Right now, I can get this far:

tsv-select --fields 2,3 sample.tsv | \
    keep-header -- /bin/sh -c \
        '(sed "s/\t/ \* /" | xargs printf "%d\n" $(bc $1))'

Which will give me:

weight
600
3600
6000
1000

Hm. Not quite. Better still, the rows match, but we can't join this with the current tools. We can sort of fake it with some awful circumlocutions involving number-lines:

tsv-select --fields 2,3 sample.tsv | \
    keep-header -- /bin/sh -c \
        '(sed "s/\t/ \* /" | xargs printf "%d\n" $(bc $1))' | \
    number-lines --H | \
    tsv-join -H  -k 1 -a 2,3,4 -f <(number-lines -H sample.tsv) | \
    tsv-select -H -f 3,4,5,2

gives me:

script	weight	actions	weight
foo.js	100	6	600
bar.js	600	6	3600
baz.js	1200	5	6000
qux.js	500	2	1000

Well, it's better. The new column is named wrong, though. Dang.

More broadly, that's only multiplication. What I'd like to see is something like... this isn't really tsv-summarize, which is columnar reductions. Maybe it's an advanced selection?
tsv-select -H -f1,2,3 --multiply 2:3:total_actions
Or maybe it's a new tool; tsv-eval or something. Hm. :/

@jondegenhardt
Copy link
Contributor

The specific example you listed can be handled pretty easily using awk. More generally though, none of the tsv-utilities tools provides a way to generate new values with mathematical expressions, or in the case of tsv-filter, the ability to filter rows based on a mathematical formula. The general reason I haven't introduced such facilities is that I haven't found a way to do this without essentially recreating significant parts of awk. awk is a great tool, there's no reason to rebuild it. I have a couple ideas on this front, but nothing I'm likely to get to soon.

@Llammissar
Copy link
Author

Embarrassingly, I've barely used awk for anything, and simply didn't think to reach for it. I spent a few minutes playing and the result was substantially better, so I can see your point. Didn't really help the column naming, but that's a smaller issue.

In the long term, I think there are things that would be nice to have in this general category, but agree that it's not an especially high priority.

@jondegenhardt
Copy link
Contributor

jondegenhardt commented May 9, 2017

One of the reasons I wrote these tools is that awk, powerful as it is, becomes error prone to write as soon as things get at all complicated. And, it is possible to get a large portion of the power by providing a set of much simpler and less error prone primitives. An important part of the intent of these tools.

But, for general mathematical manipulation, it seems hard to create a language that is significantly simpler than awk, or at least that's my feeling for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants