Implement more efficient `uniq` aggregators #51

angelhof · 2020-11-23T15:19:32Z

At the moment the aggregator for uniq is uniq and there is no implemented aggregator for uniq -c.

We could implement very efficient aggregators for both of them

The text was updated successfully, but these errors were encountered:

nvasilakis · 2020-12-10T06:00:31Z

I am not sure we can gain much by re-implementing a custom aggregator for uniq. This is because the partial inputs to the aggregator have already unique, so the only comparison we can gain is from comparisons at n input boundaries — which I don't see yielding serious improvements over simply re-running uniq.

nvasilakis · 2020-12-10T06:02:09Z

There is a first version of a custom aggregator for uniq -c. (Not ready for prime time yet, but getting there.)

angelhof · 2020-12-10T13:25:50Z

I am not sure we can gain much by re-implementing a custom aggregator for uniq. This is because the partial inputs to the aggregator have already unique, so the only comparison we can gain is from comparisons at n input boundaries — which I don't see yielding serious improvements over simply re-running uniq.

I was thinking of adding auxiliary map outputs (the end line) so that we don't even have to traverse the whole file, because this would allow us to just copy the file without ever touching it. However, I think you might be right, this doesn't sound like a big performance win. Let's make it low priority.

nvasilakis · 2020-12-10T14:35:28Z

That's actually a great idea, I didn't think of that. El jue, 10 de dic. de 2020 a la(s) 08:26, Konstantinos Kallas ( notifications@github.com) escribió:

…

I am not sure we can gain much by re-implementing a custom aggregator for uniq. This is because the partial inputs to the aggregator have already unique, so the only comparison we can gain is from comparisons at *n* input boundaries — which I don't see yielding serious improvements over simply re-running uniq. I was thinking of adding auxiliary map outputs (the end line) so that we don't even have to traverse the whole file, because this would allow us to just copy the file without ever touching it. However, I think you might be right, this doesn't sound like a big performance win. Let's make it low priority. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#51 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADL55ELN4NGNLPNUMH44ULSUDD63ANCNFSM4T7T4KGQ> .

angelhof · 2021-01-12T22:43:55Z

This can also be applied for

tr -s ' ' '\n'

or other similar commands that remove duplicates etc

angelhof · 2022-07-21T17:44:28Z

Closing because this issue will have to do with the annotation library rather than PaSh itself now

angelhof changed the title ~~Implement more efficient uni aggregators~~ Implement more efficient uniq aggregators Nov 23, 2020

nvasilakis self-assigned this Nov 23, 2020

angelhof closed this as completed Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement more efficient `uniq` aggregators #51

Implement more efficient `uniq` aggregators #51

angelhof commented Nov 23, 2020

nvasilakis commented Dec 10, 2020

nvasilakis commented Dec 10, 2020

angelhof commented Dec 10, 2020

nvasilakis commented Dec 10, 2020 via email

angelhof commented Jan 12, 2021

angelhof commented Jul 21, 2022

Implement more efficient uniq aggregators #51

Implement more efficient uniq aggregators #51

Comments

angelhof commented Nov 23, 2020

nvasilakis commented Dec 10, 2020

nvasilakis commented Dec 10, 2020

angelhof commented Dec 10, 2020

nvasilakis commented Dec 10, 2020 via email

angelhof commented Jan 12, 2021

angelhof commented Jul 21, 2022

Implement more efficient `uniq` aggregators #51

Implement more efficient `uniq` aggregators #51