Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement more efficient uniq aggregators #51

Closed
angelhof opened this issue Nov 23, 2020 · 6 comments
Closed

Implement more efficient uniq aggregators #51

angelhof opened this issue Nov 23, 2020 · 6 comments
Assignees

Comments

@angelhof
Copy link
Member

At the moment the aggregator for uniq is uniq and there is no implemented aggregator for uniq -c.

We could implement very efficient aggregators for both of them

@angelhof angelhof changed the title Implement more efficient uni aggregators Implement more efficient uniq aggregators Nov 23, 2020
@nvasilakis nvasilakis self-assigned this Nov 23, 2020
@nvasilakis
Copy link
Collaborator

I am not sure we can gain much by re-implementing a custom aggregator for uniq. This is because the partial inputs to the aggregator have already unique, so the only comparison we can gain is from comparisons at n input boundaries — which I don't see yielding serious improvements over simply re-running uniq.

@nvasilakis
Copy link
Collaborator

There is a first version of a custom aggregator for uniq -c. (Not ready for prime time yet, but getting there.)

@angelhof
Copy link
Member Author

I am not sure we can gain much by re-implementing a custom aggregator for uniq. This is because the partial inputs to the aggregator have already unique, so the only comparison we can gain is from comparisons at n input boundaries — which I don't see yielding serious improvements over simply re-running uniq.

I was thinking of adding auxiliary map outputs (the end line) so that we don't even have to traverse the whole file, because this would allow us to just copy the file without ever touching it. However, I think you might be right, this doesn't sound like a big performance win. Let's make it low priority.

@nvasilakis
Copy link
Collaborator

nvasilakis commented Dec 10, 2020 via email

@angelhof
Copy link
Member Author

This can also be applied for

tr -s ' ' '\n'

or other similar commands that remove duplicates etc

@angelhof
Copy link
Member Author

Closing because this issue will have to do with the annotation library rather than PaSh itself now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants