Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing uniq command #29

Open
i4ki opened this issue Jan 3, 2017 · 18 comments
Open

Missing uniq command #29

i4ki opened this issue Jan 3, 2017 · 18 comments

Comments

@i4ki
Copy link
Collaborator

i4ki commented Jan 3, 2017

We need a uniq command, but how it should behave?

@katcipis @geyslan @cadicallegari

Differently from Plan9 and GNU uniq, it should apply the uniq in the entire input buffer, not only in the adjacent input lines. The current @geyslan implementations already did this way (#27 #28).

Below are some test cases I do expect to work:

$ cat > file.txt
1
1
1
2

1
3
3
1
# by default, print every string one time (output have unique entries).
# empty lines are ignored
$ cat file.txt | uniq
1
2
3
# -dup  print only lines that are duplicated in the input
$ cat file.txt | uniq -dup
1
3
# -empty print empty lines also
$ cat file.txt | uniq -empty
1
2

3
$ 

A third option to show line numbers could be added if it do not complicates the tool.

@geyslan Current implementation do not honor this cases. I know it's not like 'gnu uniq'. But what do you think? Makes sense?

@geyslan
Copy link
Member

geyslan commented Jan 3, 2017

@tiago4orion @katcipis @cadicallegari

I really never used the GNU uniq as it is. If I would like to grep uniques or duplicates I've used awk or grep.

From the GNU uniq manual:

Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first,
or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.

Adjacent lines? I think it's misguided. It can be me the wrong here. :-) For that actually work it demands to use sort, and that already has an option for unique. LOL.

From the GNU sort manual:

-u, --unique
with -c, check for strict ordering; without -c, output only the first of an equal run

Highlighting that in any case, comparing all occurrences or adjacency the complexity will be O(n), let me know what do you think.

@geyslan geyslan mentioned this issue Jan 3, 2017
@i4ki
Copy link
Collaborator Author

i4ki commented Jan 3, 2017

Ok, but I do not used gnu uniq behavior in my examples. Try the file.txt in our uniq implementation.

@geyslan
Copy link
Member

geyslan commented Jan 3, 2017

Ok, using gnu uniq:

$ cat file.txt | uniq
1
2

1
3
1

It seems to me like a getridofadjacents, not a unique output.

However its duplicate option seems more logical.

cat file.txt | uniq -d
1
3

Only the -d option. since -D mess things up.

cat file.txt | uniq -D
1
1
1
3
3

@geyslan
Copy link
Member

geyslan commented Jan 3, 2017

sort -u has similar behavior to gnu uniq

cat file.txt | sort -u

1
2
3

In that case sort prints all lines getting rid of duplicates and sorts, or vice versa.

@geyslan
Copy link
Member

geyslan commented Jan 4, 2017

@tiago4orion

@geyslan Current implementation do not honor this cases. I know it's not like 'gnu uniq'. But what do you think? Makes sense?

I recompiled the branch geyslan/uniq2 and it actually doesn't honor those cases. Well, must have been a wrong merge. #28 is broken, #27 seems to be ok. I'm sorry about that. Please, consider #27 for testing by now. Can we discard #28 and discuss the better implementation? So soon I'll PR'ring again.

See below.

@geyslan
Copy link
Member

geyslan commented Jan 4, 2017

@tiago4orion I figured out now (because I was sleepy yesterday) that you're expecting a behavior as showmeonlyoneoccurrence in spite of the occurrences of the line. It could be 1 or more. That is similar to sort -u.

@katcipis @cadicallegari

What I think is, whether -dup option inverts the logic showing only duplicates, why not the default behavior should show only actual unique lines? It enables an one occurrence usage like a showmeonlyuniquelines. Of course that we can add an option to output as you expect, or just make it default and add what I'm doing as the option instead.

So, actually #27 is broken from my point of view and #28 is ok. I'm sorry for that mess. I suggest we discuss all here before moving into code. I'm postponing any PR's fixes.

@i4ki
Copy link
Collaborator Author

i4ki commented Jan 11, 2017

@geyslan Sorry for the late reply..

Yes, I've used sort -u several time instead of uniq because output is much more sane (to me), but some times I do not want the output actually sorted.

Can you describe with examples how you would like the tool? Showing input and output, like I made in the issue description.

@geyslan
Copy link
Member

geyslan commented Jan 11, 2017

@tiago4orion ok, it's better exemplify so we can get it right. But first I want to paste a few meanings to make my point accordingly.

unique from Oxford Dictionary:

1 Being the only one of its kind; unlike anything else:
‘the situation was unique in British politics’
‘original and unique designs’

unique from Online Etymology Dictionary

c. 1600, "single, solitary," from Middle French unique (16c.), from Latin unicus "only, single, sole, alone of its kind," from unus "one" (see one). Meaning "forming the only one of its kind" is attested from 1610s; erroneous sense of "remarkable, uncommon" is attested from mid-19c. Related: Uniquely; uniqueness.

By now we can accept that unique is a sense of something that is only one, (alone) in a set. Right?

# Same input as above
$ cat > file.txt
1
1
1
2

1
3
3
1
# by default, print only unique/single/alone entries.
# empty lines are ignored
$ cat file.txt | uniq
2
# -dup  print only lines that are duplicated in the input
$ cat file.txt | uniq -dup
1
3
# -every print only one string representation from all input set.
$ cat file.txt | uniq -every
1
2
3
# -empty print empty lines also
$ cat file.txt | uniq -empty
2

$ 

It's a possibility to change the -every option to be default and my default example to be a -single. Anyway I hope you have got my point.

Cheers.

@geyslan
Copy link
Member

geyslan commented Jan 11, 2017

Examples for real cases that came to mind.

uniq (default case) could be used to retrieve lines that haven't duplicates, highlighting singular usage.

uniq -every could be used in cases when a file wrongly contains duplicate lines (not adjacent lines, this usage isn't covered by Enzo's uniq) ripping of that duplicates.

uniq -dup could be used to identify lost of space or misleading repetitions.

@i4ki
Copy link
Collaborator Author

i4ki commented Jan 11, 2017

@tiago4orion ok, it's better exemplify so we can get it right. But first I want to paste a few meanings to make my point accordingly.

Ok. I got your point, but we use a 'single word' to describe a set of features related to that word, it doesn't need to be so much strict.

I liked your examples except the first. I cannot think of one use case for that. Can you provide a real world example?

My point is that, features that doesn't have real world usage (right now) should be dropped in the first design and implementation. What do you think? I'm not against it being developed in the future if needed, but I think that adding complexity with no advantage in the beginning won't help.

@geyslan
Copy link
Member

geyslan commented Jan 11, 2017

I liked your examples except the first. I cannot think of one use case for that. Can you provide a real world example?

Here, here and here.

gnu uniq -u already does that.

-u, --unique
only print unique lines

People need that usage. I needed it too but I can't remember now for what actual usage.

@i4ki
Copy link
Collaborator Author

i4ki commented Jan 11, 2017

Ok, no problem.

Then we can start implementation? Or we're missing some detail?

@geyslan
Copy link
Member

geyslan commented Jan 11, 2017

Nice,

Before go back to implementation, I would like to hear from you about this understanding.

#28 (comment) and #28 (comment)

Last suggestion:

type struct Line {
	Text *string
	Numbers []int
}
...
inputMap := make(map[string]Line)
linesOrdered := []*Line

@geyslan
Copy link
Member

geyslan commented Jan 11, 2017

You can ask me why that Numbers []int, right? Well, it's good to supply user when asked about line number through -num option which I forgot to mention above. So the user will be able to track the output lines.

$ cat file.txt | uniq -num
[4] 2

$ cat file.txt | uniq -dup -num
[1 2 3 6 9] 1
[7 8] 3

@geyslan
Copy link
Member

geyslan commented Jan 21, 2017

@tiago4orion @katcipis

Hello guys,

I made the changes that we have discussed and I implemented the uniq_test.go as well, though there's no commit yet. Right now, I'm having doubts about how the -empty option should behave. By default, all options disregard empty lines, but -empty print them in the same order they were scanned regardless of their occurrences count the other options idiosyncrasies. Eg.

Input

λ> cat input
hello
world

hello

世界
世界
世
1
3
4
日本語
4
1

Unique lines and all empty lines

λ> cat input | ./uniq -empty
world


世
3
日本語

Duplicate lines and all empty lines

λ> cat input | ./uniq -dup -empty
hello


世界
1
4

Every line representation and all empty lines

λ> cat input | ./uniq -every -empty
hello
world


世界
世
1
3
4
日本語

Whit -num

λ> cat input | ./uniq -dup -empty -num
1,4: hello
3,5: 
3,5: 
6,7: 世界
9,14: 1
11,13: 4

So, do you think that -empty is doing the right thing printing all occurrences or should it behave like the other options which print only one specific representation?

@i4ki
Copy link
Collaborator Author

i4ki commented Feb 7, 2017

I think it should behave like any other character.. one representation.

@geyslan
Copy link
Member

geyslan commented Feb 7, 2017

@tiago4orion, tks. I'll change it soon. 👍

@geyslan
Copy link
Member

geyslan commented Mar 18, 2017

@tiago4orion Done!

@geyslan geyslan removed their assignment Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants