Missing uniq command #29

i4ki · 2017-01-03T21:28:47Z

We need a uniq command, but how it should behave?

Differently from Plan9 and GNU uniq, it should apply the uniq in the entire input buffer, not only in the adjacent input lines. The current @geyslan implementations already did this way (#27 #28).

Below are some test cases I do expect to work:

$ cat > file.txt
1
1
1
2

1
3
3
1
# by default, print every string one time (output have unique entries).
# empty lines are ignored
$ cat file.txt | uniq
1
2
3
# -dup  print only lines that are duplicated in the input
$ cat file.txt | uniq -dup
1
3
# -empty print empty lines also
$ cat file.txt | uniq -empty
1
2

3
$

A third option to show line numbers could be added if it do not complicates the tool.

@geyslan Current implementation do not honor this cases. I know it's not like 'gnu uniq'. But what do you think? Makes sense?

The text was updated successfully, but these errors were encountered:

geyslan · 2017-01-03T23:16:16Z

@tiago4orion @katcipis @cadicallegari

I really never used the GNU uniq as it is. If I would like to grep uniques or duplicates I've used awk or grep.

From the GNU uniq manual:

Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first,
or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.

Adjacent lines? I think it's misguided. It can be me the wrong here. :-) For that actually work it demands to use sort, and that already has an option for unique. LOL.

From the GNU sort manual:

-u, --unique
with -c, check for strict ordering; without -c, output only the first of an equal run

Highlighting that in any case, comparing all occurrences or adjacency the complexity will be O(n), let me know what do you think.

i4ki · 2017-01-03T23:25:16Z

Ok, but I do not used gnu uniq behavior in my examples. Try the file.txt in our uniq implementation.

geyslan · 2017-01-03T23:35:37Z

Ok, using gnu uniq:

$ cat file.txt | uniq
1
2

1
3
1

It seems to me like a getridofadjacents, not a unique output.

However its duplicate option seems more logical.

cat file.txt | uniq -d
1
3

Only the -d option. since -D mess things up.

cat file.txt | uniq -D
1
1
1
3
3

geyslan · 2017-01-03T23:58:35Z

sort -u has similar behavior to gnu uniq

cat file.txt | sort -u

1
2
3

In that case sort prints all lines getting rid of duplicates and sorts, or vice versa.

geyslan · 2017-01-04T00:09:54Z

@tiago4orion

@geyslan Current implementation do not honor this cases. I know it's not like 'gnu uniq'. But what do you think? Makes sense?

I recompiled the branch geyslan/uniq2 and it actually doesn't honor those cases. Well, must have been a wrong merge. #28 is broken, #27 seems to be ok. I'm sorry about that. Please, consider #27 for testing by now. Can we discard #28 and discuss the better implementation? So soon I'll PR'ring again.

See below.

geyslan · 2017-01-04T11:15:03Z

@tiago4orion I figured out now (because I was sleepy yesterday) that you're expecting a behavior as showmeonlyoneoccurrence in spite of the occurrences of the line. It could be 1 or more. That is similar to sort -u.

@katcipis @cadicallegari

What I think is, whether -dup option inverts the logic showing only duplicates, why not the default behavior should show only actual unique lines? It enables an one occurrence usage like a showmeonlyuniquelines. Of course that we can add an option to output as you expect, or just make it default and add what I'm doing as the option instead.

So, actually #27 is broken from my point of view and #28 is ok. I'm sorry for that mess. I suggest we discuss all here before moving into code. I'm postponing any PR's fixes.

i4ki · 2017-01-11T01:22:58Z

@geyslan Sorry for the late reply..

Yes, I've used sort -u several time instead of uniq because output is much more sane (to me), but some times I do not want the output actually sorted.

Can you describe with examples how you would like the tool? Showing input and output, like I made in the issue description.

geyslan · 2017-01-11T11:36:59Z

@tiago4orion ok, it's better exemplify so we can get it right. But first I want to paste a few meanings to make my point accordingly.

unique from Oxford Dictionary:

1 Being the only one of its kind; unlike anything else:
‘the situation was unique in British politics’
‘original and unique designs’

unique from Online Etymology Dictionary

c. 1600, "single, solitary," from Middle French unique (16c.), from Latin unicus "only, single, sole, alone of its kind," from unus "one" (see one). Meaning "forming the only one of its kind" is attested from 1610s; erroneous sense of "remarkable, uncommon" is attested from mid-19c. Related: Uniquely; uniqueness.

By now we can accept that unique is a sense of something that is only one, (alone) in a set. Right?

# Same input as above
$ cat > file.txt
1
1
1
2

1
3
3
1
# by default, print only unique/single/alone entries.
# empty lines are ignored
$ cat file.txt | uniq
2
# -dup  print only lines that are duplicated in the input
$ cat file.txt | uniq -dup
1
3
# -every print only one string representation from all input set.
$ cat file.txt | uniq -every
1
2
3
# -empty print empty lines also
$ cat file.txt | uniq -empty
2

$

It's a possibility to change the -every option to be default and my default example to be a -single. Anyway I hope you have got my point.

Cheers.

geyslan · 2017-01-11T12:17:44Z

Examples for real cases that came to mind.

uniq (default case) could be used to retrieve lines that haven't duplicates, highlighting singular usage.

uniq -every could be used in cases when a file wrongly contains duplicate lines (not adjacent lines, this usage isn't covered by Enzo's uniq) ripping of that duplicates.

uniq -dup could be used to identify lost of space or misleading repetitions.

i4ki · 2017-01-11T12:19:55Z

@tiago4orion ok, it's better exemplify so we can get it right. But first I want to paste a few meanings to make my point accordingly.

Ok. I got your point, but we use a 'single word' to describe a set of features related to that word, it doesn't need to be so much strict.

I liked your examples except the first. I cannot think of one use case for that. Can you provide a real world example?

My point is that, features that doesn't have real world usage (right now) should be dropped in the first design and implementation. What do you think? I'm not against it being developed in the future if needed, but I think that adding complexity with no advantage in the beginning won't help.

geyslan · 2017-01-11T12:30:51Z

I liked your examples except the first. I cannot think of one use case for that. Can you provide a real world example?

Here, here and here.

gnu uniq -u already does that.

-u, --unique
only print unique lines

People need that usage. I needed it too but I can't remember now for what actual usage.

i4ki · 2017-01-11T12:37:02Z

Ok, no problem.

Then we can start implementation? Or we're missing some detail?

geyslan · 2017-01-11T12:43:28Z

Nice,

Before go back to implementation, I would like to hear from you about this understanding.

#28 (comment) and #28 (comment)

Last suggestion:

type struct Line {
	Text *string
	Numbers []int
}
...
inputMap := make(map[string]Line)
linesOrdered := []*Line

geyslan · 2017-01-11T13:12:17Z

You can ask me why that Numbers []int, right? Well, it's good to supply user when asked about line number through -num option which I forgot to mention above. So the user will be able to track the output lines.

$ cat file.txt | uniq -num
[4] 2

$ cat file.txt | uniq -dup -num
[1 2 3 6 9] 1
[7 8] 3

geyslan · 2017-01-21T12:46:10Z

@tiago4orion @katcipis

Hello guys,

I made the changes that we have discussed and I implemented the uniq_test.go as well, though there's no commit yet. Right now, I'm having doubts about how the -empty option should behave. By default, all options disregard empty lines, but -empty print them in the same order they were scanned regardless of ~~their occurrences count~~ the other options idiosyncrasies. Eg.

Input

λ> cat input
hello
world

hello

世界
世界
世
1
3
4
日本語
4
1

Unique lines and all empty lines

λ> cat input | ./uniq -empty
world


世
3
日本語

Duplicate lines and all empty lines

λ> cat input | ./uniq -dup -empty
hello


世界
1
4

Every line representation and all empty lines

λ> cat input | ./uniq -every -empty
hello
world


世界
世
1
3
4
日本語

Whit -num

λ> cat input | ./uniq -dup -empty -num
1,4: hello
3,5: 
3,5: 
6,7: 世界
9,14: 1
11,13: 4

So, do you think that -empty is doing the right thing printing all occurrences or should it behave like the other options which print only one specific representation?

i4ki · 2017-02-07T00:30:30Z

I think it should behave like any other character.. one representation.

geyslan · 2017-02-07T15:47:18Z

@tiago4orion, tks. I'll change it soon. 👍

geyslan · 2017-03-18T22:48:06Z

@tiago4orion Done!

i4ki added enhancement help wanted labels Jan 3, 2017

i4ki assigned geyslan and katcipis Jan 3, 2017

geyslan mentioned this issue Jan 3, 2017

add uniq tool #28

Merged

geyslan mentioned this issue Jan 4, 2017

add uniq tool (space optimized) #27

Closed

geyslan removed their assignment Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing uniq command #29

Missing uniq command #29

i4ki commented Jan 3, 2017 •

edited

Loading

geyslan commented Jan 3, 2017 •

edited

Loading

i4ki commented Jan 3, 2017

geyslan commented Jan 3, 2017 •

edited

Loading

geyslan commented Jan 3, 2017

geyslan commented Jan 4, 2017 •

edited

Loading

geyslan commented Jan 4, 2017 •

edited

Loading

i4ki commented Jan 11, 2017

geyslan commented Jan 11, 2017 •

edited

Loading

geyslan commented Jan 11, 2017 •

edited

Loading

i4ki commented Jan 11, 2017

geyslan commented Jan 11, 2017

i4ki commented Jan 11, 2017

geyslan commented Jan 11, 2017

geyslan commented Jan 11, 2017 •

edited

Loading

geyslan commented Jan 21, 2017 •

edited

Loading

i4ki commented Feb 7, 2017

geyslan commented Feb 7, 2017

geyslan commented Mar 18, 2017

Missing uniq command #29

Missing uniq command #29

Comments

i4ki commented Jan 3, 2017 • edited Loading

geyslan commented Jan 3, 2017 • edited Loading

i4ki commented Jan 3, 2017

geyslan commented Jan 3, 2017 • edited Loading

geyslan commented Jan 3, 2017

geyslan commented Jan 4, 2017 • edited Loading

geyslan commented Jan 4, 2017 • edited Loading

i4ki commented Jan 11, 2017

geyslan commented Jan 11, 2017 • edited Loading

geyslan commented Jan 11, 2017 • edited Loading

i4ki commented Jan 11, 2017

geyslan commented Jan 11, 2017

i4ki commented Jan 11, 2017

geyslan commented Jan 11, 2017

geyslan commented Jan 11, 2017 • edited Loading

geyslan commented Jan 21, 2017 • edited Loading

i4ki commented Feb 7, 2017

geyslan commented Feb 7, 2017

geyslan commented Mar 18, 2017

i4ki commented Jan 3, 2017 •

edited

Loading

geyslan commented Jan 3, 2017 •

edited

Loading

geyslan commented Jan 3, 2017 •

edited

Loading

geyslan commented Jan 4, 2017 •

edited

Loading

geyslan commented Jan 4, 2017 •

edited

Loading

geyslan commented Jan 11, 2017 •

edited

Loading

geyslan commented Jan 11, 2017 •

edited

Loading

geyslan commented Jan 11, 2017 •

edited

Loading

geyslan commented Jan 21, 2017 •

edited

Loading