add uniq tool #28

geyslan · 2017-01-03T12:37:38Z

This PR is for comparison with #27

Signed-off-by: Geyslan G. Bem geyslan@gmail.com

i4ki · 2017-01-03T14:06:09Z

@geyslan I thought we had a O(n) solution in both complexity and space once in our last discussion. No?

Depending on sort.Ints annoys me. The stable sort solution is guaranteed to be logarithmic but ... very generic and could change in the future.
https://golang.org/src/sort/sort.go?s=7612:7630#L487

The O(n) solution was broken in any way? Or it does not support some of the required features?

geyslan · 2017-01-03T19:36:32Z

@tiago4orion the first solution (the simple one) that we have discussed is not broken. This PR is just for comparison with the last #27, I really do not consider this as a useful merge too. It's here just for analysis. I should have mentioned it in the #27 instead.

The one that you did refer was the first. Its time complexity is O(n) but its space depends on the input size O(2 * len(n)). I know that 2 * in the big-O should be disregarded, but the length of all inputs being duplicated seemed to me a waste of space.

That duplicate space is due to the repetition of the string in the linesMap and linesOrdered

LinesMap: = make(map[string]int)
LinesOrdered: = []string{}

To avoid that waste (e.g. a huge input stream) in the #27 I used three maps, but see that in only one inputMap of these We have the real string. The other two lineNumsMap lineMap works as pointers to the main map. I realized that we do not need an extra ordered list if we have a map with key int (line) and value pointing to the actual string.

InputMap: = make(map[string] *string)
LineNumsMap: = make(map[*string] []int)
LineMap: = make(map[int] *string)

You can tell me if their names (maps) reflect what they represent, I do not know.

Using that approach we can fill only one map with unique strings. I still have to think how to assure the allocation of the maps in the heap.

Things need to be changed, of course, let me know what you think.

first:

package main

import (
	"bufio"
	"io"
	"os"
)

type options struct {
	printDuplicates bool
	printEmptyLines bool
	stringDelimiter byte
}

func scanLines(input *os.File, opts options) (map[string]int, []string) {
	linesMap := make(map[string]int)
	linesOrdered := []string{}
	reader := bufio.NewReader(input)

	for {
		line, err := reader.ReadBytes(opts.stringDelimiter)
		if err == io.EOF {
			break
		}
		lineStr := string(line)
		linesOrdered = append(linesOrdered, lineStr)
		linesMap[lineStr]++
	}

	return linesMap, linesOrdered
}

func writeStdout(line string) {
	_, err := os.Stdout.WriteString(line)
	if err != nil {
		// what to do?
		os.Exit(1)
	}
}

func printLines(linesMap map[string]int, linesOrdered []string, opts options) {
	for _, line := range linesOrdered {
		if linesMap[line] == 0 {
			continue
		}
		if opts.printEmptyLines && line == string(opts.stringDelimiter) {
			writeStdout(line)
			continue
		}
		if opts.printDuplicates {
			if linesMap[line] < 2 {
				goto discardAndContinue
			}
		} else if linesMap[line] != 1 {
			goto discardAndContinue
		}
		writeStdout(line)
	discardAndContinue:
		linesMap[line] = 0
	}
}

func parseArgs(args []string) options {
	var opts options
	opts.stringDelimiter = byte('\n')

	for _, opt := range args[1:] {
		switch opt {
		case "-dup":
			opts.printDuplicates = true
		case "-empty":
			opts.printEmptyLines = true
		case "-nul":
			opts.stringDelimiter = 0x00
		default:
			writeStdout("Usage:\n")
			writeStdout("uniq [-dup | -empty | -nul]\n")
			os.Exit(1)
		}
	}
	return opts
}

func main() {
	opts := parseArgs(os.Args)
	linesMap, linesArr := scanLines(os.Stdin, opts)
	printLines(linesMap, linesArr, opts)
}

i4ki

The output I'm getting is not expected. Maybe we're misinterpreting uniq behaviour. I'll open an issue describing the use cases I expect to work and we can discuss there before review your current implementation.

i4ki · 2017-01-03T20:53:20Z

cmd/uniq/uniq.go

+
+	i := 1
+	for {
+		line, err := reader.ReadBytes(byte('\n'))


in addition to io.EOF, other kind of errors must abort.

Ok. Should be panic() or just os.Exit(1)?

yeah, print error and exit.

i4ki · 2017-01-03T21:11:38Z

@geyslan Regarding the O(n) implementation, I think we can change it to use string pointers in the linesOrdered array also, to avoid waste the double of memory. Using three maps instead of ordered slice requires sorting and complicates the logic.

I propose a third PR with the O(n) implementation and we can evaluate run time and memory usage.

Take a look in the rules of the Complexity section of Rob Pike C style:
http://doc.cat-v.org/bell_labs/pikestyle

geyslan · 2017-01-03T23:16:48Z

@tiago4orion did you notice that using 3 maps we don't need to sort anything? The catch is to iterate from 1 until the line count and test if this number exists in the lineMap, using i as the key.

Take a look:
06cc033#diff-d2c7a5b53893853781f228b0716c857fR68

Concerning your suggestion to use a slice with pointer to string in the map, I think it can be done, but using a structure to count occurrences.

type struct Line {
	Text *string
	Count int
}
...
inputMap := make(map[string]Line)
linesOrdered := []*Line

The inputMap is necessary only when scanning which lets the results ordered in the linesOrdered for later use.

The behavior (like GNU uniq) is not expected indeed. It was intentional. Take a look at #29 (comment).

About the new PR, NP. Just waiting more reviews/discussions.

geyslan · 2017-01-03T23:23:46Z

For keep track of the line numbers we could use:

type struct Line {
	Text *string
	Numbers []int
}

geyslan · 2017-01-04T11:27:50Z

Please, let's discuss this on #29 before any changes.

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

- add option '-every' which prints every specific line only one time. - add usage() function. - parseArgs() now returns error. - dismember scanLines() and printLine() logics into addLine() and printLine(). - scanLine() now returns a slice of pointers to Line, enabling an output ordered by inclusion. Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

This makes tests on the two main functions: scanLines() and uniq(). Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

geyslan · 2017-01-23T12:32:22Z

Hello folks. 😃

I did a rebase and push some changes. If it's ok there's still a conundrum to solve #29 (comment) before merging.

i4ki · 2017-01-23T12:38:53Z

@geyslan Awesome =)

I'm very busy right now, sorry! But I'll try to take a look soon (maybe tomorrow).

i4ki · 2017-03-22T01:24:24Z

I have a real use case for uniq tomorrow. I'll give feedback soon.

:)

i4ki

Sorry for the veryy late review...
I started this review several times before but never had time to finish it properly.

I've used your/our uniq in my work to get unique entries of CSV files and it worked like a charm. The csvs had 1.2mm and 2.1mm lines =)

Added some comments to make it more testable but it's already very good.

i4ki · 2017-04-22T02:44:03Z

cmd/uniq/uniq.go

+
+	lineNum := 0
+	add := false
+	empAdded := false


what's the meaning of empAdded?

oh, got it. It means emptyAdded?

Yes, I have to change it for better reading.

i4ki · 2017-04-22T02:47:05Z

cmd/uniq/uniq.go

+
+func scanLines(input io.Reader, opts options) []*Line {
+	reader := bufio.NewReader(input)
+	linesPtrMap := make(map[string]*Line)


Line already has string pointers and Line.num is a slice (a ptr), then making a map[string]*Line is too much indirection with no space advantage. Using map[string]Line will simplify the code and allocate only bytes to store the two pointers (text and nums).

@tiago4orion I believe that we have performance and space improvement using map[string]*Line once that LinesOrdered which is a []*Line (returned by scanLines()), contains only the pointers of the lines scanned into the map instead of a pure copy of a Line object. If we attempt to reduce the indirection used in the map, we'll ~~probably~~ certainly have to use more space to build the LinesOrdered slice.

See perf results on my machine of uniq as it is now (unfortunately for testing not using pointers I should have to change much of the logic, so I didn't do it).

$ cat ~/projs/rsrch/anas.platyrhynchos/data | wc -l 18532826 $ taskset 2 perf stat -e task-clock,cache-references,cache-misses,cycles,instructions,branches,branch-misses \ ./uniq_pointer -num -dup < ~/projs/rsrch/anas.platyrhynchos/data ... Performance counter stats for './uniq_pointer -num -dup': 64623,899938 task-clock:u (msec) # 0,996 CPUs utilized 6.161.710.231 cache-references:u # 95,347 M/sec (83,33%) 3.942.617.559 cache-misses:u # 63,986 % of all cache refs (83,32%) 153.774.203.335 cycles:u # 2,380 GHz (66,68%) 208.763.595.755 instructions:u # 1,36 insn per cycle (83,34%) 41.554.644.963 branches:u # 643,023 M/sec (83,33%) 162.322.952 branch-misses:u # 0,39% of all branches (83,34%) 64,867745418 seconds time elapsed

Besides of the certain increase of space usage these are my concerns about not using pointers:

6.161.710.231 **cache-references:u** # 95,347 M/sec (83,33%) 3.942.617.559 **cache-misses:u** # 63,986 % of all cache refs (83,32%)

Using pointers we already have a lot of cache-misses. That misses will not increase using larger buffers instead of pointers? We need tests.

i4ki · 2017-04-22T02:50:27Z

cmd/uniq/uniq.go

+			linesPtrMap[lineStr] = linep
+		}
+		linep.nums = append(linep.nums, lineNum)
+		add, empAdded = addLine(linep, opts, empAdded)


addLine is misleading because it doesn't add anything. It's a predicate, better to rename to shouldAdd or testAdd.

Indeed. Tks.

i4ki · 2017-04-22T02:52:47Z

cmd/uniq/uniq.go

+
+func uniq(linesp []*Line, opts options) {
+	for _, linep := range linesp {
+		if !printLine(linep, opts) {


same here, the name implies the function actually prints but not. Better to rename to shouldPrintLine or only shouldPrint?

Sure, changing it to shouldPrint.

i4ki · 2017-04-22T02:54:28Z

cmd/uniq/uniq.go

+			break
+		}
+		if err != nil {
+			fmt.Println(err)


the main logic is here, if you exit in case of errors it's not possible to test properly. Calling exit here should exit the test program also.

Indeed. Tks again. Changes:

In uniq.go

-func scanLines(input io.Reader, opts options) []*Line { +func scanLines(input io.Reader, opts options) ([]*Line, error) { ... - fmt.Println(err) - os.Exit(1) + return nil, err ... - return linesOrdered + return linesOrdered, nil

In uniq_test.go

- scannedLines := scanLines(bytesReader, tScanLines.opts) + scannedLines, err := scanLines(bytesReader, tScanLines.opts) + if err != nil { + t.Fatalf("test index %d: %v", i, err) + } ... - err := cmpLines(scannedLines, tScanLines.expected) + err = cmpLines(scannedLines, tScanLines.expected)

i4ki · 2017-04-22T03:00:59Z

cmd/uniq/uniq_test.go

+`)
+
+func newLine(s string, ln []int) *Line {
+	newStr := new(string)


string is immutable and new is deprecated.
This function could be created something like:

func newLine(s string, ln []int) Line { line := s+"\n" return Line{ text: &line, nums: ln, } }

In Go every variable in analysed if it scapes the function and if so it's allocated in the heap and managed by GC. You don't need to call new anymore (since go1.3 or earlier).

i4ki · 2017-04-22T03:05:03Z

cmd/uniq/uniq_test.go

+}
+
+func cmpLines(a, b []*Line) error {
+	if la, lb := len(a), len(b); la != lb {


In Go len doesn't actually count but only accesses an internal field (len(s) == s.sz) and len call is always inlined.
Then,

if len(a) != len(b) { return fmt.Errorf("%T sizes diverge: %v != %v", a, len(a), len(b))

isn't slow.
*only a hint, a minor comment.

i4ki · 2017-04-22T03:05:43Z

cmd/uniq/uniq_test.go

+		return fmt.Errorf("%T sizes diverge: %v != %v", a, la, lb)
+	}
+	for i, _ := range a {
+		linea := *a[i]


As I'd said before, you don't need *Line here.

i4ki · 2017-04-22T03:06:57Z

cmd/uniq/uniq_test.go

+func TestScanLines(t *testing.T) {
+	for i, tScanLines := range ttScanLines {
+		bytesReader := bytes.NewReader(input)
+		scannedLines := scanLines(bytesReader, tScanLines.opts)


As I'd said before, in case of error the entire test you exit.. Better to make this function return an error also.

Already changed as mentioned above.

i4ki · 2017-04-22T03:13:10Z

cmd/uniq/uniq_test.go

+		return "", err
+	}
+
+	savedStdout := os.Stdout


that's not idiomatic Go. The correct way would be simple define uniq something like:

func uniq(stdout io.Writer, linesp []Line, opts options) error { /* code here */ }

And the use like this in main:

linesp, err := scanLines(os.Stdin, opts) if err != nil { fmt.Fprintf(os.Stderr, err.Error()) os.Exit(1) } err = uniq(os.Stdout, linesp, opts) if err != nil { fmt.Fprintf(os.Stderr, err.Error()) os.Exit(1) }

This way in your tests you can mock the stdout with:

var buf bytes.Buffer err := unit(&buf, linesp, opts) if err != nil { t.Fatal(err) } out := string(buf.Bytes())

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

- remove deprecated new - calling len() without fear of it not being inlined - remove * dereference once range sends a copy of the object Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

i4ki requested review from katcipis and i4ki January 3, 2017 14:06

i4ki requested changes Jan 3, 2017

View reviewed changes

i4ki mentioned this pull request Jan 3, 2017

Missing uniq command #29

Open

geyslan mentioned this pull request Jan 3, 2017

add uniq tool (space optimized) #27

Closed

geyslan added 5 commits January 21, 2017 21:47

uniq: print unique lines fed by stdin

79853be

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

uniq: add uniq test unity

1beebe7

This makes tests on the two main functions: scanLines() and uniq(). Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

uniq: update README.md

2739fdf

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

add .gitignore

22e718e

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

-empty now print once

d2895ff

i4ki requested changes Apr 22, 2017

View reviewed changes

geyslan added 3 commits April 22, 2017 12:00

uniq: improving the meaning consequently the understanding

0f68407

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

uniq: scanLines() now returns error

0c806cf

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

uniq: making code idiomatic

e781107

- remove deprecated new - calling len() without fear of it not being inlined - remove * dereference once range sends a copy of the object Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>

i4ki merged commit cf1a1c7 into c0defellas:master Oct 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add uniq tool #28

add uniq tool #28

geyslan commented Jan 3, 2017

i4ki commented Jan 3, 2017

geyslan commented Jan 3, 2017 •

edited

Loading

i4ki left a comment

i4ki Jan 3, 2017

geyslan Jan 3, 2017

i4ki Jan 3, 2017

i4ki commented Jan 3, 2017

geyslan commented Jan 3, 2017 •

edited

Loading

geyslan commented Jan 3, 2017 •

edited

Loading

geyslan commented Jan 4, 2017

geyslan commented Jan 23, 2017

i4ki commented Jan 23, 2017

i4ki commented Mar 22, 2017

i4ki left a comment

i4ki Apr 22, 2017

i4ki Apr 22, 2017

geyslan Apr 22, 2017

i4ki Apr 22, 2017

geyslan Apr 22, 2017 •

edited

Loading

i4ki Apr 22, 2017

geyslan Apr 22, 2017

i4ki Apr 22, 2017

geyslan Apr 22, 2017

i4ki Apr 22, 2017

geyslan Apr 22, 2017 •

edited

Loading

i4ki Apr 22, 2017

i4ki Apr 22, 2017

i4ki Apr 22, 2017

i4ki Apr 22, 2017

geyslan Apr 22, 2017

i4ki Apr 22, 2017

add uniq tool #28

add uniq tool #28

Conversation

geyslan commented Jan 3, 2017

i4ki commented Jan 3, 2017

geyslan commented Jan 3, 2017 • edited Loading

i4ki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

i4ki commented Jan 3, 2017

geyslan commented Jan 3, 2017 • edited Loading

geyslan commented Jan 3, 2017 • edited Loading

geyslan commented Jan 4, 2017

geyslan commented Jan 23, 2017

i4ki commented Jan 23, 2017

i4ki commented Mar 22, 2017

i4ki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geyslan Apr 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geyslan Apr 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geyslan commented Jan 3, 2017 •

edited

Loading

geyslan commented Jan 3, 2017 •

edited

Loading

geyslan commented Jan 3, 2017 •

edited

Loading

geyslan Apr 22, 2017 •

edited

Loading

geyslan Apr 22, 2017 •

edited

Loading