Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add uniq tool #28

Merged
merged 9 commits into from
Oct 20, 2019
Merged

add uniq tool #28

merged 9 commits into from
Oct 20, 2019

Conversation

geyslan
Copy link
Member

@geyslan geyslan commented Jan 3, 2017

This PR is for comparison with #27

Signed-off-by: Geyslan G. Bem geyslan@gmail.com

@i4ki
Copy link
Collaborator

i4ki commented Jan 3, 2017

@geyslan I thought we had a O(n) solution in both complexity and space once in our last discussion. No?

Depending on sort.Ints annoys me. The stable sort solution is guaranteed to be logarithmic but ... very generic and could change in the future.
https://golang.org/src/sort/sort.go?s=7612:7630#L487

The O(n) solution was broken in any way? Or it does not support some of the required features?

@i4ki i4ki requested review from katcipis and i4ki January 3, 2017 14:06
@geyslan
Copy link
Member Author

geyslan commented Jan 3, 2017

@tiago4orion the first solution (the simple one) that we have discussed is not broken. This PR is just for comparison with the last #27, I really do not consider this as a useful merge too. It's here just for analysis. I should have mentioned it in the #27 instead.

The one that you did refer was the first. Its time complexity is O(n) but its space depends on the input size O(2 * len(n)). I know that 2 * in the big-O should be disregarded, but the length of all inputs being duplicated seemed to me a waste of space.

That duplicate space is due to the repetition of the string in the linesMap and linesOrdered

LinesMap: = make(map[string]int)
LinesOrdered: = []string{}

To avoid that waste (e.g. a huge input stream) in the #27 I used three maps, but see that in only one inputMap of these We have the real string. The other two lineNumsMap lineMap works as pointers to the main map. I realized that we do not need an extra ordered list if we have a map with key int (line) and value pointing to the actual string.

InputMap: = make(map[string] *string)
LineNumsMap: = make(map[*string] []int)
LineMap: = make(map[int] *string)

You can tell me if their names (maps) reflect what they represent, I do not know.

Using that approach we can fill only one map with unique strings. I still have to think how to assure the allocation of the maps in the heap.

Things need to be changed, of course, let me know what you think.

first:

package main

import (
	"bufio"
	"io"
	"os"
)

type options struct {
	printDuplicates bool
	printEmptyLines bool
	stringDelimiter byte
}

func scanLines(input *os.File, opts options) (map[string]int, []string) {
	linesMap := make(map[string]int)
	linesOrdered := []string{}
	reader := bufio.NewReader(input)

	for {
		line, err := reader.ReadBytes(opts.stringDelimiter)
		if err == io.EOF {
			break
		}
		lineStr := string(line)
		linesOrdered = append(linesOrdered, lineStr)
		linesMap[lineStr]++
	}

	return linesMap, linesOrdered
}

func writeStdout(line string) {
	_, err := os.Stdout.WriteString(line)
	if err != nil {
		// what to do?
		os.Exit(1)
	}
}

func printLines(linesMap map[string]int, linesOrdered []string, opts options) {
	for _, line := range linesOrdered {
		if linesMap[line] == 0 {
			continue
		}
		if opts.printEmptyLines && line == string(opts.stringDelimiter) {
			writeStdout(line)
			continue
		}
		if opts.printDuplicates {
			if linesMap[line] < 2 {
				goto discardAndContinue
			}
		} else if linesMap[line] != 1 {
			goto discardAndContinue
		}
		writeStdout(line)
	discardAndContinue:
		linesMap[line] = 0
	}
}

func parseArgs(args []string) options {
	var opts options
	opts.stringDelimiter = byte('\n')

	for _, opt := range args[1:] {
		switch opt {
		case "-dup":
			opts.printDuplicates = true
		case "-empty":
			opts.printEmptyLines = true
		case "-nul":
			opts.stringDelimiter = 0x00
		default:
			writeStdout("Usage:\n")
			writeStdout("uniq [-dup | -empty | -nul]\n")
			os.Exit(1)
		}
	}
	return opts
}

func main() {
	opts := parseArgs(os.Args)
	linesMap, linesArr := scanLines(os.Stdin, opts)
	printLines(linesMap, linesArr, opts)
}

Copy link
Collaborator

@i4ki i4ki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output I'm getting is not expected. Maybe we're misinterpreting uniq behaviour. I'll open an issue describing the use cases I expect to work and we can discuss there before review your current implementation.

cmd/uniq/uniq.go Outdated

i := 1
for {
line, err := reader.ReadBytes(byte('\n'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in addition to io.EOF, other kind of errors must abort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Should be panic() or just os.Exit(1)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, print error and exit.

@i4ki
Copy link
Collaborator

i4ki commented Jan 3, 2017

@geyslan Regarding the O(n) implementation, I think we can change it to use string pointers in the linesOrdered array also, to avoid waste the double of memory. Using three maps instead of ordered slice requires sorting and complicates the logic.

I propose a third PR with the O(n) implementation and we can evaluate run time and memory usage.

Take a look in the rules of the Complexity section of Rob Pike C style:
http://doc.cat-v.org/bell_labs/pikestyle

@i4ki i4ki mentioned this pull request Jan 3, 2017
@geyslan
Copy link
Member Author

geyslan commented Jan 3, 2017

@tiago4orion did you notice that using 3 maps we don't need to sort anything? The catch is to iterate from 1 until the line count and test if this number exists in the lineMap, using i as the key.

Take a look:
06cc033#diff-d2c7a5b53893853781f228b0716c857fR68

Concerning your suggestion to use a slice with pointer to string in the map, I think it can be done, but using a structure to count occurrences.

type struct Line {
	Text *string
	Count int
}
...
inputMap := make(map[string]Line)
linesOrdered := []*Line

The inputMap is necessary only when scanning which lets the results ordered in the linesOrdered for later use.

The behavior (like GNU uniq) is not expected indeed. It was intentional. Take a look at #29 (comment).

About the new PR, NP. Just waiting more reviews/discussions.

@geyslan
Copy link
Member Author

geyslan commented Jan 3, 2017

For keep track of the line numbers we could use:

type struct Line {
	Text *string
	Numbers []int
}

@geyslan
Copy link
Member Author

geyslan commented Jan 4, 2017

Please, let's discuss this on #29 before any changes.

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
- add option '-every' which prints every specific line
  only one time.
- add usage() function.
- parseArgs() now returns error.
- dismember scanLines() and printLine() logics into
  addLine() and printLine().
- scanLine() now returns a slice of pointers to Line,
  enabling an output ordered by inclusion.

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
This makes tests on the two main functions: scanLines() and uniq().

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
@geyslan
Copy link
Member Author

geyslan commented Jan 23, 2017

Hello folks. 😃

I did a rebase and push some changes. If it's ok there's still a conundrum to solve #29 (comment) before merging.

@i4ki
Copy link
Collaborator

i4ki commented Jan 23, 2017

@geyslan Awesome =)

I'm very busy right now, sorry! But I'll try to take a look soon (maybe tomorrow).

@i4ki
Copy link
Collaborator

i4ki commented Mar 22, 2017

I have a real use case for uniq tomorrow. I'll give feedback soon.

:)

Copy link
Collaborator

@i4ki i4ki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the veryy late review...
I started this review several times before but never had time to finish it properly.

I've used your/our uniq in my work to get unique entries of CSV files and it worked like a charm. The csvs had 1.2mm and 2.1mm lines =)

Added some comments to make it more testable but it's already very good.

cmd/uniq/uniq.go Outdated

lineNum := 0
add := false
empAdded := false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the meaning of empAdded?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, got it. It means emptyAdded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have to change it for better reading.

cmd/uniq/uniq.go Outdated

func scanLines(input io.Reader, opts options) []*Line {
reader := bufio.NewReader(input)
linesPtrMap := make(map[string]*Line)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line already has string pointers and Line.num is a slice (a ptr), then making a map[string]*Line is too much indirection with no space advantage. Using map[string]Line will simplify the code and allocate only bytes to store the two pointers (text and nums).

Copy link
Member Author

@geyslan geyslan Apr 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiago4orion I believe that we have performance and space improvement using map[string]*Line once that LinesOrdered which is a []*Line (returned by scanLines()), contains only the pointers of the lines scanned into the map instead of a pure copy of a Line object. If we attempt to reduce the indirection used in the map, we'll probably certainly have to use more space to build the LinesOrdered slice.

See perf results on my machine of uniq as it is now (unfortunately for testing not using pointers I should have to change much of the logic, so I didn't do it).

$ cat ~/projs/rsrch/anas.platyrhynchos/data | wc -l
18532826

$ taskset 2 perf stat -e task-clock,cache-references,cache-misses,cycles,instructions,branches,branch-misses \
./uniq_pointer -num -dup < ~/projs/rsrch/anas.platyrhynchos/data
...

 Performance counter stats for './uniq_pointer -num -dup':

      64623,899938      task-clock:u (msec)       #    0,996 CPUs utilized
     6.161.710.231      cache-references:u        #   95,347 M/sec                    (83,33%)
     3.942.617.559      cache-misses:u            #   63,986 % of all cache refs      (83,32%)
   153.774.203.335      cycles:u                  #    2,380 GHz                      (66,68%)
   208.763.595.755      instructions:u            #    1,36  insn per cycle           (83,34%)
    41.554.644.963      branches:u                #  643,023 M/sec                    (83,33%)
       162.322.952      branch-misses:u           #    0,39% of all branches          (83,34%)

      64,867745418 seconds time elapsed

Besides of the certain increase of space usage these are my concerns about not using pointers:

 6.161.710.231      **cache-references:u**        #   95,347 M/sec                    (83,33%)
 3.942.617.559      **cache-misses:u**            #   63,986 % of all cache refs      (83,32%)

Using pointers we already have a lot of cache-misses. That misses will not increase using larger buffers instead of pointers? We need tests.

cmd/uniq/uniq.go Outdated
linesPtrMap[lineStr] = linep
}
linep.nums = append(linep.nums, lineNum)
add, empAdded = addLine(linep, opts, empAdded)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addLine is misleading because it doesn't add anything. It's a predicate, better to rename to shouldAdd or testAdd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Tks.

cmd/uniq/uniq.go Outdated

func uniq(linesp []*Line, opts options) {
for _, linep := range linesp {
if !printLine(linep, opts) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, the name implies the function actually prints but not. Better to rename to shouldPrintLine or only shouldPrint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, changing it to shouldPrint.

cmd/uniq/uniq.go Outdated
break
}
if err != nil {
fmt.Println(err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main logic is here, if you exit in case of errors it's not possible to test properly. Calling exit here should exit the test program also.

Copy link
Member Author

@geyslan geyslan Apr 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Tks again. Changes:

In uniq.go

-func scanLines(input io.Reader, opts options) []*Line {
+func scanLines(input io.Reader, opts options) ([]*Line, error) {
...
-                       fmt.Println(err)
-                       os.Exit(1)
+                       return nil, err
...
-       return linesOrdered
+       return linesOrdered, nil

In uniq_test.go

-               scannedLines := scanLines(bytesReader, tScanLines.opts)
+               scannedLines, err := scanLines(bytesReader, tScanLines.opts)
+               if err != nil {
+                       t.Fatalf("test index %d: %v", i, err)
+               }
...
-               err := cmpLines(scannedLines, tScanLines.expected)
+               err = cmpLines(scannedLines, tScanLines.expected)

`)

func newLine(s string, ln []int) *Line {
newStr := new(string)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string is immutable and new is deprecated.
This function could be created something like:

func newLine(s string, ln []int) Line {
        line := s+"\n"
        return Line{
                text: &line,
                nums: ln,
        }
}

In Go every variable in analysed if it scapes the function and if so it's allocated in the heap and managed by GC. You don't need to call new anymore (since go1.3 or earlier).

}

func cmpLines(a, b []*Line) error {
if la, lb := len(a), len(b); la != lb {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Go len doesn't actually count but only accesses an internal field (len(s) == s.sz) and len call is always inlined.
Then,

if len(a) != len(b) {
         return fmt.Errorf("%T sizes diverge: %v != %v", a, len(a), len(b))

isn't slow.
*only a hint, a minor comment.

return fmt.Errorf("%T sizes diverge: %v != %v", a, la, lb)
}
for i, _ := range a {
linea := *a[i]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'd said before, you don't need *Line here.

func TestScanLines(t *testing.T) {
for i, tScanLines := range ttScanLines {
bytesReader := bytes.NewReader(input)
scannedLines := scanLines(bytesReader, tScanLines.opts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'd said before, in case of error the entire test you exit.. Better to make this function return an error also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already changed as mentioned above.

return "", err
}

savedStdout := os.Stdout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not idiomatic Go. The correct way would be simple define uniq something like:

func uniq(stdout io.Writer, linesp []Line, opts options) error { /* code here */ }

And the use like this in main:

linesp, err := scanLines(os.Stdin, opts)
if err != nil {
        fmt.Fprintf(os.Stderr, err.Error())
        os.Exit(1)
}
 err = uniq(os.Stdout, linesp, opts)
if err != nil {
        fmt.Fprintf(os.Stderr, err.Error())
        os.Exit(1)
} 

This way in your tests you can mock the stdout with:

var buf bytes.Buffer
err := unit(&buf, linesp, opts)
if err != nil {
        t.Fatal(err)
}
out := string(buf.Bytes())

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
 - remove deprecated new
 - calling len() without fear of it not being inlined
 - remove * dereference once range sends a copy of the object

Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
@i4ki i4ki merged commit cf1a1c7 into c0defellas:master Oct 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants