-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add uniq tool #28
add uniq tool #28
Conversation
@geyslan I thought we had a O(n) solution in both complexity and space once in our last discussion. No? Depending on sort.Ints annoys me. The stable sort solution is guaranteed to be logarithmic but ... very generic and could change in the future. The O(n) solution was broken in any way? Or it does not support some of the required features? |
@tiago4orion the first solution (the simple one) that we have discussed is not broken. This PR is just for comparison with the last #27, I really do not consider this as a useful merge too. It's here just for analysis. I should have mentioned it in the #27 instead. The one that you did refer was the first. Its time complexity is O(n) but its space depends on the input size O(2 * len(n)). I know that 2 * in the big-O should be disregarded, but the length of all inputs being duplicated seemed to me a waste of space. That duplicate space is due to the repetition of the string in the LinesMap: = make(map[string]int)
LinesOrdered: = []string{} To avoid that waste (e.g. a huge input stream) in the #27 I used three maps, but see that in only one InputMap: = make(map[string] *string)
LineNumsMap: = make(map[*string] []int)
LineMap: = make(map[int] *string) You can tell me if their names (maps) reflect what they represent, I do not know. Using that approach we can fill only one map with unique strings. I still have to think how to assure the allocation of the maps in the heap. Things need to be changed, of course, let me know what you think. package main
import (
"bufio"
"io"
"os"
)
type options struct {
printDuplicates bool
printEmptyLines bool
stringDelimiter byte
}
func scanLines(input *os.File, opts options) (map[string]int, []string) {
linesMap := make(map[string]int)
linesOrdered := []string{}
reader := bufio.NewReader(input)
for {
line, err := reader.ReadBytes(opts.stringDelimiter)
if err == io.EOF {
break
}
lineStr := string(line)
linesOrdered = append(linesOrdered, lineStr)
linesMap[lineStr]++
}
return linesMap, linesOrdered
}
func writeStdout(line string) {
_, err := os.Stdout.WriteString(line)
if err != nil {
// what to do?
os.Exit(1)
}
}
func printLines(linesMap map[string]int, linesOrdered []string, opts options) {
for _, line := range linesOrdered {
if linesMap[line] == 0 {
continue
}
if opts.printEmptyLines && line == string(opts.stringDelimiter) {
writeStdout(line)
continue
}
if opts.printDuplicates {
if linesMap[line] < 2 {
goto discardAndContinue
}
} else if linesMap[line] != 1 {
goto discardAndContinue
}
writeStdout(line)
discardAndContinue:
linesMap[line] = 0
}
}
func parseArgs(args []string) options {
var opts options
opts.stringDelimiter = byte('\n')
for _, opt := range args[1:] {
switch opt {
case "-dup":
opts.printDuplicates = true
case "-empty":
opts.printEmptyLines = true
case "-nul":
opts.stringDelimiter = 0x00
default:
writeStdout("Usage:\n")
writeStdout("uniq [-dup | -empty | -nul]\n")
os.Exit(1)
}
}
return opts
}
func main() {
opts := parseArgs(os.Args)
linesMap, linesArr := scanLines(os.Stdin, opts)
printLines(linesMap, linesArr, opts)
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output I'm getting is not expected. Maybe we're misinterpreting uniq behaviour. I'll open an issue describing the use cases I expect to work and we can discuss there before review your current implementation.
cmd/uniq/uniq.go
Outdated
|
||
i := 1 | ||
for { | ||
line, err := reader.ReadBytes(byte('\n')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in addition to io.EOF, other kind of errors must abort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Should be panic() or just os.Exit(1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, print error and exit.
@geyslan Regarding the O(n) implementation, I think we can change it to use string pointers in the linesOrdered array also, to avoid waste the double of memory. Using three maps instead of ordered slice requires sorting and complicates the logic. I propose a third PR with the O(n) implementation and we can evaluate run time and memory usage. Take a look in the rules of the Complexity section of Rob Pike C style: |
@tiago4orion did you notice that using 3 maps we don't need to sort anything? The catch is to iterate from 1 until the line count and test if this number exists in the Take a look: Concerning your suggestion to use a slice with pointer to string in the map, I think it can be done, but using a structure to count occurrences. type struct Line {
Text *string
Count int
}
...
inputMap := make(map[string]Line)
linesOrdered := []*Line The The behavior (like GNU uniq) is not expected indeed. It was intentional. Take a look at #29 (comment). About the new PR, NP. Just waiting more reviews/discussions. |
For keep track of the line numbers we could use: type struct Line {
Text *string
Numbers []int
} |
Please, let's discuss this on #29 before any changes. |
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
- add option '-every' which prints every specific line only one time. - add usage() function. - parseArgs() now returns error. - dismember scanLines() and printLine() logics into addLine() and printLine(). - scanLine() now returns a slice of pointers to Line, enabling an output ordered by inclusion. Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
This makes tests on the two main functions: scanLines() and uniq(). Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Hello folks. 😃 I did a rebase and push some changes. If it's ok there's still a conundrum to solve #29 (comment) before merging. |
@geyslan Awesome =) I'm very busy right now, sorry! But I'll try to take a look soon (maybe tomorrow). |
I have a real use case for uniq tomorrow. I'll give feedback soon. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the veryy late review...
I started this review several times before but never had time to finish it properly.
I've used your/our uniq in my work to get unique entries of CSV files and it worked like a charm. The csvs had 1.2mm and 2.1mm lines =)
Added some comments to make it more testable but it's already very good.
cmd/uniq/uniq.go
Outdated
|
||
lineNum := 0 | ||
add := false | ||
empAdded := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the meaning of empAdded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, got it. It means emptyAdded
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have to change it for better reading.
cmd/uniq/uniq.go
Outdated
|
||
func scanLines(input io.Reader, opts options) []*Line { | ||
reader := bufio.NewReader(input) | ||
linesPtrMap := make(map[string]*Line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line
already has string pointers and Line.num is a slice (a ptr), then making a map[string]*Line is too much indirection with no space advantage. Using map[string]Line
will simplify the code and allocate only bytes to store the two pointers (text and nums).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tiago4orion I believe that we have performance and space improvement using map[string]*Line
once that LinesOrdered which is a []*Line
(returned by scanLines()
), contains only the pointers of the lines scanned into the map instead of a pure copy of a Line
object. If we attempt to reduce the indirection used in the map, we'll probably certainly have to use more space to build the LinesOrdered slice.
See perf
results on my machine of uniq as it is now (unfortunately for testing not using pointers I should have to change much of the logic, so I didn't do it).
$ cat ~/projs/rsrch/anas.platyrhynchos/data | wc -l
18532826
$ taskset 2 perf stat -e task-clock,cache-references,cache-misses,cycles,instructions,branches,branch-misses \
./uniq_pointer -num -dup < ~/projs/rsrch/anas.platyrhynchos/data
...
Performance counter stats for './uniq_pointer -num -dup':
64623,899938 task-clock:u (msec) # 0,996 CPUs utilized
6.161.710.231 cache-references:u # 95,347 M/sec (83,33%)
3.942.617.559 cache-misses:u # 63,986 % of all cache refs (83,32%)
153.774.203.335 cycles:u # 2,380 GHz (66,68%)
208.763.595.755 instructions:u # 1,36 insn per cycle (83,34%)
41.554.644.963 branches:u # 643,023 M/sec (83,33%)
162.322.952 branch-misses:u # 0,39% of all branches (83,34%)
64,867745418 seconds time elapsed
Besides of the certain increase of space usage these are my concerns about not using pointers:
6.161.710.231 **cache-references:u** # 95,347 M/sec (83,33%)
3.942.617.559 **cache-misses:u** # 63,986 % of all cache refs (83,32%)
Using pointers we already have a lot of cache-misses. That misses will not increase using larger buffers instead of pointers? We need tests.
cmd/uniq/uniq.go
Outdated
linesPtrMap[lineStr] = linep | ||
} | ||
linep.nums = append(linep.nums, lineNum) | ||
add, empAdded = addLine(linep, opts, empAdded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addLine
is misleading because it doesn't add anything. It's a predicate, better to rename to shouldAdd
or testAdd
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Tks.
cmd/uniq/uniq.go
Outdated
|
||
func uniq(linesp []*Line, opts options) { | ||
for _, linep := range linesp { | ||
if !printLine(linep, opts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, the name implies the function actually prints but not. Better to rename to shouldPrintLine
or only shouldPrint
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, changing it to shouldPrint
.
cmd/uniq/uniq.go
Outdated
break | ||
} | ||
if err != nil { | ||
fmt.Println(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main logic is here, if you exit
in case of errors it's not possible to test properly. Calling exit here should exit the test program also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Tks again. Changes:
In uniq.go
-func scanLines(input io.Reader, opts options) []*Line {
+func scanLines(input io.Reader, opts options) ([]*Line, error) {
...
- fmt.Println(err)
- os.Exit(1)
+ return nil, err
...
- return linesOrdered
+ return linesOrdered, nil
In uniq_test.go
- scannedLines := scanLines(bytesReader, tScanLines.opts)
+ scannedLines, err := scanLines(bytesReader, tScanLines.opts)
+ if err != nil {
+ t.Fatalf("test index %d: %v", i, err)
+ }
...
- err := cmpLines(scannedLines, tScanLines.expected)
+ err = cmpLines(scannedLines, tScanLines.expected)
cmd/uniq/uniq_test.go
Outdated
`) | ||
|
||
func newLine(s string, ln []int) *Line { | ||
newStr := new(string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string is immutable and new
is deprecated.
This function could be created something like:
func newLine(s string, ln []int) Line {
line := s+"\n"
return Line{
text: &line,
nums: ln,
}
}
In Go every variable in analysed if it scapes the function and if so it's allocated in the heap and managed by GC. You don't need to call new
anymore (since go1.3 or earlier).
cmd/uniq/uniq_test.go
Outdated
} | ||
|
||
func cmpLines(a, b []*Line) error { | ||
if la, lb := len(a), len(b); la != lb { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Go len
doesn't actually count but only accesses an internal field (len(s) == s.sz
) and len call is always inlined.
Then,
if len(a) != len(b) {
return fmt.Errorf("%T sizes diverge: %v != %v", a, len(a), len(b))
isn't slow.
*only a hint, a minor comment.
cmd/uniq/uniq_test.go
Outdated
return fmt.Errorf("%T sizes diverge: %v != %v", a, la, lb) | ||
} | ||
for i, _ := range a { | ||
linea := *a[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'd said before, you don't need *Line here.
cmd/uniq/uniq_test.go
Outdated
func TestScanLines(t *testing.T) { | ||
for i, tScanLines := range ttScanLines { | ||
bytesReader := bytes.NewReader(input) | ||
scannedLines := scanLines(bytesReader, tScanLines.opts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'd said before, in case of error the entire test you exit.. Better to make this function return an error also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already changed as mentioned above.
return "", err | ||
} | ||
|
||
savedStdout := os.Stdout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's not idiomatic Go. The correct way would be simple define uniq
something like:
func uniq(stdout io.Writer, linesp []Line, opts options) error { /* code here */ }
And the use like this in main:
linesp, err := scanLines(os.Stdin, opts)
if err != nil {
fmt.Fprintf(os.Stderr, err.Error())
os.Exit(1)
}
err = uniq(os.Stdout, linesp, opts)
if err != nil {
fmt.Fprintf(os.Stderr, err.Error())
os.Exit(1)
}
This way in your tests you can mock the stdout with:
var buf bytes.Buffer
err := unit(&buf, linesp, opts)
if err != nil {
t.Fatal(err)
}
out := string(buf.Bytes())
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
- remove deprecated new - calling len() without fear of it not being inlined - remove * dereference once range sends a copy of the object Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
This PR is for comparison with #27
Signed-off-by: Geyslan G. Bem geyslan@gmail.com