New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing a log twice will count entries twice using on-disk store (dumb protection) #334
Comments
Thanks for pointing this out. Do you actually need to parse the same log twice? To read persisted data only (without parsing new data) you could do:
|
@allinurl - can you confirm if sets are against each As a related side note what do you think of digesting the
As a comparative counter for each logged request. The only draw back with this approach is that it does intensify the Alternatively a better approach may be to do this on each if [[ $(sha256sum F1.log) == $(sha256sum F2.log) ]] ; then echo "IGNORING: F1 & F2 since both are the same" ; else echo "PARSING: F1 or F2" ; fi ; I think the later may be an easy & light-enough winner for us to include? |
@aphorise - that's correct, input files go against the data set on disk. I think I favor the second approach, where it's done on each input-file, however, in the case where input files differ, then I'm not sure how would you get the offset line (last parsed line)? A hash-table lookup, like you said, it would pretty intense to compute the |
@allinurl - It would be great if we can strive to digest whole blocks / chunks of the file. Where file sizes / input differ - then we can take the smaller set of all and compare against the same
Of course the files would require some reasonable minimal and all of these approach would only be useful for exact chunks / files that are the same in both. As you also mentioned - one can also take the first & last lines as extra comparisons too - though this would perhaps add too many levels of assumption since for example the first & last line of a file could be the only entries that are a replica from other sources. For these sort of approach a modular smaller chunk comparison would be ideal. Taking smallest file - split by All this is a partial solution - though it would minimally reduce exact duplicates that are similarly in the exact same order. |
Just run into this problem, by accident, ran a test script twice and the data doubled. Acknowledgement of issue. Might be an idea to document in the on-disk storage section. As for finding a way to avoid duplicating data on-disk using log stamp will not work with piped data. And idea to hash/checksum the complete log line might be over kill, how about only hashing/checksuming part of the data? Which part, I am not sure, date/time, IP, URL ... |
My use case for wanting this is that I want to run GoAccess against my latest logs multiple times, for updates. If runs are not idempotent, and they count visitors twice, it would be very inconvenient. |
Thinking about this some more, I think that one simple rule will take care of 99% of the problems here. Basically, the rule is: "Make a note of the latest log line timestamp in the stored dataset and don't import anything before that from the log files in this run." So, if I ran goaccess on my logfile that ended at 15:02:18 today, and then ran it again a bit later, it would start up, see that the last piece of data in my db is 15:02:18, and any line earlier than or exactly at 15:02:18 in the logfile specified in the commandline would be ignored. This would prevent double-counting unless the log files specified in this session contained duplicates (e.g. if someone specified a logfile and a copy of the same logfile), which strikes me as much rarer than the case where we just want to parse a logfile that was changed. This simple change would make goaccess much, much more convenient for me. As it stands, I can't really use it as my sole analytics software because I can't be sure it won't double-count if I run it on an old logfile. I'm really bad with C but this is so essential to me that I'd be willing to try writing a PR, if someone could point me in the right direction. |
By the way, if you're worried about logs being processed out of order (i.e. someone wanting to process the most recent log first), there are two ways that can be addressed:
This (together with This is the use case I think most people want from goaccess, as it's how Google Analytics and other software works. I'm also very excited about being able to finally ditch GA for a non-intrusive (and more accurate) solution. |
Thanks for your input. Here are my thoughts on this: 1- Add the ability to process multiple logs directly from goaccess. e.g., 2- To process log files (no piped data) incrementally, we keep the inodes of all the files processed (assuming files will stay on the same partition) along with the last line parsed of each file and the timestamp of the last line parsed. e.g., If the inode does not match the current file, parse all lines. If the current file matches the inode, we then read the remaining lines and update the count of lines parsed and the timestamp. Also, as you suggested and as an extra precaution, have a flag that if passed, it won't parse log lines with a timestamp ≤ than the one stored. 3- Piped data is a bigger problem. It will have to be based on the flag mentioned above and the timestamp of the last line read. For instance, it will parse and discard all incoming entries until it finds a timestamp > than the one stored (should it be inclusive?). The issue here is that you could have multiple consecutive lines with the same timestamp (even at the second level), so a few may end up as duplicates. However, I assume that in most cases, for incremental log parsing, people would be parsing the logs as they are without pre-processing them and thus #1 would come in handy. Unless I'm missing something, I think this would be a reasonable implementation that should address most of the problems of duplicate data for incremental parsing. Thoughts? |
I completely second the above. I like the inode optimization, although I think it's important that the timestamp-aware exclusion be added, as sometimes logrotate might copy files and the inode will change, so duplicates should be avoided in that case. I also think the timestamp comparison needs to be exclusive of the last time, as I would rather miss a few lines than double-count a few, but I don't feel very strongly about this. Keep in mind that, if you accept multiple files in the command line, people will expect that they will all be live-updated. This might not be an issue, I'm just pointing it out. Overall, this feature sounds great to me. I want to write a post about the inaccuracy of analytics and how GoAccess/Piwik can solve many of the problems at once, but I'd like GoAccess to be a bit more usable for the "set it and forget it" crowd before I recommend it. This feature would get me all the way there, and I could then show people how to set up GoAccess so they always get live and historical data at once. Thank you! |
@skorokithakis Looks like unless I'll look into this and post back. Thanks again and stay tuned! |
Is there any update on this? I'd really like to write that post on using GoAccess for analytics, but without this feature the UX is too hard... |
@skorokithakis From my previous comment, the first part (1) is done, and will be deployed soon in the upcoming version. 2 and 3 will probably come in the release after. I want to leave v1.1 as stable as possible before moving into 1.2. Stay tuned :) |
Ah, alright, thank you! |
I'm interested on this too :) Thanks |
I just installed 1.2-1 from the deb.goaccess.io repo and am hitting this bug. skorokithakis' usecase in the Jan 9 comment is my use case:
What's the current workaround for this? Hack up a rotate & parsing schedule? |
@chennin I'm working on this on a local branch while keeping up with some bug reports. However, a workaround can be easily scripted in bash, something like this should help for now:
then you can simply run it as:
The script will basically start reading from the first line of your log until the end of it, then it will save the last recorded line into a |
Perhaps goaccess could be made smart enough to understand that all the files named However, the whole point is to feed goaccess old data on its first run. Maybe this can be done externally with a script like #926 Related #925 |
@freemp This is awesome, thanks for sharing it! |
…ring data. See #334 for double counting protecting on persisted data.
This has been added to v1.4 (implemented as stated here) - finally. This release is around the corner, so stay tuned next week. Thanks again for all the feedback. |
When analyzing log files with on-disk B+Tree feature enabled (--keep-db-files --load-from-disk at each run), parsing a log file twice will process and count each entry twice. Would be a good thing to avoid such feature.
Possible way to do this :
_ remember wich log entry has been analyzed (would take lots of space on disk)
_ remember date from last log entry (log files will have to be processed in chronological order)
_ something else I didn't think of ?
The text was updated successfully, but these errors were encountered: