New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing a log twice will count entries twice using on-disk store (dumb protection) #334

Open
billietl opened this Issue Nov 19, 2015 · 30 comments

Comments

Projects
None yet
@billietl

billietl commented Nov 19, 2015

When analyzing log files with on-disk B+Tree feature enabled (--keep-db-files --load-from-disk at each run), parsing a log file twice will process and count each entry twice. Would be a good thing to avoid such feature.

Possible way to do this :
_ remember wich log entry has been analyzed (would take lots of space on disk)
_ remember date from last log entry (log files will have to be processed in chronological order)
_ something else I didn't think of ?

@allinurl

This comment has been minimized.

Show comment
Hide comment
@allinurl

allinurl Nov 19, 2015

Owner

Thanks for pointing this out. Do you actually need to parse the same log twice? To read persisted data only (without parsing new data) you could do:

goaccess --load-from-disk --keep-db-files
Owner

allinurl commented Nov 19, 2015

Thanks for pointing this out. Do you actually need to parse the same log twice? To read persisted data only (without parsing new data) you could do:

goaccess --load-from-disk --keep-db-files
@aphorise

This comment has been minimized.

Show comment
Hide comment
@aphorise

aphorise Nov 19, 2015

@allinurl - can you confirm if sets are against each input / files? - thats to say are counted-sets anew / separate based on the series of input files?

As a related side note what do you think of digesting the URI in a similar hash-table lookup similar to the proposition currently under development for the UA's? - so basically something such as:

echo -n 'htttp://uri_request...' | sha256sum --tag
# or perhaps slightly shorter 
echo -n 'htttp://uri_request...' | sha1sum --tag

As a comparative counter for each logged request. The only draw back with this approach is that it does intensify the input process as sha methods are arguably not light - which would require executions against each read entry; it must also be mentioned that this does a sort of compaction / compression of strings for in-memory store irrespective of the original string that was read.

Alternatively a better approach may be to do this on each input-file - which would reasonably suffice for any length file / size (thinking of ranges at or above < 2^61``+) where the probability of a collision are superbly low and may be ignored eg:

if [[ $(sha256sum F1.log) == $(sha256sum F2.log) ]] ; then echo "IGNORING: F1 & F2 since both are the same" ; else echo "PARSING: F1 or F2" ; fi ; 

I think the later may be an easy & light-enough winner for us to include?

aphorise commented Nov 19, 2015

@allinurl - can you confirm if sets are against each input / files? - thats to say are counted-sets anew / separate based on the series of input files?

As a related side note what do you think of digesting the URI in a similar hash-table lookup similar to the proposition currently under development for the UA's? - so basically something such as:

echo -n 'htttp://uri_request...' | sha256sum --tag
# or perhaps slightly shorter 
echo -n 'htttp://uri_request...' | sha1sum --tag

As a comparative counter for each logged request. The only draw back with this approach is that it does intensify the input process as sha methods are arguably not light - which would require executions against each read entry; it must also be mentioned that this does a sort of compaction / compression of strings for in-memory store irrespective of the original string that was read.

Alternatively a better approach may be to do this on each input-file - which would reasonably suffice for any length file / size (thinking of ranges at or above < 2^61``+) where the probability of a collision are superbly low and may be ignored eg:

if [[ $(sha256sum F1.log) == $(sha256sum F2.log) ]] ; then echo "IGNORING: F1 & F2 since both are the same" ; else echo "PARSING: F1 or F2" ; fi ; 

I think the later may be an easy & light-enough winner for us to include?

@allinurl

This comment has been minimized.

Show comment
Hide comment
@allinurl

allinurl Nov 19, 2015

Owner

@aphorise - that's correct, input files go against the data set on disk.

I think I favor the second approach, where it's done on each input-file, however, in the case where input files differ, then I'm not sure how would you get the offset line (last parsed line)?

A hash-table lookup, like you said, it would pretty intense to compute the sha per request. Maybe as @billietl mentioned, just keeping track of the last line parsed (assuming the log is in chronological order) ?

Owner

allinurl commented Nov 19, 2015

@aphorise - that's correct, input files go against the data set on disk.

I think I favor the second approach, where it's done on each input-file, however, in the case where input files differ, then I'm not sure how would you get the offset line (last parsed line)?

A hash-table lookup, like you said, it would pretty intense to compute the sha per request. Maybe as @billietl mentioned, just keeping track of the last line parsed (assuming the log is in chronological order) ?

@aphorise

This comment has been minimized.

Show comment
Hide comment
@aphorise

aphorise Nov 19, 2015

@allinurl - It would be great if we can strive to digest whole blocks / chunks of the file. Where file sizes / input differ - then we can take the smaller set of all and compare against the same sha output on the larger sets. So something like:

if [[ $(tail -c 1000 F1.log | sha256sum) == $(tail -c 1000 F2.log | sha256sum) ]] ; then echo "IGNORING: F1 & F2 since both are the same" ; else echo "PARSING: F1 or F2" ; fi ; 
# taking the first '-c 1000' Bytes

Of course the files would require some reasonable minimal and all of these approach would only be useful for exact chunks / files that are the same in both. As you also mentioned - one can also take the first & last lines as extra comparisons too - though this would perhaps add too many levels of assumption since for example the first & last line of a file could be the only entries that are a replica from other sources. For these sort of approach a modular smaller chunk comparison would be ideal.

Taking smallest file - split by X (largest possible value) with Y bytes - then for each X do sha blocks - then for each Y bytes in the same order in the next file do the same comparison to determine if any chunks can be ignored.

All this is a partial solution - though it would minimally reduce exact duplicates that are similarly in the exact same order.

aphorise commented Nov 19, 2015

@allinurl - It would be great if we can strive to digest whole blocks / chunks of the file. Where file sizes / input differ - then we can take the smaller set of all and compare against the same sha output on the larger sets. So something like:

if [[ $(tail -c 1000 F1.log | sha256sum) == $(tail -c 1000 F2.log | sha256sum) ]] ; then echo "IGNORING: F1 & F2 since both are the same" ; else echo "PARSING: F1 or F2" ; fi ; 
# taking the first '-c 1000' Bytes

Of course the files would require some reasonable minimal and all of these approach would only be useful for exact chunks / files that are the same in both. As you also mentioned - one can also take the first & last lines as extra comparisons too - though this would perhaps add too many levels of assumption since for example the first & last line of a file could be the only entries that are a replica from other sources. For these sort of approach a modular smaller chunk comparison would be ideal.

Taking smallest file - split by X (largest possible value) with Y bytes - then for each X do sha blocks - then for each Y bytes in the same order in the next file do the same comparison to determine if any chunks can be ignored.

All this is a partial solution - though it would minimally reduce exact duplicates that are similarly in the exact same order.

@allinurl allinurl changed the title from [new feature] Processing a log twice will count entries twice (dumb protection) to Processing a log twice will count entries twice using on-disk store (dumb protection) Nov 24, 2015

@LeeNX

This comment has been minimized.

Show comment
Hide comment
@LeeNX

LeeNX Nov 9, 2016

Just run into this problem, by accident, ran a test script twice and the data doubled. Acknowledgement of issue. Might be an idea to document in the on-disk storage section.

As for finding a way to avoid duplicating data on-disk using log stamp will not work with piped data. And idea to hash/checksum the complete log line might be over kill, how about only hashing/checksuming part of the data? Which part, I am not sure, date/time, IP, URL ...

LeeNX commented Nov 9, 2016

Just run into this problem, by accident, ran a test script twice and the data doubled. Acknowledgement of issue. Might be an idea to document in the on-disk storage section.

As for finding a way to avoid duplicating data on-disk using log stamp will not work with piped data. And idea to hash/checksum the complete log line might be over kill, how about only hashing/checksuming part of the data? Which part, I am not sure, date/time, IP, URL ...

@skorokithakis

This comment has been minimized.

Show comment
Hide comment
@skorokithakis

skorokithakis Jan 8, 2017

My use case for wanting this is that I want to run GoAccess against my latest logs multiple times, for updates. If runs are not idempotent, and they count visitors twice, it would be very inconvenient.

skorokithakis commented Jan 8, 2017

My use case for wanting this is that I want to run GoAccess against my latest logs multiple times, for updates. If runs are not idempotent, and they count visitors twice, it would be very inconvenient.

@skorokithakis

This comment has been minimized.

Show comment
Hide comment
@skorokithakis

skorokithakis Jan 8, 2017

Thinking about this some more, I think that one simple rule will take care of 99% of the problems here. Basically, the rule is:

"Make a note of the latest log line timestamp in the stored dataset and don't import anything before that from the log files in this run."

So, if I ran goaccess on my logfile that ended at 15:02:18 today, and then ran it again a bit later, it would start up, see that the last piece of data in my db is 15:02:18, and any line earlier than or exactly at 15:02:18 in the logfile specified in the commandline would be ignored.

This would prevent double-counting unless the log files specified in this session contained duplicates (e.g. if someone specified a logfile and a copy of the same logfile), which strikes me as much rarer than the case where we just want to parse a logfile that was changed.

This simple change would make goaccess much, much more convenient for me. As it stands, I can't really use it as my sole analytics software because I can't be sure it won't double-count if I run it on an old logfile.

I'm really bad with C but this is so essential to me that I'd be willing to try writing a PR, if someone could point me in the right direction.

skorokithakis commented Jan 8, 2017

Thinking about this some more, I think that one simple rule will take care of 99% of the problems here. Basically, the rule is:

"Make a note of the latest log line timestamp in the stored dataset and don't import anything before that from the log files in this run."

So, if I ran goaccess on my logfile that ended at 15:02:18 today, and then ran it again a bit later, it would start up, see that the last piece of data in my db is 15:02:18, and any line earlier than or exactly at 15:02:18 in the logfile specified in the commandline would be ignored.

This would prevent double-counting unless the log files specified in this session contained duplicates (e.g. if someone specified a logfile and a copy of the same logfile), which strikes me as much rarer than the case where we just want to parse a logfile that was changed.

This simple change would make goaccess much, much more convenient for me. As it stands, I can't really use it as my sole analytics software because I can't be sure it won't double-count if I run it on an old logfile.

I'm really bad with C but this is so essential to me that I'd be willing to try writing a PR, if someone could point me in the right direction.

@skorokithakis

This comment has been minimized.

Show comment
Hide comment
@skorokithakis

skorokithakis Jan 9, 2017

By the way, if you're worried about logs being processed out of order (i.e. someone wanting to process the most recent log first), there are two ways that can be addressed:

  1. Goaccess can refuse to process items between its bounds. For example, if you have log lines from Jan 1 to Jan 30, you can refuse to process log items between those dates again. However, this might be an issue with people processing the Jan log, then the May log, then the Feb log, etc.
  2. A more flexible solution would be to just make this an option. Passing --idempotent or --update-only, goaccess will refuse to process log lines between dates it has already seen.

This (together with --process-and-exit) will allow people to set goaccess to update its logs during the day, from the same log, without worrying about duplicate items. We can simply add a cron job a few times a day that will read the latest logfile, and, whenever we want to see statistics, we can open goaccess at any point (possibly parsing the latest logfile again) to see completely up-to-date statistics, going back months or years, without needing to keep or reprocess the logs, or worry about getting duplicate data.

This is the use case I think most people want from goaccess, as it's how Google Analytics and other software works. I'm also very excited about being able to finally ditch GA for a non-intrusive (and more accurate) solution.

skorokithakis commented Jan 9, 2017

By the way, if you're worried about logs being processed out of order (i.e. someone wanting to process the most recent log first), there are two ways that can be addressed:

  1. Goaccess can refuse to process items between its bounds. For example, if you have log lines from Jan 1 to Jan 30, you can refuse to process log items between those dates again. However, this might be an issue with people processing the Jan log, then the May log, then the Feb log, etc.
  2. A more flexible solution would be to just make this an option. Passing --idempotent or --update-only, goaccess will refuse to process log lines between dates it has already seen.

This (together with --process-and-exit) will allow people to set goaccess to update its logs during the day, from the same log, without worrying about duplicate items. We can simply add a cron job a few times a day that will read the latest logfile, and, whenever we want to see statistics, we can open goaccess at any point (possibly parsing the latest logfile again) to see completely up-to-date statistics, going back months or years, without needing to keep or reprocess the logs, or worry about getting duplicate data.

This is the use case I think most people want from goaccess, as it's how Google Analytics and other software works. I'm also very excited about being able to finally ditch GA for a non-intrusive (and more accurate) solution.

@allinurl

This comment has been minimized.

Show comment
Hide comment
@allinurl

allinurl Jan 12, 2017

Owner

Thanks for your input. Here are my thoughts on this:

1- Add the ability to process multiple logs directly from goaccess. e.g., goaccess -f access.log -f access.log.1. This will give greater flexibility and avoid having to use zcat/cat for multiple files.

2- To process log files (no piped data) incrementally, we keep the inodes of all the files processed (assuming files will stay on the same partition) along with the last line parsed of each file and the timestamp of the last line parsed. e.g., inode:29627417|line:20012|ts:20171231235059

If the inode does not match the current file, parse all lines. If the current file matches the inode, we then read the remaining lines and update the count of lines parsed and the timestamp. Also, as you suggested and as an extra precaution, have a flag that if passed, it won't parse log lines with a timestamp ≤ than the one stored.

3- Piped data is a bigger problem. It will have to be based on the flag mentioned above and the timestamp of the last line read. For instance, it will parse and discard all incoming entries until it finds a timestamp > than the one stored (should it be inclusive?).

The issue here is that you could have multiple consecutive lines with the same timestamp (even at the second level), so a few may end up as duplicates. However, I assume that in most cases, for incremental log parsing, people would be parsing the logs as they are without pre-processing them and thus #1 would come in handy.

Unless I'm missing something, I think this would be a reasonable implementation that should address most of the problems of duplicate data for incremental parsing.

Thoughts?

Owner

allinurl commented Jan 12, 2017

Thanks for your input. Here are my thoughts on this:

1- Add the ability to process multiple logs directly from goaccess. e.g., goaccess -f access.log -f access.log.1. This will give greater flexibility and avoid having to use zcat/cat for multiple files.

2- To process log files (no piped data) incrementally, we keep the inodes of all the files processed (assuming files will stay on the same partition) along with the last line parsed of each file and the timestamp of the last line parsed. e.g., inode:29627417|line:20012|ts:20171231235059

If the inode does not match the current file, parse all lines. If the current file matches the inode, we then read the remaining lines and update the count of lines parsed and the timestamp. Also, as you suggested and as an extra precaution, have a flag that if passed, it won't parse log lines with a timestamp ≤ than the one stored.

3- Piped data is a bigger problem. It will have to be based on the flag mentioned above and the timestamp of the last line read. For instance, it will parse and discard all incoming entries until it finds a timestamp > than the one stored (should it be inclusive?).

The issue here is that you could have multiple consecutive lines with the same timestamp (even at the second level), so a few may end up as duplicates. However, I assume that in most cases, for incremental log parsing, people would be parsing the logs as they are without pre-processing them and thus #1 would come in handy.

Unless I'm missing something, I think this would be a reasonable implementation that should address most of the problems of duplicate data for incremental parsing.

Thoughts?

@skorokithakis

This comment has been minimized.

Show comment
Hide comment
@skorokithakis

skorokithakis Jan 13, 2017

I completely second the above. I like the inode optimization, although I think it's important that the timestamp-aware exclusion be added, as sometimes logrotate might copy files and the inode will change, so duplicates should be avoided in that case.

I also think the timestamp comparison needs to be exclusive of the last time, as I would rather miss a few lines than double-count a few, but I don't feel very strongly about this.

Keep in mind that, if you accept multiple files in the command line, people will expect that they will all be live-updated. This might not be an issue, I'm just pointing it out.

Overall, this feature sounds great to me. I want to write a post about the inaccuracy of analytics and how GoAccess/Piwik can solve many of the problems at once, but I'd like GoAccess to be a bit more usable for the "set it and forget it" crowd before I recommend it. This feature would get me all the way there, and I could then show people how to set up GoAccess so they always get live and historical data at once.

Thank you!

skorokithakis commented Jan 13, 2017

I completely second the above. I like the inode optimization, although I think it's important that the timestamp-aware exclusion be added, as sometimes logrotate might copy files and the inode will change, so duplicates should be avoided in that case.

I also think the timestamp comparison needs to be exclusive of the last time, as I would rather miss a few lines than double-count a few, but I don't feel very strongly about this.

Keep in mind that, if you accept multiple files in the command line, people will expect that they will all be live-updated. This might not be an issue, I'm just pointing it out.

Overall, this feature sounds great to me. I want to write a post about the inaccuracy of analytics and how GoAccess/Piwik can solve many of the problems at once, but I'd like GoAccess to be a bit more usable for the "set it and forget it" crowd before I recommend it. This feature would get me all the way there, and I could then show people how to set up GoAccess so they always get live and historical data at once.

Thank you!

@allinurl

This comment has been minimized.

Show comment
Hide comment
@allinurl

allinurl Jan 13, 2017

Owner

@skorokithakis Looks like unless copytruncate is used in logrotate, it just renames it. If it just moves the file and is within the file system boundaries, then it should just perform a metadata change, so it should preserve the inode number, which should work as expected :)

I'll look into this and post back. Thanks again and stay tuned!

Owner

allinurl commented Jan 13, 2017

@skorokithakis Looks like unless copytruncate is used in logrotate, it just renames it. If it just moves the file and is within the file system boundaries, then it should just perform a metadata change, so it should preserve the inode number, which should work as expected :)

I'll look into this and post back. Thanks again and stay tuned!

@skorokithakis

This comment has been minimized.

Show comment
Hide comment
@skorokithakis

skorokithakis Feb 18, 2017

Is there any update on this? I'd really like to write that post on using GoAccess for analytics, but without this feature the UX is too hard...

skorokithakis commented Feb 18, 2017

Is there any update on this? I'd really like to write that post on using GoAccess for analytics, but without this feature the UX is too hard...

@allinurl

This comment has been minimized.

Show comment
Hide comment
@allinurl

allinurl Feb 20, 2017

Owner

@skorokithakis From my previous comment, the first part (1) is done, and will be deployed soon in the upcoming version. 2 and 3 will probably come in the release after. I want to leave v1.1 as stable as possible before moving into 1.2. Stay tuned :)

Owner

allinurl commented Feb 20, 2017

@skorokithakis From my previous comment, the first part (1) is done, and will be deployed soon in the upcoming version. 2 and 3 will probably come in the release after. I want to leave v1.1 as stable as possible before moving into 1.2. Stay tuned :)

@skorokithakis

This comment has been minimized.

Show comment
Hide comment
@skorokithakis

skorokithakis Feb 20, 2017

Ah, alright, thank you!

skorokithakis commented Feb 20, 2017

Ah, alright, thank you!

@SeLLeRoNe

This comment has been minimized.

Show comment
Hide comment
@SeLLeRoNe

SeLLeRoNe Mar 1, 2017

I'm interested on this too :)
I'm just replying to receive updates.

Thanks

SeLLeRoNe commented Mar 1, 2017

I'm interested on this too :)
I'm just replying to receive updates.

Thanks

@chennin

This comment has been minimized.

Show comment
Hide comment
@chennin

chennin Jun 10, 2017

I just installed 1.2-1 from the deb.goaccess.io repo and am hitting this bug. skorokithakis' usecase in the Jan 9 comment is my use case:

  • Parse existing logs once (some compressed) to HTML
  • Update the HTML every X hours
  • Not live updates

What's the current workaround for this? Hack up a rotate & parsing schedule?

chennin commented Jun 10, 2017

I just installed 1.2-1 from the deb.goaccess.io repo and am hitting this bug. skorokithakis' usecase in the Jan 9 comment is my use case:

  • Parse existing logs once (some compressed) to HTML
  • Update the HTML every X hours
  • Not live updates

What's the current workaround for this? Hack up a rotate & parsing schedule?

@allinurl

This comment has been minimized.

Show comment
Hide comment
@allinurl

allinurl Jun 10, 2017

Owner

@chennin I'm working on this on a local branch while keeping up with some bug reports. However, a workaround can be easily scripted in bash, something like this should help for now:

#!/bin/bash

# change the following two...
LOG=/var/log/nginx/access.log
LASTREAD=/home/user/.goaccess.last

OFFSET=$(sed -n '$=' $LOG)
START=1
if [[ -s $LASTREAD ]]; then
  START=$(<$LASTREAD)
  START="$(($START + 1))"
fi

sed -n $START','$OFFSET'p' < $LOG
echo "$OFFSET" > "$LASTREAD"

then you can simply run it as:

./myscript.sh | goaccess --load-from-disk --keep-db-files --log-format=COMBINED

The script will basically start reading from the first line of your log until the end of it, then it will save the last recorded line into a goaccess.last file that will be used as the starting point for the next run. Feel free to modify it to suit your env :)

Owner

allinurl commented Jun 10, 2017

@chennin I'm working on this on a local branch while keeping up with some bug reports. However, a workaround can be easily scripted in bash, something like this should help for now:

#!/bin/bash

# change the following two...
LOG=/var/log/nginx/access.log
LASTREAD=/home/user/.goaccess.last

OFFSET=$(sed -n '$=' $LOG)
START=1
if [[ -s $LASTREAD ]]; then
  START=$(<$LASTREAD)
  START="$(($START + 1))"
fi

sed -n $START','$OFFSET'p' < $LOG
echo "$OFFSET" > "$LASTREAD"

then you can simply run it as:

./myscript.sh | goaccess --load-from-disk --keep-db-files --log-format=COMBINED

The script will basically start reading from the first line of your log until the end of it, then it will save the last recorded line into a goaccess.last file that will be used as the starting point for the next run. Feel free to modify it to suit your env :)

@FedericoCeratto

This comment has been minimized.

Show comment
Hide comment
@FedericoCeratto

FedericoCeratto Oct 29, 2017

Perhaps goaccess could be made smart enough to understand that all the files named foo.log[.<number>[.gz]] belong to the same source (instead of passing -f multiple times) and scan older files only if needed, based on their mtime.

However, the whole point is to feed goaccess old data on its first run. Maybe this can be done externally with a script like #926

Related #925

FedericoCeratto commented Oct 29, 2017

Perhaps goaccess could be made smart enough to understand that all the files named foo.log[.<number>[.gz]] belong to the same source (instead of passing -f multiple times) and scan older files only if needed, based on their mtime.

However, the whole point is to feed goaccess old data on its first run. Maybe this can be done externally with a script like #926

Related #925

@shaun-ba

This comment has been minimized.

Show comment
Hide comment
@shaun-ba

shaun-ba Nov 7, 2017

@allinurl This would be great. Essentially my scenario is that i want to append the report over and over. As i understand it to do that i need to persist logs to disk every time, then load them back on each subsequent parse. Is this really the best way?

My logfiles are 3.2GB each for example, but the report on 50GB of logs is only 1MB. I simply want an "alltime" report, that i can append to, is that what this issue will address?

shaun-ba commented Nov 7, 2017

@allinurl This would be great. Essentially my scenario is that i want to append the report over and over. As i understand it to do that i need to persist logs to disk every time, then load them back on each subsequent parse. Is this really the best way?

My logfiles are 3.2GB each for example, but the report on 50GB of logs is only 1MB. I simply want an "alltime" report, that i can append to, is that what this issue will address?

@allinurl

This comment has been minimized.

Show comment
Hide comment
@allinurl

allinurl Nov 8, 2017

Owner

@shaun-ba Currently that's the only way, but I definitely see the advantage of appending data to the HTML file. I plan to address this in #631. Though, the only downside is that it wouldn't work on the terminal output and neither in the CSV output (may be hard to implement).

Owner

allinurl commented Nov 8, 2017

@shaun-ba Currently that's the only way, but I definitely see the advantage of appending data to the HTML file. I plan to address this in #631. Though, the only downside is that it wouldn't work on the terminal output and neither in the CSV output (may be hard to implement).

@shaun-ba

This comment has been minimized.

Show comment
Hide comment
@shaun-ba

shaun-ba Nov 9, 2017

@allinurl I've not seen how much data is stored to disk, would it be similar to the logfile size? If so that's not very practical at least for me :)

shaun-ba commented Nov 9, 2017

@allinurl I've not seen how much data is stored to disk, would it be similar to the logfile size? If so that's not very practical at least for me :)

@allinurl

This comment has been minimized.

Show comment
Hide comment
@allinurl

allinurl Nov 9, 2017

Owner

@shaun-ba It won't be similar. Give it a shot using a smaller dataset to begin.

Owner

allinurl commented Nov 9, 2017

@shaun-ba It won't be similar. Give it a shot using a smaller dataset to begin.

@Geobert

This comment has been minimized.

Show comment
Hide comment
@Geobert

Geobert Mar 8, 2018

@allinurl How does the myscript workaround behave when the log get rotated?

Geobert commented Mar 8, 2018

@allinurl How does the myscript workaround behave when the log get rotated?

@Geobert

This comment has been minimized.

Show comment
Hide comment
@Geobert

Geobert Mar 9, 2018

To answer my own question: the script does not handle the case. I added rm ~/.goaccess.last in the logrotate postrotate configuration file for nginx.

Geobert commented Mar 9, 2018

To answer my own question: the script does not handle the case. I added rm ~/.goaccess.last in the logrotate postrotate configuration file for nginx.

@allinurl

This comment has been minimized.

Show comment
Hide comment
@allinurl

allinurl Mar 9, 2018

Owner

@Geobert Thanks for posting that. I was looking for a viable solution to this, but your answer makes total sense and much simpler :)

Owner

allinurl commented Mar 9, 2018

@Geobert Thanks for posting that. I was looking for a viable solution to this, but your answer makes total sense and much simpler :)

@Geobert

This comment has been minimized.

Show comment
Hide comment
@Geobert

Geobert Mar 9, 2018

You're welcome, happy to help :D A pity I dislike C so much, if it was in Rust, I would contribute much more :D

Geobert commented Mar 9, 2018

You're welcome, happy to help :D A pity I dislike C so much, if it was in Rust, I would contribute much more :D

@Glandos

This comment has been minimized.

Show comment
Hide comment
@Glandos

Glandos Apr 28, 2018

@Geobert Except that on the 0.6 release of GoAccess (7 October 2013), Rust was only at version 0.8. And at that times, external libraries were non existing.

Glandos commented Apr 28, 2018

@Geobert Except that on the 0.6 release of GoAccess (7 October 2013), Rust was only at version 0.8. And at that times, external libraries were non existing.

@Geobert

This comment has been minimized.

Show comment
Hide comment
@Geobert

Geobert Apr 28, 2018

True, fair enough :)

Geobert commented Apr 28, 2018

True, fair enough :)

@freemp

This comment has been minimized.

Show comment
Hide comment
@freemp

freemp May 9, 2018

Just wanted to share my solution working for OpenBSD's httpd on a system running goaccess as a daemon by means of this rc.d script:

#!/bin/ksh

logfile="/var/www/logs/access.log"

daemon="/usr/local/bin/goaccess"

. /etc/rc.d/rc.subr

rc_reload=NO

rc_stop() {
	pkill -INT -T "${daemon_rtable}" -xf "${pexp}"
}

rc_post() {
	if [ -f $logfile ]; then
		gzip -c $logfile >> $logfile.0.gz
		cp /dev/null $logfile
		pkill -USR1 -u root -U root -x httpd
	fi
}

rc_cmd $1

In case the goaccess daemon needs to shut down, it will receive a SIGINT to terminate gracefully. Then, the log file will be appended to the last archived log, gets cleared afterwards and the http daemon gets notified to reopen the log file.

freemp commented May 9, 2018

Just wanted to share my solution working for OpenBSD's httpd on a system running goaccess as a daemon by means of this rc.d script:

#!/bin/ksh

logfile="/var/www/logs/access.log"

daemon="/usr/local/bin/goaccess"

. /etc/rc.d/rc.subr

rc_reload=NO

rc_stop() {
	pkill -INT -T "${daemon_rtable}" -xf "${pexp}"
}

rc_post() {
	if [ -f $logfile ]; then
		gzip -c $logfile >> $logfile.0.gz
		cp /dev/null $logfile
		pkill -USR1 -u root -U root -x httpd
	fi
}

rc_cmd $1

In case the goaccess daemon needs to shut down, it will receive a SIGINT to terminate gracefully. Then, the log file will be appended to the last archived log, gets cleared afterwards and the http daemon gets notified to reopen the log file.

@allinurl

This comment has been minimized.

Show comment
Hide comment
@allinurl

allinurl May 9, 2018

Owner

@freemp This is awesome, thanks for sharing it!

Owner

allinurl commented May 9, 2018

@freemp This is awesome, thanks for sharing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment