Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposals to improve PyFunceble #41

Closed
maravento opened this issue Jul 15, 2019 · 5 comments
Closed

Proposals to improve PyFunceble #41

maravento opened this issue Jul 15, 2019 · 5 comments

Comments

@maravento
Copy link

maravento commented Jul 15, 2019

Problems:

  1. The installation method (described here) is confusing and does not match the manual (some commands require privileges and others do not, the installation is mixed with the execution, and the env, etc)
  2. The minimum hardware and OS resources are unknown
  3. It has no debug mode or logs, therefore, there will be no information when an error occurs (For example, sometimes it freezes, without being able to determine the cause)
  4. There are inconsistencies between what the manual says and the creator's suggestions on issues
  5. There are no technical performance data, no warnings about program consumption and how to control it. I have consulted other projects that use this program and do not provide this technical data either
  6. It becomes unstable and collapses or freezes if large lists (+ 3 M) are used

Possible bugs:

  1. Freezing: The program crashes on Ubuntu 18.04.x x64 and large lists (+ 3 M) and the only way to unlock it is with ctrl+c. It happens with small and large lists. The cause is unknown because the program has no debug mode or logs
  2. Wrong instructions: According to the instructions, when ctrl+c is pressed to interrupt the program, the program must be executed with the --clean flag. This is very bad because all work is lost
  3. Warnings: The --clean flag must have a warning of what it does, to avoid partial or total loss of work
  4. auto-continue system fail: The auto-continue system is failing, since when the program is interrupted or frozen, it does not start where it was left, and as a result it is generating duplicates in the output.
  5. Inconsistencies in the output: When processing a list, 3 files are generated in the hosts folder (ACTIVE / hosts, INACTIVE / hosts, INVALID / hosts). However, once the processing of the source list is finished, we take, for example, the INACTIVE / hosts file, and we reprocess it and in theory the output should be the same, but this is not the case, because this inactive list, It can become partially active. So the result is not reliable.
  6. Run in modes and log file: It is necessary that the program has execution modes (debugging, safe, normal, minimal, etc.) so that it does not compromise the stability of the system and know more thoroughly the problems that may arise. The program also needs a log file to facilitate the audit and diagnosis of problems.
  7. Virtual Env: The suggested virtual environment (python3-virtualenv) is not working as it should

Hardware Test:
I have performed different tests in physical environments with Ubuntu 18.04.3 x64 and large lists (+ 3 M). This is the result::

PC1: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz, RAM 32028 MiB
a. PyFunceble -m -p 200 -f file = system collapses
b. PyFunceble -m -p 150 -f file = freezes after a while running
c. PyFunceble -m -p 100 -f file = freezes after a while running
d. PyFunceble -m -p 50 -f file = test abort. Read 'CPU Usage'
e. PyFunceble -f file = Stable but slower than a bash

PC2: Intel(R) Xeon(TM) CPU ES-2603 v4 @ 1.70 GHz, RAM 15903 MiB
a. PyFunceble -m -p 200 -f file = system collapses
b. PyFunceble -m -p 150 -f file = freezes after a while running
c. PyFunceble -m -p 100 -f file = freezes after a while running
d. PyFunceble -m -p 50 -f file = freezes after a while running
e. PyFunceble -f file = Stable but slower than a bash

CPU usage:
In all tests the program reaches 100% CPU usage with a large lists (+ 3 M).
Captura de pantalla -2019-08-07 11-48-21

Speed test: PyFunceble vs bash
Bash:
#!/bin/bash
while read LINE; do
curl -o /dev/null --silent --head --write-out '%{http_code}' "$LINE"
echo " $LINE"
done < source.txt
PyFunceble:
PyFunceble -f source.txt
Results after +1 hour:
PyFunceble: 1364 processed lines (in hosts/ACTIVE hosts/INACTIVE hosts/INVALID)
Bash: 2930 processed lines

Conclusion:
This application is only faster than a simple bash with the "-m -p" flag, but it becomes unstable and freezes or collapses the system. I suggest that it be improved in this regard so that it is usable. regards

@maravento maravento changed the title Proposal to improve the HowTO Proposal to improve installation method Jul 16, 2019
@maravento maravento changed the title Proposal to improve installation method Proposals to improve PyFunceble Jul 22, 2019
@mitchellkrogza
Copy link
Contributor

An containerised version ie. Docker image or a VM is a good idea and something we have discussed before. This just makes is easy for someone to run it with set limitations in a preconfigured environment but for those who know their specs and limitations we can, as it is now, determine that through a few simple tests of smaller lists. Like I know on my Intel i7 3960x that 200 processes is my own Max before starting to drag the machine down.

I also run this on Ubuntu 18.04.2 x64 daily with tests running 4-8 hours and no freezing and all my tests use Multiprocessing. Same on Ubuntu 16.04.2 and Arch Linux latest. So having debug logs YES oh YES we do indeed need them and I know they will be coming soon.

@mitchellkrogza
Copy link
Contributor

Here's a test of mine from last night using only 100 processes and finished in just under 9 hours without freezing. This test is on my Ubuntu 18.04.2 server which still serves Nginx sites while the test is running, MySQL running for the sites and PyFunceble using same MySQL database.

Screenshot_20190811_074942

@mitchellkrogza
Copy link
Contributor

mitchellkrogza commented Aug 13, 2019

Most of my posts are from my smartphone as that is often the only time I have. Commenting on github from a smartphone is not user friendly by any means.

First:
my Specs on above tests are a 12-core KVM (Virtual Processor / Proxmox)
16 GB Memory allocated to this server VM
The physical processor is a Xeon E5-1650 which is split across 6 VM's (Screenshots attached are from same test running now this morning including the cpu and mem when a burst of multiprocessing is active)

Second:
Are you using the default JSON database or have you tried mySQL / mariaDB as has been suggested. With such a massive list of domains you are passing JSON is bound to be causing issues. As you will notice my test list above is 181,000 + strong and grows weekly, so splitting at 100K for me is not an option right now and thus far it gets processed without freezing all that differs is the length of time the tests take.

Third:
Have you tried running any of your test without multiprocessing?
That's essentially the same as running it with your bash script except without multiprocessing and could probably still be a bit faster than the bash method. Before we had the multiprocessing option in PyFunceble we all ran it this way which was one test at a time and had it running in Travis-CI docker containers across 50+ repo's daily for almost 3 years. Some of out tests on big lists would take weeks to complete.

it seems very good program

It is indeed, you should not give up faith. It may not be as perfect as what you want but we have massive projects running and relying on it for 3 years day in and day out. There are always improvements and fixes when time allows @funilrys but in its current state we run it it on so many different environments and distributions we cannot replicate freezing and believe me I have tried.

To create a flag to control the hardware resources assigned when running the program (CPU Core/RAM/bandwidth)

I doubt this would ever be practical (I may be wrong though), but I think it would be impossible to know what's running on someone's machine other than PyFunceble. So such a switch might be able to say ok lets allocate X processes because CPU is X and Memory is X but then 20 minutes into the test something else gets launched by the system / user which causes that situation to change.

The key here is just finding the sweet spot of how many processes to allocate before things go wrong. For safety sake you could use even 25 processes which is still way faster than any bash method or running one at a time. Even 10 processes is faster than 1 🤔 even 5 is faster than 1 its too tempting to push many many processes in order to get such massive tests finished.

I have automated PyFunceble tests on the same server above which run every hour 24/7 from Cron but are only allocated 50 processes so as to make sure, like this morning, they don't bring the server down while my current manual test of the bigger lists is in progress.

Let me correct myself a bit here, I HAVE indeed been able to freeze PyFunceble that was when I gave it 250 processes on my local machine. My max processes I can ever run on my local machine is 200 but even then I am limited to what else I can do while that is running. So I can run 50 processes day in and day out while I have 5 browsers open, my email and working on anything else I like without noticing its really even running in the background.

It truly is about finding a sweet spot and with your VERY large list its also a matter of, right now, finding ways of splitting the load by splitting your lists into smaller chunks for safety sake of not losing data but my suggestion would be simply pick 25-50 processes and let it run and also use the mySQL/mariaDB database option.

I have been discussing improvements to the database structure for mySQL/MariaDB with @funilrys which I know are coming soon which will dramatically improve the situation of if you had to kill PyFunceble during a multiprocess test, so it can carry on where it left off but also NOT lose the data in the output folder. This change will mean the output folder files are only created at the very end of testing by pulling the data from mySQL/MariaDB and then generating those files from database. I doubt this kind of change would ever work with the current and default JSON database structure which is one reason why SQL was introduced because we are all running into very large lists to deal with.

2019-08-13_09-47
2019-08-13_09-46
2019-08-13_09-43_1
2019-08-13_09-43

@mitchellkrogza
Copy link
Contributor

it is better to remove it from the program and set default to mySQL / mariaDB

Too many existing users who run it to test smaller lists where JSON is still ok and will remain a default.

I know a debug log is coming in the short term but to be honest you will never succeed with very big lists like yours, even mine, without using MySQL. This is why I switched the moment the option was available as lists are growing rapidly.

@mitchellkrogza
Copy link
Contributor

@maravento See bash script I added to #39

funilrys pushed a commit that referenced this issue Feb 21, 2021
Missing default date in sql:whois
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants