Inconsistency in manual/docs and help message #38

maravento · 2019-07-11T16:22:43Z

I have been told that this is a "magic" tool. And I congratulate you for that, however i have read the instructions several times:
https://pyfunceble.readthedocs.io/en/latest/what-can-we-do.html
And I still have no idea how to verify a list of urls, nor the format that this list should have.
You could review the manual and make it more friendly, with examples. Thank you

mitchellkrogza · 2019-07-11T16:29:50Z

Lists should be plain text domains one per line
Just pass the f mydomainslistt.txt to PyFunceble

Simple example PyFunceble -ex --dns 1.1.1.1 1.0.0.1 --plain -f ${input}

mitchellkrogza · 2019-07-11T16:37:16Z

Here's just one of my many repos using it, this script uses all the TravisCI functionality which you won't need in your local environment https://github.com/mitchellkrogza/Badd-Boyz-Hosts/blob/master/dev-tools/DataTesting.sh

maravento · 2019-07-11T16:39:42Z

What format should the list have, and the parameter to "write" in an output file?

mitchellkrogza · 2019-07-11T16:44:06Z

domain1.com
domains2.com
www.whatever.com

Simple List with just one per line

Output folder is created in whatever folder you run PyFunceble

See output folder here that gets created by PyFunceble

https://github.com/mitchellkrogza/Phishing.Database/tree/master/phishing-domains

You can specify the location of the output by using export

maravento · 2019-07-11T16:47:18Z

Look at the image. That's how they are. However it says that google.com INVALID (PyFunceble -uf test)

mitchellkrogza · 2019-07-11T16:51:00Z

Must be just -f parameter not -uf
Try this simple PyFunceble -d google.com

Then try PyFunceble -f domains.txt

mitchellkrogza · 2019-07-11T16:54:53Z

Not sure if you are seeing the docs correct version but it does state the format

mitchellkrogza · 2019-07-11T17:01:11Z

The ones I use in all my projects

ACTIVE = my list of active domains
INACTIVE = my list of dead domains
INVALID = the domain syntax is somehow invalid

mitchellkrogza · 2019-07-11T17:01:53Z

So INACTIVE is what you want for your dead domains lists

mitchellkrogza · 2019-07-11T17:08:23Z

Also if you want to force re-testing of everything every time you run PyFunceble add the -db flag to disable it re-testing from it's own database. But once you learn how smart the database is you should just leave it do its own thing.

mitchellkrogza · 2019-07-11T17:10:38Z

See my usage of ACTIVE INACTIVE and INVALID here https://github.com/mitchellkrogza/Phishing.Database

mitchellkrogza · 2019-07-11T17:18:17Z

The INVALID lists I use every now and again to clean up my input lists of any formatting errors but you will see on Phishing Database the numbers of INVALID hardly feature anymore for me but they were crucial in the beginning to get all the cleaning functions of my input sources correct

maravento · 2019-07-11T18:43:17Z

Not sure if you are seeing the docs correct version but it does state the format

the confusion was in the description of the help file:

-f FILE, --file FILE Read the given file and test all domains inside it. If
a URL is given we download and test the content of the
given URL.

-uf URL_FILE, --url-file URL_FILE
Read and test the list of URL of the given file. If a
URL is given we download and test the list (of URL) of
the given URL content.

But it is already clear. It is "-f" only

funilrys · 2019-07-11T21:07:43Z

Hi @maravento, and thanks for your feedback.

Sorry that it wasn't clear. I'll do my best to improve the documentation for the future.

To recapitulate.

Test of file with URLs

If you want to test a list of URLs so in this format:

https://example.org/
http://example.org
https://example.org/hello_world

You can parse the file path with -uf.

Test of a file in plain text or host file format

if you want to test a list of domains, IPs which are in plain text or hosts format so in this format:

127.0.0.1 example.org
beispiel.org
0.0.0.0 example.com

You can parse the file path with -f.

Confusions (to fix in docs)

Sorry for the confusion I created.

Indeed both -uf and -f can take a raw URL.

For example, let's say I want to test this file, I can give it to -f and PyFunceble will download and test its content. That's what I tried to explain in the doc.

What is the difference between ACTIVE vs VALID and INACTIVE vs INVALID?

Documentation: https://pyfunceble.readthedocs.io/en/latest/columns/status.html#status

Because there are many possibilities, I created the structure of this project into one file called dir_structure_production.json which is later downloaded and found in your local filesystem as dir_structure.json.

What I'm doing is I generate the output directory before even starting the test.
So to explain,

ACTIVE reference to the output of an availability test.
VALID reference to the output of a syntax test.
INVALID reference to the output of an availability test or syntax test.
INACTIVE reference to the output of an availability test.

Difference between availability and syntax test

Availability test

The availability test consists of finding the availability of a domain, IP or URL.

Domain and IP

The availability of domain and IP are found based on the result of WHOIS records, NSLOOKUP and HTTP status code.

URL

The availability of a URL is found based on the HTTP status code. cf: documentation

Syntax test

The syntax test is just a syntax test.

As you understand Python, you can review our syntax test/check logic here.

Auto continue

Your question:

Suppose I start processing a list of 5 million URLs. And there is a power cut or internet drops. Which parameter allows start / continue from the last point (where did the cut occur)?

Documentation: https://pyfunceble.readthedocs.io/en/latest/components/auto-continue.html

As the auto continue system is activated by default (unless you disable it into your personal .PyFunceble.yaml), you have nothing to do. The system will auto continue itself.

How does it work?

Documentation: https://pyfunceble.readthedocs.io/en/latest/components/auto-continue.html#how-does-it-work

Said, in other words, everything happens into output/continue.json or if you use the MariaDB/MySQL database type into the continue table.

The idea is to log everything which has been tested and on next run (after the power cut in your example) remove the tested element from the original list to test.

Said in python we do the equivalent of the following on a bigger scale.

to_test = [1,2,3,4,5]
already_tested = [2,3,4]

to_test = list(set(to_test) - set(already_tested))

Thanks again for your feedback. I hope that I clarified things here. If not, please let me know.

Cheers,
Nissar

maravento · 2019-07-11T21:26:34Z

perfect. well explained. Thanks a lot.

mitchellkrogza · 2019-07-12T06:39:17Z

@maravento awesome now let me make that even better for you as I currently process 60000 domains in 4 hours.

Now welcome to the absolutely brilliant Multiprocessing of PyFunceble

Now add the flags -m -p 100 to your PyFunceble command line now and see the magic. Max processes we have discovered is about 200-250 so experiment with what works for you.

maravento · 2019-07-12T13:25:56Z

I'm running the command like this:
PyFunceble -qf list
to increase the processing ratio, your proposal is to run it like this?:
PyFunceble -q -f -m -p 100 list

-p PROCESSES, --processes PROCESSES
Set the number of simultaneous processes to use while using multiple processes. configured value: 25
-m, --multiprocess
Switch the value of the usage of multiple process. Configured value: False

Why there is no value for flag "-m"?

What is the maximum level of processing allowed and what is the consumption of resources per process?

PD: I am using a proliant M110 G9 HP test server 24/7, 8 GB RAM free and 10 Mb bandwidth

mitchellkrogza · 2019-07-12T15:05:35Z

No PyFunceble -q -m -p 100 -f list this will run 100 processes at the same time.

maravento · 2019-07-12T15:22:44Z

Then, according to my resources described above, how should I run the command for maximum performance and speed?

mitchellkrogza · 2019-07-12T15:40:43Z

Try 100 processes if it's too much drop it to 50 if it's too little up it to Max 250 . -m is just the switch to turn multi on the you specify how many processes with -p xx with that CPU you should comfortably get away with running 150 processes ... Just ty the exact command line I gave and let us know

mitchellkrogza · 2019-07-15T16:13:57Z

@maravento try pip3 uninstall --user PyFunceble let's see if that helps

maravento · 2019-07-15T16:17:09Z

@maravento try pip3 uninstall --user PyFunceble let's see if that helps

Usage:
pip uninstall [options] < package > ...
pip uninstall [options] -r < requirements file > ...
no such option: --user

mitchellkrogza · 2019-07-15T16:21:22Z

Did you install it with pip or pip3 🤔

mitchellkrogza · 2019-07-15T16:26:01Z

My bad sorry uninstall has no --user option indeed. Helping you off my phone as best as I can. Should be just pip uninstall package or pip3 uninstall package 🤔 @funilrys will have to assist further. For now why not just leave it as is and fire up Conda and run it there ? Won't matter if you have it installed on your system as you will be running a new instance from inside the Conda environment

mitchellkrogza · 2019-07-15T16:28:06Z

Just going back a few posts from earlier, are you doing all this in a VM on Virtual box or did you want a guide to creating a fool proof VM environment for running PyFunceble ?

maravento · 2019-07-15T16:30:37Z

Just going back a few posts from earlier, are you doing all this in a VM on Virtual box or did you want a guide to creating a fool proof VM environment for running PyFunceble ?

On a dedicated physical server (description is HERE)

mitchellkrogza · 2019-07-15T16:34:50Z

Ok got that just was referencing your request to doing it in a VM ... I could build one tomorrow which will work and may benefit others too. Still I cannot explain why you are experiencing freezing on your hardware we run PyFunceble in dcoker containers with Multiprocessing and don't get freezes or anything @funilrys will have to assist you to trace that.

mitchellkrogza · 2019-07-15T16:38:48Z

Please bare in mind I'm a user just like you, I'm not the author but have been using this extensively since Nissar started building it from some of my crazy ideas.

mitchellkrogza · 2019-07-15T16:46:10Z

Please bare in mind I'm a user just like you, I'm not the author but have been using this extensively since Nissar started building it from some of my crazy ideas.

I am clear that your role is a contribution. It's in README. And thank you very much for your help. But I think it's time for the creator to intervene in this thread, because the HowTo document is quite confusing.

I think you need to add the -nl parameter to your existing command line

mitchellkrogza · 2019-07-15T16:47:15Z

No logs (nl) defaults to false but adding -nl toggles it to true

mitchellkrogza · 2019-07-15T16:51:43Z

Please bare in mind I'm a user just like you, I'm not the author but have been using this extensively since Nissar started building it from some of my crazy ideas.

I am clear that your role is a contribution. It's in README. And thank you very much for your help. But I think it's time for the creator to intervene in this thread, because the HowTo document is quite confusing.

Pleasure and don't stress we will get you up and running for sure. @funilrys working mon-fri and his time is limited so I help where I can he will respond once he's online which he has not been all day so I know hes hammering away at some code somewhere

funilrys · 2019-07-15T18:59:16Z

Hello there,

Sorry for being so silent. I here between work, next version of this tool, huge private project and family :)

So let's go!

Multiprocessing

Why there is no value for the flag "-m"?

The -m flag is the one that activates the Multiprocessing subsystem. It's just a switch.

What is the maximum level of processing allowed and what is the consumption of resources per process?

I can't really answer that as there are too many variables. But generally in modern x64 machines, 100-150 is sufficient if you have other business running.

Those are some of the variables that directly comes in mind and are obvious:

Internet speed/bandwidth
Memory usage/impact
Drive sanity as we do a lot of I/O
DNS Server
Whois Server
...

It really depends on the machine most of the time.

Reduce memory impact (and freezes ?)

For your big amount of data (I didn't think you will test 5 Millions of entries), I'll recommend setting us a MySQL/MariaDB database to
handle the big amount of data that have to reread/reconstructed of each loop.

It's actually way better as we don't have to keep the following dataset/subsystem in memory:

Auto continue
InactiveDB
Mining
WhoisDB

The (short) documentation about the database can be found here: https://pyfunceble.readthedocs.io/en/latest/components/databases.html

I should mention that more deeply in the documentation. Thanks for mentioning.

Please read more about it in the documentation:

Multiprocessing component: https://pyfunceble.readthedocs.io/en/latest/components/multiprocessing.html
Database Types: https://pyfunceble.readthedocs.io/en/latest/components/databases.html#databases-types

Freeze

PyFunceble freezes a lot and I have to stop (ctrl + c) and restart. It does not matter if I use -p xxx or not (and it's not the hardware)

I'm not aware of any freeze. But I hope that using the MariaDB/MySQL database type can solve that.

Uninstallation

The HowTo does not show how to uninstall PyFunceble:

Well, it depends on how you install it but I never thought it was necessary. Will be added to the documentation.

Arch Linux

Arch user can simply do

$ yourFavoriteAurHelper -Rns pyfunceble

PyPi

PyPi installed package can be uninstalled like follow

$ pip3 uninstall pyfunceble

I don't understand why you get the following.

configparser.DuplicateOptionError: While reading from '<???>' [line 3]: option 'pyfunceble' in section 'console_scripts' already exists

It might be because if your version of pip/pip3. Here is mine under my virtualenv but it's actually the same from outside the env under Arch:

$ pip --version
pip 19.1.1 from /home/funilrys/repositories/GitHub/source/PyFunceble/venv/lib/python3.7/site-packages/pip (python 3.7)

Can you try to pip3 install pip --upgrade and try to uninstall it again? It might be a pip issue not a PyFunceble issue at all as it's working on my side ...

Otherwise, you can delete the output of the following commands.

$ pip show pyfunceble | grep Location
$ which pyfunceble
$ which PyFunceble

Virtualenv/Conda

You can start from the beginning by setting up a virtualenv.

Advantages

You don't need to rely on the system version of pip or even python

(Mini)Conda

@mitchellkrogza already explained it there and I have nothing to add except Mitch @mitchellkrogza please make a PR from it !! 😸

Advantages of conda

Conda let you install and use a python version of your choice and work from there! While virtualenv will only use the one installed by the system.

Virtualenv

Here is my routine when I'm at work using Debian 9 (from the head as I'm out of office).

$ apt-get install python3-virtualenv
# Create the virtualenv and install it into the venv directory
$ virtualenv -p python3 venv
# Activate the environment (installed)
$ . venv/bin/activate
$ pip3 --version
# update pip
$ pip3 install pip --upgrade # Will be install inside the venv directory.
# Install and play with what we need
$ pip3 install pyfunceble # Will be install inside the venv directory.
# play with pyfunceble and other
$ pip3 --version
$ PyFunceble --version
$ PyFunceble -d microsoft_google.com
# When done and you want to go back to your system.
# Deactivate the virtual env.
$ deactivate
# Now you are back into your system
# proof PyFunceble is installed systemwide.
$ pip shoe pyfunceble | grep Location

Logs

Does this program generate logs to verify possible cause?

Actually not but I have a private branch with the work around it. It was never my priority but it will be for 2.5+.

The only logs generated are the one we produce after each test so you can keep a track of what was the output of what domain for example.

Warnings

`--clean`

if you ever kill iPyFunceble with ctrl+c be sure to run PyFunceble --clean first before you once again run your normal full command line.

@mitchellkrogza can do that because he mostly uses the MySQL/MariaDB database type.

MariaDB/MySQL over 2 server

As you previously stated:

so as not to recharge the CPU I divided the list to run it on 2 servers

if you use the MariaDB/MySQL database type be sure to have 2 different filenames. That way PyFunceble can handle data from both.

Side note for me (todolist)

Reduce confusion around -uf, -f, -m and others.
Add more warning about the multiprocessing usage and big inputs.
Add uninstallation method.
Add installation method with conda and virtualenv.
Create a docker image?

Thanks again for your feedback. I hope that I clarified things here. If not, please let me know.

Cheers,
Nissar

mitchellkrogza · 2019-07-15T19:21:55Z

@maravento I highly recommend the Mariadb solution. If you're not ok with it right now you could just split your large file into parts of maybe 500000 each with split -l 500000 filename and test each one separately, not ideal so SQL is the way, the Mariadb or MySQL setup is rather simple to get up and running.

maravento · 2019-07-22T23:36:11Z

@mitchellkrogza Hi. A query: For example, my file has 5 M lines, and host-active has 1.5 M and host-inactive has 1.3 (host-invalid has few, so it doesn't apply for the example).
Does the above mean that the program has processed 2.8 M of lines or this data is not real because the output has duplicates? (the input file was debugged from duplicates before running the program) THX

mitchellkrogza · 2019-07-23T14:42:08Z

@maravento it's hard to say why you got such results. I am currently testing your entire list in 5 x parts of 1M each all at the same time using Mini(Conda) environments running in parallel with each environment / instance of PyFunceble using multiprocessing and 50 processes each all using the mariadb database system.

I estimate it will be finished by tomorrow morning and then I can push my results to my fork of your repo.

This is the only way I can tell is to see what my results show versus yours.

mitchellkrogza · 2019-07-23T16:09:06Z

Data is definitely real and there will be no duplicates. Go and look yourself at the contents of output/domains/ACTIVE/list

mitchellkrogza · 2019-07-23T16:10:33Z

You can look at any of the files while they are being created or just tail them and you will see

maravento · 2019-08-02T22:47:10Z

@mitchellkrogza Hi. the same problem. At this time the program has processed the following data:
Original List: 5.9 M
ACTIVE/hosts = 3.6 M
INACTIVE/host = 3 M
INVALID/host = 42.000
Total = + 6.6 M
... And it's not over (still running)
I have detected duplicate lines in ACTIVE/hosts. The original file has not duplicate lines. I think "auto continue system" is not working as it should (it may not work when the program is interrupted with ctrl + c and restarted).
Then I did what @funilrys recommended above:

Warnings
--clean
if you ever kill iPyFunceble with ctrl+c be sure to run PyFunceble --clean first before you once again run your normal full command line.
@mitchellkrogza can do that because he mostly uses the MySQL/MariaDB database type.

And I lost all the work, and it started from the beginning again

mitchellkrogza · 2019-08-03T15:29:12Z

--clean will clean your output folders. Be careful using it I should have been more clear on that. Can't explain the duplications I've never seen any dupes before but I will have to check some of my big lists to see if active has any dupes.

For now you can just run a final sort on the active and inactive files when the test is finished to remove any dupes until @funilrys can look into what might cause that.

Just run sort -u list.txt -o list.txt on each of the output files and then do a recount to see your totals

maravento · 2019-08-03T15:58:34Z

That's why I reopened the ticket. I just lost 3 weeks of work by following the instructions of @funilrys
In short, the program freezes and i have to stop it with ctrl + c. But @funilrys says: "if you ever kill iPyFunceble with ctrl + c be sure to run PyFunceble --clean first before you once again run your normal full command line". Then this causes the job to be lost.
Conclusion: This program is very unstable and the instructions of HowTO and @funilrys are imprecise. So, unfortunately I have to temporarily remove it from my blackweb project, until these bugs are fixed

I have summarized the proposals for improvements and bug fixes in issue 41

funilrys · 2019-08-04T07:30:39Z

@maravento If you have a problem with the output and multiprocessing then use the API and manage your file and your multiprocessing yourself.

I do it for @Ultimate-Hosts-Blacklist. You can do it and it is as simple as the following. Again, it's documented.

from PyFunceble import test as PyFunceble

print(PyFunceble("google.com", complete=True))

I have no time actually to go deep into reproducing what you do (@mitchellkrogza might help with that) but in my plan there the full database (so MariaDB/MySQL) processing so that files are generated when it's really done.
But please be patient. I have a life, family, work, study and other things that have to come before this whole issue in my workflow.

What database type do you use ? If it's JSON then no, then it's normal that's one of the reasons I introduced the database types. It's not in the documentation yet but I talked about it in the Reduce memory impact (and freezes ?) section ...

The auto continue is guaranteed - if you use the multiprocessing option - only if you use the MySQL/MariaDB database types. That's what @mitchellkrogza implicitly said and that's what I confirmed:

@mitchellkrogza can do that because he mostly uses the MySQL/MariaDB database type.

I agree a lot with the state of the documentation. And that is in my workflow. But for the rest, you're using PyFunceble in a way we never used it before. Indeed, I tested it with 1.2 million records but never with so many records. That's what we need to go further into the database types implementation because JSON is not good for multiprocessing and memory.

Cheers,
Nissar

P.S.: Please keep this open, it does not make sense to close it if the documentation and things you mentioned here are not fixed/handled.

maravento · 2019-08-05T02:42:50Z

It is not necessary to keep it open. I think everything is clear. And I summarized my experiences and proposal for improvement in issue 41
You have family and other priorities and me too, so consider the proposal and when you can make the corrections you will be welcome. In general, the program is good, you just have to fix some things.
regards

funilrys self-assigned this Jul 11, 2019

funilrys added bug enhancement good first issue labels Jul 11, 2019

funilrys changed the title ~~scattered manual~~ Inconsistency in manual/docs and help message Jul 11, 2019

funilrys added this to the 2.3.0 milestone Jul 11, 2019

funilrys added this to To do in 2.x.x Jul 11, 2019

funilrys removed this from To do in 2.x.x Jul 11, 2019

maravento closed this as completed Jul 11, 2019

maravento reopened this Jul 12, 2019

maravento closed this as completed Jul 15, 2019

maravento mentioned this issue Jul 15, 2019

Proposals to improve PyFunceble #41

Closed

maravento reopened this Aug 3, 2019

maravento closed this as completed Aug 3, 2019

funilrys reopened this Aug 4, 2019

funilrys assigned mitchellkrogza Aug 4, 2019

maravento closed this as completed Aug 5, 2019

maravento mentioned this issue Aug 7, 2019

[GUIDE] Running PyFunceble in Conda Virtual Environments #39

Closed

Inconsistency in manual/docs and help message #38

Inconsistency in manual/docs and help message #38

Comments

maravento commented Jul 11, 2019 • edited Loading

mitchellkrogza commented Jul 11, 2019

mitchellkrogza commented Jul 11, 2019

maravento commented Jul 11, 2019 • edited Loading

mitchellkrogza commented Jul 11, 2019

maravento commented Jul 11, 2019 • edited Loading

mitchellkrogza commented Jul 11, 2019

mitchellkrogza commented Jul 11, 2019

mitchellkrogza commented Jul 11, 2019

mitchellkrogza commented Jul 11, 2019

mitchellkrogza commented Jul 11, 2019

mitchellkrogza commented Jul 11, 2019

mitchellkrogza commented Jul 11, 2019

maravento commented Jul 11, 2019

funilrys commented Jul 11, 2019

Test of file with URLs

Test of a file in plain text or host file format

Confusions (to fix in docs)

What is the difference between ACTIVE vs VALID and INACTIVE vs INVALID?

Difference between availability and syntax test

Availability test

Domain and IP

URL

Syntax test

Auto continue

How does it work?

maravento commented Jul 11, 2019

mitchellkrogza commented Jul 12, 2019

maravento commented Jul 12, 2019 • edited Loading

mitchellkrogza commented Jul 12, 2019

maravento commented Jul 12, 2019

mitchellkrogza commented Jul 12, 2019

mitchellkrogza commented Jul 15, 2019

maravento commented Jul 15, 2019 • edited Loading

mitchellkrogza commented Jul 15, 2019

mitchellkrogza commented Jul 15, 2019

mitchellkrogza commented Jul 15, 2019

maravento commented Jul 15, 2019

mitchellkrogza commented Jul 15, 2019

mitchellkrogza commented Jul 15, 2019

mitchellkrogza commented Jul 15, 2019

mitchellkrogza commented Jul 15, 2019

mitchellkrogza commented Jul 15, 2019

funilrys commented Jul 15, 2019

Multiprocessing

Reduce memory impact (and freezes ?)

Freeze

Uninstallation

Arch Linux

PyPi

Virtualenv/Conda

Advantages

(Mini)Conda

Advantages of conda

Virtualenv

Logs

Warnings

--clean

MariaDB/MySQL over 2 server

Side note for me (todolist)

mitchellkrogza commented Jul 15, 2019

maravento commented Jul 22, 2019 • edited Loading

mitchellkrogza commented Jul 23, 2019

mitchellkrogza commented Jul 23, 2019

mitchellkrogza commented Jul 23, 2019

maravento commented Aug 2, 2019 • edited Loading

mitchellkrogza commented Aug 3, 2019

maravento commented Aug 3, 2019 • edited Loading

funilrys commented Aug 4, 2019

maravento commented Aug 5, 2019

maravento commented Jul 11, 2019 •

edited

Loading

maravento commented Jul 11, 2019 •

edited

Loading

maravento commented Jul 11, 2019 •

edited

Loading

maravento commented Jul 12, 2019 •

edited

Loading

maravento commented Jul 15, 2019 •

edited

Loading

`--clean`

maravento commented Jul 22, 2019 •

edited

Loading

maravento commented Aug 2, 2019 •

edited

Loading

maravento commented Aug 3, 2019 •

edited

Loading