Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in manual/docs and help message #38

Closed
maravento opened this issue Jul 11, 2019 · 58 comments
Closed

Inconsistency in manual/docs and help message #38

maravento opened this issue Jul 11, 2019 · 58 comments

Comments

@maravento
Copy link

maravento commented Jul 11, 2019

I have been told that this is a "magic" tool. And I congratulate you for that, however i have read the instructions several times:
https://pyfunceble.readthedocs.io/en/latest/what-can-we-do.html
And I still have no idea how to verify a list of urls, nor the format that this list should have.
You could review the manual and make it more friendly, with examples. Thank you

@mitchellkrogza
Copy link
Contributor

Lists should be plain text domains one per line
Just pass the f mydomainslistt.txt to PyFunceble

Simple example PyFunceble -ex --dns 1.1.1.1 1.0.0.1 --plain -f ${input}

@mitchellkrogza
Copy link
Contributor

Here's just one of my many repos using it, this script uses all the TravisCI functionality which you won't need in your local environment https://github.com/mitchellkrogza/Badd-Boyz-Hosts/blob/master/dev-tools/DataTesting.sh

@maravento
Copy link
Author

maravento commented Jul 11, 2019

What format should the list have, and the parameter to "write" in an output file?
Captura de pantalla -2019-07-11 11-37-36

@mitchellkrogza
Copy link
Contributor

domain1.com
domains2.com
www.whatever.com

Simple List with just one per line

Output folder is created in whatever folder you run PyFunceble

See output folder here that gets created by PyFunceble

https://github.com/mitchellkrogza/Phishing.Database/tree/master/phishing-domains

You can specify the location of the output by using export

@maravento
Copy link
Author

maravento commented Jul 11, 2019

Look at the image. That's how they are. However it says that google.com INVALID (PyFunceble -uf test)
Captura de pantalla -2019-07-11 11-48-51

@mitchellkrogza
Copy link
Contributor

Must be just -f parameter not -uf
Try this simple PyFunceble -d google.com

Then try PyFunceble -f domains.txt

@mitchellkrogza
Copy link
Contributor

Not sure if you are seeing the docs correct version but it does state the format
Screenshot_20190711_185339

@mitchellkrogza
Copy link
Contributor

The ones I use in all my projects

ACTIVE = my list of active domains
INACTIVE = my list of dead domains
INVALID = the domain syntax is somehow invalid

@mitchellkrogza
Copy link
Contributor

So INACTIVE is what you want for your dead domains lists

@mitchellkrogza
Copy link
Contributor

Also if you want to force re-testing of everything every time you run PyFunceble add the -db flag to disable it re-testing from it's own database. But once you learn how smart the database is you should just leave it do its own thing.

@mitchellkrogza
Copy link
Contributor

See my usage of ACTIVE INACTIVE and INVALID here https://github.com/mitchellkrogza/Phishing.Database

@mitchellkrogza
Copy link
Contributor

The INVALID lists I use every now and again to clean up my input lists of any formatting errors but you will see on Phishing Database the numbers of INVALID hardly feature anymore for me but they were crucial in the beginning to get all the cleaning functions of my input sources correct

@maravento
Copy link
Author

Not sure if you are seeing the docs correct version but it does state the format
Screenshot_20190711_185339

the confusion was in the description of the help file:

-f FILE, --file FILE Read the given file and test all domains inside it. If
a URL is given we download and test the content of the
given URL.

-uf URL_FILE, --url-file URL_FILE
Read and test the list of URL of the given file. If a
URL is given we download and test the list (of URL) of
the given URL content.

But it is already clear. It is "-f" only

@funilrys
Copy link
Owner

Hi @maravento, and thanks for your feedback.

Sorry that it wasn't clear. I'll do my best to improve the documentation for the future.

To recapitulate.


Test of file with URLs

If you want to test a list of URLs so in this format:

https://example.org/
http://example.org
https://example.org/hello_world

You can parse the file path with -uf.


Test of a file in plain text or host file format

if you want to test a list of domains, IPs which are in plain text or hosts format so in this format:

127.0.0.1 example.org
beispiel.org
0.0.0.0 example.com

You can parse the file path with -f.


Confusions (to fix in docs)

Sorry for the confusion I created.

Indeed both -uf and -f can take a raw URL.

For example, let's say I want to test this file, I can give it to -f and PyFunceble will download and test its content. That's what I tried to explain in the doc.


What is the difference between ACTIVE vs VALID and INACTIVE vs INVALID?

Documentation: https://pyfunceble.readthedocs.io/en/latest/columns/status.html#status

Because there are many possibilities, I created the structure of this project into one file called dir_structure_production.json which is later downloaded and found in your local filesystem as dir_structure.json.

What I'm doing is I generate the output directory before even starting the test.
So to explain,

  • ACTIVE reference to the output of an availability test.
  • VALID reference to the output of a syntax test.
  • INVALID reference to the output of an availability test or syntax test.
  • INACTIVE reference to the output of an availability test.

Difference between availability and syntax test

Availability test

The availability test consists of finding the availability of a domain, IP or URL.

Domain and IP

The availability of domain and IP are found based on the result of WHOIS records, NSLOOKUP and HTTP status code.

URL

The availability of a URL is found based on the HTTP status code. cf: documentation

Syntax test

The syntax test is just a syntax test.

As you understand Python, you can review our syntax test/check logic here.


Auto continue

Your question:

Suppose I start processing a list of 5 million URLs. And there is a power cut or internet drops. Which parameter allows start / continue from the last point (where did the cut occur)?

Documentation: https://pyfunceble.readthedocs.io/en/latest/components/auto-continue.html

As the auto continue system is activated by default (unless you disable it into your personal .PyFunceble.yaml), you have nothing to do. The system will auto continue itself.

How does it work?

Documentation: https://pyfunceble.readthedocs.io/en/latest/components/auto-continue.html#how-does-it-work

Said, in other words, everything happens into output/continue.json or if you use the MariaDB/MySQL database type into the continue table.

The idea is to log everything which has been tested and on next run (after the power cut in your example) remove the tested element from the original list to test.

Said in python we do the equivalent of the following on a bigger scale.

to_test = [1,2,3,4,5]
already_tested = [2,3,4]

to_test = list(set(to_test) - set(already_tested))

Thanks again for your feedback. I hope that I clarified things here. If not, please let me know.

Cheers,
Nissar

@funilrys funilrys changed the title scattered manual Inconsistency in manual/docs and help message Jul 11, 2019
@funilrys funilrys added this to the 2.3.0 milestone Jul 11, 2019
@funilrys funilrys added this to To do in 2.x.x Jul 11, 2019
@funilrys funilrys removed this from To do in 2.x.x Jul 11, 2019
@maravento
Copy link
Author

perfect. well explained. Thanks a lot.

@mitchellkrogza
Copy link
Contributor

@maravento awesome now let me make that even better for you as I currently process 60000 domains in 4 hours.

Now welcome to the absolutely brilliant Multiprocessing of PyFunceble

Now add the flags -m -p 100 to your PyFunceble command line now and see the magic. Max processes we have discovered is about 200-250 so experiment with what works for you.

@maravento
Copy link
Author

maravento commented Jul 12, 2019

I'm running the command like this:
PyFunceble -qf list
to increase the processing ratio, your proposal is to run it like this?:
PyFunceble -q -f -m -p 100 list

-p PROCESSES, --processes PROCESSES
Set the number of simultaneous processes to use while using multiple processes. configured value: 25
-m, --multiprocess
Switch the value of the usage of multiple process. Configured value: False

Why there is no value for flag "-m"?

What is the maximum level of processing allowed and what is the consumption of resources per process?

PD: I am using a proliant M110 G9 HP test server 24/7, 8 GB RAM free and 10 Mb bandwidth

@maravento maravento reopened this Jul 12, 2019
@mitchellkrogza
Copy link
Contributor

No PyFunceble -q -m -p 100 -f list this will run 100 processes at the same time.

@maravento
Copy link
Author

Then, according to my resources described above, how should I run the command for maximum performance and speed?

@mitchellkrogza
Copy link
Contributor

Try 100 processes if it's too much drop it to 50 if it's too little up it to Max 250 . -m is just the switch to turn multi on the you specify how many processes with -p xx with that CPU you should comfortably get away with running 150 processes ... Just ty the exact command line I gave and let us know

@mitchellkrogza
Copy link
Contributor

@maravento try pip3 uninstall --user PyFunceble let's see if that helps

@maravento
Copy link
Author

maravento commented Jul 15, 2019

@maravento try pip3 uninstall --user PyFunceble let's see if that helps

Usage:
pip uninstall [options] < package > ...
pip uninstall [options] -r < requirements file > ...
no such option: --user

@mitchellkrogza
Copy link
Contributor

Did you install it with pip or pip3 🤔

@mitchellkrogza
Copy link
Contributor

My bad sorry uninstall has no --user option indeed. Helping you off my phone as best as I can. Should be just pip uninstall package or pip3 uninstall package 🤔 @funilrys will have to assist further. For now why not just leave it as is and fire up Conda and run it there ? Won't matter if you have it installed on your system as you will be running a new instance from inside the Conda environment

@mitchellkrogza
Copy link
Contributor

Just going back a few posts from earlier, are you doing all this in a VM on Virtual box or did you want a guide to creating a fool proof VM environment for running PyFunceble ?

@maravento
Copy link
Author

Just going back a few posts from earlier, are you doing all this in a VM on Virtual box or did you want a guide to creating a fool proof VM environment for running PyFunceble ?

On a dedicated physical server (description is HERE)

@mitchellkrogza
Copy link
Contributor

Ok got that just was referencing your request to doing it in a VM ... I could build one tomorrow which will work and may benefit others too. Still I cannot explain why you are experiencing freezing on your hardware we run PyFunceble in dcoker containers with Multiprocessing and don't get freezes or anything @funilrys will have to assist you to trace that.

@mitchellkrogza
Copy link
Contributor

Please bare in mind I'm a user just like you, I'm not the author but have been using this extensively since Nissar started building it from some of my crazy ideas.

@mitchellkrogza
Copy link
Contributor

Please bare in mind I'm a user just like you, I'm not the author but have been using this extensively since Nissar started building it from some of my crazy ideas.

I am clear that your role is a contribution. It's in README. And thank you very much for your help. But I think it's time for the creator to intervene in this thread, because the HowTo document is quite confusing.

I think you need to add the -nl parameter to your existing command line

@mitchellkrogza
Copy link
Contributor

No logs (nl) defaults to false but adding -nl toggles it to true

@mitchellkrogza
Copy link
Contributor

Please bare in mind I'm a user just like you, I'm not the author but have been using this extensively since Nissar started building it from some of my crazy ideas.

I am clear that your role is a contribution. It's in README. And thank you very much for your help. But I think it's time for the creator to intervene in this thread, because the HowTo document is quite confusing.

Pleasure and don't stress we will get you up and running for sure. @funilrys working mon-fri and his time is limited so I help where I can he will respond once he's online which he has not been all day so I know hes hammering away at some code somewhere

@funilrys
Copy link
Owner

Hello there,

Sorry for being so silent. I here between work, next version of this tool, huge private project and family :)

So let's go!

Multiprocessing

Why there is no value for the flag "-m"?

The -m flag is the one that activates the Multiprocessing subsystem. It's just a switch.

What is the maximum level of processing allowed and what is the consumption of resources per process?

I can't really answer that as there are too many variables. But generally in modern x64 machines, 100-150 is sufficient if you have other business running.

Those are some of the variables that directly comes in mind and are obvious:

  • Internet speed/bandwidth
  • Memory usage/impact
  • Drive sanity as we do a lot of I/O
  • DNS Server
  • Whois Server
  • ...

It really depends on the machine most of the time.

Reduce memory impact (and freezes ?)

For your big amount of data (I didn't think you will test 5 Millions of entries), I'll recommend setting us a MySQL/MariaDB database to
handle the big amount of data that have to reread/reconstructed of each loop.

It's actually way better as we don't have to keep the following dataset/subsystem in memory:

  • Auto continue
  • InactiveDB
  • Mining
  • WhoisDB

The (short) documentation about the database can be found here: https://pyfunceble.readthedocs.io/en/latest/components/databases.html

I should mention that more deeply in the documentation. Thanks for mentioning.

Please read more about it in the documentation:

Freeze

PyFunceble freezes a lot and I have to stop (ctrl + c) and restart. It does not matter if I use -p xxx or not (and it's not the hardware)

I'm not aware of any freeze. But I hope that using the MariaDB/MySQL database type can solve that.

Uninstallation

The HowTo does not show how to uninstall PyFunceble:

Well, it depends on how you install it but I never thought it was necessary. Will be added to the documentation.

Arch Linux

Arch user can simply do

$ yourFavoriteAurHelper -Rns pyfunceble

PyPi

PyPi installed package can be uninstalled like follow

$ pip3 uninstall pyfunceble

I don't understand why you get the following.

configparser.DuplicateOptionError: While reading from '<???>' [line 3]: option 'pyfunceble' in section 'console_scripts' already exists

It might be because if your version of pip/pip3. Here is mine under my virtualenv but it's actually the same from outside the env under Arch:

$ pip --version
pip 19.1.1 from /home/funilrys/repositories/GitHub/source/PyFunceble/venv/lib/python3.7/site-packages/pip (python 3.7)

Can you try to pip3 install pip --upgrade and try to uninstall it again? It might be a pip issue not a PyFunceble issue at all as it's working on my side ...

Otherwise, you can delete the output of the following commands.

$ pip show pyfunceble | grep Location
$ which pyfunceble
$ which PyFunceble

Virtualenv/Conda

You can start from the beginning by setting up a virtualenv.

Advantages

You don't need to rely on the system version of pip or even python

(Mini)Conda

@mitchellkrogza already explained it there and I have nothing to add except Mitch @mitchellkrogza please make a PR from it !! 😸

Advantages of conda

Conda let you install and use a python version of your choice and work from there! While virtualenv will only use the one installed by the system.

Virtualenv

Here is my routine when I'm at work using Debian 9 (from the head as I'm out of office).

$ apt-get install python3-virtualenv
# Create the virtualenv and install it into the venv directory
$ virtualenv -p python3 venv
# Activate the environment (installed)
$ . venv/bin/activate
$ pip3 --version
# update pip
$ pip3 install pip --upgrade # Will be install inside the venv directory.
# Install and play with what we need
$ pip3 install pyfunceble # Will be install inside the venv directory.
# play with pyfunceble and other
$ pip3 --version
$ PyFunceble --version
$ PyFunceble -d microsoft_google.com
# When done and you want to go back to your system.
# Deactivate the virtual env.
$ deactivate
# Now you are back into your system
# proof PyFunceble is installed systemwide.
$ pip shoe pyfunceble | grep Location

Logs

Does this program generate logs to verify possible cause?

Actually not but I have a private branch with the work around it. It was never my priority but it will be for 2.5+.

The only logs generated are the one we produce after each test so you can keep a track of what was the output of what domain for example.

Warnings

--clean

if you ever kill iPyFunceble with ctrl+c be sure to run PyFunceble --clean first before you once again run your normal full command line.

@mitchellkrogza can do that because he mostly uses the MySQL/MariaDB database type.

MariaDB/MySQL over 2 server

As you previously stated:

so as not to recharge the CPU I divided the list to run it on 2 servers

if you use the MariaDB/MySQL database type be sure to have 2 different filenames. That way PyFunceble can handle data from both.

Side note for me (todolist)

  • Reduce confusion around -uf, -f, -m and others.
  • Add more warning about the multiprocessing usage and big inputs.
  • Add uninstallation method.
  • Add installation method with conda and virtualenv.
  • Create a docker image?

Thanks again for your feedback. I hope that I clarified things here. If not, please let me know.

Cheers,
Nissar

@mitchellkrogza
Copy link
Contributor

@maravento I highly recommend the Mariadb solution. If you're not ok with it right now you could just split your large file into parts of maybe 500000 each with split -l 500000 filename and test each one separately, not ideal so SQL is the way, the Mariadb or MySQL setup is rather simple to get up and running.

@maravento
Copy link
Author

maravento commented Jul 22, 2019

@mitchellkrogza Hi. A query: For example, my file has 5 M lines, and host-active has 1.5 M and host-inactive has 1.3 (host-invalid has few, so it doesn't apply for the example).
Does the above mean that the program has processed 2.8 M of lines or this data is not real because the output has duplicates? (the input file was debugged from duplicates before running the program) THX

@mitchellkrogza
Copy link
Contributor

@maravento it's hard to say why you got such results. I am currently testing your entire list in 5 x parts of 1M each all at the same time using Mini(Conda) environments running in parallel with each environment / instance of PyFunceble using multiprocessing and 50 processes each all using the mariadb database system.

I estimate it will be finished by tomorrow morning and then I can push my results to my fork of your repo.

This is the only way I can tell is to see what my results show versus yours.

@mitchellkrogza
Copy link
Contributor

Data is definitely real and there will be no duplicates. Go and look yourself at the contents of output/domains/ACTIVE/list

@mitchellkrogza
Copy link
Contributor

You can look at any of the files while they are being created or just tail them and you will see

@maravento
Copy link
Author

maravento commented Aug 2, 2019

@mitchellkrogza Hi. the same problem. At this time the program has processed the following data:
Original List: 5.9 M
ACTIVE/hosts = 3.6 M
INACTIVE/host = 3 M
INVALID/host = 42.000
Total = + 6.6 M
... And it's not over (still running)
I have detected duplicate lines in ACTIVE/hosts. The original file has not duplicate lines. I think "auto continue system" is not working as it should (it may not work when the program is interrupted with ctrl + c and restarted).
Then I did what @funilrys recommended above:

Warnings
--clean
if you ever kill iPyFunceble with ctrl+c be sure to run PyFunceble --clean first before you once again run your normal full command line.
@mitchellkrogza can do that because he mostly uses the MySQL/MariaDB database type.

And I lost all the work, and it started from the beginning again

@maravento maravento reopened this Aug 3, 2019
@mitchellkrogza
Copy link
Contributor

--clean will clean your output folders. Be careful using it I should have been more clear on that. Can't explain the duplications I've never seen any dupes before but I will have to check some of my big lists to see if active has any dupes.

For now you can just run a final sort on the active and inactive files when the test is finished to remove any dupes until @funilrys can look into what might cause that.

Just run sort -u list.txt -o list.txt on each of the output files and then do a recount to see your totals

@maravento
Copy link
Author

maravento commented Aug 3, 2019

That's why I reopened the ticket. I just lost 3 weeks of work by following the instructions of @funilrys
In short, the program freezes and i have to stop it with ctrl + c. But @funilrys says: "if you ever kill iPyFunceble with ctrl + c be sure to run PyFunceble --clean first before you once again run your normal full command line". Then this causes the job to be lost.
Conclusion: This program is very unstable and the instructions of HowTO and @funilrys are imprecise. So, unfortunately I have to temporarily remove it from my blackweb project, until these bugs are fixed

I have summarized the proposals for improvements and bug fixes in issue 41

@funilrys funilrys reopened this Aug 4, 2019
@funilrys
Copy link
Owner

funilrys commented Aug 4, 2019

@maravento If you have a problem with the output and multiprocessing then use the API and manage your file and your multiprocessing yourself.

I do it for @Ultimate-Hosts-Blacklist. You can do it and it is as simple as the following. Again, it's documented.

from PyFunceble import test as PyFunceble

print(PyFunceble("google.com", complete=True))

I have no time actually to go deep into reproducing what you do (@mitchellkrogza might help with that) but in my plan there the full database (so MariaDB/MySQL) processing so that files are generated when it's really done.
But please be patient. I have a life, family, work, study and other things that have to come before this whole issue in my workflow.

What database type do you use ? If it's JSON then no, then it's normal that's one of the reasons I introduced the database types. It's not in the documentation yet but I talked about it in the Reduce memory impact (and freezes ?) section ...

The auto continue is guaranteed - if you use the multiprocessing option - only if you use the MySQL/MariaDB database types. That's what @mitchellkrogza implicitly said and that's what I confirmed:

@mitchellkrogza can do that because he mostly uses the MySQL/MariaDB database type.


I agree a lot with the state of the documentation. And that is in my workflow. But for the rest, you're using PyFunceble in a way we never used it before. Indeed, I tested it with 1.2 million records but never with so many records. That's what we need to go further into the database types implementation because JSON is not good for multiprocessing and memory.

Cheers,
Nissar

P.S.: Please keep this open, it does not make sense to close it if the documentation and things you mentioned here are not fixed/handled.

@maravento
Copy link
Author

It is not necessary to keep it open. I think everything is clear. And I summarized my experiences and proposal for improvement in issue 41
You have family and other priorities and me too, so consider the proposal and when you can make the corrections you will be welcome. In general, the program is good, you just have to fix some things.
regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants