Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtration support #62

Merged
merged 21 commits into from Jun 6, 2021
Merged

Filtration support #62

merged 21 commits into from Jun 6, 2021

Conversation

ghost
Copy link

@ghost ghost commented May 31, 2021

Add filtration support(#29)

  • You can check available tags by using pywhat --tags or
from pywhat import *
print(pywhat_tags)
  • Filtration CLI:
    pywhat --rarity min:max --include_tags tag1,tag2 --exclude_tags tag1,tag2 TEXT
    'min' and 'max' can be omitted.

ToDo:

  • Finish work on distributions
  • Update CLI
  • Move API to __init__.py' for easy access
  • TESTS
  • Clean up the mess

@bee-san
Copy link
Owner

bee-san commented May 31, 2021

Ahh!!! This is gonna be annoying, but I am actually working on a filter system myself :-) See my branch here: https://github.com/bee-san/pyWhat/tree/bee-filter

However!! I really like some of the things here. The arguments, listing tags, and docs are great!!

Here's how I wanted to do it:
Problem: We have a program which requires 2 filters, this is a real-world example. Ciphey would like to have a filter for checking whether the item is possible plaintext (bitcoin address, email, website) but also a filter for checking for encoded text, like JWT tokens or Morse code etc etc.

So, we have 1 application which requires 2 filters:

  1. Plaintext identification
  2. Stuff that needs to be decoded, but is the answer

By implementing filtering at the identifier level, every time ciphey runs it'd have to re-create the filters with the API. This means that if it runs 10 times a second, it re-makes the filters 10 times a second.

You can get around this by filtering at the object attributes level, so you make the object once and the filters are there -- but this prevents you from changing the filters or adding new ones.

You can of course cheat and manually make the object, but that's cheating and I'd prefer a high level API for users.

I propose adding distributions. A distribution is a regex list which has been filtered (see my branch https://github.com/bee-san/pyWhat/tree/bee-filter/pywhat/filtration_distribution ).

Then, the regex_identifier (or identifier) takes this distribution object and uses the distribution.get_regexes() to automatically get the already tagged data.

This means we can:

  1. Have as many filtration systems as we want without re-making them everytime.
  2. Be nimble enough that we can have more than 1 filtration system.

I also envisioned adding magic methods to distributions like __add__, so you can add 2 distributions together or take away one distribution from another.

So going back to the Ciphey example, I imagine something like this:

  1. Plaintext identification
filters = {"Tags": ["Networking", "Credentials"], "Min_Rarity": 0.6}
distribution = Distribution(filters)

And then adding decoding stuff:

filters1 = {"Tags": ["Needs Decoding", "IPv6]} # This tag doesn't exist yet, we also want short names for each regex as a tag in the future
distribution2 = Distribution(filters1)

In total:

filters = {"Tags": ["Networking", "Credentials"], "Min_Rarity": 0.6}
distribution = Distribution(filters)

filters1 = {"Tags": ["Needs Decoding", "IPv6]} # This tag doesn't exist yet, we also want short names for each regex as a tag in the future
distribution2 = Distribution(filters1)

identifier("text here", distribution)

So we pass the identifier the distribution object.

If no distribution object is passed, we should make a distribution object at the start of the program which is everything with no filters

Hope that makes sense!! Sorry I haven't had much time to do this, I've been busy 😅

If you fancy picking up my branch you can always merge what I have with yours? 🥺🙏

Thank you so much for contributing!!!

@ghost
Copy link
Author

ghost commented May 31, 2021

Sounds interesting!

@bee-san
Copy link
Owner

bee-san commented May 31, 2021

Sounds interesting!

Are you interested in picking this up or should I finish it? 😄

@ghost
Copy link
Author

ghost commented May 31, 2021

Sounds interesting!

Are you interested in picking this up or should I finish it? 😄

I will finish it. However, need some time😀

@ghost
Copy link
Author

ghost commented May 31, 2021

I am thinking about making API look like this:

...
distribution1 = Distribution(filter1)
distribution2 = Distribution(filter2)
id = identifier.Identifier(distribution1)
id.identify(text)
id.distribution = distribution2
id.identify(text)

What are your thoughts, @bee-san?

@bee-san
Copy link
Owner

bee-san commented May 31, 2021

I am thinking about making API look like this:

...
distribution1 = Distribution(filter1)
distribution2 = Distribution(filter2)
id = identifier.Identifier(distribution1)
id.identify(text)
id.distribution = distribution2
id.identify(text)

What are your thoughts, @bee-san?

I can see where you're going, but also I think repeatedly switching the variables over is a bad idea? In an ideal world the API would look like:

distribution1 = Distribution(filter1)
distribution2 = Distribution(filter2)
id = identifier.Identifier() # No distribution

id.identify(text, distribution1)
id.identify(text, distribution2)
id.identify(text) # Uses no distribution, which means it uses "everything". 

You can achieve that last one by creating a distribution for the whole program with everything,. and setting it as the default in the function like:

class Identifier():
    def __init__(self):
        self.default_distribution = distribution()
    def identify(self, distribution=self.default_distribution)

That way it either uses:

  1. The distribution passed to it
  2. The default one

I just dislike the idea of repeatedly changing a variable in an object, I'd much rather it be more functional like 😄

Thanks so much for the ❓ !

@bee-san
Copy link
Owner

bee-san commented May 31, 2021

@piatrashkakanstantinass

Move API to init.py' for easy access

Can you explain how this works? I'm not familiar with this pattern 😄

@ghost
Copy link
Author

ghost commented May 31, 2021

@piatrashkakanstantinass

Move API to init.py' for easy access

Can you explain how this works? I'm not familiar with this pattern 😄

After applying the changes, API will be like this:

import pywhat
id = pywhat.Identifier()
dist = pywhat.Distribution(some_filter)
id.identify(some_text, dist)

Importing pywhat and getting all the functionality is more straightforward than importing pywhat.identifier

@ghost
Copy link
Author

ghost commented May 31, 2021

@bee-san, I guess supporting distribution both as an attribute of Identifier and as an optional parameter to identify will make everyone happy?🤔

@bee-san
Copy link
Owner

bee-san commented Jun 1, 2021

@bee-san, I guess supporting distribution both as an attribute of Identifier and as an optional parameter to identify will make everyone happy?🤔

Sure! That makes sense :) I can see why having multiple-identifiers might be handy if the program has hundreds of filters 😄 But also I see why having one identifier is cool if it only has one or two :)

@ghost
Copy link
Author

ghost commented Jun 1, 2021

Now I am really confused on what went wrong.

@ghost
Copy link
Author

ghost commented Jun 1, 2021

Soo, now it is possible to import API by using from pywhat import *

Distributions

from pywhat import *
dist = Distribution({"Tags": ["Identifiers"], "ExcludeTags": ["Credentials"], "MinRarity": 0.2, "MaxRarity": 0.8})
id = Identifier()
res = id.identify(DATA, distribution=dist)

dist2 = dist | Distribution({"Tags": ["Finance", "Media"]}) # not supported yet :(
id.distribution = dist2
id.identify(DATA)

Tests are urgently important and it would be great if someone could help me with that 😏
I am going to add dunder methods to Distributions for set-like behaviour. Probably implementing some helper class for work with filters_dict's is something I should do, too. Also Distribution class must strictly check filters_dict's since it is really simple to forget brackets (like this: {"Tags": "some_tag"}).

P.S.: I really dislike how CLI option parsing looks like, I should move parsing to another function or check if click has some advanced solutions to offer.

@bee-san
Copy link
Owner

bee-san commented Jun 2, 2021

Sure! I'll perhaps work on some tests later, I have been very sick this week but I am feeling better 😄

@bee-san
Copy link
Owner

bee-san commented Jun 2, 2021

@piatrashkakanstantinass are you in the discord? http://discord.skerritt.blog :)

@ghost
Copy link
Author

ghost commented Jun 2, 2021

@piatrashkakanstantinass are you in the discord? http://discord.skerritt.blog :)

Sure! r49behind#6377 :)

@ghost
Copy link
Author

ghost commented Jun 4, 2021

So, I have written some tests for distributions. Tests for identifiers still need to be written. Also, I decided not to create a Filter helper class since it does not offer any significant functionality.

@bee-san
Copy link
Owner

bee-san commented Jun 4, 2021

hii!!! nice!!! do we not already have identifier tests? Or do you mean identifiers with Filters:tm: tests? :)

@ghost
Copy link
Author

ghost commented Jun 4, 2021

hii!!! nice!!! do we not already have identifier tests? Or do you mean identifiers with Filters™️ tests? :)

Yes, we do, but I want to test the actual behaviour of API with Filters™️XD

@ghost
Copy link
Author

ghost commented Jun 4, 2021

@bee-san Well, it is done, I guess...

@bee-san
Copy link
Owner

bee-san commented Jun 4, 2021

Let me review 😄 !!!!!

Copy link
Owner

@bee-san bee-san left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!!!! I'll write some docs based on your tests 😄 Thanks so much! Just a few comments and questions :-)

pywhat/__init__.py Show resolved Hide resolved
pywhat/distribution.py Show resolved Hide resolved
pywhat/distribution.py Outdated Show resolved Hide resolved
pywhat/distribution.py Show resolved Hide resolved
pywhat/distribution.py Show resolved Hide resolved
pywhat/helper.py Outdated Show resolved Hide resolved
pywhat/what.py Show resolved Hide resolved
tests/test_regex_identifier.py Outdated Show resolved Hide resolved
@bee-san
Copy link
Owner

bee-san commented Jun 6, 2021

Small update! I checked out this branch, will test manually and update docs then accept :)

Copy link
Owner

@bee-san bee-san left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some bugs with the CLI tool itself, but the API strangely works fine! Will write docs for the API :)

tests/test_click.py Show resolved Hide resolved
tests/test_click.py Show resolved Hide resolved
tests/test_identifier.py Show resolved Hide resolved
pywhat/what.py Outdated Show resolved Hide resolved
@ghost
Copy link
Author

ghost commented Jun 6, 2021

Okay, going to work on that😀

@ghost
Copy link
Author

ghost commented Jun 6, 2021

Aww, I forgot to return distribution in parse_options() 😅

@bee-san
Copy link
Owner

bee-san commented Jun 6, 2021

Aww, I forgot to return distribution in parse_options() 😅

Ahh always the way! The great news is that after this (and after I write some docs 😉 ) we can release!!!

@ghost ghost requested a review from bee-san June 6, 2021 10:45
@bee-san
Copy link
Owner

bee-san commented Jun 6, 2021

On it! ✍🏻

@bee-san
Copy link
Owner

bee-san commented Jun 6, 2021

It works!!!
image

@bee-san
Copy link
Owner

bee-san commented Jun 6, 2021

@bee-san bee-san merged commit 93b3bfa into bee-san:main Jun 6, 2021
@ghost ghost deleted the filtration branch June 7, 2021 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants