New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More approachable names #19
Comments
I'm on board with metrics instead of quantifiers, but I'm personally more convinced that "pruning" fits what's going on in that submodule. Here's a snippet from the wiki:
And from google's auto definition:
And ditto on the "mouthful" sentiment lol. |
What about renaming |
I think |
Builtin ones can be used as well, but my current concern is that the pruners aren't having a uniform type signature. For example Update: Also, what's |
I took basket out, forgot I left it there :P
A node's parent or a node shouldn't be much of a problem since they're both of type I agree that the second item yielded should stick to a single type instead of |
With that being said, I think we may be approaching a zone of overly generic code. I think that doman-specific functions should remain domain specific. If the user wants to create a function that traverses the tree, they should be required to write their own traversal function. For example some traversals are more efficient if they share state, e.g.: def get_pairs(etree):
cache = {}
for node in etree.xpath(SELECTOR):
id = node.get('id')
if id not in cache:
cache[id] = func(node)
yield cache[id] |
I'm under the impression that tree traversal is kind of a must in this domain. Couldn't we apply a similar mechanism, like an internal |
About the overly generic code- seems like we won't be hitting that zone too soon, and no we shouldn't provide caching or memoziation. That is to be provided by the user themselves. The decorators we provide merely "enforce" a common pattern. |
I should note that part of the motivation to "pruning", "maximizing", etc. is because those are basically the steps laid out in most A.I. resources that talk about searching & predicting. |
In that case, 👍 I am misunderstood. |
The question I think we both have though is: are these patterns going to be helpful? I mean, to me, they make sense, but I understand that most users are going to need a brief explanation to usage.. |
I think I'll work on a re-architecturing of the package: We will have a
|
Sounds good! I would be in favor of dropping article and tabular from the module and refactor their methods into an appropriately named module? |
What do you mean by that? |
In this branch, |
I think we should promote the individual methods used by the strategy, as opposed to the module itself. |
I think I'll revert the library to the state where it was before we started adding the Let's focus on building a nice-to-use API and worry about extensibility when the problem comes to us. |
Ehh, but wouldn't we be trying to refactor |
IMHO, that's a fool's errand. |
It's sort of already been requested by Beluki:
As for making this API useful, I'm wondering where our efforts should be concentrated. I'm thinking we can keep these changes active as they do, imo, implement a good base for further abstraction of the general html extraction process flow (this is coming from personal exp.). Reverting to a state with the plumper versions of |
What about moving the |
Unfortunately, I'm opting for dropping The reason for my position is because if we're trying to create a library that's supposed to be "purely functional" and supposed to work with HT/XML documents, the There's functions that are similar, but not 100% the same, but those are the ones that I think should be refactored and generalized, in order to go towards creating a functional lib. A flat structure without the html folder would be a step in the right direction I think. |
👍 will do. |
:) While you're doing that, I'm going to try to see if I can create a -to-json formatter. |
I'm really sorry about the whole commotion caused. I was just really frustrated and confused just now, and I should've just merged and refactored from there. But I've come to realise that what we are building is not just a mere extraction tool but a "framework" (if you will) to help users develop their own extraction algorithms. |
Not a problem dude! I hope I didn't come off too firm on my ideas, but you're right, we can say that we are actually making an extraction framework as opposed to just a single extraction tool. I mean, I think we'll do the bare minimum for users to start using extractors at the get-go, like supply them default "main article" and "tabular data" extraction algos, but what can possibly make this unique from the rest of what's out there is the underlying predictive and traversal components |
I think @msiemens has raised an important issue about the terminology used in our project. We should always strive to make it more "intuitive", that is, that developers can just start delving into the source code without having to look up the technical terminology used. Some suggestions:
As for maximisers I don't know of a better name, but it sounds a mouthful ;)
cc @libextract/owners @libextract/contrib
The text was updated successfully, but these errors were encountered: