Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More approachable names #19

Closed
eugene-eeo opened this issue Apr 12, 2015 · 26 comments
Closed

More approachable names #19

eugene-eeo opened this issue Apr 12, 2015 · 26 comments

Comments

@eugene-eeo
Copy link
Contributor

I think @msiemens has raised an important issue about the terminology used in our project. We should always strive to make it more "intuitive", that is, that developers can just start delving into the source code without having to look up the technical terminology used. Some suggestions:

  • quantifiers -> metrics
  • pruners -> selectors/mappers

As for maximisers I don't know of a better name, but it sounds a mouthful ;)

cc @libextract/owners @libextract/contrib

@rodricios
Copy link
Contributor

I'm on board with metrics instead of quantifiers, but I'm personally more convinced that "pruning" fits what's going on in that submodule. Here's a snippet from the wiki:

A top down pruning will traverse nodes and trim subtrees starting at the root [..]

And from google's auto definition:

Pruning is a horticultural and silvicultural practice involving the selective removal of parts of a plant, such as branches, buds, or roots

And ditto on the "mouthful" sentiment lol.

@eugene-eeo
Copy link
Contributor Author

What about renaming @pruner to @prunes or @selects?

@rodricios
Copy link
Contributor

I think @selects is appropriate in that case. We provide the user with the selects decorator for custom pruners, or they can also user our builtin pruners?

@eugene-eeo
Copy link
Contributor Author

Builtin ones can be used as well, but my current concern is that the pruners aren't having a uniform type signature. For example prune_by_node_text_length returns the node's parent and the node text length. So it should be per-strategy, i.e. ``libextract.html.article.get_parent_textlen_pairs`.

Update: Also, what's basket()?

@rodricios
Copy link
Contributor

I took basket out, forgot I left it there :P

prune_by_node_text_length returns the node's parent and the node text length

A node's parent or a node shouldn't be much of a problem since they're both of type lxml.html.HtmlElement

I agree that the second item yielded should stick to a single type instead of numerical or Counter

@eugene-eeo
Copy link
Contributor Author

With that being said, I think we may be approaching a zone of overly generic code. I think that doman-specific functions should remain domain specific. If the user wants to create a function that traverses the tree, they should be required to write their own traversal function. For example some traversals are more efficient if they share state, e.g.:

def get_pairs(etree):
    cache = {}
    for node in etree.xpath(SELECTOR):
        id = node.get('id')
        if id not in cache:
            cache[id] = func(node)
        yield cache[id]

@rodricios
Copy link
Contributor

If the user wants to create a function that traverses the tree [..]

I'm under the impression that tree traversal is kind of a must in this domain. Couldn't we apply a similar mechanism, like an internal cache variable, if the user would like to share state?

@eugene-eeo
Copy link
Contributor Author

About the overly generic code- seems like we won't be hitting that zone too soon, and no we shouldn't provide caching or memoziation. That is to be provided by the user themselves. The decorators we provide merely "enforce" a common pattern.

@rodricios
Copy link
Contributor

I should note that part of the motivation to "pruning", "maximizing", etc. is because those are basically the steps laid out in most A.I. resources that talk about searching & predicting.

@eugene-eeo
Copy link
Contributor Author

In that case, 👍 I am misunderstood.

@rodricios
Copy link
Contributor

The question I think we both have though is: are these patterns going to be helpful? I mean, to me, they make sense, but I understand that most users are going to need a brief explanation to usage..

@eugene-eeo
Copy link
Contributor Author

I think I'll work on a re-architecturing of the package:

We will have a protocol module which defines wrappers for common use-cases of parsing, pruning, prediction, and post-processing. It will include decorators like @select and defaults like parse_html and perhaps enforce the four stages (Parsing, Pruning, Prediction, Post-processing).

libextract/
  __init__.py
  coretools.py
  protocol.py
  metrics.py
  article.py
  tabular.py
  formatters.py
  strategies.py

@rodricios
Copy link
Contributor

Sounds good! I would be in favor of dropping article and tabular from the module and refactor their methods into an appropriately named module?

@eugene-eeo
Copy link
Contributor Author

What do you mean by that?

@rodricios
Copy link
Contributor

In this branch, article.py only has one function: get_text, and tabular.py has two (at least being used by tabular's strategy, sort_best_pair and filter_tags.

@rodricios
Copy link
Contributor

I think we should promote the individual methods used by the strategy, as opposed to the module itself.

@eugene-eeo
Copy link
Contributor Author

I think I'll revert the library to the state where it was before we started adding the @prunes decorator. I am a firm believer in simplicity being the key to composability, and I think that the library was in the best state then. The YAGNI methodology should be adopted here, especially for our library. Look at this stackoverflow question.

Let's focus on building a nice-to-use API and worry about extensibility when the problem comes to us.

@rodricios
Copy link
Contributor

Ehh, but wouldn't we be trying to refactor get_pairs again? Or is that a fool's errand?

@eugene-eeo
Copy link
Contributor Author

IMHO, that's a fool's errand.

@rodricios
Copy link
Contributor

[..] worry about extensibility when the problem comes to us.

It's sort of already been requested by Beluki:

My rationale for this is that the important thing is to make the pipeline easily extensible/configurable and documenting how users can extend it. Closures allow that.

As for making this API useful, I'm wondering where our efforts should be concentrated. I'm thinking we can keep these changes active as they do, imo, implement a good base for further abstraction of the general html extraction process flow (this is coming from personal exp.). Reverting to a state with the plumper versions of html.article and html.tabular is like basically going to a version that's closer to what eatiht was and currently is.

@eugene-eeo
Copy link
Contributor Author

What about moving the @prunes decorator to coretools but keep the functions in html.article and html.tabular? It offers a nice compromise between code-reuse and locality.

@rodricios
Copy link
Contributor

Unfortunately, I'm opting for dropping html.article and html.tabular because imo, with those two submodules, we're simply reimplementing eatiht.

The reason for my position is because if we're trying to create a library that's supposed to be "purely functional" and supposed to work with HT/XML documents, the html name, as well as the article and tabular distinction don't really allow for that.

There's functions that are similar, but not 100% the same, but those are the ones that I think should be refactored and generalized, in order to go towards creating a functional lib.

A flat structure without the html folder would be a step in the right direction I think.

@eugene-eeo
Copy link
Contributor Author

👍 will do.

eugene-eeo pushed a commit that referenced this issue Apr 12, 2015
@rodricios
Copy link
Contributor

:) While you're doing that, I'm going to try to see if I can create a

-to-json formatter.

@eugene-eeo
Copy link
Contributor Author

I'm really sorry about the whole commotion caused. I was just really frustrated and confused just now, and I should've just merged and refactored from there. But I've come to realise that what we are building is not just a mere extraction tool but a "framework" (if you will) to help users develop their own extraction algorithms.

@rodricios
Copy link
Contributor

Not a problem dude! I hope I didn't come off too firm on my ideas, but you're right, we can say that we are actually making an extraction framework as opposed to just a single extraction tool.

I mean, I think we'll do the bare minimum for users to start using extractors at the get-go, like supply them default "main article" and "tabular data" extraction algos, but what can possibly make this unique from the rest of what's out there is the underlying predictive and traversal components

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants