Skip to content
This repository has been archived by the owner on Mar 1, 2018. It is now read-only.

Migrated the CPF validation library #95

Merged
merged 8 commits into from
Nov 29, 2017
Merged

Migrated the CPF validation library #95

merged 8 commits into from
Nov 29, 2017

Conversation

luizberti
Copy link
Contributor

The current CPF validation library has a shoddy construction and does not recognise repeated digit CPF numbers as invalid (e.g. 999.999.999-99). Migrated to a better one, and improved readability.

Local tests are passing.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.003%) to 98.042% when pulling a30feb8 on luizberti:cpf-library-migration into c599dc7 on datasciencebr:master.

@luizberti
Copy link
Contributor Author

Hmm, coveralls seems to think that test coverage has decreased but I think it just got confused over the fact that I deleted more code than I added, so it seems to believe it's testing less things than before.

What should I do about this?

@cuducos
Copy link
Collaborator

cuducos commented Oct 30, 2017

Hmm, coveralls seems to think that test coverage has decreased

Coverage decreased (-0.003%) — I wouldn't bother about it ; )

Copy link
Collaborator

@cuducos cuducos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we're editing classifiers I'd follow up on @lipemorais's #50 and rewrite the code focusing on readability. Beyond calling X dataframe (for example) I think a structure like that would enhance the quality of the code:

    def predict(self, X):
        return np.r_[X.apply(self._is_invalid, axis=1)]

    def _is_invalid(row):
        doc_number = str(row['recipient_id'])
        doc_type = str(row['document_type'])

        is_cpf = cpf.validate(doc.zfill(11))
        is_cnpj = cnpj.validate(doc.zfill(14))
        proper_type = doc_type in ('bill_of_sale', 'simple_receipt', 'unknown')
        
        return proper_type and not (is_cpf or is_cnpj)

What do you think?


I haven't tested it, just drafted as an example to make my point ; )

requirements.txt Outdated
@@ -1,5 +1,5 @@
geopy>=1.11.0
pycpfcnpj==1.0.2
brutils==1.0.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we keep requirements sorted?

cpf_or_cnpj = lambda doc: cpf.validate(str(doc).zfill(11)) or cnpj.validate(str(doc).zfill(14))
is_invalid = lambda row: row['document_type'] in ('bill_of_sale', 'simple_receipt', 'unknown') and \
(not cpf_or_cnpj(row['recipient_id']))
return np.r_[X.apply(is_invalid, axis=1)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we're editing classifiers I'd follow up on @lipemorais's #50 and rewrite the code focusing on readability. Beyond calling X dataframe (for example) I think a structure like that would enhance the quality of the code:

    def predict(self, X):
        return np.r_[X.apply(self._is_invalid, axis=1)]

    def _is_invalid(row):
        doc_number = str(row['recipient_id'])
        doc_type = str(row['document_type'])

        is_cpf = cpf.validate(doc.zfill(11))
        is_cnpj = cnpj.validate(doc.zfill(14))
        proper_type = doc_type in ('bill_of_sale', 'simple_receipt', 'unknown')
        
        return proper_type and not (is_cpf or is_cnpj)

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly enjoy the idea of creating a method (as it was before) for something that is inherently a one-off function, use cases like these are why lambdas were added to the language. Pulling it out of that function seems like it would only serve to pollute the class space, especially given that it's not being reused anywhere. Also being only a few lines long, I'm not sure if this is a worthy abstraction.

As for breaking up the computed sub-values (rather than sub-expressions) into separate variables, I tend to favor that style of programming more in languages that support immutable assignments than I do in Python. In my opinion, having the sub-expressions of a boolean equation as functions makes it easier to test (given that they can be called and evaluated separately), and also easier to read (given that you have to always explicitly pass context; i.e. is_cpf(something) rather than is_cpf # what is?).

I will make some small modifications to this code that make it less verbose, and more or less follow on your remarks regarding readability, but I still tend to favor decomposition through lambdas rather than intermediate values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm sorry but I totally disagree : )

I disagree with your argument on the use of lambda for that case in particular. But beyond that I disagree with your understanding of readability too. Let's start with readability and your own example:

(given that you have to always explicitly pass context; i.e. is_cpf(something) rather than is_cpf # what is?)

Surely is_cpf(something) is quite readable. In plain English it would read is CPF something? which is almost how you would ask that in English (i.e. Is something a CPF?). However, let's go for real code:

cpf_or_cnpj = lambda doc: cpf.validate(str(doc).zfill(11)) or cnpj.validate(str(doc).zfill(14))

This line doesn't even fit in this text area in my browser. Using the style I suggested it'd give you something like:

is_cpf = cpf.validate(str(doc).zfill(11))
is_cnpj = cnpj.validate(str(doc).zfill(14))
return is_cpf or is_cnpj

Ok, now it's clear not only because the code fits in the screen but because it's clear it's a three-step logic we're talking about (validate the first value, validate the second one, and then return).

To be clear I'd surely go for return is_cpf(cpf) or is_cnpj(cnpj) if we had this is_cpf and is_cnpj API to handle the data types at hand. Unfortunately that's not the case (but that would give a nice PR).

However the most important point here is what you call an one-off function. The example we're discussing actually does at least eight operations

I'm considering this single line cpf_or_cnpj = lambda doc: cpf.validate(str(doc).zfill(11)) or cnpj.validate(str(doc).zfill(14)). What you call an one-of function actually:

  1. Converts the CPF to string
  2. Formats the CPF enforcing 11 digits
  3. Validates the result of the now guarantee 11 digits CPF
  4. If it's valid it is read to return True
  5. Else it converts the CNPJ to string
  6. Formats the CNPJ enforcing 14 digits
  7. Validates the now 14 digits guarantee CNPJ
  8. Returns whatever the CNPJ validation was

As Raymond Hettinger puts the problem with the 80 chars limit proposed by PEP8 is not the physical measure but the suggestion that you might be putting too much logic in a tiny space. Your code gets dense (and might even be PEP8 compliant), but it might not end up intelligible — and readability counts.

It seams to me that this one-off function (as you called) is exactly that: at least eight logical operations bundled in a single (and long) line. And that is why I started saying I'm sorry, but I totally disagree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you misconstrued a lot of the points I made. When I say a one-off function, I don't mean a simple function, but one whose purpose and existence is constrained to a single application, which in this case is being passed to apply; It's not reused anywhere, and thus has no purpose being exported into the class space, but beyond that it has no purpose in being a method at all: It makes no use of self whatsoever.

That's the bigger point I was trying to make and that's the sentiment I tried to express when I mentioned lambdas, it goes more along the lines of anonymous functions in general than Python's lambda construct itself, to which I'll concede that it's not the nicest implementation of anonymous functions out there. I think most of our divergence stems from me favouring a more functional style of programming and wanting to decompose the boolean equation into independently evaluable expressions, while you would prefer it to be more imperative and decompose the evaluation of the expression itself. Calling that a fundamental disagreement with my understanding of readability was a bit of a long stretch in my opinion 😉

As for the rest I'd like to point to this:

To be clear I'd surely go for return is_cpf(cpf) or is_cnpj(cnpj) if we had this is_cpf and is_cnpj API to handle the data types at hand. Unfortunately that's not the case (but that would give a nice PR).

The optimal solution to which you pointed out is almost exactly the line you were referring to through most of your comment. That is what cpf.validate and cnpj.validate do. The only difference from that line and mine is that I defensively formatted the strings out of distrust for the way data may be structured in the dataset.

Hettinger's main point when writing that was neither about the line length ifself, and not about the amount of logical operations in a single line either (which you seemed to heavily focus on), but about the cognitive strain necessary to understand it. I would guess that is also the main driving factor behind line 2 of The Zen of Python, explicit context passing helps when you're a dynamically typed language, and that is why I favoured decomposing subexpressions. The final expression looked like good_doctype(row['document_type']) and (not cpf_or_cnpj(row['recipient_id'])) which is about as concise as you can get for something like this, imperative decomposition doesn't look as intuitive in my opinion.


In the end those are all just opinions and I would rather not quarrel over that and talk solutions instead: What would you say to using a local scoped block function instead? Seems like a nice middle ground between not having to export the one-off function (as I call it), and being able to decompose the evaluation without long lines and weird line breaks. Would look somewhat like this:

def predict(self, dataframe):
    def is_invalid(row):
        is_cpf = cpf.validate(str(doc).zfill(11))
        is_cnpj = cnpj.validate(str(doc).zfill(14))
        ...
    return np.r_[dataframe.apply(is_invalid, axis=1)]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @luizberti — you're totally right. I misunderstood your point. If it worth it anything it was not on purpose thought. I was really trying to foster readability and a code easy for beginners to grasp, as you beautifully said to reduce the cognitive strain necessary to understand it.

Now I clearly get your. I do appreciate your comments, I've learned new stuff from them, and do appreciate your comprehension in not making a quarrel out of it : )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries 😄
I suppose I have a green light for that last solution so I'll be commiting it within the next five minutes for revision.

Cheers!

return (row['document_type'] in document_types) \
& (not cpfcnpj.validate(str(row['recipient_id'])))
def predict(self, dataframe):
def is_invalid(row):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have a method inside another method here? Couldn't we have a auxiliary method in this case?

Copy link
Contributor Author

@luizberti luizberti Nov 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a function inside a method because in my opinion it doesn't make sense having it in the class space: It's not being reused anywhere, and it makes no use of self. An auxiliary method is something which abstracts stateful logic within the object itself, this is not a method at all.

It's just a dummy function whose only purpose is being passed to apply, thats why I argue it should be kept in the most local scope possible, that reinforces the idea that it's not a core part of the class but rather something that is disposable.

Me and @cuducos had a thread that sorta touched on this subject, you can read it here.

@cuducos cuducos merged commit d73a9bb into okfn-brasil:master Nov 29, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants