Migrated the CPF validation library #95

luizberti · 2017-10-29T00:16:39Z

The current CPF validation library has a shoddy construction and does not recognise repeated digit CPF numbers as invalid (e.g. 999.999.999-99). Migrated to a better one, and improved readability.

Local tests are passing.

coveralls · 2017-10-29T00:28:28Z

Coverage decreased (-0.003%) to 98.042% when pulling a30feb8 on luizberti:cpf-library-migration into c599dc7 on datasciencebr:master.

luizberti · 2017-10-29T19:50:47Z

Hmm, coveralls seems to think that test coverage has decreased but I think it just got confused over the fact that I deleted more code than I added, so it seems to believe it's testing less things than before.

What should I do about this?

cuducos · 2017-10-30T07:53:53Z

Hmm, coveralls seems to think that test coverage has decreased

Coverage decreased (-0.003%) — I wouldn't bother about it ; )

cuducos

Once we're editing classifiers I'd follow up on @lipemorais's #50 and rewrite the code focusing on readability. Beyond calling X dataframe (for example) I think a structure like that would enhance the quality of the code:

    def predict(self, X):
        return np.r_[X.apply(self._is_invalid, axis=1)]

    def _is_invalid(row):
        doc_number = str(row['recipient_id'])
        doc_type = str(row['document_type'])

        is_cpf = cpf.validate(doc.zfill(11))
        is_cnpj = cnpj.validate(doc.zfill(14))
        proper_type = doc_type in ('bill_of_sale', 'simple_receipt', 'unknown')
        
        return proper_type and not (is_cpf or is_cnpj)

What do you think?

I haven't tested it, just drafted as an example to make my point ; )

cuducos · 2017-10-30T08:06:09Z

requirements.txt

@@ -1,5 +1,5 @@
 geopy>=1.11.0
-pycpfcnpj==1.0.2
+brutils==1.0.1


May we keep requirements sorted?

cuducos · 2017-10-30T08:18:52Z

rosie/core/classifiers/invalid_cnpj_cpf_classifier.py

+        cpf_or_cnpj = lambda doc: cpf.validate(str(doc).zfill(11)) or cnpj.validate(str(doc).zfill(14))
+        is_invalid = lambda row: row['document_type'] in ('bill_of_sale', 'simple_receipt', 'unknown') and \
+                                 (not cpf_or_cnpj(row['recipient_id']))
+        return np.r_[X.apply(is_invalid, axis=1)]


Once we're editing classifiers I'd follow up on @lipemorais's #50 and rewrite the code focusing on readability. Beyond calling X dataframe (for example) I think a structure like that would enhance the quality of the code:

def predict(self, X): return np.r_[X.apply(self._is_invalid, axis=1)] def _is_invalid(row): doc_number = str(row['recipient_id']) doc_type = str(row['document_type']) is_cpf = cpf.validate(doc.zfill(11)) is_cnpj = cnpj.validate(doc.zfill(14)) proper_type = doc_type in ('bill_of_sale', 'simple_receipt', 'unknown') return proper_type and not (is_cpf or is_cnpj)

What do you think?

I don't particularly enjoy the idea of creating a method (as it was before) for something that is inherently a one-off function, use cases like these are why lambdas were added to the language. Pulling it out of that function seems like it would only serve to pollute the class space, especially given that it's not being reused anywhere. Also being only a few lines long, I'm not sure if this is a worthy abstraction.

As for breaking up the computed sub-values (rather than sub-expressions) into separate variables, I tend to favor that style of programming more in languages that support immutable assignments than I do in Python. In my opinion, having the sub-expressions of a boolean equation as functions makes it easier to test (given that they can be called and evaluated separately), and also easier to read (given that you have to always explicitly pass context; i.e. is_cpf(something) rather than is_cpf # what is?).

I will make some small modifications to this code that make it less verbose, and more or less follow on your remarks regarding readability, but I still tend to favor decomposition through lambdas rather than intermediate values.

Ok. I'm sorry but I totally disagree : )

I disagree with your argument on the use of lambda for that case in particular. But beyond that I disagree with your understanding of readability too. Let's start with readability and your own example:

(given that you have to always explicitly pass context; i.e. is_cpf(something) rather than is_cpf # what is?)

Surely is_cpf(something) is quite readable. In plain English it would read is CPF something? which is almost how you would ask that in English (i.e. Is something a CPF?). However, let's go for real code:

cpf_or_cnpj = lambda doc: cpf.validate(str(doc).zfill(11)) or cnpj.validate(str(doc).zfill(14))

This line doesn't even fit in this text area in my browser. Using the style I suggested it'd give you something like:

is_cpf = cpf.validate(str(doc).zfill(11)) is_cnpj = cnpj.validate(str(doc).zfill(14)) return is_cpf or is_cnpj

Ok, now it's clear not only because the code fits in the screen but because it's clear it's a three-step logic we're talking about (validate the first value, validate the second one, and then return).

To be clear I'd surely go for return is_cpf(cpf) or is_cnpj(cnpj) if we had this is_cpf and is_cnpj API to handle the data types at hand. Unfortunately that's not the case (but that would give a nice PR).

However the most important point here is what you call an one-off function. The example we're discussing actually does at least eight operations

I'm considering this single line cpf_or_cnpj = lambda doc: cpf.validate(str(doc).zfill(11)) or cnpj.validate(str(doc).zfill(14)). What you call an one-of function actually:

Converts the CPF to string

Formats the CPF enforcing 11 digits

Validates the result of the now guarantee 11 digits CPF

If it's valid it is read to return True

Else it converts the CNPJ to string

Formats the CNPJ enforcing 14 digits

Validates the now 14 digits guarantee CNPJ

Returns whatever the CNPJ validation was

As Raymond Hettinger puts the problem with the 80 chars limit proposed by PEP8 is not the physical measure but the suggestion that you might be putting too much logic in a tiny space. Your code gets dense (and might even be PEP8 compliant), but it might not end up intelligible — and readability counts.

It seams to me that this one-off function (as you called) is exactly that: at least eight logical operations bundled in a single (and long) line. And that is why I started saying I'm sorry, but I totally disagree.

I think you misconstrued a lot of the points I made. When I say a one-off function, I don't mean a simple function, but one whose purpose and existence is constrained to a single application, which in this case is being passed to apply; It's not reused anywhere, and thus has no purpose being exported into the class space, but beyond that it has no purpose in being a method at all: It makes no use of self whatsoever.

That's the bigger point I was trying to make and that's the sentiment I tried to express when I mentioned lambdas, it goes more along the lines of anonymous functions in general than Python's lambda construct itself, to which I'll concede that it's not the nicest implementation of anonymous functions out there. I think most of our divergence stems from me favouring a more functional style of programming and wanting to decompose the boolean equation into independently evaluable expressions, while you would prefer it to be more imperative and decompose the evaluation of the expression itself. Calling that a fundamental disagreement with my understanding of readability was a bit of a long stretch in my opinion 😉

As for the rest I'd like to point to this:

To be clear I'd surely go for return is_cpf(cpf) or is_cnpj(cnpj) if we had this is_cpf and is_cnpj API to handle the data types at hand. Unfortunately that's not the case (but that would give a nice PR).

The optimal solution to which you pointed out is almost exactly the line you were referring to through most of your comment. That is what cpf.validate and cnpj.validate do. The only difference from that line and mine is that I defensively formatted the strings out of distrust for the way data may be structured in the dataset.

Hettinger's main point when writing that was neither about the line length ifself, and not about the amount of logical operations in a single line either (which you seemed to heavily focus on), but about the cognitive strain necessary to understand it. I would guess that is also the main driving factor behind line 2 of The Zen of Python, explicit context passing helps when you're a dynamically typed language, and that is why I favoured decomposing subexpressions. The final expression looked like good_doctype(row['document_type']) and (not cpf_or_cnpj(row['recipient_id'])) which is about as concise as you can get for something like this, imperative decomposition doesn't look as intuitive in my opinion.

In the end those are all just opinions and I would rather not quarrel over that and talk solutions instead: What would you say to using a local scoped block function instead? Seems like a nice middle ground between not having to export the one-off function (as I call it), and being able to decompose the evaluation without long lines and weird line breaks. Would look somewhat like this:

def predict(self, dataframe): def is_invalid(row): is_cpf = cpf.validate(str(doc).zfill(11)) is_cnpj = cnpj.validate(str(doc).zfill(14)) ... return np.r_[dataframe.apply(is_invalid, axis=1)]

Hi @luizberti — you're totally right. I misunderstood your point. If it worth it anything it was not on purpose thought. I was really trying to foster readability and a code easy for beginners to grasp, as you beautifully said to reduce the cognitive strain necessary to understand it.

Now I clearly get your. I do appreciate your comments, I've learned new stuff from them, and do appreciate your comprehension in not making a quarrel out of it : )

No worries 😄
I suppose I have a green light for that last solution so I'll be commiting it within the next five minutes for revision.

Cheers!

lipemorais · 2017-11-06T00:42:04Z

rosie/core/classifiers/invalid_cnpj_cpf_classifier.py

-        return (row['document_type'] in document_types) \
-            & (not cpfcnpj.validate(str(row['recipient_id'])))
+    def predict(self, dataframe):
+        def is_invalid(row):


Why do we have a method inside another method here? Couldn't we have a auxiliary method in this case?

We have a function inside a method because in my opinion it doesn't make sense having it in the class space: It's not being reused anywhere, and it makes no use of self. An auxiliary method is something which abstracts stateful logic within the object itself, this is not a method at all.

It's just a dummy function whose only purpose is being passed to apply, thats why I argue it should be kept in the most local scope possible, that reinforces the idea that it's not a core part of the class but rather something that is disposable.

Me and @cuducos had a thread that sorta touched on this subject, you can read it here.

migrated the cpf validation library

a30feb8

cuducos suggested changes Oct 30, 2017

View reviewed changes

luizberti added 5 commits November 3, 2017 19:14

sorted requirements and more modular validation

02a3645

Merge branch 'master' into cpf-library-migration

bebbcaf

Merge branch 'master' into cpf-library-migration

5cbf759

reworked boolean equation into a closured function

76edfa8

Merge branch 'master' into cpf-library-migration

2d8cc41

lipemorais reviewed Nov 6, 2017

View reviewed changes

cuducos approved these changes Nov 27, 2017

View reviewed changes

cuducos added 2 commits November 27, 2017 16:37

Merge branch 'master' into cpf-library-migration

48d94c7

Merge branch 'master' into cpf-library-migration

d73a9bb

cuducos merged commit d73a9bb into okfn-brasil:master Nov 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrated the CPF validation library #95

Migrated the CPF validation library #95

luizberti commented Oct 29, 2017

coveralls commented Oct 29, 2017

luizberti commented Oct 29, 2017

cuducos commented Oct 30, 2017 •

edited

Loading

cuducos left a comment

cuducos Oct 30, 2017

cuducos Oct 30, 2017

luizberti Nov 3, 2017

cuducos Nov 3, 2017

luizberti Nov 4, 2017

cuducos Nov 4, 2017

luizberti Nov 5, 2017

lipemorais Nov 6, 2017

luizberti Nov 6, 2017 •

edited

Loading

Migrated the CPF validation library #95

Migrated the CPF validation library #95

Conversation

luizberti commented Oct 29, 2017

coveralls commented Oct 29, 2017

luizberti commented Oct 29, 2017

cuducos commented Oct 30, 2017 • edited Loading

cuducos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

luizberti Nov 6, 2017 • edited Loading

Choose a reason for hiding this comment

cuducos commented Oct 30, 2017 •

edited

Loading

luizberti Nov 6, 2017 •

edited

Loading