Refactor election expenses classifier #89

lipemorais · 2017-10-19T17:26:39Z

fixes #50

This a WIP I just opening it to have feedbacks earlier.

What is the purpose of this Pull Request?
The purpose of this PR is make the classifier election_expenses_classifier.py easier to understand, including the tests.

What was done to achieve this purpose?
I renamed some variable to be more meaningful and some refactoring.

How to test if it really works?
Just run the Rosie test and see if everything keep working. A refactor should not change any behaviour or feature. python rose.py test

Who can help reviewing it?
@anaschwendler @cuducos @jtemporal @Irio

TODO

Refactoring classifier source code
Give a more meaningful name for the subject under test
Refactor tests name to be more meaningful
Meaningful constant for 409-0 - CANDIDATO A CARGO POLITICO ELETIVO
Avoid read from a file where it's not necessary to improve test performance

cuducos · 2017-10-19T17:31:49Z

rosie/chamber_of_deputies/classifiers/election_expenses_classifier.py

@@ -14,11 +14,11 @@ class ElectionExpensesClassifier(TransformerMixin):
        Brazilian Federal Revenue category of companies, preceded by its code.
    """

-    def fit(self, X):
+    def fit(self, dataset):


Maybe this is just be being annoying, but I think that conceptually this refactor is here also for these kind of semantic changes too. That said: Pandas jargon uses data frames instead of datasets. What do you think about dataframe instead of dataset?

I got this tip from the test: self.dataset = pd.read_csv('rosie/chamber_of_deputies/tests/fixtures/election_expenses_classifier.csv', dtype={'name': np.str, 'legal_entity': np.str})

So I believe that it should change too. Is it right?

Just a newbie question. What is the difference between a dataframe and a dataset?

What is the difference between a dataframe and a dataset?

Dataset is simply a collection os sets (groups) of information. A data frame is more specific, it's a 2 dimensional structure holding data ; )

cuducos · 2017-10-19T17:34:52Z

rosie/chamber_of_deputies/tests/test_election_expenses_classifier.py

@@ -11,16 +11,16 @@ class TestElectionExpensesClassifier(TestCase):
    def setUp(self):
        self.dataset = pd.read_csv('rosie/chamber_of_deputies/tests/fixtures/election_expenses_classifier.csv',
                                   dtype={'name': np.str, 'legal_entity': np.str})
-        self.subject = ElectionExpensesClassifier()
+        self.election_expenser_classifier = ElectionExpensesClassifier()


Just wondering: as each test file refers to only one classifier I wouldn't bother shortening it to self.classifier…

I aiming to achieve people that have no context of a classifier, to understand what is this referencing to.

How do you self.classifier helping with it? Or is this tip addressing something else that I didn't get here?

I aiming to achieve people that have no context of a classifier

Hopefully readers know which file they are reading… or maybe I'm just too optimistic.

But nothing against self.election_expenser_classifier per se except that I was trying something a bit shorter (minor enhancement in code readability) and as meaningful as before (guessing the file name would be a context to inform the reader about which classifier we're talking about). But that was just a minor suggestion… feel free to throw it away ; )

In that way, I'm with @cuducos, I prefer something meaningful but simple at the same way :)

lipemorais · 2017-10-19T18:53:24Z

rosie/chamber_of_deputies/classifiers/election_expenses_classifier.py

-    def predict(self, X):
-        return X['legal_entity'] == '409-0 - CANDIDATO A CARGO POLITICO ELETIVO'
+    def predict(self, dataset):
+        return dataset['legal_entity'] == '409-0 - CANDIDATO A CARGO POLITICO ELETIVO'


Hey @cuducos @anaschwendler! Any tip how could I give a more meaningful name for 409-0 - CANDIDATO A CARGO POLITICO ELETIVO?

I'd make of it a class constant, e.g.:

class ElectionExpensesClassifier: ELECTION_LEGAL_ENTITY = '409 - …'

What mean this 409-0 I would like to give it a name the assertion a the test name could be improved. Looks a kind of code but I'm not sure how could I call it.

The same for CANDIDATO A CARGO POLITICO ELETIVO.

The answer for that questions is here: https://concla.ibge.gov.br/estrutura/natjur-estrutura/natureza-juridica-2014/409-0-candidato-a-cargo-politico-eletivo.html

I think that going for constant is the best, because we don't know if we may use another code from that dataset

The answer for that questions is here: https://concla.ibge.gov.br/estrutura/natjur-estrutura/natureza-juridica-2014/409-0-candidato-a-cargo-politico-eletivo.html

I'm sorry but this does not enough context to give name for 409-0 and CANDIDATO A CARGO POLITICO ELETIVO. I even know if this is really splitted stuffs that could receive different names? May that is the reason why it's strange to me. I just saw that the serie returned has two boolean values what make thing that it's one for each pieces of the constant.

I think that going for constant is the best, because we don't know if we may use another code from that dataset

I agree. I introduced it here: acbb6c2#diff-69f39fa56efc440a1649e8ba9ff1f1cbR24 as suggested.

lipemorais · 2017-10-19T18:56:19Z

rosie/chamber_of_deputies/tests/test_election_expenses_classifier.py

@@ -11,16 +11,16 @@ class TestElectionExpensesClassifier(TestCase):
    def setUp(self):
        self.dataset = pd.read_csv('rosie/chamber_of_deputies/tests/fixtures/election_expenses_classifier.csv',
                                   dtype={'name': np.str, 'legal_entity': np.str})
-        self.subject = ElectionExpensesClassifier()
+        self.election_expenser_classifier = ElectionExpensesClassifier()


I aiming to achieve people that have no context of a classifier, to understand what is this referencing to.

How do you self.classifier helping with it? Or is this tip addressing something else that I didn't get here?

lipemorais · 2017-10-19T21:18:07Z

rosie/chamber_of_deputies/tests/test_election_expenses_classifier.py


    def test_is_not_election_company(self):
-        self.assertEqual(self.subject.predict(self.dataset)[1], False)
+        self.assertEqual(self.election_expenser_classifier.predict(self.dataset)[1], False)

    def test_fit(self):


Why is the classifier itself been returned here on fit.

I believe that the fit is this case is not supposed to change anything in the data frame (but is kept due to scikitlearn architecture). Can you shed some light on us @Irio?

coveralls · 2017-10-26T12:55:54Z

Changes Unknown when pulling bbb0975 on lipemorais:refactor-election-expenses-classifier into ** on datasciencebr:master**.

coveralls · 2017-10-27T02:42:46Z

Coverage increased (+0.008%) to 98.053% when pulling beceb14 on lipemorais:refactor-election-expenses-classifier into c599dc7 on datasciencebr:master.

coveralls · 2017-10-27T02:44:23Z

Coverage increased (+0.008%) to 98.053% when pulling beceb14 on lipemorais:refactor-election-expenses-classifier into c599dc7 on datasciencebr:master.

coveralls · 2017-10-27T03:12:04Z

Coverage increased (+0.01%) to 98.058% when pulling dc7b91a on lipemorais:refactor-election-expenses-classifier into c599dc7 on datasciencebr:master.

anaschwendler · 2017-10-27T13:08:56Z

Hey @lipemorais you added the Dockerfile to the PR again, I think that we must keep that to that PR.

…cit is better than implicit and readbility counts

…fier

…d on election expenses classifier

…classifier

…fier

…e test performance

lipemorais · 2017-10-27T16:09:08Z

Hey @lipemorais you added the Dockerfile to the PR again, I think that we must keep that to that PR.

Thank you @anaschwendler

coveralls · 2017-10-27T16:13:01Z

Coverage increased (+0.01%) to 98.058% when pulling 250022a on lipemorais:refactor-election-expenses-classifier into c599dc7 on datasciencebr:master.

coveralls · 2017-10-31T02:09:25Z

Coverage increased (+0.01%) to 98.058% when pulling 6025ef5 on lipemorais:refactor-election-expenses-classifier into 49c069d on datasciencebr:master.

lipemorais · 2017-10-31T02:13:23Z

rosie/chamber_of_deputies/tests/test_election_expenses_classifier.py

-    def test_is_election_company(self):
-        self.assertEqual(self.subject.predict(self.dataset)[0], True)
+    def test_legal_entity_is_a_election_company(self):
+        self.dataframe = pd.read_csv('rosie/chamber_of_deputies/tests/fixtures/election_expenses_classifier.csv',


Hey @anaschwendler @cuducos

How could I create a data frame here from a string or a data structure? Just to avoid have tests accessing file system and be able to be more specific with just what is necessary for this test.
Something like: pd.read('just the information necessary for this test)

A dataframe is a structure of rows and columns. So something like pd.DataFrame(data=[['my text']]) might work (the first list is the row, the inside list is the single column I guess).

Nice with your tip and some search I was able to do what I was planing. :)

lipemorais · 2017-11-02T02:58:06Z

Hey @anaschwendler @cuducos Could take a new look and check if is able or not to be merged?

cuducos · 2017-11-03T18:29:52Z

rosie/chamber_of_deputies/classifiers/election_expenses_classifier.py

-    def predict(self, X):
-        return X['legal_entity'] == '409-0 - CANDIDATO A CARGO POLITICO ELETIVO'
+    def predict(self, dataframe):
+        ELECTION_LEGAL_ENTITY = '409-0 - CANDIDATO A CARGO POLITICO ELETIVO'


PEP8 says that constants usually should be defined in the module level (not method level), would you mind changing this minor detail?

cuducos · 2017-11-03T18:32:24Z

rosie/chamber_of_deputies/tests/test_election_expenses_classifier.py

-    def test_tranform(self):
-        self.assertEqual(self.subject.transform(), self.subject)
+    def test_legal_entity_is_not_election_company(self):
+        self.dataframe = self._create_dataframe([['PAULO ROGERIO ROSSETO DE MELO', 'POSTO ROTA 116 DERIVADOS DE PETROLEO LTDA', '401-4 - EMPRESA INDIVIDUAL IMOBILIARIA']])


What about making this test a bit more readable?

def test_legal_entity_is_not_election_company(self): data = [[ 'PAULO ROGERIO ROSSETO DE MELO', 'POSTO ROTA 116 DERIVADOS DE PETROLEO LTDA', '401-4 - EMPRESA INDIVIDUAL IMOBILIARIA' ]] self.dataframe = self._create_dataframe(data) prediction_result = self.election_expenser_classifier.predict(self.dataframe) self.assertEqual(prediction_result[0], False)

@cuducos, I fixed the point mentioned above. :)

cuducos reviewed Oct 19, 2017

View reviewed changes

lipemorais mentioned this pull request Oct 19, 2017

Simplify classifiers code #50

Open

lipemorais commented Oct 19, 2017

View reviewed changes

lipemorais force-pushed the refactor-election-expenses-classifier branch from 3d797ab to beceb14 Compare October 27, 2017 02:37

Felipe Morais and others added 9 commits October 27, 2017 14:03

refactor X to a more meaningful name in election expenses classifier

e47610b

Remove a Rails accent of use subject in favor of Zen of Python: expli…

993bab4

…cit is better than implicit and readbility counts

changes from dataset to dataframe metions in election expenses classi…

ee324e9

…fier

Gives a more meaningful constant name for a election legal entity use…

e572e9b

…d on election expenses classifier

Makes the assertion more clear about what is been asserted

afce56e

fixes a typo on transform for election expense classifier test

8fa947e

Gives a more meaningful name for transform test on election expenses …

4ea4974

…classifier

Gives a more meaningful name for fit test on election expenses classi…

d906fe7

…fier

refactor test to just read from fixture file when necessary to improv…

250022a

…e test performance

lipemorais force-pushed the refactor-election-expenses-classifier branch from dc7b91a to 250022a Compare October 27, 2017 16:03

Merge branch 'master' into refactor-election-expenses-classifier

6025ef5

lipemorais commented Oct 31, 2017

View reviewed changes

lipemorais added 2 commits November 2, 2017 00:55

Create dataframe focused on each test scenario

d63ba5d

DRY dataframe creation

e9c3ac8

anaschwendler requested a review from cuducos November 3, 2017 17:04

Merge branch 'master' into refactor-election-expenses-classifier

1fc676f

cuducos reviewed Nov 3, 2017

View reviewed changes

Felipe Morais added 2 commits November 3, 2017 16:58

Moves ELECTION_LEGAL_ENTITY constant to a module level

a3bd233

Make test data easier to read on test_election_expenses_classifier

8b14e51

cuducos merged commit 99b4c32 into okfn-brasil:master Nov 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor election expenses classifier #89

Refactor election expenses classifier #89

lipemorais commented Oct 19, 2017 •

edited

cuducos Oct 19, 2017

lipemorais Oct 19, 2017

cuducos Oct 19, 2017

cuducos Oct 19, 2017

lipemorais Oct 19, 2017

cuducos Oct 19, 2017

anaschwendler Oct 20, 2017

lipemorais Oct 19, 2017

cuducos Oct 19, 2017

lipemorais Oct 19, 2017

anaschwendler Oct 20, 2017

lipemorais Oct 20, 2017

lipemorais Oct 19, 2017

lipemorais Oct 19, 2017

cuducos Oct 19, 2017

coveralls commented Oct 26, 2017

coveralls commented Oct 27, 2017

coveralls commented Oct 27, 2017

coveralls commented Oct 27, 2017

anaschwendler commented Oct 27, 2017

lipemorais commented Oct 27, 2017 •

edited

coveralls commented Oct 27, 2017

coveralls commented Oct 31, 2017

lipemorais Oct 31, 2017

cuducos Oct 31, 2017

lipemorais Nov 2, 2017

lipemorais commented Nov 2, 2017

cuducos Nov 3, 2017

cuducos Nov 3, 2017

lipemorais Nov 3, 2017

Refactor election expenses classifier #89

Refactor election expenses classifier #89

Conversation

lipemorais commented Oct 19, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Oct 26, 2017

coveralls commented Oct 27, 2017

coveralls commented Oct 27, 2017

coveralls commented Oct 27, 2017

anaschwendler commented Oct 27, 2017

lipemorais commented Oct 27, 2017 • edited

coveralls commented Oct 27, 2017

coveralls commented Oct 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lipemorais commented Nov 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lipemorais commented Oct 19, 2017 •

edited

lipemorais commented Oct 27, 2017 •

edited