Skip to content
This repository has been archived by the owner on Mar 1, 2018. It is now read-only.

Refactor election expenses classifier #89

Merged

Conversation

lipemorais
Copy link
Contributor

@lipemorais lipemorais commented Oct 19, 2017

fixes #50

This a WIP I just opening it to have feedbacks earlier.

What is the purpose of this Pull Request?
The purpose of this PR is make the classifier election_expenses_classifier.py easier to understand, including the tests.

What was done to achieve this purpose?
I renamed some variable to be more meaningful and some refactoring.

How to test if it really works?
Just run the Rosie test and see if everything keep working. A refactor should not change any behaviour or feature. python rose.py test

Who can help reviewing it?
@anaschwendler @cuducos @jtemporal @Irio

TODO

  • Refactoring classifier source code
  • Give a more meaningful name for the subject under test
  • Refactor tests name to be more meaningful
  • Meaningful constant for 409-0 - CANDIDATO A CARGO POLITICO ELETIVO
  • Avoid read from a file where it's not necessary to improve test performance

@@ -14,11 +14,11 @@ class ElectionExpensesClassifier(TransformerMixin):
Brazilian Federal Revenue category of companies, preceded by its code.
"""

def fit(self, X):
def fit(self, dataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is just be being annoying, but I think that conceptually this refactor is here also for these kind of semantic changes too. That said: Pandas jargon uses data frames instead of datasets. What do you think about dataframe instead of dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got this tip from the test: self.dataset = pd.read_csv('rosie/chamber_of_deputies/tests/fixtures/election_expenses_classifier.csv', dtype={'name': np.str, 'legal_entity': np.str})

So I believe that it should change too. Is it right?

Just a newbie question. What is the difference between a dataframe and a dataset?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between a dataframe and a dataset?

Dataset is simply a collection os sets (groups) of information. A data frame is more specific, it's a 2 dimensional structure holding data ; )

@@ -11,16 +11,16 @@ class TestElectionExpensesClassifier(TestCase):
def setUp(self):
self.dataset = pd.read_csv('rosie/chamber_of_deputies/tests/fixtures/election_expenses_classifier.csv',
dtype={'name': np.str, 'legal_entity': np.str})
self.subject = ElectionExpensesClassifier()
self.election_expenser_classifier = ElectionExpensesClassifier()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering: as each test file refers to only one classifier I wouldn't bother shortening it to self.classifier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aiming to achieve people that have no context of a classifier, to understand what is this referencing to.

How do you self.classifier helping with it? Or is this tip addressing something else that I didn't get here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aiming to achieve people that have no context of a classifier

Hopefully readers know which file they are reading… or maybe I'm just too optimistic.

But nothing against self.election_expenser_classifier per se except that I was trying something a bit shorter (minor enhancement in code readability) and as meaningful as before (guessing the file name would be a context to inform the reader about which classifier we're talking about). But that was just a minor suggestion… feel free to throw it away ; )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that way, I'm with @cuducos, I prefer something meaningful but simple at the same way :)

def predict(self, X):
return X['legal_entity'] == '409-0 - CANDIDATO A CARGO POLITICO ELETIVO'
def predict(self, dataset):
return dataset['legal_entity'] == '409-0 - CANDIDATO A CARGO POLITICO ELETIVO'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @cuducos @anaschwendler! Any tip how could I give a more meaningful name for 409-0 - CANDIDATO A CARGO POLITICO ELETIVO?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make of it a class constant, e.g.:

class ElectionExpensesClassifier:

    ELECTION_LEGAL_ENTITY = '409 - …'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What mean this 409-0 I would like to give it a name the assertion a the test name could be improved. Looks a kind of code but I'm not sure how could I call it.

The same for CANDIDATO A CARGO POLITICO ELETIVO.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer for that questions is here: https://concla.ibge.gov.br/estrutura/natjur-estrutura/natureza-juridica-2014/409-0-candidato-a-cargo-politico-eletivo.html

I think that going for constant is the best, because we don't know if we may use another code from that dataset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answer for that questions is here: https://concla.ibge.gov.br/estrutura/natjur-estrutura/natureza-juridica-2014/409-0-candidato-a-cargo-politico-eletivo.html

I'm sorry but this does not enough context to give name for 409-0 and CANDIDATO A CARGO POLITICO ELETIVO. I even know if this is really splitted stuffs that could receive different names? May that is the reason why it's strange to me. I just saw that the serie returned has two boolean values what make thing that it's one for each pieces of the constant.

I think that going for constant is the best, because we don't know if we may use another code from that dataset

I agree. I introduced it here: acbb6c2#diff-69f39fa56efc440a1649e8ba9ff1f1cbR24 as suggested.

@@ -11,16 +11,16 @@ class TestElectionExpensesClassifier(TestCase):
def setUp(self):
self.dataset = pd.read_csv('rosie/chamber_of_deputies/tests/fixtures/election_expenses_classifier.csv',
dtype={'name': np.str, 'legal_entity': np.str})
self.subject = ElectionExpensesClassifier()
self.election_expenser_classifier = ElectionExpensesClassifier()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aiming to achieve people that have no context of a classifier, to understand what is this referencing to.

How do you self.classifier helping with it? Or is this tip addressing something else that I didn't get here?


def test_is_not_election_company(self):
self.assertEqual(self.subject.predict(self.dataset)[1], False)
self.assertEqual(self.election_expenser_classifier.predict(self.dataset)[1], False)

def test_fit(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the classifier itself been returned here on fit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the fit is this case is not supposed to change anything in the data frame (but is kept due to scikitlearn architecture). Can you shed some light on us @Irio?

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling bbb0975 on lipemorais:refactor-election-expenses-classifier into ** on datasciencebr:master**.

@lipemorais lipemorais force-pushed the refactor-election-expenses-classifier branch from 3d797ab to beceb14 Compare October 27, 2017 02:37
@coveralls
Copy link

Coverage Status

Coverage increased (+0.008%) to 98.053% when pulling beceb14 on lipemorais:refactor-election-expenses-classifier into c599dc7 on datasciencebr:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.008%) to 98.053% when pulling beceb14 on lipemorais:refactor-election-expenses-classifier into c599dc7 on datasciencebr:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 98.058% when pulling dc7b91a on lipemorais:refactor-election-expenses-classifier into c599dc7 on datasciencebr:master.

@anaschwendler
Copy link
Collaborator

Hey @lipemorais you added the Dockerfile to the PR again, I think that we must keep that to that PR.

@lipemorais lipemorais force-pushed the refactor-election-expenses-classifier branch from dc7b91a to 250022a Compare October 27, 2017 16:03
@lipemorais
Copy link
Contributor Author

lipemorais commented Oct 27, 2017

Hey @lipemorais you added the Dockerfile to the PR again, I think that we must keep that to that PR.

Thank you @anaschwendler

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 98.058% when pulling 250022a on lipemorais:refactor-election-expenses-classifier into c599dc7 on datasciencebr:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 98.058% when pulling 6025ef5 on lipemorais:refactor-election-expenses-classifier into 49c069d on datasciencebr:master.

def test_is_election_company(self):
self.assertEqual(self.subject.predict(self.dataset)[0], True)
def test_legal_entity_is_a_election_company(self):
self.dataframe = pd.read_csv('rosie/chamber_of_deputies/tests/fixtures/election_expenses_classifier.csv',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @anaschwendler @cuducos

How could I create a data frame here from a string or a data structure? Just to avoid have tests accessing file system and be able to be more specific with just what is necessary for this test.
Something like: pd.read('just the information necessary for this test)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dataframe is a structure of rows and columns. So something like pd.DataFrame(data=[['my text']]) might work (the first list is the row, the inside list is the single column I guess).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice with your tip and some search I was able to do what I was planing. :)

@lipemorais
Copy link
Contributor Author

Hey @anaschwendler @cuducos Could take a new look and check if is able or not to be merged?

def predict(self, X):
return X['legal_entity'] == '409-0 - CANDIDATO A CARGO POLITICO ELETIVO'
def predict(self, dataframe):
ELECTION_LEGAL_ENTITY = '409-0 - CANDIDATO A CARGO POLITICO ELETIVO'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 says that constants usually should be defined in the module level (not method level), would you mind changing this minor detail?

def test_tranform(self):
self.assertEqual(self.subject.transform(), self.subject)
def test_legal_entity_is_not_election_company(self):
self.dataframe = self._create_dataframe([['PAULO ROGERIO ROSSETO DE MELO', 'POSTO ROTA 116 DERIVADOS DE PETROLEO LTDA', '401-4 - EMPRESA INDIVIDUAL IMOBILIARIA']])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about making this test a bit more readable?

    def test_legal_entity_is_not_election_company(self):
        data = [[
            'PAULO ROGERIO ROSSETO DE MELO',
            'POSTO ROTA 116 DERIVADOS DE PETROLEO LTDA',
            '401-4 - EMPRESA INDIVIDUAL IMOBILIARIA'
         ]]
        self.dataframe = self._create_dataframe(data)
        prediction_result = self.election_expenser_classifier.predict(self.dataframe)
        self.assertEqual(prediction_result[0], False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cuducos, I fixed the point mentioned above. :)

@cuducos cuducos merged commit 99b4c32 into okfn-brasil:master Nov 4, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify classifiers code
4 participants