-
Notifications
You must be signed in to change notification settings - Fork 60
Introducing Federal Senate script to Rosie! #51
Conversation
This PR is a work in progress, but it has started today. The things that need to be done are: - [ ] Create a sample randomic dataset to use to create tests. - [ ] Finish the `adapter.py` script, writing the tests first (that is the reason why I stoped the script) - [ ] Learn how to run the tests only for federal senate - [ ] Check if everything is working - [ ]Any opinion is important, so feel free to do so :)
By running `python rosie.py run federal_senate` we can start finding suspicious reimbursements! 🎉
rosie.py
Outdated
@@ -17,7 +17,7 @@ def help(): | |||
|
|||
|
|||
def run(): | |||
import rosie, rosie.chamber_of_deputies | |||
import rosie, rosie.chamber_of_deputies, rosie.federal_senate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have missed it before but… multiple import
s in not recommended ; )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, you see, there is a way to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtemporal do you know a better way to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import rosie
import rosie.chamber_of_deputies
import rosie.federal_senate
if self.settings.UNIQUE_IDS: | ||
self.suspicions = self.dataset[self.settings.UNIQUE_IDS].copy() | ||
else: | ||
self.suspicions = self.dataset.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a chance we end up here without UNIQUE_IDS
? I thought it was suppose to be a kind of standard/requirement…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't know what to do about it, I want to change it, and really don't know what is necessary for chamber_of_deputies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will need to figure out a way to uniquely identify each federal senate reimbursements, otherwise the suspicions file will have all the columns that can be found in the original dataset... that's what we use the UNIQUE_IDS
for, so if we are comfortable with all the columns in the suspicions file there's no reason to set an UNIQUE_IDS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the matter of creating a unique ID for each reimbursement, I tried combining date
, cnpj_cpf
and document_id
and yet wasn't able to create a string that was unique ¯\_(ツ)_/¯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document_id
will never be a great combination for UNIQUE_IDS
because there is some receipts that don't have one, or some receipts that have sem fatura
problem :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's why I thought combining those 3 columns might help... but it wasn't enough, I believe we need a brainstorm to figure this one out... So far I'm good with having all the columns in the suspicions file :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need consistency in this unique identifiers? I mean, considering Rosie runs today and tomorrow: is it really required that the document X today have exactly the same id as it would have tomorrow? If not we can bring pandas index (created by default) to the dataset.
rosie/federal_senate/adapter.py
Outdated
|
||
def prepare_cpnj_cpf(self): | ||
self._dataset = self._dataset[self._dataset['cnpj_cpf'].notnull()] | ||
self._dataset['document_type'] = 'simple_receipt' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a rational for that? Can you mention it in a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a padronization for the core module :)
To run invalid_cpnj_cpf
the dataset must have this field.
I can comment it on the code!
Minor refactor on Federal Senate Adapter: - Moved column creation to a method for it - created a condition so that only generates the big file if it doesn't exist facilitating tests Tests assums that all steps worked successfully and test to see if final file is as it should be: - columns renamed after COLUMNS variable - `document_type` column created and filled with `simple_receipt`
names now reflect what the method and testreally do
@jtemporal there is only one thing missing, @cuducos asked if we can comment on the code why we create another column for 👏 |
a369611
to
cb62ce8
Compare
We have a problem which is: we need Rosie to update the data from Federal Senate every time she runs just like she does with the Chamber of Deputies data. Right now that doesn't happen. Also, she is expecting that we already have a We need help to mock On hold until we finish this, soon will be resolved. cc @cuducos |
Running the update method by default, but mocking it in tests: that sounds like a really good approach IMHO. |
This idea sounds really good to me, that could be a way, if it looks good to @jtemporal we can work on it tomorrow, or I can try it later. |
that's the plan |
A better approach to the required missing information on document_type on the Federal Senate dataset
This PR is a work in progress, but it has started today.
The things that need to be done are:
adapter.py
script, writing the tests first (that is the reason why I stopped the script)Any opinion is important, so feel free to do so :)
(I never do this alone, so I stopped putting my name in the begining)