New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New dataset with companies information #19
Conversation
@@ -0,0 +1,78 @@ | |||
from itertools import chain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
itertools.chain
💕
I added some comments in a in line basis. Nothing really important, but my intent was to make the code easier to understand and more optimized (when needed). Most of the comments are more stylistic than technical and accepting the suggestions or not I feel like this is good to merge already. I'll let it up to you to merge, because I'm not sure about the 3rd item from the TODO list (Upload file to Amazon S3) — so feel free to go through the comments, calling me names for being such a dick, uploading the files and finally merging ; ) |
Rereading I found another issue — maybe more important than all the other ones. Defining file paths like We might also need review our code base because this might have passed unnoticed in other PRs… |
c86f24e
to
33ef88e
Compare
Squashing few commits and merging this branch now. The work hasn't completely finished, since there are 3,391 CNPJs remaining to be fetched (they keep returning 504 HTTP status). There are 57,644 geolocated companies, from a total of 60K. Some seem to be incorrectly pointed by Google Maps, but I'm happy with this first version for now. https://nbviewer.jupyter.org/gist/Irio/22026574ebc597bdedfbdb13989ffe98 To get the dataset, already available on Amazon S3, run the following: $ python src/fetch_datasets.py |
8691b8a
to
5438e3e
Compare
5438e3e
to
75ea2b1
Compare
…o-1.10.2 Update django to 1.10.2
Adds invalid CNPJ CPF Classifier
Comes to solve our need (#14) of addresses and coordinates of all companies listed in the expenses. The initial version is the result a collaboration between me, @filipelinhares and @andrepinho.
Here's a short summary of what the current dataset has (not sure if this notebook fits well in the repository, probably not): https://gist.github.com/Irio/002a1d3842466aacd0027c6f382a2d9b
Once this pull request gets approved, I will uploaded it to Amazon S3.
Current version of the dataset contains a list of 59,985 companies; when counting just the Brazilian ones, 3,453 are missing given timeouts returned by the API used.
Missing
src/fetch_datasets.py