Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New dataset with companies information #19

Merged
merged 8 commits into from Sep 3, 2016
Merged

New dataset with companies information #19

merged 8 commits into from Sep 3, 2016

Conversation

Irio
Copy link
Collaborator

@Irio Irio commented Aug 21, 2016

Comes to solve our need (#14) of addresses and coordinates of all companies listed in the expenses. The initial version is the result a collaboration between me, @filipelinhares and @andrepinho.

Here's a short summary of what the current dataset has (not sure if this notebook fits well in the repository, probably not): https://gist.github.com/Irio/002a1d3842466aacd0027c6f382a2d9b
Once this pull request gets approved, I will uploaded it to Amazon S3.

Current version of the dataset contains a list of 59,985 companies; when counting just the Brazilian ones, 3,453 are missing given timeouts returned by the API used.

Missing

  • Get coordinates of addresses
  • Translate dataset
  • Upload file to Amazon S3 and add its URL to src/fetch_datasets.py

@Irio Irio self-assigned this Aug 21, 2016
@@ -0,0 +1,78 @@
from itertools import chain
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

itertools.chain 💕

@cuducos
Copy link
Collaborator

cuducos commented Sep 1, 2016

I added some comments in a in line basis. Nothing really important, but my intent was to make the code easier to understand and more optimized (when needed). Most of the comments are more stylistic than technical and accepting the suggestions or not I feel like this is good to merge already.

I'll let it up to you to merge, because I'm not sure about the 3rd item from the TODO list (Upload file to Amazon S3) — so feel free to go through the comments, calling me names for being such a dick, uploading the files and finally merging ; )

@cuducos
Copy link
Collaborator

cuducos commented Sep 1, 2016

Rereading I found another issue — maybe more important than all the other ones.

Defining file paths like 'data/cnpj_info.xz' might raise an error in non-Unix platforms (aka Windows). The proper way to do it is to use os.path.join('data', 'cnpj_info.xz') so Python uses the proper slash (/ or \) according to the OS.

We might also need review our code base because this might have passed unnoticed in other PRs…

@Irio
Copy link
Collaborator Author

Irio commented Sep 3, 2016

Squashing few commits and merging this branch now. The work hasn't completely finished, since there are 3,391 CNPJs remaining to be fetched (they keep returning 504 HTTP status). There are 57,644 geolocated companies, from a total of 60K. Some seem to be incorrectly pointed by Google Maps, but I'm happy with this first version for now.

https://nbviewer.jupyter.org/gist/Irio/22026574ebc597bdedfbdb13989ffe98

To get the dataset, already available on Amazon S3, run the following:

$ python src/fetch_datasets.py

@Irio Irio merged commit 1011f8f into master Sep 3, 2016
@Irio Irio deleted the im-fetch-cnpj-info branch September 3, 2016 14:34
@Irio Irio mentioned this pull request Sep 10, 2016
Irio pushed a commit that referenced this pull request Feb 27, 2018
cuducos added a commit that referenced this pull request Feb 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants