New dataset with companies information #19

Irio · 2016-08-21T14:55:51Z

Comes to solve our need (#14) of addresses and coordinates of all companies listed in the expenses. The initial version is the result a collaboration between me, @filipelinhares and @andrepinho.

Here's a short summary of what the current dataset has (not sure if this notebook fits well in the repository, probably not): https://gist.github.com/Irio/002a1d3842466aacd0027c6f382a2d9b
Once this pull request gets approved, I will uploaded it to Amazon S3.

Current version of the dataset contains a list of 59,985 companies; when counting just the Brazilian ones, 3,453 are missing given timeouts returned by the API used.

Missing

Get coordinates of addresses
Translate dataset
Upload file to Amazon S3 and add its URL to src/fetch_datasets.py

cuducos · 2016-09-01T13:26:39Z

src/clean_cnpj_info_dataset.py

@@ -0,0 +1,78 @@
+from itertools import chain


itertools.chain 💕

cuducos · 2016-09-01T14:15:19Z

I added some comments in a in line basis. Nothing really important, but my intent was to make the code easier to understand and more optimized (when needed). Most of the comments are more stylistic than technical and accepting the suggestions or not I feel like this is good to merge already.

I'll let it up to you to merge, because I'm not sure about the 3rd item from the TODO list (Upload file to Amazon S3) — so feel free to go through the comments, calling me names for being such a dick, uploading the files and finally merging ; )

cuducos · 2016-09-01T14:34:31Z

Rereading I found another issue — maybe more important than all the other ones.

Defining file paths like 'data/cnpj_info.xz' might raise an error in non-Unix platforms (aka Windows). The proper way to do it is to use os.path.join('data', 'cnpj_info.xz') so Python uses the proper slash (/ or \) according to the OS.

We might also need review our code base because this might have passed unnoticed in other PRs…

Irio · 2016-09-03T14:21:17Z

Squashing few commits and merging this branch now. The work hasn't completely finished, since there are 3,391 CNPJs remaining to be fetched (they keep returning 504 HTTP status). There are 57,644 geolocated companies, from a total of 60K. Some seem to be incorrectly pointed by Google Maps, but I'm happy with this first version for now.

https://nbviewer.jupyter.org/gist/Irio/22026574ebc597bdedfbdb13989ffe98

To get the dataset, already available on Amazon S3, run the following:

$ python src/fetch_datasets.py

…o-1.10.2 Update django to 1.10.2

Adds invalid CNPJ CPF Classifier

Irio added the work in progress label Aug 21, 2016

Irio self-assigned this Aug 21, 2016

cuducos reviewed Sep 1, 2016
View reviewed changes

src/clean_cnpj_info_dataset.py

@@ -0,0 +1,78 @@

from itertools import chain

Copy link

Collaborator

cuducos Sep 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

itertools.chain 💕

Irio mentioned this pull request Sep 2, 2016

Simple web service to return everything we know about a given reimbursement #34

Closed

Irio force-pushed the im-fetch-cnpj-info branch from c86f24e to 33ef88e Compare September 3, 2016 04:02

Irio removed the work in progress label Sep 3, 2016

Irio force-pushed the im-fetch-cnpj-info branch from 8691b8a to 5438e3e Compare September 3, 2016 14:30

Irio added 8 commits September 3, 2016 11:32

Fetch CNPJ's info using ReceitaWS API

84745f0

Minimaly clean cnpj info dataset

369421f

Save geocoding locations in a tmp folder

65217d8

Use os.path.join to generate appropriate file paths

6b46f66

Prevent data files to be backed up if already in S3

a3d1848

Dasherize data file names

517ae83

Add link to companies.xz to fetch_datasets.py

ca1e5a4

Remove unused imports

75ea2b1

Irio force-pushed the im-fetch-cnpj-info branch from 5438e3e to 75ea2b1 Compare September 3, 2016 14:34

Irio merged commit 1011f8f into master Sep 3, 2016

Irio deleted the im-fetch-cnpj-info branch September 3, 2016 14:34

Irio mentioned this pull request Sep 10, 2016

Find expenses in brothels #39

Closed

Irio pushed a commit that referenced this pull request Feb 27, 2018

Merge pull request #19 from datasciencebr/pyup-update-django-1.10.1-t…

a805500

…o-1.10.2 Update django to 1.10.2

cuducos added a commit that referenced this pull request Feb 28, 2018

Merge pull request #19 from datasciencebr/jtemporal-invalid-cnpj-cpf

df3b969

Adds invalid CNPJ CPF Classifier

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New dataset with companies information #19

New dataset with companies information #19

Irio commented Aug 21, 2016 •

edited

cuducos Sep 1, 2016

cuducos commented Sep 1, 2016

cuducos commented Sep 1, 2016

Irio commented Sep 3, 2016

Navigation Menu

New dataset with companies information #19

New dataset with companies information #19

Conversation

Irio commented Aug 21, 2016 • edited

cuducos Sep 1, 2016

Choose a reason for hiding this comment

cuducos commented Sep 1, 2016

cuducos commented Sep 1, 2016

Irio commented Sep 3, 2016

Irio commented Aug 21, 2016 •

edited