Skip to content

co-demos/OCF_datasets

Repository files navigation

OPEN CORPORATE FACTS / scripts & data


This is a repo where we could gather scripts and copies of OCF datasets

The Python / Jupyter notebooks scripts in the repo aim to request the API platform instance put in place by Leonard, get the JSON and save them into .json files, so it is possible to rebuild the datasets and reuse them for other purposes...

The scripts are storing the json and csv files to the corresponding folders in /datasets.

The datasets are split into chunks corresponding to a quantity of n pages scraped :

  • the "corporations" data in ./datasets/corporations/
  • the "account results" data in ./datasets/account_results/

Note :

For compatibility purposes the .csv files created use a tab (|) as separator...


install jupyter notebooks

You can install all the notebooks dependencies following those steps :

# install python virtual env
pip install virtualenv
virtualenv venv

# activate virtual env
source venv/bin/activate

# install dependencies in venv
pip install python-dotenv
# or
pip install -r requirements.txt

# install a kernel of a jupyter notebook in venv
ipython kernel install --user --name=venv

For some notebooks you'll need to load and access variables stored in a hidden/ignored file .env So just create this file containing those secret variables by copying the example.env file and rename it .env

Or type :

touch .env
# or
nano .env

to run the notebooks just open this folder and run :

jupyter notebook

(you need jupyter to do so indeed)


notebooks

There are several notebooks we used to get the data back from APIs, copy it to local files, or even send it back to other services :

  • get data from Leonard's API Platform : load OCF data to file-TEST 3.ipynb
  • split big files into several files : split_big_json_files.ipynb
  • load files into Pandas and play with it / export them as csv / insert docs to mongo DBs : JSON files to API or DB.ipynb

OCF sources


inspirations / ressources


TEAM

Releases

No releases published

Packages

No packages published