Skip to content
/ sirene Public
forked from etalab/sirene

🌬 Ease and speed up the kick off of the #OpenSirene hackathon, it contains tools to split/filter/reduce the huge CSV file.

License

Notifications You must be signed in to change notification settings

chrnin/sirene

 
 

Repository files navigation

Ulysse

Vision

The aim of that project is to ease and speed up the kick off when the #OpenSirene hackathon starts as of November, 15th.

With that small Python script you will be able to load a subset of the huge CSV file provided during the hackathon to only work on a given domain. Additionally, you can export a part of the loaded data as JSON or CSV.

Technically, the CSV is loaded within Redis and then served with Sanic. It should not be necessary to be a Python developer to use the project. Please please please report any issue you encounter to improve the project.

Check detailed use cases for a better idea of what can be achieved and then follow the installation instructions if it finally fits your needs.

Ethics

This project operates under the W3C's Code of Ethics and Professional Conduct:

W3C is a growing and global community where participants choose to work together, and in that process experience differences in language, location, nationality, and experience. In such a diverse environment, misunderstandings and disagreements happen, which in most cases can be resolved informally. In rare cases, however, behavior can intimidate, harass, or otherwise disrupt one or more people in the community, which W3C will not tolerate.

A Code of Ethics and Professional Conduct is useful to define accepted and acceptable behaviors and to promote high standards of professional practice. It also provides a benchmark for self evaluation and acts as a vehicle for better identity of the organization.

We hope that our community group act according to these guidelines, and that participants hold each other to these high standards. If you have any questions or are worried that the code isn't being followed, please contact the owner of the repository.

Language

The development language is English. All comments and documentation should be written in English, so that we don't end up with “franglais” methods, and so we can share our learnings with developers around the world.

However, the domain language is French. We consider each tax, collecting organism and French regulation as a domain-specific term. In the same fashion, well-known abbreviations of these domain-specific terms are accepted.

Installation

Retrieve the current repository, then either make a dedicated virtualenv or just type that command if you are not a Python developer:

$ pip install -r requirements.txt

In both cases, you have to run the script with Python 3.5+.

Note that you might need to install pip before.

If you plan to make HTTP requests to retrieve filtered data, we strongly recommend using httpie instead of curl, it should have been installed through the load of the requirements.txt file.

You will need a running Redis server too. The installation and launch depends on your OS. For instance for macOS:

$ brew install redis
$ redis-server

Data

We assume that you have access to the source files for data:

  • sirc-266_266_13705_201606_L_P_20161010_121909418.csv.bz2 is the unique stock file with 12 millions lines
  • MisesajourQuotidiennes/sirc-266_266_13706_2016183_E_Q_20161020_131153997.csv is one of the 42 daily update files with about 10000 lines each

During the hackathon, you will also have access to 2 databases pre-loaded with that script:

  • dump-500000+500000-all.rdb.zip contains 500000 records from the stock (i.e. 1/24th) and 500000 update records with ALL columns loaded
  • dump-3000000+500000-default.rdb.zip containts 3000000 records from the stock (i.e. one quarter) and 500000 update records with DEFAULT columns loaded.

See use-cases below to see if one of these databases suits your needs or if you have to load one by yourself. If you plan to use these databases, the easiest way is to rename the file to dump.rdb and then run the redis-server command from that place. Once it is done, use the serve command documented below.

Use cases

First of all, you can have access to the help of the module at any given time:

$ python -m ulysse --help

Limiting the huge Sirene CSV file

Before starting, you need to define the number of lines you want to load within the local database. Default is 1000 to be able to try fast, the whole stock file is about 12 millions lines. It takes about 9 minutes to load 500000 lines with default keys and 1 hour for 3000000 record (one third of the total).

Next, you choose which columns you want to work on, loading all columns is probably irrelevant for the scope of a hackathon. Focus on a given domain and iterate quickly. If you miss one column, it should not be too long to load a new database.

Once you did that, it is time to call the script with these given parameters (add --columns SIREN NIC etc if you do not want default ones which are SIREN NIC L1_NORMALISEE TEFET DEFET DEPCOMEN APEN700 CATEGORIE DCREN DATEMAJ):

$ python -m ulysse load_stock --filename path/to/sirc.csv --lines 20000
INFO:ulysse.database:👌 Connected to Redis
INFO:ulysse.loaders:👉 Loading 20000 lines from path/to/sirc.csv
INFO:ulysse.loaders:💧 Already 10000 lines loaded
INFO:ulysse.loaders:💧 Already 20000 lines loaded
INFO:ulysse.loaders:🌊 20000 lines loaded with success

The beautifully emoji-ed log will hopefully help you to understand what is happening. Do not forget to launch your Redis server first!

Note: there is a way to load all columns at once (see section below).

Playing with data (optional/advanced)

At that point, you should have a loaded Redis database.

If you are familiar with Python and/or Redis, you can start querying that subset. For instance:

$ python
>>> from ulysse.database import db
INFO:ulysse.database:👌 Connected to Redis
>>> from ulysse.database import retrieve_sirets
INFO:ulysse.database:👌 Connected to Redis
>>> sirets = retrieve_sirets('NIC', '00056', limit=3)
>>> print(sirets)
['00542002100056', '00664182300056', '00735020000056']
>>> from ulysse.database import retrieve_siret
>>> retrieve_siret(sirets[0])
{'20110719': '{"DATEMAJ": "20110719", "L1_NORMALISEE": "ETABLISSEMENTS LUCIEN BIQUEZ", "APEN700": "4669B", "DEPCOMEN": "80001", "SIREN": "005420021", "DCREN": "195401", "NIC": "00056", "DEFET": "2009", "TEFET": "11", "CATEGORIE": "PME"}'}
>>> from ulysse.database import decode_siret
>>> decode_siret(retrieve_siret(sirets[0]), ['SIREN', 'L1_NORMALISEE'])
{'20110719': {'L1_NORMALISEE': 'ETABLISSEMENTS LUCIEN BIQUEZ', 'SIREN': '005420021'}}

The low-level API gives you the more modular and customizable way to retrieve data but it can be a bit tedious to do that by hand. If you are totally lost, the next section will hopefully help you!

Quick and dirty export in CSV or JSON

You can serve your data through HTTP for an easier access.

You have to launch the local server:

$ python -m ulysse serve --columns SIREN NIC L1_NORMALISEE
INFO:ulysse.database:👌 Connected to Redis
INFO:sanic.log:Goin' Fast @ http://0.0.0.0:8000

Now you will be able to issue HTTP requests from another command-line to retrieve data as CSV:

$ http :8000/NIC/00056 limit==2 format==csv columns==SIREN,NIC,L1_NORMALISEE
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 108
Content-Type: text/plain; charset=utf-8
Keep-Alive: timeout=60

SIREN;NIC;L1_NORMALISEE
005420021;00056;ETABLISSEMENTS LUCIEN BIQUEZ
006641823;00056;MONSIEUR PHILIPPE PLOGE

And/or JSON:

http :8000/NIC/00056 limit==3 format==json columns==SIREN,L1_NORMALISEE
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 232
Content-Type: application/json; charset=utf-8
Keep-Alive: timeout=60

[
    {
        "20110719": {
            "L1_NORMALISEE": "ETABLISSEMENTS LUCIEN BIQUEZ",
            "SIREN": "005420021"
        }
    },
    {
        "20150902": {
            "L1_NORMALISEE": "MONSIEUR PHILIPPE PLOGE",
            "SIREN": "006641823"
        }
    },
    {
        "20120120": {
            "L1_NORMALISEE": "ENTREPRISE MINETTO",
            "SIREN": "007350200"
        }
    }
]

You can play with GET parameters (limit, format and columns) to retrieve the pertinent data for your use-case.

Dealing with history (optional)

You can load updates from daily files generated (here again, you can pass a --columns parameter to customize extra columns loaded, default are VMAJ DATEMAJ EVE DATEVE:

$ python -m ulysse load_updates --folder path/to/MisesajourQuotidiennes/
INFO:ulysse.database:👌 Connected to Redis
INFO:ulysse.loaders:👉 Loading data from path/to/MisesajourQuotidiennes/sirc-..._124141890.csv
INFO:ulysse.loaders:💧 Already 3000 lines loaded
INFO:ulysse.loaders:💧 Already 6000 lines loaded
INFO:ulysse.loaders:💧 Already 9000 lines loaded
INFO:ulysse.loaders:💧 Already 12000 lines loaded
INFO:ulysse.loaders:🐣 Creations: 4678 — 👥 Modifications: 2759 — 💀 Deletions: 3357 — 🤑 Commercial: 4 — 💸 Non commercial: 8
[…]
INFO:ulysse.loaders:🌊 475065 items loaded with success

The full load takes about 12 minutes to complete with default columns. Once it's achieved, you will have more information when you perform a query against the server (note the use of the offset parameter useful for pagination):

$ http :8000/NIC/00056 limit==2 offset==44 format==json columns==SIREN,L1_NORMALISEE,DATEMAJ,EVE
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 781
Content-Type: application/json; charset=utf-8
Keep-Alive: timeout=60

[
    {
        "20140607": {
            "DATEMAJ": "20140607",
            "L1_NORMALISEE": "SERVICE INSTALLATION DEPANNAGE ELECTRO",
            "SIREN": "070801659"
        },
        "20160725": {
            "DATEMAJ": "20160725",
            "EVE": "MS",
            "L1_NORMALISEE": "SERVICE INSTALLATION DEPANNAGE ELECTRO",
            "SIREN": "070801659"
        },
        "20160726": {
            "DATEMAJ": "20160726",
            "EVE": "MS",
            "L1_NORMALISEE": "SERVICE INSTALLATION DEPANNAGE ELECTRO",
            "SIREN": "070801659"
        },
        "20160817": {
            "DATEMAJ": "20160817",
            "EVE": "MS",
            "L1_NORMALISEE": "SERVICE INSTALLATION DEPANNAGE ELECTRO",
            "SIREN": "070801659"
        },
        "20160818": {
            "DATEMAJ": "20160818",
            "EVE": "MS",
            "L1_NORMALISEE": "SID ELECTRONIQUE",
            "SIREN": "070801659"
        },
        "20160826": {
            "DATEMAJ": "20160826",
            "EVE": "SS",
            "L1_NORMALISEE": "SID ELECTRONIQUE",
            "SIREN": "070801659"
        }
    },
    {
        "19981126": {
            "DATEMAJ": "19981126",
            "L1_NORMALISEE": "BETON CONTROLE COTE D AZUR",
            "SIREN": "071503569"
        }
    }
]

Here the company with SIREN "070801659" issued a MS (headquarter change) as of 2016-07-26 and 2016-08-17 (?!) and then a SS (headquarter close) as of 2016-08-26.

Another example:

$ http :8000/SIREN/024049124 limit==25 format==csv columns==SIREN,NIC,L1_NORMALISEE,DATEMAJ,EVE   239ms
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 1035
Content-Type: text/plain; charset=utf-8
Keep-Alive: timeout=60

SIREN;NIC;L1_NORMALISEE;DATEMAJ;EVE
024049124;00027;DOUKA BE;20131125;
024049124;00035;BOURBON DISTRIBUTION MAYOTTE;20131125;
024049124;00043;ENTREPOT BAZAR;20131128;
024049124;00050;SNIE;20120922;
024049124;00068;BOURBON DISTRIBUTION MAYOTTE;20120922;
024049124;00076;SNIE COMBANI;20131125;
024049124;00084;DOUKA BE;20131125;
024049124;00092;BOURBON DISTRIBUTION MAYOTTE;20131125;
024049124;00100;BOURBON DISTRIBUTION MAYOTTE;20131125;
024049124;00118;SNIE;20131104;
024049124;00126;SNIE;20131125;
024049124;00134;UTV;20131107;
024049124;00142;JUMBO SCORE;20141028;
024049124;00159;BOURBON DISTRIBUTION MAYOTTE;20150320;
024049124;00167;DOUKA BE;20151009;
024049124;00175;DOUKA BE;20151009;
024049124;00183;DOUKA BE;20151106;
024049124;00191;DOUKA BE;20160120;
024049124;00209;DOUKA BE;20160216;
024049124;00217;BOURBON DISTRIBUTION MAYOTTE;20160318;
024049124;00225;BOURBON DISTRIBUTION MAYOTTE;20160318;
024049124;00233;BOURBON DISTRIBUTION MAYOTTE;20160318;
024049124;00241;DOUKA BE;20160701;CE
024049124;00258;DOUKA BE;20160701;CE

You can see that as of 2016-07-01, the company with NIC "00258" created new establishments (CE for EVEnement column).

Load all columns from source files

Warning: this way to use the lib is only recommended if you have a lot of RAM given that Redis stores everything into memory. 16Gb+ is recommended otherwise your hard drive will start to swap and performances will drop significantly.

To load all columns, use the --all option when you launch the load_* commands:

$ python -m ulysse load_stock --filename path/to/sirc.csv --lines 50000 --all
INFO:ulysse.database:👌 Connected to Redis
INFO:ulysse.loaders:👉 Loading 50000 lines from path/to/sirc.csv
INFO:ulysse.loaders:💧 Already 10000 lines loaded
INFO:ulysse.loaders:💧 Already 20000 lines loaded
INFO:ulysse.loaders:💧 Already 30000 lines loaded
INFO:ulysse.loaders:💧 Already 40000 lines loaded
INFO:ulysse.loaders:💧 Already 50000 lines loaded
INFO:ulysse.loaders:🌊 50000 lines loaded with success

Be aware that it obviously takes way more time than just storing columns you want to work on. For instance, it takes about one hour and a half for 500000 lines.

Even with the --all option activated, note that columns from the --excluded option will still be effective (default are NORMALISEE DECLAREE ENSEIGNE LIBNATETAB LIBAPET LIBTEFET NOMEN_LONG LIBNJ LIBAPEN LIBTEFEN) so columns with these suffixes will not be indexed. Modify that option too if you want to index these plaintext columns.

That option is also effective to load updates:

$ python -m ulysse load_updates --folder path/to/MisesajourQuotidiennes/ --all
INFO:ulysse.database:👌 Connected to Redis
INFO:ulysse.loaders:👉 Loading data from path/to/MisesajourQuotidiennes/sirc-266_266_13706_2016183_E_Q_20161020_131153997.csv
INFO:ulysse.loaders:💧 Already 3000 lines loaded
INFO:ulysse.loaders:💧 Already 6000 lines loaded
INFO:ulysse.loaders:💧 Already 9000 lines loaded
INFO:ulysse.loaders:💧 Already 12000 lines loaded
INFO:ulysse.loaders:🐣 Creations: 4678 — 👥 Modifications: 2759 — 💀 Deletions: 3357 — 🤑 Commercial: 4 — 💸 Non commercial: 8
INFO:ulysse.loaders:👉 Loading data from path/to/MisesajourQuotidiennes/sirc-266_266_13706_2016186_E_Q_20161020_131100370.csv
INFO:ulysse.loaders:💧 Already 3000 lines loaded
[…]
INFO:ulysse.loaders:🌊 475065 items loaded with success

In that configuration, the load of all updates will take about 1 hour and a half. Plus, loading the Redis takes about 8Gb and it requires at least 12Gb to not swap on initial load of the data. The Redis dump.rdb database is about 3.5Gb and takes 7 minutes to load when you launch the server with about 1 million records (500000 from stock + ~500000 from updates) containing all available columns.

Display diffs for a given SIRET

An entry point of the API allows you to display differences within the lifetime of a given SIRET. Let's say you loaded the update files for all columns. If you retrive informations for a given SIREN you end up with two entries:

$ http :8000/SIREN/055801013
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 3827
Content-Type: application/json; charset=utf-8
Keep-Alive: timeout=60

[
    {
        "20160700": {
            "ACTISURF": "",
            "ACTIVNAT": "NR",
            "ADR_MAIL": "",
            "AMINTREN": "201209",
            "AMINTRET": "201209",
            "APEN700": "1071C",
            "APET700": "1071C",
            "APRM": "",
            "ARRONET": "3",
            "AUXILT": "0",
            "CATEGORIE": "PME",
            "CEDEX": "",
            "CIVILITE": "",
            "CODPOS": "13001",
            "COMET": "201",
            "CTONET": "12",
            "DAPEN": "2008",
            "DAPET": "2008",
            "DATEESS": "",
            "DATEMAJ": "20160700",
            "DCREN": "195501",
            "DCRET": "195501",
            "DDEBACT": "",
            "DEFEN": "2014",
            "DEFET": "2014",
            "DEPCOMEN": "13201",
            "DEPET": "13",
            "DIFFCOM": "O",
            "DU": "00",
            "EFENCENT": "10",
            "EFETCENT": "10",
            "ENSEIGNE": "",
            "EPCI": "241300391",
            "ESAANN": "",
            "ESAAPEN": "",
            "ESASEC1N": "",
            "ESASEC2N": "",
            "ESASEC3N": "",
            "ESASEC4N": "",
            "ESS": "",
            "INDREP": "",
            "IND_PUBLIPO": "1",
            "L1_DECLAREE": "BOULANGERIE-PATISSERIE NOAILLES",
            "L1_NORMALISEE": "BOULANGERIE PATISSERIE NOAILLES",
            "L2_DECLAREE": "",
            "L2_NORMALISEE": "",
            "L3_DECLAREE": "",
            "L3_NORMALISEE": "",
            "L4_DECLAREE": "20 RUE DES FEUILLANTS",
            "L4_NORMALISEE": "20 RUE DES FEUILLANTS",
            "L5_DECLAREE": "",
            "L5_NORMALISEE": "",
            "L6_DECLAREE": "13001 MARSEILLE 1",
            "L6_NORMALISEE": "13001 MARSEILLE",
            "L7_DECLAREE": "",
            "L7_NORMALISEE": "FRANCE",
            "LIBAPEN": "Boulangerie et boulangerie-pâtisserie",
            "LIBAPET": "Boulangerie et boulangerie-pâtisserie",
            "LIBCOM": "MARSEILLE 1",
            "LIBNATETAB": "",
            "LIBNJ": "Société à responsabilité limitée (sans autre indication)",
            "LIBREG": "Provence-Alpes-Côte d'Azur",
            "LIBTEFEN": "10 à 19 salariés",
            "LIBTEFET": "10 à 19 salariés",
            "LIBVOIE": "DES FEUILLANTS",
            "LIEUACT": "NR",
            "MODEN": "S",
            "MODET": "S",
            "MONOACT": "1",
            "NATETAB": "",
            "NIC": "00016",
            "NICSIEGE": "00016",
            "NJ": "5499",
            "NOM": "",
            "NOMEN_LONG": "BOULANGERIE-PATISSERIE NOAILLES",
            "NUMVOIE": "20",
            "ORIGINE": "3",
            "PRENOM": "",
            "PRODEN": "O",
            "PRODET": "O",
            "PRODPART": "",
            "RNA": "",
            "RPEN": "93",
            "RPET": "93",
            "SAISONAT": "P",
            "SIEGE": "1",
            "SIGLE": "",
            "SIREN": "055801013",
            "TCA": "",
            "TCD": "73",
            "TEFEN": "11",
            "TEFET": "11",
            "TU": "7",
            "TYPVOIE": "RUE",
            "UU": "59",
            "VMAJ": "I",
            "VMAJ1": "",
            "VMAJ2": "",
            "VMAJ3": "",
            "ZEMET": "9310"
        },
        "20160701": {
            "ACTISURF": "",
            "ACTIVNAT": "NR",
            "ADR_MAIL": "",
            "AMINTREN": "201209",
            "AMINTRET": "201209",
            "APEN700": "6820B",
            "APET700": "6820B",
            "APRM": "",
            "ARRONET": "3",
            "AUXILT": "0",
            "CATEGORIE": "PME",
            "CEDEX": "",
            "CIVILITE": "",
            "CODPOS": "13001",
            "COMET": "201",
            "CTONET": "12",
            "DAPEN": "2015",
            "DAPET": "2015",
            "DATEESS": "",
            "DATEMAJ": "20160701",
            "DCREN": "195501",
            "DCRET": "195501",
            "DDEBACT": "",
            "DEFEN": "2014",
            "DEFET": "2014",
            "DEPCOMEN": "13201",
            "DEPET": "13",
            "DIFFCOM": "O",
            "DU": "00",
            "EFENCENT": "10",
            "EFETCENT": "10",
            "ENSEIGNE": "",
            "EPCI": "241300391",
            "ESAANN": "",
            "ESAAPEN": "",
            "ESASEC1N": "",
            "ESASEC2N": "",
            "ESASEC3N": "",
            "ESASEC4N": "",
            "ESS": "",
            "INDREP": "",
            "IND_PUBLIPO": "1",
            "L1_DECLAREE": "BOULANGERIE-PATISSERIE NOAILLES",
            "L1_NORMALISEE": "BOULANGERIE-PATISSERIE NOAILLES",
            "L2_DECLAREE": "",
            "L2_NORMALISEE": "",
            "L3_DECLAREE": "",
            "L3_NORMALISEE": "",
            "L4_DECLAREE": "20 RUE DES FEUILLANTS",
            "L4_NORMALISEE": "20 RUE DES FEUILLANTS",
            "L5_DECLAREE": "",
            "L5_NORMALISEE": "",
            "L6_DECLAREE": "13001 MARSEILLE 1",
            "L6_NORMALISEE": "13001 MARSEILLE 1",
            "L7_DECLAREE": "",
            "L7_NORMALISEE": "",
            "LIBAPEN": "Location de terrains et d'autres biens immobiliers",
            "LIBAPET": "Location de terrains et d'autres biens immobiliers",
            "LIBCOM": "MARSEILLE 1",
            "LIBNATETAB": "",
            "LIBNJ": "Société à responsabilité limitée (sans autre indication)",
            "LIBREG": "Provence-Alpes-Côte d'Azur",
            "LIBTEFEN": "10 à 19 salariés",
            "LIBTEFET": "10 à 19 salariés",
            "LIBVOIE": "DES FEUILLANTS",
            "LIEUACT": "NR",
            "MODEN": "S",
            "MODET": "S",
            "MONOACT": "1",
            "NATETAB": "",
            "NIC": "00016",
            "NICSIEGE": "00016",
            "NJ": "5499",
            "NOM": "",
            "NOMEN_LONG": "BOULANGERIE-PATISSERIE NOAILLES",
            "NUMVOIE": "20",
            "ORIGINE": "3",
            "PRENOM": "",
            "PRODEN": "O",
            "PRODET": "O",
            "PRODPART": " ",
            "RNA": "",
            "RPEN": "93",
            "RPET": "93",
            "SAISONAT": "P",
            "SIEGE": "1",
            "SIGLE": "",
            "SIREN": "055801013",
            "TCA": "",
            "TCD": "73",
            "TEFEN": "11",
            "TEFET": "11",
            "TU": "7",
            "TYPVOIE": "RUE",
            "UU": "59",
            "VMAJ": "F",
            "VMAJ1": "0",
            "VMAJ2": "1",
            "VMAJ3": "1",
            "ZEMET": "9310"
        }
    }
]

With that view it's hard to see which keys have changed, that's why we propose a diff of that view too. The SIRET is the concatenation of SIREN and NIC keys, start-date and end-date are coming from keys of the previous result:

$ http :8000/diff/05580101300016 start-date==20160700 end-date==20160701
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 619
Content-Type: application/json; charset=utf-8
Keep-Alive: timeout=60

{
    "APEN700": [
        "1071C",
        "6820B"
    ],
    "APET700": [
        "1071C",
        "6820B"
    ],
    "DAPEN": [
        "2008",
        "2015"
    ],
    "DAPET": [
        "2008",
        "2015"
    ],
    "DATEMAJ": [
        "20160700",
        "20160701"
    ],
    "L1_NORMALISEE": [
        "BOULANGERIE PATISSERIE NOAILLES",
        "BOULANGERIE-PATISSERIE NOAILLES"
    ],
    "L6_NORMALISEE": [
        "13001 MARSEILLE",
        "13001 MARSEILLE 1"
    ],
    "L7_NORMALISEE": [
        "FRANCE",
        ""
    ],
    "LIBAPEN": [
        "Boulangerie et boulangerie-pâtisserie",
        "Location de terrains et d'autres biens immobiliers"
    ],
    "LIBAPET": [
        "Boulangerie et boulangerie-pâtisserie",
        "Location de terrains et d'autres biens immobiliers"
    ],
    "PRODPART": [
        "",
        " "
    ],
    "VMAJ": [
        "I",
        "F"
    ],
    "VMAJ1": [
        "",
        "0"
    ],
    "VMAJ2": [
        "",
        "1"
    ],
    "VMAJ3": [
        "",
        "1"
    ]
}

Here you have a better view of what changed between these two dates according to the update files provided.

What's next?

If your use case is not covered, please please add an issue about that. Your needs will drive our future developments on the subject so your feedback is increadibly valuable to us! 👍

Tools

The following tools do not require any installation step, a classic Python 3 setup should be enough to launch the scripts.

Flux2Stock

The aim of that script (available at the root of the directory) is to create a new stock file from a previous stock file and incremental daily files since then.

You can use it that way:

$ python flux2stock.py stock-t.zip stock-t+2.csv flux-t+1.zip flux-t+2.zip

Here stock-t.zip is the initial stock, stock-t+2.csv is the name of the newly generated stock and flux-t+1.zip flux-t+2.zip [...] are daily updates since the initial stock creation.

The generation of a new stock takes aproximatively 15 minutes on a recent computer. The RAM consumption should stay low.

Stock2Reduce

The aim of that script (available at the root of the directory) is to create a new filtered stock file based on the name of a column and its value.

You can use it that way:

$ python stock2reduce.py stock.zip stock-paca.csv RPET=93

Here stock.zip is the initial stock, stock-paca.csv is the name of the newly generated stock, RPET is the name of the column and 93 is the value matching of that column that you want to reduce on.

The generation of a new stock takes aproximatively 7 minutes on a recent computer. The RAM consumption should stay low.

Note that you can combine many filters at once:

$ python stock2reduce.py stock.zip stock-arles.csv DEPET=13 COMET=004

Here we match both the county code and the town one to create a valid and unique INSEE code and retrieve the stock related to Arles only.

Contributing

We’re really happy to accept contributions from the community, that’s the main reason why we open-sourced it! There are many ways to contribute, even if you’re not a technical person.

We’re using the infamous simplified Github workflow to accept modifications (even internally), basically you’ll have to:

  • create an issue related to the problem you want to fix (good for traceability and cross-reference)
  • fork the repository
  • create a branch (optionally with the reference to the issue in the name)
  • hack hack hack
  • commit incrementally with readable and detailed commit messages
  • submit a pull-request against the master branch of this repository

We’ll take care of tagging your issue with the appropriated labels and answer within a week (hopefully less!) to the problem you encounter.

If you’re not familiar with open-source workflows or our set of technologies, do not hesitate to ask for help! We can mentor you or propose good first bugs (as labeled in our issues).

Submitting bugs

You can report issues directly on Github, that would be a really useful contribution given that we lack some user testing on the project. Please document as much as possible the steps to reproduce your problem.

Adding documentation

We’re doing our best to document each usage of the project but you can improve it or add you own sections.

Hacking

Commit messages should be formatted using AngularJS conventions (one-liners are OK for now but body and footer may be required as the project matures).

Comments follow Google's style guide.

License

We’re using the MIT license.

Changelog

See the dedicated file.

TODO

  • document the low-level API?
  • use pipelines from Redis to speed up loads
  • use file streaming for CSV output (and iterators for the server - Falcon?)
  • move from Redis to PostrgeSQL given the size of the whole database
  • move from Sanic to Falcon/Flask for adoption/maturity?

Readme initiated with OpenSourceTemplate.

About

🌬 Ease and speed up the kick off of the #OpenSirene hackathon, it contains tools to split/filter/reduce the huge CSV file.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%