-
Link of the challenge: https://statistics-awards.eu/competitions/4
-
Link to the public repository: https://github.com/antoine-palazz/deduplication
-
Link to the Datalab that was used for the coding part of the challenge: https://datalab.sspcloud.fr/. For more information about the Onyxia project: https://github.com/InseeFrLab/onyxia-web.
-
For more information, you can contact Antoine Palazzolo, that represents team Nins.
This is a Kedro project, which was generated using Kedro 0.18.6.
Take a look at the Kedro documentation to get started.
Two possibilities:
- Load manually the initial dataset wi_dataset.csv into the folder
data/01_raw/(and the possible past approaches or submissions indata/09_past_approaches/) - In the file
setup.sh, change the s3 path to your datasetwi_dataset.csvand possible past approaches to the problem, and uncomment the import.
To install the dependencies (and possibly import your data), run:
./setup.sh
To visualize the diverse elements of the code in a more interactive way, run:
kedro viz
You can run your Kedro project with:
kedro run
To run only the part that has been selected for the final submission, add --tags=final_models. To better visualize the parts of the pipelines that have been filtered, you can apply the same filters on the visual representation generated by kedro viz.
If you have several CPUs at your disposition and want to make the execution faster, you can run the following lines:
kedro run --tags=final_models_parallel_part --runner=ParallelRunner
kedro run --tags=final_models_sequential_part --runner=SequentialRunner
The final output will be stored in data/07_model_output/best_duplicates.csv, and a description of the output will be available in data/08_reporting/best_duplicates_description.csv.
To further analyze and possibly improve the output by using past approaches, you can then use the notebook stored in notebooks/use_past_approaches.ipynb.
To generate or update the dependency requirements for your project:
kedro build-reqs
This will pip-compile the contents of src/requirements.txt into a new file src/requirements.lock. You can see the output of the resolution by opening src/requirements.lock.
After this, if you'd like to update your project requirements, please update src/requirements.txt and re-run kedro build-reqs.
Further information about project dependencies
Further information about building project documentation and packaging your project