Full repo for data classification project
Repo contains both source code and data. Data is kept on Google Cloud and managed by DVC.
[[TOC]]
data/ <- data
raw/ <- raw documents obtained from crawlers
de/
fr/
...
interim/ <- any intermediate data (e.g. processed docs)
de/ <- split by language
fr/
...
data_test/ <- small dataset for debugging
<same structure as `data/`>
go_crawler/ <- directory for all crawlers related code (crawlers + config)
de-crawler/ <- German config.toml + URL database
fr-crawler/ <- French config.toml + URL database
...
logs/ <- all logs from stages
models/ <- dupmed model files
de/
fr/
...
reports/ <- all metrics, plot data, etc
de/
fr/
...
src/ <- Python sources, mainly pipeline files
dvc.lock <- lock file for data versioning. DO NOT edit manually
dvc.yaml <- description of all ML stages in project pipeline
params.yaml <- single place for all parameters, tunings and configurations
requirements.txt <- package list
| N | Server | Local machine |
|---|---|---|
| 1 | do data crawling to data/raw folder |
- |
| 2 | Unfreeze make_small_dataset stage, run dvc repro make_small_dataset -f, then add to DVC dvc add data_test/, commit and push git push && dvc push |
- |
| 3 | Push the rest of data: dvc push |
Checkout main branch, do dvc fetch make_small_dataset && dvc checkout make_small_dataset. You will have all data from step #2 (small dataset). Set data_dir in params.yaml to data_test! Conduct experiments, then commit them. |
| 4 | - | Open MR to main, wait for review and approval process to finish. DO NOT merge yet, you need to revert data_dir to data. |
| 6 | Fetch merged master: git checkout master && git pull. Run pipeline with dvc repro, then push results (git add, git commit, dvc push) |
Continue development in a separated branch |
For quick setup read "Local development" section below.
Do the following steps on production server!
- Create a new branch
lang/<lang>. - Create a folder
<lang>-crawler, e.g.es-crawlerand putconfig.tomlfromde-crawlerthere. - Set correct paths in your new
config.tomlfor data - Copy URL database file to your
<lang>-crawlerdirectory, set correct db-file path inconfig.toml. Do not forget to reset allis_crawledflags:
UPDATE companies SET is_common_crawled=0 WHERE 1=1;
UPDATE companies SET is_google_crawled=0 WHERE 1=1;
UPDATE companies SET is_colly_crawled=0 WHERE 1=1;- Build crawler, copy bin file to
<lang>-crawler/dir and run it:
go build
cp dataclassification-crawler ../<lang>-crawler/
cd ../<lang>-crawler/
./dataclassification-crawler- Wait for the crawler to finish (can take up to 24 hours)
- Once crawler is done, add files to DVC (this will set up data tracking):
dvc add data/raw/<lang>
# DVC will suggest these commands - do as it asks
git add data/raw/<lang>/.gitignore
git add data/raw/<lang>/.<lang>.dvc- Add pipeline to
dvc.yaml- simple copy-paste one of existing (e.g.de) and correct the language in configuration. - Run pipeline for your language with
dvc repro train_<lang>. You can rerun all languages withdvc repro - Add all files that DVC asks, to git, commit it and push with
git push. - Push data (not sources) with
dvc push.
Good job! Model files are waiting for you at models/ folder.
They are tracked among git branches, reproducible and visualized.
All metrics and plots are uploaded here.
- Create a new branch from
main. - Set up packages
python -m venv venv
# on Linux and MacOS
source venv/bin/activate
# on Windows
venv/Scripts/activate # TODO check it
# all systems
pip install -r requirements.txt- Fetch small data (all dataset is too huge):
dvc fetch make_small_dataset
dvc checkout make_small_dataset- You should have
data_testdirectory filled with data. This directory is your primary data source! - Make sure that
base.data_dirinparams.yamlis set todata_test - You can now use
dvc reproas usual. Make changes, commit experiments and have fun. - Open MR to
mainbranch.
DVC tracks md5 hashes to determine whether the data was changed. If so, DVC re-runs all dependant stages in pipeline, saves newly produced md5 hashes and commits them to git.
- When checking out a branch, don't forget to do
dvc fetch && dvc checkout(docs).