Processing scripts for the `jafarisbarov/azwiki` dataset

You can find the final dataset here

Working with the scripts

Set-up the Python environment:

python3 -m venv env

source env/bin/activate

pip install -r requirements.txt

Download the latest multistream version from wiki dumps:

# Download
wget https://dumps.wikimedia.org/azwiki/latest/azwiki-latest-pages-articles-multistream.xml.bz2

# Extract the xml file
bzip2 -dk azwiki-latest-pages-articles-multistream.xml.bz2

We used the latest dump as of 15.12.2023.

Run the dewiki script to extract articles from the xml file. They will be stored in the raw folder. After this, you can run the process.py scrip to clean contents and filter out some files. Results will be saved in the processed folder.

# Extract the articles from the xml file
python scripts/dewiki.py

# Process
python scripts/process.py

Working with Huggingface Datasets

Create a config.py file at the root of your project folder. Define the following variables:

hf_repo = "username/dataset_name"
hf_token = "your_hf_access_token"

After this, simply run the hf.py file to push the dataset to HuggingFace.

python scripts/hf.py

Acknowledgements and contribution

Original version of the process.py script has been developed during the AzCorpus project. dewiki.py script has been adapted from this repository.

If there is anything to update regarding the dataset, open a PR here, on GitHub. I will update the HuggingFace repo myself.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
processed		processed
raw		raw
scripts		scripts
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Processing scripts for the `jafarisbarov/azwiki` dataset

Working with the scripts

Working with Huggingface Datasets

Acknowledgements and contribution

About

Releases

Packages

Languages

ceferisbarov/azwiki

Folders and files

Latest commit

History

Repository files navigation

Processing scripts for the jafarisbarov/azwiki dataset

Working with the scripts

Working with Huggingface Datasets

Acknowledgements and contribution

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Processing scripts for the `jafarisbarov/azwiki` dataset

Packages