Skip to content

doozan/spanish_data

Repository files navigation

This data is built from Wiktionary and Tatoeba datasets using my Wiktionary Parser and Spanish Tools

This data is used to build the free, open-source Spanish to English dictionary available in StarDict and Aard2/slob formats in the Release section. It's also used to build my 6001 Spanish Vocab anki deck, and is provided here with the hope that others may find additional uses for it.

Interesting files:

  • es-en.data - Spanish to English Wiktionary data formatted for use with enwiktionary_wordlist
  • frequency.csv - a list of the most frequently used Spanish lemmas with part of speech and word forms combined into lemma
  • sentences.tsv - English/Spanish sentence pairs from tatoeba.org with users self-reported proficiency, part of speech tags, and lemmas

Credits:

  • es-en.data (CC-BY-SA Attribution: wiktionary.org)
  • frequency.csv (CC-BY-SA 3.0 github.com/hermitdave/FrequencyWords)
  • sentences.tsv (CC-BY 2.0 FR Attribution: tatoeba.org)
  • tatoeba user CK for the list of reviewed English sentences
  • tatoeba user arh for the list of reviewed Spanish sentences
  • FreeLing for the part of speech tagging

Building the datafiles

Install required tools

sudo apt install curl bzip2 gawk pv unzip zip pkg-config dictzip make
pip3 install ijson pywikibot mwparserfromhell pyglossary PyICU Levenshtein

Install FreeLing on Debian (for other distros, check the FreeLing instructions)

wget https://github.com/TALP-UPC/FreeLing/releases/download/4.2/freeling-4.2-buster-amd64.deb
sudo apt install ./freeling-4.2-buster-amd64.deb libboost-chrono1.67.0 libboost-date-time1.67.0

Download and run the Makefile

curl https://github.com/doozan/spanish_data/raw/master/Makefile -o Makefile
make