BootCaT

A Simple tool to Bootstrap Corpora And Terms from the Web

BootCaT automates the process of finding reference texts on the web and collating them in a single corpus.

The pipeline allows varying levels of control. In the first step, users provide a list of single- or multi-word terms to be used as seeds for text collection. These are then combined into “tuples” of varying length and sent as queries to a search engine, which returns a list of potentially relevant URLs. At this point the user has the option of inspecting the URLs and trimming them; the actual web pages are then retrieved, converted to plain text and saved in plain text and XML format. The corpus can thus be interrogated using most concordancers.

Using BootCat one can build a relatively large quick-and-dirty corpus (typically of about 80 texts, with default parameters and no manual quality checks) in less than half an hour. This flexible approach to the task makes BootCaT a very useful tool for translators and translation students, which has been used in the translation and terminology classroom to build small DIY corpora of varying size and specialization.

Binaries

You can download binaries packaged for Mac, Windows and Linux from the official web site:

https://bootcat.dipintra.it/

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
nbproject		nbproject
resources/curl/win		resources/curl/win
src		src
LICENSE		LICENSE
README.md		README.md
build.xml		build.xml
manifest.mf		manifest.mf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nbproject

nbproject

resources/curl/win

resources/curl/win

src

src

LICENSE

LICENSE

README.md

README.md

build.xml

build.xml

manifest.mf

manifest.mf

Repository files navigation

BootCaT

Binaries

About

Releases

Packages

Languages

License

eroszanchetta/BootCaT

Folders and files

Latest commit

History

Repository files navigation

BootCaT

Binaries

About

Resources

License

Stars

Watchers

Forks

Languages