diff --git a/Makefile.am b/Makefile.am index b1c1f0b9..e130e661 100644 --- a/Makefile.am +++ b/Makefile.am @@ -1,3 +1,3 @@ ACLOCAL_AMFLAGS = -I m4 -EXTRA_DIST = m4 reconf configure COPYING.LESSER +EXTRA_DIST = m4 reconf configure COPYING.LESSER README.md SUBDIRS = src utils examples m4 doc diff --git a/README b/README deleted file mode 100644 index 0e70b421..00000000 --- a/README +++ /dev/null @@ -1,357 +0,0 @@ -Thot: a Software Toolkit for Statistical Machine Translation -============================================================ - -Introduction ------------- -Thot is an open source software toolkit for statistical machine -translation (SMT). Originally, Thot incorporated tools to train -phrase-based models. The new version of Thot now includes a -state-of-the-art phrase-based translation decoder as well as tools to -estimate all of the models involved in the translation process. In -addition to this, Thot is also able to incrementally update its models -in real time after presenting an individual sentence pair using online -learning techniques. - -Thot is being developed by [Daniel Ortiz-Martínez] [1]. Daniel is -currently a researcher on natural language processing at [Webinterpret] -[2]. Formerly, he was a member of the [PRHLT research group] [3] as well -as assistant professor at the [Technical University of Valencia] [4]. - -News ----- -A new version of the toolkit has been released with several improvements -and new features: - -- Incorporation of a whole set of pre/post-processing tools - -- Portability increased (Thot has been successfully compiled in many - different platforms, including Mac OS X, FreeBSD, OpenBSD, NetBSD, - etc.) - -- Improved checking of runtime errors in all of the tools involved in - the translation pipeline - -- Early detection of bugs and portability problems using built-in checks - -- Improvements in tools to carry out translation experiments and - incorporation of new ones - -- Translation can now be executed in parallel using clusters or - multi-processor systems by means of the thot_decoder tool - -- The Thot manual has been extended and revised - -- Multiple minor improvements and bug fixes - - -Features --------- -The toolkit includes the following features: - -- Phrase-based statistical machine translation decoder. -- Computer-aided translation (post-editing and interactive machine translation). -- Incremental estimation of all of the models involved in the translation process. -- Robust generation of alignments at phrase-level. -- Client-server implementation of the translation functionality. -- Single word alignment model estimation using the incremental EM algorithm. -- Scalable and parallel model estimation algorithms using Map-Reduce. -- Compiles on Unix-like and Windows (using Cygwin) systems. -- Integration with the [CasMaCat Workbench] [5] developed in the EU FP7 [CasMaCat project] [6]. -- ... - - -Distribution Details --------------------- -Thot has been coded using C, C++, Python and shell-scripting. Thot is -known to compile on Unix-like and Windows (using Cygwin) systems. See -the "Documentation and Support" section of these instructions if you -experience problems during compilation. - -It is released under the [GNU Lesser General Public License (LGPL)] [7]. - - -Installation ------------- - -#### Basic Installation Procedure - -To install Thot, first you need to install the autotools (autoconf, -autoconf-archive, automake and libtool packages in Ubuntu). If you are -planning to use Thot on a Windows platform, you also need to install the -[Cygwin] [8] environment. Alternatively, Thot can also be installed on -Mac OS X systems using [MacPorts] [9]. - -On the other hand, some of the code used for corpus pre/post-processing -(see more on this in the Thot manual) is based on the [Natural Language -Toolkit (NLTK)] [10] library. Those users interested in using the -pre/post-processing functionality incorporated in Thot will need to -install that library as well. - -Once the autotools are available (as well as other required software -such as Cygwin, MacPorts or the NLTK library), you can proceed with the -installation of Thot by following the next sequence of steps: - - 1. Obtain the package using git: - - $ git clone https://github.com/daormar/thot.git - - Or [download it in a zip file] [11] - - 2. `cd` to the directory containing the package's source code and type - `./reconf`. - - 3. Type `./configure` to configure the package. - - 4. Type `make` to compile the package. - - 5. Type `make install` to install the programs and any data files and - documentation. - - 6. You can remove the program binaries and object files from the source - code directory by typing `make clean`. - -By default the files are installed under the /usr/local/ directory (or -similar, depending of the OS you use); however, since Step 5 requires -root privileges, another directory can be specified during Step 3 by -typing: - - $ configure --prefix= - -For example, if "user1" wants to install the Thot package in the -directory /home/user1/thot, the sequence of commands to execute should be -the following: - - $ ./reconf - $ configure --prefix=/home/user1/thot - $ make - $ make install - -The installation directory can be the same directory where the Thot -package was decompressed. - -See "INSTALL" file for more information. - -IMPORTANT NOTE: if Thot is being installed in a PBS cluster (a cluster -providing `qsub` and other related tools), it is important that the -`configure` script is executed in the main cluster node, so as to -properly detect the cluster configuration (do not execute it in an -interactive session). - -#### Alternative Installation Options -The Thot configure script can be used to modify the toolkit -behavior. Here is a list of current installation options: - -- **--enable-ibm2-alig**: Thot currently uses HMM-based alignment models -to obtain the word alignment matrices required for phrase model -estimation. One alternative installation option allows to replace -HMM-based alignment models by IBM 2 alignment models. IBM 2 alignment -models can be estimated very efficiently without significantly affecting -translation quality. - -- **--with-casmacat**: this options enables the configuration -required for the CasMaCat Workbench, see more information below. - -#### Installation Including the CasMaCat Workbench -Thot can be combined with the CasMacat Workbench which has been developed -in the project of the same name. [See this webpage] [5] to get the specific -installation instructions. - -#### Checking Package Installation -Once the package has been installed, it is possible to perform basic -checkings in an automatic manner so as to detect portability errors. For -this purpose, the following command can be used: - - $ make installcheck - -The tests performed by the previous command involve the execution of the -main tasks present in a typical SMT pipeline, including training and -tuning of model parameters as well as generating translations using the -estimated models (see more on this in the Thot manual). The command -internally uses the toy corpus provided with the Thot package to carry -out the checkings. - - -Relation with Existing Software -------------------------------- -Due to the strong focus of Thot on online and incremental learning, it -includes its own programs to carry out language and translation model -estimation. Specifically, Thot includes tools to work with n-gram -language models based on incrementally updateable sufficient -statistics. On the other hand, Thot also includes a set of tools and a -whole software library to estimate IBM 1, IBM 2 and HMM-based word -alignment models. The estimation process can be carried out using batch -and incremental EM algorithms. This functionality is not based on the -standard GIZA++ software for word alignment model generation. - -Additionally, Thot does not use any code from other existing translation -tools. In this regard, Thot tries to offer its own view of the process -of statistical machine translation, with a strong focus on online -learning and also incorporating interactive machine translation -functionality. Another interesting feature of the toolkit is its stable -and robust translation server. - -Current Status --------------- -The Thot toolkit is under development. Original public versions of Thot -date back to 2005 [Ortiz-Martínez et al., 2005] and were hosted in -SourceForge. These original versions were strongly focused on the -estimation of phrase-based models. By contrast, current version offers -several new features that had not been previously incorporated. - -A set of specific tools to ease the process of making SMT experiments -has been created. Basic usage instructions have been recently added to -the Thot manual. - -On the other hand, there are some toolkit extensions that will be -incorporated in the next months: - -- Improved management of concurrency in the Thot translation server - (concurrent translation processes are currently handled with mutual - exclusion). - -- Virtualized language models (i.e. accessing language model parameters - from disk). - -- Interpolation of language and translation models. - -Finally, here is a list of known issues with the Thot toolkit that are -currently being addressed: - -- Phrase model training is based on HMM-based word alignment models - estimated by means of incremental EM. The current implementation is - slow and currently constitutes a bottleneck when training phrase - models from large corpora. One already implemented solution is to - carry out the estimation in multiple processors. Another solution is - to replace HMM-based models by IBM 2 Models, which can be estimated - very efficiently. However, we are also investigating alternative - optimization techniques to efficiently execute the estimation process - of HMM-based models in a single processor. - -- Log-linear model weight adjustment is carried out by means of the - downhill simplex algorithm, which is very slow. Downhill simplex will - be replaced by a more efficient technique. - -- Non-monotonic translation is not yet sufficiently tested, specially - with complex corpora such as Europarl. - - -Support -------- -Project documentation is being developed. Such documentation include: - -- These instructions. -- The [Thot manual] [12] ("thot_manual.pdf" under the "doc" directory). - -If you need additional help, you can: - -- use the [github issue tracker] [13]. -- [send an e-mail to the author] [14]. -- join the [CasMaCat support group] [15]. - -Additional information about the theoretical foundations of Thot can be -found in [Ortiz-Martínez, 2011]. One interesting feature of Thot, -incremental (or online) estimation of statistical models, is also -described in [Ortiz-Martínez et al., 2010]. Finally, phrase-level -alignment generation functionality implemented in Thot was proposed in -[Ortiz-Martínez et al., 2008]. - - -Citation --------- -You are welcome to use the code under the terms of the license for -research or commercial purposes, however please acknowledge its use with -a citation: - -Daniel Ortiz-Martínez, Francisco Casacuberta. -*"The New Thot Toolkit for Fully Automatic and Interactive Statistical Machine Translation"*. -In Proc. of the European Association for Computational Linguistics (EACL): System Demonstrations, -Gothenburg, Sweden, April 2014. pp. 45-48. - -Here is a BiBTeX entry: - -
-@InProceedings{Ortiz2014,
-  author    = {Daniel Ortiz-Mart\'{\i}nez and Francisco Casacuberta},
-  title     = {The New Thot Toolkit for Fully Automatic and Interactive Statistical Machine Translation},
-  booktitle = {Proc. of the European Association for Computational Linguistics (EACL): System Demonstrations},
-  year      = {2014},
-  month     = {April},
-  address   = {Gothenburg, Sweden},
-  pages     = "45--48",
-}
-
- - -Literature ----------- -Daniel Ortiz-Martínez, -"*Online Learning for Statistical Machine Translation*". -In Computational Linguistics, 2016. Vol. 42 (1), pp. 121-161. [Download] [16] - -Daniel Ortiz-Martínez, -"*Advances in Fully-Automatic and Interactive Phrase-Based Statistical Machine Translation*". -PhD Thesis. Universidad Politécnica de Valencia. 2011. -Advisors: Ismael García Varea and Francisco Casacuberta. [Download] [17] - -Daniel Ortiz-Martínez, Ismael García-Varea, Francisco Casacuberta. -"*Online Learning for Interactive Statistical Machine Translation*". -In Proc. of the North American Chapter of the Association for Computational Linguistics - -Human Language Technologies (NAACL-HLT), pp. 546-554, Los Angeles, US, 2010. [Download] [18] - -Daniel Ortiz-Martínez, Ismael García-Varea, Francisco Casacuberta. -"*Phrase-level alignment generation using a smoothed loglinear phrase-based statistical alignment model*". -In Proc. of the European Association for Machine Translation (EAMT), pp. 160-169, Hamburg, Germany, 2008. *Best paper award*. [Download] [19] - -Daniel Ortiz-Martínez, Ismael García-Varea, Francisco Casacuberta. -"*Thot: a toolkit to train phrase-based models for statistical machine translation*". -In Proc. of the Machine Translation Summit (MT-Summit), -Phuket, Thailand, September 2005. [Download] [20] - - -Why Thot? ---------- - -The name Thot was chosen because it represents the [Egyptian God of -knowledge] [21] (he is also referred to as Thoth or Tot), continuing the -tradition of using Egyptian names for machine translation tools -initiated by other developers in the field. Interestingly, Seshat, who -is the consort of Thot and the [Goddess of writing] [22], is the name of -[a very exciting machine learning tool] [23] for mathematical expression -recognition developed by my former colleague at PRHLT, [Francisco -Álvaro] [24]. - - -Sponsors --------- -Thot has been supported by the European Union under the [CasMaCat -research project] [6]. Thot has also received support from the Spanish -Government in a number of research projects, such as the [MIPRCV -project] [25] that belongs to the [CONSOLIDER programme] [26]. - - -[1]: http://daormar.github.io/ -[2]: http://webinterpret.com/ -[3]: https://www.prhlt.upv.es/ -[4]: http://www.upv.es/ -[5]: http://www.casmacat.eu/index.php?n=Workbench.Workbench -[6]: http://www.casmacat.eu/ -[7]: http://www.gnu.org/copyleft/lgpl.html -[8]: https://www.cygwin.com/ -[9]: https://www.macports.org/ -[10]: http://www.nltk.org/ -[11]: https://github.com/daormar/thot/archive/master.zip -[12]: http://daormar.github.io/thot/docsupport/thot_manual.pdf -[13]: https://github.com/daormar/thot/issues -[14]: mailto:dortiz@prhlt.upv.es -[15]: http://groups.google.com/group/casmacat-support/boxsubscribe -[16]: http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00244 -[17]: https://www.prhlt.upv.es/aigaion2/attachments/dortiz_thesis_2011.pdf-d12d165f9a2b01b0697000ed7c08c4bc.pdf -[18]: http://aclweb.org/anthology-new/N/N10/N10-1079.pdf -[19]: http://mt-archive.info/EAMT-2008-Ortiz-Martinez.pdf -[20]: http://www.mt-archive.info/MTS-2005-Ortiz-Martinez.pdf -[21]: https://en.wikipedia.org/wiki/Thoth -[22]: https://en.wikipedia.org/wiki/Seshat -[23]: https://github.com/falvaro/seshat -[24]: http://falvaro.github.io/ -[25]: http://miprcv.iti.upv.es/ -[26]: http://www.ingenio2010.es/ diff --git a/README.md b/README.md index 429df2af..86abfa08 100644 --- a/README.md +++ b/README.md @@ -1 +1 @@ -Please refer to [http://daormar.github.io/thot/](http://daormar.github.io/thot/) to see project information. +Please refer to [http://daormar.github.io/thot/](http://daormar.github.io/thot/) to see project information, including installation instructions.