This repository contains Alignerzilla - a movie subtitle aligner - and some additional tools that have been developed to use with it.
This subtitle aligner was developed to obtain a Master Degree in Information Systems and Computer Engineering. In this repository it is also available the dissertation that describes the work done, explaining all the decisions.
If you intend to use anything from this repository, please make a reference to this document. BibTex:
@MastersThesis{LuisRosado_5_2016,
author = "Lu{\'\i}s Carlos Cachapela Rosado",
title = "Cinema at the service of Natural Language Processing",
school = "Instituto Superior T{\'e}cnico, Universidade de Lisboa",
year = 2016,
month = may
}
This repository contains three different directories:
pre-processing-scripts/
is used to remove noise and normalize the subtitle files, so that the aligner can use them.alignerzilla/
contains the actual subtitle aligner. Alignerzilla expects the input data to be pre-processed by the aforementioned scripts.dataset/
contains the subtitles (in raw state) and reference alignments we have used to evaluate the aligner.
The pre-processing-scripts require an environment with Bash and Perl.
- First, navigate to
pre-processing-scripts/
directory; - Then, you should place the raw subtitle files that you want to pre-process inside the folder
input/
; - Finally, execute the following command:
$ bash normalizeAll.sh
or$ ./normalizeAll.sh
.
If everything runs as expected, you should now have all the pre-processed subtitle files inside output/
folder.
Note: the runtime of the pre-processing-scripts depends on the used machine and how many subtitles you want to normalize. It might take a while for many subtitles.
Alignerzilla requires an environment with Apache Ant and Java 7. Also, make sure the subtitles you provide as input are already pre-processed by pre-processing-scripts.
- First, make sure you've followed the instructions in 1);
- Then, navigate to
alignerzilla/
directory; - Now, you should place the pre-processed subtitle files inside the folder
data/
;- The subtitles placed in
data/
directory should all be named asMovieName-LANGUAGE.srt
.- For example:
lotr-EN.srt
orTheDarkKnight-PT.srt
.
- For example:
- As the goal is to align subtitle pairs, every movie must have two different subtitle files, with different language tags.
- For example:
lotr-EN.srt
andlotr-PT.srt
. - If you want to use same language subtitles
lotr-EN1.srt
andlotr-EN2.srt
can be used.
- For example:
- The subtitles placed in
- Then, make sure
list.txt
contains a list with the filenames of the subtitles you want to align.- One subtitle file per line.
- If you want to aligner every subtitle in
data/
you can simply use the following command:$ ls data/ > list.txt
- Finally, execute one of the following commands:
$ bash run-alignerzilla.sh
or$ ./run-alignerzilla.sh
.
If everything runs as expected, you should now have the aligned subtitles inside results/
folder.
You can configure additional parameters editing alignerzilla.config
.
Note: the runtime of Alignerzilla depends on the used machine and how many subtitles you want to align. It might take a while for many subtitles.