Цитатник Леся Подерв'янського
The site is implemented as a simple Flask-based web app. Core
components are the actual app code (
v2/app/) and an engine that is
responsible for matching (
data is stored in an SQLite DB and is never modified.
Previous (v1) implementation is a PHP script in the
Since the corpus is relatively small and we require fuzzy matching, searching is done by matching the query against all phrases (properly cleaned up and tokenized) and selecting those with the highest match score.
At the bottom level, we use a concept of a word difference score, which measures the difference between two words. We define it as follows:
where L is Levenshtein edit distance and l is length of each word. The reason for scaling by the longer word length is to reflect the fact that for example, an edit distance of 2 between two 4-letter words is not the same as an edit distance of 2 in 15-letter ones.
Defined this way, the difference score is a real number between 0 and 1, where it is 0 for two identical words (or, strictly speaking, an insignificant difference between two infinite-length words).
With this score defined, we calculate the phrase similarity score as follows:
where Nq is number of unmatched words in the query and Np is number of unmatched words in candidate quote. The rationale for such definition is given below:
The more unmatched words in the query the less good the match is, since the user obvious considred those words important.
On the other hand, a candidate quote in most cases has many more words than a query given (since the query will only have a handful of most important words). Therefore we don't want the number of unmatched words in the candidate quote to affect the score in a major way. We use it scaled down by 1000 to break a tie when two quotes have a good match to query words. In this case a shorter quote would score slightly higher.
The similarity score defined this way is a real number from 0 to 1 which is 1 for nearly exact match between two phrases.
The algorithm is tuned using two parameters (both defined in
WORD_SIMILARITY_THRESHOLD: cut-off value for difference score that determines whether two words are similar or different.
PHRASE_SIMILARITY_THRESHOLD: we only report matches that have a similarity score higher than this value.
As a source data we used the text of plays from http://doslidy.org.ua . Most of the texts are HTML files (with little document structure), so the parser does its best trying to make sense from it. A couple of playes were available eslewhere as text documents, so the parser is capable of handling them as well.
Although we tried to avoid modifying the source document, in several cases it was unavoidable, so for now one is advised to use the documents stored within this project instead of re-downloading the files.
- Web hosting with Python/WSGI support
- Korn shell (required for database creation)
- Build the database
cd tools ./build_db
This step builds the database in data/habib.db.
- Deploy the database file.
- Deploy the app in a manner prescribed by your provider.
v2/app/app.cfgand modify the latter file to accommodate your setup.