Цитатник Леся Подерв'янського
HTML Python PHP CSS
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
data
tools
v1
v2
.gitignore
README.md

README.md

Цитатник Леся Подерв'янського

Architecture

The site is implemented as a simple Flask-based web app. Core components are the actual app code (v2/app/) and an engine that is responsible for matching (v2/app/FuzzyMatchingEngine.py). Run-time data is stored in an SQLite DB and is never modified.

Previous (v1) implementation is a PHP script in the v1 directory.

Search implementation

Since the corpus is relatively small and we require fuzzy matching, searching is done by matching the query against all phrases (properly cleaned up and tokenized) and selecting those with the highest match score.

At the bottom level, we use a concept of a word difference score, which measures the difference between two words. We define it as follows:

equation

where L is Levenshtein edit distance and l is length of each word. The reason for scaling by the longer word length is to reflect the fact that for example, an edit distance of 2 between two 4-letter words is not the same as an edit distance of 2 in 15-letter ones.

Defined this way, the difference score is a real number between 0 and 1, where it is 0 for two identical words (or, strictly speaking, an insignificant difference between two infinite-length words).

With this score defined, we calculate the phrase similarity score as follows:

equation

where Nq is number of unmatched words in the query and Np is number of unmatched words in candidate quote. The rationale for such definition is given below:

  • The more unmatched words in the query the less good the match is, since the user obvious considred those words important.

  • On the other hand, a candidate quote in most cases has many more words than a query given (since the query will only have a handful of most important words). Therefore we don't want the number of unmatched words in the candidate quote to affect the score in a major way. We use it scaled down by 1000 to break a tie when two quotes have a good match to query words. In this case a shorter quote would score slightly higher.

The similarity score defined this way is a real number from 0 to 1 which is 1 for nearly exact match between two phrases.

The algorithm is tuned using two parameters (both defined in FuzzyMatchingEngine class):

  • WORD_SIMILARITY_THRESHOLD: cut-off value for difference score that determines whether two words are similar or different.

  • PHRASE_SIMILARITY_THRESHOLD: we only report matches that have a similarity score higher than this value.

Source data

As a source data we used the text of plays from http://doslidy.org.ua . Most of the texts are HTML files (with little document structure), so the parser does its best trying to make sense from it. A couple of playes were available eslewhere as text documents, so the parser is capable of handling them as well.

Although we tried to avoid modifying the source document, in several cases it was unavoidable, so for now one is advised to use the documents stored within this project instead of re-downloading the files.

Deployment

Pre-requisites

  • Web hosting with Python/WSGI support
  • Python
  • Korn shell (required for database creation)
  • SQLite3
  • Flask

Deployment procedure

  • Build the database
cd tools
./build_db

This step builds the database in data/habib.db.

  • Deploy the database file.
  • Deploy the app in a manner prescribed by your provider.
  • Copy v2/app/app-template.cfg to v2/app/app.cfg and modify the latter file to accommodate your setup.