KPRA is an experimental service to reduce tons of letters in film reviews to quicklier grasp the films' essence and answer the two questions:
- Why should I watch the film?
- Why should not I watch the film?
- Find the location (in HTML) of the numbers that indicate how many positive/negative/neutral reviews there are, and extract them.
- Depending of the above numbers, calculate the maximum number of reviews per page to display (in order to minimize the number of pages to download and parse).
- Write a script to download all the reviews (and, perhaps, store them temporary locally).
- Bring all words to the initial form in order to compute
tf–idf
. For example, via Yandex'esmystem
. - Compute the
tf–idf
statistic on the obtained data. Presumably the better way is to treat the primary data as follows: Each review is a document, each collection of reviews according to some mood is a collection, or corpus. However note that since it's important to know the word weight for a certain mood, there's probably good logic in treating the whole set of reviews (independent from the mood) as a single corpus as well. - Define collocations by using
t-test
,chi-square
,MLE
,MI
/PMI
, etc. As well as above, maybe it's needed to work on the initial word forms. After obtaining various metrics, opt for the most appropriate (basing on some factors?) collocations. - Develop a simple GUI for a web service.
- KPRA should function as a separate web service that enables users to promptly check the info about the film.
- KPRA should somehow retain the already collected information for the quicker processing of further requests.