## Text Retrieval Problem

### TR vs. Database Retrieval

+ Information
  + Unstructured/free text vs. structured data
  + Ambiguous text vs. well-defined semantics

+ Query
  + Ambiguous vs well-defined semantics
  + Incomplete vs. complete specification

+ Answers
  + Relevant documents vs. matched records

  The difference here is that _TR is an empirically defined problem_ and therefore, we have to rely on **empirical evaluation** involving users.


**Vocabulary:** $V=\{w{_1}, w{_2}, ..., w{_N} \}\ of\ language$

**Query:** $q=q_1,....q_m,\ where\ q_i \in V$

**Document:** $d_i = d{_{i1}},...,d{_im_i},\ where\ d_{ij} \in V$

**Collection:** $C=\{d_1,...d_M\}$

**Set of relevant documents:** $R(q) \subseteq C$
  + Generally unknown and user-dependent
  + Query is a _hint_ on which doc is in `R(q)`

**Task:** compute `R'(q)` an approximation of `R(q)`

### Two methods to find R'(q)

+ Document Selection

  + $R'(q)=\{d \in C \mid f(d,q)=1\},\ where\ f(d,q) \in\{0,1\}$ is an [indicator function](https://en.wikipedia.org/wiki/Indicator_function) or [binary classifier](https://en.wikipedia.org/wiki/Binary_classification)
  + The system has to decide if a doc has _absolute relevance_ 
  
+ Document Ranking

  + $R'(q)=\{d \in C \mid f(d,q) > \theta\}\ where\ f(d,q) \in \Re$ is a relevance function; $\theta$ is a threshold determined by the user
  + System only needs to decide if a given document is more likely than another or _relative relevance_.
  
![](https://storage.googleapis.com/personal-notes/docvranking.png)

### Problems in document selection

+ Classifier is unlikely to be accurate
  + **Over-constrained** query $\rightarrow$ no relevant documents
  + **Under-constrained** query $\rightarrow$ overly noisy
  + Hard to find the right position between these
  
+ Even with some accuracy, all relevant documents are not equally relevant, and thus **prioritization** is needed, and important!

This is why ranking is generally preferred in an informal sense. There is also some theoretical grounds:

_Probability ranking principle_: Returning a ranked list of documents in descending order of probability that a document is relevant to the query is the optimal strategy under the following two assumptions: [1] The utility of a document (to a user) is **independent** of the utility of any other document and [2] that a user would browse the results **sequentially**

The main challenge in TR is therefore this:


$f(q,d)=$🤔