In [4]:
import pandas as pd
import numpy as np

# Information Retrieval - Elliot Linsey QMUL 2022

## What is IR?

This is a subject focusing on extracting information and knowledge from data using search queries. 

![knowledge.JPG](attachment:knowledge.JPG)

In this case, the data is 'unstructured' and therefore needs different methods for information to be extracted from this raw data. From here, the information is then processed to produce knowledge. 

IR is the ability to sift through this unstructured data to find relevant documents related to the user's search query. 

Unstructured data can be of many different types, such as: 

* Text (Documents)
* XML and structured documents
* Images
* Audio (sound effects, songs, etc.)
* Video
* Source code
* Applications/Web services

### Key Terms:

* term frequency (TF)
* document frequency
* inverse document frequency (IDF)
* vector-space model (VSM)
* probabilistic model
* BM25 (Best-Match Version 25)
* DFR (Divergence from Randomness)
* page rank
* stemming
* precision, recall

### Notation
* D: set of documents
* Q: set of queries
* d $\to$ q: d implies q as in classical logic
* d $\cap$ q: the intersection of d and q (d **or** q) 
* |d|: the length of the set d, number of elements contained within d
* d $\cup$ q: the union of d and q (d **and** q)
* a $\vee$ b: a or b
* a $\wedge$ b: a and b
* $\neg$ b: not b

### Databases vs IR

We have worked extensively with databases, primarily using modules such as pandas within python. However, the industry standard is SQL which works in a similar manner. Extracting information from a database requires exact queries to be used as the data is stored in a defined structure. Due to this, we also obtain an exact result with no vague or unrelated data being returned. 

Within IR, there is no predefined structure to the data we have querying. Depending on the method and algorithms we utilise this means that we could have different results for the same query, some may be relevant and some may not. These queries are usually informal in contrast to DB searches and are often expressed in natural language. An important part of IR is Natural Language Processing (NLP), in order for the computer to compare queries to the data that it has listed.  

The most common usage of IR is in search engines. Google, Bing, Yahoo etc. 

![DB%20vs%20IR.JPG](attachment:DB%20vs%20IR.JPG)

### Information Need

This is the information that you are trying to extract from the data. For example, if you are trying to find a specific type of dog to adopt: 

1. Must be a Labrador
2. Must be Female
3. Must be Golden

The information (document) result should include breed, sex, colour, age, location, cost, health issues etc

The *Query* is the formal representation of this information need. 

### Types of Information Need

**Retrospective ("searching the past")**

Known as "Ad-hoc Querying", this is the instance of posing information queries against a static collection of documents. In this way, the documents are not evolving or expanding and are able to be stored offline. 

**Topical Search:**
* Identify positive results occurring from Napoleon's rule in the 1800s. 
* Compile a list of famous musicians, what instrument they play and number of record sales. 

**Open-ended Exploration:**
* Who has the best guitar tone?
* What types of materials are available for kitchen counters?

**Known-item Search:**
* Find Amazon's home page.
* What is Elliot Linsey's QMUL ID number?

**Question Answering:**
![question%20answering.JPG](attachment:question%20answering.JPG)

**Prospective Searching**

This is based on more dynamic data which is being created or classed in real time. 

**Filtering:**
* Creating a spam filter that classifies incoming mail as spam or not spam. Binary.

**Multi-class labelling or Classification:**
* Filtering news stories that are posted into bins depending on what type of story they investigate. i.e. if you were only interested in crime news, you could create a filter that will evaluate news stories as they are posted, if they are not crime stories then they are not shown but if they are, you receive a notification etc. 

### Stemming and Lemmatization 

Stemming is the process of removing prefixes and suffixes from words. e.g developing => develop.

Lemmatization is the process of transforming words to their base meaning. e.g am, is, are => be. 

Both methods will usually increase recall but decrease precision.

### Evaluation Methods

Our good friends Precision and Recall are used to evaluate the results of an IR system. In this case, they mean the same as before but are related to *relevance* to our search query. 

Recall: The ability to find all relevant documents to our query (retrieve as few non-relevant documents as possible). If you prioritise recall then there is a chance that non-relevant documents may be returned in the search for all relevant documents. In essence, it's the proportion of relevant documents retrieved. 

Precision: Retrieve the most relevant documents to the query. In this case we may not collect all the relevant documents, but the ones we do collect are more likely to be relevant. Proportion of the retrieved documents that are actually relevant. 

Recall $\approx$ Completeness

Precision $\approx$ Correctness

There are also a couple of terms related to relevance: 

**Topical Relevance:** Is it about the right thing?

**Situational Relevance:** Is it useful? 

![retrieval%20metrics.JPG](attachment:retrieval%20metrics.JPG)

Precision = Number of relevant documents retrieved/number of documents received

Recall = Number of relevant documents retrieved/number of relevant documents

Some tasks prefer either precision or recall. For example, web search may prefer precision where they want relevant documents ranked highly as users may only select the first few results. Other tasks may prefer recall such as in medicine where you are looking for papers about a certain disease. You want an exhaustive list of all documents that may be relevant to this specific topic.   

The above measures do not take into account the ranking of the documents. This is important as an IR system should aim to retrieve relevant documents before non-relevant ones.  

Within the table below, let's assume there are 10 relevant documents but only 5 have been retrieved (in red). The calculations are below. 

![p+r%20table.JPG](attachment:p+r%20table.JPG)

This can also be interpreted as at 20% recall, we have 66.66% precision. At 30% recall we have 50% precision. As one increases the other usually decreases. 

Recall is difficult if not impossible to get correct as we may not know all relevant documents. 

We usually use more than 1 query to evaluate precision and recall then take the average: 

![average%20precision.JPG](attachment:average%20precision.JPG)

We can then create an interpolated precision-recall curve: 

![interpolated%20precision%20recall%20curve.JPG](attachment:interpolated%20precision%20recall%20curve.JPG)

### Evaluation Method

There are only 3 steps: 

* Run the query against the system to obtain a ranked list of documents
* Use the ranking and relevance judgements to produce precision-recall pairs
* Average precision-recall pairs to obtain an overall measure of effectiveness


### R-Precision

We may also use this single value summary, this tells us the precision value of a query at a document cut-off level. For example, the wording may be "P@10" which is the precision at 10 documents. 

![R-precision.JPG](attachment:R-precision.JPG)

### F-Measure

The harmonic mean of recall and precision. The P(j) and r(j) mean precision and recall at *j*-th document in the ranking.

![F1%20score.JPG](attachment:F1%20score.JPG)

F = 0 means no relevant documents were retrived

F = 1 means all ranked documents were relevant

A high F is achieved only when both precision and recall are high. 

We can plot the F scores for the achieved precision and recall scores. The breakeven point is when the curve crosses the y=x line, the better the performance the further to the top-right the curve and therefore the higher the breakeven point. 

![breakeven%20point.JPG](attachment:breakeven%20point.JPG)

### E measure

This allows the user to specify whether they are more interested in precision or recall

![E%20measure.JPG](attachment:E%20measure.JPG)

If b = 1 it acts as the complement of the F measure (1 - F measure)

If b > 1 it is more interested in recall

If b < 1 it is more interested in precision

### Roc plots - See statistics 

### Relevance

Complex concept that has no truly definable answer. 

Relies on humans to make judgements about documents and relevance to queries, however no gold standard exists as humans can be idiosyncratic and variable. Disagreements can occur on whether a document is relevant. 

The success of an IR system relies on how effective it is at satisfying the information need of idiosyncratic humans.

Relevance is an independent concept, one document's relevance does not influence another document. 

Treated as an absolute, objective decision. Either the document is relevant or it is not. 

A number of assumptions are made to simplify the problem:
* eg. topical vs user relevance
* eg. binary vs multi-valued relevance
* eg. assume all index terms within the document are independent (unigram model)


## Information Retrieval Cycle:

![IR%20Cycle.JPG](attachment:IR%20Cycle.JPG)

We can see two loops from both the Selection and Result boxes, depending on the effectiveness of our search algorithm we may need to reformulate our query as well as provide relevance feedback to further train our model. 

## Search Process: 

![Search%20process.JPG](attachment:Search%20process.JPG)

This includes the document collection and indexing process. We can see that the comparison of the search query and index leads to the creation of a ranked list and resulting documents. 

## IR Black Box

![IR%20black%20box.JPG](attachment:IR%20black%20box.JPG)

Here we can see the main processes of the IR process. On the left, the search query is inputted by the user in natural language format. The difficulty of IR is processing data from our documents collection into a representable format that can then be made into an index that is compared with our search query in the comparison function. 

## IR Concept

![IR%20concept.JPG](attachment:IR%20concept.JPG)

This is very similar to the IR Black Box above, just take note of the Relevance Feedback loop on the right from Retrieved Documents to Information Need.

## Language Issues:

Within the English language there are many different issues that cause natural language processing difficulty. 

![lang%20difficulty.JPG](attachment:lang%20difficulty.JPG)

## Boolean Model

Simple model based on set theory and boolean algebra

A query is defined by boolean expressions such as *and*, *or*, *not*. 

Requires precise semantics and formal entry, term queries are either present or absent {0,1}. 

![boolean%20model.JPG](attachment:boolean%20model.JPG)

## CHECK OUT THE DNF MODEL - CONFUSING

![boolean%20dnf.JPG](attachment:boolean%20dnf.JPG)

Create a dictionary of terms present within the documents (vocabulary or lexicon, *collection language model?*). Then create a *posting* list that records which documents contain which terms. 

The posting may also contain information such as term frequency or the position of the term within the document. 

![boolean%20dict.JPG](attachment:boolean%20dict.JPG)

Example: 

Query = (sports $\wedge$ game) $\vee$ (score $\wedge \neg$ win)

Translates into:  sports *and* game *or* score *and not* win.

Using the representations below, we see that documents {d1,d2,d3} fit this criteria. However, due to the binary nature of boolean retrieval there is no ranking associated with the documents. 

In [2]:
words_b = pd.DataFrame(
    [[1,1,1,1],
    [1,1,0,0],
    [1,1,1,0],
    [1,0,0,0]],
    columns = ['d1','d2','d3','d4'],
    index = ['sports', 'game', 'score', 'win']
)
words_b

Unnamed: 0,d1,d2,d3,d4
sports,1,1,1,1
game,1,1,0,0
score,1,1,1,0
win,1,0,0,0


Above is the sparse matrix style of storing this binary information. This leads to lots of empty space as with large documents and collections the majority of the information stored will be zeroes. 

To solve this, we use an inverted index which stores just the document locations of the index terms. 

Here's the query again: 

Query = (sports $\wedge$ game) $\vee$ (score $\wedge \neg$ win)

In [3]:
words_b2 = pd.DataFrame(
    [['d1,d2,d3,d4'],
    ['d1,d2'],
    ['d1,d2,d3'],
    ['d1']],
    columns = ['Documents'],
    index = ['sports', 'game', 'score', 'win']
)
words_b2

Unnamed: 0,Documents
sports,"d1,d2,d3,d4"
game,"d1,d2"
score,"d1,d2,d3"
win,d1


Advantages: 
* Simple queries to process and understand
* Easily explained results
* Efficient processing since most documents can be removed from the search

Disadvantages: 
* No notion of partial matching so no term weighting or ability to rank based on *similarity* to query. It either matches or does not.
* Much more like a data retrieval model
* Information need has to be translated into boolean form which is awkward
* Often too simplistic, resulting in either too many or too few results

Still dominant within commercial document database systems.

## Vector Space Model (VSM)

This introduces the ability to rank documents according to *cosine similarity*. This is the angle between both the document vector and query vector, the smaller the angle the more similar they are to one another. The use of binary weights such as in the Boolean model is too limiting and non-binary weights (using tf-idf etc) allows for partial matching. 

There is an overall set of *n* terms {$t_1, t_2 ... t_n$}

Each document is represented as a vector {$d_1, d_2...d_n$}. $d_1$ is the weight of term $t_1$ in document *d*. 

The query is *also* represented as a vector {$q_1, q_2...q_n$}. $q_1$ is the weight of term $t_1$ in query *q*. The weight is 1 if $t_i \in q$; else 0.  

**All these vectors are of length *n*.** We are essentially computing the weights of each term for both the document and query and then finding the similarity between them. 



Each term acts as an axes in a dimensional space. The more terms you have, the more dimensions you have. To represent it on a 2D scale we have a diagram with only 2 terms. 

![VSM%20example.JPG](attachment:VSM%20example.JPG)

The formula for the RSV (Retrieval Status Value) is as follows:

![VSM%20RSV.JPG](attachment:VSM%20RSV.JPG)

The part circled in red can technically be removed as it is the same for all documents and should not have an effect on the ranking. If discarded, it is equivalent to the projection of the query on the document vector.  

On the top of the RSV, you are essentially calculating the inner product of d and q. Due to q being full of zeroes as it will usually contain less terms than the document, it is basically governed by how many values are within q. 

However, on the bottom you are multiplying the *lengths* of both vectors (magnitudes) which could be denoted as $||d||\cdot ||q||$ 

Overall, on the top you are limited by the tfidf weights within the query vector (as any weight within the document vector that is not present in the query vector will be multiplied by 0). On the bottom the full vector lengths and weights are taken into account. 

We use tfidf to get the weights for each term in the document: 

(Whilst the equation says $w_{t,q}$ I think it should be $w_{t,d}$ in this case.)

![tfidf.JPG](attachment:tfidf.JPG)

So within our document vector, we apply this equation to *every term that has a frequency $> 0$*; else 0. The 1+log(tf) can also be replaced by just standard tf if we want (I think). 

Here's an example of the full calculation of the formula: 

![VSM%20example2.JPG](attachment:VSM%20example2.JPG)

Within this we only have 3 terms, n = 3. 

We have calculated the weights of our 3 terms within the 2 documents, as well as the query. 

As we can see, the query does not contain one of the terms and so its weight is set to 0. 

Then we just follow the RSV equation. 



In [4]:
R1 = ((0.5*1)+(0.8*1))/np.sqrt((0.5**2 + 0.8**2 + 0.3**2)*(1**2 + 1**2))
R2 = ((0.9*1)+(0.4*1))/np.sqrt((0.9**2 + 0.4**2 + 0.2**2)*(1**2 + 1**2))

print('R1 = ' + str(R1))
print('R2 = ' + str(R2))

R1 = 0.9285714285714285
R2 = 0.9146768081493795


In [8]:
(0.9**2 + 0.4**2 + 0.2**2)*(1**2 + 1**2)

2.02

There are a couple of ways of doing this automatically in python.

In [5]:
from sklearn.metrics.pairwise import cosine_similarity
print('R1: \n' + str(cosine_similarity([[0.5,0.8,0.3],[1,1,0]])))
print('R2: \n' + str(cosine_similarity([[0.9,0.4,0.2],[1,1,0]])))

R1: 
[[1.         0.92857143]
 [0.92857143 1.        ]]
R2: 
[[1.         0.91467681]
 [0.91467681 1.        ]]


Remember that to cosine can be interpreted as a measure of distance, so using the below we have to use 1 - cosine. As we can see the actual distance between our two vectors is quite small (dissimilarity). 

In [6]:
from scipy import spatial
similarity = 1-spatial.distance.cosine([0.5,0.8,0.3],[1,1,0])
dissimilarity = spatial.distance.cosine([0.5,0.8,0.3],[1,1,0])
print('similarity = '+ str(similarity))
print('dissimilarity = '+ str(dissimilarity))

similarity = 0.9285714285714286
dissimilarity = 0.0714285714285714


## Generalised Vector Space Model (GVSM)

This is an extension of VSM in that it utilises minterms. These are binary indicators of term relevance patterns within the documents. Each represents a specific co-occurrence of terms within our document corpus.

We still use cosine similarity, we are simply creating extended vectors for both the query and documents. To this end, we still have a set of terms $[t_1, t_2, ... t_N]$ as well as the weights associated for each term for each document $w_{i,k}$ is the weight associated for term $t_i$ in document $d_k$

![gvsm.JPG](attachment:gvsm.JPG)

The amount of minterms possible is $2^N$, however we see that not every pattern is demonstrated within our document corpus. 

$m_1$ is the pattern represented by documents $d_1, d_2, d_8$. $m_2$ is represented by documents $d_3, d_9$ and so on for the other 'ms'

From here, we can create a weights vector for every term:

![gvsm%20terms.JPG](attachment:gvsm%20terms.JPG)

What the above means is that we essentially go through every minterm pattern and sum the values for term 1 that appear in each document. So $c_{1,1}$ is using pattern $m_1$ and summing the $t_1$ values for each document that abides by the $m_1$ representation, so $d_1,d_2,d_8$

What we can see is that for every $m$ pattern that does not have a $t_1$, the respective weight is 0. 

We then normalise using the $C$ value that is calculated by simply summing up the squares of our $c_{i,m}$ weights and taking the square root.

The final vector for $t_1$ is [8,3,0,0,0,1]/8.6 = [0.93,0.35,0,0,0,0.116]

We then use these vectors to create our document and query representation using the respective weights. 

Document $d_1$ contains $t_1$ twice and $t_2$ once:

$\vec{d_1} = 2 \times \vec{t1} + \vec{t_2}$

$ = 2 \times [0.93,0.35,0,0,0,0.116] + [0.64,0.43,0,0.64,0,0.21]$

$ = [2.5,1.128,0,0.64,0,0.442]$

We can then use the same method to generate a vector for a given query, then calculate the cosine similarity between both.

## Representing a Document

A simple but effective method is to use the 'bag of words' representation. We have already covered this in Data Mining but the simple explanation is to remove all stop words and perform stemming etc so that the document is transformed into a count of its key words. We disregard all order and structure and then we can compute the cosine similarity of these different documents with each other to find how similar (or dissimilar) they are to one another.

In [7]:
words = pd.DataFrame(
    [[3,0,5,0,2,6,0,2,0,2],
    [0,7,0,2,1,0,0,3,0,0],
    [0,1,0,0,1,2,2,0,3,0],
    [1,1,1,1,1,1,1,1,1,1],
    [1,1,1,1,1,1,1,1,1,1]],
    columns=['team','coach','play','ball','score','game','win','lost','timeout','season'],
    index=['Document 1','Document 2', 'Document 3','Document 4', 'Document 5']
)

words

Unnamed: 0,team,coach,play,ball,score,game,win,lost,timeout,season
Document 1,3,0,5,0,2,6,0,2,0,2
Document 2,0,7,0,2,1,0,0,3,0,0
Document 3,0,1,0,0,1,2,2,0,3,0
Document 4,1,1,1,1,1,1,1,1,1,1
Document 5,1,1,1,1,1,1,1,1,1,1


Here's the formula for cosine similarity:

![similarity2.JPG](attachment:similarity2.JPG)

Dissimilarity is simply $1-s$

In [8]:
def cosine_sim(dataset,doc1,doc2):
    doc1 = dataset.loc[doc1].to_numpy()
    doc2 = dataset.loc[doc2].to_numpy()
    return (np.dot(doc1,doc2))/(np.linalg.norm(doc1)*np.linalg.norm(doc2))

print('Similarity = ' + str(cosine_sim(words,'Document 1', 'Document 2')))
print('Dissimilarity = ' + str(1-cosine_sim(words,'Document 1', 'Document 2')))
#cosine_sim(words,'Document 4', 'Document 5')

Similarity = 0.11130451615062428
Dissimilarity = 0.8886954838493757


### TF-IDF

We want to use term frequency when computing query-document match scores, however we **do not** want to use raw term frequency (simply the counts of the term). 

This is because a document with 10 occurrences of the term is more relevant than a document with only 1 occurrence of the term. **But it is not 10 times more relevant**. 

Relevance does not increase proportionally with term frequency. 

Due to this, we use Log-frequency weighting: 

![log-frequency%20weighting.JPG](attachment:log-frequency%20weighting.JPG)

Remember that the weighting is 0 if the raw term frequency is also 0. Here are some examples:

In [9]:
print('1: ' + str(1+ np.log10(1)))
print('2: ' + str(1+ np.log10(2)))
print('10: ' + str(1+ np.log10(10)))
print('1000: ' + str(1+ np.log10(1000)))


1: 1.0
2: 1.3010299956639813
10: 2.0
1000: 4.0


TF on its own is not very good, this is because all terms have equal importance, larger documents contain more terms and thus have higher scores. It also ignores term order. 

If a word appears in every document, it probably isn't very important. 

Due to this, a rare term should be considered more important than frequent terms and should be weighted appropriately. This is where IDF comes in. 

N is the total number of documents and $df_t$ is the number of documents that the term is found in. We use log10 to dampen the effect of IDF in general. 

![IDF.JPG](attachment:IDF.JPG)

IDF does not have an effect on ranking for one-term queries, like 'iPhone'. For the query 'capricious person', IDF weighting makes occurrences of 'capricious' count for much more in the final document ranking than occurrences of person. 

However, still have to calculate the IDF for each individual term. 

In [10]:
words

Unnamed: 0,team,coach,play,ball,score,game,win,lost,timeout,season
Document 1,3,0,5,0,2,6,0,2,0,2
Document 2,0,7,0,2,1,0,0,3,0,0
Document 3,0,1,0,0,1,2,2,0,3,0
Document 4,1,1,1,1,1,1,1,1,1,1
Document 5,1,1,1,1,1,1,1,1,1,1


Example:

In [11]:
print('IDF(team) = ' + str(np.log10(5/3)))

IDF(team) = 0.2218487496163564


As we can see, the rarest terms have the highest IDF score, whilst the most common such as 'score' which is contained in every document has a score of 0. 

In [12]:
idf = pd.DataFrame([
    [0*len(list(words.columns))]],
    index=list(words.columns),
    columns=['idf']
)
#idf['idf'] = idf['idf'].apply(lambda x: np.log10(len(words)/(y for y in words.columns: sum(words[y] >= 1))))
idfs = []
for x in words.columns:
    idfs.append(np.log10(len(words)/sum(words[x].values >= 1)))
idf['idf'] = idfs
idf

Unnamed: 0,idf
team,0.221849
coach,0.09691
play,0.221849
ball,0.221849
score,0.0
game,0.09691
win,0.221849
lost,0.09691
timeout,0.221849
season,0.221849


TFIDF is created by multiplying these two values together. 

![tfidf.JPG](attachment:tfidf.JPG)

When used within the vector model, this equation can be used as the weights for each term within the query. However, it should only be applied to values of term frequency greater than 0. If tf is 0, then the respective weight is 0. 

If the query is "play ball" we can calculate the tfidf values for both documents D1 and D2:

D1:
play = (1 + log10(5)) * np.log10(5/3) = 0.38

ball = 0

D2: 
play = 0

ball = (1 + log10(2)) * np.log10(5/3) = 0.29

Score(D1, q) = (0.38 * 1) + (0 * 1)/sqrt((0.38^2 + 0^2)(1^2 + 1^2))

In [13]:
(1 + np.log10(2)) * np.log10(5/3)

0.28863187775142785

In [14]:
(0.38 * 1) + (0 * 1)/((0.38**2 + 0**2)*(1**2 + 1**2))**0.5

0.38

## Probabilistic Model

Also known as the Query-Likelihood model. This utilises language models to assign a probability to a particular sequence of words. In doing this, we have to create a probability distribution of individual words within our document as well as our collection of documents. This is again the bag of words model as seen above. 

A sequence of words can be assigned a probability of occurring by multiplying their individual probabilities together - this is known as the Unigram model. In this case, we are multiplying the word occurring by document. For example, the probability of the words "team play" occurring in Document 1 below is 3/20 x 5/20.

Written in notation, it is p(team) x p(play). 

In [15]:
words2 = words.copy()
words2['total per document'] = words2.iloc[::].sum(axis=1)
words2.loc['total per word']= words2.sum()
words2

Unnamed: 0,team,coach,play,ball,score,game,win,lost,timeout,season,total per document
Document 1,3,0,5,0,2,6,0,2,0,2,20
Document 2,0,7,0,2,1,0,0,3,0,0,13
Document 3,0,1,0,0,1,2,2,0,3,0,9
Document 4,1,1,1,1,1,1,1,1,1,1,10
Document 5,1,1,1,1,1,1,1,1,1,1,10
total per word,5,10,7,4,6,10,4,7,5,4,62


For this method of language modeling, there are two primary steps: 

1. Estimating the probability of a word occurring by using the frequency of the word through the documents (create a language model). 
2. Using the language model to assign a probability to a span of text (multiplying the individual probabilities of each word in the language model that appears in the text).

### Topics

We can think of a topic as having a high probability of using certain words and a low probability of using others within a defined language model. In the example below, each different colour represents a different word.

![topics.JPG](attachment:topics.JPG)

Each document has its own language model, this can be denoted as $\theta D$.

The probability of a given term within a document uses this language model: $P(t|D) = P(t|\theta D)$. An example using document 1 above could be: $P(team|D_1) = 3/20$ which means what is the probability of team occurring given a language model of $D_1$?

For information retrieval, we are ranking each document by how likely it is to have produced our query. Our method of doing this is by generating each document's language model and finding the probability of the terms of our query within each document. The ranking is then ordered in descending order, the higher the probability the more likely the document is to have generated our term. 

To estimate a document's language model:
* Split the document into terms
* Count the number of times each term occurs (term frequency)
* Count the total number of terms within the document ($N_D$)
* Assign term *t* a probability of t = term frequency/$N_D$

Using a generic example of a query for just the word 'team'. We find that $P(team|D_1) = 3/20$ and $P(team|D_4) = 1/10$. Therefore, document 1 is more likely to have generated this query. 

For a query with more than one word, we simply multiply all the probabilities together and then rank each document:

Score(Q,D) = $\Pi^n_{i=1}P(q_i|D)$

As we are multiplying probabilities, the longer the query the lower the probability result. This does not matter as we are evaluating every document *for the same query*. 

### Smoothing

An issue with this method is that if a term does not appear in the document but does appear in the query, then the resulting probability is 0. For example, 0.3 x 0.1 x 0 = 0.

For probabilities, it is common to over-estimate the occurrence of observed outcomes and under-estimate the probability of unobserved outcomes. The goal of smoothing is to reduce the probability of observed outcomes and increase the probability of unobserved outcomes. 

A simple method of smoothing is known as 'Add one' or 'discounting' smoothing. If we take document 1 again, we see that it has a number of terms with 0 occurrences, however the know that these terms exist within the entire *collection* of documents. In order to give a probability value to one of these unobserved terms, we can add an (imaginary) +1 to **every** term within document 1. This both slightly reduces the probability of the observed terms and increases the probability of the unobserved terms, whilst also removing the multiplication by zero issue.   

In [16]:
words2

Unnamed: 0,team,coach,play,ball,score,game,win,lost,timeout,season,total per document
Document 1,3,0,5,0,2,6,0,2,0,2,20
Document 2,0,7,0,2,1,0,0,3,0,0,13
Document 3,0,1,0,0,1,2,2,0,3,0,9
Document 4,1,1,1,1,1,1,1,1,1,1,10
Document 5,1,1,1,1,1,1,1,1,1,1,10
total per word,5,10,7,4,6,10,4,7,5,4,62


We can see that the P(team|D) probability of 3/20 has now become 4/30 whilst P(coach|D) has gone from 0/20 to 1/30. 

In [17]:
words3 = words.copy()
words3 = words3+1
words3['total per document'] = words3.iloc[::].sum(axis=1)
words3.loc['total per word']= words3.sum()
words3

Unnamed: 0,team,coach,play,ball,score,game,win,lost,timeout,season,total per document
Document 1,4,1,6,1,3,7,1,3,1,3,30
Document 2,1,8,1,3,2,1,1,4,1,1,23
Document 3,1,2,1,1,2,3,3,1,4,1,19
Document 4,2,2,2,2,2,2,2,2,2,2,20
Document 5,2,2,2,2,2,2,2,2,2,2,20
total per word,10,15,12,9,11,15,9,12,10,9,112


Here's an example using a bag of balls: 

![add%20one.JPG](attachment:add%20one.JPG)

This method is a simple way to add smoothing, however there is a more effective approach.

### Linear Interpolation

This involves using the language model of the **entire collection** of documents, a set alpha value, as well as each document's individual language model. 

$P(t|D) = \alpha P(t|\theta D) + (1-\alpha)P(t|\theta C)$

Every one of these values is between 0 and 1, so the resulting P(t|D) is also between 0 and 1. 

![linear%20interpolation.JPG](attachment:linear%20interpolation.JPG)

The method of generating the document rankings is the same, just multiplying all the probabilities together. Only this time we use the linear interpolated probabilities. 

Score(Q,D) = $\Pi^n_{i=1} (\alpha P(q_i|\theta D) + (1-\alpha)P(q_i|\theta C))$

Linear interpolation helps us avoid 0 probabilities, I am guessing this is because if a term is not within a document but is in the collection then the probability used is just that of the collection language model. 

A term is descriptive of a document if it appears many times within that document. This is not true if the term appears frequently in the document but also frequently in the **collection**. 

Linear interpolation adds an IDF (inverse document frequency) like feature in that terms that are less frequent in the collection are more important to a document's score, this is not present without smoothing. 

Here's an example with linear interpolation:

Within the collection, both terms have different frequencies.

![LI%20example%201.JPG](attachment:LI%20example%201.JPG)

Ignoring the smoothing for now and just using the document language models. If we calculated both p(apple|D) and p(ipad|D) together for both documents (0.04 x 0.06 and 0.06 x 0.04), we would get the same result of 0.0024 and both documents would be ranked the same. 

By taking into account the collection language model and using linear interpolation, we find that both documents are now ranked differently due to the differing importance levels of the terms **overall**. 

The score for document one is calculated by (with an alpha of 0.5): 

$(0.5 \times 0.04 + 0.5 \times 0.0002) \cdot (0.5 \times 0.06 + 0.5 \times 0.0001) = 0.000604005$

The score for document two is calculated by:

$(0.5 \times 0.06 + 0.5 \times 0.0002) \cdot (0.5 \times 0.04 + 0.5 \times 0.0001) = 0.000603505$

![LI%20example%202.JPG](attachment:LI%20example%202.JPG)

Overall, we can see that even though both terms have the same ranking if we don't use smoothing and just multiply term frequencies, by taking into account the collection language model we can place greater importance on terms that appear **less** in the collection. 

Slightly confused about this. If both apple and ipad had the same term frequency of 0.04. The calculation would be:

(0.5 x 0.04 + 0.5 x 0.0002) x (0.5 x 0.04 + 0.5 x 0.0001) = 0.0201 x 0.02005. Therefore, doesn't apple have the greater impact even though it appears more times in the collection?

The longer the document, a larger $\alpha$ value is needed for more smoothing. 

## Dirichlet Smoothing

This form of smoothing takes document length into account, in that a term that doesn't appear in a long document should have a lower smoothed probability of occurring. I.e, the longer the document, the lower the probability. 

Here's the formula: 
![dirichlet%20smoothing.JPG](attachment:dirichlet%20smoothing.JPG)

|d| means the length of document D or total number of terms, |c| means the total number of terms within the entire collection.

$\mu$ is a constant and for a fixed $\mu$, longer documents will get less smoothing. Therefore, for longer queries we need a higher $\mu$.

$\mu \in [0, +\infty)$

From what I understand, for $\mu$ a good out of the box value is the average document length, but it should be tuned with more experimentation. In normal practice, $\mu$ is fixed for a collection of documents and does not vary based on document length. 

In [18]:
words2

Unnamed: 0,team,coach,play,ball,score,game,win,lost,timeout,season,total per document
Document 1,3,0,5,0,2,6,0,2,0,2,20
Document 2,0,7,0,2,1,0,0,3,0,0,13
Document 3,0,1,0,0,1,2,2,0,3,0,9
Document 4,1,1,1,1,1,1,1,1,1,1,10
Document 5,1,1,1,1,1,1,1,1,1,1,10
total per word,5,10,7,4,6,10,4,7,5,4,62


Let's calculate the score for P(team|Doc3) with a $\mu$ of 12.4 (the average document length in our collection).  

$\frac{0 + 12.4\times 5/62}{12.4 + 9} = 0.04673$

Now p(game|Doc1) with $\mu$ of 12.4.

$\frac{6 + 12.4\times 10/62}{12.4 + 20} = 0.24691$

With a query that contains multiple terms, you simply multiply the probabilities for the total score. 

In [19]:
(0 + (12.4*5/62))/(12.4+9)
#(6 + (12.4*10/62))/(12.4+20)

0.04672897196261683

Dirichlet and Linear Interpolated smoothing work the same, however the difference is that Dirichlet smoothing uses a lambda that is parameterized on document length. We can calculate the equivalent linear interpolated lambda value using $\mu$ from Dirichlet smoothing with this formula: 

![LI%20to%20DIR.JPG](attachment:LI%20to%20DIR.JPG)

If we use the $\mu$ value of 12.4 as above, we can convert that to a lambda value of 0.5794

In [20]:
a = 1-9/(9+12.4)
a

0.5794392523364486

In [21]:
b = 1-20/(20+12.4)
b

0.382716049382716

Now if we use this lambda value in linear interpolated smoothing for P(team|Doc3), we get the exact same result as we did for Dirichlet smoothing above for the same query. 

Note that in this formula, we have used a version of linear interpolation that has swapped the 1-$\alpha$ around. 

Score(Q,D) = $\Pi^n_{i=1} (1-\alpha P(q_i|\theta D) + \alpha P(q_i|\theta C))$


In [22]:
(1-a)*0+(a)*(5/62)

0.04672897196261682

If $\mu$ is set to be the document length |D|, the equivalent lambda is 0.5. As $\mu$ is fixed, this means that the lambda varies per document. If |D| is less than $\mu$ then the document is smoothed with an equivalent lambda greater than 0.5. If |D| is greater than $\mu$ then the document is smoothed with an equivalent lambda of less than 0.5. 

Usually, the more a document is smoothed, the lower the score. Therefore, Dirichlet appears to favour longer documents due to the lower smoothing that happens to them due to the fixed $\mu$ value. 

## Absolute Discounting

This method involves subtracting a constant $\delta$ (discounting) from all terms with non-zero counts within the document. 

max($f_{d,t} - \delta$, 0) means that if a term frequency is greater than 0, you discount it by minusing the constant $\delta$ away from it. If the term frequency is already 0, then you just select 0. $|D|_u$ is the number of unique words in the document and 0 < $\delta$ < 1.

![absolute%20discounting.JPG](attachment:absolute%20discounting.JPG)

An example with P(team|Doc3) and a $\delta$ of 0.5: 

(0 + 0.5 x 5 x 5/62)/9 = 0.022

Now P(game|Doc1) with the same $\delta$:

(6-0.5 + 0.5 x 6 x 10/62)/20 = 0.299

In [23]:
(0 + (0.5*5*5/62))/9
(6-0.5 + 0.5*6*10/62)/20 

0.2991935483870968

## Two Stage Smoothing

This has been proposed by smoothing the document language model using the Dirichlet prior, then mixing this with a 'query background' model of linear intepolation smoothing. In the absence of information about this 'query background', we use P(t|C) as an approximation. 

The full smoothing function: 

![two%20stage%20smoothing.JPG](attachment:two%20stage%20smoothing.JPG)

Let's do this with P(team|Doc3), a $\lambda$ of 0.2 and $\mu$ of 12.4:

(1-0.2) x (0 + (12.4 x 5/62)/(9 + 12.4)) + (0.2 x 5/62) = 0.054

Now for P(game|Doc1) with same parameters:

(1-0.2) x ((6 + (12.4 x 10/62))/(20 + 12.4)) + (0.2 x 10/62) = 0.23

In [24]:
(1-0.2)*((0 + (12.4*5/62))/(9 + 12.4)) + (0.2*5/62)
(1-0.2)*((6 + (12.4*10/62))/(20 + 12.4)) + (0.2*10/62)

0.2297889287136599

In [25]:
((6 + (12.4*10/62))/(20 + 12.4))

0.2469135802469136

In [26]:
words2

Unnamed: 0,team,coach,play,ball,score,game,win,lost,timeout,season,total per document
Document 1,3,0,5,0,2,6,0,2,0,2,20
Document 2,0,7,0,2,1,0,0,3,0,0,13
Document 3,0,1,0,0,1,2,2,0,3,0,9
Document 4,1,1,1,1,1,1,1,1,1,1,10
Document 5,1,1,1,1,1,1,1,1,1,1,10
total per word,5,10,7,4,6,10,4,7,5,4,62


(0.8 * 0/9) + (0.2 * 0/62)

In [27]:
(0.5 * 0/9) + (0.5 * 5/62)

0.04032258064516129

In [28]:
(0 + (12.4*5/62))/(12.4+9)

0.04672897196261683

## BIRM Model (Binary Independence Retrieval Model) 

Given a user query 'q' and document 'd', estimate the probability that the user wil find 'd' relevant. 

Binary: All weights for the terms are binary, either 0 or 1

Independence: index terms are independent and don't influence each others probability

Is based on Bayes' theorem 

The main theorem: 

![BIRM%20theorem.JPG](attachment:BIRM%20theorem.JPG)

$c_i$ are weights associated with terms $t_i$ 

For $c_i > 0$, term $t_i$ occurring in document is a good indication of relevance 

For $c_i < 0$, term $t_i$ occurring in document is a good indication of non-relevance 

For $c_i = 0$, term $t_i$ occurring in document doesn't mean anything

C is a constant for all documents given the same query

$x_i$ is the binary form for whether a term appears in a document

if R(d) > C then retrieve d; otherwise do not retrieve d **or** simply rank by R(d) value and **ignore C**. 

![BIRM%20params.JPG](attachment:BIRM%20params.JPG)

These data can be extracted after a relevance feedback process: user pointing out relevant documents

$a_i = P(x_i=1|r)$ which is the probability that a term $t_i$ appears in a *relevant* document

$1 - a_i = P(x_i=0|r)$ which is the probability that a term $t_i$ does not appear in a *relevant* document

$b_i = P(x_i=1|\neg r)$ which is the probability that a term $t_i$ appears in a *non-relevant* document

$1-b_i = P(x_i=0|\neg r)$ which is the probability that a term $t_i$ does not appear in a *non-relevant* document

$a_i = \frac{r_i}{R}$

$b_i = \frac{n_i - r_i}{N - R}$

![BIRM%20estimating%20ci-2.JPG](attachment:BIRM%20estimating%20ci-2.JPG)

$c_i$ can also be calculated using the above table values. 0.5 is added to keep the $c_i$ value from being infinite when $r_i$ and R are small

![BIRM%20estimating%20ci2.JPG](attachment:BIRM%20estimating%20ci2.JPG)

$c_i$ is also referred to as term weight in BIRM, or Robertson-Spark Jones (RSJ) and written *w*

When no sample is available, the number of relevant documents (R) is not known
* set $a_i$ = 0.5 and $b_i$ = $n_i/N$
* leads to $c_i = log(N-n_i)/n_i$
* which can be viewed as a probabilistif *idf* 
* R(d) thus with idf weights produces initial ranking

From this, relevance feedback is applied and R, $r_i$ can be defined which improves ranking.


### Example

![BIRM%20example.JPG](attachment:BIRM%20example.JPG)

N = number of documents

R = number of relevant documents

$r_1$ = number of relevant documents that contain $x_1$ 

$r_2$ = number of relevant documents that contain $x_2$ 

$n_1$ = number of documents that contain $x_1$ 

$n_2$ = number of documents that contain $x_2$

$a_1 = r_1/R = 8/12; a_2 = r_2/R = 7/12$

$b_1 = (n_1 - r_1)/(N-R) = (11-8)/(20-12) = 3/8; b_2 = 4/8$

With these values we can calculate the term weights for each independent $x_i$. 

**use ln (natural logarithm) for logs)**

$c_1 = log\frac{a_1(1-b_1)}{b1(1-a_1)}=log(10/3) = 1.20$

$c_2 = log\frac{a_2(1-b_2)}{b2(1-a_2)}= log((7/12\times(1-4/8))/(4/8\times(1-7/12))) = 0.34$


Our retrieval function: 

R(D) = $1.20x_1 + 0.34x_2 + C$ 

In [29]:
import numpy as np
np.log((7/12*(1-4/8))/(4/8*(1-7/12)))

0.336472236621213

By applying this formula to every document (and ignoring C), we get a ranking: 

![BIRM%20ranking.JPG](attachment:BIRM%20ranking.JPG)

Overall, this is a probabilistic model used to model 'uncertainty' in the retrieval process

Clear independence assumptions to do with index terms

Term weight ($c_i$) without relevance information is IDF

Relevance feedback can help improve ranking by giving better probability estimates of term weights

No use of term frequencies (binary) or document lengths. 

## BM25 (Okapi)

This builds on the probabilistic model and incorporates parameters such as term frequency and document length. 

![bm25.JPG](attachment:bm25.JPG)

(There is a more involved model with k1, k2 and k3 but this is a simplified version)

Typically, k1 is set around [1.2,2] and b around 0.75

In [30]:
words

Unnamed: 0,team,coach,play,ball,score,game,win,lost,timeout,season
Document 1,3,0,5,0,2,6,0,2,0,2
Document 2,0,7,0,2,1,0,0,3,0,0
Document 3,0,1,0,0,1,2,2,0,3,0
Document 4,1,1,1,1,1,1,1,1,1,1
Document 5,1,1,1,1,1,1,1,1,1,1


We can calculate the BM25 values without any query, then when we have a query we simply **sum** all the values up. 

If we calculate the value for 'team' within Document 1: 

N = 5

k1 = 1.2

b = 0.8

$tf_i = 3$

$df_i = 3$

dl = 20

avgdl = 12.4

$log(5/3)\cdot \frac{(1.2 + 1)\cdot 3}{(1.2\cdot (1-0.8) + (0.8\cdot (20/12.4))+3)}$

In [31]:
np.log10(5/3)*((2.2*3)/(1.2*(0.2+0.8*(20/12.4))+3))

0.30578182546150984

We can automate this for a bag of words model, we need to generate all the idfs for the terms however.

In [32]:
idfs = (words > 0).sum(axis=0)
idfs = np.log10(5/idfs) #N is 5, number of documents
idfs

team       0.221849
coach      0.096910
play       0.221849
ball       0.221849
score      0.000000
game       0.096910
win        0.221849
lost       0.096910
timeout    0.221849
season     0.221849
dtype: float64

In [33]:
dls = words.sum(axis=1) # vector
dls = np.array(dls)
avgdl = np.mean(words.sum(axis=1))
k_1 = 1.2
b = 0.8
N=5

numerator = np.array((k_1 + 1) * words)
denominator = np.array(k_1 *((1 - b) + b * (dls / avgdl))).reshape(N,1) + np.array(words)

BM25_tf = numerator / denominator

idfs = np.array(idfs)

BM25_score = BM25_tf * idfs

BM25_score

array([[0.30578183, 0.        , 0.3594869 , 0.        , 0.        ,
        0.1642461 , 0.        , 0.11255557, 0.        , 0.25766493],
       [0.        , 0.18097653, 0.        , 0.30067736, 0.        ,
        0.        , 0.        , 0.15062131, 0.        , 0.        ],
       [0.        , 0.11008099, 0.        , 0.        , 0.        ,
        0.1451947 , 0.33238323, 0.        , 0.37192932, 0.        ],
       [0.24231398, 0.10584982, 0.24231398, 0.24231398, 0.        ,
        0.10584982, 0.24231398, 0.10584982, 0.24231398, 0.24231398],
       [0.24231398, 0.10584982, 0.24231398, 0.24231398, 0.        ,
        0.10584982, 0.24231398, 0.10584982, 0.24231398, 0.24231398]])

In [36]:
vocabulary = list(words.columns)
bm25_idf = pd.DataFrame(BM25_score, columns=vocabulary, index=words.index)
bm25_idf

Unnamed: 0,team,coach,play,ball,score,game,win,lost,timeout,season
Document 1,0.305782,0.0,0.359487,0.0,0.0,0.164246,0.0,0.112556,0.0,0.257665
Document 2,0.0,0.180977,0.0,0.300677,0.0,0.0,0.0,0.150621,0.0,0.0
Document 3,0.0,0.110081,0.0,0.0,0.0,0.145195,0.332383,0.0,0.371929,0.0
Document 4,0.242314,0.10585,0.242314,0.242314,0.0,0.10585,0.242314,0.10585,0.242314,0.242314
Document 5,0.242314,0.10585,0.242314,0.242314,0.0,0.10585,0.242314,0.10585,0.242314,0.242314


Now, if our query was ['coach game lost'], we would just sum up the probabilities from our table and rank by the highest

In [37]:
query = ['coach', 'game', 'lost']
bm25_idf[query].sum(axis=1).sort_values(ascending=False)

Document 2    0.331598
Document 4    0.317549
Document 5    0.317549
Document 1    0.276802
Document 3    0.255276
dtype: float64