# Information Retrieval - Elliot Linsey QMUL 2022

## What is IR?

This is a subject focusing on extracting information and knowledge from data using search queries. 

![knowledge.JPG](attachment:knowledge.JPG)

In this case, the data is 'unstructured' and therefore needs different methods for information to be extracted from this raw data. From here, the information is then processed to produce knowledge. 

IR is the ability to sift through this unstructured data to find relevant documents related to the user's search query. 

Unstructured data can be of many different types, such as: 

* Text (Documents)
* XML and structured documents
* Images
* Audio (sound effects, songs, etc.)
* Video
* Source code
* Applications/Web services

### Key Terms:

* term frequency (TF)
* document frequency
* inverse document frequency (IDF)
* vector-space model (VSM)
* probabilistic model
* BM25 (Best-Match Version 25)
* DFR (Divergence from Randomness)
* page rank
* stemming
* precision, recall

### Databases vs IR

We have worked extensively with databases, primarily using modules such as pandas within python. However, the industry standard is SQL which works in a similar manner. Extracting information from a database requires exact queries to be used as the data is stored in a defined structure. Due to this, we also obtain an exact result with no vague or unrelated data being returned. 

Within IR, there is no predefined structure to the data we have querying. Depending on the method and algorithms we utilise this means that we could have different results for the same query, some may be relevant and some may not. These queries are usually informal in contrast to DB searches and are often expressed in natural language. An important part of IR is Natural Language Processing (NLP), in order for the computer to compare queries to the data that it has listed.  

The most common usage of IR is in search engines. Google, Bing, Yahoo etc. 

![DB%20vs%20IR.JPG](attachment:DB%20vs%20IR.JPG)

### Information Need

This is the information that you are trying to extract from the data. For example, if you are trying to find a specific type of dog to adopt: 

1. Must be a Labrador
2. Must be Female
3. Must be Golden

The information (document) result should include breed, sex, colour, age, location, cost, health issues etc

The *Query* is the formal representation of this information need. 

### Types of Information Need

**Retrospective ("searching the past")**

Known as "Ad-hoc Querying", this is the instance of posing information queries against a static collection of documents. In this way, the documents are not evolving or expanding and are able to be stored offline. 

**Topical Search:**
* Identify positive results occurring from Napoleon's rule in the 1800s. 
* Compile a list of famous musicians, what instrument they play and number of record sales. 

**Open-ended Exploration:**
* Who has the best guitar tone?
* What types of materials are available for kitchen counters?

**Known-item Search:**
* Find Amazon's home page.
* What is Elliot Linsey's QMUL ID number?

**Question Answering:**
![question%20answering.JPG](attachment:question%20answering.JPG)

**Prospective Searching**

This is based on more dynamic data which is being created or classed in real time. 

**Filtering:**
* Creating a spam filter that classifies incoming mail as spam or not spam. Binary.

**Multi-class labelling or Classification:**
* Filtering news stories that are posted into bins depending on what type of story they investigate. i.e. if you were only interested in crime news, you could create a filter that will evaluate news stories as they are posted, if they are not crime stories then they are not shown but if they are, you receive a notification etc. 

### Evaluation Methods

Our good friends Precision and Recall are used to evaluate the results of an IR system. In this case, they mean the same as before but are related to *relevance* to our search query. 

Recall: The ability to find all relevant documents to our query (retrieve as few non-relevant documents as possible). If you prioritise recall then there is a chance that non-relevant documents may be returned in the search for all relevant documents. 

Precision: Retrieve the most relevant documents to the query. In this case we may not collect all the relevant documents, but the ones we do collect are more likely to be relevant. 

Recall $\approx$ Completeness

Precision $\approx$ Correctness