<a href="https://colab.research.google.com/github/adel-nouar/ML_with_Rune/blob/main/13%20-%20Lesson%20-%20Information%20Retrieval%20(IR).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Information Retrieval (IR)
### Goal of lesson
- Learn what Information Retrival is
- Topic modeling documents
- How to use Term Frequency and understand the limitations
- Implement Term Frequency by Inverse Document Frequency (TF-IDF)

### What is Information Retrievel (IR)
- The task of finding relevant documents in respose to a user query
- Web search engines are the most visible IR applications ([wiki](https://en.wikipedia.org/wiki/Information_retrieval))

### Topic Modeling
- Models for discovering the topics for a set of document
    - e.g., it provides us with methods to organize, understand and summarize large collections of textual information.
- Topic modeling can be described as a method for finding a group of words that best represents the information.

## Approach 1: Term Frequency

### Term Frequency
- The number of times a term occurs in a document is called its term frequency ([wiki](https://en.wikipedia.org/wiki/Tf–idf#Term_frequency))

$\text{tf}(t, d) = f_{t, d}$: The number of time term $t$ occurs in document $d$.

- There are other ways to define term frequency (see [wiki](https://en.wikipedia.org/wiki/Tf–idf#Term_frequency_2))

> #### Programming Notes:
> - Libraries used
>     - [**nltk**](https://www.nltk.org) - Natural Language Toolkit
>     - [**os**](https://docs.python.org/3/library/os.html) Miscellaneous operating system interfaces
>     - [**math**](https://docs.python.org/3/library/math.html) Do math with Python
> - Functionality and concepts used
>     - **List/Dict Comprehension** to convert data ([Lecture on **List Comprehension**](https://youtu.be/vCYEvtfXdig))
>     - [**sorted**](https://docs.python.org/3/howto/sorting.html) sort stuff
>     - [**lambda**](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions) lambda functions

In [4]:
import os
import nltk
import math
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [6]:
corpus = {}

for filename in os.listdir('files/holmes/'):
  with open(f'files/holmes/{filename}') as f:
    content = [word.lower() for word in nltk.word_tokenize(f.read()) if word.isalpha()]

    freq = {word: content.count(word) for word in set(content)}

    corpus[filename] = freq

In [8]:
for filename in corpus:
  corpus[filename] = sorted(corpus[filename].items(), key=lambda x: x[1], reverse=True)

In [10]:
for filename in corpus:
  print(filename)
  for word, score in corpus[filename][:5]:
    print(f' {word}: {score}')

clerk.txt
 the: 312
 i: 202
 a: 184
 and: 180
 of: 174
problem.txt
 the: 427
 i: 227
 to: 209
 of: 191
 and: 187
coronet.txt
 the: 466
 i: 347
 to: 270
 and: 238
 a: 213
boscombe.txt
 the: 529
 and: 279
 i: 272
 to: 251
 of: 244
treaty.txt
 the: 688
 i: 343
 of: 319
 and: 318
 to: 316
league.txt
 the: 460
 and: 271
 i: 261
 a: 239
 of: 224
carbuncle.txt
 the: 463
 of: 233
 a: 208
 and: 199
 to: 188
ritual.txt
 the: 482
 of: 255
 and: 216
 to: 200
 i: 190
speckled.txt
 the: 600
 and: 281
 of: 276
 a: 252
 i: 232
bachelor.txt
 the: 401
 i: 236
 and: 234
 to: 233
 a: 210
interpreter.txt
 the: 353
 and: 188
 a: 186
 to: 178
 of: 149
engineer.txt
 the: 431
 i: 295
 and: 250
 a: 233
 to: 215
copper.txt
 the: 485
 i: 321
 and: 275
 to: 256
 a: 237
blaze.txt
 the: 641
 of: 242
 a: 242
 and: 242
 to: 238
squires.txt
 the: 508
 of: 206
 and: 169
 to: 168
 a: 152
twisted.txt
 the: 493
 a: 275
 and: 270
 i: 238
 of: 234
crooked.txt
 the: 438
 and: 204
 of: 199
 i: 182
 a: 175
gloria_scott.txt
 the

### Problem: Stop of Function Word
- words that have little meaning on their own ([wiki](https://en.wikipedia.org/wiki/Stop_word))
- Examples: am, by, do, is, which, ....
- Student exercise: Remove function words and see result (HINT: nltk has a list of stopwords)

## Approach 2: TF-IDF
- TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. ([wiki](https://en.wikipedia.org/wiki/Tf–idf))

### Inverse Document Frequency
- Measure of how common or rare a word is across documents

$\text{idf}(t, D) = \log{\frac{N}{|d\in D : t\in d|}} = \log{\frac{\text{Total Documents}}{\text{Number of Documents Containing "term"}}}$
- $D$: All docments in the corpus
- $N$: total number of documents in the corpus $N = |D|$

### TF-IDF
- Ranking of what words are important in a document by multiplying Term Frequencey (TF) by Inverse Document Frequency (IDF)

$\text{tf-idf}(t, d) = \text{tf}(t, d)\cdot \text{idf}(t, D)$

### Example

- Document 1: *This is the sample of the day*
- Document 2: *This is another sample of the day*

In [11]:
doc1 = "This is the sample of the day".split()
doc2 = "This is another sample of the day".split()

In [12]:
corpus = [doc1, doc2]
corpus

[['This', 'is', 'the', 'sample', 'of', 'the', 'day'],
 ['This', 'is', 'another', 'sample', 'of', 'the', 'day']]

In [16]:
tf1 = {word: doc1.count(word) for word in set(doc1)}
tf2 = {word: doc2.count(word) for word in set(doc2)}

In [17]:
tf1

{'This': 1, 'day': 1, 'is': 1, 'of': 1, 'sample': 1, 'the': 2}

In [18]:
tf2

{'This': 1, 'another': 1, 'day': 1, 'is': 1, 'of': 1, 'sample': 1, 'the': 1}

In [22]:
term = 'sample'
ids = 2/sum(term in doc for doc in corpus)

tf1.get(term, 0)*ids, tf2.get(term, 0)*ids

(1.0, 1.0)