# Project: Information Retrieval (IR)
- Calculate the TF-IDF of the corpus form 'files/holmes'

### Step 1: Import libraries

In [9]:
import os
import math
from nltk import word_tokenize

### Step 2: Read the corpus
- Read all the Sherlock Holmes texts in files/holmes/
- Create a dictionary (dict) calleds corpus
- Use os.listdir(...) ([docs](https://docs.python.org/3/library/os.html)) to iterate over all the filenames in 'files/holmes'
- For each filename open the file and read the content and add it to the **corpus[filename]**

In [4]:
corpus = {}

for filename in os.listdir('files/holmes'):
    with open(f'files/holmes/{filename}') as f:
        corpus[filename] = f.read()

### Step 3: Tokenize the content
- Iterate over **filename** in **corpus**
- For each filename assign **corpus[filename]** to be the list of word (in lower) for word in word_tokenize(...) of the content of filename if word is alpha.
    - HINT: Use list comprehension
    - HINT: Use **.isalpha()**

In [7]:
for filename in corpus:
    corpus[filename] = [word.lower() for word in word_tokenize(corpus[filename]) if word.isalpha()]

### Step 4: Get all words
- Create a set **words**
    - HINT: **words = set()**
- For each **filename** in **corpus** update the set **words** with the content
    - HINT: apply **update(...)**

In [14]:
words = set()

for filename in corpus:
    words.update(corpus[filename])

### Step 5: Calculate term frequency (TF)
- Createa empty dictionary (dict) called **tf**
- Iterate over **filename** in **corpus**
- For each filename add **tf[filename]** with the word frequency.
    - HINT: Use dict comprehension with **word** in **words**

In [18]:
tf = {}

for filename in corpus:
    tf[filename] = {word: corpus[filename].count(word) for word in words}

### Step 6: Calculate the inverse document frequency (IDF)
- Create an empty dictionary called **idf**
- Iterate **word** in **words**
- For each **word** calculate the number of documents word is in the corpus
    - HINT: **freq = sum(word in corpus[filename] for filename in corpus)**
- Update **idf[word]** to be the logarithm of number of documents divided by the calcualted frequency.

In [15]:
idf = {}

for word in words:
    freq = sum(word in corpus[filename] for filename in corpus)
    idf[word] = math.log(len(corpus) / freq)

### Step 7: Calculate the Term Frequence-Inverse Document Frequency (TF-IDF)
- Create a dictionary tfidf
- Iterate over **filename** in **corpus**
- For each **filename** calculate the TF-IDF for each word and add it as pairs **(word, tf-idf)**
    - HINT: Use list comprehension **[(word, tf[filename][word] * idf[word]) for word in words]**

In [19]:
tfidf = {}

for filename in corpus:
    tfidf[filename] = [(word, tf[filename][word] * idf[word]) for word in words]

### Step 8: Sort the values
- Iterate over **filename** in **corpus**
- For each **filename** sort the values in **tfidf** by second item in reverse order
    - HINT: Use **sorted** ([docs](https://docs.python.org/3/howto/sorting.html)) with **key=lambda x: x[1]** and **reverse=True**

In [21]:
for filename in corpus:
    tfidf[filename] = sorted(tfidf[filename], key=lambda x: x[1], reverse=True)

### Step 9: Print the top five words
- Iterate **filename** in **corpus**
- For each **filename** print the filename and iterate over the first file elements of **tfidf[filename]** and print the **term** and **score**

In [24]:
for filename in corpus:
    print(filename)
    for term, score in tfidf[filename][:5]:
        print(f"    {term}: {score:.4f}")

speckled.txt
    roylott: 60.8904
    stoner: 57.8459
    ventilator: 42.6233
    stepfather: 36.5343
    stoke: 33.4897
face.txt
    cottage: 56.4330
    munro: 18.8110
    jack: 18.2405
    grant: 16.4596
    norbury: 15.2226
twisted.txt
    clair: 82.2021
    neville: 57.8459
    lascar: 36.5343
    opium: 25.8651
    swandam: 24.3562
squires.txt
    cunningham: 94.3802
    alec: 57.8459
    acton: 45.6678
    william: 31.5063
    colonel: 31.3191
coronet.txt
    coronet: 82.2021
    arthur: 44.6761
    gems: 39.5788
    holder: 29.8481
    snow: 23.5138
carbuncle.txt
    goose: 61.1358
    geese: 51.7569
    horner: 39.5788
    ryder: 36.5343
    peterson: 33.4897
treaty.txt
    phelps: 118.7364
    joseph: 70.0240
    harrison: 60.8904
    woking: 42.6233
    holdhurst: 42.6233
bachelor.txt
    simon: 121.7809
    doran: 36.5343
    lestrade: 32.9193
    wedding: 30.5679
    lord: 29.6554
patient.txt
    blessington: 79.1576
    trevelyan: 48.7124
    brook: 24.3562
    consultati