# TF-IDF

TF-IDF is a method for extracting features for machine learning models from textual information.

## TF - term frequency

Term frequency is a metric for words in any next. It can be calculated using a formula:

$$tf(t,d)=\frac{n_t}{\sum_i n_i}$$

Where:

- $t$ - some word;
- $d$ - some text;
- $n_t$ - number of occurrences of word $t$ in document $d$;
- $\sum_i n_i$ - number of words in text $d$.

So in the following cell I calculate the term frequencies of the words for some phrases. The result here is a table that contains `original phrase` and `Term frequency`, which for each word from the `Original phrase` corresponds to $tf$ in the form `<word>-<tf>`.

So let's take the logic of the first phrase - "a penny saved is a penny earned" - one step at a time:

- Total count of words - $\sum_i n_i = 7$;
- You can find the word "a" twice in the phrase so - $n_{a} = 2 \Rightarrow tf(a)=\frac{2}{7} \approx 0.29$;
- You can find the word "penny" twice in the phrase so - $n_{penny}=2 \Rightarrow tf(penny)= \frac{2}{7} \approx 0.29$;
- All other words occur once so $tf$ for them can me computed as $\frac{1}{7} \approx 0.14$.

In [98]:
from collections import Counter
from IPython.display import HTML

phrases = [
    'a penny saved is a penny earned',
    'the quick brown fox jumps over the lazy dog',
    'beauty is in the eye of the beholder',
    'early to bed and early to rise makes a man healthy wealthy and wise',
    'give credit where credit is due',
    "if at first you don't succeed try try again",
    'justice delayed is justice denied',
    'keep your friends close and your enemies closer',
    'no pain no gain',
    'quickly come quickly go',
    'united we stand divided we fall',
    'when in rome do as the romans do'
]


html_table = "<tr><th>Original phrase</th><th>Terms frequency</th></tr>"

for p in phrases:

    words_in_phrase = p.split()
    words_count = len(words_in_phrase)
    
    counts_line = "<br>".join(
        [
            key + " - " + str(round(value/words_count, 2)) 
            for key, value in dict(Counter(words_in_phrase)).items()
        ]
    )
    html_table += f"<tr><td>{p}</td><td>{counts_line}</td></tr>"

HTML("<table>" + html_table + "</table>")

Original phrase,Terms frequency
a penny saved is a penny earned,a - 0.29 penny - 0.29 saved - 0.14 is - 0.14 earned - 0.14
the quick brown fox jumps over the lazy dog,the - 0.22 quick - 0.11 brown - 0.11 fox - 0.11 jumps - 0.11 over - 0.11 lazy - 0.11 dog - 0.11
beauty is in the eye of the beholder,beauty - 0.12 is - 0.12 in - 0.12 the - 0.25 eye - 0.12 of - 0.12 beholder - 0.12
early to bed and early to rise makes a man healthy wealthy and wise,early - 0.14 to - 0.14 bed - 0.07 and - 0.14 rise - 0.07 makes - 0.07 a - 0.07 man - 0.07 healthy - 0.07 wealthy - 0.07 wise - 0.07
give credit where credit is due,give - 0.17 credit - 0.33 where - 0.17 is - 0.17 due - 0.17
if at first you don't succeed try try again,if - 0.11 at - 0.11 first - 0.11 you - 0.11 don't - 0.11 succeed - 0.11 try - 0.22 again - 0.11
justice delayed is justice denied,justice - 0.4 delayed - 0.2 is - 0.2 denied - 0.2
keep your friends close and your enemies closer,keep - 0.12 your - 0.25 friends - 0.12 close - 0.12 and - 0.12 enemies - 0.12 closer - 0.12
no pain no gain,no - 0.5 pain - 0.25 gain - 0.25
quickly come quickly go,quickly - 0.5 come - 0.25 go - 0.25
