## TF-IDF: Term Frequency - Inverse Document Frequency)
learning source: https://www.youtube.com/watch?v=f0a1XXmaQp8&list=PL2O3HdJI4voHNEv59SdXKRQVRZAFmwN9E&index=12

- **TF-IDF** merupakan salah satu metode statistik yang digunakan untuk **mengukur seberapa penting suatu kata terhadap suatu dokumen tertentu** dari sekumpulan corpus
- Pada dasarnya melibatkan perkalian dua nilai yaitu TF dan IDF
- Implementasi term frequency pada sklearn mengdopsi formula **term frequency adjusted for document length** dimana TF diekspresikan sebagai hasil pembagian antara jumlah kemunculan suatu term pada document dengan total jumlah kata dalam document tersebut
- TF-IDF juga menerapkan normalisasi 'l2'
- [Dokumentasi sklearn](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

### *Dataset*
---

In [1]:
corpus = [
    'the house had a tiny little mouse',
    'the cat saw the mouse',
    'the mouse ran away from the house',
    'the cat finally ate the mouse',
    'the end of the mouse story']

corpus

['the house had a tiny little mouse',
 'the cat saw the mouse',
 'the mouse ran away from the house',
 'the cat finally ate the mouse',
 'the end of the mouse story']

### *TF-IDF Weights dengan **TfidfVectorizer***
---

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
response = vectorizer.fit_transform(corpus)
print(response)

  (0, 7)	0.2808823162882302
  (0, 6)	0.5894630806320427
  (0, 11)	0.5894630806320427
  (0, 5)	0.47557510189256375
  (1, 9)	0.7297183669435993
  (1, 2)	0.5887321837696324
  (1, 7)	0.3477147117091919
  (2, 1)	0.5894630806320427
  (2, 8)	0.5894630806320427
  (2, 7)	0.2808823162882302
  (2, 5)	0.47557510189256375
  (3, 0)	0.5894630806320427
  (3, 4)	0.5894630806320427
  (3, 2)	0.47557510189256375
  (3, 7)	0.2808823162882302
  (4, 10)	0.6700917930430479
  (4, 3)	0.6700917930430479
  (4, 7)	0.3193023297639811


**Cara membaca :**
- Angka dalam tuple
    - sisi kiri : merepresentasikan index dari corpus, 0 merepresentasikan kalimat pertama dari corpus, 1 merepresentasikan kalimat kedua dari corpus, dst.
    - sisi kanan : index dari feature name yang dihasilkan/token (panggil dengan get_feature_names_out/kode dibawah)
- Sekumpulan angka paling kanan: bobot dari tiap tf-idf hasil kalkulasi

In [5]:
vectorizer.get_feature_names_out()

array(['ate', 'away', 'cat', 'end', 'finally', 'house', 'little', 'mouse',
       'ran', 'saw', 'story', 'tiny'], dtype=object)

index|feature
---|---
0|ate
1|away
2|cat
3|end
4|finally
5|house
6|little
7|mouse
8|ran
9|saw
10|story
11|tiny

Kalimat pada index 0 : **'the house had a tiny little mouse'** <br>
Output response : 
- (0, 7) - Kalimat pada corpus pertama ini mengandung index ke-7 yaitu **mouse**
- (0, 6) - Kalimat pada corpus pertama ini mengandung index ke-6 yaitu **little**
- (0, 11) - Kalimat pada corpus pertama ini mengandung index ke-11 yaitu **tiny**
- (0, 5) - Kalimat pada corpus pertama ini mengandung index ke-5 yaitu **house**

**Hasil kalkulasi**

In [6]:
response.todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.4755751 , 0.58946308, 0.28088232, 0.        , 0.        ,
         0.        , 0.58946308],
        [0.        , 0.        , 0.58873218, 0.        , 0.        ,
         0.        , 0.        , 0.34771471, 0.        , 0.72971837,
         0.        , 0.        ],
        [0.        , 0.58946308, 0.        , 0.        , 0.        ,
         0.4755751 , 0.        , 0.28088232, 0.58946308, 0.        ,
         0.        , 0.        ],
        [0.58946308, 0.        , 0.4755751 , 0.        , 0.58946308,
         0.        , 0.        , 0.28088232, 0.        , 0.        ,
         0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.67009179, 0.        ,
         0.        , 0.        , 0.31930233, 0.        , 0.        ,
         0.67009179, 0.        ]])

Setiap row merepresentasikan setiap kalimat atau document pada corpus. Agar lebih readable dapat dikonversi ke dataframe.

In [7]:
import pandas as pd

df = pd.DataFrame(response.todense().T,
                  index=vectorizer.get_feature_names_out(),
                  columns=[f'D{i+1}' for i in range (len(corpus))])

df

Unnamed: 0,D1,D2,D3,D4,D5
ate,0.0,0.0,0.0,0.589463,0.0
away,0.0,0.0,0.589463,0.0,0.0
cat,0.0,0.588732,0.0,0.475575,0.0
end,0.0,0.0,0.0,0.0,0.670092
finally,0.0,0.0,0.0,0.589463,0.0
house,0.475575,0.0,0.475575,0.0,0.0
little,0.589463,0.0,0.0,0.0,0.0
mouse,0.280882,0.347715,0.280882,0.280882,0.319302
ran,0.0,0.0,0.589463,0.0,0.0
saw,0.0,0.729718,0.0,0.0,0.0


**Contoh cara membaca:** <br>
- Untuk kata **cat** terdapat dalam dukumen **kedua (D2)** dan dokumen **keempat (D4)**, hanya saja kata cat ini lebih tinggi bobotnya pada dokumen kedua dibandingkan dengan dokumen keempat.

- Semakin tinggi bobot suatu kata pada suatu dokumen mengindikasikan bahwa kata tersebut semakin layak untuk digunakan menjadi keyword untuk dokumen tersebut.