### Feature Extraction: Bag of Words & TF-IDF
Memahami bagaimana teks diubah menjadi representasi numerik menggunakan Bag of Words (BoW) dan TF-IDF (Term Frequency – Inverse Document Frequency).
Representasi ini penting sebagai dasar untuk pemrosesan teks lanjut seperti klasifikasi atau clustering.

## Import Library
*CountVectorizer* digunakan untuk membentuk model Bag of Words.

*TfidfVectorizer* untuk model TF-IDF.

In [1]:
# Library utama untuk pemrosesan teks
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## Menyiapkan Dataset

In [2]:
# Dataset sederhana untuk eksperimen
docs = [
    "Narendra Modi visited the United States.",
    "Barack Obama was the President of the United States.",
    "Shinzo Abe met Narendra Modi in Japan.",
    "The United States and Japan are strong allies."
]

# Lihat isi dataset
pd.DataFrame(docs, columns=["Document"])

Unnamed: 0,Document
0,Narendra Modi visited the United States.
1,Barack Obama was the President of the United S...
2,Shinzo Abe met Narendra Modi in Japan.
3,The United States and Japan are strong allies.


## Bag of Words (BoW)

In [3]:
# Membuat model BoW
bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(docs)

# Konversi ke DataFrame agar mudah dibaca
bow_df = pd.DataFrame(bow_matrix.toarray(), columns=bow_vectorizer.get_feature_names_out())
bow_df

Unnamed: 0,abe,allies,and,are,barack,in,japan,met,modi,narendra,obama,of,president,shinzo,states,strong,the,united,visited,was
0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,1,1,1,0
1,0,0,0,0,1,0,0,0,0,0,1,1,1,0,1,0,2,1,0,1
2,1,0,0,0,0,1,1,1,1,1,0,0,0,1,0,0,0,0,0,0
3,0,1,1,1,0,0,1,0,0,0,0,0,0,0,1,1,1,1,0,0


## TF-IDF (Weighted Features)
*   TF(Term Frequency) : seberapa sering kata muncul di dokumen
*   IDF(Inverse Document Frequency) : seberapa unik kata itu di seluruh dokument
*   TF-IDF : menurunkan bobot kata umum seperti "the" agar tidak mendominasi hasilnya








In [4]:
# Membuat model TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(docs)

# Konversi ke DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df

Unnamed: 0,abe,allies,and,are,barack,in,japan,met,modi,narendra,obama,of,president,shinzo,states,strong,the,united,visited,was
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.423521,0.423521,0.0,0.0,0.0,0.0,0.342877,0.0,0.342877,0.342877,0.537183,0.0
1,0.0,0.0,0.0,0.0,0.366508,0.0,0.0,0.0,0.0,0.0,0.366508,0.366508,0.366508,0.0,0.233937,0.0,0.467874,0.233937,0.0,0.366508
2,0.412928,0.0,0.0,0.0,0.0,0.412928,0.325557,0.412928,0.325557,0.325557,0.0,0.0,0.0,0.412928,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.413668,0.413668,0.413668,0.0,0.0,0.32614,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.264039,0.413668,0.264039,0.264039,0.0,0.0
