## Tugas 1

1. Buatlah model klasifikasi dengan menggunakan SVM untuk data suara, voice.csv.

    Jawab:


**Persiapan Library dan Data**

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

data = pd.read_csv('../Data/voice.csv')

**Membagi Fitur dan Target**

In [3]:
X= data.drop(columns=['label'])
y = data['label']

# Mengonversi label target dari format teks ('male' dan 'female') ke format numerik biner (0 dan 1).
y = y.map({'male': 0, 'female': 1})

**Pembagian Data Latih dan Data Uji**

In [4]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=50)

**Melakukan Standarisasi**

In [6]:
scaler = StandardScaler()
Xtrain_scaled = scaler.fit_transform(Xtrain)
Xtest_scaled = scaler.transform(Xtest)

**Membuat dan Melatih Model SVM**

In [7]:
svm_model = SVC()
svm_model.fit(Xtrain_scaled, ytrain)

**Membuat Prediksi**

In [8]:
y_pred = svm_model.predict(Xtest_scaled)

**Evaluasi Model**

In [22]:
accuracy = accuracy_score(ytest, y_pred)
conf_matrix = confusion_matrix(ytest, y_pred)
class_report = classification_report(ytest, y_pred)

print(f'Accuracy: {accuracy :.4f}\n')
print(f'Confusion Matrix: {conf_matrix}\n')
print(f'Classification Report: \n\n{class_report}')

Accuracy: 0.9795

Confusion Matrix: [[300  10]
 [  3 321]]

Classification Report: 

              precision    recall  f1-score   support

           0       0.99      0.97      0.98       310
           1       0.97      0.99      0.98       324

    accuracy                           0.98       634
   macro avg       0.98      0.98      0.98       634
weighted avg       0.98      0.98      0.98       634



---

2. Buatlah model klasfikasi Multinomial Naive Bayes dengan ketentuan,

    * Menggunakan data spam.csv

    * Fitur CountVectorizer dengan mengaktifkan stop_words

    * Evaluasi hasilnya

    Jawab:

**Persiapan Library dan Data**

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

data = pd.read_csv('../Data/spam.csv', encoding='latin-1') # Encode diperlukan karena data tidak menggunakan UTF-8

# Drop 3 kolom terakhir dengan fungsi iloc, karena 3 kolom terakhir tidak ada isi
data = data.drop(data.iloc[:,2:], axis=1)

data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


**Membagi Fitur dan Target**

In [24]:
X = data['v2']
y = data['v1']

**Pembagian Data Latih dan Data Uji**

In [25]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=50)

**Mengubah data text menjadi numerik menggunakan CountVectorizer**

In [26]:
countvext = CountVectorizer(stop_words='english')
Xtrain_count = countvext.fit_transform(Xtrain)
Xtest_count = countvext.transform(Xtest)

**Membuat dan Melatih Multinomial Naive Bayes**

In [27]:
mnModel = MultinomialNB()
mnModel.fit(Xtrain_count, ytrain)

**Membuat Prediksi**

In [28]:
y_pred = mnModel.predict(Xtest_count)

**Evaluasi Model**

In [30]:
accuracy_train = mnModel.score(Xtrain_count, ytrain)
accuracy_test = accuracy_score(ytest, y_pred)
accuracy = accuracy_score(ytest, y_pred)
conf_matrix = confusion_matrix(ytest, y_pred)
class_report = classification_report(ytest, y_pred)

print(f'Accuracy Train: {accuracy_train :.4f}\n')
print(f'Accuracy Test: {accuracy_test :.4f}\n')
print(f'Accuracy: {accuracy :.4f}\n')
print(f'Confusion Matrix: \n{conf_matrix}\n')
print(f'Classification Report: \n\n{class_report}')


Accuracy Train: 0.9946

Accuracy Test: 0.9830

Accuracy: 0.9830

Confusion Matrix: 
[[951   3]
 [ 16 145]]

Classification Report: 

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       954
        spam       0.98      0.90      0.94       161

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



---

3. Buatlah model klasfikasi Multinomial Naive Bayes dengan ketentuan,

    * Menggunakan data spam.csv

    * Fitur TF-IDF dengan mengaktifkan stop_words

    * Evaluasi hasilnya dan bandingkan dengan hasil pada Tugas no 2.

    * Berikan kesimpulan fitur mana yang terbaik pada kasus data spam.csv

    Jawab:

**Persiapan Library dan Data**

In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

data = pd.read_csv('../Data/spam.csv', encoding='latin-1') # Encode diperlukan karena data tidak menggunakan UTF-8

# Drop 3 kolom terakhir dengan fungsi iloc, karena 3 kolom terakhir tidak ada isi
data = data.drop(data.iloc[:,2:], axis=1)

data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


**Membagi Fitur dan Target**

In [35]:
X = data['v2']
y = data['v1']

**Pembagian Data Latih dan Data Uji**

In [36]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=50)

**Mengubah data text menjadi numerik menggunakan TF-IDF Vectorizer**

In [37]:
tfidfvect = TfidfVectorizer(stop_words='english')
Xtrain_tfidf = tfidfvect.fit_transform(Xtrain)
Xtest_tfidf = tfidfvect.transform(Xtest)

**Membuat dan Melatih Multinomial Naive Bayes**

In [38]:
mnModel = MultinomialNB()
mnModel.fit(Xtrain_tfidf, ytrain)

**Membuat Prediksi**

In [39]:
y_pred = mnModel.predict(Xtest_tfidf)

**Evaluasi Model**

In [40]:
accuracy_train = mnModel.score(Xtrain_tfidf, ytrain)
accuracy_test = accuracy_score(ytest, y_pred)
accuracy = accuracy_score(ytest, y_pred)
conf_matrix = confusion_matrix(ytest, y_pred)
class_report = classification_report(ytest, y_pred)

print(f'Accuracy Train: {accuracy_train :.4f}\n')
print(f'Accuracy Test: {accuracy_test :.4f}\n')
print(f'Accuracy: {accuracy :.4f}\n')
print(f'Confusion Matrix: \n{conf_matrix}\n')
print(f'Classification Report: \n\n{class_report}')


Accuracy Train: 0.9843

Accuracy Test: 0.9605

Accuracy: 0.9605

Confusion Matrix: 
[[954   0]
 [ 44 117]]

Classification Report: 

              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       954
        spam       1.00      0.73      0.84       161

    accuracy                           0.96      1115
   macro avg       0.98      0.86      0.91      1115
weighted avg       0.96      0.96      0.96      1115



---

### Kesimpulan

Berdasarkan hasil evaluasi model klasifikasi, CountVectorizer terbukti lebih unggul dalam mendeteksi pesan spam pada dataset spam.csv, dengan akurasi yang lebih tinggi yakni 98.30% dan F1-score yang lebih baik pada kelas spam dibandingkan dengan TF-IDF. Performanya yang konsisten dalam mendeteksi pesan spam menjadikan CountVectorizer lebih relevan untuk kasus ini.

Namun, meskipun TF-IDF tidak mencapai performa yang sama dalam hal akurasi dan recall, teknik ini menawarkan kemampuan yang lebih tinggi dalam proses parsing teks. TF-IDF mampu memperhitungkan bobot frekuensi kata di seluruh dokumen, yang dapat membantu dalam kasus di mana perbedaan antara kata umum dan kata spesifik penting untuk analisis. 

Oleh karena itu, meskipun CountVectorizer lebih sesuai untuk kasus ini, TF-IDF tetap memiliki keunggulan tersendiri dalam situasi yang lebih kompleks, di mana pemahaman lebih mendalam mengenai distribusi kata diperlukan.