# Tugas

## Tugas 1

Buatlah model klasfikasi Multinomial Naive Bayes dengan ketentuan,

1. Menggunakan data `spam.csv`
2. Fitur `CountVectorizer` dengan mengaktifkan **stop_words**
3. Evaluasi hasilnya

## Tugas 2

Buatlah model klasfikasi Multinomial Naive Bayes dengan ketentuan,

1. Menggunakan data `spam.csv`
2. Fitur `TF-IDF` dengan mengaktifkan **stop_words**
3. Evaluasi hasilnya dan bandingkan dengan hasil tugas 1.
4. Berikan kesimpulan fitur mana yang terbaik pada kasus data `spam.csv`

### Tugas 1

Load Data

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('spam.csv', encoding='latin-1') 
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


Drop kolom yang tidak digunakan

In [2]:
df = df.drop(df.iloc[:,2:], axis=1)
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Ubah nama kolom v1 dan v2

In [3]:
new_cols = {
    'v1': 'Labels',
    'v2': 'SMS'
}
df = df.rename(columns=new_cols)
df.head()

Unnamed: 0,Labels,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Encode Label

In [4]:
new_labels = {
    'spam': 1,
    'ham': 0
}
df['Labels'] = df['Labels'].map(new_labels)
df.head()

Unnamed: 0,Labels,SMS
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Memisahkan data SMS dengan label

In [5]:
X = df['SMS'].values
y = df['Labels'].values

Ekstraksi Fitur

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Split data training dan testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

# Inisiasi CountVectorizer
bow = CountVectorizer(stop_words='english')

# Fitting dan transform X_train dengan CountVectorizer
X_train = bow.fit_transform(X_train)

# Transform X_test
# Mengapa hanya transform? Alasan yang sama dengan kasus pada percobaan ke-3
# Kita tidak menginginkan model mengetahui paramter yang digunakan oleh CountVectorizer untuk fitting data X_train
# Sehingga, data testing dapat tetap menjadi data yang asing bagi model nantinya
X_test = bow.transform(X_test)

print(len(bow.get_feature_names_out()))
print(f'Dimensi data: {X_train.shape}')
print(bow.get_feature_names_out())

7502
Dimensi data: (4457, 7502)
['00' '000' '000pes' ... '½ûªt' '½ûªve' 'ì¼1']


Training dan Evaluasi Model dengan MultinomialNB

In [9]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Inisiasi MultinomialNB
mnb = MultinomialNB()

# Fit model
mnb.fit(X_train, y_train)

# Prediksi dengan data training
y_pred_train = mnb.predict(X_train)

# Evaluasi akurasi data training
acc_train = accuracy_score(y_train, y_pred_train)

# Prediksi dengan data training
y_pred_test = mnb.predict(X_test)

# Evaluasi akurasi data training
acc_test = accuracy_score(y_test, y_pred_test)

# Print hasil evaluasi
print(f'Hasil akurasi data train: {acc_train}')
print(f'Hasil akurasi data test: {acc_test}')

Hasil akurasi data train: 0.9946152120260264
Hasil akurasi data test: 0.9856502242152466


### Tugas 2

In [11]:
df2 = pd.read_csv('spam.csv', encoding='latin-1') 
df2.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [12]:
df2 = df2.drop(df2.iloc[:,2:], axis=1)
df2.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [13]:
new_cols = {
    'v1': 'Labels',
    'v2': 'SMS'
}
df2 = df2.rename(columns=new_cols)
df2.head()

Unnamed: 0,Labels,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [14]:
X2 = df['SMS'].values
y2 = df['Labels'].values

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Split data training dan testing
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.2, random_state=50)

# Inisiasi CountVectorizer
bow2 = TfidfVectorizer(stop_words='english')

# Fitting dan transform X_train dengan CountVectorizer
X_train2 = bow2.fit_transform(X_train2)

# Transform X_test
# Mengapa hanya transform? Alasan yang sama dengan kasus pada percobaan ke-3
# Kita tidak menginginkan model mengetahui paramter yang digunakan oleh CountVectorizer untuk fitting data X_train
# Sehingga, data testing dapat tetap menjadi data yang asing bagi model nantinya
X_test2 = bow2.transform(X_test2)

print(len(bow2.get_feature_names_out()))
print(f'Dimensi data: {X_train.shape}')
print(bow2.get_feature_names_out())

7502
Dimensi data: (4457, 7502)
['00' '000' '000pes' ... '½ûªt' '½ûªve' 'ì¼1']


In [18]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Inisiasi MultinomialNB
mnb = MultinomialNB()

# Fit model
mnb.fit(X_train2, y_train2)

# Prediksi dengan data training
y_pred_train2 = mnb.predict(X_train2)

# Evaluasi akurasi data training
acc_train2 = accuracy_score(y_train2, y_pred_train2)

# Prediksi dengan data training
y_pred_test2 = mnb.predict(X_test2)

# Evaluasi akurasi data training
acc_test2 = accuracy_score(y_test2, y_pred_test2)

# Print hasil evaluasi
print(f'Hasil akurasi data train: {acc_train2}')
print(f'Hasil akurasi data test: {acc_test2}')

Hasil akurasi data train: 0.9845187345748261
Hasil akurasi data test: 0.9605381165919282


### Kesimpulan


TFIDF dapat memberikan pemahaman yang baik tentang pentingny kata-kata dibandingkan CountVectorizer
