## **LAB ASSIGNMENT**

### **Faradisha Aldina Putri - 2141720159 - TI 3I**

#### **Tasks**
1. Create a classification model using SVM for the voice.csv data. 
2. Create a Multinomial Naive Bayes classification model with the following conditions:
    1. Use the spam.csv data.
    2. Utilize CountVectorizer with stop words enabled.
    3. Evaluate the results.
3. Create another Multinomial Naive Bayes classification model with the following conditions:
    1. Use the spam.csv data.
    2. Employ TF-IDF features with stop words enabled.
    3. Evaluate the results and compare them with the results from Task #2.
    4. Provide a conclusion on which feature extraction method is best for the spam.csv dataset.

---

**1. Create a classification model using SVM for the voice.csv data.**

> *Import Libraries*

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

> *Load & Read Data*

In [3]:
data = pd.read_csv('../dataset/voice.csv')

# Check the structure of the dataset
print(data.head())

# Separate features and target
X = data.iloc[:, :-1]  # Features
y = data.iloc[:, -1]   # Target

   meanfreq        sd    median       Q25       Q75       IQR       skew  \
0  0.059781  0.064241  0.032027  0.015071  0.090193  0.075122  12.863462   
1  0.066009  0.067310  0.040229  0.019414  0.092666  0.073252  22.423285   
2  0.077316  0.083829  0.036718  0.008701  0.131908  0.123207  30.757155   
3  0.151228  0.072111  0.158011  0.096582  0.207955  0.111374   1.232831   
4  0.135120  0.079146  0.124656  0.078720  0.206045  0.127325   1.101174   

          kurt    sp.ent       sfm  ...  centroid   meanfun    minfun  \
0   274.402906  0.893369  0.491918  ...  0.059781  0.084279  0.015702   
1   634.613855  0.892193  0.513724  ...  0.066009  0.107937  0.015826   
2  1024.927705  0.846389  0.478905  ...  0.077316  0.098706  0.015656   
3     4.177296  0.963322  0.727232  ...  0.151228  0.088965  0.017798   
4     4.333713  0.971955  0.783568  ...  0.135120  0.106398  0.016931   

     maxfun   meandom    mindom    maxdom   dfrange   modindx  label  
0  0.275862  0.007812  0.007812  

> *# Encode the "label" column to numeric values (0 : male, 1 : female)*

In [4]:
data['label'] = data['label'].map({'male': 0, 'female': 1})

> *Standardize the features*

In [5]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

> *Split the data into training and testing sets*

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

> Create and Train an SVM Classifier

In [7]:
svm_classifier = SVC(kernel='linear') 
svm_classifier.fit(X_train, y_train)

> *Make Predicition*

In [8]:
y_pred = svm_classifier.predict(X_test)

> *Evaluate the Model*

In [9]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

# Print classification report
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Print confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9763406940063092
Classification Report:
              precision    recall  f1-score   support

      female       0.96      0.99      0.98       297
        male       0.99      0.97      0.98       337

    accuracy                           0.98       634
   macro avg       0.98      0.98      0.98       634
weighted avg       0.98      0.98      0.98       634

Confusion Matrix:
[[293   4]
 [ 11 326]]


---

**2. Create a Multinomial Naive Bayes classification model with the following conditions:**
    
    1. Use the spam.csv data.
    2. Utilize CountVectorizer with stop words enabled.
    3. Evaluate the results.

> *Import Libraries*

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


> *Load & Read Data*

In [12]:
data = pd.read_csv("../dataset/spam.csv", encoding="latin-1")

> *Extract features (email text) and target labels (spam or not)*

In [13]:
X = data["v2"]
y = data["v1"]

> *Drop & Rename Column*

In [19]:
df = data[['v1', 'v2']]
df.columns = ['Labels', 'SMS']

> *Encode Labels (spam and ham) to numerical values (0 and 1)*

In [16]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(y)
df['Labels'] = df['Labels'].map({'spam': 1, 'ham': 0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Labels'] = df['Labels'].map({'spam': 1, 'ham': 0})


> *Split the data into training and testing sets (80% training, 20% testing)*

In [18]:
X = df['SMS']
y = df['Labels']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

> *Create a CountVectorizer with stop words enabled to convert text data into a numerical format.*

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
# Create a CountVectorizer with stop words enabled
vectorizer = CountVectorizer(stop_words="english")

# Fit and transform the training data
X_train_counts = vectorizer.fit_transform(X_train)

# Transform the test data
X_test_counts = vectorizer.transform(X_test)

> *Build a Multinomial Naive Bayes classifier and train it using the transformed training data*

In [21]:
# Create and train a Multinomial Naive Bayes classifier
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_counts, y_train)

> *Make predictions on the test data and evaluate the model's performance.*

In [23]:
# Predict on the test set
y_pred = naive_bayes.predict(X_test_counts)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

# Print classification report
report = classification_report(y_test, y_pred)
print('Classification Report:', report)

# Print confusion matrix
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.9838565022421525
Classification Report:               precision    recall  f1-score   support

           0       0.99      0.99      0.99       965
           1       0.96      0.92      0.94       150

    accuracy                           0.98      1115
   macro avg       0.97      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115

Confusion Matrix:
[[959   6]
 [ 12 138]]


---

**3. Create another Multinomial Naive Bayes classification model with the following conditions:**
    
    1. Use the spam.csv data.
    2. Employ TF-IDF features with stop words enabled.
    3. Evaluate the results and compare them with the results from Task #2.
    4. Provide a conclusion on which feature extraction method is best for the spam.csv dataset.

> *Import Libraries*

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

> *Load and Read the Dataset*

In [45]:
df = pd.read_csv('../dataset/spam.csv', encoding='latin-1')

> *Drop & Rename Columns*

In [46]:
df = df[['v1', 'v2']]
df.columns = ['Labels', 'SMS']

> *Encode Labels*

In [47]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(y)
df['Labels'] = df['Labels'].map({'spam': 1, 'ham': 0})

> *Split the data into training and testing sets*

In [48]:
X = df['SMS']
y = df['Labels']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

> *Preprocess the data: Split it into features (X) and target labels (y) and split the data into training and testing sets. You can use the same code as in Task #2 for this step.*

> *Create a TfidfVectorizer with stop words enabled to convert text data into TF-IDF features.*

In [49]:
# Create a TfidfVectorizer with stop words enabled
tfidf_vectorizer = TfidfVectorizer(stop_words="english")

# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data
X_test_tfidf = tfidf_vectorizer.transform(X_test)


> *Build a Multinomial Naive Bayes classifier and train it using the transformed TF-IDF training data.*

In [50]:
# Create and train a Multinomial Naive Bayes classifier
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_tfidf, y_train)

> *Make prediction on the test data and evaluate the TF-IDF*

In [51]:
# Make predictions on the test data using TF-IDF features
y_pred_tfidf = nb_tfidf.predict(X_test_tfidf)

# Evaluate the TF-IDF model
accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)
print('Accuracy using TF-IDF:', accuracy_tfidf)
confusion_tfidf = confusion_matrix(y_test, y_pred_tfidf)
print('Confusion Matrix using TF-IDF:', confusion_tfidf)
report_tfidf = classification_report(y_test, y_pred_tfidf)
print('Classification Report using TF-IDF:', report_tfidf)



Accuracy using TF-IDF: 0.9668161434977578
Confusion Matrix using TF-IDF: [[965   0]
 [ 37 113]]
Classification Report using TF-IDF:               precision    recall  f1-score   support

           0       0.96      1.00      0.98       965
           1       1.00      0.75      0.86       150

    accuracy                           0.97      1115
   macro avg       0.98      0.88      0.92      1115
weighted avg       0.97      0.97      0.96      1115



---

**Conclusion** :

Based on this comparison, CountVectorizer with stop words enabled performs slightly better than TF-IDF with stop words for spam classification. It achieves higher accuracy, precision, recall, and F1-score for the "spam" class.

However, it's important to note that the choice between these feature extraction methods may vary depending on the dataset and the specific machine learning model used. Different datasets and models may yield different results, so experimentation and adaptation of feature extraction methods are advisable to achieve the best performance in different scenarios.