## Spam or Ham Classification

This is a basic project that consists of classifying a text sentence coming from an email, and classifying whether this email is spam or ham (a non-spam)

In [3]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

## The Start
Importing the dataset *".csv" with* **pandas**, and splitting the dataset in train and test with **sklearn**.
The dataset is available at [Kaggle - Spam Mails Dataset](https://www.kaggle.com/datasets/venky73/spam-mails-dataset)

In [4]:
dataset = pd.read_csv("dataset.csv")
x_train, x_test, y_train, y_test =  train_test_split(dataset["text"], dataset["label_num"], test_size=.3, random_state=1)

## Preparing the Data
Before model development, we need to prepare the data to use as inputs to our classifier. For this we will vectorize our training and test data, creating a kind of dictionary that contains all the words used in the input data. We will use a function from the **sklearn** library called ***CountVectorizer()*** that will count the appearances of each word in each input data.


In [10]:
count_vectorizer = CountVectorizer()
train_data = count_vectorizer.fit_transform(x_train)
test_data = count_vectorizer.transform(x_test)

## The model
The model chosen was the Naive Bayes algorithm, we used the implementation available in the sklearn library

In [13]:
naive_bayes = MultinomialNB()
naive_bayes.fit(train_data, y_train)

MultinomialNB()

## Results
To evaluate our model, we used the most common metrics in the literature. Of course, there are more current and safer forms of evaluation, such as cross-validation, but for such a simple problem it was not considered necessary.

In [14]:
predictions = naive_bayes.predict(test_data)

In [16]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

print("Accuracy:" + format(accuracy_score(predictions, y_test)))
print("F1:" + format(f1_score(predictions, y_test)))
print("Precision:" + format(precision_score(predictions, y_test)))
print("Recall:" + format(recall_score(predictions, y_test)))



Accuracy:0.9819587628865979
F1:0.96875
Precision:0.9623059866962306
Recall:0.9752808988764045


## Conclusion

The results of the model evaluation are satisfactory, considering the size of the dataset