In this project, I aimed to build a spam detection system using natural language processing techniques. We started by loading a dataset containing SMS messages labeled as spam or ham (not spam). The dataset was preprocessed to remove unnecessary characters, convert text to lowercase, tokenize the text, and remove stopwords.

Next, we applied TF-IDF vectorization to convert the tokenized text into numerical features. TF-IDF assigns weights to words based on their frequency in the document and across the dataset. This allowed us to represent the text data as a matrix of numerical values.

We then split the dataset into training and testing sets for model evaluation. We chose to use the Support Vector Machine (SVM) algorithm for classification. The SVM model was trained on the TF-IDF features of the training set.

After training the SVM model, we evaluated its performance using various evaluation metrics such as accuracy, precision, recall, F1-score, and ROC AUC. These metrics provide insights into how well the model performs in distinguishing between spam and ham messages.

The evaluation results showed that the SVM model achieved high accuracy, precision, and F1-score, indicating its effectiveness in spam detection. The ROC AUC score demonstrated the model's ability to discriminate between positive and negative classes.

To test the model, we provided an example text message, preprocessed it using the same steps as the training data, applied TF-IDF vectorization, and made a prediction using the SVM model. This allowed us to classify the example text as either spam or ham.

Overall, this project demonstrated the process of building a spam detection system using NLP techniques, including preprocessing, feature extraction using TF-IDF, model training with SVM, evaluation of model performance, and testing with example data.

https://www.kaggle.com/uciml/sms-spam-collection-dataset

The dataset used in the spam detection code is called the "SMS Spam Collection Dataset." It is a collection of SMS messages labeled as spam or ham (not spam). The dataset contains a total of 5,572 messages.




In [13]:
# Load the CSV file into a DataFrame using the 'latin-1' encoding
data = pd.read_csv(csv_path, encoding='latin-1')

# Explore the loaded data
print(data.head())




     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [15]:
# Drop the columns with mostly null values
data = data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)

# Explore the loaded data
print(data.head())

     v1                                                 v2
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [16]:
data = data.rename(columns={"v1": "Label", "v2": "Text"})


In [18]:
data['Label'].unique()

array(['ham', 'spam'], dtype=object)

## Data preprocessing:
Preprocess the text data by removing unnecessary characters, converting to lowercase, and removing stopwords.

In [21]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords data (needed once)
nltk.download('stopwords')
nltk.download('punkt')

# Remove unnecessary characters and convert to lowercase
data['Text'] = data['Text'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', x.lower()))

# Tokenize the text
data['Tokenized_Text'] = data['Text'].apply(lambda x: word_tokenize(x))

# Remove stopwords
stop_words = set(stopwords.words('english'))
data['Tokenized_Text'] = data['Tokenized_Text'].apply(lambda x: [word for word in x if word not in stop_words])

# Lowercase normalization
data['Tokenized_Text'] = data['Tokenized_Text'].apply(lambda x: [word.lower() for word in x])


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Feature Extraction
First convert the tokenized text back into strings by joining the tokens using a space separator. Then, we initialize a TF-IDF vectorizer object. We apply the TF-IDF vectorization on the processed text using the fit_transform() method, which returns the feature matrix X representing the TF-IDF values for each document.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert tokenized text to strings
data['processed_text'] = data['Tokenized_Text'].apply(lambda x: ' '.join(x))

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Apply TF-IDF vectorization on the processed text
X = tfidf_vectorizer.fit_transform(data['processed_text'])

# Print the shape of the feature matrix
print("Shape of X:", X.shape)


Shape of X: (5572, 9314)


This means that we have 5572 documents (text messages) and 9314 unique features (words) represented by their TF-IDF values. Each element in the matrix corresponds to the TF-IDF value of a specific word in a specific document.

### Splitting Text

In [25]:
from sklearn.model_selection import train_test_split

# Split the data into features (X) and the target variable (y)
X = data['Text']
y = data['Label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the shapes of the resulting sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (4457,)
X_test shape: (1115,)
y_train shape: (4457,)
y_test shape: (1115,)


## SVM for modeling and evaluation
Support Vector Machine used for classification and regression tasks. It is a popular and powerful algorithm known for its effectiveness in handling complex datasets. The main idea behind SVM is to find an optimal hyperplane that separates different classes in the data space.

<img src = "https://www.researchgate.net/profile/Ismail-Calikusu/publication/338698374/figure/fig3/AS:849434183233538@1579532299528/Optimal-Hyperplane-and-Margin-of-SVM.png">

In binary classification, the hyperplane aims to maximize the margin between the two classes, which is the distance between the hyperplane and the nearest data points of each class. These data points are called support vectors.

### TF-IDF vectorization
Term Frequency-Inverse Document Frequency is a numerical statistic used to evaluate the importance of a term (word) in a document within a collection or corpus.

In the context of text classification, TF-IDF vectorization converts a collection of text documents into a matrix representation where each row represents a document and each column represents a unique term in the corpus. The values in the matrix correspond to the TF-IDF score of each term in each document.

By applying TF-IDF vectorization, we transform the textual data into numerical features that can be used as input for machine learning algorithms, such as SVM, to train and classify text data.

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Apply TF-IDF vectorization on the processed text
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(data['processed_text'])
y = data['Label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the SVM classifier
svm = SVC()
svm.fit(X_train, y_train)

# Make predictions on the test set
y_pred = svm.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.9668161434977578


In [30]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import LabelEncoder

# Make predictions on the test set
y_pred = svm.predict(X_test)

# Compute evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label='spam')
recall = recall_score(y_test, y_pred, pos_label='spam')
f1 = f1_score(y_test, y_pred, pos_label='spam')

# Convert labels to numeric format
label_encoder = LabelEncoder()
y_test_encoded = label_encoder.fit_transform(y_test)
y_pred_encoded = label_encoder.transform(y_pred)

# Compute ROC AUC score
roc_auc = roc_auc_score(y_test_encoded, y_pred_encoded)


# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("ROC AUC:", roc_auc)

# Compute the confusion matrix
confusion = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion)

# Classification Report
report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)


Accuracy: 0.9668161434977578
Precision: 0.9829059829059829
Recall: 0.7666666666666667
F1-Score: 0.8614232209737828
ROC AUC: 0.8822970639032814
Confusion Matrix:
[[963   2]
 [ 35 115]]
Classification Report:
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       965
        spam       0.98      0.77      0.86       150

    accuracy                           0.97      1115
   macro avg       0.97      0.88      0.92      1115
weighted avg       0.97      0.97      0.97      1115



Accuracy: The accuracy of the model is 0.9668, which indicates that it correctly classifies 96.68% of the instances in the test set.

Precision: Precision is a measure of the model's ability to correctly identify spam messages. A precision score of 0.9829 suggests that when the model predicts a message as spam, it is correct 98.29% of the time.

Recall: Recall, also known as sensitivity or true positive rate, measures the model's ability to correctly detect spam messages. A recall score of 0.7667 means that the model identifies 76.67% of the actual spam messages.

F1-Score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of the model's performance. The F1-score of 0.8614 indicates a good trade-off between precision and recall.

ROC AUC: The ROC AUC (Receiver Operating Characteristic Area Under the Curve) score measures the model's ability to discriminate between classes. A value of 0.8823 suggests that the model performs well in distinguishing between spam and non-spam messages.

Confusion Matrix: The confusion matrix shows the number of true positive (spam correctly classified), true negative (ham correctly classified), false positive (ham misclassified as spam), and false negative (spam misclassified as ham) predictions. In this case, the model correctly classified 963 ham messages and 115 spam messages, but misclassified 35 spam messages as ham and 2 ham messages as spam.

Classification Report: The classification report provides a summary of precision, recall, and F1-score for both classes (ham and spam), along with the support (number of instances) for each class. It gives a comprehensive view of the model's performance on a class-by-class basis.

Overall, the SVM model shows promising performance with high accuracy and precision. However, it has a relatively lower recall, indicating that there is room for improvement in detecting more spam messages correctly.

### Test the SVM model with an example

In [32]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Function to preprocess the text
def preprocess_text(text):
    # Remove unnecessary characters and convert to lowercase
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower())

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Join the tokens back into a string
    processed_text = ' '.join(tokens)

    return processed_text

# Example text message
example_text = "Congratulations! You have won a free vacation. Reply now to claim your prize."

# Preprocess the example text
example_text = preprocess_text(example_text)

# Apply TF-IDF vectorization on the example text
example_tfidf = tfidf_vectorizer.transform([example_text])

# Make a prediction using the SVM model
prediction = svm.predict(example_tfidf)

# Print the predicted label
print("Predicted Label:", prediction[0])


Predicted Label: spam
