##  **Spam Detection System: Leveraging NLP for Enhanced Messaging Experience**

As a passionate individual with a strong interest in data science and Natural Language Processing (NLP), I embarked on an exciting project to develop a robust spam detection system. This project showcases my ability to harness cutting-edge NLP techniques and machine learning algorithms to address real-world challenges.

## Project Overview:

The aim of this project was to design an efficient spam detection system for mobile messaging. I leveraged the SMS Spam Collection Dataset from Kaggle and employed Python along with popular libraries like NLTK and scikit-learn. The ultimate goal was to create a model capable of accurately classifying incoming messages as either spam or ham (non-spam) in real-time.

## Key Highlights:

1. Data Preprocessing: I meticulously cleaned and processed the raw text data by converting it to lowercase, tokenizing, removing stopwords, and employing stemming. This critical step ensured that the model could effectively analyze and understand the messages.

2. Feature Extraction: Employing the powerful TF-IDF vectorization technique, I transformed the preprocessed text messages into numerical features, making them suitable for machine learning algorithms.

3. Machine Learning Model: I utilized the Naive Bayes classifier for its efficiency in handling text data. The model was trained on the processed data to predict whether a given message is spam or ham.

4. Model Evaluation: The trained model achieved an impressive accuracy of 96% in classifying messages, reflecting its effectiveness in distinguishing between spam and legitimate content. Additionally, I analyzed the confusion matrix and classification report to gain valuable insights into the model's performance, including precision and recall metrics.

## Conclusion:

This project serves as a testament to my ability as an individual data scientist, showcasing my proficiency in NLP techniques, data preprocessing, and machine learning model development. By building this spam detection system, I have demonstrated my capability to create impactful solutions that can significantly improve customer experience and satisfaction by reducing the impact of spam messages.

I am excited to present this project as a representation of my skills and passion for data science. Leveraging NLP for spam detection, I aim to contribute my expertise to various industries, solving complex challenges and delivering valuable insights that drive positive outcomes.



## Contents:

1. Business probleme.    
2. Loading libraries.      
3. Loading the Dataset.           
4. Data Preprocessing and Featuer Engineering.   
  4.1. Remove unnecessary columns.               
  4.2. Convert labels to binary (0 for 'ham', 1 for 'spam')           
5. Text Preprocessing.
6. Split the Data into Training and Testing Sets.
7. Feature Extraction.
8. Train the Model (Naive Bayes Classifier).
9. Make Predictions.
10. Evaluate the Model and analysis the results.
11. Conclusions.

## 1- Business probleme
the company has noticed a significant increase in customer complaints related to spam messages. These spam messages not only annoy customers but also create a negative impact on the overall customer experience and satisfaction.

The business problem is to develop an efficient spam detection system that can accurately identify spam messages and prevent them from reaching customers' inboxes. The company aims to reduce the number of spam messages received by customers, which will lead to improved customer satisfaction and loyalty.

To address this problem, the data scientist's task is to build a robust machine learning model using Natural Language Processing (NLP) techniques to classify incoming messages as either "spam" or "ham" (non-spam). The model should be able to process and analyze large volumes of messages in real-time to effectively filter out spam messages before they reach customers' devices.

The solution should integrate seamlessly with the existing messaging infrastructure of the mobile service provider, ensuring minimal disruption to the customers' messaging experience. By implementing an accurate and efficient spam detection system, the company aims to enhance customer satisfaction, reduce customer complaints, and strengthen its position in the highly competitive mobile service market.
Data Set, which is available at https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset.

## 2- Loading libraries

In [62]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [63]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 3- Loading the Dataset

In [64]:
data = pd.read_csv('spam.csv', encoding='ISO-8859-1')


# 4- Data Preprocessing
# 4.1- Remove unnecessary columns

In [65]:
data = data[['v1', 'v2']]
data.columns = ['label', 'text']

# 4.2- Convert labels to binary (0 for 'ham', 1 for 'spam')

In [66]:
data['label'] = data['label'].map({'ham': 0, 'spam': 1})


# 5- Text Preprocessing

In [67]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess_text(text):
    words = nltk.word_tokenize(text.lower())  # Convert to lowercase and tokenize
    words = [word for word in words if word.isalpha()]  # Remove non-alphabetic characters
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    words = [stemmer.stem(word) for word in words]  # Stemming
    return ' '.join(words)




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [68]:
def preprocess_text(text):
    words = nltk.word_tokenize(text.lower())  # Convert to lowercase and tokenize
    words = [word for word in words if word.isalpha()]  # Remove non-alphabetic characters
    words = [word for word in words if word not in stop_words]  # Remove stopwords
    words = [stemmer.stem(word) for word in words]  # Stemming
    return ' '.join(words)



In [69]:
data['text'] = data['text'].apply(preprocess_text)


# 6- Split the Data into Training and Testing Sets

In [70]:
X = data['text']
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# 7-  Feature Extraction

In [71]:
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# 8- Train the Model (Naive Bayes Classifier)

In [72]:
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tfidf, y_train)

# 9- Make Predictions

In [73]:
y_pred = naive_bayes_classifier.predict(X_test_tfidf)

# 10- Evaluate the Model

In [74]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

confusion_mat = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(confusion_mat)

classification_rep = classification_report(y_test, y_pred)
print("Classification Report:")
print(classification_rep)

Accuracy: 0.96
Confusion Matrix:
[[965   0]
 [ 42 108]]
Classification Report:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       965
           1       1.00      0.72      0.84       150

    accuracy                           0.96      1115
   macro avg       0.98      0.86      0.91      1115
weighted avg       0.96      0.96      0.96      1115



## analysis of the results for Evaluate the Model:

1. The model achieved an impressive accuracy of 96%, indicating that it correctly classified the majority of messages as either spam or ham.
2. The confusion matrix shows that out of 965 ham messages, all were correctly classified as ham (True Positives). Also, out of 150 spam messages, 108 were correctly classified as spam (True Positives), while 42 were incorrectly classified as ham (False Negatives).
3. The model's precision for spam (class 1) is 100%, indicating that when it predicts a message as spam, it is highly likely to be correct. However, its recall for spam is 72%, implying that it may not capture all spam messages, leading to some false negatives.
4. The weighted average F1-score of 0.96 is a balanced metric that considers both precision and recall, indicating a well-performing model overall.

## 11- Conclusions:

1. The spam detection model demonstrates high accuracy and precision, effectively identifying the majority of spam messages.
2. The model's recall for spam can be further improved to capture more spam messages and reduce the number of false negatives.
3. Continuous monitoring and updates to the model are essential to adapt to new spam patterns and maintain its effectiveness over time.


Overall, the spam detection system appears promising and can be integrated into the mobile service provider's messaging infrastructure to enhance customer experience by reducing the impact of spam messages and improving overall customer satisfaction. However, ongoing optimization and evaluation of the model are necessary to ensure its efficiency in combating new and evolving spam tactics.