# MACHINE LEARNING INTERNSHIP @ CODE SOFT

## TASK 4 : Spam SMS Detection 

### The dataset is available at Kaggle : 
### https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

### Context:

In [None]:
- Classifying emails into distinct labels can have a great impact on customer support. By using machine learning to label emails the system can set up queues containing emails of a specific category. This enables support personnel to handle request quicker and more easily by selecting a queue that match their expertise.

### Objectives:

- Build an AI model that can classify SMS messages as spam or legitimate. 
- Use techniques like TF-IDF or word embeddings with classifiers like Naive Bayes, Logistic Regression, or Support Vector Machines to identify spam messages

### We will follow the following steps for creating the model for spam sms detection
Step 1: Data Collection
Step 2: Data Preprocessing
Step 3: Feature Engineering(TF-IDF)
Step 4: Model Selection and Training(Logistic Regression)
Step 5: Model Evaluation
Step 6: Testing on random message

## Step 1:Data collection

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('spam.csv')

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 606-607: invalid continuation byte

Eerror occured as read_csv() function in pandas is 'utf-8', but sometimes CSV files contain characters that are not encoded in UTF-8. When pandas encounters such characters, it raises a UnicodeDecodeError because it cannot decode them using the specified encoding.
we can use different encoding like ISO-8859-1 , etc.

In [3]:
df = pd.read_csv('spam.csv', encoding='ISO-8859-1')

In [4]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Step 2:Data Preprocessing

Our dataset has some extra columns (Unnamed: 2, Unnamed: 3, Unnamed: 4) that are empty (NaN).We will selcet only necessary columns i.e v1 and v2 thus cleaning the dataset

In [5]:
df_cleaned = df[['v1', 'v2']]

In [6]:
df_cleaned.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Step 3: Feature Engineering

We will be using the TF-IDF vectorizer from the scikit-learn library to convert the text data into numerical features.
TF-IDF quantifies the importance of a word in a document relative to a collection of documents. It takes into account the frequency of a word in the document (Term Frequency) and the inverse document frequency (IDF), which measures how rare a word is across documents.

In [7]:
#importing TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
# Initialize
tfidf_vectorizer = TfidfVectorizer()
# Fit and transform text-data
X = tfidf_vectorizer.fit_transform(df_cleaned['v2'])

In [9]:
# Display feature matrix shape
print("Shape of feature matrix:", X.shape)

Shape of feature matrix: (5572, 8672)


## Step 4: Model Selection and Training
Which means choosing a machine learning model (classifier) and training it on the feature matrix (X) along with the corresponding labels (y).

I chose Logistic Regression for this

In [10]:
#importing various requirements for logistic regression and its evaluation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [11]:
# Spliting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, df_cleaned['v1'], test_size=0.2, random_state=42)

In [12]:
# Initialize 
model = LogisticRegression()
# Training the model on the training data
model.fit(X_train, y_train)

In [13]:
# Predicting on the test data
y_pred = model.predict(X_test)

In [14]:
# Calculating its accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

Accuracy:  0.9623318385650225


In [15]:
# Importing 
from sklearn.metrics import classification_report, confusion_matrix

In [16]:
# Printing classification report as well as confusion matrix
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       965
        spam       1.00      0.72      0.84       150

    accuracy                           0.96      1115
   macro avg       0.98      0.86      0.91      1115
weighted avg       0.96      0.96      0.96      1115

Confusion Matrix:
[[965   0]
 [ 42 108]]


Classification Report: 
Precision- It measures the accuracy of positive predictions. For the "ham" class the precision is 0.96, indicating 96% of the messages predicted as "ham" are indeed "ham". For the "spam" class, the precision is 1.00, meaning all messages predicted as "spam" are actually "spam".
Recall(sensitivity)- It measures the proportion of actual positives that are correctly identified by the model. For the "ham" class, the recall is 1.00, indicating that all actual "ham" messages are correctly identified. For the "spam" class, the recall is 0.72, suggesting that 72% of the actual "spam" messages are correctly identified.
F1-score: It is the harmonic mean of precision and recall. It provides a balance between precision and recall. For the "ham" class, the F1-score is 0.98, and for the "spam" class, the F1-score is 0.84.
Support: Refers to the number of actual occurrences of each class.

Confusion Matrix:
The confusion matrix shows the counts of true positive TP, true negative TN, false positive FP, and false negative FN predictions.
TP: 965 messages were correctly classified as "ham" T): 108 spam messages were correctly classified as "spm"
FP: :  messages were incorrectly classified as "pams
(F): 42 spam messages were incorrectly classified as "ham.

## Step 6: testing on random message

In [17]:
# Creating a function for the model to test on random essage
def predict_spam_or_ham(message):
    message_cleaned = [message]
    # Transform using TF-IDF vectorization
    message_vectorized = tfidf_vectorizer.transform(message_cleaned)
    # Predict using model
    prediction = model.predict(message_vectorized)
    # Return the predicted class
    return prediction[0]

In [18]:
sample_message = "Hey there! Just wanted to remind you about our meeting tomorrow at 10 AM. See you then!"
# Predicting using function
predicted_class = predict_spam_or_ham(sample_message)
# Printing the predicted class
print("Predicted Class:", predicted_class)

Predicted Class: ham


In [19]:
sample_message2 ="Congratulations! You've been selected as the winner of our special prize. Click the link to claim your reward now!"
predicted_class = predict_spam_or_ham(sample_message2)
print("Predicted Class:", predicted_class)

Predicted Class: spam
