# PROBLEM STATEMENT : 



### SMS Classifier : Develop a text classification model to classify SMS as either spam or non-spam using data science techniques in Python

Import Libraries:
Start by importing the necessary libraries.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [2]:
# Assuming you've downloaded the dataset as 'spam.csv'
df = pd.read_csv('spam.csv', encoding='latin-1')

# Display the first few rows of the dataset
print(df.head())


     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  


Data Preprocessing:
Clean and preprocess the data. This may include removing unnecessary columns, handling missing values, and converting labels to numerical format.

In [3]:
# Assuming your dataset has a 'v1' column containing labels and a 'v2' column containing SMS messages
df = df[['v1', 'v2']]
df.columns = ['label', 'text']
df['label'] = df['label'].map({'ham': 0, 'spam': 1})


Split the Data:
Split the dataset into training and testing sets.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)


Feature Extraction:
Convert the text data into numerical features using techniques like Count Vectorization.

In [5]:
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)


Build and Train the Model:
Choose a classification model (here we use Naive Bayes) and train it on the training set.

In [6]:
model = MultinomialNB()
model.fit(X_train_vec, y_train)


MultinomialNB()

Evaluate the Model:
Evaluate the model on the testing set and check performance metrics.

In [7]:
y_pred = model.predict(X_test_vec)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.9838565022421525

Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.99      0.89      0.94       150

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115


Confusion Matrix:
 [[963   2]
 [ 16 134]]


Prediction:
You can use the trained model to classify new SMS messages.

In [8]:
new_sms = ["Free gift! Click now!", "Hello, how are you?"]
new_sms_vec = vectorizer.transform(new_sms)
predictions = model.predict(new_sms_vec)

for sms, prediction in zip(new_sms, predictions):
    print(f"{sms} - {'Spam' if prediction == 1 else 'Non-Spam'}")


Free gift! Click now! - Spam
Hello, how are you? - Non-Spam
