
#Explanation:

We import the pandas library using import pandas as pd.

We use pd.read_csv() to read the CSV file containing the dataset. The encoding='latin-1' argument is used to handle special characters.

We select only the relevant columns ('v1' for labels, 'v2' for email content) using data[['v1', 'v2']].

In [1]:
import pandas as pd

# Load the dataset
data = pd.read_csv('/kaggle/input/sms-spam-collection-dataset/spam.csv', encoding='latin-1')
data = data[['v1', 'v2']]  # Selecting only the relevant columns

In [2]:
data #printing

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


#Data Preprocessing

In this step, we perform data preprocessing tasks, which include converting labels to binary values and removing duplicates from the dataset.

#Explanation:

We use data['v1'].apply(lambda x: 1 if x == 'spam' else 0) to convert the labels. 'ham' is mapped to 0, and 'spam' is mapped to 1 in the 'v1' column.

We then remove duplicate rows from the dataset using data = data.drop_duplicates().

The resulting DataFrame is displayed to show the cleaned dataset.

In [3]:
# Convert 'ham' to 0 and 'spam' to 1 directly in the 'v1' column
data['v1'] = data['v1'].apply(lambda x: 1 if x == 'spam' else 0)

# removing duplicates
data = data.drop_duplicates()

In [4]:
data

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


To prevent the "SettingWithCopyWarning" that can occur when making changes to a slice of a DataFrame.

In [5]:
import pandas as pd
pd.options.mode.chained_assignment = None  # Disable the warning

#Text Cleaning:

Text cleaning involves removing any unnecessary characters, symbols, or noise from the text data. This might include punctuation, special characters, and numbers.

#Explanation:

We import the regular expression (re) module using import re.

The function clean_text() takes a string text as input and uses a regular expression to remove all characters except alphabetic characters (letters).

The cleaned text is then returned.

We apply this function to the 'v2' column of the DataFrame using data['v2'].apply(lambda x: clean_text(x)). This cleans the text in each email.

In [6]:
import re

def clean_text(text):
    cleaned_text = re.sub(r'[^a-zA-Z]', ' ', text)
    return cleaned_text

data['v2'] = data['v2'].apply(lambda x: clean_text(x))


#Lowercasing:

Converting all text to lowercase ensures that the model doesn't treat "Hello" and "hello" as different words.

#Explanation:

We use the str.lower() method to convert all text in the 'v2' column to lowercase. This helps standardize the text data and ensure that the model is not case-sensitive.

In [7]:
data['v2'] = data['v2'].str.lower()

#Tokenization:

Tokenization involves splitting the text into individual words or tokens. The NLTK library can be used for this.

#Explanation:

In this code cell, we use nltk.download('punkt') to download the necessary resources for tokenization from the Natural Language Toolkit (NLTK). This resource includes pre-trained models for tokenizing text into words or sentences. This step is essential for further text processing.

In [8]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#Explanation:

We import the word_tokenize function from the NLTK library.

The word_tokenize function takes a string as input and returns a list of tokens (words).

We apply this function to the 'v2' column of the DataFrame, converting each email's content into a list of tokens. This step is crucial for converting text data into a format suitable for machine learning models.

In [9]:
from nltk.tokenize import word_tokenize

data['v2'] = data['v2'].apply(word_tokenize)


#Stemming:

Stemming reduces words to their base forms. This can help in reducing the dimensionality of the feature space.

#Explanation:

We import the PorterStemmer class from the NLTK library.

We initialize an instance of the PorterStemmer as stemmer.

We define a function stem_words(words) that takes a list of words and applies stemming to each word using the stemmer.stem() method.

We apply this function to the 'v2' column of the DataFrame, effectively reducing words to their base forms through stemming. This step can help improve the model's performance by reducing the feature space.

In [10]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

def stem_words(words):
    return [stemmer.stem(word) for word in words]

data['v2'] = data['v2'].apply(stem_words)


In [11]:
data

Unnamed: 0,v1,v2
0,0,"[go, until, jurong, point, crazi, avail, onli,..."
1,0,"[ok, lar, joke, wif, u, oni]"
2,1,"[free, entri, in, a, wkli, comp, to, win, fa, ..."
3,0,"[u, dun, say, so, earli, hor, u, c, alreadi, t..."
4,0,"[nah, i, don, t, think, he, goe, to, usf, he, ..."
...,...,...
5567,1,"[thi, is, the, nd, time, we, have, tri, contac..."
5568,0,"[will, b, go, to, esplanad, fr, home]"
5569,0,"[piti, wa, in, mood, for, that, so, ani, other..."
5570,0,"[the, guy, did, some, bitch, but, i, act, like..."


#Feature Extraction
we convert the tokenized words back to text and apply Count Vectorization to transform the text data into numerical format.

#Explanation:

We import the CountVectorizer class from the scikit-learn library.

We convert the tokenized words back to text using data['v2'].apply(lambda x: ' '.join(x)). This step is necessary for the Count Vectorizer to work correctly.

We initialize the Count Vectorizer with a maximum of 5000 features using CountVectorizer(max_features=5000). You can adjust this parameter based on your specific needs and computational resources.

We apply the Count Vectorizer to the 'v2' column of the DataFrame, transforming the text data into a numerical format suitable for machine learning models.

If needed, we convert the result to a dense array using features = features.toarray(). This step may be necessary depending on the specific requirements of the downstream modeling process.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

# Convert tokenized words back to text
data['v2'] = data['v2'].apply(lambda x: ' '.join(x))

# Initialize the Count Vectorizer
count_vectorizer = CountVectorizer(max_features=5000)  # You can adjust max_features as needed

# Apply the vectorizer to the 'v2' column
features = count_vectorizer.fit_transform(data['v2'])

# Convert the result to a dense array (if needed)
features = features.toarray()


#Train-Test Split:

Split your data into training and testing sets. This allows you to evaluate the performance of your model on data it hasn't seen before.

#Explanation:

We import the train_test_split function from scikit-learn, which allows us to split the dataset into training and testing sets.

We use train_test_split to split the features (features) and labels (data['v1']) into training and testing sets. The parameter test_size=0.2 indicates that 20% of the data will be used for testing, while 80% will be used for training.

The random_state=42 ensures that the split is reproducible. The same random state will produce the same split each time the code is run.

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, data['v1'], test_size=0.2, random_state=42)


#Model Selection:

Chose the Multinomial Naive Bayes classifier for its effectiveness in text classification tasks.

#Explanation:

We import the MultinomialNB class from scikit-learn, which represents the Multinomial Naive Bayes classifier.

We initialize an instance of the Multinomial Naive Bayes classifier as clf. This classifier is a commonly used algorithm for text classification tasks.

In [14]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()


#Model Training:

Train your chosen model on the training data.

#Explanation:

We use the fit method of the classifier (clf) to train it on the training data. The training data consists of the features (X_train) and their corresponding labels (y_train). This step allows the classifier to learn patterns in the data and make predictions on new, unseen examples.

In [15]:
clf.fit(X_train, y_train)


#Model Evaluation:

Evaluate the model's performance using metrics like accuracy, precision, recall, and F1-score.

we evaluate the performance of the Multinomial Naive Bayes classifier using various classification metrics.

#Explanation:

We import the classification_report function from scikit-learn, which generates a detailed classification report.

We use the trained classifier (clf) to make predictions on the test data (X_test).

The classification_report function takes the true labels (y_test) and the predicted labels (y_pred) as input, and calculates various metrics including precision, recall, F1-score, and support for each class.

The resulting report is printed, providing a comprehensive assessment of the classifier's performance on the test set.

In [16]:
from sklearn.metrics import classification_report

y_pred = clf.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)


              precision    recall  f1-score   support

           0       0.99      0.98      0.99       889
           1       0.90      0.94      0.92       145

    accuracy                           0.98      1034
   macro avg       0.94      0.96      0.95      1034
weighted avg       0.98      0.98      0.98      1034



we perform k-fold cross-validation to assess the performance of the Multinomial Naive Bayes classifier.

#Explanation:

We import the cross_val_score function from scikit-learn, which performs k-fold cross-validation.

We initialize a new instance of the Multinomial Naive Bayes classifier (clf) for cross-validation.

Assuming 'features' (X) and labels (y) are available, we use cross_val_score to perform 5-fold cross-validation (cv=5). You can adjust the value of cv based on your specific requirements.

The cross-validation scores are printed, showing the performance of the classifier in each fold, as well as the mean cross-validation score. This provides a more robust evaluation of the model's performance.

In [17]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

# Assuming 'features' and 'data['v1']' are your features and labels
X = features
y = data['v1']

# Initialize a Naive Bayes classifier
clf = MultinomialNB()

# Perform 5-fold cross-validation (you can adjust 'cv' as needed)
cv_scores = cross_val_score(clf, X, y, cv=5)

# Print the cross-validation scores
print(f'Cross-Validation Scores: {cv_scores}')
print(f'Mean CV Score: {cv_scores.mean()}')


Cross-Validation Scores: [0.97969052 0.97582205 0.97582205 0.9787234  0.9767667 ]
Mean CV Score: 0.9773649452028887
