<a href="https://colab.research.google.com/github/couragedike1/Natural_Language_Processing-/blob/main/Spam_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Classification using Bag of Words (BoW) in Google Colab**

### **Introduction**

Text classification is a fundamental task in Natural Language Processing (NLP) that involves assigning predefined categories to text data. One of the simplest and most popular methods for representing text data is the **Bag of Words (BoW)** model. In this lab, we will explore how to build a text classification model using the BoW approach on a public dataset.

We'll be using the **SMS Spam Collection Dataset**, which contains a collection of SMS messages labeled as "ham" (not spam) or "spam." Our goal is to build a machine learning model that can automatically classify SMS messages as spam or ham based on their content.

### **Objectives**

By the end of this lab, you will have learned how to:
1. Preprocess text data by removing noise such as stopwords and punctuation.
2. Convert text into numerical features using the **Bag of Words** representation.
3. Train a machine learning model on the text data using **Logistic Regression**.
4. Evaluate the performance of the model using metrics like accuracy and classification report.

### **Key Concepts**

- **Bag of Words (BoW)**: A representation of text data where each unique word in the text is treated as a feature, and the value associated with each feature is the count of its occurrences.
- **Text Preprocessing**: The process of cleaning and preparing text data by removing unwanted elements like punctuation, stopwords, and converting text to lowercase.
- **Logistic Regression**: A linear model commonly used for binary classification tasks, suitable for text classification due to its simplicity and effectiveness.
- **SMS Spam Collection Dataset**: A dataset containing labeled SMS messages, publicly available from the UCI Machine Learning Repository.

### **Tools and Libraries**

We will be using the following tools and libraries:
- **Google Colab**: A cloud-based platform for writing and running Python code.
- **Pandas**: For loading and manipulating tabular data.
- **NLTK**: For text preprocessing, including stopword removal.
- **Scikit-learn**: For feature extraction, model training, and evaluation.

### **Dataset Overview**

The dataset consists of two columns:
- `label`: The category of the SMS message (either 'ham' or 'spam').
- `message`: The actual content of the SMS.

### **Workflow**

1. **Loading and preprocessing the dataset**: We will start by loading the dataset into a pandas DataFrame and performing text preprocessing to clean the data.
2. **Feature extraction using Bag of Words**: We'll convert the cleaned text data into numerical features using the BoW technique.
3. **Model training**: We'll use Logistic Regression to train the model on the training dataset.
4. **Evaluation**: Finally, we will evaluate the model on a test set and analyze the results using accuracy and classification metrics.


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
import string

# Download stopwords for preprocessing
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip


--2024-10-09 23:47:01--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘smsspamcollection.zip.1’

smsspamcollection.z     [ <=>                ] 198.65K  1023KB/s    in 0.2s    

2024-10-09 23:47:02 (1023 KB/s) - ‘smsspamcollection.zip.1’ saved [203415]

Archive:  smsspamcollection.zip
replace SMSSpamCollection? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
# Load the dataset
df = pd.read_csv('SMSSpamCollection', sep='\t', names=['label', 'message'])
df.head()


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [None]:
df.iloc[4]['message']

"Nah I don't think he goes to usf, he lives around here though"

### **Text Preprocessing**

Before we can apply machine learning algorithms to our dataset, we need to clean the text data. Raw text often contains noise, such as punctuation, stopwords, and case differences, which can affect model performance. In this step, we'll define a function to:

- Remove punctuation.
- Convert all text to lowercase.
- Remove common English stopwords that do not contribute meaningful information.

We will apply this preprocessing function to the SMS messages to prepare them for feature extraction.

In [None]:
# Define a function to clean the text
def text_preprocessing(text):
    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])

    # Convert to lowercase
    text = text.lower()

    # Remove stopwords
    stop_words = stopwords.words('english')
    text = ' '.join([word for word in text.split() if word not in stop_words])

    return text

# Apply preprocessing to the message column
df['cleaned_message'] = df['message'].apply(text_preprocessing)
df.head()


Unnamed: 0,label,message,cleaned_message
0,ham,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah dont think goes usf lives around though


### **Convert Text Data to Bag of Words**

In this step, we will convert the preprocessed text data into numerical features using the **Bag of Words (BoW)** technique. For this, we will use `CountVectorizer` from the `scikit-learn` library.

- The `CountVectorizer` will tokenize the cleaned text data and convert each message into a vector of word counts.
- We will also convert the labels from categorical ('ham' or 'spam') into binary values (0 for 'ham' and 1 for 'spam') to prepare them for model training.


In [None]:
# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the cleaned message column
X = vectorizer.fit_transform(df['cleaned_message'])

# Convert labels to binary (0 for ham, 1 for spam)
y = df['label'].apply(lambda x: 1 if x == 'spam' else 0)


In [None]:
X

<5572x9437 sparse matrix of type '<class 'numpy.int64'>'
	with 47493 stored elements in Compressed Sparse Row format>

In [None]:
df

### **Train-Test Split**

To evaluate the performance of our machine learning model, we need to split the dataset into two parts:
- **Training set**: Used to train the model.
- **Test set**: Used to evaluate the model's performance on unseen data.

We will use the `train_test_split` function from `scikit-learn` to split the data into training and test sets, with 80% of the data for training and 20% for testing. The `random_state` parameter ensures reproducibility of the results.


In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shape of the resulting data
X_train.shape, X_test.shape


((4457, 9437), (1115, 9437))

### **Model Training: Logistic Regression**

In this step, we will initialize and train a **Logistic Regression** classifier. Logistic Regression is a widely used algorithm for binary classification tasks, such as distinguishing between "spam" and "ham" messages.

- The model is initialized with a maximum iteration parameter (`max_iter=1000`) to ensure the algorithm has enough iterations to converge during training.
- We then fit the model to the training data (`X_train`, `y_train`) to learn the patterns and relationships in the text features.



In [None]:
# Initialize Logistic Regression classifier
lr = LogisticRegression(max_iter=1000)  # Increase max_iter if you encounter convergence issues

# Train the classifier
lr.fit(X_train, y_train)


### **Model Evaluation**

After training the Logistic Regression model, we will now evaluate its performance on the test set.

- **Predictions**: Using the trained model, we make predictions (`y_pred`) on the test set (`X_test`).
- **Accuracy**: We calculate the accuracy of the model, which is the percentage of correct predictions out of the total number of predictions.
- **Classification Report**: We generate a detailed classification report, which includes precision, recall, f1-score, and support for both classes (spam and ham).

The evaluation results will help us understand how well the model generalizes to unseen data.


In [None]:
# Make predictions on the test set
y_pred = lr.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

# Print accuracy and classification report
print(f'Accuracy: {accuracy * 100:.2f}%')
print('Classification Report:')
print(report)


Accuracy: 98.48%
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       1.00      0.89      0.94       149

    accuracy                           0.98      1115
   macro avg       0.99      0.94      0.97      1115
weighted avg       0.99      0.98      0.98      1115



### **Model Performance**

The Logistic Regression model achieved the following performance on the test set:

- **Accuracy**: 98.48%

#### **Classification Report**:

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| Ham (0) | 0.98 | 1.00 | 0.99 | 966 |
| Spam (1) | 1.00 | 0.89 | 0.94 | 149 |

- **Macro Avg**:
  - Precision: 0.99
  - Recall: 0.94
  - F1-Score: 0.97
- **Weighted Avg**:
  - Precision: 0.99
  - Recall: 0.98
  - F1-Score: 0.98

### **Conclusion**

The Logistic Regression model performed exceptionally well with an accuracy of 98.48%. The high precision and recall for the "ham" class (non-spam) demonstrate that the model is very effective at correctly identifying non-spam messages. For the "spam" class, although the recall is slightly lower (0.89), the model still shows strong performance, as reflected by the F1-score of 0.94. Overall, this model is highly reliable for spam detection in SMS messages.
