In [None]:
%pip install nltk sklearn pandas numpy sys

### Importing the Required libraries

In [3]:
import sys
import nltk
import sklearn
import pandas as pd
import numpy as np


### Code Explanation: Loading and Previewing SMS Spam Dataset

1. **Loading the Dataset**:
   - The code uses `pandas` to load a text file into a DataFrame. The function `pd.read_table()` reads the file `SMSSpamCollection` located in the `./classification_data/` directory.
   - The file is expected to be a tab-separated values (TSV) file, which means each column is separated by a tab character.

2. **No Header in the Dataset**:
   - The argument `header=None` specifies that the dataset does not have a predefined header (column names). Thus, pandas will automatically assign default integer column names (starting from 0).

3. **Encoding**:
   - The `encoding='utf-8'` parameter ensures that the text is read correctly in case of any special characters, preventing errors during the file loading process.

4. **Previewing the Data**:
   - `sms.head()` displays the first five rows of the DataFrame, providing a quick view of the loaded data. This is useful for confirming the structure and contents of the dataset.

5. **Usage**:
   - The dataset likely contains SMS text messages labeled as either spam or not spam, which is common in classification tasks. The loaded DataFrame can now be used for further data analysis or machine learning tasks.



In [4]:
sms = pd.read_table('./classification_data/SMSSpamCollection', header=None, encoding='utf-8')
sms.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       5572 non-null   object
 1   1       5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB



1. **Accessing the First Column**:
   - `sms[0]` accesses the first column of the DataFrame `sms`. Since the dataset does not have predefined column names, pandas assigned integer-based labels (starting from 0). 
   - This first column likely contains the labels (e.g., "spam" or "ham" for non-spam messages).


In [6]:
sms[0].value_counts()

0
ham     4825
spam     747
Name: count, dtype: int64

### Code Explanation: Label Encoding for SMS Spam Classification

1. **Importing LabelEncoder**:
   - The `LabelEncoder` from `sklearn.preprocessing` is imported. This is used to transform categorical labels into numeric values (i.e., encoding string labels like "spam" or "ham" into integers).

2. **Fitting and Transforming Labels**:
   - The `enc.fit_transform(sms[0])` applies the label encoder to the first column of the `sms` dataset, which contains the target labels (e.g., "spam", "ham"). This converts them into integer values (e.g., "spam" might become 1, and "ham" might become 0).
   
3. **Printing Encoded and Original Labels**:
   - The first 10 encoded labels (`label[:10]`) and the original 10 labels (`sms[0][:10]`) are printed. This allows you to compare how the labels have been transformed from text to numbers.



In [7]:
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
label = enc.fit_transform(sms[0])
print(label[:10])
print(sms[0][:10])

[0 0 1 0 0 1 0 0 1 1]
0     ham
1     ham
2    spam
3     ham
4     ham
5    spam
6     ham
7     ham
8    spam
9    spam
Name: 0, dtype: object


In [8]:
text = sms[1]
text[:10]

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
5    FreeMsg Hey there darling it's been 3 week's n...
6    Even my brother is not like to speak with me. ...
7    As per your request 'Melle Melle (Oru Minnamin...
8    WINNER!! As a valued network customer you have...
9    Had your mobile 11 months or more? U R entitle...
Name: 1, dtype: object

In [9]:
# Replace email addresses with 'email'
processed = text.str.replace(r'^.+@[^\.].*\.[a-z]{2,}$', 'emailaddress', regex=True)

# Replace URLs with 'webaddress'
processed = processed.str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$', 'webaddress', regex=True)

# Replace money symbols with 'moneysymb' (£ can by typed with ALT key + 156)
processed = processed.str.replace(r'£|\$', 'moneysymb', regex=True)
    
# Replace 10 digit phone numbers (formats include paranthesis, spaces, no spaces, dashes) with 'phonenumber'
processed = processed.str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$', 'phonenumbr', regex=True)
    
# Replace numbers with 'numbr'
processed = processed.str.replace(r'\d+(\.\d+)?', 'numbr', regex=True)

processed = processed.str.replace(r'[^\w\d\s]', ' ', regex=True)

# Replace whitespace between terms with a single space
processed = processed.str.replace(r'\s+', ' ', regex=True)

# Remove leading and trailing whitespace
processed = processed.str.replace(r'^\s+|\s+?$', '', regex=True)

In [10]:
#convering the strings to lower case
processed = processed.str.lower()
processed

0       go until jurong point crazy available only in ...
1                                 ok lar joking wif u oni
2       free entry in numbr a wkly comp to win fa cup ...
3             u dun say so early hor u c already then say
4       nah i don t think he goes to usf he lives arou...
                              ...                        
5567    this is the numbrnd time we have tried numbr c...
5568                  will ü b going to esplanade fr home
5569    pity was in mood for that so any other suggest...
5570    the guy did some bitching but i acted like i d...
5571                            rofl its true to its name
Name: 1, Length: 5572, dtype: object

### Code Explanation: Text Preprocessing with Stopwords Removal and Stemming

1. **Loading Stopwords**:
   - The `stopwords` from `nltk.corpus` are loaded and stored in the `stop_words` variable. These are common words in English (e.g., "is", "the", "in") that are typically removed in text preprocessing to focus on more meaningful words.

2. **Stopwords Removal**:
   - The `processed` dataset is updated by applying a lambda function that removes stopwords. The function splits each text string into terms (words), filters out any term found in the `stop_words` set, and rejoins the remaining terms into a cleaned sentence.

3. **Porter Stemming**:
   - The `nltk.PorterStemmer()` is instantiated as `ps`. Stemming reduces words to their root form (e.g., "running" becomes "run").

4. **Applying Stemming**:
   - Another lambda function is applied to `processed`, where each word in the text is stemmed using `ps.stem()`, and then the terms are rejoined to form the processed sentence.

5. **Output**:
   - The fully processed `processed` data is displayed, showing the text after stopwords have been removed and terms have been stemmed.


In [11]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

processed = processed.apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))

ps = nltk.PorterStemmer()

processed = processed.apply(lambda x: ' '.join(ps.stem(term) for term in x.split()))
processed

0       go jurong point crazi avail bugi n great world...
1                                   ok lar joke wif u oni
2       free entri numbr wkli comp win fa cup final tk...
3                     u dun say earli hor u c alreadi say
4                    nah think goe usf live around though
                              ...                        
5567    numbrnd time tri numbr contact u u moneysymbnu...
5568                              ü b go esplanad fr home
5569                                    piti mood suggest
5570    guy bitch act like interest buy someth els nex...
5571                                       rofl true name
Name: 1, Length: 5572, dtype: object

In [12]:
from nltk.tokenize import word_tokenize

all_words = []

for message in processed:
    words = word_tokenize(message)
    for w in words:
        all_words.append(w)
        
all_words = nltk.FreqDist(all_words)

# Print the result
print('Number of words: {}'.format(len(all_words)))
print('Most common words: {}'.format(all_words.most_common(15)))

Number of words: 6579
Most common words: [('numbr', 2648), ('u', 1207), ('call', 674), ('go', 456), ('get', 451), ('ur', 391), ('gt', 318), ('lt', 316), ('come', 304), ('moneysymbnumbr', 303), ('ok', 293), ('free', 284), ('day', 276), ('know', 275), ('love', 266)]


### Code Explanation: Tokenization and Word Frequency Distribution

1. **Tokenizing Words**:
   - The `word_tokenize` function from `nltk` is used to split each message in the `processed` dataset into individual words (tokens). The result is a list of words for each message.

2. **Building Word List**:
   - For each message, the words are added to a cumulative list called `all_words`, which stores every word found in the processed text data.

3. **Frequency Distribution**:
   - The `nltk.FreqDist` function is used to create a frequency distribution of the words stored in `all_words`. This helps in counting how often each word appears across the dataset.

4. **Result Output**:
   - The total number of unique words is printed using `len(all_words)`.
   - The `most_common(15)` method is called on `all_words` to print the 15 most frequently occurring words along with their counts.


In [14]:
word_features = [x[0] for x in all_words.most_common(1500)]
def find_features(message):
    words = word_tokenize(message)
    features = {}
    for word in word_features:
        features[word] = (word in words)

    return features

### Code Explanation: Feature Extraction Using Word Frequencies

1. **Creating Word Features**:
   - `word_features` is a list that stores the top 1,500 most common words from the frequency distribution `all_words`. These words will act as features for the model.

2. **`find_features` Function**:
   - The function `find_features` takes a message as input and tokenizes it into individual words using `word_tokenize`.
   - A dictionary `features` is created, where the keys are the words from `word_features`, and the values are `True` or `False` depending on whether the word appears in the message or not.

3. **Output**:
   - The function returns a dictionary where each key (word from `word_features`) maps to a boolean (`True` or `False`), indicating whether that word is present in the message. This dictionary can be used as feature input for machine learning models.


In [15]:
features = find_features(processed[0])
for key, value in features.items():
    if value == True:
        print(key)

go
got
n
great
wat
e
world
point
avail
crazi
bugi
la
cine


### Code Explanation: Displaying Feature Words Present in a Message

1. **Finding Features in a Processed Message**:
   - The function `find_features(processed[0])` is called on the first message in the `processed` dataset. This identifies which of the top 1,500 most common words (`word_features`) are present in the message.

2. **Iterating Over the Features**:
   - A `for` loop is used to iterate through the dictionary `features`. Each key represents a word from `word_features`, and each value is a boolean (`True` or `False`) indicating the presence of the word in the message.

3. **Printing Present Words**:
   - The condition `if value == True:` checks if a word is present in the message. If the word is found, it is printed.
   - This loop outputs all words from the top 1,500 that are present in the message.

4. **Result**:
   - The printed words will be the words from the top 1,500 most frequent words that appear in the given message (`processed[0]`).


In [16]:
list(features.items())[:10]
messages = list(zip(processed, label))

np.random.seed(1)
np.random.shuffle(messages)

# Call find_features function for each SMS message
feature_set = [(find_features(text), label) for (text, label) in messages]

### Code Explanation: Preparing Feature Set for SMS Messages

1. **Listing Features**:
   - `list(features.items())[:10]` retrieves the first 10 items from the `features` dictionary, which contains the words and their presence in the processed message. This helps to visualize which words were considered as features.

2. **Combining Processed Messages and Labels**:
   - The `messages` variable is created by zipping together `processed` (the preprocessed SMS messages) and `label` (the corresponding labels indicating if a message is spam or not). This results in a list of tuples, where each tuple contains a message and its label.

3. **Shuffling Messages**:
   - `np.random.seed(1)` sets the random seed for reproducibility, ensuring that the shuffle operation produces the same order each time it is run.
   - `np.random.shuffle(messages)` randomly shuffles the `messages` list to ensure that the training data is mixed, which is important for training machine learning models to avoid any order biases.

4. **Creating the Feature Set**:
   - A list comprehension is used to create `feature_set`, which contains tuples of features extracted from each message and their corresponding label. 
   - For each message (`text`), the `find_features(text)` function is called to generate a dictionary of features indicating the presence of the top words, and this is paired with the message's label.

5. **Result**:
   - The `feature_set` now contains a collection of features and labels, ready for use in training a machine learning model.


In [17]:
from sklearn.model_selection import train_test_split

training, test = train_test_split(feature_set, test_size=0.25, random_state=1)

### Splitting the Dataset into Training and Testing Sets

1. **Importing Required Function**:
   - The `train_test_split` function from the `sklearn.model_selection` module is imported. This function is commonly used to split datasets into training and testing subsets for machine learning.

2. **Splitting the Dataset**:
   - The line `training, test = train_test_split(feature_set, test_size=0.25, random_state=1)` performs the dataset split.
     - **`feature_set`**: This is the dataset containing tuples of features and their corresponding labels, which was prepared earlier.
     - **`test_size=0.25`**: This parameter specifies that 25% of the data should be reserved for testing, while the remaining 75% will be used for training the model.
     - **`random_state=1`**: Setting the random seed ensures that the split is reproducible, meaning that every time this code is run with the same seed, the split will be identical.

3. **Result**:
   - The result of this operation is two separate datasets: 
     - **`training`**: Contains 75% of the `feature_set` data for training the model.
     - **`test`**: Contains 25% of the `feature_set` data for evaluating the model's performance after training.


In [18]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

names = ['K Nearest Neighbors', 'Decision Tree', 'Random Forest', 'Logistic Regression', 'SGD Classifier',
         'Naive Bayes', 'Support Vector Classifier']

classifiers = [
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    LogisticRegression(),
    SGDClassifier(max_iter=100),
    MultinomialNB(),
    SVC(kernel='linear')
]

models = zip(names, classifiers)

for name, model in models:
    nltk_model = SklearnClassifier(model)
    nltk_model.train(training)
    accuracy = nltk.classify.accuracy(nltk_model, test)
    print("{} model Accuracy: {}".format(name, accuracy))

K Nearest Neighbors model Accuracy: 0.9454414931801867
Decision Tree model Accuracy: 0.95908111988514
Random Forest model Accuracy: 0.9813352476669059
Logistic Regression model Accuracy: 0.9834888729361091
SGD Classifier model Accuracy: 0.9806173725771715
Naive Bayes model Accuracy: 0.9856424982053122
Support Vector Classifier model Accuracy: 0.9820531227566404


### Code Explanation: Evaluating Multiple Classifiers

1. **Importing Necessary Libraries**:
   - The code imports various classifiers from `sklearn`, including:
     - `KNeighborsClassifier`
     - `DecisionTreeClassifier`
     - `RandomForestClassifier`
     - `LogisticRegression`
     - `SGDClassifier`
     - `MultinomialNB`
     - `SVC` (Support Vector Classifier)
   - It also imports metrics for model evaluation, such as `classification_report`, `accuracy_score`, and `confusion_matrix`.

2. **Setting Up Classifiers**:
   - A list named `names` contains the names of each classifier for easy reference.
   - Another list named `classifiers` holds instances of the corresponding classifier classes.

3. **Combining Names and Classifiers**:
   - The `zip` function is used to pair each classifier name with its respective classifier instance, resulting in an iterable of tuples called `models`.

4. **Training and Evaluating Classifiers**:
   - A loop iterates through each `name` and `model` in `models`:
     - For each classifier, it initializes an `nltk_model` using `SklearnClassifier`, which wraps the scikit-learn classifiers to integrate with NLTK.
     - The model is trained using the `train` method with the `training` dataset.
     - The accuracy of the model is calculated using `nltk.classify.accuracy`, which takes the trained model and the `test` dataset as inputs.
     - The accuracy for each model is printed in a formatted string indicating which model was evaluated.

5. **Output**:
   - The output displays the accuracy for each classifier, allowing for a comparison of their performance on the given dataset.


In [22]:
from sklearn.ensemble import VotingClassifier

# Since VotingClassifier can accept list type of models
models = list(zip(names, classifiers))

nltk_ensemble = SklearnClassifier(VotingClassifier(estimators=models, voting='hard', n_jobs=-1))
nltk_ensemble.train(training)
accuracy = nltk.classify.accuracy(nltk_ensemble, test)
print("Voting Classifier model Accuracy: {}".format(accuracy))

Voting Classifier model Accuracy: 0.9806173725771715


### Code Explanation: Implementing a Voting Classifier

1. **Importing VotingClassifier**:
   - The code imports the `VotingClassifier` from the `sklearn.ensemble` module, which is used to combine multiple models into a single ensemble model for improved prediction accuracy.

2. **Preparing the Models**:
   - The previously created lists `names` and `classifiers` are combined using the `zip` function to create a list of tuples called `models`, where each tuple contains a classifier name and its corresponding instance.

3. **Creating the Voting Classifier**:
   - An instance of `VotingClassifier` is created, where:
     - The `estimators` parameter is set to the `models` list, indicating the individual classifiers that will be part of the ensemble.
     - The `voting` parameter is set to `'hard'`, meaning that the classifier will predict the class label based on the majority vote among the individual classifiers.
     - The `n_jobs` parameter is set to `-1`, allowing the use of all available CPU cores for parallel processing during training.

4. **Training the Ensemble Model**:
   - The ensemble model is wrapped with `SklearnClassifier`, allowing it to integrate with NLTK's classification framework.
   - The `train` method is called with the `training` dataset to fit the ensemble model.

5. **Evaluating the Model**:
   - The accuracy of the ensemble model is computed using `nltk.classify.accuracy`, which takes the trained model and the `test` dataset as inputs.
   - Finally, the accuracy of the Voting Classifier is printed, providing insight into its performance compared to the individual classifiers.


In [23]:
text_features, labels = zip(*test)
prediction = nltk_ensemble.classify_many(text_features)
print(classification_report(labels, prediction))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1199
           1       0.99      0.87      0.93       194

    accuracy                           0.98      1393
   macro avg       0.98      0.93      0.96      1393
weighted avg       0.98      0.98      0.98      1393



In [24]:
pd.DataFrame( confusion_matrix(labels, prediction),
             index=[['actual', 'actual'], ['ham', 'spam']],
             columns = [['predicted', 'predicted'], ['ham', 'spam']])

Unnamed: 0_level_0,Unnamed: 1_level_0,predicted,predicted
Unnamed: 0_level_1,Unnamed: 1_level_1,ham,spam
actual,ham,1197,2
actual,spam,25,169


### Code Explanation: Evaluating the Voting Classifier Model

1. **Extracting Test Features and Labels**:
   - The code uses `zip(*test)` to separate the `test` dataset into two tuples: `text_features` and `labels`. 
   - `text_features` contains the feature sets of the SMS messages, while `labels` contains the corresponding true labels (ham or spam).

2. **Making Predictions**:
   - The `classify_many` method of the `nltk_ensemble` model is called with `text_features` to generate predictions for the test dataset. 
   - The predicted labels are stored in the `prediction` variable.

3. **Generating a Classification Report**:
   - The `classification_report` function from `sklearn.metrics` is used to evaluate the performance of the model. 
   - It provides a detailed report including precision, recall, f1-score, and support for each class (ham and spam) based on the true labels (`labels`) and predicted labels (`prediction`).

4. **Creating a Confusion Matrix**:
   - The `confusion_matrix` function generates a confusion matrix to visualize the model's performance in terms of true positives, false positives, true negatives, and false negatives.
   - A Pandas DataFrame is created to display the confusion matrix with appropriate row and column labels for better readability. The matrix includes 'actual' and 'predicted' labels, allowing for easy interpretation of the results.
