**FYP - Lua Chong En [20417309]**

**SpamLyte: Lightweight Stacking Approach for Spam Detection**

**Jupyter Notebook Code - Google Colab**

*Pre-requisites*
- Have all dataset files (4,000 to 16,000) - Use file path: n.xlsx
- Have the feature extracted datasets (TF-IDF) - Use file path: tfidf_file1000-n.xlsx
- Ensure all files are uploaded onto Google Colab prior to executing any cells
- Download all data from the kaggle link below

---



This notebook is split into 12 sections - Use the table to contents to navigate to each section


1. Forcefully install nltk version 3.8.1 due to ongoing dependency problem
2. Data Pre-processing
3. Support Vector Machine (Classifier)
4. Random Forest (Classifier)
5. Multinomial Naive Bayes (Classifier)
6. K-Nearest-Neighbors (Classifier)
7. Logistic Regression (Meta-model)
8. Stacking (MNB, SVM, DT) + (LR) - Combination 1
9. Stacking (MNB, KNN, DT) + (LR) - Combination 2
10. Stacking (MNB, Perceptron, DT) + (LR) - Combination 3
11. Stacking (MNB, LightGBM, RF) + (LR) - Combination 4
12. Final Stacking Technique (MNB, KNN, RF) + (LR) - Combination 5


---

Datasets used in this research can be found here (Kaggle): https://kaggle.com/datasets/5cc5fc5934ccd6336b7ef938834d5038ad795ad66927783caceea6ad529cddad

'

--- Last modified on 30/04/2025 by Lua Chong En ---

'

# 1. Forcefully install nltk version 3.8.1 due to ongoing dependency problem

In [None]:
#!pip uninstall nltk -y

!pip install nltk==3.8.1

import nltk
print(nltk.__version__)

Collecting nltk==3.8.1
  Downloading nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.9.1
    Uninstalling nltk-3.9.1:
      Successfully uninstalled nltk-3.9.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
textblob 0.19.0 requires nltk>=3.9, but you have nltk 3.8.1 which is incompatible.[0m[31m
[0mSuccessfully installed nltk-3.8.1
3.8.1


# 2. Data Pre-processing and Feature Extraction

In [None]:
#DATA PRE-PROCESSING AND FEATURE EXTRACTION (TF-IDF)

# Import necessary libraries
import time
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.tokenize import word_tokenize

# Download NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')

# Load dataset
df = pd.read_excel('/content/4000.xlsx', usecols=['text', 'target'])

# Preprocess text - lowercase and remove non-alphabetic characters
def preprocess_text(text):
    if isinstance(text, str):
        text = text.lower()
        text = re.sub(r'[^a-z\s]', '', text)
    else:
        text = ''
    return text

# Check if the word is valid using WordNet
def is_valid_word(word):
    return bool(wordnet.synsets(word))

# Apply Tokenization, Lemmatization, Stop Word Removal
def tokenize_lemmatize_stopword_blacklist(text):
    if isinstance(text, str):  # Check if the text is a string
        tokens = word_tokenize(text)

        lemmatizer = WordNetLemmatizer()

        stop_words = set(stopwords.words('english'))

        lemmatized_tokens = [
            lemmatizer.lemmatize(word)
            for word in tokens
            if word not in stop_words
            and is_valid_word(word)
        ]

        return ' '.join(lemmatized_tokens)
    else:
        return ''

start_time = time.time()

df['text'] = df['text'].apply(preprocess_text)

df['text'] = df['text'].apply(tokenize_lemmatize_stopword_blacklist)

df['target'] = pd.to_numeric(df['target'], errors='coerce')

# Apply TF-IDF Vectorization without limiting features - Feature Extraction
full_vectorizer = TfidfVectorizer()
full_tfidf_matrix = full_vectorizer.fit_transform(df['text'])

# Find the total number of unique features (terms) in the data
total_detected_features = len(full_vectorizer.get_feature_names_out())
print(f"Total number of features detected in the data: {total_detected_features}")

# Apply TF-IDF Vectorization with 1000 features
vectorizer = TfidfVectorizer(max_features=1000)
tfidf_matrix = vectorizer.fit_transform(df['text'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Concatenate the original DataFrame with the TF-IDF DataFrame
final_df = pd.concat([df.reset_index(drop=True), tfidf_df.reset_index(drop=True)], axis=1)

# Save the final DataFrame to a new Excel file
final_df.to_excel('/content/tfidf_file1000.xlsx', index=False)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

# Retrieve the feature names
feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame to view the TF-IDF scores for each word across all documents
tfidf_matrix = vectorizer.transform(df['text']).toarray()
tfidf_df = pd.DataFrame(tfidf_matrix, columns=feature_names)

# Calculate the average TF-IDF score for each word across all documents
average_tfidf_scores = tfidf_df.mean().sort_values(ascending=False)

# Most important words (highest TF-IDF scores)
most_important_words = average_tfidf_scores.head(10)

# Least important words (lowest TF-IDF scores)
least_important_words = average_tfidf_scores.tail(10)

# Display the results
print("Most important words for determining spam:")
print(most_important_words)
print("\nLeast important words for determining spam:")
print(least_important_words)
print("\n\nFinished")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Total number of features detected in the data: 16300
Elapsed time: 126.64 seconds
Most important words for determining spam:
received    0.121925
aug         0.109860
sep         0.098344
id          0.090963
wed         0.065102
oct         0.063014
mon         0.062805
fri         0.051524
postfix     0.039254
jalapeno    0.035262
dtype: float64

Least important words for determining spam:
generate     0.001210
targeted     0.001209
po           0.001194
held         0.001158
echo         0.001113
broadcast    0.001108
inquiry      0.001092
ordering     0.000866
judgment     0.000695
cm           0.000427
dtype: float64


Finished


# 3. Support Vector Machine

In [None]:
# SUPPORT VECTOR MACHINE

import pandas as pd
import time
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import psutil  # Import the psutil library for memory usage

start_time = time.time()

# Load feature extracted dataset - TF-IDF
df = pd.read_excel('/content/tfidf_file1000-10.xlsx')

# Separate features (X) and target labels (y)
X = df.drop(columns=['text', 'target'], errors='ignore')
y = df['target']

# Split the data into training and testing sets (70-30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Support Vector Machine classifier
svm_model = SVC(kernel='linear')

# Train the model
svm_model.fit(X_train, y_train)

# Predict on the test data
y_pred = svm_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, digits=3)
conf_matrix = confusion_matrix(y_test, y_pred)

ram_usage = psutil.virtual_memory().used / (1024 ** 3)

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

# Print evaluation metrics
print("--- Support Vector Machine Model ---")
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)
print("\nSuccessfully detected spam (True Positives):", conf_matrix[1, 1])
print(f"\nRAM Usage: {ram_usage:.2f} GB")

Elapsed time: 211.43 seconds
--- Support Vector Machine Model ---
Accuracy: 95.27%

Classification Report:
               precision    recall  f1-score   support

           0      0.954     0.949     0.951      1463
           1      0.951     0.956     0.954      1537

    accuracy                          0.953      3000
   macro avg      0.953     0.953     0.953      3000
weighted avg      0.953     0.953     0.953      3000


Confusion Matrix:
 [[1388   75]
 [  67 1470]]

Successfully detected spam (True Positives): 1470

RAM Usage: 1.59 GB


# 4. Random Forest

In [None]:
#RANDOM FOREST

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import psutil  # For monitoring memory usage
import time

start_time = time.time()

# Load feature extracted dataset - TF-IDF
df = pd.read_excel('/content/tfidf_file1000-10.xlsx')

# Separate features (X) and target labels (y)
X = df.drop(columns=['text', 'target'], errors='ignore')
y = df['target']

# Split the data into training and testing sets (70-30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)

# Train the model
rf_model.fit(X_train, y_train)

# Predict on the test data
y_pred = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, digits=3)
conf_matrix = confusion_matrix(y_test, y_pred)

ram_usage = psutil.virtual_memory().used / (1024 ** 3)  # RAM usage in GB

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

# Print evaluation metrics
print("--- Random Forest Model ---")
print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)
print("\nSuccessfully detected spam (True Positives):", conf_matrix[1, 1])
print(f"\nRAM Usage: {ram_usage:.2f} GB")

Elapsed time: 196.91 seconds
--- Random Forest Model ---
Accuracy: 0.9596666666666667

Classification Report:
               precision    recall  f1-score   support

           0      0.958     0.959     0.959      1463
           1      0.961     0.960     0.961      1537

    accuracy                          0.960      3000
   macro avg      0.960     0.960     0.960      3000
weighted avg      0.960     0.960     0.960      3000


Confusion Matrix:
 [[1403   60]
 [  61 1476]]

Successfully detected spam (True Positives): 1476

RAM Usage: 8.34 GB


# 5. Multinomial Naive Bayes

In [None]:
# MULTINOMIAL NAIVE BAYES

import pandas as pd
import time
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import psutil  # Import the psutil library for memory usage

start_time = time.time()
# Load feature extracted dataset - TF-IDF
df = pd.read_excel('/content/tfidf_file1000-10.xlsx')

# Separate features (X) and target labels (y)
X = df.drop(columns=['text', 'target'], errors='ignore')
y = df['target']

# Split the data into training and testing sets (70-30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Multinomial Naive Bayes classifier
nb_model = MultinomialNB()

# Train the model
nb_model.fit(X_train, y_train)

# Predict on the test data
y_pred = nb_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
# Set digits=3 to format the classification report to three decimal points
classification_rep = classification_report(y_test, y_pred, digits=3)
conf_matrix = confusion_matrix(y_test, y_pred)

ram_usage = psutil.virtual_memory().used / (1024 ** 3)  # RAM usage in GB

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

# Print evaluation metrics
print("--- Multinomial Naive Bayes Model ---")
print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)
print("\nSuccessfully detected spam (True Positives):", conf_matrix[1, 1])
print(f"\nRAM Usage: {ram_usage:.2f} GB")

Elapsed time: 181.96 seconds
--- Multinomial Naive Bayes Model ---
Accuracy: 0.9293333333333333

Classification Report:
               precision    recall  f1-score   support

           0      0.911     0.948     0.929      1463
           1      0.949     0.912     0.930      1537

    accuracy                          0.929      3000
   macro avg      0.930     0.930     0.929      3000
weighted avg      0.930     0.929     0.929      3000


Confusion Matrix:
 [[1387   76]
 [ 136 1401]]

Successfully detected spam (True Positives): 1401

RAM Usage: 1.76 GB


# 6. K-Nearest-Neighbor

In [None]:
#K-Nearest-Neighbor (KNN)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import psutil  # Import the psutil library for memory usage
import time

start_time = time.time()
# Load feature extracted dataset - TF-IDF
df = pd.read_excel('/content/tfidf_file1000-10.xlsx')

# Separate features (X) and target labels (y)
X = df.drop(columns=['text', 'target'], errors='ignore')
y = df['target']

# Split the data into training and testing sets (70-30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the K-Nearest Neighbors classifier
knn_model = KNeighborsClassifier(n_neighbors=5)

# Train the model
knn_model.fit(X_train, y_train)

# Predict on the test data
y_pred = knn_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, digits=3)
conf_matrix = confusion_matrix(y_test, y_pred)

ram_usage = psutil.virtual_memory().used / (1024 ** 3)  # RAM usage in GB

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

# Print evaluation metrics
print("--- K-Nearest Neighbors Model ---")
print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)
print("\nSuccessfully detected spam (True Positives):", conf_matrix[1, 1])
print(f"\nRAM Usage: {ram_usage:.2f} GB")

Elapsed time: 186.17 seconds
--- K-Nearest Neighbors Model ---
Accuracy: 0.7713333333333333

Classification Report:
               precision    recall  f1-score   support

           0      0.936     0.570     0.709      1463
           1      0.702     0.963     0.812      1537

    accuracy                          0.771      3000
   macro avg      0.819     0.766     0.760      3000
weighted avg      0.816     0.771     0.761      3000


Confusion Matrix:
 [[ 834  629]
 [  57 1480]]

Successfully detected spam (True Positives): 1480

RAM Usage: 1.83 GB


# 7. Logsitic Regression (Meta-model)

In [None]:
#LOGISTIC REGRESSION

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import psutil  # Import the psutil library for memory usage
import time

start_time = time.time()

# Load feature extracted dataset - TF-IDF
df = pd.read_excel('/content/tfidf_file1000-16.xlsx')

# Separate features (X) and target labels (y)
X = df.drop(columns=['text', 'target'], errors='ignore')
y = df['target']

# Split the data into training and testing sets (70-30 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Logistic Regression model
log_reg_model = LogisticRegression(random_state=42, max_iter=1000)

# Train the model
log_reg_model.fit(X_train, y_train)

# Predict on the test set
y_pred = log_reg_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, digits=3)
conf_matrix = confusion_matrix(y_test, y_pred)

ram_usage = psutil.virtual_memory().used / (1024 ** 3)  # RAM usage in GB

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

# Output results
print("--- Logistic Regression Model ---")
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)
print("\nSuccessfully detected spam (True Positives):", conf_matrix[1, 1])
print(f"\nRAM Usage: {ram_usage:.2f} GB")

Elapsed time: 292.79 seconds
--- Logistic Regression Model ---
Accuracy: 94.56%

Classification Report:
               precision    recall  f1-score   support

           0      0.947     0.944     0.945      2388
           1      0.945     0.947     0.946      2412

    accuracy                          0.946      4800
   macro avg      0.946     0.946     0.946      4800
weighted avg      0.946     0.946     0.946      4800


Confusion Matrix:
 [[2254  134]
 [ 127 2285]]

Successfully detected spam (True Positives): 2285

RAM Usage: 2.31 GB


# 8. Stacking (MNB, SVM, DT) + (LR) - Combination 1

In [None]:
#STACKING WITH MNB SVM DT

import pandas as pd
import time
import psutil
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_selection import SelectKBest, chi2

start_time = time.time()

# Load feature extracted dataset - TF-IDF
df = pd.read_excel('/content/tfidf_file1000-4.xlsx', usecols=lambda column: column not in ['text'])

# Prepare features and labels
X = df.drop(columns=['target'], errors='ignore')
y = df['target']

# Feature selection
k_best_features = SelectKBest(chi2, k=500).fit_transform(X, y)

# Train-Test split (70-30)
X_train, X_test, y_train, y_test = train_test_split(k_best_features, y, test_size=0.3, random_state=42)

# Define base models with hyperparameter optimization
base_models = [
    ('mnb', MultinomialNB(alpha=0.01)),
    ('svc', SVC(kernel='linear', C=1.0, probability=True)),
    ('dt', DecisionTreeClassifier(max_depth=10, random_state=42))
]

# Define optimized meta-model (Logistic Regression with GridSearch)
param_grid = {
    'solver': ['liblinear', 'lbfgs'],
    'C': [0.1, 1.0, 10]
}

meta_model = GridSearchCV(LogisticRegression(solver='liblinear'), param_grid, cv=5)

# Initialize StackingClassifier with optimized models
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5, passthrough=True, n_jobs=-1)

# Train the model
stacking_model.fit(X_train, y_train)

# Predict on test set
y_pred = stacking_model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, digits=3)
conf_matrix = confusion_matrix(y_test, y_pred)
successfully_detected_spam = conf_matrix[1, 1]  # True Positives (Spam detected correctly)

# Measure RAM Usage
ram_usage = psutil.virtual_memory().used / (1024 ** 3)

# Measure elapsed time
end_time = time.time()
elapsed_time = end_time - start_time

# Print results
print(f"Elapsed time: {elapsed_time:.2f} seconds")
print(f"RAM Usage: {ram_usage:.2f} GB")

print("\n--- Optimized Stacking Model ---")
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)
print(f"\nSuccessfully detected spam (True Positives): {successfully_detected_spam}")

Elapsed time: 78.08 seconds
RAM Usage: 2.06 GB

--- Optimized Stacking Model ---
Accuracy: 98.58%

Classification Report:
               precision    recall  f1-score   support

           0      0.984     0.986     0.985       578
           1      0.987     0.986     0.986       622

    accuracy                          0.986      1200
   macro avg      0.986     0.986     0.986      1200
weighted avg      0.986     0.986     0.986      1200


Confusion Matrix:
 [[570   8]
 [  9 613]]

Successfully detected spam (True Positives): 613


# 9. Stacking (MNB, KNN, DT) + (LR) - Combination 2

In [None]:
#STACKING WITH MNB KNN DT

import pandas as pd
import time
import psutil
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_selection import SelectKBest, chi2

start_time = time.time()

# Load feature extracted dataset - TF-IDF
df = pd.read_excel('/content/tfidf_file1000-4.xlsx', usecols=lambda column: column not in ['text'])

# Prepare features and labels
X = df.drop(columns=['target'], errors='ignore')
y = df['target']

# Feature selection
k_best_features = SelectKBest(chi2, k=500).fit_transform(X, y)

# Train-Test split (70-30)
X_train, X_test, y_train, y_test = train_test_split(k_best_features, y, test_size=0.3, random_state=42)

# Define base models with hyperparameter optimization
base_models = [
    ('mnb', MultinomialNB(alpha=0.01)),
    ('knn', KNeighborsClassifier(n_neighbors=5)),
    ('dt', DecisionTreeClassifier(max_depth=10, random_state=42))
]

meta_model = LogisticRegression(solver='liblinear')

# Initialize StackingClassifier with optimized models
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5, passthrough=True, n_jobs=-1)

# Train the model
stacking_model.fit(X_train, y_train)

# Predict on test set
y_pred = stacking_model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, digits=3)
conf_matrix = confusion_matrix(y_test, y_pred)
successfully_detected_spam = conf_matrix[1, 1]

# Measure RAM Usage
ram_usage = psutil.virtual_memory().used / (1024 ** 3)

# Measure elapsed time
end_time = time.time()
elapsed_time = end_time - start_time

# Print results
print(f"Elapsed time: {elapsed_time:.2f} seconds")
print(f"RAM Usage: {ram_usage:.2f} GB")

print("\n--- Optimized Stacking Model ---")
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)
print(f"\nSuccessfully detected spam (True Positives): {successfully_detected_spam}")

Elapsed time: 78.36 seconds
RAM Usage: 1.84 GB

--- Optimized Stacking Model ---
Accuracy: 98.92%

Classification Report:
               precision    recall  f1-score   support

           0      0.986     0.991     0.989       578
           1      0.992     0.987     0.990       622

    accuracy                          0.989      1200
   macro avg      0.989     0.989     0.989      1200
weighted avg      0.989     0.989     0.989      1200


Confusion Matrix:
 [[573   5]
 [  8 614]]

Successfully detected spam (True Positives): 614


# 10. Stacking (MNB, Perceptron, DT) + (LR) - Combination 3

In [None]:
#STACKING WITH MNB Perceptron DT

import pandas as pd
import time
import psutil
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import Perceptron
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_selection import SelectKBest, chi2

start_time = time.time()

# Load feature extracted dataset - TF-IDF
df = pd.read_excel('/content/tfidf_file1000-4.xlsx', usecols=lambda column: column not in ['text'])

# Prepare features and labels
X = df.drop(columns=['target'], errors='ignore')
y = df['target']

# Feature selection
k_best_features = SelectKBest(chi2, k=500).fit_transform(X, y)

# Train-Test split (70-30)
X_train, X_test, y_train, y_test = train_test_split(k_best_features, y, test_size=0.3, random_state=42)

# Define base models with hyperparameter optimization
base_models = [
    ('mnb', MultinomialNB(alpha=0.01)),
    ('perceptron', Perceptron(max_iter=1000, tol=1e-3)),
    ('dt', DecisionTreeClassifier(max_depth=10, random_state=42))
]

# Define optimized meta-model (Logistic Regression with GridSearch)
param_grid = {
    'solver': ['liblinear', 'lbfgs'],
    'C': [0.1, 1.0, 10]
}

meta_model = GridSearchCV(LogisticRegression(solver='liblinear'), param_grid, cv=5)

# Initialize StackingClassifier with optimized models
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5, passthrough=True, n_jobs=-1)

# Train the model
stacking_model.fit(X_train, y_train)

# Predict on test set
y_pred = stacking_model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, digits=3)
conf_matrix = confusion_matrix(y_test, y_pred)
successfully_detected_spam = conf_matrix[1, 1]

# Measure RAM Usage
ram_usage = psutil.virtual_memory().used / (1024 ** 3)

# Measure elapsed time
end_time = time.time()
elapsed_time = end_time - start_time

# Print results
print(f"Elapsed time: {elapsed_time:.2f} seconds")
print(f"RAM Usage: {ram_usage:.2f} GB")

print("\n--- Optimized Stacking Model ---")
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)
print(f"\nSuccessfully detected spam (True Positives): {successfully_detected_spam}")

Elapsed time: 64.06 seconds
RAM Usage: 1.91 GB

--- Optimized Stacking Model ---
Accuracy: 98.08%

Classification Report:
               precision    recall  f1-score   support

           0      0.968     0.993     0.980       578
           1      0.993     0.969     0.981       622

    accuracy                          0.981      1200
   macro avg      0.981     0.981     0.981      1200
weighted avg      0.981     0.981     0.981      1200


Confusion Matrix:
 [[574   4]
 [ 19 603]]

Successfully detected spam (True Positives): 603


# 11. Stacking (MNB, LightGBM, RF) + (LR) - Combination 4

In [None]:
#STACKING WITH MNB LightGBM RF

import pandas as pd
import time
import psutil
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
import lightgbm as lgb
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

start_time = time.time()

# Load feature extracted dataset - TF-IDF
df = pd.read_excel('/content/tfidf_file1000-4.xlsx', usecols=lambda column: column not in ['text'])

# Prepare features and labels
X = df.drop(columns=['target'], errors='ignore')
y = df['target']

# Train-Test split (70-30)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base models
base_models = [
    ('mnb', MultinomialNB(alpha=0.01)),
    ('lightgbm', lgb.LGBMClassifier()),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
]

# Define simple meta-model
meta_model = LogisticRegression(solver='liblinear')

# Initialize StackingClassifier without passthrough
stacking_model = StackingClassifier(base_models, meta_model, cv=3)

# Train the model
stacking_model.fit(X_train, y_train)

# Predict on test set
y_pred = stacking_model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, digits=3)
conf_matrix = confusion_matrix(y_test, y_pred)
successfully_detected_spam = conf_matrix[1, 1]

# Measure RAM Usage
ram_usage = psutil.virtual_memory().used / (1024 ** 3)

# Measure elapsed time
end_time = time.time()
elapsed_time = end_time - start_time

# Print results
print(f"Elapsed time: {elapsed_time:.2f} seconds")
print(f"RAM Usage: {ram_usage:.2f} GB")

print("\n--- Simplified Stacking Model ---")
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)
print(f"\nSuccessfully detected spam (True Positives): {successfully_detected_spam}")

[LightGBM] [Info] Number of positive: 1378, number of negative: 1422
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.029549 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 49560
[LightGBM] [Info] Number of data points in the train set: 2800, number of used features: 990
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.492143 -> initscore=-0.031431
[LightGBM] [Info] Start training from score -0.031431
[LightGBM] [Info] Number of positive: 918, number of negative: 948
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.025374 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 35252
[LightGBM] [Info] Number of data points in the train set: 1866, number of used features: 978
[LightGBM] [Info] [bin

# 12. Final Stacking Technique (MNB, KNN, RF) + (LR) - Combination 5

In [None]:
#STACKING TECHNIQUE

import pandas as pd
import time
import psutil
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

start_time = time.time()

# Load feature extracted dataset - TF-IDF
df = pd.read_excel('/content/tfidf_file1000-4.xlsx', usecols=lambda column: column not in ['text'])

# Prepare features and labels
X = df.drop(columns=['target'], errors='ignore')
y = df['target']

# Train-Test split (70-30)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base models
base_models = [
    ('mnb', MultinomialNB(alpha=0.01)),
    ('knn', KNeighborsClassifier(n_neighbors=5)),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
]

# Define meta-model
meta_model = LogisticRegression(solver='liblinear', random_state=42)

# Initialize StackingClassifier
stacking_model = StackingClassifier(base_models, meta_model, cv=3, passthrough=True)

# Train the model
stacking_model.fit(X_train, y_train)

# Predict on test set
y_pred = stacking_model.predict(X_test)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, digits=3)
conf_matrix = confusion_matrix(y_test, y_pred)
successfully_detected_spam = conf_matrix[1, 1]  # True Positives (Spam detected correctly)

# Measure RAM Usage
ram_usage = psutil.virtual_memory().used / (1024 ** 3)

# Measure elapsed time
end_time = time.time()
elapsed_time = end_time - start_time

# Print results
print(f"Elapsed time: {elapsed_time:.2f} seconds")
print(f"RAM Usage: {ram_usage:.2f} GB")
print("\n--- Simplified Stacking Model ---")
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)
print(f"\nSuccessfully detected spam (True Positives): {successfully_detected_spam}")

Elapsed time: 66.31 seconds
RAM Usage: 2.03 GB

--- Simplified Stacking Model ---
Accuracy: 99.08%

Classification Report:
               precision    recall  f1-score   support

           0      0.990     0.991     0.990       578
           1      0.992     0.990     0.991       622

    accuracy                          0.991      1200
   macro avg      0.991     0.991     0.991      1200
weighted avg      0.991     0.991     0.991      1200


Confusion Matrix:
 [[573   5]
 [  6 616]]

Successfully detected spam (True Positives): 616
