# #########################################################
# Name: Deepti Lalwani
### Date: 11 December 2023
# Company: Oasis Infobyte
### Description: Python Code Showcase for EMAIL SPAM DETECTION (Data Science)

This Python script demonstrates a specific aspect of my work at Oasis Infobyte.
# #########################################################


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report



In [2]:
df = pd.read_csv('/Users/deeptilalwani/Documents/Data Science/oasis infobyte internship/task 3/spam.csv', encoding='ISO-8859-1')

In [3]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [4]:
X = df['v2']
y = df['v1'].map({'ham': 0, 'spam': 1})  # Map labels to numerical values

In [5]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Feature extraction using CountVectorizer

In [6]:
# Convert the text data into a matrix of token counts
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)


### Train a Naive Bayes classifier

In [7]:
# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_vec, y_train)


### Make predictions and evaluate the model

In [8]:
# Make predictions on the test set
y_pred = classifier.predict(X_test_vec)

In [9]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Display classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.98
Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.99      0.89      0.94       150

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



#### Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positives. In our case, it's high for both classes (spam and non-spam), indicating that when the model predicts an email as spam, it is correct 99% of the time.

#### Recall: Recall is the ratio of correctly predicted positive observations to the all observations in actual class. The recall for spam (class 1) is slightly lower at 89%, meaning the model correctly identifies 89% of the actual spam emails.

#### F1-score: The F1-score is the weighted average of precision and recall. It provides a balance between precision and recall. The F1-scores for both classes are quite high.

#### Support: Support is the number of actual occurrences of the class in the specified dataset. It's the count of each class in your test set.

#### Macro Avg and Weighted Avg: These are averages calculated for precision, recall, and F1-score. Macro avg gives equal weight to all classes, while weighted avg considers the number of samples for each class.



In [10]:
# recall for spam could be a bit higher. 
# so we can try different algorithms to see if we can improve recall without sacrificing too much precision.

### Naive Bayes Hyperparameter Tuning:

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

In [12]:
# Naive Bayes Hyperparameter Tuning
param_grid_nb = {
    'alpha': [0.01, 0.1, 0.5, 1.0, 2.0],  # Smoothing parameter
    'fit_prior': [True, False]  # Whether to learn class prior probabilities
}

nb_classifier = MultinomialNB()

grid_search_nb = GridSearchCV(nb_classifier, param_grid_nb, cv=3, scoring='accuracy')
grid_search_nb.fit(X_train_vec, y_train)

best_params_nb = grid_search_nb.best_params_
best_nb_classifier = MultinomialNB(**best_params_nb)
best_nb_classifier.fit(X_train_vec, y_train)


In [13]:
# Make predictions on the test set
nb_y_pred = best_nb_classifier.predict(X_test_vec)

In [14]:
# Evaluate the Naive Bayes model with best parameters
nb_accuracy = accuracy_score(y_test, nb_y_pred)
print(f"Naive Bayes Accuracy with Hyperparameter Tuning: {nb_accuracy:.2f}")

# Display Naive Bayes classification report
print("Naive Bayes Classification Report with Hyperparameter Tuning:\n", classification_report(y_test, nb_y_pred))


Naive Bayes Accuracy with Hyperparameter Tuning: 0.98
Naive Bayes Classification Report with Hyperparameter Tuning:
               precision    recall  f1-score   support

           0       0.99      1.00      0.99       965
           1       0.97      0.91      0.94       150

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.97      1115
weighted avg       0.98      0.98      0.98      1115



### SVM classifier:

In [15]:
pip install scikit-learn


Note: you may need to restart the kernel to use updated packages.


In [16]:
from sklearn.svm import SVC

# Train a Support Vector Machine (SVM) classifier
svm_classifier = SVC()
svm_classifier.fit(X_train_vec, y_train)


In [17]:
# Make predictions and evaluate the SVM model
svm_y_pred = svm_classifier.predict(X_test_vec)


In [18]:
# Evaluate the SVM model
svm_accuracy = accuracy_score(y_test, svm_y_pred)
print(f"SVM Accuracy: {svm_accuracy:.2f}")

SVM Accuracy: 0.98


In [19]:
# Display SVM classification report
print("SVM Classification Report:\n", classification_report(y_test, svm_y_pred))

SVM Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       1.00      0.85      0.92       150

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115



#### Precision: The precision for spam (class 1) is perfect (1.00), meaning that when the model predicts an email as spam, it is always correct.

#### Recall: The recall for spam is 85%, indicating that the model correctly identifies 85% of the actual spam emails.

#### F1-score: The F1-score for spam is 0.92, which is a good balance between precision and recall.

#### Support: The support values represent the number of actual occurrences of each class in the test set.

#### Macro Avg and Weighted Avg: These metrics provide average values across classes. In this case, they show high overall performance.

In [20]:
# using TF-IDF Vectorizer:

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import RandomizedSearchCV

# Convert the text data into a matrix of TF-IDF features
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)



### # SVM Hyperparameter Tuning using Randomized Search

In [25]:
param_dist_svm = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}


In [26]:
svm_classifier = SVC()

random_search_svm = RandomizedSearchCV(svm_classifier, param_distributions=param_dist_svm, n_iter=10, cv=3, scoring='accuracy', random_state=42)
random_search_svm.fit(X_train_tfidf, y_train)


In [27]:
# Get the best parameters for SVM from randomized search
best_params_svm_random = random_search_svm.best_params_


In [28]:
# Train the SVM classifier with the best parameters from randomized search
best_svm_classifier_random = SVC(**best_params_svm_random)
best_svm_classifier_random.fit(X_train_tfidf, y_train)

In [29]:
# Make predictions on the test set
svm_y_pred_random = best_svm_classifier_random.predict(X_test_tfidf)

In [30]:
# Evaluate the SVM model with best parameters from randomized search
svm_accuracy_random = accuracy_score(y_test, svm_y_pred_random)
print(f"SVM Accuracy with Randomized Search: {svm_accuracy_random:.2f}")

# Display SVM classification report from randomized search
print("SVM Classification Report with Randomized Search:\n", classification_report(y_test, svm_y_pred_random))

SVM Accuracy with Randomized Search: 0.98
SVM Classification Report with Randomized Search:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.97      0.89      0.93       150

    accuracy                           0.98      1115
   macro avg       0.98      0.94      0.96      1115
weighted avg       0.98      0.98      0.98      1115



### Naive Bayes:
### Before Hyperparameter Tuning:
##### High accuracy (98%).
##### High precision, recall, and F1-score for both classes.

### After Hyperparameter Tuning:
##### Similar accuracy (98%).
##### Slight improvement in precision and recall for class 1 (spam) after tuning.

### SVM:
### Before Hyperparameter Tuning:
##### High accuracy (98%).
##### High precision, recall, and F1-score for class 0 (non-spam).
##### Slightly lower recall for class 1 (spam).

### After Hyperparameter Tuning with TF-IDF:
##### Similar accuracy (98%).
##### Slight improvement in recall for class 1 (spam) after tuning.


### Observations:
#### Both models perform well with high accuracy.
#### Hyperparameter tuning, especially for recall in detecting spam (class 1), seems to have a positive impact.