<a href="https://colab.research.google.com/github/armandossrecife/teste/blob/main/my_spam_detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# My E-mail spam detector

Here's a breakdown of the main steps to implement a spam email identifier using Python:

**1. Data Acquisition and Preprocessing:**

-   **Obtain Dataset:** You'll need a labeled dataset of emails categorized as spam or not spam. Public datasets like [UCI Spam Email Dataset](https://archive.ics.uci.edu/dataset/94/spambase) can be a good starting point.
-   **Import Libraries:** Use libraries like pandas for data manipulation, NumPy for numerical computations, and scikit-learn for machine learning tasks.
-   **Load and Clean Data:** Load the email data, handle missing values, and remove irrelevant information.
-   **Feature Engineering:** Extract features from the emails that might be informative for spam classification. This could involve:
    -   **Text Processing:** Convert text to lowercase, remove punctuation and stop words (common words that provide little meaning).
    -   **Bag-of-Words:** Create a feature vector representing word frequency in each email.
    -   **Additional Features:** Consider features like presence of URLs, uppercase characters, exclamation marks, etc.

**2. Model Training and Evaluation:**

-   **Split Data:** Divide your data into training and testing sets using `train_test_split` from scikit-learn. The training data will be used to train the model, and the testing data will be used to evaluate its performance.
-   **Choose a Model:** Select a suitable machine learning algorithm for classification. Common choices for spam filtering include Naive Bayes, Support Vector Machines (SVM), or Random Forest.
-   **Train the Model:** Train your chosen model on the training data using the `fit` method from scikit-learn.
-   **Evaluate Performance:** Use the testing data to assess the model's accuracy with metrics like precision, recall, and F1-score. These metrics tell you how well the model identifies spam and avoids classifying legitimate emails as spam.

**3. Prediction and Deployment (Optional):**

-   **Predict on New Emails:** Once satisfied with the model's performance, you can use the `predict` method to classify new unseen emails as spam or not spam.
-   **Deployment (Optional):** For real-world use, you can integrate your model into an email client or web application to automatically classify incoming emails.

**Note:** This is a simplified overview. Each step can involve further details and optimizations depending on your chosen libraries, model, and desired functionalities.

In [8]:
!wget https://raw.githubusercontent.com/armandossrecife/teste/main/spam_or_not_spam.csv.zip

--2024-07-06 22:13:51--  https://raw.githubusercontent.com/armandossrecife/teste/main/spam_or_not_spam.csv.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1212263 (1.2M) [application/zip]
Saving to: ‘spam_or_not_spam.csv.zip’


2024-07-06 22:13:51 (25.8 MB/s) - ‘spam_or_not_spam.csv.zip’ saved [1212263/1212263]



In [9]:
!unzip spam_or_not_spam.csv.zip

Archive:  spam_or_not_spam.csv.zip
  inflating: spam_or_not_spam.csv    


In [3]:
import pandas as pd

In [10]:
df_emails = pd.read_csv('spam_or_not_spam.csv')
df_emails

Unnamed: 0,email,label
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0
1,martin a posted tassos papadopoulos the greek ...,0
2,man threatens explosion in moscow thursday aug...,0
3,klez the virus that won t die already the most...,0
4,in adding cream to spaghetti carbonara which ...,0
...,...,...
2995,abc s good morning america ranks it the NUMBE...,1
2996,hyperlink hyperlink hyperlink let mortgage le...,1
2997,thank you for shopping with us gifts for all ...,1
2998,the famous ebay marketing e course learn to s...,1


In [12]:
df_emails.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   email   2999 non-null   object
 1   label   3000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 47.0+ KB


In [41]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split  # For splitting data
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [19]:
content = """a
an
the
of
and
to
in
I
on
my
is
for
at
but
be
with
that
as
you
do
it
this
from
they
are
have
or
which
one
all
would
there
their
what
so
up
out
if
about
who
get
which
go
me
when
make
can
like
time
no
just
him
know
take
people
into
year
your
good
some
could
them
see
other
than
then
now
only
come
its
over
think
also
back
after
use
two
how
our
work
first
well
way
even
new
want
because
any
these
give
day
most
us
found
make
world
over
must
more
use
go
up
out
if
about
who
get
which
go
me
when
make
can
like
time
no
just
him
know
take
people
into
year
your
good
some
could
them
see
other
than"""

with open('stopwords.txt', 'w') as f:
    f.write(content)

In [20]:
def preprocess_text(text):
  """
  This function preprocesses text data for spam classification.

  Args:
      text: A string containing the email text.

  Returns:
      A string containing the preprocessed text.
  """
  # Lowercase conversion
  text = text.lower()
  # Remove punctuation
  text = ''.join([char for char in text if char.isalnum() or char.isspace()])
  # Remove stop words (replace with your stopwords list)
  stopwords = set(open('stopwords.txt', 'r').read().split())  # Replace with your stopwords file path
  text = ' '.join([word for word in text.split() if word not in stopwords])
  return text

def create_bag_of_words(emails):
  """
  This function creates a bag-of-words representation of emails.

  Args:
      emails: A list of strings containing the email text.

  Returns:
      A sparse matrix representing the bag-of-words features and
      a vocabulary list containing the unique words found in the emails.
  """
  from sklearn.feature_extraction.text import CountVectorizer

  vectorizer = CountVectorizer()
  features = vectorizer.fit_transform(emails)
  vocabulary = vectorizer.get_feature_names_out()
  return features, vocabulary


In [22]:
df_emails['email'] = df_emails['email'].astype(str)

In [23]:
# Apply the preprocess_text function to the 'email' column using a lambda function
df_emails['preprocessed_text'] = df_emails['email'].apply(preprocess_text)

In [24]:
df_emails

Unnamed: 0,email,label,preprocessed_text
0,date wed NUMBER aug NUMBER NUMBER NUMBER NUMB...,0,date wed number aug number number number numbe...
1,martin a posted tassos papadopoulos the greek ...,0,martin posted tassos papadopoulos greek sculpt...
2,man threatens explosion in moscow thursday aug...,0,man threatens explosion moscow thursday august...
3,klez the virus that won t die already the most...,0,klez virus won t die already prolific virus ev...
4,in adding cream to spaghetti carbonara which ...,0,adding cream spaghetti carbonara has same effe...
...,...,...,...
2995,abc s good morning america ranks it the NUMBE...,1,abc s morning america ranks number christmas t...
2996,hyperlink hyperlink hyperlink let mortgage le...,1,hyperlink hyperlink hyperlink let mortgage len...
2997,thank you for shopping with us gifts for all ...,1,thank shopping gifts occasions free gift numbe...
2998,the famous ebay marketing e course learn to s...,1,famous ebay marketing e course learn sell comp...


In [42]:
# Assuming your dataframe is named df_emails and the 'email' column contains the text data

# 1. Preprocess the email text (optional, but recommended)
# You can use the preprocess_text function from previous examples

# df_emails['preprocessed_text'] = df_emails['email'].apply(preprocess_text)
emails = df_emails['preprocessed_text']  # Use the preprocessed text if applied

# 2. Extract features using CountVectorizer
# emails = df_emails['email']  # Use the original email text for this example

vectorizer = CountVectorizer()
features = vectorizer.fit_transform(emails)
vocabulary = vectorizer.get_feature_names_out()

# Print results (example)
print("Shape of the feature matrix:", features.shape)
print("Sample of features:", features[0].toarray())
print(f"Len of vocabulary: {len(vocabulary)}")
print("Vocabulary (feature names):", vocabulary[:1000])  # Print the first 10 words

Shape of the feature matrix: (3000, 33712)
Sample of features: [[0 0 0 ... 0 0 0]]
Len of vocabulary: 33712
Vocabulary (feature names): ['aa'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacuxrmplfnumberfhxl'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacnumbermmvznumbercjnumberzld'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacunumberwlcunumberwmnumberdlo'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabcnumberfudnumberhgknumberxt'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacuxrmplfnumberfhxlnumbermh'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaafcvwyfk'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacuxrmplfnumberfhxlnumbermhyv'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacnlzcunnumberljnumberfetbkvts'
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaknumberesw'
 'aaaaaaaaaaaa

In [43]:
def split_data(X, y, test_size=0.2, random_state=42):
  """
  This function splits features (X) and labels (y) into training and testing sets.

  Args:
      X: A numpy array or pandas dataframe containing the features.
      y: A numpy array containing the labels.
      test_size: Float between 0.0 and 1.0 (default=0.2) representing the proportion of data
                 allocated to the testing set.
      random_state: Integer seed for random number generation (default=42) to ensure
                     reproducibility of the split.

  Returns:
      A tuple containing four numpy arrays: X_train, X_test, y_train, y_test representing
      the training and testing sets for features and labels.
  """
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
  return X_train, X_test, y_train, y_test


In [46]:
# Assuming your dataframe is named df_emails and has columns 'email' and 'label'

# 1. Feature engineering (optional, but recommended)
# You can use the previously defined functions for preprocessing and bag-of-words creation

# df_emails['preprocessed_text'] = df_emails['email'].apply(preprocess_text)
emails = df_emails['preprocessed_text']

# Alternatively, use the original email content for features
# emails = df_emails['email']

# 2. Create features (using CountVectorizer for this example)
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
features = vectorizer.fit_transform(emails)

# 3. Split features and labels using split_data function
X_train, X_test, y_train, y_test = split_data(features, df_emails['label'])

In [47]:
# Assuming you have your training features (X_train) and training labels (y_train)

model = SVC()  # Create an SVC instance

In [48]:
model.fit(X_train, y_train)  # Train the model on the training data


In [49]:
y_pred = model.predict(X_test)  # Predict labels for the testing data


In [50]:
accuracy = accuracy_score(y_test, y_pred)
print("Model accuracy:", accuracy)


Model accuracy: 0.9583333333333334


In [55]:
from sklearn.metrics import accuracy_score

# Assuming you have trained a model (model), testing features (X_test),
# testing labels (y_test), and obtained predictions (y_pred)

# 1. Evaluate Model Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# 2. Analyze Predictions (Example: Top 5 Misclassified)

# Identify indices of misclassified emails (handle potential errors)
try:
  misclassified_indices = (y_pred != y_test).values.nonzero()[0]
except (IndexError, KeyError):
  print("Warning: Potential issues with dataframe indexing. Check your code and data.")
  misclassified_indices = []  # Set to empty list to avoid further errors

# Print top 5 misclassified emails (replace 5 with desired number)
for i in range(min(len(misclassified_indices), 5)):
  index = misclassified_indices[i]

  # Check if index is valid before accessing data
  if 0 <= index < len(df_emails):  # Handle potential out-of-bounds indices
    email_text = df_emails.loc[index, 'email']  # Assuming you have email text stored
    predicted_label = y_pred[index]
    actual_label = y_test[index]
    print(f"Email {i+1}:")
    print(f"  - Predicted: {predicted_label} (Spam)") if predicted_label == 1 else print(f"  - Predicted: {predicted_label} (Not Spam)")
    print(f"  - Actual: {actual_label} (Spam)") if actual_label == 1 else print(f"  - Actual: {actual_label} (Not Spam)")
    print(f"  - Email Text:\n{email_text[:100]}...")  # Print first 100 characters of the email
  else:
    print(f"Warning: Skipping invalid index {index}")


Model Accuracy: 0.9583333333333334


KeyError: 18