In [None]:
### Checking Folder Existence
This code snippet checks whether a specified folder exists on the system. The path to the folder is defined, and the `os.path.exists()` function is used to verify its existence. If the folder is found, it prints "Folder found!" Otherwise, it prints "Folder not found. Please check the path."

In [None]:
import os

# Set the path to the new folder location
folder_path = r"OneDrive/Desktop/WVU/CYBR_520/hard_ham"

# Check if the path exists
if os.path.exists(folder_path):
    print("Folder found!")
else:
    print("Folder not found. Please check the path.")


In [None]:
### Process Email Files in a Folder and Save as a CSV
This code performs the following steps:
1. **Folder Path Setup**:
   - Sets the path to a folder containing individual email files.

2. **Reading Email Content**:
   - Iterates through all files in the specified folder using `os.listdir()`.
   - Reads the content of each file and appends it to a list named `emails`.

3. **Labeling Emails**:
   - Assigns the label "ham" to all emails (assuming the folder contains only ham emails).

4. **Creating a DataFrame**:
   - Combines the email content and labels into a pandas DataFrame with columns `text` (email content) and `label` (email category).

5. **Saving as CSV**:
   - Saves the DataFrame to a CSV file named `hard_ham_emails.csv`.
   - Prints a confirmation message with the total number of emails processed.

In [None]:
import os
import pandas as pd

# Set the correct path to the folder containing the email files
folder_path = r"OneDrive/Desktop/WVU/CYBR_520/hard_ham"

emails = []
labels = []  # Assuming these are "ham" emails

# Loop through each file in the folder
for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    
    # Open each file and read its content
    with open(file_path, 'r', encoding='latin1') as file:
        content = file.read()
        emails.append(content)
        labels.append("ham")  # Label all these emails as "ham"

# Create a DataFrame from the lists
df = pd.DataFrame({
    'text': emails,
    'label': labels
})

# Save the DataFrame to a CSV file
output_csv = 'hard_ham_emails.csv'
df.to_csv(output_csv, index=False)
print(f"CSV file '{output_csv}' created successfully with {len(df)} emails.")


In [None]:
### Get Current Working Directory
This code snippet:
1. **Imports the `os` module**: A built-in Python module for interacting with the operating system.
2. **Prints the Current Working Directory (CWD)**:
   - Uses the `os.getcwd()` function to retrieve the path of the directory where the Python script or Jupyter Notebook is currently running.
   - This is useful for confirming the current directory before performing file operations.

In [None]:
import os
print(os.getcwd())


In [None]:
### Save DataFrame to a CSV File
This code saves the processed DataFrame to a specified file path as a CSV. 

1. **Specify the Output Path**:
   - The variable `output_csv` stores the file path where the CSV will be saved.
   - Paths can be either absolute (full path from the root directory) or relative (path relative to the current working directory).

2. **Save the DataFrame**:
   - The `df.to_csv(output_csv, index=False)` function writes the DataFrame to the CSV file.
   - The `index=False` parameter ensures that the DataFrame index is not included in the CSV file.

### Notes:
- The first code snippet uses an **absolute path** to save the file.
- The second code snippet uses a **relative path** to save the file.
- Ensure the path specified exists or is writable to avoid errors.

In [None]:
output_csv = r"C:\Users\cjvan\OneDrive\Desktop\WVU\CYBR_520\hard_ham_emails.csv"
df.to_csv(output_csv, index=False)

In [None]:
output_csv = r"OneDrive/Desktop/WVU/CYBR_520/hard_ham_emails.csv"
df.to_csv(output_csv, index=False)

In [None]:
### Verify Folder Path Existence
This code verifies whether a specific folder exists in the filesystem. 

1. **Folder Path Setup**:
   - The `folder_path` variable stores the path to the folder being checked.
   - In this example, the path points to a folder named `spam_2` in the `CYBR_520` directory.

2. **Check Folder Existence**:
   - The `os.path.exists()` function checks if the specified folder path exists.
   - If the folder exists, it prints "Folder found!".
   - If the folder does not exist, it prints "Folder not found. Please check the path."

### Usage:
- Use this snippet to ensure the folder path is correct before performing operations such as reading files or saving data.
- Adjust the `folder_path` variable to match the location of your desired folder.

In [None]:
import os

# Set the path to the new folder location
folder_path = r"OneDrive/Desktop/WVU/CYBR_520/spam_2"

# Check if the path exists
if os.path.exists(folder_path):
    print("Folder found!")
else:
    print("Folder not found. Please check the path.")

In [None]:
### Process and Save Spam Emails to a CSV File
This code processes all email files in the `spam_2` folder and saves them into a labeled CSV file.

1. **Folder Path Setup**:
   - The `folder_path` variable specifies the directory containing the spam email files.

2. **Read Email Files**:
   - Iterates through each file in the folder using `os.listdir(folder_path)`.
   - Opens each file with `open()` and reads its content into a list called `emails`.
   - Each email is labeled as "spam" and appended to the `labels` list.

3. **Create a DataFrame**:
   - Combines the email content (`emails`) and labels (`labels`) into a pandas DataFrame with columns:
     - `text`: Contains the email content.
     - `label`: Contains the corresponding label ("spam").

4. **Save to CSV**:
   - Writes the DataFrame to a CSV file named `spam_2_emails.csv` using `df.to_csv()`.
   - Prints a confirmation message indicating the number of processed emails and the file's location.
   
### Notes:
- The `encoding='latin1'` parameter handles possible encoding issues in email files.
- The output CSV can be used for further analysis or as input to machine learning models.

In [None]:
import os
import pandas as pd

# Set the correct path to the folder containing the email files
folder_path = r"OneDrive/Desktop/WVU/CYBR_520/spam_2"

emails = []
labels = []  # Assuming these are "ham" emails

# Loop through each file in the folder
for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    
    # Open each file and read its content
    with open(file_path, 'r', encoding='latin1') as file:
        content = file.read()
        emails.append(content)
        labels.append("spam")  # Label all these emails as "spam"

# Create a DataFrame from the lists
df = pd.DataFrame({
    'text': emails,
    'label': labels
})

# Save the DataFrame to a CSV file
output_csv = 'spam_2_emails.csv'
df.to_csv(output_csv, index=False)
print(f"CSV file '{output_csv}' created successfully with {len(df)} emails.")

In [None]:
output_csv = r"C:\Users\cjvan\OneDrive\Desktop\WVU\CYBR_520\spam_2_emails.csv"
df.to_csv(output_csv, index=False)

In [None]:
### Combine Ham and Spam Email Datasets into a Single CSV File
This code combines two separate email datasets (`ham` and `spam`) into a unified dataset and saves it as a new CSV file.

1. **Load Datasets**:
   - `hard_ham_emails.csv`: Contains emails labeled as "ham."
   - `spam_2_emails.csv`: Contains emails labeled as "spam."
   - Both files are loaded into pandas DataFrames using `pd.read_csv()`.

2. **Combine Datasets**:
   - Uses `pd.concat()` to merge the two DataFrames (`ham_df` and `spam_df`) into a single DataFrame, `combined_df`.
   - The `ignore_index=True` parameter resets the index in the combined dataset for consistency.

3. **Save Combined Dataset**:
   - Saves the merged DataFrame to a new CSV file named `BERT_Emails.csv` using `to_csv()`.
   - Prints confirmation messages indicating the total number of emails in the combined dataset and the name of the saved file.

### Notes:
- The combined dataset can be used as input for text classification models, such as a BERT-based model.
- Ensure the file paths for the input datasets are correct before running the code.
- This code is part of a preprocessing pipeline to prepare data for further machine learning tasks.


In [None]:
import pandas as pd

# Load the ham and spam datasets
ham_df = pd.read_csv(r"hard_ham_emails.csv")
spam_df = pd.read_csv(r"spam_2_emails.csv")

# Combine the datasets
combined_df = pd.concat([ham_df, spam_df], ignore_index=True)
print("Dataset combined successfully with", len(combined_df), "total emails.")

# Save the combined DataFrame to a new CSV file
combined_csv_path = r"BERT_Emails.csv"
combined_df = pd.concat([ham_df, spam_df], ignore_index=True)
print(f"Combined CSV file saved as '{combined_csv_path}'.")


In [None]:
### Load and Inspect the Combined Dataset
This code loads the combined dataset of ham and spam emails (`BERT_Emails.csv`) and inspects its structure.

1. **Load the Dataset**:
   - The file `BERT_Emails.csv` is read into a pandas DataFrame (`df`) using `pd.read_csv()`.

2. **Inspect the Dataset**:
   - `df.head()`: Displays the first five rows of the DataFrame to provide a preview of the data.
   - `df.columns`: Prints the column names of the DataFrame to confirm its structure.

### Notes:
- The dataset includes two columns:
  - `text`: Contains the email content.
  - `label`: Indicates the classification (`ham` or `spam`).
- This step ensures that the dataset is loaded correctly before performing further operations, such as preprocessing or modeling.

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("BERT_emails.csv")

# Inspect the dataset
print(df.head())
print(df.columns)

                                                text label
0  Return-Path: Fool@motleyfool.com\nDelivery-Dat...   ham
1  Return-Path: <malcolm-sweeps@mrichi.com>\nDeli...   ham
2  From nic@starflung.com  Mon Jun 24 17:06:54 20...   ham
3  Received: from bran.mc.mpls.visi.com (bran.mc....   ham
4  Return-Path: <iso17799@securityrisk.co.uk>\nRe...   ham
Index(['text', 'label'], dtype='object')


In [None]:
### Context: Running BERT and DNN Models on Email Dataset
Previously, we processed the `BERT_Emails.csv` dataset and successfully ran it through a BERT model to extract embeddings and a Deep Neural Network (DNN) for classification. However, the Jupyter Notebook crashed and did not save progress, requiring us to restart the process.

### Current Challenges:
1. **Resource Constraints**:
   - Running the full BERT model on the dataset consistently crashes Jupyter Lab due to insufficient memory or compute resources.
   - Attempts to use lighter BERT models, such as `google/electra-small-discriminator`, have also resulted in crashes.

2. **Current Objective**:
   - Reload the dataset.
   - Split the data into training and testing sets.
   - Attempt to reinitialize and run a lightweight transformer-based tokenizer and model for embedding generation.

### Actions Taken:
- Verified available system memory using `psutil`.
- Tried using a smaller BERT model (`google/electra-small-discriminator`) to mitigate resource issues, but it failed to initialize correctly in Jupyter Lab.

### Next Steps:
- Explore even lighter transformer-based models (e.g., DistilBERT).
- Consider using cloud-based resources or more powerful hardware for model execution.
- Optimize preprocessing steps to reduce memory load before embedding generation.

### Key Notes:
- The code below attempts to re-run the process with a smaller model while ensuring the dataset is correctly split into training and testing sets for further processing.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv("BERT_emails.csv")  # Replace with your actual file path

# Split the data into training and testing sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df["text"], df["label"], test_size=0.2, random_state=42
)

In [1]:
import psutil
print(f"Available Memory: {psutil.virtual_memory().available / (1024 * 1024):.2f} MB")

Available Memory: 2829.10 MB


In [1]:
from transformers import AutoTokenizer, AutoModel

In [2]:
tokenizer = AutoTokenizer.from_pretrained('google/electra-small-discriminator')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
model = AutoModel.from_pretrained('google/electra-small-discriminator')

In [None]:
# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

In [None]:
### Reducing Email Datasets for Smaller Sample Sizes
To address resource constraints and improve the likelihood of successfully running BERT on the email dataset, we decided to reduce the datasets to smaller sample sizes.

1. **Ham Dataset**:
   - Loaded `hard_ham_emails.csv` containing ham emails.
   - Randomly sampled 35 emails from the dataset using `pandas.DataFrame.sample()`.
   - Saved the reduced dataset to a new CSV file named `Reduced_Ham.csv`.

2. **Spam Dataset**:
   - Loaded `spam_2_emails.csv` containing spam emails.
   - Randomly sampled 45 emails from the dataset.
   - Saved the reduced dataset to a new CSV file named `Reduced_Spam.csv`.

3. **Reason for Reduction**:
   - BERT models require significant computational resources to process large datasets.
   - Reducing the dataset size allows us to test the pipeline with smaller, manageable samples.

### Next Steps:
- Combine `Reduced_Ham.csv` and `Reduced_Spam.csv` into a single dataset for classification.
- Re-run the BERT model and DNN classifier with these smaller datasets to verify functionality.

### Notes:
- The `random_state=42` parameter ensures reproducibility of the sampling process.
- This approach provides a baseline for testing while avoiding crashes caused by insufficient memory.

In [1]:
import pandas as pd

# Load the dataset
file_path = 'hard_ham_emails.csv'  # Update this with the correct file path
emails_df = pd.read_csv(file_path)

# Randomly sample 35 emails
subset_emails = emails_df.sample(n=35, random_state=42)  # random_state ensures reproducibility

# Save the subset to a new CSV file named "Reduced_Ham.csv"
subset_path = 'Reduced_Ham.csv'
subset_emails.to_csv(subset_path, index=False)

print(f"Subset of 35 emails saved to {subset_path}")


Subset of 35 emails saved to Reduced_Ham.csv


In [2]:
import pandas as pd

# Load the dataset
file_path = 'spam_2_emails.csv'  # Update this with the correct file path
emails_df = pd.read_csv(file_path)

# Randomly sample 35 emails
subset_emails = emails_df.sample(n=45, random_state=42)  # random_state ensures reproducibility

# Save the subset to a new CSV file named "Reduced_Ham.csv"
subset_path = 'Reduced_Spam.csv'
subset_emails.to_csv(subset_path, index=False)

print(f"Subset of 35 emails saved to {subset_path}")


Subset of 35 emails saved to Reduced_Spam.csv


In [3]:
import pandas as pd

# Load the Reduced_Ham and Reduced_Spam datasets
ham_path = 'Reduced_Ham.csv'  # Update with actual file path if needed
spam_path = 'Reduced_Spam.csv'  # Update with actual file path if needed

reduced_ham = pd.read_csv(ham_path)
reduced_spam = pd.read_csv(spam_path)

# Combine the two datasets
reduced_bert = pd.concat([reduced_ham, reduced_spam], ignore_index=True)

# Save the combined dataset as Reduced_BERT.csv
output_path = 'Reduced_BERT.csv'
reduced_bert.to_csv(output_path, index=False)

print(f"Combined dataset saved to {output_path}")

Combined dataset saved to Reduced_BERT.csv


In [None]:
The following commands use the !pip syntax to install the necessary Python libraries for this project. The !pip command allows you to run shell commands directly within a Jupyter Notebook environment. These libraries include:

transformers: Provides tools and models for natural language processing (NLP) tasks, such as BERT.
torch: PyTorch, a deep learning library for building and training neural networks.
scikit-learn: A machine learning library for model evaluation and feature engineering.
pandas: Used for data manipulation and analysis.
numpy: Provides support for numerical computations.
matplotlib: A library for creating data visualizations.

In [4]:
!pip install transformers
!pip install torch
!pip install scikit-learn
!pip install pandas
!pip install numpy
!pip install matplotlib

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [None]:
***BERT Model Step 1*** 
### We are not having any issues to run this code, it is processing in Jupyter lab correctly
    
### Preprocessing Data for BERT Classification
1. To ensure the BERT model processes the input text correctly, the dataset is preprocessed and tokenized. Below are the steps and their purposes:
- Import Required Libraries:
- Imported pandas for dataset handling.
- Used train_test_split from sklearn for dividing the dataset into training and testing sets.
- Loaded BertTokenizer from transformers to tokenize text data into BERT-compatible input.

2. Loaded the Reduced_BERT.csv file containing email texts and their labels.
- The dataset includes:
- text: Input email content.
- label: Classification labels (ham or spam).

3. Map Labels to Integers:
- Mapped ham to 0 and spam to 1 for binary classification

4. Split Data into Training and Testing Sets:
-Split the dataset into 80% training and 20% testing subsets using train_test_split

5. Initialize BERT Tokenizer:
- Loaded a pre-trained BERT tokenizer (bert-base-uncased) for converting text into tokens.

6. Tokenize Data:
- Tokenized both training and testing data using the tokenizer
    -- truncation=True: Ensures text exceeding the maximum length is truncated.
    -- padding=True: Pads sequences to uniform length.
    -- max_length=512: Maximum token length allowed. We have also reduced this to 32, to see if it would prevent the kernal from crashing
    -- return_tensors="pt": Converts output into PyTorch tensors.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer

# Load the Reduced_BERT.csv file
data_path = 'Reduced_BERT.csv'
data = pd.read_csv(data_path)

# Extract text and labels
texts = data['text'].tolist()
labels = data['label'].map({'ham': 0, 'spam': 1}).tolist()  # Map labels to 0 and 1

# Split into train and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

# Initialize BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the data
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=32, return_tensors="pt")
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=32, return_tensors="pt")

In [None]:
*** BERT Model Step 2 ***

###This step is causing the kernel to crash when running, but it did successfully execute once.

1. Generating Embeddings Using Pre-Trained BERT
- This step uses the pre-trained BERT model to generate embeddings (vector representations) for the training and test datasets. Below are the steps and their purposes:
- Import Required Libraries:
- Imported torch for handling tensor operations.
- Loaded BertModel from transformers to use the pre-trained BERT model.
- Load Pre-Trained BERT Model

2. Loaded the bert-base-uncased model, a pre-trained BERT model provided by Hugging Face.
- This model generates contextualized embeddings for input text.
- Generate Embeddings for Training and Test Sets:

###The embeddings for both the training and testing datasets are generated by passing the tokenized data through the pre-trained BERT model:

3. Explanation of the Code:
- torch.no_grad(): Disables gradient computation to save memory during inference.
- last_hidden_state: Contains the embeddings for all tokens in the input text.
- .mean(dim=1): Performs mean pooling to aggregate token embeddings into a single vector for each input.

In [None]:
import torch
from transformers import BertModel

# Load pre-trained BERT model
bert_model = BertModel.from_pretrained('bert-base-uncased')

# Generate embeddings for the training and test sets
with torch.no_grad():
    train_embeddings = bert_model(**train_encodings).last_hidden_state.mean(dim=1)  # Pooling
    test_embeddings = bert_model(**test_encodings).last_hidden_state.mean(dim=1)

In [None]:
import torch.nn as nn
import torch.optim as optim

# Define the DNN model
class DNNClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(DNNClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return self.softmax(x)

# Model parameters
input_dim = train_embeddings.shape[1]
hidden_dim = 128
output_dim = 2  # Binary classification: ham/spam

# Initialize model, loss, and optimizer
model = DNNClassifier(input_dim, hidden_dim, output_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Convert embeddings and labels to PyTorch tensors
train_labels = torch.tensor(train_labels)
test_labels = torch.tensor(test_labels)

# Train the DNN
epochs = 10
for epoch in range(epochs):
    model.train()
    optimizer.zero_grad()
    outputs = model(train_embeddings)
    loss = criterion(outputs, train_labels)
    loss.backward()
    optimizer.step()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item()}")

In [None]:
from sklearn.metrics import classification_report

# Evaluate the model
model.eval()
with torch.no_grad():
    predictions = model(test_embeddings).argmax(dim=1)

# Print classification report
print(classification_report(test_labels, predictions))

In [1]:
!which python

/usr/bin/python
