## Setup
Here we will import the libraries that are needed, and load the data into a pandas dataframe.
Also, the OpenAI client will be set up and authentication is handled.

In [5]:
import pandas as pd
import os
import json

from openai import OpenAI

# Load the CSV file
df = pd.read_csv('./dataset.csv')

# Limit the number of rows (Set to None to process all rows)
row_limit = 100  # Adjust this number based on cost considerations

client = OpenAI(
    api_key='<API_KEY>'
)

## Prompt

In [7]:
zero_shot = """
Analyze this email, and classify it as phishing or not phishing
Instructions: provide the analysis result in json format with the following parts:
is_deceptive: Boolean (True for phishing, False for not phishing)
explanation: Text explaining the reasoning behind the decision.
"""

## Analysis
In this part, every email in the provided dataset is analyzed to determine if it is a phishing email or not. 
The results are then saved to another csv file for further analysis.

In [6]:
# Create a new DataFrame to store the processed rows
processed_df = pd.DataFrame(columns=df.columns)

# Process each row in the DataFrame
for index, row in df.iterrows():
    if row_limit is not None and index >= row_limit:
        break

    email_text = row['text']    

    completion = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        response_format={ "type": "json_object" },
        messages=[
            {"role": "system", "content": zero_shot},
            {"role": "user", "content": email_text}
        ]
    )
    response = completion.choices[0].message.content
    response_json = json.loads(response)

    new_row = pd.DataFrame({
        'text': [email_text],
        'is_deceptive': [response_json['is_deceptive']],
        'explanation': [response_json['explanation']],
    })
    processed_df = pd.concat([processed_df, new_row], ignore_index=True)



# Export DataFrame to a new CSV file
processed_df.to_csv('processed.csv', index=False)

## Statistical performance analysis
In this part, the performance of the model is analyzed using statistical measures such as accuracy, precision, recall, and F1 score.

In [7]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the datasets
processed_df = pd.read_csv('./processed.csv')
true_labels_df = pd.read_csv('./sample_labeled.csv')

# Extracting the predicted and true labels
y_pred = processed_df['is_deceptive'].values
y_true = true_labels_df['is_deceptive'].values

# Calculate the evaluation metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Print the scores
print(f"Accuracy: {accuracy*100:.2f}%")
print(f"Precision: {precision*100:.2f}%")
print(f"Recall: {recall*100:.2f}%")
print(f"F1 Score: {f1*100:.2f}%")


Accuracy: 92.00%
Precision: 89.80%
Recall: 93.62%
F1 Score: 91.67%


These metrics suggest that the phishing detection performed quite well, with high scores across accuracy, precision, recall, and F1, indicating a good balance between identifying phishing emails accurately and minimizing false positives. ​​

1. Accuracy (92%)

    Implication: This high accuracy indicates that the system correctly identifies phishing and legitimate emails most of the time. However, accuracy alone can be misleading if the dataset is imbalanced (i.e., if there are significantly more examples of one class than the other).

2. Precision (89.8%)

    Implication: Precision measures the proportion of emails identified as phishing that were actually phishing. A precision of 89.8% means that when the system flags an email as phishing, there's a high chance it's correct. However, a lower precision would indicate that the system generates more false positives, potentially leading to legitimate emails being incorrectly flagged as phishing, which can erode trust in the system.

3. Recall (93.6%)

    Implication: Recall measures the ability of the system to find all the phishing emails. A recall of 93.6% means the system is very good at catching phishing attempts but still misses a few (a lower recall would mean more phishing emails slip through, posing a significant security risk).

4. F1 Score (91.7%)

    Implication: The F1 score is the harmonic mean of precision and recall, providing a single metric to assess the balance between them. A high F1 score indicates the system effectively balances catching as many phishing attempts as possible (recall) while minimizing the number of legitimate emails incorrectly flagged (precision).

### In the Context of Phishing Countermeasures:

Reducing False Negatives: A high recall indicates the system is effective at minimizing false negatives (missed phishing attempts), crucial for security. Even a few missed phishing emails can lead to successful attacks.

Minimizing False Positives: The precision rate shows the system's effectiveness in minimizing false positives, essential for user trust and reducing manual review workloads. False positives can lead to legitimate emails being quarantined, potentially interrupting business operations or causing important communications to be missed.

Overall Effectiveness: The balance between precision and recall, as reflected in the F1 score, suggests that the system is well-tuned for phishing detection, effectively identifying threats while minimizing disruptions to legitimate communication.

Improvement Areas: Even with high performance, there's always room for improvement. Analyzing the characteristics of false positives and false negatives can help refine the detection algorithms, potentially incorporating machine learning models that adapt to evolving phishing tactics.

User Education: Countermeasures should not solely rely on detection systems. User education on recognizing and reporting potential phishing attempts remains a critical component of a comprehensive security posture.

# GPT-4 Analysis
gpt-4-turbo

In [8]:
# Create a new DataFrame to store the processed rows
processed_df = pd.DataFrame(columns=df.columns)

# Process each row in the DataFrame
for index, row in df.iterrows():
    if row_limit is not None and index >= row_limit:
        break

    email_text = row['text']    

    completion = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        response_format={ "type": "json_object" },
        messages=[
            {"role": "system", "content": zero_shot},
            {"role": "user", "content": email_text}
        ]
    )
    response = completion.choices[0].message.content
    response_json = json.loads(response)

    new_row = pd.DataFrame({
        'text': [email_text],
        'is_deceptive': [response_json['is_deceptive']],
        'explanation': [response_json['explanation']],
    })
    processed_df = pd.concat([processed_df, new_row], ignore_index=True)


# Export DataFrame to a new CSV file
processed_df.to_csv('gpt4t-zs-processed.csv', index=False)

In [9]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the datasets
processed_df = pd.read_csv('./gpt4t-zs-processed.csv')
true_labels_df = pd.read_csv('./sample_labeled.csv')

# Extracting the predicted and true labels
y_pred = processed_df['is_deceptive'].values
y_true = true_labels_df['is_deceptive'].values

# Calculate the evaluation metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Print the scores
print(f"Accuracy: {accuracy*100:.2f}%")
print(f"Precision: {precision*100:.2f}%")
print(f"Recall: {recall*100:.2f}%")
print(f"F1 Score: {f1*100:.2f}%")

Accuracy: 92.00%
Precision: 93.33%
Recall: 89.36%
F1 Score: 91.30%


In [2]:
import plotly.graph_objects as go

# Data for GPT-3 and GPT-4
models = ['GPT-3', 'GPT-4']
accuracy = [92.00, 92.00]
precision = [89.80, 93.33]
recall = [93.62, 89.36]
f1_score = [91.67, 91.30]

# Create subplots
fig = go.Figure()

# Add bar for each metric
fig.add_trace(go.Bar(x=models, y=accuracy, name='Accuracy'))
fig.add_trace(go.Bar(x=models, y=precision, name='Precision'))
fig.add_trace(go.Bar(x=models, y=recall, name='Recall'))
fig.add_trace(go.Bar(x=models, y=f1_score, name='F1 Score'))

# Update layout
fig.update_layout(
    title='Comparison of GPT-3 and GPT-4 Performance',
    xaxis_title='Model',
    yaxis_title='Percentage',
    barmode='group'
)

# Show plot
fig.show()

In [3]:
import plotly.graph_objects as go

# Data for GPT-3 and GPT-4
models = ['GPT-3', 'GPT-4']
accuracy = [92.00, 92.00]
precision = [89.80, 93.33]
recall = [93.62, 89.36]
f1_score = [91.67, 91.30]

# Create subplots
fig = go.Figure()

# Add bar for each metric
fig.add_trace(go.Bar(x=models, y=accuracy, name='Accuracy', marker_color='indianred'))
fig.add_trace(go.Bar(x=models, y=precision, name='Precision', marker_color='lightsalmon'))
fig.add_trace(go.Bar(x=models, y=recall, name='Recall', marker_color='lightseagreen'))
fig.add_trace(go.Bar(x=models, y=f1_score, name='F1 Score', marker_color='royalblue'))

# Update layout for clarity
fig.update_layout(
    title='Comparison of GPT-3 and GPT-4 Performance Metrics',
    xaxis=dict(
        title='Model',
        tickmode='linear',
    ),
    yaxis=dict(
        title='Percentage (%)',
        range=[0, 100]
    ),
    barmode='group',
    legend=dict(
        title='Metrics'
    )
)

# Save the figure
fig.show()

In [8]:
row_limit = 1  # Adjust this number based on cost considerations


# Create a new DataFrame to store the processed rows
processed_df = pd.DataFrame(columns=df.columns)

# Process each row in the DataFrame
for index, row in df.iterrows():
    if row_limit is not None and index >= row_limit:
        break

    email_text = row['text']    

    completion = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        response_format={ "type": "json_object" },
        messages=[
            {"role": "system", "content": zero_shot},
            {"role": "user", "content": email_text}
        ]
    )
    response = completion.choices[0].message.content
    response_json = json.loads(response)

    new_row = pd.DataFrame({
        'text': [email_text],
        'is_deceptive': [response_json['is_deceptive']],
        'explanation': [response_json['explanation']],
        'response': [response_json]
    })
    processed_df = pd.concat([processed_df, new_row], ignore_index=True)



# Export DataFrame to a new CSV file
processed_df.to_csv('gpt4t-zs-processed.csv', index=False)