## Setup
Here we will import the libraries that are needed, and load the data into a pandas dataframe.
Also, the OpenAI client will be set up and authentication is handled.

In [None]:
import openai
import pandas as pd
import os
import json
# Load the CSV file
df = pd.read_csv('./dataset.csv')

# Limit the number of rows (Set to None to process all rows)
row_limit = 100  # Adjust this number based on cost considerations

from openai import OpenAI

client = OpenAI(
    api_key='<API_KEY>'
)

## Prompt

In [47]:
multi_shot = """
Analyze the provided email text to determine if it is a phishing attempt, using the detailed contextual framework of psychological traits and cognitive biases. For each email, decide whether it is a phishing email or not and explain your reasoning. Apply the following psychological traits and cognitive biases in your analysis, with their explanations and examples:\n\n1. A Sense of Urgency: Pressuring the recipient to make quick decisions, often leading to insufficient consideration of consequences. Example: An email claiming account suspension within 24 hours unless a link is clicked.\n\n2. Inducing Fear by Threatening: Invoking fear to coerce compliance, threatening negative outcomes. Example: Email from 'tax authority' threatening legal action for unpaid taxes.\n\n3. Enticement with Desire: Playing on desires with too-good-to-be-true offers. Example: Email congratulating on a lottery win, requesting personal info.\n\n4. Authority Bias: Trust in suggestions from authority figures. Example: Email from the CEO directing urgent fund transfer.\n\n5. Recency Effect: Prioritizing the most recently presented information. Example: Urging donations to a fraudulent charity after a disaster.\n\n6. Halo Effect: Influence of overall brand impression on character feelings. Example: Email mimicking a respected brand to steal credentials.\n\n7. Hyperbolic Discounting: Preferring immediate rewards over larger, delayed benefits. Example: Offering an immediate discount for quick action.\n\n8. Curiosity Effect: Leveraging curiosity to entice seeking more information. Example: Email with a vague subject line and a malicious attachment.\n\nAdditionally, consider other phishing indicators like poor grammar and unusual requests for personal information. Analyze the overall context, language subtlety, and presentation of the email. Instructions: provide the analysis result in json format with the following parts:
is_deceptive: Boolean (True for phishing, False for not phishing)
explanation: Text explaining the reasoning behind the decision.
"""

## Analysis
In this part, every email in the provided dataset is analyzed to determine if it is a phishing email or not. 
The results are then saved to another csv file for further analysis.

In [48]:
# Create a new DataFrame to store the processed rows
processed_df = pd.DataFrame(columns=df.columns)

# Process each row in the DataFrame
for index, row in df.iterrows():
    if row_limit is not None and index >= row_limit:
        break

    email_text = row['text']    

    completion = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        response_format={ "type": "json_object" },
        messages=[
            {"role": "system", "content": multi_shot},
            {"role": "user", "content": email_text}
        ]
    )
    response = completion.choices[0].message.content
    response_json = json.loads(response)

    new_row = pd.DataFrame({
        'text': [email_text],
        'is_deceptive': [response_json['is_deceptive']],
        'explanation': [response_json['explanation']],
    })
    processed_df = pd.concat([processed_df, new_row], ignore_index=True)

# Export DataFrame to a new CSV file
processed_df.to_csv('processed.csv', index=False)

## Statistical performance analysis
In this part, the performance of the model is analyzed using statistical measures such as accuracy, precision, recall, and F1 score.

In [50]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the datasets
processed_df = pd.read_csv('./processed.csv')
true_labels_df = pd.read_csv('./sample_labeled.csv')

# Make sure the datasets are aligned if necessary. This example assumes they're already aligned.

# Extracting the predicted and true labels
y_pred = processed_df['is_deceptive'].values
y_true = true_labels_df['is_deceptive'].values

# Calculate the evaluation metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Print the scores
print(f"Accuracy: {accuracy*100:.2f}%")
print(f"Precision: {precision*100:.2f}%")
print(f"Recall: {recall*100:.2f}%")
print(f"F1 Score: {f1*100:.2f}%")


Accuracy: 90.00%
Precision: 91.11%
Recall: 87.23%
F1 Score: 89.13%


1. Accuracy (90.00%)

    Implication: An accuracy of 90% indicates that the system is quite adept at identifying both phishing and legitimate emails correctly. This level of accuracy is commendable and suggests that the system is reliable. However, it's essential to consider the dataset's balance, as accuracy might not fully capture performance nuances in imbalanced datasets.

2. Precision (91.11%)

    Implication: Precision is slightly higher than accuracy in this case, indicating that when the system labels an email as phishing, there is a 91.11% chance it's correct. High precision is particularly valuable in reducing the number of false positives, ensuring that legitimate emails are less likely to be mistakenly flagged as phishing, which can help maintain user trust in the system.

3. Recall (87.23%)

    Implication: The recall is somewhat lower than precision, indicating that while the system is good at identifying phishing emails, it misses a higher proportion of actual phishing emails compared to what it incorrectly flags. A recall of 87.23% suggests that some phishing attempts may still slip through the system, posing a potential security risk. Improving recall is critical to ensuring that fewer malicious emails reach the end-users.

4. F1 Score (89.13%)

    Implication: The F1 score provides a balanced view of precision and recall, and in this case, it indicates that the system has a good balance between the two but with room for improvement. An F1 score of 89.13% suggests that the system effectively identifies phishing attempts with a reasonable rate of false positives and negatives, but there might be opportunities to refine the system further to enhance its detection capabilities.

### In the Context of Phishing Countermeasures:

Emphasis on Minimizing False Positives: The high precision rate is beneficial for maintaining user trust and operational efficiency by ensuring that legitimate communications are not unduly interrupted.

Need to Improve Detection: The slightly lower recall rate compared to precision highlights a need for improving the system's ability to catch all phishing attempts. Enhancing recall could involve refining detection algorithms or incorporating new data sources that better capture emerging phishing techniques.

Balanced System with Improvement Opportunities: The F1 score shows that the system maintains a balance between precision and recall. However, the slightly lower recall suggests a potential area for improvement to make the system even more effective at thwarting phishing attempts.

Continuous Improvement and Adaptation: Phishing tactics constantly evolve, so it's crucial for detection systems to adapt continually. This might include implementing machine learning models that can learn from new phishing patterns and incorporating user feedback and reporting mechanisms to identify missed phishing attempts.

User Education Remains Key: Despite technological advancements in detection, educating users on recognizing and reporting phishing remains a vital component of a comprehensive security strategy. A more informed user base can act as an additional layer of defense against phishing.

# GPT-4 Analysis


In [51]:
# Create a new DataFrame to store the processed rows
processed_df = pd.DataFrame(columns=df.columns)

# Process each row in the DataFrame
for index, row in df.iterrows():
    if row_limit is not None and index >= row_limit:
        break

    email_text = row['text']    

    completion = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        response_format={ "type": "json_object" },
        messages=[
            {"role": "system", "content": multi_shot},
            {"role": "user", "content": email_text}
        ]
    )
    response = completion.choices[0].message.content
    response_json = json.loads(response)

    new_row = pd.DataFrame({
        'text': [email_text],
        'is_deceptive': [response_json['is_deceptive']],
        'explanation': [response_json['explanation']],
    })
    processed_df = pd.concat([processed_df, new_row], ignore_index=True)


# Export DataFrame to a new CSV file
processed_df.to_csv('gpt4t-ms-processed.csv', index=False)

In [53]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the datasets
processed_df = pd.read_csv('./gpt4t-ms-processed.csv')
true_labels_df = pd.read_csv('./sample_labeled.csv')

# Extracting the predicted and true labels
y_pred = processed_df['is_deceptive'].values
y_true = true_labels_df['is_deceptive'].values

# Calculate the evaluation metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Print the scores
print(f"Accuracy: {accuracy*100:.2f}%")
print(f"Precision: {precision*100:.2f}%")
print(f"Recall: {recall*100:.2f}%")
print(f"F1 Score: {f1*100:.2f}%")

Accuracy: 90.00%
Precision: 89.36%
Recall: 89.36%
F1 Score: 89.36%
