## A tutorial on applying the one-shot fine-tuning technique in digital forensics

### Motivation
- The generated evidence graph (consists of evidence and their relations) doesn't follow STIX.

### Solution: One-shot fine-tuning

- Provide one training example to LLMs
- LLMs often produce more accurate results by learning the example
- Fine-tuning is a productive way to leverage machine learning

<img src="https://github.com/frankwxu/digital-forensics-lab/blob/main/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/notes/productive_way_llm.webp?raw=1" width="550">

### Implementation
- Add one-shot example as the `context` of answer (e.g., conversation)

### Step 1: Download libraries and files for the lab
- Make use you download necessary library and files.
- All downloaded and saved files can be located in the `content` folder if using google Colab

In [None]:
# uncomment the commands to download libraries and files
#!pip install python-dotenv
#!pip install dspy-ai==2.4.17
#!pip install graphviz
#!wget https://raw.githubusercontent.com/frankwxu/digital-forensics-lab/main/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/conversation.txt

import dspy
import os
import openai
import json
from dotenv import load_dotenv
from IPython.display import display

In [None]:
from google.colab import files
uploaded = files.upload()

### Step 2: Config DSPy with openAI
- You `MUST` have an openAI api key
- load an openAI api key from `openai_api_key.txt` file
- or, hard code your open api key

In [None]:
def set_dspy():
    # ==============set openAI enviroment=========
    # Path to your API key file
    key_file_path = "openai_api_key.txt"

    # Load the API key from the file
    with open(key_file_path, "r") as file:
        openai_api_key = file.read().strip()

    # Set the API key as an environment variable
    os.environ["OPENAI_API_KEY"] = openai_api_key
    openai.api_key = os.environ["OPENAI_API_KEY"]
    turbo = dspy.OpenAI(model="gpt-3.5-turbo", max_tokens=2000, temperature=0.5)
    dspy.settings.configure(lm=turbo)
    return turbo
    # ==============end of set openAI enviroment=========


def set_dspy_hardcode_openai_key():
    os.environ["OPENAI_API_KEY"] = "sk-proj-yourapikeyhere"
    openai.api_key = os.environ["OPENAI_API_KEY"]
    turbo = dspy.OpenAI(model="gpt-3.5-turbo", temperature=0, max_tokens=2000)
    dspy.settings.configure(lm=turbo)
    return turbo


# provide `openai_api_key.txt` with your openAI api key
turbo = set_dspy()
# optionally, hard code your openAI api key at line 21
# turbo=set_dspy_hardcode_openai_key()

### Step 3: Load the cyber incident repot (e.g., conversation)

In [None]:
def load_text_file(file_path):
    """
    Load a text file and return its contents as a string.

    Parameters:
    file_path (str): The path to the text file.

    Returns:
    str: The contents of the text file.
    """
    try:
        with open(file_path, "r") as file:
            contents = file.read()
        return contents
    except FileNotFoundError:
        return "File not found."
    except Exception as e:
        return f"An error occurred: {e}"


conversation = load_text_file("conversation.txt")
print(conversation)

Alice: Hey Bob, I just got a strange email from support@banksecure.com. It says I need to verify my account details urgently. The subject line was "Urgent: Verify Your Account Now". The email looks suspicious to me.

Bob: Hi Alice, that does sound fishy. Can you forward me the email? I’ll take a look at the headers to see where it came from.

Alice: Sure, forwarding it now.

Bob: Got it. Let’s see... The email came from IP address 192.168.10.45, but the domain banksecure.com is not their official domain. It's actually registered to someone in Russia.

Alice: That’s definitely not right. Should I be worried?

Bob: We should investigate further. Did you click on any links or download any attachments?

Alice: I did click on a link that took me to a page asking for my login credentials. I didn't enter anything though. The URL was http://banksecure-verification.com/login.

Bob: Good call on not entering your details. Let’s check the URL. This domain was just registered two days ago. It’s hi

### Step 4: Define a structure for one-shot examples understood by `DSPy`.

#### DSPy accpets training data in a certain format  
- For instance, if you were working on a question-answering system
```
example = dspy.Example(
    question="What is the capital of France?",
    answer="The capital of France is Paris."
).with_inputs("question")
```
- `.with_inputs("question")`: Telling the dspy framework that the "question" field should be treated as an input when using this example

#### Key components in our one-shot example
- one-shot example: a similar conversation with correct evidence and relations in STIX
- question: a conversation describing the cyber incident scenario
- answer: the enhanced evidence entities and relations in STIX bacause the one-shot learning

In [None]:
example = dspy.Example(
    question="""
      Taylor: Hey Alex, I think I might have clicked on a suspicious link in an email.
      Alex: Oh no, Taylor. Can you describe what happened?
      Taylor: I got an email from what looked like our HR department. It said there was an urgent update to our benefits package, and I needed to click a link to review the changes.
      Alex: Did the email address seem legitimate?
      Taylor: At first glance, yes, but now that I think about it, the domain was slightly different. It was hr-dept@ourcompany-security.com instead of @ourcompany.com.
      Alex: That sounds like phishing. What happened after you clicked the link?
      Taylor: It took me to a login page that looked just like our internal portal. I entered my username and password.
      Alex: Did you notice anything unusual after entering your credentials?
      Taylor: Not immediately, but a few minutes later, I got an alert that someone attempted to log into my account from a different location.
      Alex: Okay, this sounds serious. I need you to change your password immediately and enable two-factor authentication if you haven't already.
      Taylor: Done. What should we do next?
      Alex: I'll start by examining the email headers to trace the origin. Also, I need to check the link you clicked on to understand its structure and where it leads.
      Taylor: Alright, I’ll forward you the email.
      Alex: Thanks. I’ll also run a network scan to see if any other devices might have been compromised.
      Taylor: Should I inform the rest of the team?
      Alex: Yes, let them know about the phishing attempt and advise them to be cautious. I’ll send an official email with detailed instructions.
      Taylor: Got it. Thanks, Alex. Is there anything else I should do?
      Alex: Just keep an eye out for any unusual activities in your accounts. I’ll handle the technical investigation and follow up with you if I need more information.
      Taylor: Will do. Thanks again.
      Alex: No problem. Stay safe online.""",
    answer="""[
    {
      "type": "identity",
      "id": "identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f",
      "name": "OurCompany",
      "identity_class": "organization",
      "sectors": ["technology"],
      "contact_information": "info@ourcompany.com"
    },
    {
      "type": "email-addr",
      "id": "email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798",
      "value": "hr-dept@ourcompany-security.com"
    },
    {
      "type": "email-message",
      "id": "email-message--c79b6bde-4f4c-4b38-a8c8-fb82921d6b97",
      "is_multipart": false,
      "subject": "Urgent Benefits Package Update",
      "from_ref": "email-addr--0c0d2094-df97-45a7-9e9c-223569a9e798",
      "body": "Please click the link to review the changes to your benefits package."
    },
    {
      "type": "url",
      "id": "url--4c3b-4c4b-bb6c-ded6b2a4a567",
      "value": "http://phishing-link.com/login"
    },
    {
      "type": "user-account",
      "id": "user-account--bd5631cf-2af6-4bba-bc92-37c60d020400",
      "user_id": "Taylor",
      "account_login": "taylor@ourcompany.com"
    },
    {
      "type": "observable",
      "id": "observable--001",
      "observable_type": "email",
      "observable_value": "hr-dept@ourcompany-security.com"
    },
    {
      "type": "observable",
      "id": "observable--002",
      "observable_type": "url",
      "observable_value": "http://phishing-link.com/login"
    },
    {
      "type": "indicator",
      "id": "indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f",
      "name": "Phishing Email Indicator",
      "pattern": "[email-message:subject = 'Urgent Benefits Package Update']",
      "valid_from": "2024-07-17T00:00:00Z"
    },
    {
      "type": "incident",
      "id": "incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857",
      "name": "Phishing Attack on OurCompany",
      "description": "A phishing attack where a suspicious email was sent to an employee of OurCompany.",
      "first_seen": "2024-07-17T08:00:00Z",
      "last_seen": "2024-07-17T08:10:00Z",
      "status": "ongoing",
      "affected_assets": ["user-account--bd5631cf-2af6-4bba-bc92-37c60d020400"]
    },
    {
      "type": "relationship",
      "id": "relationship--3f1a8d8b-6a6e-4b5d-8e15-2d6d9a2b3f1d",
      "relationship_type": "indicates",
      "source_ref": "indicator--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f",
      "target_ref": "incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857"
    },
    {
      "type": "relationship",
      "id": "relationship--4b6e65f3-743d-40c2-9194-3b5e38b3efed",
      "relationship_type": "attributed-to",
      "source_ref": "incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857",
      "target_ref": "identity--1cba2e3c-4bdb-4d0b-a87b-2d504ad5923f"
    },
    {
      "type": "relationship",
      "id": "relationship--5c9b6eaf-27a6-4b2b-9b17-49e3b00f6051",
      "relationship_type": "uses",
      "source_ref": "incident--7a2b252e-c3e5-4bc2-bc6f-cb917ecf7857",
      "target_ref": "url--4c3b-4c4b-bb6c-ded6b2a4a567"
    }
]""",
).with_inputs("question")

### Step 5: Load the one-shot example using a retriever

#### What is retriever in `DSPy`?
- designed to fetch relevant information or documents from a larger corpus or database based on a given query
- often based on vector representations of text
- use cases: Question-answer systems, chatbots, or any application
    - where relevant information needs to be fetched from a large dataset to inform further processing or responses
### The retriever in this example
- enhance accuracy for forensic evidence analysis
    - identified evidence entities and relationships that comply with STIX
- hard-coded just return one-example for one-shot learning
- can be improved to retrieve more or related examples

### Step 6: Implement `OneShotRetriever`
The main retrieval method. It returns a formatted string containing the predefined example, regardless of the input query.

- Parameters:
    - query: The input `query` (currently not used in the retrieval process).
- Returns: A formatted string containing:
    - The example scenario (from self.example.question)
    - The corresponding STIX JSON (from self.example.answer)

In [None]:
# Create a simple retriever that always returns the one-shot example
class OneShotRetriever(dspy.Retrieve):
    def __init__(self, example):
        super().__init__()
        self.example = example

    def forward(self, query):
        # Here we could use the query to determine if we should return the example
        # For demonstration, let's just print the query
        # print(f"Retrieval query: {query}")
        one_example = f"Example scenairo: {self.example.question}\n Example generated STIX in JSON based on the scenairo: {self.example.answer}\n"
        return one_example

### Step 7: Implement `STIXGeneratorSig`

In [None]:
class STIXGeneratorSig(dspy.Signature):
    """Describe a conversation in STIX, which stands for Structured Threat Information eXpression, is a standardized language for representing cyber threat information."""

    # Make sure to define context here, otherwise, one-short learning won't work
    context = dspy.InputField(desc="contain a scenario and the coreposing STIX in JSON")

    question: str = dspy.InputField(
        desc="a conversation describing a cyber incident between an IT Security Specialist and an employee."
    )

    answer: str = dspy.OutputField(
        desc="the formalized STIX in JSON representing cyber threat information based on the conversation, e.g., [{object 1}, {object 2}, ... {object n}]"
    )

### Step 7: Implement `STIXGeneratorSig` Module


#### `dspy.Module`
- implement your business logic
- can include multiple submodules
- syntactic similarity to PyTorch
    -  `__init__()`: declares the used submodules.
    - `forward()`: describes the control flow among the defined submodules.

![Diagram of forward modules](https://github.com/frankwxu/digital-forensics-lab/blob/main/AI4Forensics/CKIM2024/PhishingAttack/PhishingAttackScenarioDemo/04_forward_module.svg?raw=1)




In [None]:
class STIXGenCoT(dspy.Module):
    def __init__(self, example):
        super().__init__()
        self.retriever = OneShotRetriever(example)
        self.predictor = dspy.ChainOfThought(STIXGeneratorSig)

    def forward(self, question):
        context = self.retriever(question)
        results = self.predictor(context=context, question=question)

        # Inspect the history
        # last_interaction = turbo.inspect_history(n=1)
        # print("Last interaction:")
        # print(last_interaction)

        return results

### Step 8: Tell an LLM `HOW` to generate answer in a function:

- `HOW`: defined in `STIXGenCoT`
- save the output in JSON

In [None]:
def generate_answer(conversation, output_file):
    # Create an instance of your module with the one-shot example
    my_module = STIXGenCoT(example)

    # Use your module with a new input
    answer = my_module(question=conversation).answer

    with open(output_file, "w") as json_file:
        result = json.loads(answer)
        print(answer)
        json.dump(result, json_file, indent=4)
    print(f"The results have been saved to the file {output_file}")

### Step 9: Execute the function above with an input and output

In [None]:
output_file = "04_output.json"
generate_answer(
    conversation,
    output_file,
)

### Step 10: Inspect the last prompt send to the LLM

You want to check:
- Prompt Description Section: Description in the signature
- Format Section: `Following the following format.`
- Context: Example scenario:

In [None]:
turbo.inspect_history(n=1)