<a href="https://colab.research.google.com/github/dorinhazan/FinalProject-DataScience/blob/main/observable_extraction_assistant_and_observable_rank_validator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! unzip /content/validation.zip

Archive:  /content/validation.zip
   creating: validation/
  inflating: __MACOSX/._validation   
 extracting: validation/.Rhistory    
  inflating: __MACOSX/validation/._.Rhistory  
  inflating: validation/.DS_Store    
  inflating: __MACOSX/validation/._.DS_Store  
  inflating: validation/%5BAnalysis%5DAndariel_Group.md  
  inflating: __MACOSX/validation/._%5BAnalysis%5DAndariel_Group.md  
  inflating: validation/2020_10_19_unsealed_indictment_0.md  
  inflating: __MACOSX/validation/._2020_10_19_unsealed_indictment_0.md  


In [11]:
%env OPENAI_API_KEY=""

env: OPENAI_API_KEY=""


In [3]:
from openai import OpenAI
from google.colab import userdata
import os
import regex as re
import json
import time

api_key = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key = api_key)

## Removing images from the Markdown files

In [4]:
def clean_markdown(file_path):
    """
    Clean a single Markdown file by removing:
     1) Any image line ending in .jpg) or .jpeg),
        plus any blank lines after it,
        plus the very next non-empty “Figure…” line.
     2) Other unwanted patterns as before.
    """
    with open(file_path, "r", encoding="utf-8") as f:
        content = f.read()

    # 1) Remove image + blank lines + next non-empty Figure… line
    content = re.sub(
        r'(?im)^[ \t]*!\[[^\]]*\]\([^)\r\n]+?\.(?:jpe?g)\)[ \t]*\r?\n'      # image line
        r'(?:[ \t]*\r?\n)*'                                                  # any blank lines
        r'[ \t]*(?:\[Figure\b.*\]|Figure\b.*)\r?\n?',                        # next non-empty Figure line
        "",
        content
    )

    # 2) Other patterns to strip out
    patterns = {
        "inline_hyperlinks": r'\[.*?\]\(https?://\S+?\)',
        "standalone_urls":   r'https?://\S+',
        "headers_footers":   r'(Recommended Practice|Homeland Security|Page \d+)',
        "images":            r'!\[[^\]]*\]\((?:[^()\\]|\\.)*?\)',           # any other image refs
    }

    for pat in patterns.values():
        content = re.sub(pat, "", content, flags=re.IGNORECASE)

    with open(file_path, "w", encoding="utf-8") as f:
        f.write(content)


In [5]:
for file in os.listdir("/content/validation"):
  if file.endswith(".md"):
    clean_markdown(os.path.join("/content/validation", file))

## Corrected Code - gpt_for_reports.py


##new

In [6]:
obs_rank_validator_instructions = """

# Observable-rank Validator
 You are a senior CTI analyst. Your job is to **audit one observable at a time** and judge whether its `observable_rank` value (“Actionable”, “Described”, or “Mentioned”) is correct, based on the observable’s name and the provided context.
>
> You will receive:
> 1. The observable record (as JSON).
> 2. The surrounding sentence or paragraph from the original CTI snippet for minimal context (when available).

### 1.  Decision Rules

| Rank              | Decide “Actionable” **when …**                                                                                                                   | Typical examples (not exhaustive)                                                                         |
|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| **Actionable**    | • The exact string can be matched **as‑is** (or after a trivial transform like Base64 decode).<br>• A detection rule (IDS/YARA/SIEM, PLC logic, etc.) could key on it with low false‑positives.<br>• Uniqueness is obvious (or highly likely) from its syntax or entropy. | – Full code snippets / shell commands / SQL queries<br>– Complete URLs, URIs, or file paths<br>– IPs, domains, GUIDs, hashes (MD5/SHA‑256/SHA‑1, Cert thumbprints), JWTs, beacon keys<br>– Hard‑coded strings, mutexes, registry keys, ICS/SCADA tags, Modbus function codes<br>– Crypto keys, X.509 DN strings, email addresses, phone numbers used for C2 |
| **Described**     | • The details are not sufficient to act upon **or** the observable is not searchable, for example, due to ambiguity or missing information.<br>• Additional enrichment (lookup table, parameter substitution, memory carving, etc.) is required **and** is non‑trivial. | – “An HTTP POST to `/api/checkin?id=<random>`” (placeholder)<br>– “A PowerShell script that loads shellcode” (no code shown)<br>– “A DLL named similar to system files” |
| **Mentioned**     | • It is only referenced conceptually or generically and **cannot** be turned into a concrete detection even after enrichment.                    | – “Uses AES‑128” (algorithm only)<br>– “Makes RPC calls” (no function names)<br>– “Phishing email” |

### 2.  Key Heuristics

1. **Code fence = Actionable**
   Any observable whose `observable_value` is wrapped in triple back‑ticks is presumed actionable *unless* it is clearly pseudocode.

2. **Complete identifier = Actionable**
   A string that *looks complete* for its type (e.g., 36‑char GUID, 64‑hex SHA‑256, “10.0.0.5”, “plc_tag=3001:1”) is actionable.

3. **Indicator Strings**
   Strings such as `"S^Anonymous?"`, mutex names, task names, beacon keys, campaign IDs, etc., are actionable if they appear exactly.

4. **Encoded Artifacts**
   If a value is Base64/hex/URL‑encoded **and its decoded form is also supplied**, both entries are independently actionable.

5. **When unsure**
   Err toward **higher precision**:
   ```
   Actionable  >  Described  >  Mentioned
   ```
   If the observable *could plausibly* drive automated detection, choose **Actionable**.

### 3.  Output Format

Respond only with one of the following, for each observable:

```text
| observable_id | `observable_value` → `observable_rank` | Validator output CORRECT/ INCORRECT —only if INCORRECT add -  should be "<CorrectRank>" because <one‑sentence justification> | Rationale |
```

***(No other words.)***

---

### 4.  Quick Examples for Output Format

| Observable (value + rank) | Validator output | Rationale |
|---------------------------|------------------|-----------|
| ```send_403070(0x1Au,0…)``` → **Described** | `INCORRECT — should be "Actionable" because it is a full, unique code snippet that can be matched directly.` | Rule 1 |
| `aHR0cHM6Ly9jdC5lemlsLmNvbS9wYXlsb2Fk` (encoded) → **Actionable** | `CORRECT` | Encoded, unique, searchable |
| `“a DLL named netutils.dll”` → **Mentioned** | `INCORRECT — should be "Described" because it gives a partial but still ambiguous filename pattern.` | lacks uniqueness |
| `AES‑128` → **Mentioned** | `CORRECT` | Only an algorithm, non‑searchable |

---

Use these instructions as the **sole evaluation rubric**.
"""

In [7]:
obs_rank_validator_ID = "asst_ksYtNJ5GTGYvwNQS6liBckyN"
our_updated_assistant = client.beta.assistants.update(
  obs_rank_validator_ID,
  instructions=obs_rank_validator_instructions,
)
# Notice! this code will update the assistant in OpenAI, so next time you will run the code at the beggining of this notebook, in will call the updated assistant

In [8]:
def observable_rank_validator(observables):

  """
  Validating your Observable Extraction using an observable-rank validator.

  This function uses an OpenAI assistant I built for you, named Observable-Rank Validator, its purpose
  is to help you to validate your result from the observable extractor assistant.
  This function receives a list of observables (a JSON list of observables, as resulted from the Observable Extractor Assistant),
  and Respond with one of the two:

    ```text
    CORRECT
    ```

    ```text
    INCORRECT — should be "<CorrectRank>" because <one‑sentence justification>
    ```
  for each observable in the list, according to their order.

  """
  payload = json.dumps(observables, ensure_ascii=False, indent=2)

  user_msg = (
      "Here is a JSON list of observables. "
      "Validate the `observable_rank` of each one.\n"
      "```json\n" + payload + "\n```"
  )

  thread = client.beta.threads.create(
      messages=[{"role": "user", "content": user_msg}],
  )

  run = client.beta.threads.runs.create_and_poll(
      thread_id    = thread.id,
      assistant_id = obs_rank_validator_ID
  )

  if run.status != "completed":
      raise RuntimeError(f"Validator run ended with status {run.status}")

  reply = client.beta.threads.messages.list(thread_id=thread.id).data[0]
  return reply.content[0].text.value

In [10]:
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

base_path = "/content/validation"
obs_extractor_ID = "asst_LFsVfvrlMTkrFFlr3qeNssLV"


def find_md_files(base_path):
    """Recursively find all .md files in the directory."""
    md_files = []
    for root, _, files in os.walk(base_path):
        for file in files:
            if file.endswith(".md"):
                md_files.append(os.path.join(root, file))
    return md_files

def process_md_file(md_file_path):
    """Read and convert .md file to plain text."""
    with open(md_file_path, 'r') as file:
        return file.read()

def split_into_sections(md_content):
    """
    Split the markdown content into logical sections using markdown headers.

    This function uses a regular expression to find headers (lines starting with '#' characters)
    and splits the content accordingly. If the file doesn't contain any headers, it returns the entire
    content as one section.
    """
    pattern = re.compile(r'^(#{1,6}\s.*)', re.MULTILINE)
    matches = list(pattern.finditer(md_content))
    sections = []
    if not matches:
        sections.append(md_content)
        return sections

    if matches[0].start() > 0:
        sections.append(md_content[:matches[0].start()])

    for i, match in enumerate(matches):
        start = match.start()
        end = matches[i + 1].start() if i + 1 < len(matches) else len(md_content)
        sections.append(md_content[start:end])
    print(sections)
    return sections

def send_to_openai(section):
  message_thread = client.beta.threads.create(
    messages=[
      {"role":"user", "content": section}
    ],
  )
  run = client.beta.threads.runs.create_and_poll(
    thread_id = message_thread.id,
    assistant_id = obs_extractor_ID
  )
  if run.status == 'completed':
    messages = client.beta.threads.messages.list(
    thread_id= message_thread.id,
    )
    for message in messages.data:
      if message.role == "assistant":
        messg_id = message.id
        result = message.content[0].text.value
        print(result)
    return result

def remove_duplicates(observables_list):
    """Remove duplicate dictionaries from a list of observables."""
    unique = []
    seen = set()
    for item in observables_list:
        # Convert dictionary to a JSON string for a consistent, hashable representation.
        item_str = json.dumps(item, sort_keys=True)
        if item_str not in seen:
            seen.add(item_str)
            unique.append(item)
    return unique


# Locate all markdown files in the base directory.
md_files = find_md_files(base_path)
print(f"Found {len(md_files)} markdown files.")

results = {}
ranking_list = []
total_start_time = time.time()

# Process each markdown file
for md_file in md_files:
    print(f"\nProcessing file: {md_file}")
    try:
        md_content = process_md_file(md_file)
        sections = split_into_sections(md_content)
        print(f"File split into {len(sections)} sections.")
        file_key = os.path.basename(md_file)
        results[file_key] = {"observables": []}
        obs_id = 1

        # Process each section separately
        for idx, section in enumerate(sections):
        # for idx, section in enumerate(sections[6:], start=6):

            section_prompt  = f"Here is the text snippet: \n'''\n{section}\n'''"
            print(f"Processing section {idx + 1} of {len(sections)}")
            print(f"\n--------------------------------Section {idx+1}-----------------------------------\n {section}")
            print(f"\n--------------------------------results-----------------------------------\n")
            try:
                raw_response = send_to_openai(section_prompt)

                if not raw_response.strip():
                    raise ValueError(f"Empty response for section {idx + 1} in file {md_file}")

                # Clean the raw response by removing potential markdown formatting.
                cleaned_response = raw_response.strip("```").strip("json").strip()
                parsed_response = json.loads(cleaned_response)

                # Original list of observables
                original_obs = parsed_response.get("observables", [])
                print(f"Original count of observables: {len(original_obs)}")


                # Update the response with filtered observables
                parsed_response["observables"] = [ob for ob in original_obs if ob.get("classification") != "Malware"]

                print(f'Count after removing "Malware" observables: {len(parsed_response["observables"])}')

                print(f'parsed_response["observables"]: {parsed_response["observables"]}')

                for ob in parsed_response.get("observables", []):
                    ob["observable_id"] = obs_id
                    obs_id += 1

                results[file_key]["observables"].extend(parsed_response.get("observables", []))

                if parsed_response["observables"]:
                  ranking = observable_rank_validator(parsed_response["observables"])
                  ranking_list.append(ranking)
                  print("Ranking:\n", ranking)
                else:
                    print("No observables extracted in this section.")

            except Exception as section_error:
                print(f"Error processing section {idx + 1} of {md_file}: {section_error}")
                continue

    except Exception as e:
        print(f"Error processing file {md_file}: {e}")
        results[os.path.basename(md_file)] = {"error": str(e)}

total_end_time = time.time()
total_elapsed = total_end_time - total_start_time
print(f"\nTotal processing time: {total_elapsed:.2f} seconds.")

output_file_path = "/content/results-reports-new-prompt.json"
with open(output_file_path, "w") as output_file:
    json.dump(results, output_file, indent=4)
print(f"Results saved to {output_file_path}.")

output_ranking_path = "/content/ranking_results.txt"
with open(output_ranking_path, "w", encoding="utf-8") as out_f:
    out_f.write("\n\n".join(ranking_list))


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
47. On December 2, 2017, after seventeen months of investigation and the issuance of an additional McLaren Report, the Schmid Commission-a separate Disciplinary Commission tasked with investigating the McLaren Reports' findings-reported that it concurred with McLaren's conclusions regarding "the existence of a systematic doping scheme in Russia," involving the 'systematic manipulation of the anti-doping rules and system in Russia."" The Schmid Commission further confirmed the involvement of a number of individuals within Russia's Ministry of Sport and its subordinate entities.

48. On December 5, 2017, in response to the Schmid Commission Report, the IOC, among other sanctions, suspended the Russian Olympic Committee and its president and prohibited Russian athletes from participating in the 2018 Winter Olympics under the Russian flag (instead allowing demonstrably clean athletes to compete as "Olympic Athletes from Russi

# Observable Extractor Assistant instructions:

In [1]:
our_new_observable_extractor_assistant_instructions = """
You are a helpful Cybersecurity assistant for identifying observables in Cyber Threat Intelligence text snippets.

---
### Task

1. You will receive a text snippet of a CTI report from a user.
2. Read the given snippet (plain text) carefully.
3. Extract *every observable* (artifact) mentioned — do *not* omit any.
4. For each observable, output a JSON object with the exact fields listed in the *Response format* section.

---

### Decision Rules

| Rank           | Decide “Actionable” **when …**                                                                                                                                      | Typical examples                                                                                                              |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| **Actionable** | • The exact string can be matched **as‑is** (or after a trivial transform like Base64 decode).<br>• A detection rule (IDS/YARA/SIEM, PLC logic, etc.) could key on it with low false‑positives.<br>• **Uniqueness is obvious** from its syntax or entropy (i.e. **not** a common filename, generic string, or truncated path). | – Full code snippets / shell commands / SQL queries<br>– Complete URLs, URIs, or file paths<br>– IPs, domains, GUIDs, hashes, registry keys, mutexes, etc. |
| **Described**  | • Details are not sufficient to act upon **or** the observable is not directly searchable; ambiguity or missing information.<br>• Additional enrichment (lookup tables, parameter substitution, memory carving, etc.) is required **and** is non‑trivial.<br>• **Generic filenames, ambiguous script names, truncated paths, or common class/object names** belong here rather than Actionable. | – “An HTTP POST to `/api/checkin?id=<random>`” (placeholder)<br>– “A PowerShell script that loads shellcode”<br>– “A DLL named like a system file”<br>– `nc.exe`, `main.exe`, `winrm.vbs` |
| **Mentioned**  | • Only generic or conceptual references; **cannot** be turned into a concrete detection even after enrichment.                                                          | – “Uses AES‑128”<br>– “Makes RPC calls”<br>– “Phishing email”                                                                   |

---
### Definitions

1. **Actionable Observable**: meets the **Actionable** rules above.
2. **Described Observable**: meets the **Described** rules above (and not **Actionable**).
3. **Mentioned Observable**: meets the **Mentioned** rules above.

4. *STIX Supported*
   • *Full*: the artifact’s type exists in STIX Cyber-Observable Objects.
   • *Partial*: supported only via x_ custom properties or the generic artifact object.
   • *No*: no direct STIX mapping.

5. *Proprietary Artifact*
   • Open/Standard Technology
   • Proprietary-Documented Technology
   • Proprietary-Undocumented Technology

---
### Figure Sections
Treat any “Figure X” block—and everything that follows it—as regular CTI text:
- **Include pipe-tables**, fenced code, inline snippets, and raw dumps (hex/ASCII).
- **Always parse** and if exists **extract** observables from Figures—do *not* skip them.
- **Apply the same decision rules* to Figure content as described above.

---
### Contextual Metadata Filtering
Scan every part of the document (front matter, headers/footers, first/last pages), but **only extract** observables that meet your technical criteria:
- **Judge relevance by context**: names, emails, dates, signatures, page numbers, titles, etc., are metadata—ignore them unless they themselves qualify as technical artifacts.
- If uncertain whether text is an observable, apply the normal rank rules.

---
### Handling Large Artifacts

- For code snippets, include the *full* code shown in the text, including triple backticks.
- If an artifact is encoded (e.g., Base64), include both the *encoded value* and its *decoded value* as separate observables, unless the snippet makes one of them irrelevant.
- Treat the most specific string that can stand alone in a detection rule as the observable; you may also mention in its notes the relevant less‑specific substrings if they appear separately in the text.

---
### Fields to produce for *every* observable

| Field                  | What to put                                                                                                   |
| ---------------------- | ------------------------------------------------------------------------------------------------------------- |
| observable_value     | Exact string (or faithful paraphrase). Escape any internal backticks.                                          |
| observable_rank      | Select one — "Mentioned", "Described", or "Actionable" — and assign it *strictly according to the criteria in the *Definitions section for precise ranking.                                                                  |
| data_source          | Specify the primary telemetry source (see cheat‑sheet below); do NOT use "None" unless the artifact is truly unobservable via telemetry.                                               |
| classification       | Short type label (e.g., "ICS Command", "URL", "Software/Tool")                  |
| STIX_supported       | "Full: <STIX_Object_Name>" | "Partial" | "No".                                                                     |
| proprietary_artifact | "Open/Standard Technology" | "Proprietary-Documented Technology" | "Proprietary-Undocumented Technology".        |
| parser               | List well‑known parser(s) for the data format (e.g., "Zeek", "Wireshark", "YARA"); use "N/A" or null only when no suitable parser exists.   |
| notes                | Any extra comments or context (Markdown allowed), or null if none.                                           |

---
### Common data_source cheat-sheet

Network traffic • Netflow • PCAP • DNS logs • Web proxy logs • Endpoint (EDR) logs • System logs (Windows Event, syslog) • ICS historian • PLC ladder logic • Firewall logs • Cloud API audit logs • Memory dump • None (if not observable via telemetry)

---
### Response format (return *only* this JSON)

json
{
  "observables": [
    {
      "observable_value": "<VAL>",
      "observable_rank": "Mentioned | Described | Actionable",
      "data_source": "<text>",
      "classification": "<text>",
      "STIX_supported": "Full: <STIX_Object_Name> | Partial | No",
      "proprietary_artifact": "Open/Standard Technology | Proprietary-Documented Technology | Proprietary-Undocumented Technology",
      "parser": "<text>" | null | "N/A",
      "notes": "<text>" | null
    }
    // … repeat for each unique observable
  ]
}

"""


In [13]:
obs_extractor_ID = "asst_LFsVfvrlMTkrFFlr3qeNssLV"
our_updated_assistant = client.beta.assistants.update(
  obs_extractor_ID, #this is the observable extractor assistant ID
  instructions=our_new_observable_extractor_assistant_instructions, # your new instructions
)
# Notice! this code will update the assistant in OpenAI, so next time you will run the code at the beggining of this notebook, in will call the updated assistant

***Clustering Observables***