Creating SoC Analyst Assistant-Incident RAG Investigation System (LLM + RAG-based security Incident Assistant)
Steps:
1. Data Loading and EDA 
2. Data Prep
3. Data Embedding
4. Model Build
5. LCEL Runnable Pipelines
6. Hybrid Retrieval( Faiss + BM25)
7. User Specific Memory(InMemoryChatMessageHistory)
8. EntityMemory extraction (query + retrieved context)
9. Structured Output
    {
        "summary" : "...",
        "recommende_actions" : [...],
        "confidence":0.xx,
        "threat_score":78,
        "related_incidents":[...],
        "entities":{...}
    }

In [1]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter 
from langchain_community.vectorstores import FAISS  
from langchain_core.prompts import ChatPromptTemplate  
from langchain_core.output_parsers import StrOutputParser 
from langchain_core.runnables import RunnablePassthrough  
from langchain_ollama import ChatOllama,OllamaEmbeddings  
from langchain_core.chat_history import InMemoryChatMessageHistory  
from langchain_core.runnables import RunnableWithMessageHistory  
from langchain_core.tools import tool  
from langchain_core.messages import AIMessage, HumanMessage  
import redis  
import json  
from langchain_community.retrievers import BM25Retriever
from langchain_core.embeddings import Embeddings


In [2]:
# Step 1 : Load the Documents
loader = TextLoader('security_incidents.txt', encoding = 'utf8')
documents =loader.load()
print(f"Loaded {len(documents)} documents.")
print(f"First document content: {documents[0].page_content[:500]}")

Loaded 1 documents.
First document content: Incident #001 | User=johns | Alert=Multiple failed SSH logins | SourceIP=10.1.1.9 | Host=SRV-LNX-01 | OS=Ubuntu 20 | MITRE=T1110 | Severity=High | Resolution=Blocked source IP; Reset password; Enabled MFA.
Incident #002 | User=markp | Alert=Suspicious PowerShell encoded command detected | Host=WKS-22 | OS=Windows 11 | MITRE=T1059 | Severity=High | Resolution=Terminated process; Disabled PowerShell v2; Quarantined artifacts.
Incident #003 | User=anitaa | Alert=Rapid file encryption detected (poss


In [3]:
# Step 2 : Split the Documents into Chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = text_splitter.split_documents(documents)
print(f"Split into {len(docs)} chunks.")
print(f"Created {len(docs)} chunks.")  # EDA: Check split (like len(X_train))
for i, chunk in enumerate(docs[:2]):
    print(f"Chunk {i+1} (len {len(chunk.page_content)} chars): {chunk.page_content[:100]}...")

Split into 20 chunks.
Created 20 chunks.
Chunk 1 (len 427 chars): Incident #001 | User=johns | Alert=Multiple failed SSH logins | SourceIP=10.1.1.9 | Host=SRV-LNX-01 ...
Chunk 2 (len 434 chars): Incident #003 | User=anitaa | Alert=Rapid file encryption detected (possible ransomware) | Host=FIN1...


In [4]:
# Step 3 : Create the Embeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text:latest")
vectorstore = FAISS.from_documents(docs, embeddings)
print("Created FAISS vector store with embeddings.", len(vectorstore.index_to_docstore_id))
retriever = vectorstore.as_retriever(k=3)


Created FAISS vector store with embeddings. 20


In [5]:
# Step 4 : Create the LLM
llm = ChatOllama(model="gemma:2b",temperature=0.7,max_tokens=1000)
prompt = ChatPromptTemplate.from_messages([
    "You are a Security Operations Center (SoC)Security Assistant helping with security incident analysis.\n"
    "User Query : {query}\n"
    "Context : {context}\n"
    "Based on the context, provide a detailed response to the user's query."
])
parser = StrOutputParser()
chain = ({"context" : retriever, "query": RunnablePassthrough()} | prompt | llm | parser)

query = "How should I respond to a phishing incident?"
response = chain.invoke(query)  
print("RAG Response:", response)  

RAG Response: ## Response to User Query

**Context:**

The context provides information about several phishing incidents, each with varying details and potential threats:

* **Attackers:** Users anitaa, rakesh, stevet, rohan, and johns.
* **Targets:** Servers named FIN12, SRV-DB01, SRV-LNX-12, SRV-APP2, WKS-72, and WKS-22.
* **Attack types:** Rapid file encryption, unusual DNS requests, SSH brute-force authentication, and large outbound data transfer.
* **Impact:** Critical vulnerabilities, high risk of data loss, potential for ransomware infection.

**Initial Actions:**

1. **Contain the incident:** Stop any active infections and disconnect compromised systems from the network.
2. **Investigate the source of each incident:** Analyze logs, network traffic, and system events to understand the attackers' tactics and motivations.
3. **Gather evidence:** Collect logs, screenshots, and any other relevant data for forensic analysis.
4. **Notify relevant stakeholders:** Inform affected users,

In [6]:
# Step 5 : Create the Memory Chain
store = {}  
def get_session_history(session_id: str):  
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

memory_chain = RunnableWithMessageHistory(  
    chain,
    get_session_history,
    input_messages_key="query",  
    history_messages_key="context"  
)

config = {"configurable": {"session_id": "user1"}}  


In [35]:
# Step 6: Entity-Memory Extraction
def extract_entities_tool(query: str) -> str:
    entity_prompt = ChatPromptTemplate.from_messages([
        "Extract and list the key entities (like IP addresses, hosts, OS, Mitre, Severity) from the following security incident query:\n"
        "Query:{query}\n"
        "Context:{context}\n"
    ])
    entity_chain = ({"context": retriever, "query": RunnablePassthrough()} | entity_prompt | llm | parser)
    # invoke with the raw query (chain will handle retrieval for context)
    return entity_chain.invoke(query)

# Example usage (avoid reusing the name 'entities')
sample_query = "Investigate the phishing email received on host 192.168.10 with high severity."
extracted_entities = extract_entities_tool(sample_query)
print("Extracted Entities:", extracted_entities)


Extracted Entities: | Key Entity | Value |
|---|---|
| IP Address | 192.168.10 |
| Host | FIN12 |
| OS | Windows 10 |
| MITRE | T1486 |
| Severity | Critical |
| Source | security_incidents.txt |
| Incident ID | cad5e768-4183-41e4-923f-830fca700213 |
| Incident ID | 5314bffb-1811-40de-a6df-9f9def502969 |
| Incident ID | 033 |
| Source | security_incidents.txt |
| Incident ID | 034 |
| Source | security_incidents.txt |
| Incident ID | e1e02a2f-8954-4b81-a742-68d25c2e06d1 |
| Incident ID | 023 |
| Source | security_incidents.txt |
| Incident ID | 024 |
| Source | security_incidents.txt |
| Incident ID | 4d15fce5-3ab4-4398-ad10-edc3f7f4344e |
| Incident ID | 013 |
| Source | security_incidents.txt |
| Incident ID | 014 |
| Source | security_incidents.txt |


In [36]:
#  Step 7 : Json Structured Output
def extract_entities_tool(query: str) -> str:
    entity_prompt = ChatPromptTemplate.from_messages([
        "Based on the following security incident query and context, provide a structured JSON response with fields: 'summary', 'recommended_actions', 'confidence', 'threat_score','related_incidents', 'severity', 'affected_systems'.\n"
        "Query:{query}\n",
        "Context:{context}\n",
        "Respond in JSON format only."
    ])
    entity_chain = ({"context": retriever, "query": RunnablePassthrough()} | entity_prompt | llm | parser)
    # invoke with the raw query (chain will handle retrieval for context)
    return entity_chain.invoke(query)

# Example usage (avoid reusing the name 'entities')
sample_query = "Investigate the phishing email received on host 192.168.10 with high severity."
extracted_entities = extract_entities_tool(sample_query)
print("Extracted Entities:", extracted_entities)

Extracted Entities: {
  "summary": "Phishing email received on multiple systems",
  "recommended_actions": [
    "Implement a comprehensive phishing detection and response solution.",
    "Educate users about phishing attacks and how to avoid them.",
    "Conduct regular security awareness training.",
    "Review and update security policies and procedures.",
    "Implement a robust incident response plan.",
    "Perform regular security assessments and penetration testing exercises."
  ],
  "confidence": "High",
  "threat_score": 10,
  "related_incidents": [
    "Incident #003 | User=anitaa | Alert=Rapid file encryption detected (possible ransomware)",
    "Incident #004 | User=rakesh | Alert=Unusual outbound DNS requests to unknown domain",
    "Incident #033 | User=anitaa | Alert=Credential dumping via lsass.exe",
    "Incident #034 | User=rakesh | Alert=Outbound traffic to known malicious IP",
    "Incident #023 | User=harsha | Alert=Unauthorized file transfer via SCP",
    "Incide