# LLM with RCA after anomaly detection using RAG<br>
## Project Overview<br>
Author: Sedat Kaymaz<br>
This project aims to provide a root cause analsys method by using LLM with Dynamic RAG based on a system log file <br>
for reference after applying anomaly detection ML method to the basic telecom metric list   <br>

In [3]:
# Run once only
#%pip install -r requirements.txt

In [4]:
import os
import sys
import pandas as pd
import numpy as np
import faiss
from sklearn.ensemble import IsolationForest
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import CSVLoader, TextLoader
from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain.text_splitter import CharacterTextSplitter


# Load environment variables
load_dotenv()

False

In [None]:
# LLM Configuration
def get_llm(model):
    """
    Retrieves the Language Model (LLM) for root cause analysis.

    This function retrieves the Language Model (LLM) used for root cause analysis. It first checks if the OpenAI API key is set as an environment variable. If not, it prompts the user to enter the API key and sets it as an environment variable. The function then returns an instance of the ChatOpenAI class with a temperature of 0 and the model name set to "gpt-4".

    Returns:
        ChatOpenAI: An instance of the ChatOpenAI class representing the Language Model (LLM) for root cause analysis.
    """
    #If you are having issues with api key entry via embedded input, you can uncomment the line below and replace 'put_your_key_here' with your actual key
    #os.environ["OPENAI_API_KEY"] = 'put_your_key_here'
    openai_api_key = os.getenv('OPENAI_API_KEY')
    print(openai_api_key)
    if not openai_api_key:
        openai_api_key = input("Please enter your OpenAI API key: ")
        os.environ["OPENAI_API_KEY"] = openai_api_key
    
    # You can modify this to use different models or local LLMs
    return ChatOpenAI(temperature=0, model_name=model)

print("Loading Language Model...")
model='gpt-4'
llm = get_llm(model)
print(f"Language Model {model} loaded.")

In [6]:

# RAG for reading and processing metrics file
def process_metrics_file(filename):
    """
    Process the metrics file and return a vector store.

    Args:
        filename (str): The name of the metrics file.

    Returns:
        vectorstore: A vector store containing the embeddings of the text documents.
    """
    loader = CSVLoader(f"data/{filename}")
    documents = loader.load()
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(documents)
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(texts, embeddings)
    return vectorstore

# Process metrics file
metrics_vectorstore = process_metrics_file("metrics.csv")

# Load metrics data for anomaly detection
df = pd.read_csv("data/metrics.csv")
df['time'] = pd.to_datetime(df['time'])
print(df.head(5))

                 time  call_attempt  call_success  call_failure  \
0 2024-09-04 00:00:00           114           110             0   
1 2024-09-04 00:01:00           113           110             0   
2 2024-09-04 00:02:00           114           111             0   
3 2024-09-04 00:03:00           113           111             1   
4 2024-09-04 00:04:00           112           111             1   

   total_registered_subs  call_success_rate  
0                   9031              96.40  
1                   9084              97.34  
2                   9089              97.36  
3                   9035              98.23  
4                   9092              99.10  


In [7]:
# Anomaly detection using Isolation Forest
def detect_anomalies(df):
    """
    Detects anomalies in the given DataFrame using the Isolation Forest algorithm.

    Parameters:
    - df (pandas.DataFrame): The input DataFrame containing the data to be analyzed.

    Returns:
    - pandas.DataFrame: A subset of the input DataFrame containing only the rows that are classified as anomalies.
    """

    features = ['call_attempt', 'call_success', 'call_failure', 'total_registered_subs', 'call_success_rate']
    X = df[features]
    
    iso_forest = IsolationForest(contamination=0.005, random_state=42)
    anomalies = iso_forest.fit_predict(X)
    
    df['is_anomaly'] = anomalies
    return df[df['is_anomaly'] == -1]

# Detect anomalies
anomalies = detect_anomalies(df)
if anomalies.empty:
    print("No anomalies detected in the metrics.")

print(f"Anomalies found:\n{anomalies}\n")

Anomalies found:
                   time  call_attempt  call_success  call_failure  \
683 2024-09-04 11:23:00           114            27             0   
684 2024-09-04 11:24:00           113            32             0   
685 2024-09-04 11:25:00           112            40             0   
686 2024-09-04 11:26:00           114            70             2   

     total_registered_subs  call_success_rate  is_anomaly  
683                   9029             23.491          -1  
684                   9038             28.368          -1  
685                   9033             35.730          -1  
686                   9157             61.491          -1  



In [8]:
# RAG for processing log file
def process_log_file(filename):
    """
    Process a log file and return a vector store.

    Args:
        filename (str): The name of the log file to process.

    Returns:
        vectorstore: A vector store containing embeddings of the log file texts.
    """
    loader = TextLoader(f"data/{filename}")
    documents = loader.load()
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(documents)
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(texts, embeddings)
    return vectorstore


# Process log file
print("Processing log file...")
logs_vectorstore = process_log_file("systemd.log")
print("Log file has been processed.")   

Processing log file...
Log file has been processed.


In [9]:

# Root cause analysis using LLM
def analyze_root_cause(llm, metrics_vectorstore, logs_vectorstore, anomalies):
    """
    Analyzes the root cause of anomalies in the metrics based on the system logs.

    Args:
        llm (LLM): The LLM object used for analysis.
        metrics_vectorstore (VectorStore): The vector store containing metrics information.
        logs_vectorstore (VectorStore): The vector store containing log information.
        anomalies (list): A list of anomalies to be analyzed.

    Returns:
        str: A detailed root cause analysis for the given anomalies.

    Example:
        anomalies = ["Anomaly1", "Anomaly2"]
        result = analyze_root_cause(llm, metrics_vectorstore, logs_vectorstore, anomalies)
        print(result)
    """
    
    prompt_template = """
    Analyze the following anomalies in the metrics and provide a root cause analysis based on the system logs:

    Anomalies:
    {anomalies}

    Relevant metrics information:
    {metrics_info}

    Relevant log information:
    {logs_info}

    Please provide a detailed root cause analysis for these anomalies.
    """

    prompt = PromptTemplate(
        input_variables=["anomalies", "metrics_info", "logs_info"],
        template=prompt_template
    )

    # Create a RunnableSequence
    chain = (
        {
            "anomalies": RunnablePassthrough(),
            "metrics_info": lambda x: metrics_vectorstore.similarity_search(str(x), k=2),
            "logs_info": lambda x: logs_vectorstore.similarity_search(str(x), k=2)
        }
        | prompt
        | llm
        | StrOutputParser()
    )

    # Invoke the chain
    result = chain.invoke(anomalies)
    
    return result


# Perform root cause analysis
analysis = analyze_root_cause(llm, metrics_vectorstore, logs_vectorstore, anomalies)

print("Root Cause Analysis:")
print(analysis)


Root Cause Analysis:
Based on the provided metrics and system logs, the anomalies in the metrics seem to be related to the OpenStack services, specifically the Open vSwitch service. 

The metrics show a significant drop in call success rate starting from 2024-09-04 11:23:00. This coincides with the system logs which show that the Open vSwitch service encountered an error and crashed at 2024-09-04 11:22:13. The error message "assertion pad_size <= dp_packet_size(b) failed in dp_packet_set_l2_pad_size()" indicates a problem with packet padding size, which could potentially disrupt network traffic.

The Open vSwitch service is a key component of the OpenStack platform, providing network connectivity for virtual machines. If this service fails, it could disrupt the network traffic, leading to call failures. 

The system logs also show that the Open vSwitch service was restarted at 2024-09-04 11:22:15, but the call success rate did not recover immediately, possibly due to ongoing network di