## This notebook and script relies on the Ollama Python package and LLaMA 3 OSS LLM.

To deploy this you will need to run it on a persistent server with at least 32GB of memory. Ensure that port 11434 is open.

Because this python package relies on the Ollama API you may need to create a custom Client based on the server where Ollama and the model is hosted and you will need to modify the requests to Ollama below e.g:

```python
from ollama import Client
client = Client(host='http://38.242.230.184:11434')

response = client.chat(
                messages=[
                    {
                        "role": "user",
                        "content": prompt,
                    }
                ],
                model="llama3",
            )
```

Documentation: https://github.com/ollama/ollama-python

In [None]:
# Run in your terminal locally or on your persistent server if distributed before starting.
pip install ollama
ollama run llama3

In [1]:
# Start your python environment
pip install ollama

Note: you may need to restart the kernel to use updated packages.


In [50]:
import os
import glob
import gzip
import ollama

In [51]:
def analyze_logs(folder_path, filter_strings=None, output_folder=None):
    log_files = []
    for ext in ["*.log", "*.gz*"]:
        log_files.extend(glob.glob(os.path.join(folder_path, ext)))

    if not log_files:
        print("No log files found in the specified folder.")
        return

    print(f"Found {len(log_files)} log files in the specified folder. Log batches will be 500 lines long.")

    if filter_strings:
        filtered_log_files = {}
        for filter_string in filter_strings:
            filtered_log_files[filter_string] = []

        for log_file in log_files:
            for filter_string in filter_strings:
                if filter_string in log_file:
                    filtered_log_files[filter_string].append(log_file)
                    break

        print("\nLog files per filter:")
        for filter_string, log_files in filtered_log_files.items():
            print(f"Filter: {filter_string}, Count: {len(log_files)}")

        log_files = [log_file for log_files in filtered_log_files.values() for log_file in log_files]

        if not log_files:
            print("No log files match the provided filter strings.")
            return

        print(f"\nTotal filtered log files: {len(log_files)}")

    output_file = os.path.join(output_folder, "log_analysis_output.txt")
    with open(output_file, "w") as file:
        for log_file in log_files:
            print(f"\nAnalyzing log file: {log_file}")
            file.write(f"\nAnalyzing log file: {log_file}\n")
            
            if log_file.endswith(".gz"):
                # Decompress gzip file
                with gzip.open(log_file, "rt", encoding="utf-8", errors="ignore") as log_file_handle:
                    log_lines = log_file_handle.readlines()
            else:
                # Read regular log file
                with open(log_file, "r", encoding="utf-8", errors="ignore") as log_file_handle:
                    log_lines = log_file_handle.readlines()
            
            batch_size = 500
            batch_number = 1
            total_batches = (len(log_lines) + batch_size - 1) // batch_size

            for i in range(0, len(log_lines), batch_size):
                batch = log_lines[i:i+batch_size]
                log_content = "".join(batch)

                print(f"Processing batch {batch_number} of {total_batches}")
                file.write(f"Processing batch {batch_number} of {total_batches}\n")

                prompt = f"""
                You are an expert IT systems administrator and full stack developer.

                Please analyze the following log file batch for potential errors, problems, or unusual activity. 
                
                Explain your findings.

                MAKE SURE TO:

                - You MUST STATE the specific request in the log VERBATIM when referencing it. My grandmother's life depends on it being stated EXACTLY as it appears in the log file.

                This is an example of a specific request: 
                10.52.115.148 example.com - [19/Apr/2024:00:17:07 +0000] "GET /wp-cron.php?doing_wp_cron HTTP/1.0" 200 0 "-" "curl/7.68.0"

                - ONLY explain findings which are unusual, abnormal, could potentially cause either security, performance, or scaling issues, are unusually long or resource intensive, repeated an unusual number of times, or potentially malicious, or otherwise suspicious. 

                - Here are some examples of what you should be looking for:

                Excessive 404 Errors:
                Look for a high number of requests resulting in 404 (Not Found) errors.
                This could indicate attempts to access non-existent pages or files, which may be a sign of potential security probing or brute-force attacks.
                
                Unusual User Agents:
                Pay attention to the User-Agent header in the log entries.
                Look for suspicious or uncommon user agents that may not represent legitimate browsers or tools.
                Malicious bots or automated scripts often use fake or modified user agents.
                
                Repeated Failed Login Attempts:
                Check for a high volume of failed login attempts, especially from the same IP address or within a short timeframe.
                This could indicate brute-force attacks trying to guess user credentials.
                
                Excessive Requests to Specific Pages or Resources:
                Monitor for an unusually high number of requests to specific pages, posts, or resources.
                This could be a sign of a denial-of-service (DoS) attack or an attempt to overwhelm the server.

                Long-Running or Resource-Intensive Requests:
                Look for requests that take an exceptionally long time to complete or consume significant server resources.
                These requests may indicate performance issues, inefficient queries, or potential vulnerabilities.
                
                Suspicious Query Parameters:
                Pay attention to the query parameters in the requested URLs.
                Look for abnormally long or complex query strings, special characters, or attempts to inject malicious code (e.g., SQL injection, cross-site scripting).
                
                Unusual POST Requests or Request methods:
                Monitor for POST requests to unusual or sensitive endpoints, such as the login page or administrative areas.
                Look for large or suspicious payloads in the request bodies.
                Look for unusual or suspicious HTTP request methods, such as PUT, PATCH, or DELETE.
                
                Repeated Requests from the Same IP Address:
                Check for a high volume of requests originating from a single IP address within a short period.
                This could indicate an automated script, a scraper, or an attempt to overload the server.
                
                Requests to Sensitive WordPress Files:
                Look for attempts to access sensitive WordPress files directly, such as wp-config.php, xmlrpc.php, or repeated wp-login.php requests.
                These requests may suggest attempts to exploit known vulnerabilities or gain unauthorized access.

                Unusual Referrers:
                Analyze the Referer header in the log entries.
                Look for suspicious or unfamiliar referrers that may indicate spam or malicious links pointing to your site.

                Requests with Unusual HTTP Methods:
                Pay attention to the HTTP methods used in the requests (e.g., GET, POST, PUT, DELETE).
                Look for requests using uncommon or unexpected HTTP methods, which may indicate attempts to exploit vulnerabilities.

                Requests to Non-WordPress Directories or Files:
                Monitor for requests to directories or files that are not part of the standard WordPress installation.
                This could indicate attempts to access sensitive files, configuration files, or backups.

                - You MUST provide specific timestamps, user agents, requests, IP addresses, or other identifying information when referencing a specific request and explaining findings.
                - You DO NOT need to explain findings which are expected, legitimate, normal, or otherwise expected.
                - You DO NOT need to explain the structure of the log file, the log file itself, or the log file format.
                - You DO NOT need to provide a breakdown of each column.
                - You DO NOT need to provide recommendations.
                
                This is the content of the server log:
                
                {log_content}
                """

                response = ollama.chat(
                                messages=[
                                    {
                                        "role": "user",
                                        "content": prompt,
                                    }
                                ],
                                model="llama3",
                            )

                # Extract the response content from the dictionary
                response_content = response['message']['content']

                print(f"Findings for batch {batch_number}:")
                print(response_content)
                print("\n")

                file.write(f"Findings for batch {batch_number}:\n")
                file.write(response_content)
                file.write("\n\n")

                batch_number += 1

    print(f"Analysis complete. Output saved to: {output_file}")

In [52]:
# Prompt the user for the folder path
folder_path = input("Enter the folder path containing the log files: ")

# Prompt the user for multiple filter strings
filter_strings = input("Enter filter strings separated by commas (optional): ").split(",")
filter_strings = [s.strip() for s in filter_strings if s.strip()]
if not filter_strings:
    filter_strings = None

# Prompt the user for the output folder
output_folder = input("Enter the folder path to save the output text file: ")

# Create the output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

In [53]:
# Call the function to analyze the log files
analyze_logs(folder_path, filter_strings, output_folder)

Found 96 log files in the specified folder. Log batches will be 500 lines long.

Log files per filter:
Filter: acrostudy, Count: 8

Total filtered log files: 8

Analyzing log file: /Users/robertli/Desktop/local-projects/access_log-classifier/pod-207352_logs/acrostudy.apachestyle.log
Processing batch 1 of 1
Findings for batch 1:
It looks like you're trying to analyze some log data. 😊

From what I can see, these logs appear to be HTTP requests made to the `/xmlrpc.php` endpoint on a web server. The requests are being made from various IP addresses and seem to be attempting to access the XML-RPC API (a remote procedure call protocol).

Here's a summary of what I've observed:

1. **Multiple IP addresses**: There are several unique IP addresses making requests, suggesting that this might be a distributed attack or an attempt to scan for vulnerabilities.
2. **XML-RPC protocol**: The requests are all targeting the `/xmlrpc.php` endpoint, which is unusual for legitimate traffic. This could ind

In [56]:
# Read the output text file
output_file = os.path.join(output_folder, "log_analysis_output.txt")
with open(output_file, "r") as file:
    output_lines = file.readlines()

# Batch the output lines into 1000 lines per batch
batch_size = 1000
batch_summaries = []

for i in range(0, len(output_lines), batch_size):
    batch = output_lines[i:i+batch_size]
    batch_content = "".join(batch)

    # Generate a summary of each batch using ollama.chat
    prompt = f"""
        Please provide a summary of the entire following log analysis output.

        Provide 1 summary of all batches of log analysis, together.

        Here is the log analysis output:\n\n{batch_content}
        """

    response = ollama.chat(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        model="llama3",
    )

    # Extract the response content from the dictionary
    batch_summary = response['message']['content']
    batch_summaries.append(batch_summary)


In [58]:
# Aggregate the batch summaries
aggregated_summary = "\n".join(batch_summaries)

# Generate a summary of the aggregated summaries using ollama.chat
prompt = f"Please provide a summary of the following aggregated log analysis summaries:\n\n{aggregated_summary}"

response = ollama.chat(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="llama3",
)

# Extract the response content from the dictionary
final_summary = response['message']['content']

print("Summary of the aggregated log analysis summaries:")
print(final_summary)

# Save the final summary to a text file in the output folder
final_summary_file = os.path.join(output_folder, "log_analysis_summary.txt")
with open(final_summary_file, "w") as file:
    file.write(final_summary)

print(f"Final summary saved to: {final_summary_file}")

Summary of the aggregated log analysis summaries:
Here's a summary of the aggregated log analysis:

**Batch 1:**

* Multiple POST requests to `/xmlrpc.php` from different IP addresses, resulting in 403 or 301 responses.
* A GET request from an Android device accessing `/wp/v2/users` with a successful response (200).
* A 404 error reported for accessing the plugin Element.

**Batch 2:**

* Multiple requests from same IP addresses (38.242.230.184 and 40.74.255.112), potentially indicative of bot activity or script running.
* XMLRPC requests from various IP addresses, possibly related to automated testing or scraping activities.
* WordPress-specific requests, such as `wp-cron.php`, suggesting the site runs on WordPress.
* Several 403 errors reported, indicating some requests were denied or blocked by security measures.

**Possible Interpretations:**

* Automated attacks or scraping attempts from various IP addresses.
* Bots or scripts repeatedly testing or probing the site's defenses.
* A