<a href="https://www.kaggle.com/code/donroche/defenceagent?scriptVersionId=234664866" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# DefenceAgent for Analyzing Defense-Related Text or Files
## Description:
This notebook creates an AI agent (DefenceAgent) to analyze text input or uploaded files for defense insights. The extension of this project would be to create an AI Defence Agent to analyse military documents. 

## Techniques Used: 
Structured Output (JSON mode), Few-shot Prompting, Retrieval Augmented Generation (RAG), AI agent. 

## Input: 
Text input or file (e.g., text file, CSV) containing defence-related data
Note: Optimized for Kaggle with error handling and minimal dependenciesv

### Project roadblock:
Despite the code working, I kept getting errors in loading data from Kaggle or uploading data. So I had to use dummy data instead. In theory, this can be applied to other Kaggle Notebooks.  

## Installing libraries

In [1]:
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import json
import os
import warnings
warnings.filterwarnings('ignore')

## Step 1: Read the file or not

In [3]:
def read_file(file_path):
    """Read text from a file (text or CSV)."""
    try:
        if file_path.endswith('.txt'):
            with open(file_path, 'r', encoding='utf-8') as f:
                return f.read()
        elif file_path.endswith('.csv'):
            df = pd.read_csv(file_path)
            # Assume text is in a 'text' column; adjust as needed
            return ' '.join(df['text'].dropna().astype(str))
        else:
            return "Unsupported file format. Please provide a .txt or .csv file."
    except Exception as e:
        return f"Error reading file: {str(e)}"

In [4]:
from io import StringIO

## Step 2: Load or Create Dummy Dataset
<p>Check for uploaded dataset in Kaggle's input directory.</p>
<p>If text colomn not found then use dummy data.</p>

In [5]:
data_path = '/kaggle/input/Radar_Traffic_Counts.csv'  # Adjust to your dataset
try:
    df = pd.read_csv(data_path)
    if 'text' not in df.columns:
        raise ValueError("Dataset must have a 'text' column.")
except (FileNotFoundError, ValueError) as e:
    print(f"Dataset error: {str(e)}. Using dummy data for demonstration.")
    # Create dummy defense reports data
    np.random.seed(42)
    df = pd.DataFrame({
        'report_id': [f'REPORT_{i}' for i in range(500)],
        'text': [
            'Suspicious activity detected near military base.' if i % 3 == 0 else
            'Routine patrol, no incidents reported.' if i % 3 == 1 else
            'Rocket attack.' if i % 3 == 2 else
            'Potential threat identified in communication intercept.' for i in range(500)
        ],
        'location': np.random.choice(['Base Alpha', 'City Beta', 'Unknown'], 500),
        'timestamp': pd.date_range('2025-01-01', periods=500, freq='H')
    })

# Display dataset info
print("Dataset Info:")
print(df.info())
print("\nSample Data:")
print(df.head())

Dataset error: [Errno 2] No such file or directory: '/kaggle/input/Radar_Traffic_Counts.csv'. Using dummy data for demonstration.
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   report_id  500 non-null    object        
 1   text       500 non-null    object        
 2   location   500 non-null    object        
 3   timestamp  500 non-null    datetime64[ns]
dtypes: datetime64[ns](1), object(3)
memory usage: 15.8+ KB
None

Sample Data:
  report_id                                              text    location  \
0  REPORT_0  Suspicious activity detected near military base.     Unknown   
1  REPORT_1            Routine patrol, no incidents reported.  Base Alpha   
2  REPORT_2                                    Rocket attack.     Unknown   
3  REPORT_3  Suspicious activity detected near military base.     Unknown   
4  REP

## Step 3: Train a text classifier 

In [6]:
try:
    # Label reports as 'threat' or 'non-threat' (keyword-based for demo)
    df['label'] = df['text'].str.contains('suspicious|threat', case=False, na=False).astype(int)  # 1 = threat, 0 = non-threat

    # Extract features using TF-IDF
    vectorizer = TfidfVectorizer(max_features=500)
    X = vectorizer.fit_transform(df['text'])
    y = df['label']

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Train Logistic Regression classifier
    clf = LogisticRegression(random_state=42)
    clf.fit(X_train, y_train)

    # Evaluate
    y_pred = clf.predict(X_test)
    print("\nClassifier Performance:")
    print(classification_report(y_test, y_pred))
except Exception as e:
    print(f"Error training classifier: {str(e)}")
    # Fallback: Initialize a dummy classifier
    clf = LogisticRegression()
    vectorizer = TfidfVectorizer(max_features=500)
    X = vectorizer.fit_transform(['dummy text'])
    clf.fit(X, [0])



Classifier Performance:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        68
           1       1.00      1.00      1.00        32

    accuracy                           1.00       100
   macro avg       1.00      1.00      1.00       100
weighted avg       1.00      1.00      1.00       100



## Step 4: Implement Retrieval Augmented Generation (RAG)
<p>Technique: Retrieval Augmented Generation (RAG).</p>
<p>Use TF-IDF cosine similarity to retrieve relevant defense profiles (avoids PyTorch/FAISS to avoid errors). It modifies interactions with a large language model so that the model responds to user queries with reference to a specified set of documents.</p>


In [7]:
# Create a knowledge base with defense-related profiles
knowledge_base = [
    "Suspicious activity near military bases often indicates reconnaissance or sabotage.",
    "Routine patrols with no incidents suggest stable security conditions.",
    "Communication intercepts mentioning threats may involve coded language or planning.",
    "Possible insurgent activity.",
    "Potentially hostile situation with a foreign entity."
]

# Initialize TF-IDF vectorizer for knowledge base
kb_vectorizer = TfidfVectorizer()
kb_vectors = kb_vectorizer.fit_transform(knowledge_base)

def retrieve_relevant_info(query, k=2):
    """Retrieve top-k relevant profiles using TF-IDF cosine similarity."""
    try:
        query_vector = kb_vectorizer.transform([query])
        cosine_similarities = (kb_vectors @ query_vector.T).toarray().flatten()
        top_k_indices = np.argsort(cosine_similarities)[-k:][::-1]
        return [knowledge_base[idx] for idx in top_k_indices]
    except Exception as e:
        print(f"Error in retrieval: {str(e)}")
        return knowledge_base[:k]  # Fallback: Return first k items


## Step 5: Define DefenceAgent Class
<p>Technique: Structured Output (JSON mode).</p>
<p>Technique: Few-shot Prompting. Few-shot prompting is a technique in artificial intelligence where a model is given a small number of examples to learn from before generating a response. This method helps improve the model's performance on specific tasks by guiding it with relevant demonstrations.</p>

In [8]:
class DefenceAgent:
    def __init__(self, classifier, vectorizer, kb_vectorizer, knowledge_base):
        self.classifier = classifier
        self.vectorizer = vectorizer
        self.kb_vectorizer = kb_vectorizer
        self.knowledge_base = knowledge_base

    def analyze_input(self, input_data, input_type='text', file_path=None):
        """Analyze text input or file and return structured output."""
        try:
            # Handle input type
            if input_type == 'text':
                text = input_data
                input_id = "TEXT_INPUT"
            elif input_type == 'file' and file_path:
                text = read_file(file_path)
                input_id = os.path.basename(file_path) if file_path else "UNKNOWN_FILE"
            else:
                return json.dumps({"error": "Invalid input type or missing file path"}, indent=2)

            if "Error" in text or "Unsupported" in text:
                return json.dumps({"error": text}, indent=2)

            # Predict threat level
            text_vector = self.vectorizer.transform([text])
            prediction = self.classifier.predict(text_vector)[0]
            confidence = self.classifier.predict_proba(text_vector)[0][prediction]
            threat_label = 'threat' if prediction == 1 else 'non-threat'

            # --- Few-shot Prompting ---
            # Provide examples to guide the analysis
            few_shot_examples = """
            Example 1:
            Input: Suspicious activity detected near military base.
            Output: Threat detected, possible reconnaissance activity.

            Example 2:
            Input: Routine patrol, no incidents reported.
            Output: No threat detected, normal operations.

            Input: {text}
            Output:
            """
            prompt = few_shot_examples.format(text=text[:100])  # Truncate for brevity

            # --- RAG: Retrieve relevant information ---
            retrieved_info = retrieve_relevant_info(text)

            # Generate analysis
            analysis = f"Classified as {threat_label}. {retrieved_info[0]}"

            # --- Structured Output (JSON mode) ---
            output = {
                "input_id": input_id,
                "threat_level": threat_label,
                "confidence": float(confidence),
                "analysis": analysis,
                "retrieved_context": retrieved_info
            }
            return json.dumps(output, indent=2)
        except Exception as e:
            return json.dumps({"error": f"Analysis failed: {str(e)}"}, indent=2)

# Initialize the DefenceAgent
agent = DefenceAgent(clf, vectorizer, kb_vectorizer, knowledge_base)

## Step 6: Test the DefenceAgent 
<p>End result is an AI Defence Agent that analyses reports to assess threat and confidence levels.</p>

In [9]:
test_text = "Suspicious activity detected near military base."
print("\nText Input Analysis:")
print(agent.analyze_input(test_text, input_type='text'))

# Test with a file (create a dummy file for demo)
dummy_file_path = '/kaggle/working/test_report.txt'
try:
    with open(dummy_file_path, 'w') as f:
        f.write("Potential threat identified in communication intercept.")
    print("\nFile Input Analysis:")
    print(agent.analyze_input(None, input_type='file', file_path=dummy_file_path))
except Exception as e:
    print(f"Error testing file input: {str(e)}")


Text Input Analysis:
{
  "input_id": "TEXT_INPUT",
  "threat_level": "threat",
  "confidence": 0.964086462094688,
  "analysis": "Classified as threat. Suspicious activity near military bases often indicates reconnaissance or sabotage.",
  "retrieved_context": [
    "Suspicious activity near military bases often indicates reconnaissance or sabotage.",
    "Possible insurgent activity."
  ]
}

File Input Analysis:
{
  "input_id": "test_report.txt",
  "threat_level": "non-threat",
  "confidence": 0.8261336713975911,
  "analysis": "Classified as non-threat. Communication intercepts mentioning threats may involve coded language or planning.",
  "retrieved_context": [
    "Communication intercepts mentioning threats may involve coded language or planning.",
    "Potentially hostile situation with a foreign entity."
  ]
}


## Step 7: Save the model and Knowledge Base

In [10]:
try:
    import joblib
    joblib.dump(clf, '/kaggle/working/defence_classifier.pkl')
    joblib.dump(vectorizer, '/kaggle/working/vectorizer.pkl')
    joblib.dump(kb_vectorizer, '/kaggle/working/kb_vectorizer.pkl')
    with open('/kaggle/working/knowledge_base.json', 'w') as f:
        json.dump(knowledge_base, f)
    print("\nModel and knowledge base saved successfully.")
except Exception as e:
    print(f"Error saving model: {str(e)}")


Model and knowledge base saved successfully.


## Conclusion

This Kaggle Notebook demonstrated a few AI techniques that can be used to analyse military documents and be implemented by a data engineer in different circumstances.