<a href="https://colab.research.google.com/github/ashish78905/OPTICONNECT_CALLL_CENTER_ANALYSIS-ASSIGNMENT/blob/main/PROJECT_EXPLANATION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


---

## **OpenUBA - Complete Project Explanation (3-Layer Architecture)**

---

# **LAYER 1: INPUT LAYER**
## (Types of Data Sources, Types of Data to be Streamed, Throughputs)

---

### **1.1 What is the Input Layer?**

The Input Layer is the **data collection foundation** of this User Behavior Analytics system. Think of it as the "ears and eyes" of the security platform - it listens to and captures everything happening in your organization's network and systems.

**Real-World Analogy:** Imagine a bank with security cameras at every entrance, ATM, and counter. The Input Layer is like all those cameras recording everything - who comes in, what they do, when they leave.

---

### **1.2 Types of Data Sources Supported**

The system supports **five major categories of data sources**:

#### **A. Local File System (Disk)**
This is the simplest source where log files are stored directly on the server's hard drive.

**Example Scenario:**
Your company's web proxy server writes a log file every day. Each line records:
- Which employee (identified by username) visited which website
- At what time
- How much data they downloaded

The system reads these files directly from the disk location, for instance from a folder like "proxy_logs/daily/".

**When to use:** Small to medium organizations, testing environments, or when data is already being collected by existing systems.

---

#### **B. Hadoop Distributed File System (HDFS)**
HDFS is used when you have **massive amounts of data** spread across multiple servers.

**Example Scenario:**
A global bank has 50,000 employees across 30 countries. Each day generates:
- 500 million proxy logs
- 200 million email metadata records
- 100 million authentication events

No single computer can store this. HDFS splits this data across hundreds of servers. The system can read from any of these distributed locations.

**Real-World Analogy:** Instead of one giant filing cabinet (which would be impossible), you have 100 normal filing cabinets in different rooms, but a smart index tells you exactly which cabinet holds which document.

**When to use:** Large enterprises with big data infrastructure, organizations processing terabytes daily.

---

#### **C. Elasticsearch (ES)**
Elasticsearch is a **search-optimized database** for real-time log analysis.

**Example Scenario:**
Your Security Operations Center needs to search "show me all failed login attempts for user john.doe in the last 24 hours" and get results in milliseconds. Elasticsearch makes this possible.

The system connects to Elasticsearch using:
- A host address (where the Elasticsearch server lives)
- A query (what data you want to retrieve)

**When to use:** Real-time security monitoring, when you need fast searches across billions of records, integration with existing ELK (Elasticsearch, Logstash, Kibana) stack.

---

#### **D. Apache Spark Integration (PySpark)**
Spark is used for **processing extremely large datasets in parallel**.

**Example Scenario:**
You need to analyze one year of historical data (2 petabytes) to find all users who accessed sensitive files outside business hours. A normal computer would take weeks. Spark splits this job across 500 computers and finishes in hours.

**When to use:** Historical analysis, machine learning training on massive datasets, batch processing jobs.

---

### **1.3 Types of Data Formats (What Gets Streamed)**

The system handles multiple data formats:

#### **A. CSV (Comma-Separated Values)**
The most common format. Each row is one event, columns are different attributes.

**Example of Proxy Log CSV:**
```
timestamp, username, website, action, bytes_downloaded
2024-01-15 09:30:00, alice.smith, gmail.com, allow, 15000
2024-01-15 09:31:00, bob.jones, hacking-tools.com, block, 0
2024-01-15 09:32:00, alice.smith, dropbox.com, allow, 500000
```

This tells us: Alice checked email, Bob tried to visit a suspicious site (blocked), then Alice uploaded or downloaded something big from Dropbox.

---

#### **B. Parquet**
A **columnar storage format** - extremely efficient for analytics.

**Why it matters:**
If you have 100 columns but only need 3 columns for analysis, Parquet reads only those 3 columns. CSV would read all 100.

**Example:** Analyzing only usernames and timestamps from a billion-row dataset:
- CSV approach: Read entire file (500 GB)
- Parquet approach: Read only 2 columns (10 GB)

---

#### **C. JSON (JavaScript Object Notation)**
Flexible format for complex, nested data.

**Example of Authentication Event JSON:**
```json
{
    "event_type": "login",
    "user": "alice.smith",
    "timestamp": "2024-01-15T09:30:00Z",
    "details": {
        "ip_address": "192.168.1.100",
        "device": "Windows 10 Laptop",
        "location": {
            "country": "USA",
            "city": "New York"
        },
        "authentication_method": "password + MFA"
    },
    "success": true
}
```

---

#### **D. Flat Files**
Simple text files where each line is a log entry, possibly with custom delimiters.

**Example (Space-Delimited Firewall Log):**
```
15Jan2024 09:30:00 ALLOW TCP 192.168.1.100 10.0.0.50 443
15Jan2024 09:30:01 DENY UDP 192.168.1.105 8.8.8.8 53
```

---

### **1.4 Specific Log Types Supported**

The system is designed to ingest security-relevant logs:

#### **A. Proxy/Web Gateway Logs**
Records all web browsing activity.

**Security Value:** Detect users visiting malicious websites, data exfiltration attempts, policy violations (social media during work hours).

**Example Detection:** "User downloaded 50GB of data to personal Dropbox in 2 hours" - possible data theft.

---

#### **B. DNS (Domain Name System) Logs**
Records all domain lookups (what websites computers are trying to reach).

**Security Value:** Detect malware communication. Malware often contacts "command and control" servers using strange domain names.

**Example Detection:** "Computer is looking up 'asd7f8a9sd7f.evil-domain.com' 1000 times per hour" - likely malware trying to phone home.

---

#### **C. SSH (Secure Shell) Logs**
Records remote login attempts to servers.

**Security Value:** Detect unauthorized access attempts, brute force attacks, lateral movement by attackers.

**Example Detection:** "Same user successfully logged into 50 different servers in 10 minutes" - either automation or compromised account moving laterally.

---

#### **D. DHCP (Dynamic Host Configuration Protocol) Logs**
Records network address assignments to devices.

**Security Value:** Asset tracking, detecting rogue devices on network.

**Example Detection:** "Unknown MAC address received IP address" - unauthorized device connected to network.

---

### **1.5 Throughput Capabilities**

Throughput refers to **how much data the system can handle**.

#### **Small Scale (Local Disk + Pandas)**
- **Capacity:** Thousands to millions of records per day
- **Processing Speed:** Single machine, sequential processing
- **Use Case:** Small company with 100 employees

#### **Medium Scale (Elasticsearch)**
- **Capacity:** Millions to billions of records per day
- **Processing Speed:** Near real-time (sub-second queries)
- **Use Case:** Medium enterprise with 5,000 employees

#### **Large Scale (HDFS + Spark)**
- **Capacity:** Billions to trillions of records per day
- **Processing Speed:** Distributed parallel processing across cluster
- **Use Case:** Global enterprise with 100,000+ employees or cloud service provider

---

### **1.6 Data Schema Configuration**

The system uses a **configuration scheme** that defines how to read data:

**Key Configuration Elements:**
1. **Source Group:** A collection of related data sources
2. **Folder Location:** Where the data lives
3. **Log Name:** Identifier for this log type
4. **Type:** Format (CSV, Parquet, JSON, Flat)
5. **Delimiter:** How fields are separated (comma, space, tab)
6. **Location Type:** Storage system (disk, HDFS, Elasticsearch)
7. **ID Feature:** The column that identifies users (username, employee_id)

**Example Configuration:**
```
Source Group: "Corporate Proxy Logs"
Folder: "/data/proxy/"
Type: CSV
Delimiter: Space
Location: Disk
ID Feature: "cs-username"
```

This tells the system: "Look in /data/proxy/, read CSV files separated by spaces, and the user identity is in the 'cs-username' column."

---

### **1.7 User and Entity Extraction**

As data streams in, the system automatically:

1. **Extracts all unique users** from the logs
2. **Creates user profiles** in the system
3. **Creates directories** for each user to store their behavioral data

**Example Process:**
Input: Proxy log with 1 million entries
Output:
- Found 523 unique usernames
- Created profile for each user
- Stored in user database

---

---

# **LAYER 2: PROCESSING LAYER**
## (Types of Processing, Quality of Processing, Throughputs)

---

### **2.1 What is the Processing Layer?**

The Processing Layer is the **brain** of the system. It takes raw data from the Input Layer and transforms it into security insights. This is where machine learning models analyze behavior, rules fire on suspicious patterns, and risk scores are calculated.

**Real-World Analogy:** If Input Layer is security cameras, Processing Layer is the team of analysts watching the footage, recognizing suspicious behavior, and taking notes.

---

### **2.2 Types of Processing**

#### **A. Machine Learning Model Processing**

The system has a **Model Library** - a collection of pre-built and custom machine learning models that analyze user behavior.

**How Models Work:**

1. **Data Loading:** Model receives data (DataFrame) from Input Layer
2. **Feature Extraction:** Model identifies relevant patterns
3. **Analysis:** Model applies its algorithm
4. **Output:** Model produces risk scores or anomaly flags

**Types of Models Supported:**

##### **i. Simple Rule-Based Models (Regex)**
Pattern matching using regular expressions.

**Example:**
Rule: "Flag any URL containing 'torrent' or 'hack'"
```
User visits: download-torrents.com → FLAGGED
User visits: microsoft.com → OK
```

##### **ii. Statistical Models (Deviation-Based)**
Detect when behavior deviates from normal patterns.

**Example:**
Alice normally downloads 100 MB per day. Her average (mean) is 100 MB with standard deviation of 20 MB.

Today she downloaded 500 MB.

Calculation: Is 500 MB outside (mean + 2*standard_deviation)?
500 > (100 + 2*20) = 140 → YES, this is anomalous!

**Rule Definition:**
```
Condition: (current_value > (mean + std_range)) OR (current_value < (mean - std_range))
Score: +20 risk points
```

##### **iii. TensorFlow/Deep Learning Models**
Neural networks for complex pattern recognition.

**Example Use Case:**
Detecting insider threats by analyzing:
- Login times
- Files accessed
- Email patterns
- Web browsing
- Badge-in/badge-out times

A neural network learns what "normal" looks like for each user and identifies deviations that humans might miss.

**Model Format:** Protobuf (TensorFlow's serialization format)

##### **iv. Scikit-Learn Models**
Traditional machine learning algorithms.

**Example Use Case:**
Classification model trained on historical data:
- Input: User behavior features
- Output: "Normal" or "Suspicious"

Algorithms like Random Forest, SVM, Logistic Regression.

**Model Format:** Pickle (Python's serialization format)

##### **v. PySpark Models**
Distributed machine learning for massive datasets.

**Example Use Case:**
Training a model on 5 years of historical data (500 billion records) to understand seasonal patterns in employee behavior.

---

#### **B. Rule Engine Processing**

The system has two types of rules:

##### **i. Single-Fire Rules**
Trigger immediately when a condition is met.

**Example:**
```
Rule: If username == 'admin' AND time BETWEEN 2AM-5AM
Action: Add 50 risk points
Reason: Admin account used during unusual hours
```

**Scenario:**
- 3:00 AM: admin account logs in
- Rule fires immediately
- Risk score increased
- Potential alert generated

##### **ii. Deviation Rules**
Trigger when behavior deviates from baseline.

**Example:**
```
Rule: If daily_downloads > (average_daily_downloads + 2*std_dev)
Action: Add 30 risk points
Reason: Unusual data download volume
```

**Scenario:**
- Alice's average: 50 files/day
- Standard deviation: 10 files
- Threshold: 50 + 2*10 = 70 files
- Today: Alice downloaded 150 files
- Rule fires: 150 > 70
- Risk score increased

---

#### **C. Risk Score Calculation**

Risk scores quantify the "suspiciousness" of a user.

**Risk Score Components:**
1. **Base Score:** Starting point (usually 0)
2. **Model Contributions:** Each model can add points
3. **Rule Contributions:** Each fired rule adds points
4. **Historical Factor:** Past behavior influences score

**Risk Levels:**
- **0-29:** LOW RISK - Normal behavior
- **30-69:** MEDIUM RISK - Needs monitoring
- **70-100:** HIGH RISK - Immediate attention required

**Example Calculation:**
```
User: Bob
Base Score: 0

+ Model "After Hours Login": +10 (logged in at 11 PM)
+ Model "Large Download": +25 (downloaded 2GB)
+ Rule "Accessed Sensitive File": +15
+ Historical Factor: +5 (had incident last month)

Total Risk Score: 55 (MEDIUM RISK)
```

---

#### **D. Anomaly Detection Processing**

Anomalies are behaviors that don't fit the normal pattern.

**Types of Anomalies Detected:**

##### **i. Login Anomalies**
```
Normal: User logs in from New York at 9 AM
Anomaly: Same user logs in from Tokyo at 9:05 AM
Detection: Impossible travel - flagged!
```

##### **ii. Data Access Anomalies**
```
Normal: User accesses 10 files per day
Anomaly: User accessed 500 files in 1 hour
Detection: Unusual volume - flagged!
```

##### **iii. Time-Based Anomalies**
```
Normal: User works 9 AM - 5 PM
Anomaly: User active at 3 AM
Detection: Unusual hours - flagged!
```

##### **iv. Sequence Anomalies**
```
Normal: Login → Email → Documents → Logout
Anomaly: Login → Database → Export → Logout
Detection: Unusual sequence - flagged!
```

---

### **2.3 Quality of Processing**

Quality is ensured through multiple mechanisms:

#### **A. Model Verification**

Before any model runs, the system verifies its integrity:

**i. Data Hash Verification**
Every model component has a hash (digital fingerprint). The system computes the hash and compares it to the expected value.

**Example:**
```
Expected Hash: abc123def456...
Computed Hash: abc123def456...
Match: YES - Model is authentic and unmodified
```

**Why This Matters:**
If an attacker modified a model file to ignore certain users, the hash would change, and the system would reject the model.

**ii. File Hash Verification**
After installation, file hashes are verified again.

**iii. Base64 Encoding Verification**
Model components are transmitted in Base64 encoding. The system verifies the encoded data before decoding.

---

#### **B. Model Component Limits**

To prevent malicious models:
- Maximum of 2 components per model allowed
- Components must be the expected file types
- Excessive components are rejected

---

#### **C. Session Logging**

Every model execution is logged:
```
Session Log:
- Model Name: "Insider Threat Detector"
- Timestamp: 2024-01-15 09:30:00
- User Analyzed: "alice.smith"
- Data Points: 50,000
- Result: Risk Score 45
```

This creates an audit trail for compliance and debugging.

---

#### **D. Safe Mode**

The system has a "safe mode" that:
- Automatically removes models that fail verification
- Prevents potentially malicious models from executing
- Logs all removal events

---

### **2.4 Model Library Architecture**

The Model Library is a **repository of ready-to-use models**.

**Key Concepts:**

#### **i. Model Groups**
Models are organized into groups based on their data source.

**Example:**
```
Group: "Proxy Log Analysis"
├── Model: "Malicious URL Detector"
├── Model: "Data Exfiltration Detector"
└── Model: "Policy Violation Checker"

Group: "Authentication Analysis"
├── Model: "Brute Force Detector"
├── Model: "Impossible Travel Detector"
└── Model: "Privilege Escalation Detector"
```

#### **ii. Model Components**
Each model consists of:
1. **__init__.py:** Initialization file
2. **MODEL.py:** Main execution logic with execute() function

#### **iii. Model Metadata**
Each model has associated information:
- **Name:** Unique identifier
- **Description:** What it does
- **MITRE Technique ID:** Maps to known attack patterns
- **Score:** How many risk points it can add
- **Enabled:** Whether it's active

---

#### **iv. Model Installation Process**

When installing a new model:

1. **Download:** Fetch model from library server
2. **Verify Encoding:** Check Base64 hashes match
3. **Create Directory:** Make folder for model
4. **Store Files:** Decode and save model files
5. **Verify Files:** Check installed file hashes
6. **Enable:** Mark model as ready to run

---

### **2.5 MITRE ATT&CK Integration**

The system maps behaviors to the **MITRE ATT&CK Framework** - a global knowledge base of adversary tactics and techniques.

**How It Works:**

Each model is tagged with MITRE technique IDs:
```
Model: "Unusual Process Execution"
MITRE Technique: T1059 (Command and Scripting Interpreter)
MITRE Tactic: Execution

Model: "Large Data Transfer"
MITRE Technique: T1041 (Exfiltration Over C2 Channel)
MITRE Tactic: Exfiltration
```

**Value:**
When an alert fires, security analysts immediately understand:
- What type of attack this might be
- What the attacker might do next
- How to investigate and respond

---

### **2.6 Processing Throughput**

**Data Loader Options and Their Throughput:**

| Data Loader | Processing Speed | Best For |
|-------------|------------------|----------|
| Local Pandas CSV | ~10,000 records/second | Small files, testing |
| Local Pandas Parquet | ~50,000 records/second | Medium files |
| HDFS Pandas CSV | ~100,000 records/second | Large distributed files |
| HDFS Pandas Parquet | ~500,000 records/second | Large optimized files |
| HDFS Spark CSV | ~1,000,000 records/second | Massive parallel processing |
| HDFS Spark Parquet | ~5,000,000 records/second | Maximum performance |
| Elasticsearch | Real-time | Live monitoring |

---

---

# **LAYER 3: OUTPUT LAYER**
## (Types of Events/Alerts/Correlation, Throughputs, Further Actions Downstream)

---

### **3.1 What is the Output Layer?**

The Output Layer is the **action and communication center**. It takes the insights from Processing Layer and transforms them into actionable security responses - alerts, cases, reports, and automated actions.

**Real-World Analogy:** If Processing Layer is the analysts watching cameras, Output Layer is them calling security guards, filing reports, locking doors, and notifying management.

---

### **3.2 Types of Events Generated**

#### **A. Anomaly Events**
Generated when the system detects unusual behavior.

**Anomaly Event Structure:**
```
Event Type: ANOMALY
Anomaly Type: "data_access_anomaly"
User: "alice.smith"
Score: 75
Description: "User accessed 500 files in 1 hour, baseline is 10 files/hour"
Detected At: "2024-01-15 14:30:00"
```

**Anomaly Categories:**
1. **Login Anomalies:** Unusual login patterns
2. **Access Anomalies:** Unusual data access
3. **Volume Anomalies:** Unusual data volumes
4. **Time Anomalies:** Activity at unusual times
5. **Location Anomalies:** Activity from unusual places

---

#### **B. Risk Events**
Generated when user risk scores change significantly.

**Risk Event Structure:**
```
Event Type: RISK_CHANGE
User: "bob.jones"
Previous Risk: 25 (LOW)
New Risk: 72 (HIGH)
Change Reason: "Multiple anomalies detected"
Timestamp: "2024-01-15 15:00:00"
```

---

### **3.3 Types of Alerts**

Alerts are **notifications requiring attention**.

#### **Alert Severity Levels:**

##### **i. CRITICAL Alerts**
Require immediate response.

**Example:**
```
Alert ID: ALT-001
Severity: CRITICAL
Type: "data_exfiltration"
User: "eve.hacker"
Description: "User transferred 50GB to personal cloud storage in 30 minutes"
Timestamp: "2024-01-15 02:30:00"
```

**Typical Response Time:** Minutes

##### **ii. HIGH Alerts**
Require prompt investigation.

**Example:**
```
Alert ID: ALT-002
Severity: HIGH
Type: "privilege_escalation"
User: "mallory.insider"
Description: "User granted themselves admin access to financial database"
Timestamp: "2024-01-15 11:00:00"
```

**Typical Response Time:** Hours

##### **iii. MEDIUM Alerts**
Require investigation within business day.

**Example:**
```
Alert ID: ALT-003
Severity: MEDIUM
Type: "policy_violation"
User: "john.careless"
Description: "User accessed social media during prohibited hours"
Timestamp: "2024-01-15 09:30:00"
```

**Typical Response Time:** Same day

##### **iv. LOW Alerts**
Informational, review during routine checks.

**Example:**
```
Alert ID: ALT-004
Severity: LOW
Type: "unusual_activity"
User: "jane.newbie"
Description: "New user has different browsing pattern than peer group"
Timestamp: "2024-01-15 10:00:00"
```

**Typical Response Time:** Weekly review

---

#### **Alert Types:**

1. **data_exfiltration:** Potential data theft
2. **privilege_escalation:** Unauthorized access increase
3. **policy_violation:** Breaking company rules
4. **malware_communication:** Possible malware activity
5. **credential_abuse:** Stolen or shared credentials
6. **insider_threat:** Malicious insider activity
7. **account_compromise:** Account taken over
8. **lateral_movement:** Attacker moving through network

---

### **3.4 Correlation (Connecting Events)**

The system correlates multiple events to see the bigger picture.

#### **Example of Correlation:**

**Individual Events (seem minor alone):**
1. Event 1: "Bob failed VPN login 3 times" (LOW)
2. Event 2: "Bob successfully logged in from new IP" (LOW)
3. Event 3: "Bob accessed sensitive documents" (LOW)
4. Event 4: "Bob downloaded 500 files" (MEDIUM)
5. Event 5: "Bob sent large email attachment" (LOW)

**Correlated View (reveals attack pattern):**
```
CORRELATION: Potential Account Compromise + Data Theft
Timeline:
  09:00 - Multiple failed logins (password guessing)
  09:15 - Successful login from attacker IP
  09:20 - Reconnaissance of sensitive files
  09:30 - Mass download of documents
  09:45 - Exfiltration via email

Combined Severity: CRITICAL
Pattern Match: MITRE T1078 (Valid Accounts) → T1083 (File Discovery) → T1041 (Exfiltration)
```

---

### **3.5 Case Management**

When alerts require investigation, they become **Cases**.

#### **Case Structure:**
```
Case ID: CASE-2024-001
Title: "Suspected Data Exfiltration - Bob Jones"
Status: Open → In Progress → Resolved
Priority: Critical
Assigned To: "Security Analyst Sarah"
Created: "2024-01-15 15:00:00"
Related Alerts: [ALT-001, ALT-002, ALT-003]
Related Events: [EVT-100 through EVT-125]
```

#### **Case Workflow:**
1. **Open:** Alert triggered, case created
2. **Assigned:** Analyst takes ownership
3. **In Progress:** Investigation ongoing
4. **Pending:** Waiting for additional information
5. **Resolved:** Investigation complete
6. **Closed:** Case archived

---

### **3.6 Display and Reporting**

The system provides multiple views of security status.

#### **A. Dashboard Display**
High-level overview:
```
System Display:
├── Total Monitored Users: 523
├── High Risk Users: 12
├── Medium Risk Users: 45
├── Low Risk Users: 466
├── Active Alerts: 28
├── Open Cases: 5
└── Models Running: 15
```

#### **B. User Risk Display**
Individual user view:
```
User: alice.smith
Risk Score: 45 (MEDIUM)
Risk Level: MEDIUM
Last Activity: "2024-01-15 14:30:00"
Anomalies Detected: 3
Alerts Count: 1
Recent Events:
  - Accessed unusual folder
  - Downloaded large file
  - Logged in after hours
```

#### **C. Statistics Display**
Aggregate metrics:
```
Risk Statistics:
├── Total Users: 523
├── High Risk: 12 (2.3%)
├── Medium Risk: 45 (8.6%)
├── Low Risk: 466 (89.1%)
├── Alerts Today: 15
├── Anomalies This Week: 87
├── Active Models: 15
└── Connected Data Sources: 8
```

---

### **3.7 Further Actions Downstream**

This is where the system takes **automated response actions**.

#### **A. Block the Request**
Immediately stop a suspicious action.

**Example Scenario:**
```
Detection: User trying to upload 10GB to personal Dropbox
Risk Score: 95 (CRITICAL)
Automated Action: BLOCK
Result: File transfer blocked
Notification: Security team alerted
User Message: "Transfer blocked - contact IT"
```

**Use Cases:**
- Block data exfiltration attempts
- Block access to malicious websites
- Block unauthorized system access

---

#### **B. Delay the Request**
Slow down suspicious activity to allow investigation.

**Example Scenario:**
```
Detection: User requesting access to sensitive database
Risk Score: 65 (MEDIUM)
Automated Action: DELAY (24 hours)
Result: Access queued for approval
Notification: User's manager notified
User Message: "Access request pending approval"
```

**Use Cases:**
- Delay large file transfers
- Delay privilege escalation requests
- Delay access to sensitive systems

---

#### **C. Wait for Approval**
Require human authorization before proceeding.

**Example Scenario:**
```
Detection: Admin requesting to delete backup files
Risk Score: 50 (MEDIUM)
Automated Action: REQUIRE_APPROVAL
Approval Chain: Direct Manager → IT Security → CISO
Current Status: Pending Manager Approval
Timeout: 72 hours
```

**Use Cases:**
- Sensitive data access
- Bulk deletions
- Configuration changes
- Elevated privilege requests

---

#### **D. Multi-Factor Verification**
Require additional identity verification.

**Example Scenario:**
```
Detection: Login from new country
Risk Score: 55 (MEDIUM)
Automated Action: STEP_UP_AUTH
Required: SMS code + Security question
Result: User verified → Access granted
         Verification failed → Account locked
```

---

#### **E. Isolate Endpoint**
Quarantine a potentially compromised device.

**Example Scenario:**
```
Detection: Machine exhibiting malware-like behavior
Risk Score: 90 (CRITICAL)
Automated Action: ISOLATE
Result: Device network access blocked
        Device can only communicate with security tools
Notification: User notified, IT dispatched
```

---

#### **F. Disable Account**
Temporarily or permanently disable user access.

**Example Scenario:**
```
Detection: Confirmed compromised credentials
Risk Score: 100 (CRITICAL)
Automated Action: DISABLE_ACCOUNT
Result: All active sessions terminated
        Account locked
        Password reset required
Notification: User and security team notified
```

---

#### **G. Notify Stakeholders**
Alert relevant people about security events.

**Notification Chains:**

1. **Technical Alert:**
   - Security Operations Center (SOC)
   - IT Administrator
   - System Owner

2. **Management Alert:**
   - User's Direct Manager
   - Department Head
   - HR (if policy violation)

3. **Executive Alert:**
   - Chief Information Security Officer (CISO)
   - Chief Information Officer (CIO)
   - Legal (if regulatory implications)

---

#### **H. Create Audit Record**
Document everything for compliance and legal purposes.

**Audit Record:**
```
Record ID: AUD-2024-00001
Timestamp: "2024-01-15 15:30:00"
Event: "High risk behavior detected"
User: "bob.jones"
Action Taken: "Account disabled, case opened"
Authorized By: "Automated Policy + Manager Approval"
Evidence Preserved: Yes
Retention Period: 7 years
```

---

### **3.8 Output Throughput**

The system can generate outputs at various rates:

| Output Type | Throughput | Latency |
|-------------|------------|---------|
| Anomaly Events | 1000+ per second | Real-time |
| Risk Calculations | 100+ users per second | Near real-time |
| Alerts | 100+ per second | < 1 second |
| Case Creation | 10+ per second | < 5 seconds |
| Blocking Actions | Immediate | Milliseconds |
| Notifications | 50+ per second | < 10 seconds |
| Audit Records | 1000+ per second | Real-time |

---

### **3.9 API Access**

All outputs are available through the REST API:

**Key Endpoints:**

1. **GET /alerts/** - Retrieve all security alerts
2. **GET /anomalies/** - Get detected anomalies
3. **GET /user/{name}/risk** - Get user risk score
4. **POST /analyze** - Trigger analysis job
5. **GET /cases/** - List investigation cases
6. **GET /stats/risk** - Get overall risk statistics
7. **GET /mitre/** - Get MITRE ATT&CK mappings

---

---

## **COMPLETE FLOW SUMMARY**

Let me walk through a complete example from start to finish:

### **Scenario: Detecting an Insider Threat**

**LAYER 1 - INPUT:**
1. Proxy logs stream in showing user "eve.suspicious" browsing activity
2. Data is in CSV format, stored on local disk
3. System reads 1 million records, identifies 500 unique users
4. User "eve.suspicious" is extracted and profile created

**LAYER 2 - PROCESSING:**
1. "Data Exfiltration Model" analyzes eve's browsing patterns
2. Detects: 50GB uploaded to personal cloud (baseline: 100MB/day)
3. "After Hours Model" detects: Activity at 2 AM (normal: 9AM-6PM)
4. "Sensitive Access Model" detects: HR database accessed
5. Rule fires: Downloads > (mean + 3*std_dev)
6. Risk score calculated: 10 + 30 + 25 + 20 = 85 (HIGH)
7. Model hashes verified - legitimate analysis
8. MITRE mapping: T1048 (Exfiltration Over Alternative Protocol)

**LAYER 3 - OUTPUT:**
1. **Anomaly Event** created: "Massive data upload detected"
2. **Alert** generated: CRITICAL severity
3. **Correlation** performed: Links to unusual login, HR access
4. **Automated Actions:**
   - BLOCK: Further uploads blocked
   - NOTIFY: Security team alerted
   - CASE: Investigation case created
5. **Audit Record** created for compliance
6. **Dashboard** updated with new high-risk user
7. **API** returns alert details to SIEM integration

**OUTCOME:** Potential data theft stopped, investigation begins, evidence preserved.

---

This three-layer architecture ensures that OpenUBA can:
- **Collect** data from any source at any scale
- **Process** it with sophisticated analytics while maintaining quality
- **Output** actionable intelligence with appropriate automated responses



# THIS IS THE FOLDER STRUCTURE AFTER REMOVING FRONTEND PART OF THIS PROJECT



---

# **OpenUBA Project - Complete File & Folder Analysis**

---

## **ROOT LEVEL FILES**

| File | Role |
|------|------|
| **README.md** | Project documentation - explains what OpenUBA is, how to install, and basic usage |
| **requirements.txt** | Lists all Python packages needed (FastAPI, Pandas, TensorFlow, etc.) |
| **Makefile** | Automation commands - shortcuts for build, test, run operations |
| **DockerfileServer** | Docker container configuration - packages the app for deployment |
| **LICENSE** | Legal terms - defines how others can use this code |
| **.travis.yml** | CI/CD configuration - automated testing when code is pushed |
| **.gitignore** | Tells Git which files to ignore (cache files, logs, etc.) |
| **demo_test.py** | Test script to demonstrate all API endpoints working |

---

## **ROOT LEVEL FOLDERS**

| Folder | Role |
|--------|------|
| **core/** | **THE HEART** - All main application code lives here |
| **data/** | Empty placeholder for production data storage |
| **docs/** | Documentation files (installation guide) |
| **test_datasets/** | Sample log files for testing the system |
| **venv/** | Python virtual environment (isolated dependencies) |

---

## **CORE/ FOLDER - Main Application**

### **Entry Point & Server**

| File | Role | Flow Position |
|------|------|---------------|
| **core.py** | **MAIN SERVER** - FastAPI application, all REST API endpoints, starts the web server on port 5000 | START HERE → Everything connects to this |
| **__init__.py** | Makes core a Python package so files can import each other | Infrastructure |

---

### **INPUT LAYER Files (Data Collection)**

| File | Role | What It Does |
|------|------|--------------|
| **dataset.py** | **Data Reader** - Loads data from various sources | Reads CSV, Parquet, JSON from Disk/HDFS/Elasticsearch |
| **process.py** | **Data Ingestion Engine** - Orchestrates data loading | Reads scheme.json → loads each data source → extracts users |
| **database.py** | **Storage Abstraction** - Handles all file read/write | Connects to FileSystem, HDFS; reads/writes JSON files |
| **entity.py** | **Non-Human Actor Manager** - Tracks devices/servers | Creates Entity objects (servers, workstations, etc.) |
| **user.py** | **Human Actor Manager** - Tracks all users | Extracts usernames from logs, creates User profiles |

**Flow:** dataset.py → process.py → user.py / entity.py → database.py

---

### **PROCESSING LAYER Files (Analysis)**

| File | Role | What It Does |
|------|------|--------------|
| **model.py** | **ML Model Engine** - Runs all machine learning models | Loads models, verifies integrity, executes analysis, returns risk scores |
| **risk.py** | **Risk Calculator** - Computes user risk scores | Takes model outputs → calculates 0-100 score → assigns LOW/MEDIUM/HIGH |
| **anomaly.py** | **Anomaly Detector** - Finds unusual behaviors | Detects login anomalies, data access anomalies, time anomalies |
| **riskmanager.py** | **Risk Orchestrator** - Coordinates risk calculations | Manages multiple risk jobs running in parallel |

**Flow:** Data → model.py → risk.py / anomaly.py → `riskmanager.py`

---

### **OUTPUT LAYER Files (Actions & Alerts)**

| File | Role | What It Does |
|------|------|--------------|
| **alert.py** | **Alert Generator** - Creates security notifications | Creates CRITICAL/HIGH/MEDIUM/LOW alerts with descriptions |
| **display.py** | **Dashboard Data** - Prepares data for viewing | Formats statistics, user lists, risk summaries for UI/API |
| **api.py** | **API Helper** - Internal API utilities | Routes requests, formats responses, connects to external services |

**Flow:** Analysis results → alert.py → display.py → api.py → core.py (API response)

---

### **UTILITY Files (Supporting Functions)**

| File | Role | What It Does |
|------|------|--------------|
| **encode.py** | **Base64 Encoder/Decoder** | Encodes model files for transmission, decodes for installation |
| **hash.py** | **Hashing Utility** - SHA256 | Creates fingerprints of files to verify integrity (anti-tampering) |
| **utility.py** | **General Utilities** | Timestamps, helper functions used across the project |
| **test.py** | **Test Framework** | Unit tests for the core functionality |

---

### **TEST Files**

| File | Role |
|------|------|
| **dataset_test.py** | Tests data loading functionality |
| **encode_test.py** | Tests Base64 encoding/decoding |
| **hash_test.py** | Tests hashing functions |
| **process_test.py** | Tests data processing pipeline |
| **FILE_GUIDE.md** | Documentation explaining core files |

---

## **CORE/STORAGE/ - Configuration & Data**

| File | Role | Content |
|------|------|---------|
| **scheme.json** | **Data Source Configuration** | Defines what log files to read, their format, delimiter, location |
| **models.json** | **Model Configuration** | Lists all ML models, their settings, MITRE mappings, rules |
| **default_model_library.json** | **Backup Model Config** | Default configuration if models.json is corrupted |
| **model_sessions.json** | **Execution Log** | Records when each model ran and results |
| **users.json** | **User Database** | Stores all discovered users and their profiles |
| **settings.json** | **System Settings** | Application configuration options |

### **CORE/STORAGE/MITRE/**
| File | Role |
|------|------|
| **matrix.json** | Complete MITRE ATT&CK framework data - maps attacks to techniques |

### **CORE/STORAGE/USERS/**
| Folder | Role |
|--------|------|
| Individual user folders | Each discovered user gets a folder to store their behavioral data |

### **CORE/STORAGE/SAVED_MODELS/**
| Content | Role |
|---------|------|
| Empty with README | Placeholder for saving trained model weights |

---

## **CORE/MODEL_LIBRARY/ - ML Models**

This folder contains **ready-to-use machine learning models**.

| Folder | Model Type | Purpose |
|--------|------------|---------|
| **model_test/** | Test Model | Basic model for testing the framework works |
| **model_1/** | Standard Model | General behavior analysis |
| **model_simple_re/** | Regex Model | Pattern matching using regular expressions |
| **model_simple_re_pyspark/** | Spark Regex Model | Same as above but for distributed big data |
| **model_sk_pickle/** | Scikit-Learn Model | Traditional ML (Random Forest, SVM, etc.) |
| **model_tf_protobuf/** | TensorFlow Model | Deep learning neural networks |

**Each model folder contains:**
- `__init__.py` - Makes it a Python package
- MODEL.py - Contains the `execute()` function that runs the analysis

---

## **CORE/MODEL_MODULES/ - Data Loaders**

These modules **load data for models** from different sources.

| Folder | What It Loads |
|--------|---------------|
| **local_pandas/** | CSV/Parquet from local disk using Pandas |
| **es/** | Data from Elasticsearch database |
| **test_module/** | Test data loader for development |

**Each contains:**
- `__init__.py` - Package initializer
- Main `.py` file - Actual loading logic

---

## **TEST_DATASETS/ - Sample Data**

### **test_datasets/toy_1/**
Sample log files for testing:

| Folder | Log Type | What It Contains |
|--------|----------|------------------|
| **proxy/** | Web Proxy Logs | `bluecoat.log` - User web browsing activity (50MB+) |
| **dns/** | DNS Logs | Domain name lookups |
| **ssh/** | SSH Logs | Remote login attempts |
| **dhcp/** | DHCP Logs | Network address assignments |

---

## **COMPLETE DATA FLOW DIAGRAM**

```
┌─────────────────────────────────────────────────────────────────┐
│                        INPUT LAYER                               │
├─────────────────────────────────────────────────────────────────┤
│  test_datasets/       →    dataset.py    →    process.py        │
│  (proxy, dns, ssh)         (reads files)      (orchestrates)    │
│                                  ↓                               │
│  storage/scheme.json  →    database.py   →    user.py           │
│  (configuration)           (file I/O)         (extract users)   │
└─────────────────────────────────────────────────────────────────┘
                               ↓
┌─────────────────────────────────────────────────────────────────┐
│                      PROCESSING LAYER                            │
├─────────────────────────────────────────────────────────────────┤
│  model_library/       →    model.py      →    risk.py           │
│  (ML models)               (run models)       (calculate score) │
│                                  ↓                               │
│  storage/models.json  →    anomaly.py    →    riskmanager.py    │
│  (model config)            (find unusual)     (coordinate)      │
│                                  ↓                               │
│  model_modules/       →    encode.py / hash.py                  │
│  (data loaders)            (verify integrity)                   │
└─────────────────────────────────────────────────────────────────┘
                               ↓
┌─────────────────────────────────────────────────────────────────┐
│                       OUTPUT LAYER                               │
├─────────────────────────────────────────────────────────────────┤
│  alert.py             →    display.py    →    api.py            │
│  (create alerts)           (format data)      (API helpers)     │
│                                  ↓                               │
│                            core.py                               │
│                     (REST API endpoints)                         │
│                              ↓                                   │
│                    http://localhost:5000                         │
│                    /docs (Swagger UI)                            │
└─────────────────────────────────────────────────────────────────┘
```

---

## **SUMMARY TABLE - All Files by Function**

| Function | Files |
|----------|-------|
| **Server/API** | core.py, api.py |
| **Data Loading** | dataset.py, process.py, database.py |
| **User/Entity Tracking** | user.py, entity.py |
| **ML Models** | model.py + model_library/* |
| **Data Loaders for Models** | model_modules/* |
| **Risk Analysis** | risk.py, riskmanager.py |
| **Anomaly Detection** | anomaly.py |
| **Alerts** | alert.py |
| **Display/Output** | display.py |
| **Security Utilities** | encode.py, hash.py |
| **General Utilities** | utility.py |
| **Configuration** | storage/*.json |
| **Test Data** | test_datasets/toy_1/* |
| **Tests** | *_test.py, test.py |
| **Documentation** | README.md, docs/INSTALL.md, FILE_GUIDE.md |
| **DevOps** | Makefile, DockerfileServer, .travis.yml, requirements.txt |

---

## **EXECUTION ORDER (When Server Starts)**

1. **core.py** starts → Creates FastAPI server
2. Imports all modules (model, risk, alert, etc.)
3. Loads **models.json** → Model configuration
4. Loads **scheme.json** → Data source configuration
5. Server listens on **port 5000**
6. When request comes:
   - **dataset.py** reads data
   - **model.py** runs analysis
   - **risk.py** calculates scores
   - **alert.py** creates alerts
   - **core.py** returns API response