
## BIT 3474 Project Helper File for Data Loading

The below code describes the NVD and CWE datasets and how to go about flattening them once you have downloaded the data 

## NVD Data Dictionary

### CVE DataFrame

This DataFrame contains the core vulnerability information from the National Vulnerability Database (NVD).

| Column Name | Data Type | Description | Example | Source Field |
|------------|-----------|-------------|---------|--------------|
| id | string | The unique identifier for the vulnerability | CVE-2024-0001 | CVE_data_meta.ID |
| assigner | string | The organization that assigned the CVE | cna@mitre.org | CVE_data_meta.ASSIGNER |
| published_date | datetime | The date when the vulnerability was first published | 2024-09-23T18:15Z | publishedDate |
| last_modified_date | datetime | The date when the vulnerability was last modified | 2024-09-27T14:08Z | lastModifiedDate |
| description | string | The English description of the vulnerability | A buffer overflow vulnerability in... | description.description_data[].value |
| cwe | string | Common Weakness Enumeration identifier | CWE-119 | problemtype.problemtype_data[].description[].value |
| references | string | Semicolon-separated list of reference URLs | https://example.com/vuln1; https://example.com/vuln2 | references.reference_data[].url |
| cvss3_vector | string | CVSS v3 vector string | CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:H | impact.baseMetricV3.cvssV3.vectorString |
| cvss3_base_score | float | CVSS v3 base score (0.0-10.0) | 9.8 | impact.baseMetricV3.cvssV3.baseScore |
| cvss3_base_severity | string | CVSS v3 qualitative severity rating | CRITICAL | impact.baseMetricV3.cvssV3.baseSeverity |
| attack_vector | string | The method by which vulnerability exploitation is possible | NETWORK | impact.baseMetricV3.cvssV3.attackVector |
| attack_complexity | string | The conditions beyond the attacker's control that must exist to exploit the vulnerability | LOW | impact.baseMetricV3.cvssV3.attackComplexity |
| privileges_required | string | The level of privileges an attacker must possess before successfully exploiting the vulnerability | NONE | impact.baseMetricV3.cvssV3.privilegesRequired |
| user_interaction | string | Whether the vulnerability can be exploited without user interaction | NONE | impact.baseMetricV3.cvssV3.userInteraction |
| scope | string | Whether a vulnerability in one component impacts resources beyond its security scope | UNCHANGED | impact.baseMetricV3.cvssV3.scope |
| confidentiality_impact | string | The impact to the confidentiality of the information resources | HIGH | impact.baseMetricV3.cvssV3.confidentialityImpact |
| integrity_impact | string | The impact to the integrity of the information resources | HIGH | impact.baseMetricV3.cvssV3.integrityImpact |
| availability_impact | string | The impact to the availability of the information resources | HIGH | impact.baseMetricV3.cvssV3.availabilityImpact |
| year | int | The year the vulnerability was published (derived) | 2024 | Derived from published_date |

### CVSS Score Ranges
- 0.0: None
- 0.1-3.9: Low
- 4.0-6.9: Medium
- 7.0-8.9: High
- 9.0-10.0: Critical

## CPE DataFrame

This DataFrame contains Common Platform Enumeration (CPE) data, representing affected products and versions for each vulnerability.

| Column Name | Data Type | Description | Example | Source Field |
|------------|-----------|-------------|---------|--------------|
| cve_id | string | Foreign key linking to CVE DataFrame | CVE-2024-0001 | Derived from CVE ID |
| cpe23Uri | string | The full CPE 2.3 URI | cpe:2.3:a:vendor:product:version:*:*:*:*:*:*:* | configurations.nodes[].cpe_match[].cpe23Uri |
| vulnerable | boolean | Whether this configuration is vulnerable | True | configurations.nodes[].cpe_match[].vulnerable |
| versionStartIncluding | string | The starting version of affected software (inclusive) | 1.2.3 | configurations.nodes[].cpe_match[].versionStartIncluding |
| versionEndIncluding | string | The ending version of affected software (inclusive) | 1.2.10 | configurations.nodes[].cpe_match[].versionEndIncluding |
| versionStartExcluding | string | The starting version of affected software (exclusive) | 1.2.2 | configurations.nodes[].cpe_match[].versionStartExcluding |
| versionEndExcluding | string | The ending version of affected software (exclusive) | 1.2.11 | configurations.nodes[].cpe_match[].versionEndExcluding |
| vendor | string | The vendor name (parsed from cpe23Uri) | microsoft | Parsed from cpe23Uri |
| product | string | The product name (parsed from cpe23Uri) | windows | Parsed from cpe23Uri |
| version | string | The version string (parsed from cpe23Uri) | 10 | Parsed from cpe23Uri |

### CPE 2.3 URI Format
```
cpe:2.3:part:vendor:product:version:update:edition:language:sw_edition:target_sw:target_hw:other
```

### Common Values

#### Attack Vector (AV)
- NETWORK (N)
- ADJACENT_NETWORK (A)
- LOCAL (L)
- PHYSICAL (P)

#### Attack Complexity (AC)
- LOW (L)
- HIGH (H)

#### Privileges Required (PR)
- NONE (N)
- LOW (L)
- HIGH (H)

#### User Interaction (UI)
- NONE (N)
- REQUIRED (R)

#### Scope (S)
- UNCHANGED (U)
- CHANGED (C)

#### Impact Ratings
- NONE (N)
- LOW (L)
- HIGH (H)

### Relationships
- One CVE can have multiple CPE matches (one-to-many relationship)
- The `cve_id` in the CPE DataFrame references the `id` in the CVE DataFrame

### Usage Notes
1. When filtering by version ranges, consider both Including and Excluding fields
2. The `vulnerable` field should be checked when determining if a configuration is affected
3. CPE matching should account for wildcards (*) in the CPE URI
4. Some CVEs may have no CPE matches
5. Version ranges may be empty if the vulnerability affects all versions


### Code to extract the data from NVD file 

In [59]:
import json
import pandas as pd
from typing import Any, Dict, List
import datetime


def process_nvd_json(file_path: str) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Process an NVD JSON file and convert it to two normalized DataFrames:
    1. Main CVE DataFrame
    2. CPE matches DataFrame with foreign key to CVE
    
    Args:
        file_path (str): Path to the NVD JSON file
        
    Returns:
        tuple[pd.DataFrame, pd.DataFrame]: Tuple containing (cve_df, cpe_df)
    """
    # Read the JSON file
    with open(file_path, 'r', encoding = 'ISO-8859-1') as f:
        nvd_data = json.load(f)
    
    # Lists to store processed items
    cve_items = []
    cpe_items = []
    
    for cve_item in nvd_data['CVE_Items']:
        cve_data = {}
        
        # Basic CVE information
        cve_id = cve_item['cve']['CVE_data_meta']['ID']
        cve_data['id'] = cve_id
        cve_data['assigner'] = cve_item['cve']['CVE_data_meta']['ASSIGNER']
        cve_data['published_date'] = cve_item['publishedDate']
        cve_data['last_modified_date'] = cve_item['lastModifiedDate']
        
        # Description
        descriptions = cve_item['cve']['description']['description_data']
        cve_data['description'] = next((desc['value'] for desc in descriptions if desc['lang'] == 'en'), '')
        
        # Problem type (CWE)
        try:
            problemtype_data = cve_item['cve']['problemtype']['problemtype_data']
            if problemtype_data and problemtype_data[0]['description']:
                cve_data['cwe'] = problemtype_data[0]['description'][0].get('value', '')
            else:
                cve_data['cwe'] = ''
        except (KeyError, IndexError):
            cve_data['cwe'] = ''
        
        # References
        try:
            references = cve_item['cve']['references']['reference_data']
            cve_data['references'] = '; '.join(ref['url'] for ref in references)
        except (KeyError, IndexError):
            cve_data['references'] = ''
        
        # CVSS v3 metrics
        try:
            cvss3 = cve_item['impact']['baseMetricV3']['cvssV3']
            cve_data['cvss3_vector'] = cvss3.get('vectorString', '')
            cve_data['cvss3_base_score'] = cvss3.get('baseScore', None)
            cve_data['cvss3_base_severity'] = cvss3.get('baseSeverity', '')
            cve_data['attack_vector'] = cvss3.get('attackVector', '')
            cve_data['attack_complexity'] = cvss3.get('attackComplexity', '')
            cve_data['privileges_required'] = cvss3.get('privilegesRequired', '')
            cve_data['user_interaction'] = cvss3.get('userInteraction', '')
            cve_data['scope'] = cvss3.get('scope', '')
            cve_data['confidentiality_impact'] = cvss3.get('confidentialityImpact', '')
            cve_data['integrity_impact'] = cvss3.get('integrityImpact', '')
            cve_data['availability_impact'] = cvss3.get('availabilityImpact', '')
        except (KeyError, TypeError):
            cve_data.update({
                'cvss3_vector': '',
                'cvss3_base_score': None,
                'cvss3_base_severity': '',
                'attack_vector': '',
                'attack_complexity': '',
                'privileges_required': '',
                'user_interaction': '',
                'scope': '',
                'confidentiality_impact': '',
                'integrity_impact': '',
                'availability_impact': ''
            })
        
        # Process CPE matches
        try:
            nodes = cve_item['configurations']['nodes']
            for node in nodes:
                if 'cpe_match' in node:
                    for cpe in node['cpe_match']:
                        cpe_info = {
                            'cve_id': cve_id,
                            'cpe23Uri': cpe.get('cpe23Uri', ''),
                            'vulnerable': cpe.get('vulnerable', False),
                            'versionStartIncluding': cpe.get('versionStartIncluding', ''),
                            'versionEndIncluding': cpe.get('versionEndIncluding', ''),
                            'versionStartExcluding': cpe.get('versionStartExcluding', ''),
                            'versionEndExcluding': cpe.get('versionEndExcluding', '')
                        }
                        
                        # Parse CPE URI into components
                        cpe_parts = cpe_info['cpe23Uri'].split(':')
                        if len(cpe_parts) > 4:
                            cpe_info.update({
                                'vendor': cpe_parts[3],
                                'product': cpe_parts[4],
                                'version': cpe_parts[5]
                            })
                        
                        cpe_items.append(cpe_info)
        except (KeyError, TypeError):
            pass
        
        cve_items.append(cve_data)
    
    # Create DataFrames
    cve_df = pd.DataFrame(cve_items)
    cpe_df = pd.DataFrame(cpe_items)
    
    # Convert date columns to datetime
    date_columns = ['published_date', 'last_modified_date']
    for col in date_columns:
        cve_df[col] = pd.to_datetime(cve_df[col])
    
    # Sort DataFrames
    cve_df = cve_df.sort_values('id')
    cpe_df = cpe_df.sort_values(['cve_id', 'cpe23Uri'])
    return cve_df, cpe_df


# Replace with your file name and path
file_path = "data/nvdcve-2024.json"
        
try:
    # Process the NVD JSON file
    cve_df, cpe_df = process_nvd_json(file_path)
            
    # Optionally save to CSV
    cve_df.to_csv('processed_cve_data.csv', index=False)
    cpe_df.to_csv('processed_cpe_data.csv', index=False)
            
except FileNotFoundError:
    print(f"Error: File '{file_path}' not found.")
except json.JSONDecodeError:
    print("Error: Invalid JSON file format.")
except Exception as e:
    print(f"Error processing file: {str(e)}")

## CWE Data Dictionary 


| Field Category | Field Name | Type | Description | Example/Possible Values |
|---------------|------------|------|-------------|------------------------|
| **Identification** | ID | String | Unique weakness identifier | "1004" |
| | Name | String | Concise weakness title | "Sensitive Cookie Without 'HttpOnly' Flag" |
| **Structural Metadata** | Abstraction | String | Generalization level | "Base", "Class", "Variant" |
| | Structure | String | Complexity representation | "Simple", "Composite", "Complex" |
| | Status | String | Documentation state | "Draft", "Incomplete", "Mature" |
| **Description** | Description | String | Brief technical explanation | Technical vulnerability summary |
| | Extended Description | String | Comprehensive explanation | Detailed exploitation mechanism |
| **Contextual Fields** | Applicable Platforms | Array | Technology/language contexts | [{"Type": "Web", "Class": "JavaScript"}] |
| | Common Consequences | Array | Potential security impacts | [{"Scope": "Confidentiality", "Impact": "Data Exposure"}] |
| | Observed Examples | Array | Real-world vulnerability instances | [{"Reference": "CVE-2022-XXXX", "Description": "Specific vulnerability details"}] |
| **Metadata** | References | Array | External documentation sources | [{"Title": "OWASP Guide", "URL": "example.com"}] |
| | Content History | Array | Documentation modification tracking | [{"Type": "Modification", "Date": "2023-01-01"}] |

In [2]:
def comprehensive_field_extractor(data):
    """
    Extract all fields from CWE weakness data, including nested fields stored as separate columns.
    
    :param data: Raw CWE weakness data
    :return: Flattened dictionary of all extractable fields
    """
    flattened = {}
    
    # Core weakness fields
    core_fields = [
        'ID', 'Name', 'Abstraction', 'Structure', 
        'Status', 'Description', 'ExtendedDescription', 'LikelihoodOfExploit', 'BackgroundDetails'
    ]
    for field in core_fields:
        flattened[field] = data.get(field, 'N/A')
    
    # Complex nested fields
    # ApplicablePlatforms
    flattened['Platforms'] = ', '.join([
        f"{p.get('Type', '')}: {p.get('Class', '')}" 
        for p in data.get('ApplicablePlatforms', [])
    ])
    
    # CommonConsequences (split into separate columns)
    common_consequences = data.get('CommonConsequences', [])
    for i, consequence in enumerate(common_consequences):
        flattened[f'Consequence_{i+1}_Scope'] = consequence.get('Scope', 'N/A')
        flattened[f'Consequence_{i+1}_Impact'] = consequence.get('Impact', 'N/A')

    # ObservedExamples (split into separate columns)
    observed_examples = data.get('ObservedExamples', [])
    for i, example in enumerate(observed_examples):
        flattened[f'ObservedExample_{i+1}_Reference'] = example.get('Reference', 'N/A')
        flattened[f'ObservedExample_{i+1}_Description'] = example.get('Description', 'N/A')

    # References (split into separate columns)
    references = data.get('References', [])
    for i, reference in enumerate(references):
        flattened[f'Reference_{i+1}_Title'] = reference.get('Title', 'N/A')
        flattened[f'Reference_{i+1}_URL'] = reference.get('URL', 'N/A')  # Add more fields if needed
    
    return flattened

In [None]:

# Set up the API and store the response.json() into "data" and called the code below to flatten the datafile.




flattened_weaknesses = [
        comprehensive_field_extractor(weakness) 
        for weakness in data.get("Weaknesses", [])
    ]

# Convert it to Dataframe 

