### the prompt to create the alpaca dataset

**Format:**
- **Instruction**: "Analyze the following Java code snippet and identify any security vulnerability"
- **Input**: Code Snippet
- **Output**: 
  - **If vulnerable**: `CWE-ID, Exact Vulnerable Line, Description`
  - **If BENIGN**: `No vulnerability detected. The code is secure.`

This creates a generic vulnerability detection task without specifying the CWE ID in the instruction, making the model learn to identify vulnerabilities independently and also distinguish between vulnerable and secure code.

In [6]:
# this code is used to convert the dataset to the alpaca format
import pandas as pd

# Read the CSV file
file_path = 'training dataset/dataset.csv'
df = pd.read_csv(file_path)

# Create the alpaca format
df_find = df[['CWE ID', 'Code Snippet', 'Exact Vulnerable Line', 'Description']].copy()
df_find['instruction'] = "Analyze the following Java code snippet and identify any security vulnerability"
df_find['input'] = df_find['Code Snippet']

# Handle BENIGN vs vulnerable cases differently
def create_output(row):
    if str(row['CWE ID']).upper() == 'BENIGN':
        return "No vulnerability detected. The code is secure."
    else:
        return f"{row['CWE ID']}, {row['Exact Vulnerable Line']}, {row['Description']}"

df_find['output'] = df_find.apply(create_output, axis=1)

# Select only the required columns in the correct order
df_alpaca = df_find[['instruction', 'input', 'output']]

# Save the new dataframe to a CSV file
output_file_path = 'training dataset/alpaca_format_dataset.csv'
df_alpaca.to_csv(output_file_path, index=False)

# Print statistics
print(f"Total number of rows: {len(df_alpaca)}")
print(f"Vulnerable cases: {len(df_alpaca[df_alpaca['output'].str.contains('CWE')])}")
print(f"BENIGN cases: {len(df_alpaca[df_alpaca['output'].str.contains('No vulnerability')])}")

Total number of rows: 20000
Vulnerable cases: 10000
BENIGN cases: 10000
