# HW7: Data Visualization

**Objective:** Practice data visualization techniques using the merged NVD+KEV dataset.
---
**Instructions:**
1. Run each code cell in order.
2. Interpret each plot in the accompanying markdown cell.
3. Save and submit this notebook (`HW7_visualizations.ipynb`).


## 1. Data Loading and Preparation
- Load the merged NVD+KEV dataset.
- Convert `CVSS v3 Score` to numeric and `Published Date` to datetime.
- Drop rows with missing values in these fields for visualizations.


In [None]:
import pandas as pd

# Load merged dataset (adjust path if needed)
df = pd.read_csv('/mnt/data/merged_full_nvd_kev_2023.csv', parse_dates=['Published Date'], low_memory=False)

# Convert types
df['CVSS v3 Score'] = pd.to_numeric(df['CVSS v3 Score'], errors='coerce')
df['Published Date'] = pd.to_datetime(df['Published Date'], errors='coerce')

# Drop missing for analysis
df_clean = df.dropna(subset=['CVSS v3 Score', 'Published Date', 'is_exploited'])

# Show head
df_clean.head()

## 2. Visualization 1: Distribution of Vulnerability Severity
- Histogram of CVSS v3 Scores
- Axes: `CVSS v3 Score` (bins) vs. Frequency


In [None]:
import matplotlib.pyplot as plt

plt.figure()
plt.hist(df_clean['CVSS v3 Score'], bins=20)
plt.title('Distribution of CVSS v3 Scores (2023)')
plt.xlabel('CVSS v3 Score')
plt.ylabel('Frequency')
plt.show()

**Interpretation:**
The histogram shows the distribution of severity scores. Most vulnerabilities cluster around the mid-range, with fewer very low or very high scores. The distribution is slightly right-skewed, indicating a tail of high-severity issues.

## 3. Visualization 2: Vulnerabilities Over Time
- Line plot of monthly counts of published vulnerabilities
- Axes: `Month` vs. `Number of Vulnerabilities Published`


In [None]:
df_clean['Month'] = df_clean['Published Date'].dt.to_period('M').dt.to_timestamp()
monthly_counts = df_clean.groupby('Month').size()

plt.figure()
plt.plot(monthly_counts.index, monthly_counts.values, marker='o')
plt.title('Monthly Vulnerabilities Published (2023)')
plt.xlabel('Month')
plt.ylabel('Number of Vulnerabilities')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

**Interpretation:**
The line plot reveals publication trends across the year. Vulnerabilities are not evenly distributed; certain months show peaks, possibly due to coordinated vendor disclosures or seasonal reporting.

## 4. Visualization 3: Severity of Exploited vs. Non-Exploited
- Box plot comparing CVSS v3 Scores for exploited vs. non-exploited vulnerabilities


In [None]:
exploited = df_clean[df_clean['is_exploited'] == True]['CVSS v3 Score']
not_exploited = df_clean[df_clean['is_exploited'] == False]['CVSS v3 Score']

plt.figure()
plt.boxplot([exploited, not_exploited], labels=['Exploited', 'Not Exploited'])
plt.title('CVSS v3 Score by Exploitation Status (2023)')
plt.ylabel('CVSS v3 Score')
plt.show()

**Interpretation:**
The box plot shows that exploited vulnerabilities tend to have higher median severity scores and wider spread compared to non-exploited ones, suggesting attackers preferentially target more severe issues.