# CSCA 5622 Machine Learning I Final Project

**Goal**: Identify a Supervised Learning problem to perform EDA and model analysis.

### Deliverables

1. Jupyter notebook showing a supervised learning problem description, EDA procedure, analysis (model building and training), result, and discussion/conclusion
2. Video presentation or demo (5-15min) in `.mp4` format.

   - What problem do you solve?
   - What ML approach do you use, or what methods does your app use?
   - Show the result or run an app demo.

3. [Public GitHub Repo](https://github.com/davidhwilliams/CSCA-5622-ML-I-Final)


# Project Focus

Predicting CVSS Score Severity based on `description`, `CPEs`, `keywords`.

## Links and Data Feeds

- https://nvd.nist.gov/vuln
- [NVD CVE API Guide](https://nvd.nist.gov/developers/vulnerabilities)

## Steps

### 1. Data Collection

- **Download CVE Dataset** from the NVD (includes CVSS scores, descriptions, affected software/hardware, date of discovery).

### 2. Data Preprocessing

- **Handle missing data**: Remove or impute missing values in descriptions, affected software, and CVSS scores.
- **Text preprocessing**: For vulnerability descriptions:Remove special characters, stop words, and tokenize. Apply lemmatization or stemming.
- **Keyword extraction**: Use TF-IDF or CountVectorizer (bag-of-words).
- **Categorical encoding** for affected software/hardware: One-hot encoding or label encoding. Use dimensionality reduction techniques like PCA.

### 3. Exploratory Data Analysis (EDA)

- **Understand the target variable**: Plot the distribution of CVSS severity (Low, Medium, High, Critical). Check for class imbalance.
- **Analyze text data**:
  - Word cloud visualizations for different severity levels.
  - Bar plots for keyword frequency by severity category.
- **Analyze affected software/hardware**:
  - Histograms or bar charts for common affected software/hardware in each severity.
- **Correlation analysis**:
  - Correlation matrix to identify strong relationships between features and CVSS severity.

### 4. Feature Engineering

- **Text-based features**:
  - Use TF-IDF or Word2Vec for numerical representations.
  - Create features based on keyword counts, description length, etc.
- **Categorical features**:
  - Count the number of affected products or types of products.
- **Date-based features** (if applicable): Create features like vulnerability age, seasonality, etc.

### 5. Splitting the Data

- Split the data into **training** and **testing** sets (70/30 or 80/20 split).

### 6. Model Selection and Training

- Use classification models:
  - **Logistic Regression** (baseline model).
  - **Random Forest**.
  - **Gradient Boosting (XGBoost/LightGBM)**.
  - **Naive Bayes** for text-heavy models.
  - **Support Vector Machine (SVM)** with TF-IDF.
- Handle **class imbalance** using:
  - **Oversampling** (SMOTE) or **undersampling**.
  - Adjust class weights in models.

### 7. Model Evaluation

- Use metrics:
  - **Accuracy** (for balanced data).
  - **Precision, Recall, F1-score** (for imbalanced data).
  - **Confusion Matrix** to visualize performance.
- **Cross-validation**: Use k-fold cross-validation for generalization.

### 8. Model Tuning

- **Hyperparameter tuning** with:
  - **GridSearchCV** or **RandomSearchCV**.
  - Tune parameters like number of estimators, tree depth, learning rate.

### 9. Model Interpretation

- **Feature importance**: Visualize feature importance for tree-based models.
- **SHAP values or LIME** for interpreting complex models.

### 10. Model Deployment and Conclusions

- Discuss which features (descriptions, keywords, software) were most predictive.
- Identify limitations (e.g., class imbalance, noisy text data).
- Suggest improvements with more data or refined feature engineering.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from matplotlib.colors import Normalize
from sklearn.tree import DecisionTreeClassifier
from sklearn.base import clone
from sklearn import tree
from sklearn.model_selection import train_test_split