# 📘 Notebook: 03_modeling_and_evaluation.ipynb
_**Part of the Fragma IPython Notebook Project Series**_

*Focused on developing and evaluating machine learning models for fragment detection.*

---

## 🧭 Table of Contents

1. [📘 Overview & Navigation](#overview)
2. [🧠 Context & Purpose](#context)
3. [🧩 Main Components](#components)
4. [🧭 Notebook Structure](#notebooks)
5. [📦 Dependencies](#dependencies)
6. [🛠️ Config & Setup](#setup)
7. [📊 Model Pipeline](#pipeline)
8. [📈 Evaluation & Results](#results)
9. [📚 Resources](#resources)
10. [👥 Contributors](#team)

> **Quick Links:** [🏠 Home](#overview) | [🔄 Status](#notebooks) | [📚 Docs](#resources)

---

## 🧪 Overview & Navigation

This notebook represents the final stage in our fragment detection pipeline, focusing on model development and evaluation.
It provides: **A comprehensive machine learning pipeline using TF-IDF vectorization, Random Forest classification, and PSO-based hyperparameter optimization.**

---

## 🧠 Context & Purpose

**🎯 Purpose:**  
To develop and evaluate a robust machine learning model for detecting sentence fragments, combining both text-based and structural features.

**🎯 Objectives:**  
- Develop an effective text vectorization strategy
- Build a hybrid model combining text and structural features
- Optimize model performance using PSO
- Evaluate and analyze model performance
- Visualize feature importance and results

**📘 Context:**  
This notebook builds upon the preprocessed data from notebook 02, utilizing both the cleaned text and extracted linguistic features to create a powerful fragment detection model.

## 🧩 Main Components

### `TextVectorizer`
> TF-IDF based text vectorization with optimized parameters.

```python
vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
# Returns: Sparse matrix of TF-IDF features
```

### `ModelPipeline`
> Comprehensive pipeline combining text and structural features.

```python
pipeline = Pipeline([
    ('features', ColumnTransformer([
        ('text', TfidfVectorizer(), text_col),
        ('struct', 'passthrough', structured_cols)
    ])),
    ('classifier', RandomForestClassifier())
])
```

### `PSOOptimizer`
> Particle Swarm Optimization for hyperparameter tuning.

```python
best_params = optimize_hyperparameters(pipeline, param_space)
# Returns: Optimized hyperparameters with cross-validation
```

## 🔍 Model Features

### 1. Text Features (TF-IDF)
- **Word-Level Features**: Unigram and bigram frequencies
- **Vectorization Parameters**: Optimized max_features and ngram_range
- **Preprocessing**: Utilizes cleaned text from previous notebook

### 2. Structural Features
- **Linguistic Markers**: Auxiliary verbs, punctuation, conjunctions
- **Grammatical Features**: Past verbs, gerunds, adverbs
- **Syntactic Elements**: Sentence starters, capitalization

### 3. Model Architecture
- **Base Classifier**: Random Forest with optimized parameters
- **Feature Fusion**: Combined text and structural features
- **Hyperparameter Space**: Carefully selected parameter ranges

### 4. Optimization
- **PSO Algorithm**: Particle swarm optimization for tuning
- **Cross-Validation**: K-fold validation during optimization
- **Performance Metrics**: Accuracy, F1-score, precision, recall

### 5. Visualization
- **Feature Importance**: Random forest feature rankings
- **Performance Curves**: ROC, precision-recall curves
- **Error Analysis**: Confusion matrix and misclassification analysis

## 📦 Dependencies

```bash
pandas         # Data manipulation
numpy         # Numerical operations
scikit-learn  # Machine learning tools
pyswarms      # PSO implementation
matplotlib    # Visualization
seaborn      # Enhanced visualization
joblib       # Model persistence
```

## 🧭 Notebook Structure

| 🔢 Order | 📓 Notebook | 📝 Description |
|---------:|------------|----------------|
| 0 | [00-Fragma-Overview.ipynb](https://colab.research.google.com/drive/1oUmSqBuPqv2gObJjhezXaa6xBxe_Tl4g?usp=sharing) | Project overview and setup |
| 1 | [01-Fragment-DS-Generator.ipynb](https://colab.research.google.com/drive/1aAVCptdYyRHmytnY7O__anYKpZh5Hl-w?usp=sharing) | Dataset generation |
| 2 | [02-Data-Preprocessing.ipynb](https://colab.research.google.com/drive/1QbVTz71jGvVvr2rXwJk9RKGKS4r1ntCC?usp=sharing) | Text preprocessing |
| 3 | [03-Model-Development.ipynb](https://colab.research.google.com/drive/1CDwjXuqBj1LBdXXvvFymNh6etWNpUnth?usp=sharing) | Model training and evaluation (Current) |

> ⏮ **Previous:** [02-Data-Preprocessing.ipynb](https://colab.research.google.com/drive/1QbVTz71jGvVvr2rXwJk9RKGKS4r1ntCC?usp=sharing)

## 📥 Inputs & Outputs

**📥 Inputs:**
- `processed_dataset.csv`: Preprocessed dataset with text and linguistic features
  - Text column: 'Processed Text'
  - Target column: 'is_fragment'
  - Structural features: 17 binary linguistic markers

**📤 Outputs:**
- Trained model with:
  - Optimized TF-IDF vectorizer
  - Tuned Random Forest classifier
  - Feature importance rankings
  - Performance metrics and visualizations
  - Detailed error analysis

## 📊 Model Pipeline Configuration

The model pipeline is configured through several key components:

### Text Vectorization Config
```python
TFIDF_CONFIG = {
    "max_features": [1000, 2000, 3000],  # Features to consider
    "ngram_range": [(1, 1), (1, 2)],    # N-gram lengths
    "min_df": [2, 5],                    # Minimum document frequency
    "max_df": [0.9, 0.95]               # Maximum document frequency
}
```

### Random Forest Config
```python
RF_CONFIG = {
    "n_estimators": [100, 200, 300],     # Number of trees
    "max_depth": [10, 20, 30, None],     # Tree depth
    "min_samples_split": [2, 5, 10],     # Minimum samples for split
    "min_samples_leaf": [1, 2, 4]        # Minimum samples in leaf
}
```

### PSO Configuration
```python
PSO_CONFIG = {
    "n_particles": 30,          # Number of particles
    "dimensions": 8,           # Number of parameters to optimize
    "iterations": 50,          # Maximum iterations
    "cv_folds": 5             # Cross-validation folds
}
```

## 📈 Evaluation Strategy

### Data Splitting
- Train set (70%): Model training and PSO optimization
- Validation set (15%): Parameter selection and early stopping
- Test set (15%): Final evaluation only

### Performance Metrics
1. **Primary Metrics**
   - Accuracy: Overall correctness
   - F1-Score: Harmonic mean of precision and recall
   - ROC-AUC: Discrimination ability

2. **Secondary Metrics**
   - Precision: Positive predictive value
   - Recall: True positive rate
   - Confusion Matrix: Detailed error analysis

3. **Cross-Validation**
   - 5-fold stratified CV during optimization
   - Mean and standard deviation of metrics

### Feature Analysis
- Random Forest feature importance rankings
- TF-IDF term importance analysis
- Feature correlation study

### Error Analysis
- Misclassification analysis by feature type
- Length-based error patterns
- Linguistic feature impact study

## 👥 Contributors

| 👤 Name | 🧑‍💻 Role | 📬 GitHub | 🔗 LinkedIn |
|---------|----------|-----------|------------|
| Amr Muhamed | Maintainer | [alaamer12](https://github.com/alaamer12) | [alaamer12](https://linkedin.com/in/alaamer12) |

© 2025 Amr Muhamed. All Rights Reserved.

*Last updated: May 13, 2025*