# 🧪 Fragma: Overview & Navigation
*Your guide to understanding and navigating this fragment detection project.*

---


## 🧭 Table of Contents

- 📘 [Project Introduction](#project-introduction)
- 🗂️ [Project Structure](#project-structure)
- 📓 [Notebook Structure](#notebook-structure)
- 🧰 [Dependencies & Setup](#dependencies--setup)
- 🚀 [Getting Started](#getting-started)
- 📊 [Project Status](#project-status)
- 📚 [Resources](#resources)
- 👥 [Contributors](#contributors)
- 📝 [Documentation](#documentation)


## 🧠 Project Introduction

**🔍 Overview:**  
Fragma is a specialized model designed to detect sentence fragments for optimizing autocomplete systems. By identifying and classifying text fragments, Fragma helps autocomplete models provide more contextually relevant suggestions.

**🎯 Goals:**  
- Create a robust fragment detection model for autocomplete optimization
- Build a comprehensive dataset of labeled sentence fragments
- Develop intelligent text preprocessing and feature extraction pipelines
- Enable real-time fragment detection in user input

**📘 Context:**  
Modern autocomplete systems often struggle with partial or incomplete sentences. Fragma addresses this by learning to identify and classify text fragments, enabling more accurate and context-aware suggestions.

**🗓 Metadata:**  
- **Created:** 2025-05-12  
- **Last Updated:** 2025-05-12  
- **Status:** Model Development Phase
- **Version:** 1.0.0  

## 🗂️ Project Structure

### Core Components:

1. **Fragment Detector Dataset Creator** (`fd_dataset_creator_script.py`)
   - Processes raw conversational data
   - Applies intelligent splitting rules
   - Balances dataset for model training

2. **Text Preprocessing Pipeline** (`preprocessor.py`)
   - Unicode normalization and cleaning
   - HTML entity and whitespace handling
   - Emoji/emoticon removal
   - Advanced tokenization

3. **Linguistic Features** (`fd_linguistic_features.py`)
   - Word lists and patterns
   - Feature descriptions
   - Regex patterns for analysis

4. **Dataset Expander** (`fd_ds_expander.py`)
   - Feature extraction
   - Dataset augmentation
   - Linguistic analysis

## 📓 Notebook Structure

Below is the execution order and description of each notebook in this project:

| 🔢 Order | 📓 Notebook | 📝 Description |
|----------|------------|----------------|
| 0 | [00-Fragma-Overview.ipynb](https://colab.research.google.com/drive/1Mrnk4I4nD-aty1lEdS08wyGA1ywgRHV5?usp=sharing) | Project overview and setup |
| 1 | [01-Data-Loading.ipynb](https://colab.research.google.com/drive/1NZeZYBdgr6QsVLg8je0F2D0DHuYuuxsd?usp=drive_link) | Dataset loading and initial exploration |
| 2 | [02-Fragment-Detection.ipynb](https://colab.research.google.com/drive/1PL9wJr-zn8dFTuU5y8HuMzY4fTedZKtL?usp=sharing) | Fragment detection and preprocessing |
| 3 | [03-Random-Forest-Regressor.ipynb](https://colab.research.google.com/drive/196Spb8P56B8fwFdnkp8Kmiq9uitzLrH1?usp=sharing) | Random Forest model training and evaluation |
| 4 | [04-LSTM-Model.ipynb](https://colab.research.google.com/drive/1Fg7Tw1Xj3O9NuwCdn_oztEG6Q4r7RWo4?usp=sharing) | Deep learning model using LSTM architecture |
| 5 | [05-GUI.ipynb](https://colab.research.google.com/drive/1eoX1dSyDxFyK7GxWFUrwyxls0PIbFNdk?usp=sharing) | Graphical interface to test both ML and DL models |

> 🌐 **Note:** All notebooks are accessible via Google Colab links above.


## 📦 Dependencies & Setup

**💡 Required Packages:**
```python
pandas         # Data manipulation
tqdm          # Progress bars
nltk          # NLP tools
ftfy          # Unicode fixing
emoji         # Emoji handling
textblob      # Text processing
contractions  # Contraction expansion
colab_print   # Pretty Printing
```

**🔧 Installation:**
```bash
pip install pandas tqdm nltk ftfy emoji textblob contractions
```

**📎 NLTK Data:**
```python
import nltk
nltk.download('punkt')
nltk.download('stopwords')
```


### 🔧 Custom Utilities

This project uses the custom `colab_print` utility for consistent output formatting:

**Installation:**
```python
!pip install colab-print
```

**Key Functions:**

| Function     | Purpose                                          |
|--------------|--------------------------------------------------|
| `header()`   | Print section headers with consistent formatting |
| `title()`    | Highlight subsection titles                      |
| `table()`    | Pretty-print tabular data                        |
| `info()`     | Display informational messages                   |
| `success()`  | Show success messages                            |
| `warning()`  | Display warning messages                         |
| `error()`    | Show error messages                              |


> 📘 **Note:** All notebooks use these formatting functions for consistent output.

```bash
# Install required packages
!pip install pandas tqdm nltk ftfy emoji textblob contractions
```

```python
# Download NLTK data
import nltk
nltk.download('punkt')
nltk.download('stopwords')
```

## 🚀 Getting Started

### Dataset Creation

```python
from fd_dataset_creator_script import process_dataset

# Process and create dataset
process_dataset(
    input_file='raw_data.csv',
    output_file='processed_data.csv',
    balance_strategy='expand'  # or 'reduce'
)
```

### Text Preprocessing

```python
from preprocessor import preprocess_df
import pandas as pd

# Load your data
df = pd.read_csv('your_data.csv')

# Preprocess the text
processed_df, metrics_overall, metrics_instance = preprocess_df(df)
```

## 📊 Project Status

### ✅ Completed
| Phase | Component | Status |
|-------|-----------|---------|
| Data Preparation | Dataset Creation | ✓ Complete |
| Data Preparation | Dataset Expansion | ✓ Complete |
| Data Preparation | Preprocessing Pipeline | ✓ Complete |

### 🔄 In Progress
| Phase | Component | Status |
|-------|-----------|---------|
| Model Development | Feature Extraction | 🚧 In Progress |
| Model Development | Dataset Splitting | 📋 Planned |
| Model Development | Model Selection | 📋 Planned |

### ⏳ Upcoming
| Phase | Component | Status |
|-------|-----------|---------|
| Evaluation | Model Evaluation | ⏳ Planned |
| Evaluation | Error Analysis | ⏳ Planned |
| Deployment | Model Packaging | ⏳ Planned |

See [STEPS.md](STEPS.md) for detailed progress tracking.

## 📚 Resources

### Dataset

The project uses the [Netflix & Facebook Posts Dataset](https://www.kaggle.com/datasets/tomthescientist/netflix-facebook-posts-as-sentences-for-llm-input) from Kaggle:

- **Provider:** Tom the Scientist
- **Platform:** Kaggle
- **Content:** Collection of Netflix and Facebook posts
- **Purpose:** Training data for fragment detection

### External Links

- 📊 [Dataset Link](https://www.kaggle.com/datasets/tomthescientist/netflix-facebook-posts-as-sentences-for-llm-input)
- 📓 [Project Documentation](https://github.com/alaamer12/Fragma)
- 🧪 [Colab Notebooks drive](https://drive.google.com/drive/folders/14elEhg-Kb9UXUtaUIvJrEbpFroEq_mUB?usp=sharing)

## 👥 Contributors

| 👤 Name | 🧑‍💻 Role | 📬 GitHub | 🔗 LinkedIn |
|---------|----------|-----------|------------|
| Amr Muhamed | Maintainer | [alaamer12](https://github.com/alaamer12) | [alaamer12](www.linkedin.com/in/amr-muhamed-0b0709265) |
| Muhamed Ibrahim | Data Engineer | [muhammad-senna](https://github.com/muhammad-senna) | [muhammad-senna](https://linkedin.com/in/muhammad-senna) |

© 2025 Amr Muhamed. All Rights Reserved.

*Last updated: May 13, 2025*

## 📝 Documentation

### Key Files:

- [**README.md**](README.md): Project overview and setup instructions
- [**FD.md**](FD.md): Detailed fragment detection documentation
- [**STEPS.md**](STEPS.md): Project roadmap and progress tracking

### 📋 Key Features

- Intelligent sentence splitting based on linguistic patterns
- Comprehensive text preprocessing pipeline
- Pattern-based feature detection (adverbs, past tense, gerunds)
- Dataset balancing strategies (reduction/expansion)
- Detailed preprocessing metrics and tracking

---

© 2025 Amr Muhamed. All Rights Reserved.

*Last updated: May 12, 2025*