# 📘 Notebook: 01_fragment_ds_generator.ipynb
_**Part of the Fragma IPython Notebook Project Series**_

*Focused on dataset generation and preparation for fragment detection.*

---

## 🧭 Table of Contents

1. [📘 Overview & Navigation](#overview)
2. [🧠 Context & Purpose](#context)
3. [🧩 Main Components](#components)
4. [🧭 Notebook Structure](#notebooks)
5. [📦 Dependencies](#dependencies)
6. [🛠️ Config & Setup](#setup)
7. [📥 Inputs & Outputs](#io)
8. [📚 Resources](#resources)
9. [👥 Contributors](#team)

> **Quick Links:** [🏠 Home](#overview) | [🔄 Status](#notebooks) | [📚 Docs](#resources)

---

## 🧪 Overview & Navigation

This notebook is part of a multi-step data pipeline project that creates and prepares datasets for fragment detection.
It focuses on: **Building a high-quality dataset for training fragment detection models using intelligent sentence splitting and linguistic analysis.**

---

## 🧠 Context & Purpose

**🔍 Purpose:**  
To create a balanced, well-structured dataset for training models to detect sentence fragments in autocomplete systems.

**🎯 Objectives:**  
- Generate a labeled dataset of sentence fragments and complete sentences
- Apply intelligent splitting rules based on linguistic patterns
- Balance dataset using expansion or reduction strategies
- Extract and analyze linguistic features for each fragment

**📘 Context:**  
This notebook is the first step in the fragment detection pipeline. It preprocesses raw text data from social media posts and applies linguistic rules to create training data for the fragment detection model.

---

## 🧩 Main Components

### `SmartExpander`
> Intelligent sentence splitting based on linguistic patterns and rules.
```python
expander = SmartExpander()
fragments = expander.split_sentence("The graphics are amazing but the story needs work.") # Returns: ["The graphics are amazing.", "But the story needs work."]
```

### `FeatureExtractor`
> Extracts linguistic features from text fragments for analysis.
```python
extractor = FeatureExtractor()
features = extractor.extract_features("Clearly, this is interesting.") # Returns: {'has_auxiliary': True, 'has_comma': True, ...}
```

### `process_dataset`
> Core function for dataset creation and balancing.
```python
process_dataset(input_file='raw_data.csv',
               output_file='fragments.csv',
               balance_strategy='expand')
```

## 🧭 Notebook Series Structure {#notebooks}

| 🔢 Order | 📓 Notebook | 📝 Description |
|---------:|------------|----------------|
| 0 | [00-Fragma-Overview.ipynb](https://colab.research.google.com/drive/1oUmSqBuPqv2gObJjhezXaa6xBxe_Tl4g?usp=sharing) | Project overview and setup |
| 1 | [01-Fragment-DS-Generator.ipynb](https://colab.research.google.com/drive/1aAVCptdYyRHmytnY7O__anYKpZh5Hl-w?usp=sharing) | Dataset generation (Current) |
| 2 | [02-Fragment-Detection.ipynb](https://colab.research.google.com/drive/1QbVTz71jGvVvr2rXwJk9RKGKS4r1ntCC?usp=sharing) | Fragment detection and preprocessing |
| 3 | [03-Model-Development.ipynb](https://colab.research.google.com/drive/1CDwjXuqBj1LBdXXvvFymNh6etWNpUnth?usp=sharing) | Model training and evaluation |

> ⏮ **Previous:** [00-Fragma-Overview.ipynb](https://colab.research.google.com/drive/1oUmSqBuPqv2gObJjhezXaa6xBxe_Tl4g?usp=sharing)

> ⏭ **Next:** [02-Fragment-Detection.ipynb](https://colab.research.google.com/drive/1QbVTz71jGvVvr2rXwJk9RKGKS4r1ntCC?usp=sharing)

## 📦 Dependencies

```bash
pandas         # Data manipulation
tqdm          # Progress bars
nltk          # NLP tools
ftfy          # Unicode fixing
emoji         # Emoji handling
textblob      # Text processing
contractions  # Contraction expansion
kagglehub    # Dataset download
colab_print  # Pretty printing
```

### Installation and Import

In [None]:
!pip install kagglehub pandas numpy tqdm matplotlib seaborn nltk scikit-learn ftfy contractions emoji textblob
!pip install -U colab_print # Always use last version

Collecting ftfy
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata (7.3 kB)
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import os
import re
import pandas as pd
import subprocess
import argparse
import random
from colab_print import Printer, header, list_, table, dfd, info, error, success, warning, title
from typing import Tuple, List, Dict, Optional, Union
from tqdm.auto import tqdm
import unicodedata
import html
import nltk
import shutil
from typing import List, Optional, Dict, Tuple, Any, Union, Literal
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import ftfy
from colab_print import header, table, info, error, success, warning
from contractions import fix
import emoji
from textblob import TextBlob

## 🛠️ Config / Setup
- ### Helper and Utility Functions
- ### Paths, Constants and etc.
```python
URL = "tomthescientist/netflix-facebook-posts-as-sentences-for-llm-input"
DSPATH = os.path.join(download_ds(URL), "netflix_fb_sentences.csv")
FDPATH = 'dataset_fragments.csv'
EXPANDPATH = 'expanded_dataset.csv'

```

In [None]:
def download_ds(url):
  import kagglehub

  # Download latest version
  path = kagglehub.dataset_download(url)

  print("Path to dataset files:", path)
  return path

def mount_drive():
  from google.colab import drive
  """Uploads the dataset_fragments.csv file to Google Drive."""
  drive.mount('/content/drive')

def move_to_nlp_drive(file, path):
  if not os.path.exists(path):
    raise ValueError("Path does not exist")

  shutil.move(file, path)

def count_fragments(df):
    count_t = df['is_fragment'].sum() # True Count
    count_f = len(df) - count_t # False Count
    return count_t, count_f

def get_df(path, **kwargs):
  return pd.read_csv(path, **kwargs)

def sync_colab_workspace(files: list, sync_strategy: Literal["clone", "upload", "copy"] = "copy", **kwargs):
    """Ensure all modules and files exist in Google Colab."""
    from google.colab import files, drive

    script_files = files

    if sync_strategy == "copy":
        drive_path = "/content/drive"
        if not os.path.exists(drive_path):
            drive.mount()

        nlp_project_path = kwargs.get("nlp")

        for file in tqdm(script_files, leave=False):
            src = os.path.join(nlp_project_path, file)
            dst = os.path.join("/content", file)
            if os.path.exists(src):
                shutil.copy(src, dst)
            else:
                warning(f"Warning: {file} not found in Drive path.")

    elif sync_strategy == "upload":
        uploaded = files.upload()
        for name in uploaded.keys():
            success(f"Uploaded: {name}")

    elif sync_strategy == "clone":
        repo_url = kwargs.get("repo_url")
        project_name = kwargs.get("project_name")

        repo_dir = f"/content/{project_name}"
        if os.path.exists(repo_dir):
            shutil.rmtree(repo_dir)
        subprocess.run(["git", "clone", repo_url], check=True)

        for file in script_files:
            src = os.path.join(repo_dir, file)
            dst = os.path.join("/content", file)
            if os.path.exists(src):
                shutil.copy(src, dst)
            else:
                warning(f"Warning: {file} not found in cloned repo.")

    else:
        raise ValueError("Invalid sync_strategy. Choose from 'copy', 'upload', or 'clone'.")

    def verify_files():
        print("\nVerifying required files in /content:")
        missing = []
        for file in script_files:
            if not os.path.exists(os.path.join("/content", file)):
                missing.append(file)
        if missing:
            print("Missing files:", missing)
        else:
            print("✅ All files are present.")

    verify_files()


In [None]:
URL = "tomthescientist/netflix-facebook-posts-as-sentences-for-llm-input"
PROJECT_URL = "https://github.com/alaamer12/Fragma"
PROJECT_NAME = "Fragma"
NLP_PROJECT_PATH = "/content/drive/MyDrive/Colab Notebooks/College/3rd Level 2nd Term/Natural Language Processing/Project"
FILENAME = "netflix_fb_sentences.csv"
DSPATH  = os.path.join(download_ds(URL), FILENAME)
FDPATH = 'dataset_fragments.csv'
EXPANDPATH = 'expanded_dataset.csv'
STRATEGY = 'expand'
BALANCE_PER = "95%"
TOLERANCE = "0.1%"
SCRIPT_FILES = [
        "fd_dataset_creator_script.py",
        "fd_ds_expander.py",
        "fd_linguistic_features.py",
        "preprocessor.py",
        "README.md",
        "FD.md",
    ]

DEFAULT_PREPROCESSING_CONFIG = {
    "fix_unicode": True,
    "remove_invalid": True,
    "normalize_whitespace": True,
    "remove_emojis": True,
    "lowercase": True,
    "normalize_words": True,
    "expand_contractions": True,
    "normalize_punctuation": True,
    "strip_platform_noise": True,
    "remove_stopwords": False,
    "tokenize": False,
    "keep_punctuation_tokens": False, # Only if tokenize is True
    "spell_correct": False     # Optional step (expensive)
}

# SYNC_STRATEGY = "copy"
# sync_colab_workspace(SCRIPT_FILES, SYNC_STRATEGY) # Dont forget to send kwargs for each strategy

Downloading from https://www.kaggle.com/api/v1/datasets/download/tomthescientist/netflix-facebook-posts-as-sentences-for-llm-input?dataset_version_number=3...


100%|██████████| 8.26M/8.26M [00:00<00:00, 75.2MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/tomthescientist/netflix-facebook-posts-as-sentences-for-llm-input/versions/3


## 📥 Inputs & Outputs

**📥 Inputs:**
- `netflix_fb_sentences.csv`: Raw social media posts dataset
  - Format: CSV with 'Sentence' column
  - Source: Kaggle Dataset

**📤 Outputs:**
- `dataset_fragments.csv`: Initial fragment dataset
  - Columns: 'Sentence Fragment', 'is_fragment'
- `expanded_dataset.csv`: Dataset with linguistic features
  - Additional columns for linguistic patterns
  - Feature indicators (has_auxiliary, has_comma, etc.)

## 🧬 Core Implementation
Below are the key classes and methods used in this notebook for fragment detection dataset generation:


### 🧪 Linguistics

In [None]:
# 1. Auxiliary Verbs (verbs indicating tense, state, or activity)
AUXILIARY_VERBS = [
    "is", "are", "was", "were", "be", "been", "am", "being",
    "has", "have", "had", "does", "do", "did",
    "can", "could", "may", "might", "must", "shall", "should", "will", "would",
    "ought", "need", "dare", "used", "going", "get", "gets", "got", "gotten",
    "keep", "keeps", "kept", "seem", "seems", "seemed", "become", "becomes", "became",
    "remain", "remains", "remained", "stay", "stays", "stayed", "appear", "appears", "appeared"
]

# 2. Common Expressions / Exclamations
COMMON_EXPRESSIONS = [
    "wow", "oops", "amazing", "fantastic", "unbelievable", "incredible",
    "interesting", "sadly", "fortunately", "surprisingly", "finally",
    "oh", "ah", "ugh", "yikes", "gosh", "goodness", "jeez", "alas", "phew",
    "hmm", "huh", "meh", "yay", "hooray", "bingo", "bravo", "congrats", "darn",
    "excellent", "great", "wonderful", "terrific", "brilliant", "splendid", "marvelous",
    "unfortunately", "regrettably", "honestly", "frankly", "truthfully", "admittedly",
    "basically", "essentially", "literally", "virtually", "practically", "technically",
    "ironically", "curiously", "strangely", "oddly", "weirdly", "bizarrely"
]

# 3. Conjunctions (links between ideas)
CONJUNCTIONS = [
    "and", "but", "or", "so", "because", "although", "however",
    "yet", "still", "therefore", "meanwhile", "whereas",
    "since", "unless", "until", "while", "though", "if", "whether",
    "as", "when", "where", "whenever", "wherever", "once", "before",
    "after", "than", "that", "which", "who", "whom", "whose", "what",
    "whatever", "whichever", "whoever", "whomever", "nor", "for", "plus",
    "furthermore", "moreover", "additionally", "consequently", "hence",
    "thus", "accordingly", "otherwise", "nevertheless", "nonetheless",
    "instead", "alternatively", "conversely", "similarly", "likewise"
]

# 4. Temporal Words (expressions indicating time changes)
TEMPORAL_WORDS = [
    "then", "after", "before", "later", "suddenly", "soon",
    "eventually", "earlier", "now", "today", "tonight", "tomorrow",
    "yesterday", "morning", "afternoon", "evening", "night", "midnight",
    "dawn", "dusk", "weekly", "monthly", "yearly", "daily", "hourly",
    "instantly", "immediately", "promptly", "currently", "presently",
    "previously", "formerly", "lately", "recently", "nowadays", "momentarily",
    "temporarily", "briefly", "shortly", "occasionally", "frequently", "regularly",
    "periodically", "constantly", "continuously", "perpetually", "eternally",
    "forever", "always", "never", "ever", "seldom", "rarely", "sometimes",
    "often", "usually", "generally", "typically", "historically", "traditionally",
    "initially", "ultimately", "finally", "lastly", "meanwhile", "simultaneously",
    "concurrently", "subsequently", "consequently", "accordingly", "henceforth"
]

# 5. Adverbs of Opinion / Degree - Common explicit ones
ADVERBS_OPINION = [
    "clearly", "obviously", "probably", "certainly", "absolutely",
    "seriously", "undoubtedly", "definitely", "arguably", "presumably",
    "apparently", "evidently", "seemingly", "supposedly", "allegedly",
    "conceivably", "possibly", "perhaps", "maybe", "likely", "unlikely",
    "surely", "truly", "really", "actually", "honestly", "frankly",
    "admittedly", "unfortunately", "fortunately", "surprisingly", "amazingly",
    "astonishingly", "shockingly", "disappointingly", "regrettably", "sadly",
    "happily", "gladly", "hopefully", "mercifully", "thankfully", "luckily",
    "incredibly", "remarkably", "notably", "significantly", "substantially",
    "considerably", "essentially", "fundamentally", "basically", "primarily",
    "mainly", "largely", "mostly", "generally", "typically", "usually",
    "normally", "commonly", "frequently", "occasionally", "rarely", "seldom"
]

# 6. Sentence Starters / Fillers
SENTENCE_STARTERS = [
    "well", "so", "anyway", "besides", "actually", "by the way",
    "in fact", "as a matter of fact", "to be honest", "honestly",
    "frankly", "to tell the truth", "truthfully", "admittedly",
    "obviously", "clearly", "evidently", "apparently", "seemingly",
    "interestingly", "surprisingly", "remarkably", "notably", "significantly",
    "importantly", "essentially", "basically", "fundamentally", "generally",
    "typically", "usually", "normally", "commonly", "frequently", "occasionally",
    "first", "firstly", "second", "secondly", "third", "thirdly", "finally", "lastly",
    "meanwhile", "subsequently", "consequently", "therefore", "thus", "hence",
    "accordingly", "as a result", "for this reason", "due to this", "because of this",
    "nevertheless", "nonetheless", "however", "on the other hand", "conversely",
    "in contrast", "alternatively", "instead", "rather", "in addition", "furthermore",
    "moreover", "similarly", "likewise", "in the same way", "for example", "for instance",
    "specifically", "in particular", "namely", "to illustrate", "such as", "including",
    "in conclusion", "to conclude", "to summarize", "in summary", "overall", "ultimately",
    "in the end", "eventually", "after all", "all in all", "on the whole", "by and large"
]

# Regex patterns for linguistic features
ADVERB_PATTERNS = [
    r'\b\w+ly\b',         # Words ending with 'ly' (carefully, quickly, etc.)
    r'\b\w+ily\b',        # Words ending with 'ily' (happily, easily, etc.)
    r'\b\w+ally\b',       # Words ending with 'ally' (basically, naturally, etc.)
    r'\b\w+ically\b',     # Words ending with 'ically' (specifically, dramatically, etc.)
]

# Past tense verb patterns
PAST_VERB_PATTERNS = [
    r'\b\w{3,}ed\b',      # Regular past tense (played, walked, etc.)
    r'\b\w{3,}en\b',      # Past participles (taken, broken, etc.)
    r'\b\w+ied\b',        # Words ending with 'ied' (cried, tried, etc.)
    r'\b\w+ought\b',      # Irregular past tense (thought, bought, etc.)
    r'\b\w+aught\b',      # Irregular past tense (caught, taught, etc.)
    r'\bwent\b',          # Irregular past tense of 'go'
    r'\bsaw\b',           # Irregular past tense of 'see'
    r'\bcame\b',          # Irregular past tense of 'come'
    r'\btook\b',          # Irregular past tense of 'take'
    r'\bgave\b',          # Irregular past tense of 'give'
    r'\bmade\b',          # Irregular past tense of 'make'
    r'\bsaid\b',          # Irregular past tense of 'say'
    r'\bfelt\b',          # Irregular past tense of 'feel'
    r'\bheld\b',          # Irregular past tense of 'hold'
    r'\bfound\b',         # Irregular past tense of 'find'
    r'\bknew\b',          # Irregular past tense of 'know'
    r'\bgot\b',           # Irregular past tense of 'get'
    r'\bput\b',           # Irregular past tense of 'put'
    r'\bset\b',           # Irregular past tense of 'set'
    r'\bran\b',           # Irregular past tense of 'run'
    r'\bwrote\b',         # Irregular past tense of 'write'
]

# Gerund patterns
GERUND_PATTERNS = [
    r'\b\w{3,}ing\b',     # Standard gerunds (playing, walking, etc.)
    r'\b\w+ying\b',       # Words ending with 'ying' (trying, crying, etc.)
    r'\b\w+ling\b',       # Words ending with 'ling' (handling, cycling, etc.)
    r'\b\w+ting\b',       # Words ending with 'ting' (sitting, getting, etc.)
    r'\b\w+ping\b',       # Words ending with 'ping' (shopping, clapping, etc.)
    r'\b\w+ning\b',       # Words ending with 'ning' (running, planning, etc.)
    r'\b\w+ming\b',       # Words ending with 'ming' (swimming, coming, etc.)
    r'\b\w+ding\b',       # Words ending with 'ding' (reading, building, etc.)
    r'\b\w+cing\b',       # Words ending with 'cing' (dancing, racing, etc.)
    r'\b\w+king\b',       # Words ending with 'king' (talking, making, etc.)
    r'\b\w+ging\b',       # Words ending with 'ging' (hanging, bringing, etc.)
]

# Common adverbs that don't end with 'ly' but should be excluded from pattern matching
NON_SPLITTING_ADVERBS = [
    "very", "quite", "rather", "too", "so", "just", "only",
    "almost", "nearly", "really", "pretty", "even"
]

# Feature descriptions for documentation and display
FEATURE_DESCRIPTIONS = {
    "has_auxiliary": "Contains auxiliary verbs (is, are, was, were, etc.)",
    "has_fullstop": "Contains a period (.)",
    "has_question_mark": "Contains a question mark (?)",
    "has_exclamation_mark": "Contains an exclamation mark (!)",
    "has_comma": "Contains a comma (,)",
    "has_semicolon": "Contains a semicolon (;)",
    "has_colon": "Contains a colon (:)",
    "has_quotation": "Contains quotation marks (' or \")",
    "has_expression": "Contains common expressions or exclamations",
    "has_conjunction": "Contains conjunctions (and, but, or, etc.)",
    "has_temporal": "Contains temporal words (then, after, before, etc.)",
    "has_opinion_adverb": "Contains adverbs of opinion (clearly, obviously, etc.)",
    "has_adverb": "Contains adverbs (words ending with -ly, -ily, etc.)",
    "has_starter": "Contains sentence starters or fillers (well, so, etc.)",
    "has_past_verb": "Contains past tense verbs (words ending with -ed or -en)",
    "has_gerund": "Contains gerunds/present participles (words ending with -ing)",
    "word_count": "Number of words in the sentence",
    "char_count": "Number of characters in the sentence",
    "starts_capitalized": "Sentence starts with a capital letter",
    "all_caps_word": "Contains a word in all capital letters"
}

### 📊 Fragment Detection Dataset Creation

#### Overview

The Fragment Detection Dataset Creator is a dedicated class that prepares data for fragment detection by:
- Removing noise words and standardizing text
- Handling each dataset instance consistently
- Providing two balancing strategies

#### Balance Strategies

**1. Reduction Strategy** 🔻
- Reduces majority class instances (`is_fragment = False`)
- Makes dataset balanced by removing excess complete sentences
- No new data creation, only removal of existing data

**2. Expansion Strategy** 🔼
- Increases minority class instances (`is_fragment = True`)
- Creates new fragments from complete sentences
- Uses intelligent linguistic analysis for natural splits

#### Expansion Process

For each row where `is_fragment = False`:
1. Analyze sentence structure
2. Find natural linguistic breakpoints
3. Split sentence into two fragments
4. Create new instances (rows) for each fragment

##### Smart Expander is a helper class to create fragement detection dataset
- It is a sub-strategy encapsulated because of its complexity
- Smart expander follows a pattern for expanding the dataset, which could lead to limitation of expansion [after several iterations the expansion will be less than 1%]
- This is becuase of how the pattern is work
- **Suggestions**: You can improve it by providing randomness in steps which could lead to better results, but it wont fix the problem entirely
- This class could lead to duplications in data so alway **clean** your data

In [None]:
class SmartExpander:
    def __init__(self):
        # Use imported word lists from fd_linguistic_features
        self.auxiliary_verbs = AUXILIARY_VERBS
        self.common_expressions = COMMON_EXPRESSIONS
        self.conjunctions = CONJUNCTIONS
        self.temporal_words = TEMPORAL_WORDS
        self.adverbs_opinion = ADVERBS_OPINION
        self.sentence_starters = SENTENCE_STARTERS
        self.adverb_patterns = ADVERB_PATTERNS
        self.non_splitting_adverbs = NON_SPLITTING_ADVERBS
        self.past_verb_patterns = PAST_VERB_PATTERNS
        self.gerund_patterns = GERUND_PATTERNS

        # Keyword categories in priority order
        self.keyword_categories = [
            self.common_expressions,
            self.sentence_starters,
            self.conjunctions,
            self.auxiliary_verbs,
            self.temporal_words,
            self.adverbs_opinion
        ]

    def is_adverb_by_pattern(self, word: str) -> bool:
        """
        Check if a word matches adverb patterns (typically ending with 'ly').
        Excludes words that are in the non-splitting adverbs list.

        Args:
            word: The word to check

        Returns:
            True if the word matches adverb patterns, False otherwise
        """
        # Clean the word from punctuation
        clean_word = re.sub(r'[^\w\s]', '', word.lower())

        # Skip checking if it's in our non-splitting adverbs list
        if clean_word in self.non_splitting_adverbs:
            return False

        # Check against all adverb patterns
        for pattern in self.adverb_patterns:
            if re.match(pattern, clean_word):
                return True

        return False

    def is_past_verb(self, word: str) -> bool:
        """
        Check if a word matches past verb patterns (ending with 'ed' or 'en').

        Args:
            word: The word to check

        Returns:
            True if the word is likely a past tense verb, False otherwise
        """
        # Clean the word from punctuation
        clean_word = re.sub(r'[^\w\s]', '', word.lower())

        # Check against all past verb patterns
        for pattern in self.past_verb_patterns:
            if re.match(pattern, clean_word):
                return True

        return False

    def is_gerund(self, word: str) -> bool:
        """
        Check if a word matches gerund patterns (ending with 'ing').

        Args:
            word: The word to check

        Returns:
            True if the word is likely a gerund, False otherwise
        """
        # Clean the word from punctuation
        clean_word = re.sub(r'[^\w\s]', '', word.lower())

        # Check against all gerund patterns
        for pattern in self.gerund_patterns:
            if re.match(pattern, clean_word):
                return True

        return False

    def find_split_point(self, sentence: str) -> Optional[Tuple[int, str]]:
        """
        Find the optimal split point in a sentence based on keyword priority and regex patterns.

        Args:
            sentence: The sentence to analyze for split points

        Returns:
            A tuple (position, keyword) if a split point is found, None otherwise
        """
        # Convert to lowercase for case-insensitive matching
        words = sentence.lower().split()
        original_words = sentence.split()

        # Search for keywords in priority order
        for keyword_list in self.keyword_categories:
            for i, word in enumerate(words):
                # Clean the word from punctuation for matching
                clean_word = re.sub(r'[^\w\s]', '', word)

                if clean_word in keyword_list:
                    # Return the position and the original keyword
                    return i, original_words[i]

        # If no keyword match, check for special patterns in this order:
        # 1. First check for auxiliary verbs followed by gerunds (e.g., "is playing")
        for i, word in enumerate(words[:-1]):  # Skip the last word
            clean_word = re.sub(r'[^\w\s]', '', word)
            next_word = re.sub(r'[^\w\s]', '', words[i+1])

            if clean_word in self.auxiliary_verbs and self.is_gerund(next_word):
                # Found auxiliary + gerund pattern (e.g., "is playing")
                # Split at the gerund to keep the auxiliary with the first part
                return i+1, original_words[i+1]

        # 2. Check for gerunds
        for i, word in enumerate(words):
            if self.is_gerund(word):
                return i, original_words[i]

        # 3. Check for past verbs
        for i, word in enumerate(words):
            if self.is_past_verb(word):
                return i, original_words[i]

        # 4. Finally check for adverbs
        for i, word in enumerate(words):
            if self.is_adverb_by_pattern(word):
                return i, original_words[i]

        # No keyword or pattern found
        return None

    def split_sentence(self, sentence: str) -> List[str]:
        """
        Split a sentence intelligently into two fragments.

        Args:
            sentence: The sentence to split

        Returns:
            A list of two sentence fragments
        """
        # Find a split point based on keywords or regex patterns
        split_info = self.find_split_point(sentence)

        if split_info:
            # Split at the keyword or pattern match
            idx, keyword = split_info
            words = sentence.split()

            # Create two fragments: before and after the keyword (including it)
            fragment1 = ' '.join(words[:idx]).strip()
            fragment2 = ' '.join(words[idx:]).strip()

            # Handle very short fragments (minimum 2 words where possible)
            if len(fragment1.split()) < 2 and len(words) > 4:
                # If fragment1 is too short, move one more word to it
                fragment1 = ' '.join(words[:idx+1]).strip()
                fragment2 = ' '.join(words[idx+1:]).strip()

            # If fragment2 would be too short, keep it with fragment1
            if len(fragment2.split()) < 2 and len(words) > 4:
                fragment1 = sentence
                fragment2 = ""
        else:
            # Fallback: split at the middle word
            words = sentence.split()
            middle_idx = len(words) // 2

            fragment1 = ' '.join(words[:middle_idx]).strip()
            fragment2 = ' '.join(words[middle_idx:]).strip()

        # Clean up fragments: remove leading/trailing punctuation
        fragments = []
        for fragment in [fragment1, fragment2]:
            if fragment:  # Only process non-empty fragments
                # Remove leading punctuation except opening quotes
                fragment = re.sub(r'^[^\w"\']+', '', fragment)

                # If fragment starts with a quote but doesn't end with one, add it
                if (fragment.startswith('"') and not fragment.endswith('"')) or \
                   (fragment.startswith("'") and not fragment.endswith("'")):
                    pass  # The closing quote will be in the other fragment, which is fine

                # Ensure sentences end with proper punctuation
                if not re.search(r'[.!?]$', fragment) and fragment:
                    fragment = fragment.rstrip(',;:-') + '.'

                fragments.append(fragment)

        return [f for f in fragments if f]  # Return only non-empty fragments


def preprocess_sentence(sentence: str) -> str:
    """
    Remove the starting '<name> commented' and ending 'on <date> on <platform>.' parts,
    and trim leading/trailing whitespaces.

    Args:
        sentence: The original sentence to process

    Returns:
        The cleaned sentence with patterns removed
    """
    # Remove the starting part '<name> commented'
    sentence = re.sub(r'^.*? commented\s*', '', sentence)

    # Remove the ending part 'on <date> on <platform>.'
    sentence = re.sub(r'\s*on \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} on .*?\.?$', '', sentence)

    # If the sentence was in quotes, remove only the outer quotes if they remain
    if sentence.startswith('"') and sentence.endswith('"'):
        sentence = sentence[1:-1]

    # Trim leading/trailing whitespaces
    return sentence.strip()


def extract_fragments(sentence: str) -> List[Dict[str, bool]]:
    """
    Process the sentence and extract fragments according to the rules.

    Rules:
    - If the sentence contains a comma, split at each comma and mark each part as a fragment
    - If no comma, keep up to the first special character (., !, ?) and mark as not a fragment

    Args:
        sentence: The preprocessed sentence

    Returns:
        List of dictionaries with keys 'Sentence Fragment' and 'is_fragment'
    """
    fragments = []

    # Check if the sentence contains a comma
    if ',' in sentence:
        # Split at each comma
        parts = sentence.split(',')

        # Create a new row for each part and set is_fragment = True
        for part in parts:
            part = part.strip()
            if part:  # Only add if part is not empty
                fragments.append({'Sentence Fragment': part, 'is_fragment': True})
    else:
        # Find the first occurrence of special characters
        match = re.search(r'[.!?]', sentence)

        if match:
            # Get the position of the first special character
            pos = match.start()

            # Keep everything before and including the special character
            fragment = sentence[:pos+1].strip()
        else:
            # If no special character exists, keep the whole text
            fragment = sentence.strip()

        # Add to fragments list if not empty
        if fragment:
            fragments.append({'Sentence Fragment': fragment, 'is_fragment': False})

    return fragments


def balance_dataset(df: pd.DataFrame, strategy: str = "reduce",
                   keep_balancing: Optional[str] = None, tolerance: Optional[str] = None) -> pd.DataFrame:
    """
    Balance the dataset based on the 'is_fragment' column using the specified strategy.
    Optionally maintain a target balance percentage even after duplicate removal.

    Args:
        df: DataFrame containing the dataset to balance
        strategy: Strategy to use for balancing ('reduce' or 'expand')
        keep_balancing: Target balance percentage as string (e.g., "95%"). If provided,
                       iteratively balance and remove duplicates until target is reached.
        tolerance: Acceptable imbalance as string (e.g., "1%"). If the improvement between
                  iterations is less than this value, stop iterating.

    Returns:
        Balanced DataFrame
    """
    # Extract target percentage if keep_balancing is provided
    target_percentage = None
    if keep_balancing:
        # Extract the percentage value (remove % symbol and convert to float)
        try:
            target_percentage = float(keep_balancing.strip('%'))
            if target_percentage <= 0 or target_percentage > 100:
                warning(f"Invalid target percentage {target_percentage}. Using 95% as default.")
                target_percentage = 95.0
        except ValueError:
            warning(f"Could not parse '{keep_balancing}' as percentage. Using 95% as default.")
            target_percentage = 95.0

        info(f"Targeting {target_percentage}% balance after duplicate removal.")

    # Extract tolerance percentage if provided
    tolerance_percentage = None
    if tolerance:
        try:
            tolerance_percentage = float(tolerance.strip('%'))
            if tolerance_percentage <= 0 or tolerance_percentage > 10:  # Cap at 10% for safety
                warning(f"Invalid tolerance {tolerance_percentage}. Using 1% as default.")
                tolerance_percentage = 1.0
        except ValueError:
            warning(f"Could not parse '{tolerance}' as percentage. Using 1% as default.")
            tolerance_percentage = 1.0

        info(f"Will stop when balance improvement is less than {tolerance_percentage}%.")

    # Function to calculate current balance percentage
    def calculate_balance_percentage(dataframe):
        true_count = dataframe[dataframe['is_fragment'] == True].shape[0]
        false_count = dataframe[dataframe['is_fragment'] == False].shape[0]

        if true_count == 0 or false_count == 0:
            return 0.0

        return (min(true_count, false_count) / max(true_count, false_count)) * 100

    # Function to display class distribution
    def print_distribution(dataframe, label=""):
        true_count = dataframe[dataframe['is_fragment'] == True].shape[0]
        false_count = dataframe[dataframe['is_fragment'] == False].shape[0]
        balance_pct = calculate_balance_percentage(dataframe)

        info(f"{label}: {true_count} fragments, {false_count} non-fragments " +
              f"(balance: {balance_pct:.2f}%)")

    # Initial distribution
    print_distribution(df, "Before balancing")

    # If target_percentage is None, just balance once without iterative process
    if target_percentage is None:
        # Perform simple balancing as before
        balanced_df = _balance_dataset_once(df, strategy)
        print_distribution(balanced_df, "After balancing")
        return balanced_df

    # Iterative balancing to maintain target percentage after deduplication
    current_df = df.copy()
    iteration = 1
    max_iterations = 10  # Prevent infinite loops
    previous_balance = calculate_balance_percentage(current_df)

    while iteration <= max_iterations:
        header(f"\nIteration {iteration}:")

        # Step 1: Balance the dataset
        current_df = _balance_dataset_once(current_df, strategy)
        print_distribution(current_df, "After balancing")

        # Step 2: Remove duplicates
        original_count = len(current_df)
        current_df = current_df.drop_duplicates(subset=['Sentence Fragment']).reset_index(drop=True)
        removed_count = original_count - len(current_df)

        info(f"Removed {removed_count} duplicates")
        print_distribution(current_df, "After removing duplicates")

        # Step 3: Check if we've reached the target balance
        current_balance = calculate_balance_percentage(current_df)
        if current_balance >= target_percentage:
            success(f"Target balance of {target_percentage}% achieved!")
            break

        # Check if the improvement is less than the tolerance
        if tolerance_percentage is not None:
            balance_improvement = current_balance - previous_balance
            info(f"Balance improvement: {balance_improvement:.2f}%")

            if balance_improvement < tolerance_percentage and balance_improvement >= 0:
                warning(f"Balance improvement ({balance_improvement:.2f}%) is less than tolerance ({tolerance_percentage}%).")
                info(f"Stopping iterations as we're close enough to optimal balance.")
                break

            # Update the previous balance for the next iteration
            previous_balance = current_balance

        # If we didn't reach the target but we're at the last iteration
        if iteration == max_iterations:
            warning(f"Could not achieve target balance of {target_percentage}% " +
                  f"after {max_iterations} iterations. Current balance: {current_balance:.2f}%")

        iteration += 1

    return current_df


def _balance_dataset_once(df: pd.DataFrame, strategy: str = "reduce") -> pd.DataFrame:
    """
    Helper function that performs one round of dataset balancing.

    Args:
        df: DataFrame containing the dataset to balance
        strategy: Strategy to use for balancing ('reduce' or 'expand')

    Returns:
        Balanced DataFrame
    """
    # Count the instances of each class
    true_count = df[df['is_fragment'] == True].shape[0]
    false_count = df[df['is_fragment'] == False].shape[0]

    # If already balanced, return the original DataFrame
    if true_count == false_count:
        return df

    # Determine which class is the majority
    if strategy == "reduce":
        # Reduction strategy: reduce the number of majority class instances
        if true_count > false_count:
            # More fragments than non-fragments
            # Randomly drop rows where is_fragment = True
            drop_count = true_count - false_count
            drop_indices = df[df['is_fragment'] == True].sample(drop_count).index
            balanced_df = df.drop(drop_indices)
        else:
            # More non-fragments than fragments
            # Randomly drop rows where is_fragment = False
            drop_count = false_count - true_count
            drop_indices = df[df['is_fragment'] == False].sample(drop_count).index
            balanced_df = df.drop(drop_indices)

    elif strategy == "expand":
        # Expansion strategy: increase the number of minority class instances
        balanced_df = df.copy()

        if true_count < false_count:
            # More non-fragments than fragments, need to create more fragments
            # Target number of new fragments to create
            target_new = false_count - true_count
            new_fragments = []

            # Create a SmartExpander instance for intelligent splitting
            expander = SmartExpander()

            # Get non-fragment sentences to split
            non_fragments = df[df['is_fragment'] == False]['Sentence Fragment'].tolist()
            random.shuffle(non_fragments)  # Shuffle to randomize selection

            # Add progress bar for fragment expansion
            info(f"Creating new fragments to balance dataset...")
            for sentence in tqdm(non_fragments, total=None, desc="Expanding fragments", unit="sent", leave=True):
                if len(new_fragments) >= target_new:
                    break

                # Only process sentences with at least 4 words
                if len(sentence.split()) >= 4:
                    # Use the SmartExpander to split the sentence
                    fragments = expander.split_sentence(sentence)

                    # Add fragments to the new list
                    for fragment in fragments:
                        if fragment.strip():  # Only add if not empty
                            new_fragments.append({
                                'Sentence Fragment': fragment.strip(),
                                'is_fragment': True
                            })

                            # If we've reached our target, stop
                            if len(new_fragments) >= target_new:
                                break

            # Add new fragments to the dataset
            if new_fragments:
                new_df = pd.DataFrame(new_fragments)
                balanced_df = pd.concat([balanced_df, new_df], ignore_index=True)
        else:
            # More fragments than non-fragments, need to create more non-fragments
            # This case is more complex as creating valid non-fragments from fragments
            # is challenging. For now, we'll use the reduction strategy as a fallback.
            warning("Expansion from fragments to non-fragments is not supported. Falling back to reduction strategy.")
            drop_count = true_count - false_count
            drop_indices = df[df['is_fragment'] == True].sample(drop_count).index
            balanced_df = df.drop(drop_indices)
    else:
        raise ValueError(f"Unknown balancing strategy: {strategy}. Use 'reduce' or 'expand'.")

    return balanced_df


def process_dataset(input_file: str, output_file: str, balance_strategy: str = None,
                   keep_balancing: str = None, tolerance: str = None) -> None:
    """
    Process the dataset file, apply transformations, and save the results.

    Args:
        input_file: Path to the input CSV file
        output_file: Path to save the output CSV file
        balance_strategy: Strategy to use for balancing the dataset ('reduce', 'expand', or None for no balancing)
        keep_balancing: Target balance percentage after duplicate removal (e.g., "95%")
        tolerance: Acceptable imbalance percentage (e.g., "1%"). If improvement between iterations
                  is less than this value, stop iterating.
    """
    try:
        # Read the dataset
        info(f"Reading input file: {input_file}")
        df = pd.read_csv(input_file)

        if 'Sentence' not in df.columns:
            error(f"Input file must contain a 'Sentence' column.")
            return

        # Process each sentence and collect results
        all_fragments = []

        info(f"Processing {len(df)} sentences...")
        for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing sentences"):
            # Get the current sentence
            sentence = row['Sentence']

            # Preprocess the sentence
            processed_sentence = preprocess_sentence(sentence)

            # Extract fragments
            fragments = extract_fragments(processed_sentence)

            # Add each fragment to the results
            all_fragments.extend(fragments)

        # Create a new DataFrame with the results
        info("Creating output DataFrame...")
        result_df = pd.DataFrame(all_fragments)

        # Balance the dataset if a strategy is specified
        if balance_strategy:
            info(f"Balancing dataset using '{balance_strategy}' strategy...")
            result_df = balance_dataset(result_df, balance_strategy, keep_balancing, tolerance)

        # Save to CSV
        info(f"Saving results to: {output_file}")
        result_df.to_csv(output_file, index=False)

        success(f"Processing complete. Output saved to {output_file}")
        info(f"Processed {len(df)} original sentences into {len(result_df)} fragments.")

    except Exception as e:
        error(f"An error occurred: {str(e)}")


#### Create Fragment Detection dataset

In [None]:
def create_fragement_ds():
    """Function to parse arguments and execute the script."""
    input_file = DSPATH
    output_file = FDPATH
    balance_strategy = STRATEGY
    balance_per = BALANCE_PER
    tolerance = TOLERANCE

    process_dataset(input_file, output_file, balance_strategy, balance_per, tolerance)


if __name__ == "__main__":
    create_fragement_ds()

Processing sentences:   0%|          | 0/120000 [00:00<?, ?it/s]

Expanding fragments:   0%|          | 0/105876 [00:00<?, ?sent/s]

Expanding fragments:   0%|          | 0/25951 [00:00<?, ?sent/s]

Expanding fragments:   0%|          | 0/25951 [00:00<?, ?sent/s]

Expanding fragments:   0%|          | 0/25951 [00:00<?, ?sent/s]

Expanding fragments:   0%|          | 0/25951 [00:00<?, ?sent/s]

Expanding fragments:   0%|          | 0/25951 [00:00<?, ?sent/s]

Expanding fragments:   0%|          | 0/25951 [00:00<?, ?sent/s]

Expanding fragments:   0%|          | 0/25951 [00:00<?, ?sent/s]

Expanding fragments:   0%|          | 0/25951 [00:00<?, ?sent/s]

Expanding fragments:   0%|          | 0/25951 [00:00<?, ?sent/s]

### 🔍 Dataset Feature Expansion

#### Purpose

The Feature Expander enriches the dataset by:
- Adding linguistic feature columns
- Detecting special word patterns
- Enhancing ML model interpretability

#### Feature Types

Detects and flags presence of:
- Auxiliary verbs
- Adverbs
- Conjunctions
- Temporal expressions
- Other linguistic markers

#### Usage Guidelines

**✅ Recommended for:**
- Traditional ML models
- Feature importance analysis
- Model interpretability studies

**⚠️ Considerations for Deep Learning:**
- Not always beneficial for DL models
- May introduce unnecessary complexity
- Consider raw text for transformer models

> 🔍 **Note:** Ensure data is clean and deduplicated before feature expansion

#### Core Implementation of The Feature Extractor

In [None]:
class FeatureExtractor:
    """
    Extracts linguistic features from sentences based on predefined patterns and word lists.
    Used to analyze text for specific grammatical and structural elements.
    """
    def __init__(self):
        # Use imported word lists from fd_linguistic_features
        self.auxiliary_verbs = AUXILIARY_VERBS
        self.common_expressions = COMMON_EXPRESSIONS
        self.conjunctions = CONJUNCTIONS
        self.temporal_words = TEMPORAL_WORDS
        self.adverbs_opinion = ADVERBS_OPINION
        self.sentence_starters = SENTENCE_STARTERS
        self.adverb_patterns = ADVERB_PATTERNS
        self.past_verb_patterns = PAST_VERB_PATTERNS
        self.gerund_patterns = GERUND_PATTERNS

    def extract_features(self, sentence: str) -> Dict[str, bool]:
        """
        Extracts linguistic features from a given sentence.

        Args:
            sentence: The sentence to analyze

        Returns:
            Dictionary with feature names as keys and boolean values
        """
        # Convert to lowercase for case-insensitive matching, but keep original for punctuation
        original = sentence
        sentence = sentence.lower()

        # Split into words and remove punctuation for word-level features
        words = [re.sub(r'[^\w\s]', '', word) for word in sentence.split()]
        words = [word for word in words if word]  # Remove empty strings

        # Initialize features dictionary
        features = {}

        # Check for punctuation marks
        features["has_fullstop"] = "." in original
        features["has_question_mark"] = "?" in original
        features["has_exclamation_mark"] = "!" in original
        features["has_comma"] = "," in original
        features["has_semicolon"] = ";" in original
        features["has_colon"] = ":" in original
        features["has_quotation"] = '"' in original or "'" in original

        # Check for word-level features
        features["has_auxiliary"] = any(word in self.auxiliary_verbs for word in words)
        features["has_expression"] = any(word in self.common_expressions for word in words)
        features["has_conjunction"] = any(word in self.conjunctions for word in words)
        features["has_temporal"] = any(word in self.temporal_words for word in words)
        features["has_opinion_adverb"] = any(word in self.adverbs_opinion for word in words)
        features["has_starter"] = any(starter in sentence for starter in self.sentence_starters)

        # Check for pattern-based features
        features["has_adverb"] = any(re.search(pattern, sentence) for pattern in self.adverb_patterns)
        features["has_past_verb"] = any(re.search(pattern, sentence) for pattern in self.past_verb_patterns)
        features["has_gerund"] = any(re.search(pattern, sentence) for pattern in self.gerund_patterns)

        # Capitalization feature
        features["starts_capitalized"] = bool(original and original[0].isupper())

        return features


def preprocess_sentence(sentence: str) -> str:
    """
    Remove the starting '<name> commented' and ending 'on <date> on <platform>.' parts,
    and trim leading/trailing whitespaces.

    Args:
        sentence: The original sentence to process

    Returns:
        The cleaned sentence with patterns removed
    """
    # Remove the starting part '<name> commented'
    sentence = re.sub(r'^.*? commented\s*', '', sentence)

    # Remove the ending part 'on <date> on <platform>.'
    sentence = re.sub(r'\s*on \d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} on .*?\.?$', '', sentence)

    # If the sentence was in quotes, remove only the outer quotes if they remain
    if sentence.startswith('"') and sentence.endswith('"'):
        sentence = sentence[1:-1]

    # Trim leading/trailing whitespaces
    return sentence.strip()


def expand_dataset(input_file: str, output_file: str) -> None:
    """
    Process the dataset, add linguistic feature columns, and save the expanded dataset.

    Args:
        input_file: Path to the input CSV file
        output_file: Path to save the output CSV file
    """
    try:
        # Read the dataset
        info(f"Reading input file: {input_file}")
        df = pd.read_csv(input_file)

        if 'Sentence Fragment' not in df.columns:
            error(f"Input file must contain a 'Sentence Fragment' column.")
            return

        # Create a feature extractor
        extractor = FeatureExtractor()

        # Process each sentence and add features
        info(f"Extracting features from {len(df)} sentences...")

        # Initialize new columns with empty lists
        feature_columns = {
            "has_auxiliary": [],
            "has_fullstop": [],
            "has_question_mark": [],
            "has_exclamation_mark": [],
            "has_comma": [],
            "has_semicolon": [],
            "has_colon": [],
            "has_quotation": [],
            "has_expression": [],
            "has_conjunction": [],
            "has_temporal": [],
            "has_opinion_adverb": [],
            "has_adverb": [],
            "has_starter": [],
            "has_past_verb": [],
            "has_gerund": [],
            "starts_capitalized": []
        }

        # Process sentences with a progress bar
        for _, row in tqdm(df.iterrows(), total=len(df), desc="Processing sentences"):
            # Get and preprocess the sentence
            sentence = row['Sentence Fragment']
            processed_sentence = preprocess_sentence(sentence)

            # Extract features
            features = extractor.extract_features(processed_sentence)

            # Add each feature to its corresponding column
            for feature, value in features.items():
                feature_columns[feature].append(value)

        # Add feature columns to the DataFrame
        for feature, values in feature_columns.items():
            df[feature] = values

        # Use imported feature descriptions
        descriptions = FEATURE_DESCRIPTIONS

        # Save column descriptions as DataFrame metadata (only visible in code)
        for col, desc in descriptions.items():
            if col in df.columns:
                df[col].attrs['description'] = desc

        # Save to CSV
        info(f"Saving expanded dataset to: {output_file}")
        df.to_csv(output_file, index=False)

        # Print column descriptions for reference
        header("\nColumn Descriptions:")
        desc_list = []
        for col, desc in descriptions.items():
            if col in df.columns:
                desc_list.append(f"{col}: {desc}")
        list_(desc_list)

        success(f"Processing complete. Expanded dataset saved to {output_file}")

    except Exception as e:
        error(f"An error occurred: {str(e)}")

#### Expand Dataset

In [None]:
def expand_ds():
    """Function to parse arguments and execute the script."""
    input_file = FDPATH
    output_file = EXPANDPATH

    expand_dataset(input_file, output_file)


if __name__ == "__main__":
    expand_ds()

Processing sentences:   0%|          | 0/45341 [00:00<?, ?it/s]

## 📚 Resources

### Dataset
- [Netflix & Facebook Posts Dataset](https://www.kaggle.com/datasets/tomthescientist/netflix-facebook-posts-as-sentences-for-llm-input)


## 🤝 Contributors

| 👤 Name | 🧑‍💻 Role | 📬 GitHub | 🔗 LinkedIn |
|---------|----------|-----------|------------|
| Amr Muhamed | Maintainer | [alaamer12](https://github.com/alaamer12) | [alaamer12](https://linkedin.com/in/alaamer12) |
| Muhamed Ibrahim | Data Engineer | [muhammad-senna](https://github.com/muhammad-senna) | [muhammad-senna](https://linkedin.com/in/muhammad-senna) |

© 2025 Amr Muhamed. All Rights Reserved.

*Last updated: May 12, 2025*