**Regular Expressions in NLP Preprocessing**

In [6]:
import re

# Sample text from the chosen paragraph
text = """This chapter outlines the methodology used in the development of the system. It includes details on the adopted methodology, data collection, preprocessing techniques, CNN architecture, and evaluation metrics. The Rapid Application Development (RAD) methodology was employed to ensure iterative progress and adaptability throughout the project lifecycle.

The system development process followed the Rapid Application Development (RAD) methodology, focusing on iterative development and rapid prototyping. The primary steps included:

1. Dataset Preparation: The Potato Leaf Images Dataset was downloaded from Kaggle and stored in Google Drive for easy access.
2. Data Preprocessing: Techniques such as auto-orientation, resizing, and data augmentation were applied to enhance dataset quality.
3. Model Training and Validation: The CNN model was trained and validated using Google Colab, leveraging the augmented dataset.
4. Model Testing: The trained model was evaluated on a separate test set to ensure robustness and reliability.

The PlantVillage dataset was utilized as the primary data source for this study. Developed by Penn State University and EPFL, the dataset contains 38 classes of diseased and healthy leaves from 14 plants. For this research, the focus was on potato crops, with the following distribution:

- Early Blight: 1,000 images
- Late Blight: 1,000 images
- Healthy: 152 images

To address overfitting and improve the diversity of the training set, various augmentation techniques were applied using the Keras ImageDataGenerator. These included:

- Rotation: Rotating images at specific angles to account for variations in how objects might appear in real-world scenarios.
- Shifting: Randomly moving images within a defined range, horizontally or vertically.
- Shear Transformation: Stretching one axis of the image to simulate distortions.
- Zoom: Magnifying parts of the image to focus on finer details.
- Flipping: Horizontally flipping images to ensure the model becomes invariant to directional changes.
- Brightness Adjustment: Modifying brightness levels to simulate varying lighting conditions.

Evaluation metrics play a crucial role in assessing model performance. The following metrics were employed:

1. Classification Accuracy: Calculated as the ratio of correctly predicted samples to the total number of predictions.
2. Precision: Measures the consistency of the model in identifying true positives, particularly important in imbalanced datasets.
3. Recall: Indicates the proportion of actual positives correctly identified by the model.
4. F1 Score: Combines precision and recall to provide a balanced evaluation.

The proposed CNN model comprises three convolutional layers, each followed by Rectified Linear Unit (ReLU) activation and max-pooling layers. Key components include:

- Flatten Layer: Converts the convolved matrix into a one-dimensional array.
- Fully Connected Layers: Three dense layers with ReLU activation, followed by an output layer with Softmax activation.

The CNN configuration includes:

- Convolution Layers: 3 layers with 3 × 3 filters each.
- Max-Pooling Layers: 3 layers with 2 × 2 pool sizes each.
- Fully Connected Layers: 512, 256, and 128 neurons, respectively.
- Output Layer Activation: Softmax.
- Batch Size: 32.
- Epochs: 50.
- Optimizer: Adam.
- Loss Function: Categorical Cross-Entropy.

Through these carefully selected methodologies and configurations, the model ensures robustness, accuracy, and practical usability in identifying potato leaf diseases. This approach, combined with iterative development, aligns with the project’s goal of enhancing agricultural productivity through advanced machine learning techniques."""


**Task 1: Extracting capitalized words**

In [7]:
capitalized_words = re.findall(r'\b[A-Z][a-z]*\b', text)
print("Capitalized Words:", capitalized_words)

# Count the total number of capitalized words
capitalized_word_count = len(capitalized_words)
print("Total Number of Capitalized Words:", capitalized_word_count)


Capitalized Words: ['This', 'It', 'The', 'Rapid', 'Application', 'Development', 'The', 'Rapid', 'Application', 'Development', 'The', 'Dataset', 'Preparation', 'The', 'Potato', 'Leaf', 'Images', 'Dataset', 'Kaggle', 'Google', 'Drive', 'Data', 'Preprocessing', 'Techniques', 'Model', 'Training', 'Validation', 'The', 'Google', 'Colab', 'Model', 'Testing', 'The', 'The', 'Developed', 'Penn', 'State', 'University', 'For', 'Early', 'Blight', 'Late', 'Blight', 'Healthy', 'To', 'Keras', 'These', 'Rotation', 'Rotating', 'Shifting', 'Randomly', 'Shear', 'Transformation', 'Stretching', 'Zoom', 'Magnifying', 'Flipping', 'Horizontally', 'Brightness', 'Adjustment', 'Modifying', 'Evaluation', 'The', 'Classification', 'Accuracy', 'Calculated', 'Precision', 'Measures', 'Recall', 'Indicates', 'Score', 'Combines', 'The', 'Rectified', 'Linear', 'Unit', 'Key', 'Flatten', 'Layer', 'Converts', 'Fully', 'Connected', 'Layers', 'Three', 'Softmax', 'The', 'Convolution', 'Layers', 'Max', 'Pooling', 'Layers', 'Fully

**Explanation**

    Regex Pattern: \b[A-Z][a-z]*\b
        \b: Matches word boundaries to ensure complete words are captured.
        [A-Z]: Matches the first letter as uppercase.
        [a-z]*: Matches zero or more lowercase letters following the uppercase letter.
        Combined, this identifies words that start with a capital letter.
**Purpose**: 
    This operation identifies proper nouns, acronyms, and the first words of sentences. These are useful in NLP tasks such as Named Entity Recognition (NER) or identifying significant terms in a document.

**Task 2: Extracting sentences containing the word "data"**

In [11]:
sentences_with_data = re.findall(r'.*?\bdata\b.*?[.?!]', text, flags=re.IGNORECASE)
print("2. Sentences containing 'data':", sentences_with_data)

#total number of sentences containing "data"
sentence_count = len(sentences_with_data)
print("Total Number of Sentences containing 'data':", sentence_count)



2. Sentences containing 'data': ['This chapter outlines the methodology used in the development of the system. It includes details on the adopted methodology, data collection, preprocessing techniques, CNN architecture, and evaluation metrics.', '2. Data Preprocessing: Techniques such as auto-orientation, resizing, and data augmentation were applied to enhance dataset quality.', 'The PlantVillage dataset was utilized as the primary data source for this study.']
Total Number of Sentences containing 'data': 3


**Explanation**

    Regex Pattern:
        .*?: Matches any character (non-greedy).
        \bdata\b: Matches the exact word "data" (case-insensitive because of flags=re.IGNORECASE).
        .*?[.?!]: Matches everything after "data" up to the next sentence-ending punctuation (., ?, or !).
**Purpose** 
    Extracting sentences with specific keywords is useful in text mining, summarization, and finding context for terms in NLP tasks

**Task 3: Extracting Numeric Values**

In [13]:
#Extract all numeric values (including integers and decimals)
numeric_values = re.findall(r'\b\d+(?:\.\d+)?\b', text)
print("Numeric Values:", numeric_values)

# Counting the total number of numeric values
numeric_count = len(numeric_values)
print("Total Number of Numeric Values:", numeric_count)



Numeric Values: ['1', '2', '3', '4', '38', '14', '1', '000', '1', '000', '152', '1', '2', '3', '4', '3', '3', '3', '3', '2', '2', '512', '256', '128', '32', '50']
Total Number of Numeric Values: 26


**Explanation**

Regex Pattern: \b\d+(?:\.\d+)?\b

\b: Matches word boundaries to ensure complete numbers are captured.
\d+: Matches one or more digits.
(?:\.\d+)?: Matches a decimal point followed by one or more digits (optional).
Combined, this identifies integers and decimals.
Purpose:

This operation extracts numeric data, useful for analyzing statistical and measurement-related information in text.

**Task 4: Replace Repeated Spaces or Empty Lines with a Single Space**

In [19]:
# Normalizing spacing by removing repeated spaces and empty lines
text_normalized_spacing = re.sub(r'\s+', ' ', text)
print("Text After Normalizing Spacing:\n", text_normalized_spacing[:500])  # Displaying only the first 500 characters for brevity


Text After Normalizing Spacing:
 This chapter outlines the methodology used in the development of the system. It includes details on the adopted methodology, data collection, preprocessing techniques, CNN architecture, and evaluation metrics. The Rapid Application Development (RAD) methodology was employed to ensure iterative progress and adaptability throughout the project lifecycle. The system development process followed the Rapid Application Development (RAD) methodology, focusing on iterative development and rapid prototyp


**Explanation:**

Regex Pattern: \s+
Matches one or more whitespace characters, including spaces, tabs, or newlines.
Replaces multiple spaces or empty lines with a single space, producing cleaner text.

**Purpose**: 
Streamline text for NLP pipelines, ensuring consistency in spacing.

**Task 5: Counting the Occurrences of the Word “model”**

In [21]:
model_count = len(re.findall(r'\bmodel\b', text, flags=re.IGNORECASE))
print("Occurrences of 'model':", model_count)


Occurrences of 'model': 10


**Explanation:**

**Regex Pattern:** \bmodel\b

\b: Ensures "model" is matched as a complete word.
IGNORECASE: Enables case-insensitive matching.

**Purpose:** Determine the frequency of key terms to gauge their significance in the text.

**Task 6: Replacing All Numerical Values with a Placeholder**

In [22]:
placeholder_text = re.sub(r'\b\d+(?:\.\d+)?\b', '<NUM>', text)
print("Text with Numerical Values Replaced:", placeholder_text)


Text with Numerical Values Replaced: This chapter outlines the methodology used in the development of the system. It includes details on the adopted methodology, data collection, preprocessing techniques, CNN architecture, and evaluation metrics. The Rapid Application Development (RAD) methodology was employed to ensure iterative progress and adaptability throughout the project lifecycle.

The system development process followed the Rapid Application Development (RAD) methodology, focusing on iterative development and rapid prototyping. The primary steps included:

<NUM>. Dataset Preparation: The Potato Leaf Images Dataset was downloaded from Kaggle and stored in Google Drive for easy access.
<NUM>. Data Preprocessing: Techniques such as auto-orientation, resizing, and data augmentation were applied to enhance dataset quality.
<NUM>. Model Training and Validation: The CNN model was trained and validated using Google Colab, leveraging the augmented dataset.
<NUM>. Model Testing: The tra

**Explanation:**

**Regex Pattern:** \b\d+(?:\.\d+)?\b

Matches integers and decimals.
    
**Purpose:** Mask numerical values for anonymization or normalization tasks in text processing.



**Task 7: Extracting Acronyms**

In [24]:
acronyms = re.findall(r'\b[A-Z]{2,}\b', text)
print("Acronyms:", acronyms)

# Count the total number of acronyms
acronym_count = len(acronyms)
print("Total Number of Acronyms:", acronym_count)


Acronyms: ['CNN', 'RAD', 'RAD', 'CNN', 'EPFL', 'CNN', 'CNN']
Total Number of Acronyms: 7


**Explanation:**

**Regex Pattern:** \b[A-Z]{2,}\b

Matches sequences of two or more consecutive uppercase letters.

**Purpose:**
Identify abbreviations or acronyms, which are essential for understanding domain-specific terms.

**Task 8: Finding Words With Hyphens**

In [25]:
hyphenated_words = re.findall(r'\b\w+-\w+\b', text)
print("Hyphenated Words:", hyphenated_words)

# Count the total number of hyphenated words
hyphenated_word_count = len(hyphenated_words)
print("Total Number of Hyphenated Words:", hyphenated_word_count)


Hyphenated Words: ['auto-orientation', 'real-world', 'max-pooling', 'one-dimensional', 'Max-Pooling', 'Cross-Entropy']
Total Number of Hyphenated Words: 6


**Explanation:**

**Regex Pattern:** \b\w+-\w+\b

Matches words with at least one hyphen and ensures complete word boundaries.

**Purpose:** Extract hyphenated terms that often represent compound words or specialized terminology.

**Task 9:Extracting Nouns Using Heuristics**

In [28]:
noun_candidates = re.findall(r'\b[A-Z][a-z]+\b', text)
print("Noun Candidates:", noun_candidates)

# Count the total number of noun candidates
noun_count = len(noun_candidates)
print("Total Number of Noun Candidates:", noun_count)


Noun Candidates: ['This', 'It', 'The', 'Rapid', 'Application', 'Development', 'The', 'Rapid', 'Application', 'Development', 'The', 'Dataset', 'Preparation', 'The', 'Potato', 'Leaf', 'Images', 'Dataset', 'Kaggle', 'Google', 'Drive', 'Data', 'Preprocessing', 'Techniques', 'Model', 'Training', 'Validation', 'The', 'Google', 'Colab', 'Model', 'Testing', 'The', 'The', 'Developed', 'Penn', 'State', 'University', 'For', 'Early', 'Blight', 'Late', 'Blight', 'Healthy', 'To', 'Keras', 'These', 'Rotation', 'Rotating', 'Shifting', 'Randomly', 'Shear', 'Transformation', 'Stretching', 'Zoom', 'Magnifying', 'Flipping', 'Horizontally', 'Brightness', 'Adjustment', 'Modifying', 'Evaluation', 'The', 'Classification', 'Accuracy', 'Calculated', 'Precision', 'Measures', 'Recall', 'Indicates', 'Score', 'Combines', 'The', 'Rectified', 'Linear', 'Unit', 'Key', 'Flatten', 'Layer', 'Converts', 'Fully', 'Connected', 'Layers', 'Three', 'Softmax', 'The', 'Convolution', 'Layers', 'Max', 'Pooling', 'Layers', 'Fully',

**Explanation:**

**Regex Pattern:** \b[A-Z][a-z]+\b

Matches words starting with a capital letter, excluding acronyms.

**Purpose:**

Identify probable nouns, useful for downstream tasks.

**Task 10: Highlighting All Proper Nouns in the Text**

In [29]:
highlighted_text = re.sub(r'\b[A-Z][a-z]+\b', r'[\g<0>]', text)
print("Text with Highlighted Proper Nouns:", highlighted_text[:500])  # Showing a snippet for brevity


Text with Highlighted Proper Nouns: [This] chapter outlines the methodology used in the development of the system. [It] includes details on the adopted methodology, data collection, preprocessing techniques, CNN architecture, and evaluation metrics. [The] [Rapid] [Application] [Development] (RAD) methodology was employed to ensure iterative progress and adaptability throughout the project lifecycle.

[The] system development process followed the [Rapid] [Application] [Development] (RAD) methodology, focusing on iterative developme


**Explanation:**

**Regex Pattern:** \b[A-Z][a-z]+\b

Wraps each match (proper noun) with square brackets.

**Purpose:**
Emphasize proper nouns visually for quick identification or readability.

**Task 11:Extract Sentences Containing Numbers**

In [30]:
sentences_with_numbers = re.findall(r'[^.!?]*\d+[^.!?]*[.!?]', text)
print("Sentences Containing Numbers:", sentences_with_numbers)

# Count the total number of sentences containing numbers
num_sentences_with_numbers = len(sentences_with_numbers)
print("Total Number of Sentences Containing Numbers:", num_sentences_with_numbers)


Sentences Containing Numbers: [' The primary steps included:\n\n1.', '\n2.', '\n3.', '\n4.', ' Developed by Penn State University and EPFL, the dataset contains 38 classes of diseased and healthy leaves from 14 plants.', ' For this research, the focus was on potato crops, with the following distribution:\n\n- Early Blight: 1,000 images\n- Late Blight: 1,000 images\n- Healthy: 152 images\n\nTo address overfitting and improve the diversity of the training set, various augmentation techniques were applied using the Keras ImageDataGenerator.', ' The following metrics were employed:\n\n1.', '\n2.', '\n3.', '\n4.', ' F1 Score: Combines precision and recall to provide a balanced evaluation.', '\n\nThe CNN configuration includes:\n\n- Convolution Layers: 3 layers with 3 × 3 filters each.', '\n- Max-Pooling Layers: 3 layers with 2 × 2 pool sizes each.', '\n- Fully Connected Layers: 512, 256, and 128 neurons, respectively.', '\n- Batch Size: 32.', '\n- Epochs: 50.']
Total Number of Sentences Con

**Explanation:**

**Regex Pattern:** [^.!?]*\d+[^.!?]*[.!?]

Matches sentences containing at least one number, surrounded by text.

**Purpose:**

Extract numeric-related context for analysis, like statistics or measurements.

**Task 12:Extract All Bullet Points**

In [32]:
bullet_points = re.findall(r'^\s*[\d•-]\s+.*', text, flags=re.MULTILINE)
print("Bullet Points:", bullet_points)

# Count the total number of bullet points
num_bullet_points = len(bullet_points)
print("Total Number of Bullet Points:", num_bullet_points)


Bullet Points: ['\n- Early Blight: 1,000 images', '- Late Blight: 1,000 images', '- Healthy: 152 images', '\n- Rotation: Rotating images at specific angles to account for variations in how objects might appear in real-world scenarios.', '- Shifting: Randomly moving images within a defined range, horizontally or vertically.', '- Shear Transformation: Stretching one axis of the image to simulate distortions.', '- Zoom: Magnifying parts of the image to focus on finer details.', '- Flipping: Horizontally flipping images to ensure the model becomes invariant to directional changes.', '- Brightness Adjustment: Modifying brightness levels to simulate varying lighting conditions.', '\n- Flatten Layer: Converts the convolved matrix into a one-dimensional array.', '- Fully Connected Layers: Three dense layers with ReLU activation, followed by an output layer with Softmax activation.', '\n- Convolution Layers: 3 layers with 3 × 3 filters each.', '- Max-Pooling Layers: 3 layers with 2 × 2 pool siz

**Explanation:**

**Regex Pattern:** ^\s*[\d•-]\s+.*

Matches lines starting with a number, bullet point, or dash.

**Purpose:**

Extract hierarchical points for structured document parsing.

**Task 13: Extract the First Word of Each Sentence**

In [34]:
first_words = [match.group(0) for match in re.finditer(r'\b\w+\b', text)]
print("First Words of Sentences:", first_words[:100])  # Showing first 10 for brevity

# Count the total number of unique first words
unique_first_words = set(first_words)
print("Total Number of Unique First Words:", len(unique_first_words))


First Words of Sentences: ['This', 'chapter', 'outlines', 'the', 'methodology', 'used', 'in', 'the', 'development', 'of', 'the', 'system', 'It', 'includes', 'details', 'on', 'the', 'adopted', 'methodology', 'data', 'collection', 'preprocessing', 'techniques', 'CNN', 'architecture', 'and', 'evaluation', 'metrics', 'The', 'Rapid', 'Application', 'Development', 'RAD', 'methodology', 'was', 'employed', 'to', 'ensure', 'iterative', 'progress', 'and', 'adaptability', 'throughout', 'the', 'project', 'lifecycle', 'The', 'system', 'development', 'process', 'followed', 'the', 'Rapid', 'Application', 'Development', 'RAD', 'methodology', 'focusing', 'on', 'iterative', 'development', 'and', 'rapid', 'prototyping', 'The', 'primary', 'steps', 'included', '1', 'Dataset', 'Preparation', 'The', 'Potato', 'Leaf', 'Images', 'Dataset', 'was', 'downloaded', 'from', 'Kaggle', 'and', 'stored', 'in', 'Google', 'Drive', 'for', 'easy', 'access', '2', 'Data', 'Preprocessing', 'Techniques', 'such', 'as', 'auto', '

**Explanation:**

**Regex Pattern:** \b\w+\b

Captures the first complete word of a sentence.

**Purpose:**

Analyze sentence structure or text summarization.



**Task 14: Identify Sentences Ending With Specific Words**

In [35]:
specific_endings = re.findall(r'[^.!?]*\b(?:datasets|technologies)\b[.!?]', text, flags=re.IGNORECASE)
print("Sentences Ending with Specific Words:", specific_endings)

# Count the number of such sentences
num_specific_endings = len(specific_endings)
print("Total Sentences Ending with Specific Words:", num_specific_endings)


Sentences Ending with Specific Words: [' Precision: Measures the consistency of the model in identifying true positives, particularly important in imbalanced datasets.']
Total Sentences Ending with Specific Words: 1


**Explanation:**

**Regex Pattern:** [^.!?]*\b(?:datasets|technologies)\b[.!?]

Matches sentences ending with specified keywords.

**Purpose:**
    
Pinpoint focus sentences relevant to given keywords.

**Task 15: Extract Parenthetical Statements**

In [36]:
parenthetical_statements = re.findall(r'\([^)]*\)', text)
print("Parenthetical Statements:", parenthetical_statements)

# Count the total number of parenthetical statements
num_parenthetical_statements = len(parenthetical_statements)
print("Total Number of Parenthetical Statements:", num_parenthetical_statements)


Parenthetical Statements: ['(RAD)', '(RAD)', '(ReLU)']
Total Number of Parenthetical Statements: 3


**Explanation:**

**Regex Pattern:** \([^)]*\)

Captures text enclosed in parentheses.

**Purpose:**

Extract supplementary or explanatory information for further analysis.