<a href="https://colab.research.google.com/github/farrelrassya/GettingStartedwithNLP/blob/main/02.First_NLP_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Your first NLP example

In this chapter, you will learn how to implement your own NLP application from scratch. In doing so, you will also learn how to structure a typical NLP pipeline and how to apply a simple machine learning algorithm to solve your task. The particular application you will implement is spam filtering. We overviewed it in chapter 1 as one of the classic tasks on the intersection of NLP and machine learning.

### Introducing NLP in practice: Spam filtering
In this book, you use spam filtering as your first practical NLP application, as it exemplifies a widely spread family of tasks—text classification. Text classification comprises several applications that we discuss in this book, including user profiling

We apply classification in our everyday lives pretty regularly: classifying things simply implies that we try to put them into clearly defined groups, classes, or categories. In fact, we tend to classify all sorts of things all the time. Here are some examples:
1.  Based on our level of engagement and interest in a movie, we may classify it as interesting or boring.
2. Based on temperature, we classify water as cold or hot.
3. Based on the amount of sunshine, humidity, wind strength, and air tempera-
ture, we classify the weather as good or bad.
4. Based on the number of wheels, we classify vehicles into unicycles, bicycles, tri-
cycles, quadricycles, cars, and so on.
5. Based on the availability of the engine, we may classify two-wheeled vehicles into bicycles and motorcycles.

## Understanding the task
Consider the following scenario: you have a collection of spam and normal emails from the past. You are tasked with building a spam filter, which for any future incoming email can predict whether this email is spam or not. Consider these questions:
1. How can you use the provided data?
2. What characteristics of the emails might be particularly useful, and how will you
extract them?
3. What will be the sequence of steps in this application?
In this section, we will discuss this scenario and look into the implementation steps. In total, the pipeline for this task will consist of five steps, visualized as a flow chart in figure 2.4. Now let’s look into each of these steps in more detail.

<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/02.%20Chapter%2002/Figure%202.4.png" style="width:700px;">


### Step 1: Define the data and classes
First, you need to ask yourself what format the email messages are delivered in for this task. For instance, in a real-life situation, you might need to extract the mes- sages from the mail agent application. However, for simplicity, let’s assume that someone has extracted the emails for you and stored them in text format. The nor- mal emails are stored in a separate folder—let’s call it Ham, and spam emails are stored in a Spam folder.

<div style="background-color: #E7F3FE; border-left: 6px solid #2196F3; padding: 16px; margin: 16px 0;">
  <h3 style="margin-top: 0; color: #2196F3;">Note</h3>
  <p style="margin: 0;">
    If you are wondering why “normal” emails are sometimes referred to as “ham” within the context of spam detection, please review the historical background of the term “spam” for further clarification. For example, you may consult the definition provided by <a href="https://www.merriam-webster.com/dictionary/spam" target="_blank" style="color: #2196F3;">Merriam-Webster's Dictionary</a>.
  </p>
</div>


In cases where past spam and ham emails have already been identified (for instance, by extracting emails from the INBOX and SPAM folders), manual labeling is unnecessary. However, you must still instruct the machine-learning algorithm by clearly specifying which folder corresponds to ham and which to spam. This initial step—defining class labels and determining the number of classes—is essential for any spam-detection or text-classification pipeline, as it lays the groundwork for subsequent stages such as data preprocessing, feature extraction, and model training and testing (see Figure 2.5).


<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/02.%20Chapter%2002/Figure%202.5.png" style="width:700px;">


### Step 2: Split the text into words
Next, you will need to define the features for the machine to know what type of infor- mation, or what properties of the emails to pay attention to, but before you can do that, there is one more step to perform. As we’ve just discussed in the previous exer- cise, email content provides significant information as to whether an email is ham or spam.

One approach to extract email content is to treat the entire email as a single textual feature, such as using the full text of meeting minutes for ham emails or the complete body of a spam email. Although this method allows the algorithm to identify emails based on exact phrase matches, it is unlikely to encounter identical texts repeatedly, and even a slight variation in characters could alter the feature significantly. Consequently, a more effective strategy is to use smaller text segments, like individual words, as features. These words not only convey spam-related information (e.g., "lottery" as a potential indicator of spam) but are also more likely to appear across multiple emails, enhancing the algorithm's robustness.

In [None]:
text = "Define which data represents each class for the machine learning algorithm"
words = text.split()  # By default, split() uses any whitespace as the delimiter

for word in words:
    print(word)

Define
which
data
represents
each
class
for
the
machine
learning
algorithm


<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/02.%20Chapter%2002/Figure%202.6.png" style="width:700px;">

### Text Tokenization and Punctuation Handling

### The Problem

When text is split only by whitespace, punctuation marks remain attached to words, resulting in tokens like "algorithm." that differ from their punctuation-free counterparts like "algorithm".

### Solution Approaches

To address this issue, modify the splitting strategy so that punctuation marks are separated from words. This can be achieved through two main methods:

1. **Using Python's Regular Expressions module**
   - Provides powerful pattern matching capabilities
   - Can separate punctuation with specific regex patterns

2. **Implementing a simple iterative algorithm**
   - Examine each character sequentially
   - If character is whitespace:
     - Add current word to token list
     - Reset current word
   - If character is punctuation:
     - Treat as separate token
     - Handle appropriately based on context (e.g., following whitespace)

### Importance

This improved tokenization method ensures more accurate text processing, which is crucial for:
- More precise text analysis
- Better performance in downstream tasks (e.g., spam detection)
- Consistent token identification regardless of punctuation

### Character-by-Character Tokenization Algorithm

### Algorithm Description
The algorithm processes text one character at a time:

- When encountering **whitespace** with a non-empty current word:
  - Add the current word to the token list
  - Reset the current word buffer

- When encountering **punctuation**:
  - If current word is empty:
    - Add just the punctuation mark as a token
  - If current word is non-empty:
    - Add the current word as a token
    - Add the punctuation mark as a separate token
    - Reset the current word buffer

### Limitations
This approach successfully generates a list of words and punctuation tokens, but has important limitations:

- May incorrectly split abbreviations and special cases:
  - "i.e." → ["i", ".", "e", "."]
  - "U.S.A." → ["U", ".", "S", ".", "A", "."]

### Recommendation
For handling special cases and exceptions more effectively, consider:
- Advanced tokenization methods using regular expressions
- NLP toolkits with specialized tokenization capabilities
- Custom rules for common abbreviations and domain-specific terms

In [None]:
text = 'Define which data represents "ham" class and which data represents "spam" class for the machine learning algorithm.'
delimiters = ['"', '.']  # List of punctuation marks to be treated as separate tokens

words = []       # List to store the resulting tokens
current_word = ""  # Variable to build up characters of the current word

for char in text:
    if char == " ":
        # When a whitespace is encountered, add the current word (if not empty) to the tokens list
        if current_word != "":
            words.append(current_word)
            current_word = ""
    elif char in delimiters:
        # If a punctuation mark is encountered and a word is being built, append both the word and punctuation
        if current_word != "":
            words.append(current_word)
            words.append(char)
            current_word = ""
        else:
            # If no word is being built, just add the punctuation as a separate token
            words.append(char)
    else:
        # For any other character, add it to the current word
        current_word += char

# After the loop, add any remaining word to the tokens list
if current_word != "":
    words.append(current_word)

print(words)

['Define', 'which', 'data', 'represents', '"', 'ham', '"', 'class', 'and', 'which', 'data', 'represents', '"', 'spam', '"', 'class', 'for', 'the', 'machine', 'learning', 'algorithm', '.']


<div style="background-color: #E7F3FE; border-left: 6px solid #2196F3; padding: 16px; margin: 16px 0;">
  <h3 style="margin-top: 0; color: #2196F3;">Tokenization</h3>
  <p style="margin: 0;">
    Tokenization is the process of identifying or extracting individual words or tokens from a continuous stream of text. As the first step in text preprocessing, it plays a critical role in natural language processing (NLP). Although whitespace and punctuation marks typically act as effective delimiters, there are notable exceptions, such as abbreviations like “U.S.A.”, which require more refined handling. Tokenizers are specialized NLP tools that often utilize carefully constructed regular expressions or advanced machine learning algorithms to manage these complexities efficiently.
  </p>
</div>

Let’s now define step 2 of your algorithm as follows: apply tokenization to split the running text into words, which are going to serve as features (figure 2.7).


<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/02.%20Chapter%2002/Figure%202.7.png" style="width:700px;">


### Step 3: Extract and normalize the features
Now we look closely into the extracted words and see whether they are all equally good to be used as features—that is, whether they are equally indicative of the spam- related content. Suppose two emails use a different format: one says

 ``` Collect your lottery winnings ```

 while another one says

 ```Collect Your Lottery Winnings ```  

The algorithm that splits these messages into words will end up with different word lists because, for instance, lottery ≠ Lottery, but is it different in terms of the meaning? To get rid of such formatting issues like uppercase versus lowercase, you can put all the extracted words into lowercase using Python functionality. Therefore, step 3 in your algorithm should be defined as follows: extract and normalize the features; for example, by putting all words to lowercase (figure 2.8).

<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/02.%20Chapter%2002/Figure%202.8.png" style="width:700px;">

### Step 4: Train a classifier
At this point, you will end up with two sets of data—one linked to the spam class and another one linked to the ham class. Each data is preprocessed in the same way in steps 2 and 3, and the features are extracted. Next, you need to let the machine use this data to build the connection between the set of features (properties) that describe each type of email (spam or ham) and the labels attached to each type. In step 4, a machine-learning algorithm tries to build a statistical model, a function, that helps it distinguish between the two classes. This is what happens during the learning (training) phase. Figure 2.9 is a refresher visualizing the training and test processes.

<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/02.%20Chapter%2002/Figure%202.9.png" style="width:700px;">


So, step 4 of the algorithm should be defined as follows: define a machine-learning model and train it on the data with the features predefined in the previous steps (figure 2.10).

<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/02.%20Chapter%2002/Figure%202.10.png" style="width:700px;">


### Training and Testing Machine Learning Models for Email Classification

Once you've built a machine learning model that maps features (like word occurrences) to labels (spam or ham), you need to verify how well it actually performs. This validation process is crucial for ensuring your model will work effectively on new, unseen emails.

### Understanding the Need for Separate Testing

During training, your algorithm learns which features correlate with each class:
- Words like "lottery" might strongly indicate spam
- Words like "meeting" might strongly indicate legitimate (ham) emails

However, simply checking performance on the same data used for training creates a misleading evaluation. The model already "knows" these answers because it was trained on them. It would be like giving students the same questions for homework and final exam - it doesn't truly test their ability to apply knowledge to new situations.

### The Train-Test Split Methodology

To properly evaluate model performance, follow these steps:

1. **Shuffle your data** to ensure random distribution and avoid any systematic bias
   - This prevents situations where all emails of one type might end up in either training or testing

2. **Split the dataset** into two separate portions:
   - **Training set** (typically 80% of data): Used to train the model and allow it to learn patterns
   - **Test set** (typically 20% of data): Reserved exclusively for evaluation

3. **Train your classifier** using only the training set
   - The model learns to associate features with classes based on this data

4. **Evaluate performance** using only the test set
   - This provides a realistic estimate of how your model will perform on new, unseen emails

<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/02.%20Chapter%2002/Figure%202.11.png" style="width:700px;">


### The Importance of Dataset Separation

The training and test sets must remain completely separate throughout the entire process. The test set should never be used during the training phase - it must be treated as "unseen data" that the model encounters only during final evaluation.

This separation ensures:
- A fair assessment of model performance
- Detection of overfitting (when a model performs well on training data but poorly on new data)
- Realistic expectations for real-world performance

By following this methodology, you can trust that your spam classifier's performance metrics are reliable indicators of how it will perform when deployed to filter actual incoming emails.

### Step 5 involves evaluating the classifier's performance
on the test data. This is done by determining the proportion of emails that the classifier correctly labels—assigning the spam label to spam emails and the ham label to non-spam emails. This measure is known as accuracy and is calculated using the following equation:

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

where:

1. TP (True Positives): Correctly identified spam emails.
2. TN (True Negatives): Correctly identified ham emails.
3. FP (False Positives): Ham emails incorrectly labeled as spam.
4. FN (False Negatives): Spam emails incorrectly labeled as ham.

While accuracy provides an overall measure of the classifier's performance, it is essential to also consider the distribution of classes (spam vs. ham) and the individual performance on each class. This ensures that the evaluation captures the strengths and weaknesses of the classifier comprehensively.

## Implementing your own spam filter
Now let’s implement each of the five steps. It’s time you open Jupyter and create a new notebook to start coding your own spam filter.

In [50]:
# Download file dari GitHub (gunakan link raw)
!wget -O enron1.tar.gz https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/enron1.tar.gz

# Ekstrak file tar.gz
!tar -xzf enron1.tar.gz

# Cek folder hasil ekstrak (opsional)
!ls -l

--2025-03-16 09:01:49--  https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/enron1.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1802573 (1.7M) [application/octet-stream]
Saving to: ‘enron1.tar.gz’


2025-03-16 09:01:49 (24.5 MB/s) - ‘enron1.tar.gz’ saved [1802573/1802573]

total 1776
-rw-r--r-- 1 root root    1535 Mar 16 08:45 dataset.zip
drwx------ 4 1006  513    4096 May 15  2006 enron1
-rw-r--r-- 1 root root 1802573 Mar 16 09:01 enron1.tar.gz
drwxr-xr-x 1 root root    4096 Mar 13 13:31 sample_data


In [51]:
import os
import codecs

def read_in(folder):
    """
    Reads all non-hidden files in the specified folder and returns a list of their contents.

    Args:
        folder (str): Path to the folder containing files.

    Returns:
        list: A list where each element is the content of a file.
    """
    files = os.listdir(folder)
    contents = []
    for file_name in files:
        # Skip hidden files (starting with a dot)
        if not file_name.startswith('.'):
            file_path = os.path.join(folder, file_name)
            with codecs.open(file_path, "r", encoding="ISO-8859-1", errors="ignore") as f:
                contents.append(f.read())
    return contents

In [52]:
spam_list = read_in("enron1/spam/")
ham_list = read_in("enron1/ham/")

# Nge-print jumlah file di folder spam dan ham
print(len(spam_list))
print(len(ham_list))

# Nge-print isi file pertama di masing-masing folder
print(spam_list[0])
print(ham_list[0])

1500
3672
Subject: hello paliourg , remember me katie we met online .
hi paliourg ,
it ' s katie remember me ? we met online last week . anyways i just signed up to the largest adult dating site ever !
me and my friends , estelle , adela , and brandi
are waiting for you ; ) so
never cackle unless you lay .
the bigger they are the harder they fall . . silence is less injurious than a bad reply . . nothing is ill said if it is not ill taken . . a fool in a gown is none the wiser . .
oaks may fall when reeds take the storm .
too many clicks spoil the browse .
no more ?
. better safe than sorry . . true beauty lies within . . a bad excuse is better then none . .
always you are to be rich next year . . anger and hate hinder good counsel . . an elephant never forgets . .
fore - warned is fore - armed . . don ' t cross the bridge till you come to it . . variety is the spice of life . .

Subject: november prelim wellhead production - estimate
daren ,
fyi .
bob
- - - - - - - - 

In [55]:
import random

all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]
random.seed(42)
random.shuffle(all_emails)
print (f"Dataset size = {str(len(all_emails))} emails")

Dataset size = 5172 emails


In [56]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

def get_features(text):
    features = {}
    word_list = [word for word in word_tokenize(text.lower())]
    for word in word_list:
        features[word] = True
    return features

all_features = [(get_features(email), label) for (email, label) in all_emails]

print(get_features("Participate In Our New Lottery NOW!"))
print(len(all_features))
print(len(all_features[0][0]))
print(len(all_features[99][0]))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


{'participate': True, 'in': True, 'our': True, 'new': True, 'lottery': True, 'now': True, '!': True}
5172
499
41


In [57]:
from nltk import NaiveBayesClassifier, classify

def train(features, proportion):
    """
    Splits the given features into training and testing sets and trains a Naive Bayes classifier.

    This function divides the feature set into a training set and a test set based on the specified
    proportion (e.g., 0.8 for 80% training data). It then trains the Naive Bayes classifier on the training set.

    Args:
        features (list): A list of feature tuples with labels.
        proportion (float): The fraction of the data to use for training (e.g., 0.8 for 80%).

    Returns:
        tuple: A tuple containing the training set, test set, and the trained classifier.
    """
    train_size = int(len(features) * proportion)
    # Initialize the training and test sets
    train_set, test_set = features[:train_size], features[train_size:]
    print(f"Training set size = {len(train_set)} emails")
    print(f"Test set size = {len(test_set)} emails")

    # Train the Naive Bayes classifier using the training set
    classifier = NaiveBayesClassifier.train(train_set)
    return train_set, test_set, classifier

# Example usage:
train_set, test_set, classifier = train(all_features, 0.8)

Training set size = 4137 emails
Test set size = 1035 emails


In [58]:
def evaluate(train_set, test_set, classifier):
    """
    Evaluates the performance of the given classifier by printing the accuracy on both
    the training set and the test set, and displaying the most informative features.

    Args:
        train_set (list): The training dataset with features and labels.
        test_set (list): The testing dataset with features and labels.
        classifier (nltk.NaiveBayesClassifier): The trained Naive Bayes classifier.
    """
    # Check accuracy on the training and test sets
    print(f"Accuracy on the training set = {classify.accuracy(classifier, train_set)}")
    print(f"Accuracy on the test set = {classify.accuracy(classifier, test_set)}")

    # Display the top 50 most informative features
    classifier.show_most_informative_features(50)

# Call the evaluate function to check classifier performance
evaluate(train_set, test_set, classifier)

Accuracy on the training set = 0.957457094512932
Accuracy on the test set = 0.9584541062801932
Most Informative Features
               forwarded = True              ham : spam   =    197.1 : 1.0
                     hou = True              ham : spam   =    186.4 : 1.0
                    2004 = True             spam : ham    =    163.0 : 1.0
            prescription = True             spam : ham    =    129.3 : 1.0
                    pain = True             spam : ham    =     94.0 : 1.0
                    2005 = True             spam : ham    =     90.8 : 1.0
                    spam = True             spam : ham    =     89.1 : 1.0
                     ect = True              ham : spam   =     82.9 : 1.0
                  farmer = True              ham : spam   =     81.4 : 1.0
                  differ = True             spam : ham    =     77.9 : 1.0
                   super = True             spam : ham    =     77.9 : 1.0
                featured = True             spam : ham

In [59]:
from nltk.text import Text
from nltk.tokenize import word_tokenize

def concordance(data_list, search_word):
    """
    Displays the concordance of the specified search word in each email from the data list.

    This function tokenizes each email's content to lowercase and creates an NLTK Text object.
    If the search word is found in the token list, it displays its context using NLTK's concordance method.

    Args:
        data_list (list): List of email contents as strings.
        search_word (str): The word to search for in the emails.
    """
    for email in data_list:
        # Tokenize email text and convert to lowercase
        word_list = [word for word in word_tokenize(email.lower())]
        text_list = Text(word_list)
        if search_word in word_list:
            text_list.concordance(search_word)

print("STOCKS in HAM:")
concordance(ham_list, "stocks")

print("\n\nSTOCKS in SPAM:")
concordance(spam_list, "stocks")

STOCKS in HAM:
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ad my portfolio is diversified into stocks that have lost even more money than


STOCKS in SPAM:
Displaying 1 of 1 matches:
ecializing in undervalued small cap stocks for immediate breakout erhc and exx
Displaying 1 of 1 matches:
in apple investments , inc profiled stocks . in order to be in full compliance
Displaying 2 of 2 matches:
his email pertaining to investing , stocks , securities must be understood as 
ntative before deciding to trade in stocks featured within this email . none o
Displaying 3 of 3 matches:
n how many times have you seen good stocks but you couldn ' t get your hands o
his email pertaining to investing , stoc

In [60]:
# Contoh list email spam dan ham
test_spam_list = ["Participate in our new lottery!", "Try out this new medicine"]
test_ham_list = [
    "See the minutes from the last meeting attached",
    "Investors are coming to our office on Monday"
]

# Buat tuple (email_content, label) untuk masing-masing email
test_emails = [(email_content, "spam") for email_content in test_spam_list]
test_emails += [(email_content, "ham") for email_content in test_ham_list]

# Ekstrak fitur untuk tiap email menggunakan fungsi get_features
new_test_set = [(get_features(email), label) for (email, label) in test_emails]

# Evaluasi classifier dengan training set yang sudah ada dan new test set
evaluate(train_set, new_test_set, classifier)

Accuracy on the training set = 0.957457094512932
Accuracy on the test set = 1.0
Most Informative Features
               forwarded = True              ham : spam   =    197.1 : 1.0
                     hou = True              ham : spam   =    186.4 : 1.0
                    2004 = True             spam : ham    =    163.0 : 1.0
            prescription = True             spam : ham    =    129.3 : 1.0
                    pain = True             spam : ham    =     94.0 : 1.0
                    2005 = True             spam : ham    =     90.8 : 1.0
                    spam = True             spam : ham    =     89.1 : 1.0
                     ect = True              ham : spam   =     82.9 : 1.0
                  farmer = True              ham : spam   =     81.4 : 1.0
                  differ = True             spam : ham    =     77.9 : 1.0
                   super = True             spam : ham    =     77.9 : 1.0
                featured = True             spam : ham    =     74.7 

In [61]:
for email in test_spam_list:
    print (email)
    print (classifier.classify(get_features(email)))
for email in test_ham_list:
    print (email)
    print (classifier.classify(get_features(email)))

Participate in our new lottery!
spam
Try out this new medicine
spam
See the minutes from the last meeting attached
ham
Investors are coming to our office on Monday
ham


In [62]:
while True:
    email = input("Type in your email here (or press 'Enter'): ")
    if len(email)==0:
        break
    else:
        prediction = classifier.classify(get_features(email))
        print (f"This email is likely {prediction}\n")

Type in your email here (or press 'Enter'): Buy new meds
This email is likely spam

Type in your email here (or press 'Enter'): Let's Schedule a meeting for tommorow
This email is likely ham

Type in your email here (or press 'Enter'): stock options fasts
This email is likely spam

Type in your email here (or press 'Enter'): investor are coming
This email is likely ham

Type in your email here (or press 'Enter'): 0
This email is likely ham

Type in your email here (or press 'Enter'): 
