<a href="https://colab.research.google.com/github/farrelrassya/GettingStartedwithNLP/blob/main/01.Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to NLP

Natural language processing (or NLP) is a field that addresses various ways in which computers can deal with natural—that is, human—language. Regardless of your occupation or background, there is a good chance you have heard about NLP before, especially in recent years with the media covering the impressive capabili- ties of intelligent machines that can understand and produce natural language. This is what has brought NLP into the spotlight, and what might have attracted you to this book. You might be a programmer who wants to learn new skills, a machine learning or data science practitioner who realizes there is a lot of potential in processing natural language, or you might be generally interested in how language works and how to process it automatically. Either way, welcome to NLP! This book aims to help you get started with it.

# NLP Timeline: A Journey Through Approaches  

**Figure 1.1** illustrates the evolution of Natural Language Processing (NLP) techniques over time. Here’s a breakdown of the key phases:  

<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/01.%20Chapter%2001/Figure%201.1.png" alt="Figure 1.1" width="700">

---

## 1. **Rule-Based Approaches (1980s)**  
- **Early Days of NLP**:  
  Systems relied on hand-written rules (e.g., grammar checks, keyword matching).  
  Example: Basic spellcheckers or syntax analyzers.  
- **Limitation**:  
  Rigid and struggled with ambiguity or new language patterns.  

## 2. **Statistical & Machine Learning Approaches (1990s–2000s)**  
- **The Web Era**:  
  The **World Wide Web** (1990s) exploded with text data, enabling statistical models.  
  Example: Spam filters, early search engines.  
- **Key Idea**:  
  Let data drive decisions using probabilities and algorithms (e.g., SVMs, decision trees).  

## 3. **Deep Learning Approaches (2010s–Present)**  
- **Hardware Revolution**:  
  Advances in **computer hardware** (GPUs, TPUs) made training neural networks feasible.  
  Example: Transformers, BERT, ChatGPT.  
- **Why It’s Powerful**:  
  Models learn complex patterns directly from data, excelling at tasks like translation and context understanding.  

---

### What Fuelled the Shifts?  
- **1990s**: The Web provided vast data for statistical learning.  
- **2010s**: Hardware advancements unlocked deep learning’s potential.  

### Modern NLP Today:  
Combines **all three approaches**:  
- **Rules** for simple tasks (e.g., regex).  
- **ML** for structured decisions (e.g., classification).  
- **Deep Learning** for cutting-edge AI (e.g., chatbots, summarization).  



# Information Retrieval: How Machines Find What We Need

Information search systems solve two critical challenges that would be overwhelming for humans:

1. Finding relevant information among vast collections of documents (like searching through an entire filesystem or the internet)
2. Identifying the *most* relevant documents from the potentially relevant ones

## The Basic Concept of Information Filtering

When searching manually through a large collection (like thousands of meeting notes), we would typically use a simple filtering approach:

- Look for documents containing specific keywords (e.g., "meeting" and "management")
- Exclude all documents that don't contain these terms

This filtering process creates a subset of potentially relevant documents from the larger collection, essentially dividing documents into two categories:
- Documents that match our search criteria
- Documents that don't match our search criteria

The text describes this as "simple filtering" - it's the fundamental starting point for more sophisticated information retrieval systems.

## Why This Matters

Modern search systems save us enormous amounts of time by automating this process. They can rapidly scan billions of documents and apply complex relevance algorithms that go far beyond simple keyword matching.

Without these systems, we would be overwhelmed by information and unable to efficiently find what we need - truly like searching for "a needle in a haystack."


<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/01.%20Chapter%2001/Figure%201.4.png" alt="Figure 1.4" width="700">


# Numerical Representation in Information Retrieval

Information retrieval systems use numerical representations to process human language since machines excel at handling numbers rather than natural language.

## The Challenge of Machine Understanding

While machines are becoming better at processing human language, they don't "understand" language the way humans do. When we search for words like "meeting" and "management" in documents, we:
- Have mental representations of these words
- Know how they're spelled and how they sound
- Can naturally recognize them in text

Machines lack these inherent representations, so they need a different approach.

## Vector Representations: The Machine's Language

To enable machines to process language, we translate words into numerical representations. The most common representation in natural language processing is a **vector**.

Vector representations are versatile and can represent:
- Individual characters
- Words
- Entire documents

## How Vector Representation Works

A vector in this context is similar to:
- Vectors from high school mathematics
- Arrays in programming languages

For a simple query like "management meeting":
1. Each word gets its own dimension in the vector (or cell in an array)
2. The "management" dimension stores information related to "management"
3. The "meeting" dimension stores information related to "meeting"

This creates a structured numerical representation that machines can process effectively.

<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/01.%20Chapter%2001/Figure%201.5.png" alt="Figure 1.5" width="700">


This numerical transformation allows machines to perform complex operations on text data, enabling the search capabilities we rely on daily.

# Vector Representation in Information Retrieval with Code Implementation

## Quantifying Word Importance

When representing documents and queries as vectors, we need to determine how to quantify each word's contribution. For a simple query like "management meeting", both terms contribute equally to the information need, which we can represent by their frequency count:

- The query "management meeting" contains each word exactly once
- This gives us a vector representation of [1, 1]
- Graphically, this creates a point with coordinates (1,1) in a 2D space

## Representing Documents

Documents are represented using the same approach - by counting word occurrences:

<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/01.%20Chapter%2001/Figure%201.7.png" alt="Figure 1.7" width="700">



### Document Examples
- **Doc1**: Contains 3 occurrences of "management" and 5 occurrences of "meeting"
  - Vector representation: [3, 5]
  - This creates a point with coordinates (3,5) in our vector space

- **Doc2**: Contains 4 occurrences of "management" and 1 occurrence of "meeting"
  - Vector representation: [4, 1]
  - This creates a point with coordinates (4,1) in our vector space

## The Process Explained

1. We start with an empty vector (array) containing zeros for each dimension (word)
2. We read through the document, treating text between whitespaces as individual words
3. When we encounter "management", we increment the count in the first position (index 0)
4. When we encounter "meeting", we increment the count in the second position (index 1)
5. The final vector contains the total count of each word in the document

## Implementation Notes

- The code provided is designed to work in Jupyter Notebooks
- This same approach can be applied to any document by changing the input text
- Python arrays are zero-indexed, which is why "management" counts are stored at position 0
- The same code could be used to create a vector for Doc2, which would result in [4, 1]

This numerical representation transforms text into a mathematical form that computers can process efficiently, enabling information retrieval systems to perform operations like measuring document similarity and relevance.

In [2]:
doc1 = "meeting ... management ... meeting ... management ... meeting "
doc1 += "... management ... meeting ... meeting"

vector = [0, 0]

for word in doc1.split(" "):
    if word=="management":
        vector[0] = vector[0] + 1
    if word=="meeting":
        vector[1] = vector[1] + 1

print (vector)

[3, 5]


# Finding Relevant Documents Through Vector Distance

## Moving Beyond Basic Vector Representation

While our simple vector representation is a powerful starting point, real-world applications require more sophisticated approaches:

1. We need to accommodate any query, not just predefined words like "management" or "meeting"
2. We need proper word detection in text (not just simple string matching)
3. We need to automatically identify vector dimensions rather than hardcoding them
4. We need scalability beyond fixed-size arrays

These improvements will be addressed in detail in later chapters. For now, let's focus on how vector representations help us determine document relevance.

## From Representation to Relevance

The fundamental question: Once we have our documents and query represented as vectors, how do we determine which document is most relevant to the query?

This is where the geometric interpretation of vectors becomes incredibly useful. In our vector space:
- Each document and query is a point with specific coordinates
- Similar documents (with similar word frequencies) will be positioned close together
- The most relevant document to a query should be the one closest to it in this vector space

<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/01.%20Chapter%2001/Figure%201.8.png" alt="Figure 1.8" width="700">


## Measuring Distance in Vector Space

To find the most relevant document, we need a way to measure the distance between points in our vector space. This is where the Euclidean distance formula comes in.

The Euclidean distance between two points is calculated using the Pythagorean theorem:

$$
ED(p, q) = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2}
$$

Where:
- p and q are the two points (vectors)
- p₁, p₂, etc. are the coordinates of point p
- q₁, q₂, etc. are the coordinates of point q

## Calculating Document Relevance

Let's apply this to our example:
- Query = [1, 1] (coordinates in the "management" and "meeting" dimensions)
- Doc1 = [3, 5]
- Doc2 = [4, 1]

<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/01.%20Chapter%2001/Figure%201.9.png" alt="Figure 1.9" width="700">


For the distance between Query and Doc1:
- Difference in "management" dimension: 3 - 1 = 2
- Difference in "meeting" dimension: 5 - 1 = 4
- ED(Query, Doc1) = √(2² + 4²) = √(4 + 16) = √20 ≈ 4.47

For the distance between Query and Doc2:
- Difference in "management" dimension: 4 - 1 = 3
- Difference in "meeting" dimension: 1 - 1 = 0
- ED(Query, Doc2) = √(3² + 0²) = √9 = 3

Since Doc2 has a smaller Euclidean distance to the Query (3 < 4.47), we consider Doc2 more relevant to the Query than Doc1.

## Extending to Higher Dimensions

The beauty of this approach is that it scales to any number of dimensions. For a vector space with n dimensions (representing n different words), the formula remains the same:

$$
ED(p, q) = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + \cdots + (q_n - p_n)^2}
$$

This allows information retrieval systems to represent documents with thousands or millions of unique words, and still calculate relevance using the same fundamental principles.

By converting text into numerical vector representations and using distance measurements, computers can effectively determine document relevance without truly "understanding" language in the human sense.

In [3]:
import math

query = [1, 1]
doc1 = [3, 5]
sq_length = 0

for index in range(0, len(query)):
    sq_length += math.pow((doc1[index] - query[index]), 2)

print (math.sqrt(sq_length))

4.47213595499958


Our Euclidean distance estimation tells us that Doc2 is closer in space to the query than Doc1, so it is more similar, right? Well, there’s one more point that we are missing at the moment. Note that if we typed in management and meeting multiple times in our query, the content and information need would not change, but the vector itself would. In par- ticular, the length of the vector will be different, but the angle between the first version of the vector and the second one won’t change, as you can see in figure 1.10.

<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/01.%20Chapter%2001/Figure%201.10.png" width="700">

###  Cosine Similarity for Better Document Comparison

The Problem with Euclidean Distance

While Euclidean distance provides a useful way to measure the distance between document vectors, it has a significant limitation: it's affected by document length.

Longer documents naturally have higher word count values in their vectors, without necessarily being more relevant. For example:
- A longer document may contain more occurrences of words simply because it contains more words overall
- Repeating the same query multiple times would create a vector with larger values, but the information content remains the same

This creates a bias where longer documents might appear less relevant simply due to their length, not their actual content similarity.

### Angle vs. Distance: A Better Approach

Rather than measuring the absolute distance between vectors, a more effective approach is to measure the angle between length-normalized vectors:

- When two documents contain similar proportions of words (regardless of document length), the angle between their vectors will be small
- When documents contain different word distributions, the angle will be large
- The angle between identical content (even if repeated) will be zero

This makes angle measurement a much more stable and meaningful way to compare documents than raw distance.

### Cosine Similarity: Measuring Vector Angles

<img src="https://raw.githubusercontent.com/farrelrassya/GettingStartedwithNLP/main/01.%20Chapter%2001/Figure%201.11.png" width="700">

The cosine similarity metric calculates the cosine of the angle between two vectors, providing a measure of their directional similarity regardless of their magnitudes:

- Cosine of 0° = 1 (maximum similarity: vectors point in exactly the same direction)
- Cosine of 90° = 0 (no similarity: vectors are perpendicular)
- Cosine of 180° = -1 (maximum dissimilarity: vectors point in opposite directions)

<div style="background-color: #E7F3FE; border-left: 6px solid #2196F3; padding: 16px; margin: 16px 0;">
  <h3 style="margin-top: 0; color: #2196F3;">Cosine Similarity</h3>
  <p style="margin: 0;">
    Cosine similarity estimates the similarity between two nonzero vectors in space (or two texts
    represented by such vectors) on the basis of the angle between these vectors. For example, the cosine
    of 0° equals 1, indicating maximum similarity, while the cosine of 180° equals –1, the lowest value.
    Unlike Euclidean distance, this measure is not affected by vector length.
  </p>
</div>


### Key Properties of Cosine Similarity:

1. **Range**: Values range from -1 (completely opposite) to 1 (identical)
2. **Length Independence**: Not affected by vector magnitude (document length)
3. **Proportionality**: Measures whether vectors have similar proportions of values across dimensions
4. **Intuitive**: Higher values indicate greater similarity


### Calculating Cosine Similarity Between Document Vectors

### Understanding Vector Relationships

When comparing documents as vectors, understanding their directional relationship provides insights about their content similarity:

- **Orthogonal vectors (90° angle, cosine = 0)**: Represent completely different content with no overlap. For example, if vector1 represents a query with only the word "management" and vector2 represents a query with only the word "meeting", they share no common terms.

- **Opposing vectors (180° angle, cosine = -1)**: In word occurrence-based vectors, this rarely occurs as word counts cannot be negative. The cosine similarity for document vectors will typically range from 0 to 1.

### Calculating Cosine Similarity

Cosine similarity is calculated using the dot product of two vectors divided by the product of their lengths:

```
cosine(vec1, vec2) = dot_product(vec1, vec2) / (length(vec1) * length(vec2))
```

This is derived from the Euclidean definition of dot product:

```
dot_product(vec1, vec2) = length(vec1) * length(vec2) * cosine(vec1, vec2)
```

### The Dot Product

The dot product is simply the sum of the coordinate products of two vectors along each dimension:

```
dot_product(query, Doc1) = 1*3 + 1*5 = 8
dot_product(query, Doc2) = 1*4 + 1*1 = 5
dot_product(Doc1, Doc2) = 3*4 + 5*1 = 17
```

### Vector Length

The length (magnitude) of a vector is calculated using the Euclidean distance formula from the origin (0,0):

```
length(query) = √((1-0)² + (1-0)²) ≈ 1.41
length(Doc1) = √((3-0)² + (5-0)²) ≈ 5.83
length(Doc2) = √((4-0)² + (1-0)²) ≈ 4.12
```

### Final Cosine Similarity Calculations

Using these components, we can calculate the cosine similarity:

```
cos(query, Doc1) = dot_product(query, Doc1) / (length(query) * length(Doc1))
                  = 8 / (1.41 * 5.83) ≈ 0.97

cos(query, Doc2) = dot_product(query, Doc2) / (length(query) * length(Doc2))
                  = 5 / (1.41 * 4.12) ≈ 0.86
```

In [1]:
import math

query = [1, 1]
doc1 = [3, 5]

def length(vector):
    sq_length = 0
    for index in range(0, len(vector)):
        sq_length += math.pow(vector[index], 2)
    return math.sqrt(sq_length)

def dot_product(vector1, vector2):
    if len(vector1) == len(vector2):
        dot_prod = 0
        for index in range(0, len(vector1)):
            dot_prod += vector1[index] * vector2[index]
        return dot_prod
    else:
        return "Unmatching dimensionality"

cosine = dot_product(query, doc1) / (length(query) * length(doc1))
print(cosine)  # Output: approximately 0.97

0.9701425001453319


### Interpreting the Results

Our calculations show that:
- Doc1 has a cosine similarity of 0.97 with the query
- Doc2 has a cosine similarity of 0.86 with the query

Despite Doc2 being closer in Euclidean distance (as we calculated earlier), Doc1 is actually more similar to the query when we measure by content distribution. This is because the content in both the query and Doc1 is more evenly balanced between "management" and "meeting," while Doc2 is heavily weighted toward "management" with only one mention of "meeting."

This demonstrates why cosine similarity is often a better measure for document relevance than Euclidean distance - it captures the proportional distribution of content rather than being influenced by absolute term frequencies or document length.

In [8]:
import numpy as np

# Let's say each position corresponds to counts of ["management", "meeting"].
query = np.array([2, 2])   # e.g., management=2, meeting=2
doc1  = np.array([1, 1])   # management=1, meeting=1
doc2  = np.array([3, 2])   # management=3, meeting=2
doc3 = 2 * doc1             # Doc3 is Doc1 scaled (double the term frequencies)
doc4 = 2 * doc2

def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

def cosine_similarity(a, b):
    eps = 1e-10
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + eps)

documents = {
    "Doc1": doc1,
    "Doc2": doc2,
    "Doc3 (2x Doc1)": doc3,
    "Doc4 (2x Doc2)": doc4
}

print("Euclidean Distances from Query:")
for name, vec in documents.items():
    print(f"{name}: {euclidean_distance(query, vec):.3f}")

print("\nCosine Similarities with Query:")
for name, vec in documents.items():
    print(f"{name}: {cosine_similarity(query, vec):.3f}")


Euclidean Distances from Query:
Doc1: 1.414
Doc2: 1.000
Doc3 (2x Doc1): 0.000
Doc4 (2x Doc2): 4.472

Cosine Similarities with Query:
Doc1: 1.000
Doc2: 0.981
Doc3 (2x Doc1): 1.000
Doc4 (2x Doc2): 0.981
