# Natural Language Processing

Linear algebra is a powerful tool in Natural Language Processing (NLP) for handling and analyzing text data. Let's break down how it's used in various stages of NLP, starting from the basics and gradually building on the example. We'll go through tokenization, embedding, word/character statistics, and document-term matrices.

## Tokenization

Consider a simple sentence "Hello World! This is an example."

Tokenization is hte process of breaking down text into smaller units, such as words or phrases. It's the first step in preparing text for NLP tasks.

At this stage linear algebra isn't directly applied. However, the outcome of tokenization sets the stage for using linear algebra in later setps.

In [16]:
# Sample sentences
documents = ["Hello world!", "This is an example."]

# Tokenization
tokenized_docs = [doc.lower().replace("!", "").replace(".", "").split() for doc in documents]
print("Tokenized Documents:", tokenized_docs)


Tokenized Documents: [['hello', 'world'], ['this', 'is', 'an', 'example']]


## Embedding

After we tokenize the previous sentence, we end up with the tokens "Hello", "World", "This", "is", "an", "example".

Embedding converts tokens into numerical vectors that represent the tokens in a high-dimensional space. Words with similar meanings are closer in this space.

Each word is represented as a vector in say a 100-dimensional space. For example "Hello" could be represented as a vector `[0.5, -0.2,..., 0.1]`. 

If "Hello" and "world" are similar in some context, their vectors will be closer in the embedding space. Linear algebra operations like cosine similarity can measure this closeness.

For simplicity, we'll simulate embeddings by assigning random vectors to each unique word. In practice, you'd use pre-trained models like `Word2Vec`, `GloVe`, or `fastText`.

In [17]:
import numpy as np

# Unique words
unique_words = set(word for doc in tokenized_docs for word in doc)

# Simulated embeddings
embeddings = {word: np.random.rand(5) for word in unique_words}  # 5-dimensional vectors
print("Word Embeddings:", embeddings)


Word Embeddings: {'this': array([0.91214246, 0.36497166, 0.08030509, 0.19464836, 0.05500404]), 'hello': array([0.62905354, 0.80303779, 0.33973286, 0.22803549, 0.00468848]), 'world': array([0.77174385, 0.26235544, 0.35802619, 0.33824357, 0.9782507 ]), 'an': array([0.95591613, 0.57198752, 0.91885446, 0.38596546, 0.15889245]), 'is': array([0.18077022, 0.00288256, 0.17246388, 0.17740929, 0.63035225]), 'example': array([0.0714305 , 0.72558867, 0.63004908, 0.10963651, 0.85759371])}


## Word/Character Statistics

Continuing with our tokens, let's analyse the word "example".

This step involves calculating statistics like word frequency, character count, etc... It's useful for understanding the composition of texts.

If we represent each character as a vector (e.g., a=1, b=2, ..., z=26), "example" can be represented as a matrix where each row corresponds to a character vector. Summing up these vectors can give us a statistical representation of the word.

In [18]:
word_lengths = {word: len(word) for word in unique_words}
print("Word Lengths:", word_lengths)

Word Lengths: {'this': 4, 'hello': 5, 'world': 5, 'an': 2, 'is': 2, 'example': 7}


## Document-Term Matrix (DTM)

Consider two sentences: "Hello World!" and "This is an example."

A DTM is a matrix that describes the frequency of terms (words) that occur in a collection of documents. It's crucial for many tasks, including information retrieval and topic modeling.

In terms of Linear Algebra:
- each row represents a document.
- each column represents a unique term for all documents.
- entries in the matrix are the frequences of the terms in the documents. 

For our example, the DTM might look like this if we only consider unique words:


||Hello|world|This|is|an|example|
|---|---|---|---|---|---|---|
|Doc1|1|1|0|0|0|0|
|Doc2|0|0|1|1|1|1|

With this matrix, we can perform operations like calculating the cosine similarity between documents, which tells us how similar two documents are based on their word usage. This involves linear algebra operations like dot products and norms.

In [19]:
# Initialize DTM
dtm = np.zeros((len(tokenized_docs), len(unique_words)))

# Mapping of words to columns
word_to_index = {word: i for i, word in enumerate(unique_words)}

# Fill DTM
for doc_idx, doc in enumerate(tokenized_docs):
    for word in doc:
        word_idx = word_to_index[word]
        dtm[doc_idx, word_idx] += 1

print("Document-Term Matrix:\n")
print(unique_words)
print(dtm)


Document-Term Matrix:

{'this', 'hello', 'world', 'an', 'is', 'example'}
[[0. 1. 1. 0. 0. 0.]
 [1. 0. 0. 1. 1. 1.]]


## Document Similarity

To demonstrate document similarity, we'll use the Document-Term Matrix (DTM) created in the previous steps. Document similarity can be calculated in various ways, but one common method is using the cosine similarity measure. 

Cosine similarity calculates the cosine of the angle between two non-zero vectors of an inner product space, which in this context are the vectors representing our documents in the DTM. This measure helps us understand how similar two documents are in terms of their word usage.

First, let's calculate the cosine similarity between two documents. We'll use the `numpy` library for this, as it provides efficient linear algebra operations.

### Calculating Cosine Similarity

The cosine similarity between two vectors $A$ and $B$ is given by the formula:

$cosine\: similarity = \frac{A.B}{\left\| A\right\| \left\| B\right\| } $

where:
- $A.B$ is the dot product of vectors $A$ and $B$,
- ${\left\| A\right\|}$ and  ${\left\| B\right\| }$ are the Euclidean norms (or magnitudes) of vectors $A$ and $B$ respectively. This denotes how far away the point is from the origin.

For example, consider `A=[1, 2, 3]` and `B=[4, 5, 6]`:
- $A.B$ = $(1*4) + (2*5) + (3*6) = 4 + 10 + 18 = 32$
- ${\left\| A\right\|}$ = $3.74$
- ${\left\| B\right\|}$ = $8.77$

Hence:

$cosine\: similarity = \frac{32}{3.74*8.77} = 0.97 $

This means these two matrices are very close to each other.

Let's apply it to our documents now.


In [20]:
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity

doc1 = dtm[0]  # Vector representing the first document
doc2 = dtm[1]  # Vector representing the second document

# Calculate similarity
similarity = cosine_similarity(doc1, doc2)
print(f"Cosine similarity between Document 1 and Document 2: {similarity}")


Cosine similarity between Document 1 and Document 2: 0.0


This means that the similarity between `Document 1` and `Document 2` is nothing.

If you had multiple documents, you can create a similarity matrix to compare all documents with each other.


In [21]:
num_docs = dtm.shape[0]
similarity_matrix = np.zeros((num_docs, num_docs))

for i in range(num_docs):
    for j in range(num_docs):
        if i != j:
            similarity_matrix[i, j] = cosine_similarity(dtm[i], dtm[j])
        else:
            similarity_matrix[i, j] = 1  # Document is perfectly similar to itself

print("Similarity Matrix:\n", similarity_matrix)


Similarity Matrix:
 [[1. 0.]
 [0. 1.]]


This matrix provides a comprehensive view of how each document relates to every other document in your corpus, based on their content. High values (close to 1) indicate strong similarity, while low values (close to 0) indicate low similarity. This approach is very useful for clustering documents, recommending content, or detecting duplicate documents.

## Conclusion

Linear algebra in NLP allows us to convert textual information into numerical form, enabling the application of mathematical models and algorithms to analyze and understand language. Each step, from tokenization to embedding and constructing DTMs, builds upon the previous to enrich the analysis capabilities in NLP tasks.