In [1]:
import numpy as np

# Our toy corpus
docs = [
    "ball game player score team",
    "cook recipe taste food eat",
    "code computer software program data",
    "team player win ball coach",
    "eat delicious cook meal taste"
]

#### Step 1: Construct a Document-Term Matrix

In [2]:
# extract unique terms (ie our vocabulary)
vocab = sorted(set(word for doc in docs for word in doc.split()))
print('Vocabulary:', vocab)

# Build the DTM
def build_dtm(docs, vocab):
    dtm = np.zeros((len(docs), len(vocab)))
    for i, doc in enumerate(docs):
        for word in doc.split():
            j = vocab.index(word)
            dtm[i,j] += 1
    return dtm

A = build_dtm(docs, vocab)
print('\nDocument-Term Matrix:')
print(A)

Vocabulary: ['ball', 'coach', 'code', 'computer', 'cook', 'data', 'delicious', 'eat', 'food', 'game', 'meal', 'player', 'program', 'recipe', 'score', 'software', 'taste', 'team', 'win']

Document-Term Matrix:
[[1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0.]
 [0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0.]]



#### Step 2: Singular Value Decomposition
LSA decomposes the matrix as A = U Σ Vᵀ where 
A(m,n) is our DTM, which decomposes to :  
U(m,m) - each row represents a document, and columns (the left singular vectors) represent topics/concepts found in the corpus.   
Σ(m,n) - is a diagonal matrix (the singular value matrix) where the diagonals (the singular values of A, typically represented by σ1, σ2, σ3....σm) are, in descending order, values describing the strength of concepts discovered in the corpus,  
Vᵀ(n,n) - each column represents a term (remember, this is a transpose) and rows represent topics/concepts found in the corpus. Collectively, the column vectors of Vᵀ are the right singular vectors.  

In [3]:
U, sigma, Vt = np.linalg.svd(A, full_matrices=True)

print("Shape of U (documents × topics):", U.shape)
print("Shape of Σ (singular values):", sigma.shape)
print("Shape of Vᵀ (topics × terms):", Vt.shape)

print("\n=== SINGULAR VALUES (Σ) ===")
print(sigma)
print("\nExplained variance ratio:")
variance_ratio = (sigma**2) / np.sum(sigma**2)
for i, (s, v) in enumerate(zip(sigma, variance_ratio)):
    print(f"  Component {i+1}: σ={s:.4f}, variance={v:.2%}")


Shape of U (documents × topics): (5, 5)
Shape of Σ (singular values): (5,)
Shape of Vᵀ (topics × terms): (19, 19)

=== SINGULAR VALUES (Σ) ===
[2.82842712 2.82842712 2.23606798 1.41421356 1.41421356]

Explained variance ratio:
  Component 1: σ=2.8284, variance=32.00%
  Component 2: σ=2.8284, variance=32.00%
  Component 3: σ=2.2361, variance=20.00%
  Component 4: σ=1.4142, variance=8.00%
  Component 5: σ=1.4142, variance=8.00%


The singular value matrix Σ exists in a different 'latent' space between the documents and the vocabulary. Each singular value (denoted by σ1, σ2, σ3....σm) represents a concept (in our case, a topic) and its value indicates how much variance in the data is explained by that concept. We can see that there are 2 main concepts dominating the documents, and a 3rd concept that is more prominent than the remaining 2 concepts.

#### Step 3: Reduce to k Latent Topics
This step is optional, but usually done for a large corpus, because we're interested usually in the top-k topics. This reduces computational effort downstream. Here we set k=3 since we think there are 3 main concepts in the corpus. This is an example of the 'elbow' method - the point where the sigma values drop dramatically.

In [4]:
k = 3
U_k = U[:, :k]  # document-topic matrix
sigma_k = sigma[:k] # top k singular values
Vt_k = Vt[:k, :]
print(f"Reduced to {k} latent dimensions")
print(f"Singular values retained: {sigma_k}")

Reduced to 3 latent dimensions
Singular values retained: [2.82842712 2.82842712 2.23606798]


#### Step 4: Interpret the Discovered Topics

In [5]:
def display_topics(Vt_k, vocab, n_top_words=5):
    print('\n' + '='*50)
    print('DISCOVERED LATENT TOPICS')
    print('='*50)

    for topic_idx, topic in enumerate(Vt_k):
        # get indices of top terms (by absolute value)
        top_indices = np.argsort(np.abs(topic))[::-1][:n_top_words]
        top_terms = [(vocab[i], topic[i]) for i in top_indices]

        print(f'\nTopic {topic_idx + 1}:')
        print(f' Top Terms: {[t[0] for t in top_terms]}')
        print(f" Weights:   {[f'{t[1]:.3f}' for t in top_terms]}")

display_topics(Vt_k, vocab)


DISCOVERED LATENT TOPICS

Topic 1:
 Top Terms: ['taste', 'eat', 'cook', 'delicious', 'meal']
 Weights:   ['0.500', '0.500', '0.500', '0.250', '0.250']

Topic 2:
 Top Terms: ['ball', 'player', 'team', 'coach', 'win']
 Weights:   ['0.500', '0.500', '0.500', '0.250', '0.250']

Topic 3:
 Top Terms: ['software', 'code', 'computer', 'program', 'data']
 Weights:   ['0.447', '0.447', '0.447', '0.447', '0.447']


#### Step 5: Associate Topics with Documents

In [6]:
# Scale U by singular values for document-topic strengths
doc_topics = U_k * sigma_k

print("\n" + "="*50)
print("DOCUMENT-TOPIC ASSOCIATIONS")
print("="*50)

for i, scores in enumerate(doc_topics):
    dominant = np.argmax(np.abs(scores)) + 1
    print(f"\nDoc {i+1} : {docs[i]}")
    print(f"  Topic scores: T1={scores[0]:.3f}, T2={scores[1]:.3f}, T3={scores[2]:.3f}")
    print(f"  → Dominant: Topic {dominant}")


DOCUMENT-TOPIC ASSOCIATIONS

Doc 1 : ball game player score team
  Topic scores: T1=0.000, T2=2.000, T3=0.000
  → Dominant: Topic 2

Doc 2 : cook recipe taste food eat
  Topic scores: T1=2.000, T2=0.000, T3=0.000
  → Dominant: Topic 1

Doc 3 : code computer software program data
  Topic scores: T1=0.000, T2=0.000, T3=2.236
  → Dominant: Topic 3

Doc 4 : team player win ball coach
  Topic scores: T1=0.000, T2=2.000, T3=0.000
  → Dominant: Topic 2

Doc 5 : eat delicious cook meal taste
  Topic scores: T1=2.000, T2=0.000, T3=0.000
  → Dominant: Topic 1


The last step is to give a name to each topic. This can't be done by LSA. In our toy example, it isn't difficult to label Topic 1 as 'Food', Topic 2 as 'Sports' and Topic 3 as 'Tech'. In a large dataset, this might be tasked to an LLM to perform. But an LLM is very likely to outperform this bag-of-words approach in the modelling task to begin with, something we'll explore in a later notebook.