# DIGI405 Texts, Discourses and Data

## Word and Document Representations Examples

We looked at token frequency as a representation of words in a corpus in week 1. This week we extend this to look at further concepts that build on frequency to create word and document-level representations of texts, including:

- Document-term matrices, using word counts
- Document-term matrices, using term frequency - inverse document frequency weighting
- Word embeddings

We will use the a corpus of (probably generated) customer support tickets as an example. This corpus offers a situation where we have many texts (~8000) with often repeating or similar content. We want find tickets that are similar, but also what is distinctive about them, and we'll explore some different ways to represent them to do this.

In [4]:
# If needed...
#!pip install textblob
#!python -m textblob.download_corpora

In [5]:
from collections import Counter
from nltk.corpus import stopwords
from tabulate import tabulate
from collections import defaultdict
import pandas as pd
from textblob import TextBlob as tb
import pandas as pd

In [6]:
tickets_df = pd.read_csv('tickets.csv')

In [7]:
tickets_corpus = tickets_df['Ticket Description'].to_list()

In [8]:
for t in tickets_corpus[:5]:
    print(t)
    print('\n')

I'm having an issue with the {product_purchased}. Please assist.

Your billing zip code is: 71701.

We appreciate that you have requested a website address.

Please double check your email address. I've tried troubleshooting steps mentioned in the user manual, but the issue persists.


I'm having an issue with the {product_purchased}. Please assist.

If you need to change an existing product.

I'm having an issue with the {product_purchased}. Please assist.

If The issue I'm facing is intermittent. Sometimes it works fine, but other times it acts up unexpectedly.


I'm facing a problem with my {product_purchased}. The {product_purchased} is not turning on. It was working fine until yesterday, but now it doesn't respond.

1.8.3 I really I'm using the original charger that came with my {product_purchased}, but it's not charging properly.


I'm having an issue with the {product_purchased}. Please assist.

If you have a problem you're interested in and I'd love to see this happen, please c

### Document-term matrix

We will use the Sci-kit Learn library to count tokens and produce a matrix, or table, summarising our documents. Each document will be converted to a vector, or list of numbers. Each number in the vector represents a word in our corpus or set of documents. The columns of the resulting matrix will represent words, and the rows will represent documents. If a word is present in a document, a frequency count will show that. If a word is not present in a document, the count will be zero.

The documentation for the CountVectorizer is here:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html 

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
vectorizer = CountVectorizer(token_pattern=r'[a-zA-Z]+')

In [11]:
tickets_dtm = vectorizer.fit_transform(tickets_corpus)

In [12]:
# print the features, ie the 'bag of words' extracted from the document(s).
vectorizer.get_feature_names_out()

array(['a', 'aaron', 'ab', ..., 'zombiecobra', 'zombii', 'zoom'],
      dtype=object)

In [13]:
# print the dtm
# rows are documents and columns are words
# each cell is a token count

tickets_dtm.toarray()

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [2, 0, 0, ..., 0, 0, 0],
       [2, 0, 0, ..., 0, 0, 0],
       [2, 0, 0, ..., 0, 0, 0]])

In [14]:
# we can also check the shape attribute

tickets_dtm.shape

(8469, 6232)

In [15]:
tickets_dtm_df = pd.DataFrame(tickets_dtm.toarray(),
                      columns=vectorizer.get_feature_names_out())

In [16]:
# print the dtm
tickets_dtm_df

Unnamed: 0,a,aaron,ab,ability,able,about,aboutumes,above,abroad,absmith,...,zerohits,zg,zilp,zinski,zip,zoe,zoltan,zombiecobra,zombii,zoom
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8464,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8465,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8466,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8467,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# get an overview of the word 'problem'

tickets_dtm_df['problem'].value_counts()
# tickets_dtm_df['problem'].describe()

problem
0    6358
1    1948
2     153
3       9
4       1
Name: count, dtype: int64

'Problem' appears once in 1948 tickets, twice in 153 tickets, three times in 9 tickets, and four times in 1 ticket. 

### Term frequency - Inverse Document Frequency

Term frequency - Inverse Document frequency is a feature scoring measure, used to weight words that are more distinctive of their document. It combines the term frequency (tf) of a word in a document, and the inverse document frequency (idf), the inverse of the number of documents the word appears in across the corpus. In this way, it tends to boost words that appear more frequently in a document than elsewhere in the corpus, and downgrades words that appear frequently in lots of documents.

#### Understanding Tf-idf in more detail

In [18]:
# This code cell is adapted from Steven Loria's TF-IDF example using the Textblob package
# Just for illustration purposes - we will use Sci-kit Learn to calculate this for us

import math
from textblob import TextBlob as tb

def raw_tf(word, doc):
    return doc.words.count(word)

def rel_tf(word, doc):
    return doc.words.count(word) / len(doc.words)

def doc_frequency(word, corpus):
    return sum(1 for doc in corpus if word in doc)

def idf(word, corpus):
    # this takes the natural logarithm
    return math.log(len(corpus) / (1 + doc_frequency(word, corpus)))

def tfidf(word, doc, corpus):
    return rel_tf(word, doc) * idf(word, corpus)

In [19]:
import pandas as pd

# Make a textblob document with one text
# Calculate tf-idf using the whole tickets_corpus dataset

example_1 = tb(tickets_corpus[1])

results = []

for word in example_1.words:
  raw_freq = raw_tf(word, example_1)
  rel_freq = rel_tf(word, example_1)
  doc_freq = doc_frequency(word, tickets_corpus)
  inverse_doc_freq = idf(word, tickets_corpus)
  tf_idf = tfidf(word, example_1, tickets_corpus)

  example_1_scores = {'word': '{}'.format(word),
            'raw_frequency': '{}'.format(raw_freq),
            'rel_frequency': '{}'.format(rel_freq),
            'document_frequency': '{}'.format(doc_freq),
            'IDF': '{}'.format(inverse_doc_freq),
            'TF-IDF': '{}'.format(tf_idf)}
  results.append(example_1_scores)

example_1_df = pd.DataFrame(results)

# here we have to ensure the TF-IDF scores are treated as numbers, not text
example_1_df['TF-IDF'] = pd.to_numeric(example_1_df["TF-IDF"])

# build a view with one row per unique word
unique = example_1_df[['word', 'raw_frequency', 'rel_frequency', 'document_frequency', 'IDF', 'TF-IDF']].drop_duplicates()

# sort on Tf-idf values
unique.sort_values(by='TF-IDF', ascending=False).head(20)

Unnamed: 0,word,raw_frequency,rel_frequency,document_frequency,IDF,TF-IDF
16,existing,1,0.0212765957446808,18,6.099728737755159,0.129781
10,If,2,0.0425531914893617,1132,2.0115434558935927,0.085598
29,The,3,0.0638297872340425,2281,1.311360186499578,0.083704
46,unexpectedly,1,0.0212765957446808,499,2.8295596184994074,0.060203
36,Sometimes,1,0.0212765957446808,500,2.8275616158367343,0.060161
44,acts,1,0.0212765957446808,501,2.82556759722987,0.060118
14,change,1,0.0212765957446808,651,2.564123154994946,0.054556
35,intermittent,1,0.0212765957446808,674,2.529455026049069,0.053818
33,facing,1,0.0212765957446808,819,2.3348633766633,0.049678
39,fine,1,0.0212765957446808,881,2.261975660914808,0.048127


#### Document-term matrix with Tf-idf

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
# We can apply various pre-processing steps to our vectorizer
# Here we add a minimum doc frequency, similar to what we've done using AntConc
# We also 

vectorizer2 = TfidfVectorizer(min_df=5, lowercase=True, token_pattern=r'[a-zA-Z]+')
tfidf_dtm = vectorizer2.fit_transform(tickets_corpus)
tfidf_dtm_df = pd.DataFrame(tfidf_dtm.toarray(),
                      columns=vectorizer2.get_feature_names_out())

# Show us the first 10 rows
# Notice the tf-idf scores are floats, not counts
tfidf_dtm_df.head()

Unnamed: 0,a,ability,able,about,above,absolutely,accept,accepted,access,accessible,...,year,years,yes,yesterday,yet,york,you,your,yours,z
0,0.071015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.084612,0.250138,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.100733,0.0,0.0,0.0
2,0.074789,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.227836,0.0,0.0,0.0,0.0,0.0,0.0
3,0.072596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.172992,0.0,0.0,0.0
4,0.059045,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.070351,0.0,0.0,0.0


### Measuring Similarity

How can we measure similarity from the document-term matrix?

In [22]:
# a code snippet to help make output later in this section readable!
import textwrap

wrapper = textwrap.TextWrapper(width=35,
    initial_indent=" " * 4,
    subsequent_indent=" " * 4,
    break_long_words=False,
    break_on_hyphens=False)

The first step is to compute a similarity matrix.

In [23]:
from sklearn.metrics.pairwise import cosine_similarity

# compute a similarity matrix
# notice its symmetry along the diagonal
cosine_sim = cosine_similarity(tfidf_dtm, tfidf_dtm)
print(cosine_sim)

[[1.         0.14950106 0.08505134 ... 0.11785847 0.16831609 0.15551782]
 [0.14950106 1.         0.22158801 ... 0.21599035 0.22122579 0.14021276]
 [0.08505134 0.22158801 1.         ... 0.25501771 0.18535725 0.22331872]
 ...
 [0.11785847 0.21599035 0.25501771 ... 1.         0.22742156 0.25251166]
 [0.16831609 0.22122579 0.18535725 ... 0.22742156 1.         0.19917944]
 [0.15551782 0.14021276 0.22331872 ... 0.25251166 0.19917944 1.        ]]


In [24]:
tfidf_dtm.shape

(8469, 1631)

This similarity matrix compares the TF-IDF vectors for each document with every other document. This  the cosine similarity in each case. The outcome is a similarity matrix (8469 rows/docs x 1684 columns/words) that summarises the similarity between documents for the whole corpus.

The similarity matrix can be used to find which documents are most similar.

In [25]:
# specify a target document and find similar documents to it.
target_doc = 1

sim_scores = list(enumerate(cosine_sim[target_doc]))
sorted_sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
sorted_sim_scores[:10]

[(1, 1.0000000000000002),
 (668, 0.8079363756453974),
 (1902, 0.780961719140004),
 (539, 0.776450455943226),
 (1106, 0.776450455943226),
 (1520, 0.776450455943226),
 (1557, 0.776450455943226),
 (1735, 0.776450455943226),
 (2007, 0.776450455943226),
 (3389, 0.776450455943226)]

In [26]:
print('Documents similar to doc {}:'.format(target_doc))
print('The first result is the target document itself!')
print('\n')

for doc in sorted_sim_scores[:10]:
    print(doc[0])
    print(wrapper.fill(tickets_corpus[doc[0]][:500]))
    print('\n')
    print('-----')

Documents similar to doc 1:
The first result is the target document itself!


1
    I'm having an issue with the
    {product_purchased}. Please
    assist.  If you need to change
    an existing product.  I'm
    having an issue with the
    {product_purchased}. Please
    assist.  If The issue I'm
    facing is intermittent.
    Sometimes it works fine, but
    other times it acts up
    unexpectedly.


-----
668
    I'm having an issue with the
    {product_purchased}. Please
    assist.   I'm having an issue
    with the {product_purchased}.
    Please assist. The issue I'm
    facing is intermittent.
    Sometimes it works fine, but
    other times it acts up
    unexpectedly.


-----
1902
    I'm having an issue with the
    {product_purchased}. Please
    assist.  SMS, please assist!
    The issue I'm facing is
    intermittent. Sometimes it
    works fine, but other times it
    acts up unexpectedly.


-----
539
    I'm having an issue with the
    {product_purchased}. Please
 

### Word embeddings

In [27]:
import spacy
nlp = spacy.load("en_core_web_sm")

2025-07-24 15:56:38.756115: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-07-24 15:56:38.808903: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [28]:
# Load one ticket into a spaCy doc
# We use spaCy tokenisation here, so not entirely comparable to our Sci-kit Learn examples.

ticket_doc = nlp(tickets_corpus[1])

In [29]:
ticket_doc.text

"I'm having an issue with the {product_purchased}. Please assist.\n\nIf you need to change an existing product.\n\nI'm having an issue with the {product_purchased}. Please assist.\n\nIf The issue I'm facing is intermittent. Sometimes it works fine, but other times it acts up unexpectedly."

spaCy's underlying language model has learned a vector representation (ie embedding) for every word.

We can get the vector for a word as follows.

In [30]:
# get word vector for the word 'unexpectedly' - a 96 dimension vector

unexpectedly = nlp(u'unexpectedly')
print(unexpectedly.vector)

[-1.48992846e-02 -9.72773433e-01 -2.12078834e+00 -8.65957975e-01
 -1.57547161e-01 -2.54883051e-01 -1.03888154e-01 -8.85680988e-02
  4.54792380e-02  5.71994603e-01 -5.91094255e-01 -3.52282226e-01
  2.37327814e-03  6.41643584e-01  6.39178514e-01 -2.07689896e-01
  1.52641118e-01 -9.49111938e-01 -1.89393759e-01  3.51712942e-01
 -4.07214046e-01  1.78185749e+00  2.42372781e-01 -6.29985511e-01
  2.44833484e-01  7.00281262e-01  2.67038047e-01  1.12765729e-01
 -1.57756436e+00  1.83215261e+00 -3.29652548e-01 -1.11963183e-01
  3.93959373e-01 -1.51478052e-01  9.16151330e-02 -1.95592964e+00
 -1.06865358e+00 -1.44240618e+00  1.61721492e+00 -1.21393025e-01
 -4.23775911e-01  3.44340205e-01 -4.69387382e-01  5.55107176e-01
 -1.74799800e-01 -3.40565562e-01 -8.32144499e-01  1.60883307e-01
  8.35322857e-01 -2.69881666e-01  1.14254797e+00  1.45602918e+00
 -6.38496816e-01  2.39781594e+00 -2.48123765e-01 -2.99766660e-01
  1.85273933e+00 -5.02061665e-01 -2.79424369e-01  1.46277592e-01
 -9.11092877e-01 -8.58279

We can compare the word vector for 'unexpectedly' with some other words that might be related.

In [31]:
intermittent = nlp(u'intermittent')
surprisingly = nlp(u'surprisingly')

In [32]:
unexpectedly.similarity(intermittent)

0.34489964073762713

In [33]:
unexpectedly.similarity(surprisingly)

0.8906676551229783

The cosine similarity shows that 'surprisingly' is much more similar to 'unexpectedly' than 'intermittent' is.

#### Document embeddings

We can measure similarity with document-level vectors, which spaCy provides through Doc objects. The values in a Doc vector are an average of the vectors for each token in that document.

In [34]:
ticket_doc.vector

array([-0.345924  , -0.38296276, -0.11615409, -0.24399236, -0.04707781,
        0.1848155 ,  0.35787773, -0.03967064,  0.26513293,  0.03438464,
        0.3388386 ,  0.05483507, -0.21400733,  0.19317377, -0.12437943,
       -0.08537383, -0.0163213 , -0.04032171,  0.09686897,  0.01143202,
       -0.13232793,  0.20896903,  0.08137819, -0.26028928,  0.27910882,
        0.21630871,  0.22284544,  0.2842081 ,  0.0149836 ,  0.22117846,
        0.17188542, -0.01495515,  0.5591556 , -0.35058782,  0.08542412,
        0.29157367,  0.07595362, -0.20705226, -0.07837567,  0.19844861,
        0.02289027,  0.29045033, -0.02778572, -0.15959685,  0.16310509,
       -0.3789265 , -0.14380701,  0.36538976,  0.24797802, -0.16165775,
       -0.08779511,  0.08664092, -0.15748355, -0.13478872,  0.21531922,
       -0.08226354, -0.05558764, -0.05496515,  0.04246097, -0.04665216,
        0.23162736, -0.07781044,  0.11183633,  0.3285311 ,  0.21637182,
       -0.0811092 , -0.08850811, -0.3651212 ,  0.08333947,  0.17

Let's compare some tickets.

In [35]:
ticket2 = nlp(tickets_corpus[2]) # another ticket
ticket668 = nlp(tickets_corpus[668]) # we saw this one was similar earlier

In [36]:
print(ticket2.text)

I'm facing a problem with my {product_purchased}. The {product_purchased} is not turning on. It was working fine until yesterday, but now it doesn't respond.

1.8.3 I really I'm using the original charger that came with my {product_purchased}, but it's not charging properly.


In [37]:
ticket_doc.similarity(ticket2)

0.7271517871097166

In [38]:
print(ticket668.text)

I'm having an issue with the {product_purchased}. Please assist.


I'm having an issue with the {product_purchased}. Please assist. The issue I'm facing is intermittent. Sometimes it works fine, but other times it acts up unexpectedly.


In [39]:
ticket_doc.similarity(ticket668)

0.9594867859745191