<a href="https://colab.research.google.com/github/changsin/AI/blob/main/08.9.k_means_text_similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# K Means Clustering for Text Similarity

K Means clustering is an unsupervised machine learning algorithm. Given the number of clusters k, items are grouped together around them.

## Bag of words
 One area of application is text similarity. For instance, given a query string, you want to find the most "relevant" document and one way to quantify the relevancy is text similarity. To compute the text similarity, each document is turned into a dictionary array of word frequency. This is called "the bag of words".



For instance, if we have these words in the vocabulary,

['hello', 'is', 'one', 'test', 'this', 'world']

the sentence: "This is one test" is turned into:

[0 1 1 1 1 0]

if we mark each word in the vocabulary that appears in the sentence as 1.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

test_world = ["this is one test hello world"]

test_vectorizer = CountVectorizer()
test_vectorizer.fit(test_world)
test = test_vectorizer.transform(["this is one test"])
hello = test_vectorizer.transform(["hello world"])

print(test.toarray(), hello.toarray())
print(test_vectorizer.get_feature_names())

np.dot(test.toarray().reshape(-1), hello.toarray().reshape(-1))

[[0 1 1 1 1 0]] [[1 0 0 0 0 1]]
['hello', 'is', 'one', 'test', 'this', 'world']


0

## Cosine similarity
In the text book, it introduces "normalized scalar product" to measure distance of two texts. The formula is:

$$ cos\theta = \frac{xy}{|x||y|} $$

This is called the cosine similarity because it tells you the difference between two normalized vectors. The bigger the number is the more similar. If there is no similarity, they are orthogonal and the result is zero.


By flipping the nominator and the denominator, we can use it as a distance measure:

$$ d(x, y) = \frac{|x||y|}{xy} $$

In this case, the smaller the number is, the more similar. This is the formula we can use to solve Exercise 8.21

## Exercise 8.21

Exercise 8.21 Determine the distances ds (scalar product) of the following texts to each other.
- x1: We will introduce the application of naive Bayes to text analysis on a short example text by Alan Turing from [Tur50].
- x2: We may hope that machines will eventually compete with men in all purely intellectual fields. But which are the best ones to start with?
- x3: Again I do not know what the right answer is, but I think both approaches should be tried.


In [None]:
x1 = ["We will introduce the application of naive Bayes to text analysis on a short example text by Alan Turing from [Tur50]."]
x2 = ["We may hope that machines will eventually compete with men in all purely intellectual fields. But which are the best ones to start with?"]
x3 = ["Again I do not know what the right answer is, but I think both approaches should be tried."]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(x1 + x2 + x3)
t1 = vectorizer.transform(x1)
t2 = vectorizer.transform(x2)
t3 = vectorizer.transform(x3)

In [None]:
a1 = t1.toarray().reshape(-1)
a2 = t2.toarray().reshape(-1)
a3 = t3.toarray().reshape(-1)

In [None]:
x1x2 = np.dot(a1, a2)
x2x3 = np.dot(a2, a3)
x1x3 = np.dot(a1, a3)
print("x1x2", x1x2)
print("x2x3", x2x3)
print("x1x3", x1x3)

x1x2 4
x2x3 2
x1x3 1


In [None]:
x1_norm = np.dot(a1, a1)
x2_norm = np.dot(a2, a2)
x3_norm = np.dot(a3, a3)
print("|x1|", x1_norm)
print("|x2|", x2_norm)
print("|x3|", x3_norm)

|x1| 22
|x2| 26
|x3| 16


In [None]:
x1x2_sim = np.sqrt(x1_norm*x2_norm)/x1x2
x2x3_sim = np.sqrt(x2_norm*x3_norm)/x2x3
x1x3_sim = np.sqrt(x1_norm*x3_norm)/x1x3
print("x1,x2 similarity", x1x2_sim)
print("x2,x3 similarity", x2x3_sim)
print("x1,x3 similarity", x1x3_sim)


x1,x2 similarity 5.979130371550699
x2,x3 similarity 10.198039027185569
x1,x3 similarity 18.76166303929372


In [None]:
x1x2_sim = x1x2/np.sqrt(x1_norm*x2_norm)
x2x3_sim = x2x3/np.sqrt(x2_norm*x3_norm)
x1x3_sim = x1x3/np.sqrt(x1_norm*x3_norm)
print("x1,x2 similarity", x1x2_sim)
print("x2,x3 similarity", x2x3_sim)
print("x1,x3 similarity", x1x3_sim)

x1,x2 similarity 0.16724840200141816
x2,x3 similarity 0.09805806756909202
x1,x3 similarity 0.053300179088902604


# Bag of Words from scratch
Here is the code that implements the BOW (bag of words) process (taken from https://gist.github.com/edubey/c52a3b34541456a76a2c1f81eebb5f67)

In [None]:
import numpy
import re
def tokenize(sentences):
    words = []
    for sentence in sentences:
        w = word_extraction(sentence)
        words.extend(w)
        
    words = sorted(list(set(words)))
    return words

def word_extraction(sentence):
    # ignore = ['a', "the", "is"]
    words = re.sub("[^\w]", " ",  sentence).split()
    # cleaned_text = [w.lower() for w in words if w not in ignore]
    # return cleaned_text
    return words
    
def generate_bow(allsentences):
  vectors = []
  vocab = tokenize(allsentences)
  print("Word List for Document \n{0} \n".format(vocab));

  for sentence in allsentences:
      words = word_extraction(sentence)
      bag_vector = numpy.zeros(len(vocab))
      for w in words:
          for i,word in enumerate(vocab):
              if word == w: 
                  bag_vector[i] += 1
                  
      print("{0} \n{1}\n".format(sentence,numpy.array(bag_vector)))
      vectors.append(bag_vector)
  return vocab, vectors


In [None]:
vocab, bow = generate_bow(x1 + x2 + x3)

Word List for Document 
['Again', 'Alan', 'Bayes', 'But', 'I', 'Tur50', 'Turing', 'We', 'a', 'all', 'analysis', 'answer', 'application', 'approaches', 'are', 'be', 'best', 'both', 'but', 'by', 'compete', 'do', 'eventually', 'example', 'fields', 'from', 'hope', 'in', 'intellectual', 'introduce', 'is', 'know', 'machines', 'may', 'men', 'naive', 'not', 'of', 'on', 'ones', 'purely', 'right', 'short', 'should', 'start', 'text', 'that', 'the', 'think', 'to', 'tried', 'what', 'which', 'will', 'with'] 

We will introduce the application of naive Bayes to text analysis on a short example text by Alan Turing from [Tur50]. 
[0. 1. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1.
 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 2. 0. 1.
 0. 1. 0. 0. 0. 1. 0.]

We may hope that machines will eventually compete with men in all purely intellectual fields. But which are the best ones to start with? 
[0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0

In [None]:
v1.toarray()

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1]])

In [None]:
bow

[array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 2., 0., 1., 0., 1., 0.,
        0., 0., 1., 0.]),
 array([0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 1.,
        0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 1.,
        1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 1., 1., 0., 1., 0.,
        0., 1., 1., 2.]),
 array([1., 0., 0., 0., 2., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0.,
        1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 1.,
        1., 0., 0., 0.])]

In [None]:
bow[0]

array([0., 1., 1., 0., 0., 1., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
       0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 2., 0., 1., 0., 1., 0.,
       0., 0., 1., 0.])

In [None]:
bow[1]

array([0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 1.,
       0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 1.,
       1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 1., 1., 0., 1., 0.,
       0., 1., 1., 2.])

In [None]:
np.dot(bow[0], bow[1])

4.0

In [None]:
np.dot(bow[1], bow[2])

1.0

In [None]:
np.dot(bow[0], bow[2])

1.0

In [None]:
np.matmul(bow[0], bow[1])

4.0

In [None]:
bow[0].shape


(55,)

In [None]:
ids = bow[1] > 0

In [None]:
np.where(ids)

(array([ 3,  7,  9, 14, 16, 20, 22, 24, 26, 27, 28, 32, 33, 34, 39, 40, 44,
        46, 47, 49, 52, 53, 54]),)