# Natural Language Processing
## 5️⃣ Document Similarity

### Document Similarity

Documents can be expressed in **words**, which are the most basic units of documents. Too calculate **similarity between documents**, we will use **cosine similarity** of **document vectors** created based on words.

### Bag of Words

**Bag of Words** is one of the ways to create **document vector**, which creates the vector using **frequency of each words in the document**.

The **dimension** of Bag of words document vectors is the same as **the number of all words in the document**.

The bag of words document vector has a problem in that it **cannot preserve the meaning of the compound word** by treating the compound word as an independent word.
 (i.e. 'log off' -> 'log' and 'off')

#### Bag of N-grams

**N-gram** analyzes text based on N consecutive words, so that it preserve the meaning of the compound word.

- N = 1 (unigram) : ["포근한", "봄", "날씨가", "이어질", "것으로", "전망되며", ...]
- N = 2 (bigram) : ["포근한 봄", "날씨가 이어질", "것으로 전망되며", ...]
- N = 3 (trigram) : ["포근한 봄 날씨가", "이어질 것으로 전망되며", ...]

**Bag of N-grams** represent document vectors based on frequency of combined several N-grams having various N.

In [2]:
import re
from sklearn.feature_extraction.text import CountVectorizer

regex = re.compile('[^a-z ]')

with open("test.txt", 'r') as f:
    documents = []
    for line in f:
        documents.append(regex.sub("",line.rstrip()))
        
# Create Bag of words document vector using CountVectorizer() object, then save it to X.  
cv = CountVectorizer()
X = cv.fit_transform(documents)

# Print X's dimension
dim = X.shape
print(dim)

# Print words that first 10 columns of X express using get_feature_names().
words_feature = cv.get_feature_names()[:10]
print(words_feature)

# Find the index of column meaning "comedy".
idx = cv.get_feature_names().index("comedy")
print(idx)

# Save first document's Bag of words vector to vec1, then print it.
vec1 = X[0]
print(vec1)

(454, 12640)
['aal', 'aba', 'abandon', 'abandoned', 'abbot', 'abducted', 'abets', 'abilities', 'ability', 'abilitytalent']
2129
  (0, 9686)	4
  (0, 5525)	4
  (0, 6010)	4
  (0, 1761)	1
  (0, 2129)	1
  (0, 9081)	1
  (0, 948)	2
  (0, 11184)	8
  (0, 9829)	1
  (0, 11321)	1
  (0, 870)	2
  (0, 10446)	1
  (0, 8064)	1
  (0, 8847)	1
  (0, 16)	1
  (0, 9910)	2
  (0, 6444)	1
  (0, 10842)	1
  (0, 3245)	2
  (0, 12582)	1
  (0, 5689)	2
  (0, 11070)	1
  (0, 8837)	1
  (0, 6343)	1
  (0, 6826)	2
  :	:
  (0, 9400)	1
  (0, 11527)	1
  (0, 1635)	1
  (0, 3144)	1
  (0, 5633)	1
  (0, 9213)	1
  (0, 2012)	1
  (0, 6479)	1
  (0, 5126)	1
  (0, 9794)	1
  (0, 7803)	1
  (0, 12607)	1
  (0, 3420)	1
  (0, 3978)	1
  (0, 6750)	1
  (0, 196)	1
  (0, 7185)	1
  (0, 257)	1
  (0, 11237)	1
  (0, 4134)	1
  (0, 4212)	1
  (0, 4996)	1
  (0, 8519)	1
  (0, 6046)	1
  (0, 6028)	1


### TF-IDF

**TF-IDF**(term frequency - inverse document frequency) reflects that relatively more frequently occurring words in a document have a more important meaning for the document.

$$\text{TF-IDF of } word_1  \text{ in } doc_1 = \frac{\text{frequency of }word_1\text{ in }doc_1}{\text{frequency of every words in }doc_1} * log(\frac{\text{number of every docs in data}}{\text{number of docs that contains }word_1\text{ in data}}) $$

The **TF-IDF-based bag of words document vector** lowers the importance of words that appear frequently in all documents and increases the importance of words that occur frequently only in specific documents.

In [3]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer

regex = re.compile('[^a-z ]')

with open("test.txt", 'r') as f:
    documents = []
    for line in f:
        lowered_sent = line.rstrip().lower()
        filtered_sent = regex.sub('', lowered_sent)
        documents.append(filtered_sent)

# Create TF-IDF Bag of words document vector using TfidVectorizer() object, then save it to X.  
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(documents)

# Print X's dimension
dim1 = X.shape
print(dim1)

# Save fisrt document's TF-IDF Bag of words vector to vec1.
vec1 = X[0]
print(vec1)

# Create TF-IDF Bag of N-grams document vector using TfidfVectorizer() object.
unibigram_tfidf = TfidfVectorizer(ngram_range = (1, 2)) # This vectorizer uses unigram and bigram.
unibigram_X = unibigram_tfidf.fit_transform(documents)


# Print unibigram_X's dimension
dim2 = unibigram_X.shape
# 문서 벡터의 차원을 확인합니다.
print(dim2)


(454, 12136)
  (0, 5679)	0.058640619958889736
  (0, 8003)	0.10821800789540346
  (0, 11827)	0.03351360629176965
  (0, 3976)	0.10821800789540346
  (0, 3885)	0.056559253324120214
  (0, 10825)	0.040897765011371726
  (0, 228)	0.07670128660443204
  (0, 173)	0.08289290212342751
  (0, 6546)	0.043212451333837304
  (0, 3731)	0.06944792122696908
  (0, 11793)	0.08712444266241767
  (0, 12103)	0.05368632319734369
  (0, 7470)	0.025687260438575044
  (0, 9189)	0.10821800789540346
  (0, 4994)	0.04450120519256354
  (0, 5326)	0.05241495461089306
  (0, 5538)	0.09654704887371249
  (0, 6240)	0.05908965016142883
  (0, 1896)	0.06002531501567428
  (0, 8681)	0.10139093452026096
  (0, 5344)	0.08289290212342751
  (0, 3162)	0.05211156559734475
  (0, 1438)	0.09278983927035103
  (0, 11136)	0.07804901647687901
  (0, 8858)	0.09654704887371249
  :	:
  (0, 8333)	0.09654704887371249
  (0, 10655)	0.10821800789540346
  (0, 5394)	0.036855273761029705
  (0, 12074)	0.045486142952180335
  (0, 7080)	0.06950749415600704
  (0, 106