<a href="https://colab.research.google.com/github/Yenacho/09/blob/main/BoW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Bag of Words

Bag-of-Words is a family of text representations, where text vectors are built by observing and counting the words that appear in a text.

We study 2 types of BoW vectors:
* **Raw Count**: actually count the number of occurences of each word in a text
* **TF-IDF**: adjust the raw count to favor words that appear a lot in a few documents, as opposed to those who appear a lot in all documents

## Definitions

**Document** and **Corpus**: 
* **Document** is the smallest unit of text of your use case
* **Corpus** is your collection of documents
* **Use case**: think of the typical question you are looking the answer to
* **Query**: the text you will use to search in your corpus

A few examples of use cases:
* Use case 1: "*which academic papers are about black holes?*"
   * Corpus: academic papers uploaded to ArXiv
   * Document: 1 paper
   * Query: "black hole"
* Use case 2: "*Where does Victor Hugo mention Notre-Dame?*"
   * Corpus: entire works from Victor Hugo
   * Document: 1 paragraph
   * Query: "notre dame"
* Use case 3: "*What can I cook with pasta and garlic?*"
   * Corpus: all recipes in multiple cook books
   * Document: 1 recipe
   * Query: "pasta garlic"

**Tokenizer**

A tokenizer is a program that takes in a text and splits it into smaller units. A book can be split into chapters, into paragraphs, into sentences, into words. Those are all examples of tokenization process.

Once a text is tokenized into sentences, you can tokenize sentences into words.


**Sentence**

In natural language, a text is made of multiple sentences, separated by punctuation marks such as `.`. It is nonetheless a challenge to split a text into sentences as some `.` indicate abbreviations, for example.

**Word**:

Any text is made of words. Sometimes they are nicely separated by spaces or punctuation marks. As with sentences, some words include punctuation marks, like `U.S.A.`, or `to court-martial`.


**Vocabulary**:

The list of unique words used in the corpus.



In [1]:
import numpy as np
import math
import pandas as pd

## Download Corpus

We will use some short extracts from a Sherlock Holmes story "Scandal in Bohemia", by Sir Arthur Conan Doyle.

We will start with the first paragraph of the book.

* **Corpus**: All sentences in "Scandal in Bohemia"
* **Document**: 1 sentence of the book

In [2]:
import requests

r = requests.get('https://sherlock-holm.es/stories/plain-text/scan.txt')

assert r.status_code == 200

with open('scandal_in_bohemia.txt', 'w') as out:
    out.write(r.content.decode('utf-8'))
lines = [txt for txt in open('scandal_in_bohemia.txt') if len(txt.strip()) > 0]

print(lines[:20])

['                              A SCANDAL IN BOHEMIA\n', '                               Arthur Conan Doyle\n', '                                Table of contents\n', '                                     Chapter 1\n', '                                     Chapter 2\n', '                                     Chapter 3\n', '          CHAPTER I\n', '     To Sherlock Holmes she is always the woman. I have seldom heard him\n', '     mention her under any other name. In his eyes she eclipses and\n', '     predominates the whole of her sex. It was not that he felt any\n', '     emotion akin to love for Irene Adler. All emotions, and that one\n', '     particularly, were abhorrent to his cold, precise but admirably\n', '     balanced mind. He was, I take it, the most perfect reasoning and\n', '     observing machine that the world has seen, but as a lover he would\n', '     have placed himself in a false position. He never spoke of the softer\n', '     passions, save with a gibe and a sneer. T

In [3]:
# First Paragraph
par = ' '.join([x.strip() for x in lines[7:25]])

import textwrap
print(textwrap.fill(par, width=80))

To Sherlock Holmes she is always the woman. I have seldom heard him mention her
under any other name. In his eyes she eclipses and predominates the whole of her
sex. It was not that he felt any emotion akin to love for Irene Adler. All
emotions, and that one particularly, were abhorrent to his cold, precise but
admirably balanced mind. He was, I take it, the most perfect reasoning and
observing machine that the world has seen, but as a lover he would have placed
himself in a false position. He never spoke of the softer passions, save with a
gibe and a sneer. They were admirable things for the observer--excellent for
drawing the veil from men's motives and actions. But for the trained reasoner to
admit such intrusions into his own delicate and finely adjusted temperament was
to introduce a distracting factor which might throw a doubt upon all his mental
results. Grit in a sensitive instrument, or a crack in one of his own high-power
lenses, would not be more disturbing than a strong emo

## NLTK

NLTK is a Python library for text analytics.

See [Link](https://www.nltk.org).

In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

The **sentence tokenizer** takes care to split a text into sentences.

In [5]:
from nltk.tokenize import sent_tokenize
nltk_sentences = sent_tokenize(par)
nltk_sentences

['To Sherlock Holmes she is always the woman.',
 'I have seldom heard him mention her under any other name.',
 'In his eyes she eclipses and predominates the whole of her sex.',
 'It was not that he felt any emotion akin to love for Irene Adler.',
 'All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.',
 'He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position.',
 'He never spoke of the softer passions, save with a gibe and a sneer.',
 "They were admirable things for the observer--excellent for drawing the veil from men's motives and actions.",
 'But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results.',
 'Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would no

The **word tokenizer** takes care to split a text into words.

In [6]:
from nltk.tokenize import word_tokenize
nltk_tokens = word_tokenize(nltk_sentences[0])
nltk_tokens

['To', 'Sherlock', 'Holmes', 'she', 'is', 'always', 'the', 'woman', '.']

## SpaCy

SpaCy is another Python libary for text analytics.

See [Link](https://spacy.io)

In [7]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [8]:
doc = nlp(par)

It has also a **sentence tokenizer**.

In [9]:
spacy_sentences = list(doc.sents)
spacy_sentences

[To Sherlock Holmes she is always the woman.,
 I have seldom heard him mention her under any other name.,
 In his eyes she eclipses and predominates the whole of her sex.,
 It was not that he felt any emotion akin to love for Irene Adler.,
 All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.,
 He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position.,
 He never spoke of the softer passions, save with a gibe and a sneer.,
 They were admirable things for the observer--excellent for drawing the veil from men's motives and actions.,
 But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results.,
 Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbin

And a **word tokenizer**

In [10]:
spacy_tokens = [x for x in spacy_sentences[0]]
spacy_tokens

[To, Sherlock, Holmes, she, is, always, the, woman, .]

**Warning**: NLTK / SpaCy might produce different results: break sentences at different places, break words at different places, etc...

In [11]:
s = nltk_sentences[0]

## SKLEARN Generalities

Classes likes `CountVectorizer` or `TfidfVectorizer` works in the following way:
* Instantiate an object with specific parameters (`v = CountVectorizer(...)`)
* Fit this object to your corpus = learn the vocabulary (method `v.fit(...)`)
* Transform any piece of text you have into a vector (method `v.transform()`)



In [12]:
def show_vocabulary(vectorizer):
    words = vectorizer.get_feature_names()

    print(f'Vocabulary size: {len(words)} words')

    # we can print ~10 words per line
    for l in np.array_split(words, math.ceil(len(words) / 10)):
        print(''.join([f'{x:<15}' for x in l]))

In [13]:
from termcolor import colored

def show_bow(vectorizer, bow):
    words = vectorizer.get_feature_names()

    # we can print ~8 words + coefs per line
    for l in np.array_split(list(zip(words, bow)), math.ceil(len(words) / 8)):
        print(' | '.join([colored(f'{w:<15}:{n:>2}', 'grey') if int(n) == 0 else colored(f'{w:<15}:{n:>2}', on_color='on_yellow', attrs=['bold']) for w, n in l ]))

def show_bow_float(vectorizer, bow):
    words = vectorizer.get_feature_names()

    # we can print ~6 words + coefs per line
    for l in np.array_split(list(zip(words, bow)), math.ceil(len(words) / 6)):
        print(' | '.join([colored(f'{w:<15}:{float(n):>0.2f}', 'grey') if float(n) == 0 else colored(f'{w:<15}:{float(n):>0.2f}', on_color='on_yellow', attrs=['bold']) for w, n in l ]))


# Raw Count

* We take a text, any text, and represent it as a vector
* Each text is represented by a vector with **N** dimensions
* Each dimension is representative of **1 word** of the vocabulary
* The coefficient in dimension **k** is the number of times the word at index **k** in the vocabulary is seen in the represented text

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

## First Example - Reduced Vocabulary

We illustrate with a small corpus so we have a reduced vocabulary.

* **Corpus**: The first paragraph of the book
* **Document**: 1 sentence

In [15]:
count_small = CountVectorizer(lowercase=False)
count_small.fit(nltk_sentences)
show_vocabulary(count_small)

Vocabulary size: 134 words
Adler          All            And            But            Grit           He             Holmes         In             Irene          It             
Sherlock       They           To             abhorrent      actions        adjusted       admirable      admirably      admit          akin           
all            always         and            any            as             balanced       be             but            cold           crack          
delicate       distracting    disturbing     doubt          drawing        dubious        eclipses       emotion        emotions       excellent      
eyes           factor         false          felt           finely         for            from           gibe           has            have           
he             heard          her            high           him            himself        his            in             instrument     into           
introduce      intrusions     is             it             late   



The option `lowercase` sets up one behavior of the raw count: do we consider `And` to be different than `and`?

* `lowercase=False` gives 134 unique words in the vocabulary
* `lowercase=True` gives 127 unique words

In [16]:
count_small = CountVectorizer(lowercase=True)
count_small.fit(nltk_sentences)
show_vocabulary(count_small)

Vocabulary size: 127 words
abhorrent      actions        adjusted       adler          admirable      admirably      admit          akin           all            always         
and            any            as             balanced       be             but            cold           crack          delicate       distracting    
disturbing     doubt          drawing        dubious        eclipses       emotion        emotions       excellent      eyes           factor         
false          felt           finely         for            from           gibe           grit           has            have           he             
heard          her            high           him            himself        his            holmes         in             instrument     into           
introduce      intrusions     irene          is             it             late           lenses         love           lover          machine        
memory         men            mental         mention        might  



In [17]:
s = nltk_sentences[0]

print(f'Text: "{s}"')
bow = count_small.transform([s])
print(f'BoW Shape: {bow.shape}')
bow = bow.toarray()   # From sparse matrix to dense matrix (Careful with MEMORY)
print(f'BoW Vector: {bow}')

Text: "To Sherlock Holmes she is always the woman."
BoW Shape: (1, 127)
BoW Vector: [[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0
  1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0]]


In [18]:
show_bow(count_small, bow[0])

[30mabhorrent      : 0[0m | [30mactions        : 0[0m | [30madjusted       : 0[0m | [30madler          : 0[0m | [30madmirable      : 0[0m | [30madmirably      : 0[0m | [30madmit          : 0[0m | [30makin           : 0[0m
[30mall            : 0[0m | [1m[43malways         : 1[0m | [30mand            : 0[0m | [30many            : 0[0m | [30mas             : 0[0m | [30mbalanced       : 0[0m | [30mbe             : 0[0m | [30mbut            : 0[0m
[30mcold           : 0[0m | [30mcrack          : 0[0m | [30mdelicate       : 0[0m | [30mdistracting    : 0[0m | [30mdisturbing     : 0[0m | [30mdoubt          : 0[0m | [30mdrawing        : 0[0m | [30mdubious        : 0[0m
[30meclipses       : 0[0m | [30memotion        : 0[0m | [30memotions       : 0[0m | [30mexcellent      : 0[0m | [30meyes           : 0[0m | [30mfactor         : 0[0m | [30mfalse          : 0[0m | [30mfelt           : 0[0m
[30mfinely         : 0[0m | [30mfor      



## Second Example - Larger Corpus

* **Corpus**: entire book
* **Document**: 1 sentence

In [19]:
book = ' '.join([x.strip() for x in lines])
sentences = sent_tokenize(book)

In [20]:
count = CountVectorizer(lowercase=True)
count.fit(sentences)
show_vocabulary(count)

Vocabulary size: 1948 words
15             1858           1888           abandoned      abhorrent      able           about          above          absolute       absolutely     
absorb         accent         accomplice     accomplished   account        accustomed     acquaintance   across         action         actions        
active         activity       actor          actress        acute          added          additional     address        addressing     adjusted       
adler          admirable      admirably      admit          adopted        advantage      advantages     adventuress    advise         affairs        
affect         after          afternoon      afterwards     again          against        agent          agitation      ago            ah             
air            aisle          akin           alarm          all            almost         alone          aloud          already        also           
altar          alternating    always         am             amazem



In [21]:
s = sentences[10]

print(f'Text: "{s}"')
bow = count.transform([s])
print(f'BoW Shape: {bow.shape}')
bow = bow.toarray()   # From sparse matrix to dense matrix (Careful with MEMORY)
print(f'BoW Vector: {bow}')

Text: "And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory."
BoW Shape: (1, 1948)
BoW Vector: [[0 0 0 ... 0 0 0]]


In [22]:
show_bow(count, bow[0])



[30m15             : 0[0m | [30m1858           : 0[0m | [30m1888           : 0[0m | [30mabandoned      : 0[0m | [30mabhorrent      : 0[0m | [30mable           : 0[0m | [30mabout          : 0[0m | [30mabove          : 0[0m
[30mabsolute       : 0[0m | [30mabsolutely     : 0[0m | [30mabsorb         : 0[0m | [30maccent         : 0[0m | [30maccomplice     : 0[0m | [30maccomplished   : 0[0m | [30maccount        : 0[0m | [30maccustomed     : 0[0m
[30macquaintance   : 0[0m | [30macross         : 0[0m | [30maction         : 0[0m | [30mactions        : 0[0m | [30mactive         : 0[0m | [30mactivity       : 0[0m | [30mactor          : 0[0m | [30mactress        : 0[0m
[30macute          : 0[0m | [30madded          : 0[0m | [30madditional     : 0[0m | [30maddress        : 0[0m | [30maddressing     : 0[0m | [30madjusted       : 0[0m | [1m[43madler          : 1[0m | [30madmirable      : 0[0m
[30madmirably      : 0[0m | [30madmit    

## Real-Life Corpus

Books are very clean texts. Real-Life corpuses including user-generated material will be on the opposite of the spectrum, and will include typos, strange usernames, artefacts of all kinds...

The "20 newsgroups" dataset is a classical NLP dataset. Newsgroups are the ancestors of reddit, people could post messages and reply in a thread.

* **Corpus**: newsgroup messages
* **Document**: full text of 1 message

In [23]:
from sklearn.datasets import fetch_20newsgroups

In [24]:
newsgroups = fetch_20newsgroups()

In [25]:
print(f'Number of documents: {len(newsgroups.data)}')
print(f'Sample document:\n{newsgroups.data[0]}')

Number of documents: 11314
Sample document:
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







* Vocabulary is much larger (130107 unique words)
* Lots of "garbage" in vocabulary ("mbocjlo3", "mc2i", "mc68882rc25")

In [26]:
count = CountVectorizer()
count.fit(newsgroups.data)
show_vocabulary(count)



[1;30;43m스트리밍 출력 내용이 길어서 마지막 5000줄이 삭제되었습니다.[0m
mbit           mbits          mbiz5l         mbk            mbkolodn       mbl            mbl2           mblawson       mblock         mbm4           
mbmj           mbmupg9xkuex   mbn            mbnut9og       mbo            mbocjlo3       mbond          mbongo         mbp            mbps           
mbq            mbq78acw       mbqet          mbqyhb_        mbr            mbrader        mbrian         mbrxkykwsc     mbs            mbs0t          
mbs0tbs3       mbs0tq         mbs0tq6        mbs110         mbs3           mbti           mbtj           mbts           mbu            mbuc           
mbud_eyj       mbuf           mbui           mbunix         mbuntan        mbv            mbv5s          mbw            mbw1           mbw1e          
mbw76o         mbwfl_         mbx            mbxlt          mbxltbxn       mbxltq         mbxltq6        mbxn           mbxo           mbxom          
mbxom4         mby            mby8v         

In [27]:
print(f'Size of vocabulary: {len(count.get_feature_names())}')

Size of vocabulary: 130107




# TF-IDF

The basic for TF-IDF is that cosine similarity with raw count coefficients puts too much emphasis on the number of occurences of a word within a document.

Repeating a word will artifically increase the cosine similarity with any text containing this word.

Consider which word would be important:
1. One that is repeated a lot and equally present in each document
1. One that appears a lot only in a few document

TF-IDF computes coefficients:
* Low values for common words (ie present in the document, but quite common over the corpus)
* High values for uncommon words (ie present in the document, but not common over the corpus)

We consider one specific document, and one specific word.

* **TF = Term Frequency**: the number of times the word appears in the document
* **DF = Document Frequency**: the number of document in the corpus, in which the word appears
* **IDF = Inverse Document Frequency**: the inverse of the Document Frequency.

Logarithms are introduced, to reflect that 100 times a word does not deliver 100 times the information.

Given a word **w**, a document **d** in a corpus of **D** documents:

$\textrm{TF-IDF(w, d) = TF(w, d) * IDF(w)}$

$
\begin{align}
\textrm{IDF(w) = log} \left( \frac{1 + \textrm{D}}{1 + \textrm{DF(w)}} \right) + 1
\end{align}
$

This is the default SKLEARN formula (see [Link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer))


Bag of Words vectors with TF-IDF coefficients (often called TF-IDF vectors):
* **N** dimensions, where **N** is the size of the vocabulary
* Coefficient at dimension **k** is the coefficient for the word at index **k** in the vocabulary
* Coefficients are TF-IDF coefficients, instead of raw count

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

### Example

We continue with the Sherlock Holmes book "Scandal in Bohemia"

* **Corpus**: full text of the book
* **Document**: 1 sentence

In [29]:
tfidf = TfidfVectorizer()
tfidf.fit(sentences)
show_vocabulary(tfidf)

Vocabulary size: 1948 words
15             1858           1888           abandoned      abhorrent      able           about          above          absolute       absolutely     
absorb         accent         accomplice     accomplished   account        accustomed     acquaintance   across         action         actions        
active         activity       actor          actress        acute          added          additional     address        addressing     adjusted       
adler          admirable      admirably      admit          adopted        advantage      advantages     adventuress    advise         affairs        
affect         after          afternoon      afterwards     again          against        agent          agitation      ago            ah             
air            aisle          akin           alarm          all            almost         alone          aloud          already        also           
altar          alternating    always         am             amazem



In [30]:
s = sentences[10]

print(f'Text: "{s}"')
bow = tfidf.transform([s])
print(f'BoW Shape: {bow.shape}')
bow = bow.toarray()   # From sparse matrix to dense matrix (Careful with MEMORY)
print(f'BoW Vector: {bow}')

Text: "And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory."
BoW Shape: (1, 1948)
BoW Vector: [[0. 0. 0. ... 0. 0. 0.]]


In [31]:
show_bow_float(tfidf, bow[0])

[30m15             :0.00[0m | [30m1858           :0.00[0m | [30m1888           :0.00[0m | [30mabandoned      :0.00[0m | [30mabhorrent      :0.00[0m | [30mable           :0.00[0m
[30mabout          :0.00[0m | [30mabove          :0.00[0m | [30mabsolute       :0.00[0m | [30mabsolutely     :0.00[0m | [30mabsorb         :0.00[0m | [30maccent         :0.00[0m
[30maccomplice     :0.00[0m | [30maccomplished   :0.00[0m | [30maccount        :0.00[0m | [30maccustomed     :0.00[0m | [30macquaintance   :0.00[0m | [30macross         :0.00[0m
[30maction         :0.00[0m | [30mactions        :0.00[0m | [30mactive         :0.00[0m | [30mactivity       :0.00[0m | [30mactor          :0.00[0m | [30mactress        :0.00[0m
[30macute          :0.00[0m | [30madded          :0.00[0m | [30madditional     :0.00[0m | [30maddress        :0.00[0m | [30maddressing     :0.00[0m | [30madjusted       :0.00[0m
[1m[43madler          :0.22[0m | [30madmirable



Display the IDF of some words. 

* High IDF = word that appears in few documents
* Low IDF = word that appears in most of documents

In [32]:
words = tfidf.get_feature_names()
word = input('Word: ').lower()

if word in words:
    k = words.index(word)
    print(f'IDF({words[k]}) = {tfidf.idf_[k]}')
else:
    print('Not in vocabulary')



Word: visitor
IDF(visitor) = 5.721470641684252


#### More than one TF-IDF

There is a family of TF-IDF formulas. 

Another example is the **sublinear TF**, which is then:

$
\begin{align}
\textrm{TF(w, d) = 1 + log} \left( raw count \right)
\end{align}
$


In [33]:
tfidf_sublinear = TfidfVectorizer(sublinear_tf=True)
tfidf_sublinear.fit(sentences)

TfidfVectorizer(sublinear_tf=True)

In [34]:
s = sentences[10]

print(f'Text: "{s}"')
bow_sl = tfidf_sublinear.transform([s])
print(f'BoW Shape: {bow_sl.shape}')
bow_sl = bow_sl.toarray()   # From sparse matrix to dense matrix (Careful with MEMORY)
print(f'BoW Vector: {bow_sl}')

Text: "And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory."
BoW Shape: (1, 1948)
BoW Vector: [[0. 0. 0. ... 0. 0. 0.]]


In [35]:
show_bow_float(tfidf_sublinear, bow_sl[0])

[30m15             :0.00[0m | [30m1858           :0.00[0m | [30m1888           :0.00[0m | [30mabandoned      :0.00[0m | [30mabhorrent      :0.00[0m | [30mable           :0.00[0m
[30mabout          :0.00[0m | [30mabove          :0.00[0m | [30mabsolute       :0.00[0m | [30mabsolutely     :0.00[0m | [30mabsorb         :0.00[0m | [30maccent         :0.00[0m
[30maccomplice     :0.00[0m | [30maccomplished   :0.00[0m | [30maccount        :0.00[0m | [30maccustomed     :0.00[0m | [30macquaintance   :0.00[0m | [30macross         :0.00[0m
[30maction         :0.00[0m | [30mactions        :0.00[0m | [30mactive         :0.00[0m | [30mactivity       :0.00[0m | [30mactor          :0.00[0m | [30mactress        :0.00[0m
[30macute          :0.00[0m | [30madded          :0.00[0m | [30madditional     :0.00[0m | [30maddress        :0.00[0m | [30maddressing     :0.00[0m | [30madjusted       :0.00[0m
[1m[43madler          :0.23[0m | [30madmirable



In [36]:
word = 'yet'

index = words.index(word)

bow = tfidf.transform([s]).toarray()

print(f'Word: "{word}"')
print(f'TF-IDF with Natural TF   = {bow[0][index]:0.4f}')
print(f'TF-IDF with Sublinear TF = {bow_sl[0][index]:0.4f}')

Word: "yet"
TF-IDF with Natural TF   = 0.2194
TF-IDF with Sublinear TF = 0.2334


Repeating a word in a text will modify the TF-IDF coefficient for this word in the text representation.

In [37]:
word = 'yet'
s = sentences[10]
s = s + ' '.join(100 * [word])

bow = tfidf.transform([s]).toarray()
bow_sl = tfidf_sublinear.transform([s]).toarray()

index = words.index(word)
print(f'Word: "{word}"')
print(f'TF-IDF with Natural TF   = {bow[0][index]:0.4f}')
print(f'TF-IDF with Sublinear TF = {bow_sl[0][index]:0.4f}')

Word: "yet"
TF-IDF with Natural TF   = 0.9990
TF-IDF with Sublinear TF = 0.8031


# Search Engine

With these vectors, we can build a search engine.

* **Query**: Let the user enter a text query
* Search through the corpus the documents that are **similar** to the query
* **Similarity**: we use the **cosine similary** of the BoW vectors of two texts to evaluate their similarity.


In [38]:
corpus_bow = count.transform(newsgroups.data)

In [39]:
query = input("Type your query: ")
query_bow = count.transform([query])

Type your query: query


In [40]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(corpus_bow, query_bow)
print(f'Similarity Matrix Shape: {similarity_matrix.shape}')

Similarity Matrix Shape: (11314, 1)


The similarity matrix has **D** rows (the number of documents in the corpus) and 1 column.

Coefficient at row **k** is the cosine similarity between the document at index **k** in the corpus and the query.


In [41]:
similarities = pd.Series(similarity_matrix[:, 0])
similarities.head(10)

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
5    0.0
6    0.0
7    0.0
8    0.0
9    0.0
dtype: float64

In [42]:
top_10 = similarities.sort_values(ascending=False)[:10]
top_10

9826     0.081111
6560     0.067806
2342     0.065938
2826     0.062500
3131     0.062257
10100    0.060523
7969     0.056277
4418     0.055641
6105     0.054800
824      0.054393
dtype: float64

In [43]:
print('Best document:')
print(newsgroups.data[top_10.index[0]])

Best document:
From: yuan@wiliki.eng.hawaii.edu (Maw Ying Yuan)
Subject: Win3.1 Config.Sys query
Organization: University of Hawaii, College of Engineering
Lines: 11

Hi there,

With a 16Megs of RAM, is there a need to run/load Smartdrv for
Windows 3.1?  If yes, can I run/load Ramdrive without Smartdrv?
If I need both Ramdrive & Smartdrv, is the following Config.Sys
settings OK:  ...SMARTDRV.SYS 2048 2048
              ...RAMDRIVE.SYS 2048 /E

Thanks in advance for e-mail reply.

yuan@wiliki.eng.hawaii.edu

