<h1>CS4619: Artificial Intelligence II</h1>
<h1>Bag of Words</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

from sklearn.metrics import accuracy_score

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LogisticRegression

from sklearn.decomposition import TruncatedSVD 

from tensorflow.keras import Model
from tensorflow.keras import Input
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout

from tensorflow.keras.optimizers import RMSprop

from tensorflow import convert_to_tensor, string

In [3]:
import os
if 'google.colab' in str(get_ipython()):
  from google.colab import drive
  drive.mount('/content/drive')
  base_dir = "./drive/My Drive/Colab Notebooks/" # You may need to change this, depending on where your notebooks are on Google Drive
else:
  base_dir = "." 

<h1>Natural Language Processing</h1>
<ul>
    <li>Languages:
        <ul>
            <li><b>Natural languages</b> are languages, such as English, which arise through some sort of 'cultural evolution'.</li>
            <li><b>Formal languages</b> are ones that are designed by humans, e.g. programming languages such as Python.</li>
        </ul>
    </li>
    <li>Formal languages have rules (syntax rules and perhaps a formal semantics). Natural languages, by contrast, follow certain cognitive principles. Linguists might attempt to formalize rules for a natural language, but users of the language are not constrained to follow the rules.</li>
    <li>Natural Language Processing (NLP) is a phrase that covers the work we do in AI on Natural Language Understanding (NLU) and Natural Language Generation (NLG).</li>
    <li>History:
        <ul>
            <li>Early work on NLP was quite ad-hoc.</li>
            <li>Then, in the 1980s, we equipped our NLP systems with the kinds of rules written by linguists. The coverage of the systems improved enormously, but they were very brittle &mdash; they failed whenever they were faced with the kind of English that we use everyday.</li>
            <li>From the late 1980s, from large datasets, we used Machine Learning (and, later, Deep Learning) to train systems to make predictions:
                <ul>
                    <li>E.g. spam classifiers, sentiment analysers, topic classifiers, next-word predictors (autocompletion), machine translation, text summarization, &hellip;</li>
                    <li>These systems are useful. You use them all the time! Their performance may be impressive (and becoming ever more so).</li>
                    <li>But, like all Machine Learning, they work by finding regularities in the training data. They are a long way from <i>understanding</i> language.</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

<h1>Free-Form Text</h1>
<ul>
    <li>We've looked at AI systems that can handle structured data (i.e. numeric-valued and non-numeric-valued features) and AI systems that can handle images.
    </li>
    <li>Suppose instead the objects in your dataset are <b>documents</b>, rather than houses, students, irises or photos.
        <ul>
            <li>E.g. web pages, tweets, blog posts, emails, posts to Internet forums and chatrooms, &hellip;</li>
            <li>They might have a little structure to them (headings and so on), but they are primarily
                <b>free-form text</b>, written in a natural language, such as English.
            </li>
        </ul>
    </li>
    <li>Many AI algorithms can only handle vectors of numbers. So one way to apply AI techniques to 
        a dataset of documents is to convert the raw text in the documents into vectors of numbers.
    </li>
    <li><strong>Note that in this lecture, each document becomes one vector.</strong> In the next lecture, each word becomes one vector and hence a document becomes a list of vectors.
    </li>
    <!--
    <li>Our treatment of this will be brief and high-level, since many of you studied
        <i>CS4611 Information Retrieval</i>, where this is covered in depth.
    </li>
    -->
    <li><!-- Furthermore, we'll --> We'll illustrate this lecture mostly using scikit-learn 
        although its facilities for handling text are quite limited. 
        If you really want to do AI with text, consider a more powerful library such as <i>NLTK</i>
        (<a href="http://www.nltk.org/">http://www.nltk.org/</a>) or the <i>Stanford Natural Language
        Processing Toolkit</i> 
        (<a href="https://nlp.stanford.edu/software/">https://nlp.stanford.edu/software/</a>).
     </li>
</ul>

<h1>Sets</h1>
<h2>Background Maths</h2>
<ul>
    <li>A <b>set</b> is a collection of objects with two properties:
        <ul>
            <li>Order is not important,<br />
                e.g. $\Set{a, c, f} = \Set{c, f, a}$
            </li>
            <li>Duplicates are not allowed,<br />
                e.g. $\Set{a, c, a, f} = \Set{a, c, f}$
            </li>
        </ul>
    </li>
    <li>The set of all possible objects $U$ is called the <b>universal set</b>, e.g. $U = \Set{a, b, c, d, e, f, g}$.</li>
</ul>
<h2>Background Computer Science</h2>
<ul>
    <li>There are many data structures we can use to store sets, e.g. linked lists, binary search trees, &hellip;</li>
    <li>But we can describe a set by a binary-valued vector.
        <ul>
            <li>E.g. if $U = \Set{a, b, c, d, e, f, g}$, then we can represent the set $\Set{a, c, f}$ as follows:
                <table>
                    <tr>
                        <td>$a$</td><td>$b$</td><td>$c$</td><td>$d$</td><td>$e$</td><td>$f$</td><td>$g$</td>
                    </tr>
                    <tr>
                        <td>$1$</td><td>$0$</td><td>$1$</td><td>$0$</td><td>$0$</td><td>$1$</td><td>$0$</td>
                    </tr>
                </table>
            </li>
        </ul>
        Then, we can store the set using the same data structure that we use for storing vectors, e.g. numpy arrays. 
    </li>
    <li>In fact, if $U$ is large but the sets we store tend to be much smaller, then our vectors will be
        <b>sparse</b> (mostly zero).
    </li>
    <li>It may be more efficient to use a data structure that only stores the non-zero elements. numpy has several data structures for this (e.g. <code>csr_matrix</code>).
    </li>
</ul>

<h1>Bags</h1>
<h2>Background Maths</h2>
<ul>
    <li>A <b>bag</b> is a collection of objects with one property:
        <ul>
            <li>Order is not important,<br />
                e.g. $\Set{a, c, f} = \Set{c, f, a}$
            </li>
            <li>This time, duplicates are allowed,<br />
                e.g. $\Set{a, c, f} \neq \Set{a, c, a, f}$
            </li>
        </ul>
    </li>
</ul>
<h2>Background Computer Science</h2>
<ul>
    <li>We can describe a bag by a numeric-valued vector, where the numbers are the frequencies with which the elements occur.
        <ul>
            <li>E.g. if $U = \Set{a, b, c, d, e, f, g}$, then we can represent the bag $\Set{a, c, a,  f}$ as follows:
                <table>
                    <tr>
                        <td>$a$</td><td>$b$</td><td>$c$</td><td>$d$</td><td>$e$</td><td>$f$</td><td>$g$</td>
                    </tr>
                    <tr>
                        <td>$2$</td><td>$0$</td><td>$1$</td><td>$0$</td><td>$0$</td><td>$1$</td><td>$0$</td>
                    </tr>
                </table>
            </li>
        </ul>
    </li>
    <li>
        We can store these as numpy arrays or, if they are sparse, we can use numpy's sparse data structures.
    </li>
</ul>

<h1>Bag-of-Words</h1>
<ul>
    <li>We will represent each document by a bag-of-words.</li>
    <li>This will lose lots of information. Start thinking about what we will lose!</li>
</ul>

<h2>Running example</h2>
<p>
    Suppose our dataset contains just these three documents:
</p>
<table style="border-collapse:collapse;">
    <tr>
        <th style="border: 1px solid black;">Tweet 0</th>
        <th style="border: 1px solid black;">Tweet 1</th>
        <th style="border: 1px solid black;">Tweet 2</th></tr>
    <tr>
        <td style="border: 1px solid black;">
            No one is born hating another person because of the color of his skin or his background 
            or his religion.
        </td>
        <td style="border: 1px solid black;">
            People must learn to hate, and if they can learn to hate, they can be taught to love.</td>
        <td style="border: 1px solid black;">
            For love comes more naturally to the human heart than its opposite.</td>
     </tr>
     <caption style="caption-side: bottom; text-align: center">
         Three tweets from Barack Obama, quoting Nelson Mandela
     </caption>
</table>

<h2>Tokenization</h2>
<ul>
    <li>First, we must <b>tokenize</b> each document. This means splitting it into <b>tokens</b>.</li> <!--<b>terms</b>.</li>-->
    <li>In our simple treatment, the tokens <!--terms--> are just the words, ignoring punctuation and making everything
        lowercase.
        <ul>
            <li>e.g. if we tokenize "People must learn to hate, and if they can learn to hate, they can be taught to love.", we get a sequence of 18 tokens: "people must learn to hate and if they can learn to hate they can be taught to love"</li>
        </ul>
    </li>
   <li>In reality, tokenization is surprisingly complicated,
       <ul>
           <li>e.g. should we keep the punctuation as separate tokens?</li>
           <li>e.g. should we treat "People" and "people" as different tokens?</li>
           <li>e.g. is "don't" one token <!--term--> or two or three?</li>
           <li>e.g. maybe pairs of consecutive words (so-called 'bigrams') could also be treated as if they were single tokens <!--terms--> ("no one", "one is",
               "is born")</li>
           <li>and so on.</li>
        </ul>
    </li>
</ul>

<h2>Stop-words</h2>
<ul>
    <li>Optionally, discard <b>stop-words</b>:
    <li>Stop-words are common words such as "a", "the", "in", "on", "is, "are",&hellip;</li>
    <li>Sometimes discarding them helps, or does no harm, e.g. spam detection.</li>
    <li>Other times, you lose too much, e.g. web search engines ("To be, or not to be").</li>
</ul>

<h2>Running example</h2>
<ul>
    <li>After tokenization and discarding stop-words:
<table>
    <tr>
        <th style="border: 1px solid black;">Tweet 0</th>
        <th style="border: 1px solid black;">Tweet 1</th>
        <th style="border: 1px solid black;">Tweet 2</th>
    </tr>
    <tr>
        <td style="border: 1px solid black;">
            born hating person color skin background religion
        </td>
        <td style="border: 1px solid black;">
            people learn hate learn hate taught love</td>
        <td style="border: 1px solid black;">
            love comes naturally human heart opposite</td>
     </tr>
</table>
    </li>
</ul>

<h2>Stemming or lemmatization</h2>
<ul>
    <li>Optionally, apply <b>stemming</b> or <b>lemmatization</b> to the tokens. <!--terms.--></li>
    <li>E.g. "hating" is replaced by "hate", "comes" is replaced by "come"</li>
</ul>

<h2>Running example</h2>
<ul>
    <li>What would the tweets look like after stemming?</li>
    <li>What would they look like after lemmatization?</li>
</ul>

<h2>Count vectorization</h2>
<ul>
    <li>Each document becomes a vector, each token <!--term--> becomes a feature, feature-values are
        <em>frequencies</em> (how many times that token <!--term--> appears in that document).
    </li>
</ul>

<h2>Running example</h2>
<table>
    <tr style="border: 1px solid black;">
        <th></th>
        <th>background</th>
        <th>born</th>
        <th>color</th>
        <th>comes</th>
        <th>hate</th>
        <th>hating</th>
        <th>heart</th>
        <th>human</th>
        <th>learn</th>
        <th>love</th>
        <th>naturally</th>
        <th>opposite</th>
        <th>people</th>
        <th>person</th>
        <th>religion</th>
        <th>skin</th>
        <th>taught</th>
    </tr>
    <tr style="border: 1px solid black;">
        <th>Tweet 0:</th>
        <td>1</td>
        <td>1</td>
        <td>1</td>
        <td>0</td>
        <td>0</td>
        <td>1</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>1</td>
        <td>1</td>
        <td>1</td>
        <td>0</td>
    </tr>
    <tr style="border: 1px solid black;">
        <th>Tweet 1:</th>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>2</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>2</td>
        <td>1</td>
        <td>0</td>
        <td>0</td>
        <td>1</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>1</td>
    </tr>
    <tr style="border: 1px solid black;">
        <th>Tweet 2:</th>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>1</td>
        <td>0</td>
        <td>0</td>
        <td>1</td>
        <td>1</td>
        <td>0</td>
        <td>1</td>
        <td>1</td>
        <td>1</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
    </tr>
</table>
    </li>
    
</ul>

<h2>TF-IDF vectorization</h2>
<ul>
    <li>Optionally, replace the frequencies by <b>tf-idf</b> scores.</li>
    <li>tf-idf is a kind of standardization, but suitable for sparse data.</li>
    <li>tf-idf scores penalise words that recur across multiple documents,
        <ul>
            <li>e.g. in emails, word such as "hi", "best", "regards", &hellip;</li>
        </ul>
    </li>
    <li>You can look up the formula, if you are interested. <!--For the formulae, see, e.g., <i>CS4611</i>.-->
        <ul>
            <li>Variants of the formula might: scale frequencies to avoid biases towards long documents 
                (not scikit-learn);
                logarithmically scale frequencies (not default in scikit-learn);
                add 1 to part of the formula to avoid division-by-zero (default in scikit-learn);
                normalize the results (e.g. by default, scikit-learn divides by the $l_2$ norm)
            </li>
        </ul>
    </li>
</ul>

<h2>Running example</h2>
<table>
    <tr style="border: 1px solid black;">
        <th></th>
        <th>background</th>
        <th>born</th>
        <th>color</th>
        <th>comes</th>
        <th>hate</th>
        <th>hating</th>
        <th>heart</th>
        <th>human</th>
        <th>learn</th>
        <th>love</th>
        <th>naturally</th>
        <th>opposite</th>
        <th>people</th>
        <th>person</th>
        <th>religion</th>
        <th>skin</th>
        <th>taught</th>
    </tr>
    <tr style="border: 1px solid black;">
        <th>Tweet 0:</th>
        <td>0.38</td>
        <td>0.38</td>
        <td>0.38</td>
        <td>0</td>
        <td>0</td>
        <td>0.38</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0.38</td>
        <td>0.38</td>
        <td>0.38</td>
        <td>0</td>
    </tr>
    <tr style="border: 1px solid black;">
        <th>Tweet 1:</th>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0.61</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0.61</td>
        <td>0.23</td>
        <td>0</td>
        <td>0</td>
        <td>0.31</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0.31</td>
    </tr>
    <tr style="border: 1px solid black;">
        <th>Tweet 2:</th>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0.42</td>
        <td>0</td>
        <td>0</td>
        <td>0.42</td>
        <td>0.42</td>
        <td>0</td>
        <td>0.32</td>
        <td>0.42</td>
        <td>0.42</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
    </tr>
</table>

<h2>The dimension of these vectors</h2>
<ul>
    <li>Sparsity:
        <ul>
            <li>Here we have $n = 17$ features (columns). How many will there be in general?</li>
            <li>Most of the feature-values are zero, hence the matrix is sparse. Why?</li>
        </ul>
    </li>
    <li>We have the curse of dimensionality again.
        <ul>
            <li>Reduce the number of features by:
                <ul>
                    <li>discarding tokens that appear in too few documents (<code>min_df</code> in scikit-learn);
                    </li>
                    <li>discarding tokens that appear in too many documents (<code>max_df</code>);</li>
                    <li>keeping only the most frequent tokens (<code>max_features</code>).</li>
                </ul>
            </li>
            <li>Use dimensionality reduction:
                <ul>
                    <li>e.g. singular value decomposition (SVD) is suitable for bag-of-words, rather than PCA.</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

<h2>Observation about bag-of-words representations</h2>
<ul>
    <li>This representation is good for many applications in AI but it does have drawbacks too:
        <ul>
            <li>It loses all the information that English conveys through the order of words in sentences,
                <ul>
                    <li>e.g. "People learn to hate" and "People hate to learn" have very different meanings but
                        end up with the same bag-of-words representation.
                    </li>
                </ul>
            </li>
            <li>It loses the information that English conveys using its stop-words, most notably negation,
                <ul>
                    <li>e.g. "They hate religion" and "I do not hate religion" will have the same bag-of-words
                        representation.
                    </li>
                </ul>
            </li>
        </ul>
    </li>
    <li>This may not matter for some applications (e.g. spam detection) but will matter for
        others (e.g. machine translation), for which you need a different representation.
    </li>
    <li>What other weaknesses does it have?</li>
</ul>

<h1>Bag-of-Words in scikit-learn</h1>

In [4]:
tweets = [
    "No one is born hating another person because of the color of his skin or his background or his religion.",
    "People must learn to hate, and if they can learn to hate, they can be taught to love.",
    "For love comes more naturally to the human heart than its opposite."
]

In [5]:
# Create the vectorizer
vectorizer = CountVectorizer(stop_words='english')

# Run the vectorizer
vectorizer.fit(tweets)
X = vectorizer.transform(tweets)

<ul>
    <li>In the example below, we used a <code>CountVectorizer</code>.</li>
    <li>It does tokenization:
        <ul>
            <li>By default, it converts to lowercase, it treats punctuation as spaces, and it treats two or more
                consecutive characters as a word. Each word becomes a token <!--term--> (feature).
            </li>
        </ul>
        <li>The example discards stop-words.</li>
</ul>

In [6]:
# FYI here's the list of stop-words that scikit-learn uses.

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

ENGLISH_STOP_WORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

<ul>
    <li>The <code>CountVectorizer</code> also, by default, discards any word that appears in every document.</li>
    <li>It does not do stemming or lemmatization. scikit-learn doesn't have a stemmer, but does make it easy to call one, if you get one from another library, 
                e.g. NLTK.
     </li>
</ul>

In [7]:
# FYI, let's see the tokens (features) that it ends up with
vectorizer.get_feature_names_out()

array(['background', 'born', 'color', 'comes', 'hate', 'hating', 'heart',
       'human', 'learn', 'love', 'naturally', 'opposite', 'people',
       'person', 'religion', 'skin', 'taught'], dtype=object)

<ul>
    <li>Finally, the <code>CountVectorizer</code> vectorizes, producing sparse matrices of word frequencies. 
        (There is an option to produce a binary representation, instead of frequencies.)
    </li>
    <li>We know we want this to be stored in an efficient sparse matrix, and scikit-learn takes care of this
        'behind the scenes'. (Do not vectorize and then
                convert back to Pandas DataFrames because, by default, DataFrames are not sparse data structures)
    </li>
</ul>

In [8]:
# We can look at the sparse array. The first number identifies the tweet (0, 1 or 2), 
# the second is which feature, and the last is the frequency
print(X)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 18 stored elements and shape (3, 17)>
  Coords	Values
  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 5)	1
  (0, 13)	1
  (0, 14)	1
  (0, 15)	1
  (1, 4)	2
  (1, 8)	2
  (1, 9)	1
  (1, 12)	1
  (1, 16)	1
  (2, 3)	1
  (2, 6)	1
  (2, 7)	1
  (2, 9)	1
  (2, 10)	1
  (2, 11)	1


In [9]:
# Vectorize a new document
new_document = "Unsurprisingly, people hate to learn that their religion loves to hate."

new_document_as_vector = vectorizer.transform([new_document])

In [10]:
# Notice how it ignores words that weren't in the original tweets, such as "unsurprisingly" and "loves"

print(new_document_as_vector)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 4 stored elements and shape (1, 17)>
  Coords	Values
  (0, 4)	2
  (0, 8)	1
  (0, 12)	1
  (0, 14)	1


<ul>
    <li>In the example below, we use a <code>TfidfVectorizer</code> instead.</li>
    <li>(By default, it normalizes the values using the $l_2$ norm.)</li> <!--, see CS46111).</li>-->
</ul>

In [11]:
# Create the vectorizer
vectorizer = TfidfVectorizer(stop_words='english')

# Run the vectorizer
vectorizer.fit(tweets)
X = vectorizer.transform(tweets)

In [12]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 18 stored elements and shape (3, 17)>
  Coords	Values
  (0, 0)	0.3779644730092272
  (0, 1)	0.3779644730092272
  (0, 2)	0.3779644730092272
  (0, 5)	0.3779644730092272
  (0, 13)	0.3779644730092272
  (0, 14)	0.3779644730092272
  (0, 15)	0.3779644730092272
  (1, 4)	0.6149219764307087
  (1, 8)	0.6149219764307087
  (1, 9)	0.2338320064840948
  (1, 12)	0.30746098821535434
  (1, 16)	0.30746098821535434
  (2, 3)	0.42339448341195934
  (2, 6)	0.42339448341195934
  (2, 7)	0.42339448341195934
  (2, 9)	0.3220024178194947
  (2, 10)	0.42339448341195934
  (2, 11)	0.42339448341195934


In [13]:
# Vectorize a new document
new_document = "Unsurprisingly, people hate to learn that their religion loves to hate."

new_document_as_vector = vectorizer.transform([new_document])

In [14]:
# Notice how it ignores words that weren't in the original tweets, such as "unsurprisingly" and "loves"

print(new_document_as_vector)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4 stored elements and shape (1, 17)>
  Coords	Values
  (0, 4)	0.7559289460184544
  (0, 8)	0.3779644730092272
  (0, 12)	0.3779644730092272
  (0, 14)	0.3779644730092272


<h1>Similarity &amp; Distance for Bag-of-Words</h1>
<ul>
    <!--<li>For details and formulae, see CS4611.</li>-->
    <li>Euclidean distance is not suitable.</li>
    <li>Very common is <b>cosine similarity</b>, which gives values in $[0, 1]$, where 1 means 'identical'.</li>
    <li>To get <b>cosine distance</b>, we can subtract from 1, so now 1 means 'completely different'.</li>
    <li>The exact formulae differ depending on what is assumed about normalization.
        <ul>
            <li>If we assume the vectors have been normalized, then there is a simpler formula (Basically, just the dot product of the vectors).</li>
            <li>If not, then the formula is more complicated (divide by the product of the $l_2$-norms).</li>
        </ul>
    </li>
</ul>

<h2>Similarity &amp; distance for bag-of-words representation in scikit-learn</h2>
<ul>
    <li>The code below assumes that the vectors have already been normalized, e.g. produced
        by <code>TfidfVectorizer</code>.
    </li>
</ul>

In [15]:
def cosine(x, xprime):
    # Assumes x and  xprime are already normalized
    # Converts from sparse matrices because np.dot does not work on them
    return 1 - x.toarray().dot(xprime.toarray().T)

In [16]:
# So which of Barack Obama's tweets is most similar to our new document?
tweets[np.argmin([cosine(new_document_as_vector, x) for x in X])]

'People must learn to hate, and if they can learn to hate, they can be taught to love.'

<h1>Case Study: A Classifier</h1>

<ul>
    <li>Stanford University researchers have taken 50,000 movie reviews from <a href="https://www.imdb.com/">IMDB</a>,
        labelled them as either positive or negative and <a href="http://ai.stanford.edu/~amaas/data/sentiment/">made them available</a>.
    </li>
    <li>I've taken the first 5,000 of them.</li>
</ul>

In [17]:
df = pd.read_csv("../datasets/dataset_5000_reviews.csv")

In [18]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [19]:
# We'll just use holdout to keep things fast
dev_df, test_df = train_test_split(df, train_size=0.8, stratify=df["sentiment"], random_state=2)

ss = ShuffleSplit(n_splits=1, train_size=0.75, random_state=2)

In [20]:
# Extract the features but leave as a DataFrame
dev_X = dev_df["review"]
test_X = test_df["review"]

# Target values, encoded and converted to a 1D numpy array
label_encoder = LabelEncoder()
label_encoder.fit(df["sentiment"])
dev_y = label_encoder.transform(dev_df["sentiment"])
test_y = label_encoder.transform(test_df["sentiment"])

<h2>scikit-learn</h2>

In [21]:
# We can just create a preprocessor with a CountVectorizer in a pipeline with logistic regression.
logistic = Pipeline([
    ("vectorizer", CountVectorizer(stop_words='english', max_features=20000)),
    ("predictor", LogisticRegression(max_iter=300))])

# Get the validation error
np.mean(cross_val_score(logistic, dev_X, dev_y, scoring="accuracy", cv=ss))

0.832

In [22]:
# Instead of discarding features in the vectorizer, we can use TruncatedSVD for dimensionality reduction.
logistic = Pipeline([
    ("vectorizer", CountVectorizer(stop_words='english', max_features=20000)),
    ("svd", TruncatedSVD(n_components=100)),
    ("predictor", LogisticRegression(max_iter=300))])

# Get the validation error
np.mean(cross_val_score(logistic, dev_X, dev_y, scoring="accuracy", cv=ss))

0.782

In [23]:
# Let's try some grid search
logistic = Pipeline([
    ("vectorizer", CountVectorizer(max_features=20000)),
    ("svd", TruncatedSVD()),
    ("predictor", LogisticRegression(max_iter=300))])

# Create a dictionary of hyperparameters for logistic regression
logistic_param_grid = {"vectorizer__stop_words": [None, "english"],
                       "vectorizer__ngram_range": [(1, 1), (1, 2)],
                       "svd__n_components": [100, 200, 300]}

# Create the grid search object which will find the best hyperparameter values based on validation error
logistic_gs = GridSearchCV(logistic, logistic_param_grid, scoring="accuracy", cv=ss, refit=True)

# Run grid search by calling fit
logistic_gs.fit(dev_X, dev_y)

# Let's see how well we did
logistic_gs.best_params_, logistic_gs.best_score_

({'svd__n_components': 300,
  'vectorizer__ngram_range': (1, 1),
  'vectorizer__stop_words': 'english'},
 0.824)

In [24]:
# And let's try it with a TF-IDF vectorizer
tfidf_logistic = Pipeline([
    ("vectorizer", TfidfVectorizer(max_features=20000)),
    ("svd", TruncatedSVD()),
    ("predictor", LogisticRegression(max_iter=300))])

# Create a dictionary of hyperparameters for logistic regression
tfidf_logistic_param_grid = {"vectorizer__stop_words": [None, "english"],
                             "vectorizer__ngram_range": [(1, 1), (1, 2)],
                             "svd__n_components": [100, 200, 300]}

# Create the grid search object which will find the best hyperparameter values based on validation error
tfidf_logistic_gs = GridSearchCV(tfidf_logistic, tfidf_logistic_param_grid, scoring="accuracy", cv=ss, refit=True)

# Run grid search by calling fit
tfidf_logistic_gs.fit(dev_X, dev_y)

# Let's see how well we did
tfidf_logistic_gs.best_params_, tfidf_logistic_gs.best_score_

({'svd__n_components': 200,
  'vectorizer__ngram_range': (1, 2),
  'vectorizer__stop_words': None},
 0.84)

In [25]:
# Now we re-train the winner on train+validation and test on the test set
tfidf_logistic.set_params(**tfidf_logistic_gs.best_params_) 
tfidf_logistic.fit(dev_X, dev_y)
accuracy_score(test_y, tfidf_logistic.predict(test_X))

0.858

<h2>Keras</h2>

<ul>
    <li>Keras has a <code>TextVectorization</code> layer, which we can use for count vectorization (<code>output_mode="count"</code>) or TF-IDF vectorization (<code>output_mode="tf-idf"</code>).</li>
    <li>It will do the tokenization, but it does not come with a method for stop-word removal (although you could write one).</li> 
</ul>

In [26]:
# Create the count vectorization layer, and call adapt on the text-only dataset to create the vocabulary.
vectorization_layer = TextVectorization(output_mode="count", max_tokens=20000)
vectorization_layer.adapt(convert_to_tensor(dev_df["review"]))

# Create and compile the model
inputs = Input(shape=(1,), dtype=string, name="review")
x = vectorization_layer(inputs)
x = Dense(64, activation="relu")(x)
x = Dense(32, activation="relu")(x)
outputs = Dense(1, activation="sigmoid")(x)
count_model = Model(inputs, outputs)

count_model.compile(optimizer=RMSprop(learning_rate=0.0001), loss="binary_crossentropy", metrics=["accuracy"])

In [27]:
count_model.fit(dev_X, dev_y, epochs=10, batch_size=32, verbose=0)

<keras.src.callbacks.history.History at 0x30999d2b0>

In [28]:
test_loss, test_acc = count_model.evaluate(test_X, test_y, verbose=0)
test_acc

0.8709999918937683

In [29]:
# Create the TF-IDF vectorization layer, and call adapt on the text-only dataset to create the vocabulary.
vectorization_layer = TextVectorization(output_mode="tf_idf", max_tokens=20000)
vectorization_layer.adapt(convert_to_tensor(dev_df["review"]))

# Create and compile the model
inputs = Input(shape=(1,), dtype=string, name="review")
x = vectorization_layer(inputs)
x = Dense(64, activation="relu")(x)
x = Dense(32, activation="relu")(x)
outputs = Dense(1, activation="sigmoid")(x)
tfidf_model = Model(inputs, outputs)

tfidf_model.compile(optimizer=RMSprop(learning_rate=0.0001), loss="binary_crossentropy", metrics=["accuracy"])

In [30]:
tfidf_model.fit(dev_X, dev_y, epochs=10, batch_size=32, verbose=0)

<keras.src.callbacks.history.History at 0x308653710>

In [31]:
test_loss, test_acc = tfidf_model.evaluate(test_X, test_y, verbose=0)
test_acc

0.8640000224113464

<ul>
    <li>Great performances from my neural networks here, but we should probably plot a learning curve and
        make sure that we are not over-fitting.
    </li>
</ul>