<h1>CS4618: Artificial Intelligence I</h1>
<h1>Non-Numeric Features</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Class, for use in pipelines, to select certain columns from a DataFrame and convert to a numpy array
# From A. Geron: Hands-On Machine Learning with Scikit-Learn & TensorFlow, O'Reilly, 2017
# Modified by Derek Bridge to allow for casting in the same ways as pandas.DataFrame.astype
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names, dtype=None):
        self.attribute_names = attribute_names
        self.dtype = dtype
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_selected = X[self.attribute_names]
        if self.dtype:
            return X_selected.astype(self.dtype).values
        return X_selected.values

# Class, for use in pipelines, to binarize nominal-valued features (while avoiding the dummy variabe trap)
# By Derek Bridge, 2017
class FeatureBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, features_values):
        self.features_values = features_values
        self.num_features = len(features_values)
        self.labelencodings = [LabelEncoder().fit(feature_values) for feature_values in features_values]
        self.onehotencoder = OneHotEncoder(sparse=False,
            n_values=[len(feature_values) for feature_values in features_values])
        self.last_indexes = np.cumsum([len(feature_values) - 1 for feature_values in self.features_values])
    def fit(self, X, y=None):
        for i in range(0, self.num_features):
            X[:, i] = self.labelencodings[i].transform(X[:, i])
        return self.onehotencoder.fit(X)
    def transform(self, X, y=None):
        for i in range(0, self.num_features):
            X[:, i] = self.labelencodings[i].transform(X[:, i])
        onehotencoded = self.onehotencoder.transform(X)
        return np.delete(onehotencoded, self.last_indexes, axis=1)
    def fit_transform(self, X, y=None):
        onehotencoded = self.fit(X).transform(X)
        return np.delete(onehotencoded, self.last_indexes, axis=1)
    def get_params(self, deep=True):
        return {"features_values" : self.features_values}
    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            self.setattr(parameter, value)
        return self

<h1>Data types</h1>
<ul>
    <li>Structured data:
        <ul>
            <li>
                <b>Numeric-valued</b>: either real- or integer-valued, such as floor area or number of bedrooms
            </li>
            <li>
                <b>Nominal-valued</b>: where there is a finite set of possible values. Often these values are strings
                <ul>
                    <li>For example, dwelling type ($\mathit{type}$) is a nominal-valued feature whose values are "Apartment", 
                        "Detached", "Semi-detached" or "Terraced". 
                    </li>
                    <li>The special case here is, of course, a binary-valued feature, where
                        there are just two values. For example, the type of development ($\mathit{devment}$) is a nominal-valued feature 
                        whose values are "New" or "SecondHand"
                    </li>
                    <li>Another special case is where there is a finite set of possible values but there is some
                        ordering on the values, e.g. the spiciness of a curry can be "Mild", "Medium", "Hot",
                        "Very Hot" and "Suicidal"
                    </li>
                </ul>
            </li>
            <li>
                <b>Set-valued</b>: where the value of a feature is a set, but the members of the set are constrained to
                a finite set of nominals. For example, the genre of a movie might be a set-valued feature, e.g. the 
                value of the genre feature for <i>The Blues Brothers</i> is $\Set{\mathit{musical}, \mathit{comedy},
                \mathit{action}}$.
            </li>
            <li>&hellip;</li>
        </ul>
    </li>
    <li>Unstructured: 
        <ul>
            <li>
                free-form text
            </li>
            <li>
                media such as images, audio and video
            </li>
        </ul>
    </li>
</ul>

<h1>Data Types in the Cork Propery Dataset</h1>
<table>
    <tr>
        <td>$\mathit{flarea}$</td><td>numeric</td><td>the floor area in square metres</td>
    </tr>
    <tr>
        <td>$\mathit{type}$</td><td>nominal</td><td>dwelling type: Apartment, Detached, Semi-detached,
            Terraced</td>
    </tr>
    <tr>
        <td>$\mathit{bdrms}$</td><td>numeric</td><td>the number of bedrooms</td>
    </tr>
    <tr>
        <td>$\mathit{bthrms}$</td><td>numeric</td><td>the number of bathrooms</td>
    </tr>
    <tr>
        <td>$\mathit{floors}$</td><td>numeric</td><td>the number of floors</td>
    </tr>
    <tr>
        <td>$\mathit{devment}$</td><td>nominal</td><td>the type of development: New or SecondHand</td>
    </tr>
    <tr>
        <td>$\mathit{ber}$</td><td>nominal</td>
        <td>building energy rating: A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, E1, E2, F, G</td>
    </tr>
    <tr>
        <td>$\mathit{location}$</td><td>nominal</td><td>the area of Cork, e.g. Douglas, Glanmire, Wilton,...</td>
    </tr>
</table>

In [5]:
# Use pandas to read the CSV file into a DataFrame
df = pd.read_csv("datasets/dataset_corkA.csv")

In [6]:
# The datatypes
df.dtypes

flarea      float64
type         object
bdrms         int64
bthrms        int64
floors        int64
devment      object
ber          object
location     object
price         int64
dtype: object

In [8]:
# Summary statistics
df.describe(include="all")

Unnamed: 0,flarea,type,bdrms,bthrms,floors,devment,ber,location,price
count,207.0,207,207.0,207.0,207.0,207,207,207,207.0
unique,,4,,,,2,12,36,
top,,Semi-detached,,,,SecondHand,G,CityCentre,
freq,,65,,,,204,25,40,
mean,128.094686,,3.434783,2.10628,1.826087,,,,274.724638
std,73.970582,,1.23239,1.185802,0.379954,,,,171.756507
min,41.8,,1.0,1.0,1.0,,,,55.0
25%,82.65,,3.0,1.0,2.0,,,,165.0
50%,106.0,,3.0,2.0,2.0,,,,225.0
75%,153.65,,4.0,3.0,2.0,,,,327.5


In [7]:
# A few of the examples
df.head(3)

Unnamed: 0,flarea,type,bdrms,bthrms,floors,devment,ber,location,price
0,497.0,Detached,4,5,2,SecondHand,B2,Carrigrohane,975
1,83.6,Detached,3,1,1,SecondHand,D2,Glanmire,195
2,97.5,Semi-detached,3,2,2,SecondHand,D1,Glanmire,225


<h1>Handling Nominal-Valued Features</h1>
<ul>
    <li>Most AI algorithms work only with numeric-valued features (There are exceptions)</li>
    <li>So, we will look at how to convert nominal-valued features to numeric-valued ones</li>
</ul>

<h2>Binary-valued features</h2>
<ul>
    <li>The simplest case, obviously, is a binary-valued feature</li>
    <li>We encode one value as 0 and the other as 1, e.g. "SecondHand" is 0 and "New" is 1</li>
</ul>

<h2>Unordered nominal values</h2>
<ul>
    <li>Suppose there are more than two values, e.g. Apartment, Detached, Semi-detached or Terraced.</li>
    <li>The obvious thing to do is to assign integers to each nominal value, e.g. 0 = Apartment, 1 = Detached, etc.</li>
    <li>But often this is not the best encoding
        <ul>
            <li>Algorithms may assume that the values themselves are meaningful, when they're actually arbitrary
                <ul>
                    <li>E.g. an algorithm might assume that Apartments (0) are more similar to Detached houses (1)
                        than they are to Terraced houses (3)
                    </li>
                </ul>
            </li>
        </ul>
    </li>
    <li>Instead, we use <b>one-hot encoding</b></li>
</ul>

<h3>One-Hot Encoding</h3>
<ul>
    <li>If the original nominal-valued feature has $p$ values, then we use $p$ binary-valued features: 
        <ul>
            <li>In each example, exactly one of them is set to 1 and the rest are zero</li>
        </ul>
    </li>
    <li>For example, there are four types of dwelling, so we have four binary-valued features:
        <ul>
            <li>The first is set to 1 if and only if the type of dwelling is Apartment</li>
            <li>The second is set to 1 if and only if the house is Detached</li>
            <li>And so on</li>
        </ul>
        So a detached house will have $\rv{0, 1, 0, 0}$ as their values
    </li>
    <li>Some questions:
        <ul>
            <li>
                One-hot encoding replaces one nominal-valued feature that has $p$ values by $p$ binary-valued ones &mdash; in
                general, one feature per nominal value. (E.g. $\mathit{type}$ has four values, so we get four binary features.)
                What is the minimum number of binary-valued features we could use? 
            </li>
            <li>
                Why don't we use the minimum?
            </li>
            <li>
                Although we get $p$ binary features, we only need $p - 1$. How come? (Advanced note: Look up the
                <i>dummy variable trap</i> to see why this might even be preferable)
            </li>
            <li>
                How might one encode a set-valued feature (such as the movie genre example
                above)?
            </li>
        </ul>
    </li>
    <li>
        In practice, it is not uncommon to be given a dataset where a nominal-valued feature has already been 
        encoded numerically, one integer per value. You might be fooled into thinking that the feature is
        numeric-valued and overlook the need to use one-hot encoding on it. Watch out for this!
    </li>
</ul>

<h2>Ordered nominal values</h2>
<ul>
    <li>Consider the case now of a feature whose values are nominal but where there <em>is</em> an ordering
        <ul>
            <li>E.g. the $\mathit{ber}$ feature in the housing dataset is like this</li>
            <li>In this case, G &lt; F &lt; E2 &lt; E1 &lt; D1 ... &lt; A1</li>
        </ul>
    </li>
    <li>Some people would use the phrase 'ordinal-valued' to refer to nominal values that have an ordering
    </li>
    <li>
        You might be tempted to use a straightforward numeric encoding
        <ul>
            <li>E.g. 0 = G, 1 = F, 2 = E2, 3 = E1, and so on</li>
            <li>This encoding preserves the ordering, e.g. that E2 &lt; E1 because 2 &lt; 3</li>
            <li>But again this is probably not the best encoding
                <ul>
                    <li>The original feature had an ordering on its values but no notion of distance</li>
                    <li>E.g. G &lt; F but you cannot say by <em> how much</em> G is less than F</li>
                    <li>In the new feature, we have introduced a notion of distance: G is worse than F by 1, and it is 
                        2 worse than E2
                    </li>
                    <li>So this encoding has <em>added</em> 'information' that was not present in the original
                </ul>
            </li>
        </ul>
    </li>
    <li>
        So what should we do?
        <ul>
            <li>
                We could use one-hot encoding: fifteen binary-valued features. But what are the weaknesses of this?
            </li>
            <li>
                Another option is to use binary-valued features that represent inequalities
                <ul>
                    <li>E.g. one feature is set to 1 if you have a BER of at least G</li>
                    <li>Another is additionally set to 1 if you have attained at least F</li>
                    <li>And so on</li>
                </ul>
                &mdash; still fifteen binary-valued features, but no longer
                mutually exclusive
                <ul>
                    <li>E.g. a BER of E2 is converted to the following fifteen binary-valued
                        features: $\rv{1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}$
                    </li>
                    <li>E.g. a BER of E1 is converted to
                        $\rv{1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}$
                    </li>
                </ul>
            </li>
        </ul>
        But, since scikit-learn doesn't offer this somewhat sophisticated encoding, and assuming we don't write our own, we will have
        to use one-hot encoding
    </li>
    <li>Again watch out for cases where some well-intentioned person has already encoded this kind of feature
        but using a naive numeric encoding
    </li>
</ul>

<h2>The curse of dimensionality, again</h2>
<ul>
    <li>One-hot encoding increases the number of features, sometimes quite a lot</li>
    <li>We may need to use dimensionality reduction (although most people don't bother!)
        <ul>
            <li>Don't use PCA, which is for numeric-valued features</li>
            <li>Use, e.g., Canonical Correspondence Analysis (CCA)</li>
        </ul>
    </li>
</ul>

<h1>Handling Nominal Values in scikit-learn</h1>
<ul>
    <li>We will add extra steps into our pipeline to convert nominal-values features into numeric ones
        <ul>
            <li>scikit-learn has some classes for doing this but they do not play nicely with pipelines, so 
                we wll use my <code>FeatureBinarizer</code> (given
                earlier) instead
            </li>
            <li>(Advanced: <code>FeatureBinarizer</code> avoids the dummy variable trap and uses just $p-1$ binary
                features)
            </li>
        </ul>
    </li>
    <li>But, we now need two pipelines:
        <ul>
            <li>One takes all the numeric-valued features and, e.g., scales them</li>
            <li>The other takes the numeric-valued features and their legal values and binarizes them
            </li>
        </ul>
        You then join the pipelines using <code>FeatureUnion</code>
    </li>
</ul>

In [8]:
# The features we want to select
numeric_features = ["flarea", "bdrms", "bthrms", "floors"]
nominal_features = ["type", "devment", "ber", "location"]

# Create the pipelines
numeric_pipeline = Pipeline([
        ("selector", DataFrameSelector(numeric_features)),
        ("scaler", StandardScaler())
    ])

nominal_pipeline = Pipeline([
        ("selector", DataFrameSelector(nominal_features)), 
        ("binarizer", FeatureBinarizer([df[feature].unique() for feature in nominal_features]))])

pipeline = Pipeline([("union", FeatureUnion([("numeric_pipeline", numeric_pipeline), 
                                             ("nominal_pipeline", nominal_pipeline)]))])

In [9]:
# Run the pipeline
pipeline.fit(df)
X = pipeline.transform(df)

In [10]:
# Let's take a look at a few rows in X - to show you that we now have a 2D numpy array
print(X[:3])

[[ 4.99927973  0.45974713  2.4462228   0.45883147  0.          1.          0.
   1.          0.          1.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          1.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.        ]
 [-0.6029769  -0.35365164 -0.93520037 -2.17944947  0.          1.          0.
   1.          0.          0.          0.          0.          0.          0.
   0.          1.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
 

In [11]:
# So which house is most similar to yours, now that we are using all the features?

def euc(x, xprime):
    return np.sqrt(np.sum((x - xprime)**2))

# Don't try to understand or copy this code - it's a hack that you won't need
your_house_df = pd.DataFrame([{"flarea":114.0, "type":"Semi-detached", "bdrms":3, "bthrms":2, "floors":2,  
                               "devment":"SecondHand", "ber":"B2", "location":"Glasheen"}])
your_house_scaled = pipeline.transform(your_house_df)[0]

df.ix[np.argmin([euc(your_house_scaled, x) for x in X])]

flarea              134.7
type        Semi-detached
bdrms                   3
bthrms                  2
floors                  2
devment        SecondHand
ber                    D1
location         Glasheen
price                 245
Name: 127, dtype: object

<ul>
    <li>Actually, there is a question of whether Euclidean distance is the best distance measure to use on nominal-valued
        features and on mixtures of numeric-valued features and nominal-valued features
    </li>
    <li>But, in this introductory module, we will use it!</li>
</ul>

<h1>Free-Form Text</h1>
<ul>
    <li>Suppose the objects in your dataset are <b>documents</b>, rather than houses
        <ul>
            <li>E.g. web pages, tweets, blog posts, emails, posts to Internet forums and chatrooms, &hellip;</li>
            <li>They might have a little structure to them (headings and so on), but they are primarily
                <b>free-form text</b>
            </li>
        </ul>
    </li>
    <li>Many AI algorithms can only handle vectors of numbers. So one way to apply AI techniques to 
        a dataset of documents is to convert the raw text in the documents into vectors of numbers
    </li>
    <li>Our treatment of this will be brief and high-level, since many of you are studying
        <i>CS4611 Information Retrieval</i>, where this is covered in depth
    </li>
    <li>Furthermore, we'll use scikit-learn although its facilities for handling text are quite limited. 
        If you really want to do AI with text, consider a more powerful library such as <i>NLTK</i>
        (<a href="http://www.nltk.org/">http://www.nltk.org/</a>) or the <i>Stanford Natural Language
        Processing Toolkit</i> 
        (<a href="https://nlp.stanford.edu/software/">https://nlp.stanford.edu/software/</a>) 
     </li>
</ul>

<h2>Running Example</h2>
<p>
    Suppose our dataset contains just these three documents:
</p>
<table>
    <tr><th>Tweet 0</th><th>Tweet 1</th><th>Tweet 2</th></tr>
    <tr>
        <td>No one is born hating another person because of the color of his skin or his background 
            or his religion.
        </td>
        <td>People must learn to hate, and if they can learn to hate, they can be taught to love.</td>
        <td>For love comes more naturally to the human heart than its opposite.</td>
     </tr>
     <caption style="caption-side: bottom; text-align: center">
         Three tweets from Barack Obama, quoting Nelson Mandela
     </caption>
</table>

<h2>Bag-of-words representation</h2>
<ul>
    <li><b>Tokenize</b> each document
        <ul>
            <li>In our simple treatment, the tokens are just the words, ignoring punctuation and making everything
                lowercase
            </li>
            <li>In reality, this is surprisingly complicated, e.g. is "don't" one token or two, e.g. maybe
                pairs of consecutive words (so-called 'bigrams') could also be tokens ("no one", "one is",
                "is born"); and so on
            </li>
        </ul>
    </li>
    <li>Optionally, discard <b>stop-words</b>: common words such as "a", "the", "in", "on", "is, "are",&hellip;
        <ul>
            <li>Sometimes discarding them helps, or does no harm, e.g. spam detection</li>
            <li>Other times, you lose too much, e.g. web search engines ("To be, or not to be")</li>
        </ul>
    </li>
</ul>

In [12]:
from sklearn.feature_extraction import stop_words
 
print(stop_words.ENGLISH_STOP_WORDS)

frozenset({'yet', 'though', 'on', 'until', 'somewhere', 'out', 'whereby', 'forty', 'hasnt', 'hereupon', 'latter', 'yours', 'is', 'whither', 'might', 'serious', 'nine', 'whether', 'are', 'five', 'such', 'what', 'am', 'therein', 'thereupon', 'while', 'somehow', 'couldnt', 'where', 'thereby', 'never', 'ours', 'you', 'further', 'within', 'twenty', 'last', 'some', 'empty', 'everywhere', 'something', 'thick', 'onto', 'those', 'about', 'nor', 'wherein', 'there', 'whatever', 'be', 'become', 'which', 'himself', 'nowhere', 'yourself', 'via', 'amoungst', 'ever', 'often', 'his', 'against', 'always', 'will', 'everyone', 'namely', 'three', 'again', 'anyhow', 'whenever', 'from', 'per', 'rather', 'co', 'many', 'had', 'third', 'since', 'anyway', 'this', 'whereafter', 'either', 'and', 'without', 'by', 'even', 'eleven', 'it', 'mine', 'up', 'moreover', 'noone', 'perhaps', 'became', 'fifty', 'any', 'please', 'too', 'could', 'detail', 'thin', 'fire', 'un', 'all', 'show', 'hereby', 'indeed', 'can', 'find', '

<ul>
    <li>Optionally, apply <b>stemming</b> or <b>lemmatization</b> to the words
        <ul>
            <li>E.g. "hating" is replaced by "hate", "comes" is replaced by "come"</li>
            <li>scikit-learn doesn't have a stemmer, but does make it easy to call one, if you get one from another library, 
                e.g. NLTK
            </li>
        </ul>
    </li>
    <li><b>Count Vectorize</b>: each document becomes a vector, each token becomes a feature, feature-values are
        <em>frequencies</em> (how many times that token appears in that document)<br />
        (In <i>CS4611</i>, features are probably referred to as 'terms')
    </li>
    <li>Optionally, <b>TD-IDF Vectorize</b>: replace the frequencies by <b>tf-idf</b> scores
        <ul>
            <li>tf-idf scores penalise words that recur across multiple documents</li>
            <li>E.g. in emails, word such as "hi", "best", "regards", &hellip;</li>
            <li>For the formulae, see <i>CS4611</i>
                <ul>
                    <li>variants might: scale frequencies to avoid biases towards long documents (not scikit-learn);
                        logarithmically scale frequencies (not default in scikit-learn);
                        add 1 to part of the formula to avoid division-by-zero (default in scikit-learn);
                        normalize the results (e.g. by default, scikit-learn divides by the L2-norm)
                    </li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

<h2>Running Example</h2>
<ul>
    <li>After discarding stop-words:
<table>
    <tr><th>Tweet 0</th><th>Tweet 1</th><th>Tweet 2</th></tr>
    <tr>
        <td>born hating person color skin background religion
        </td>
        <td>people learn hate learn hate taught love</td>
        <td>love comes naturally human heart opposite</td>
     </tr>
</table>
    </li>
    <li>After count vectorization:
<table>
    <tr>
        <th></th>
        <th>background</th>
        <th>born</th>
        <th>color</th>
        <th>comes</th>
        <th>hate</th>
        <th>hating</th>
        <th>heart</th>
        <th>human</th>
        <th>learn</th>
        <th>love</th>
        <th>naturally</th>
        <th>opposite</th>
        <th>people</th>
        <th>person</th>
        <th>religion</th>
        <th>skin</th>
        <th>taught</th>
    </tr>
    <tr>
        <th>Tweet 0:</th>
        <td>1</td>
        <td>1</td>
        <td>1</td>
        <td>0</td>
        <td>0</td>
        <td>1</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>1</td>
        <td>1</td>
        <td>1</td>
        <td>0</td>
    </tr>
    <tr>
        <th>Tweet 1:</th>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>2</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>2</td>
        <td>1</td>
        <td>0</td>
        <td>0</td>
        <td>1</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>1</td>
    </tr>
    <tr>
        <th>Tweet 2:</th>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>1</td>
        <td>0</td>
        <td>0</td>
        <td>1</td>
        <td>1</td>
        <td>0</td>
        <td>1</td>
        <td>1</td>
        <td>1</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
    </tr>
</table>
    </li>
    <li>After tf-idf vectorization:
<table>
    <tr>
        <th></th>
        <th>background</th>
        <th>born</th>
        <th>color</th>
        <th>comes</th>
        <th>hate</th>
        <th>hating</th>
        <th>heart</th>
        <th>human</th>
        <th>learn</th>
        <th>love</th>
        <th>naturally</th>
        <th>opposite</th>
        <th>people</th>
        <th>person</th>
        <th>religion</th>
        <th>skin</th>
        <th>taught</th>
    </tr>
        <tr>
        <th>Tweet 0:</th>
        <td>0.38</td>
        <td>0.38</td>
        <td>0.38</td>
        <td>0</td>
        <td>0</td>
        <td>0.38</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0.38</td>
        <td>0.38</td>
        <td>0.38</td>
        <td>0</td>
    </tr>
    <tr>
        <th>Tweet 1:</th>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0.61</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0.61</td>
        <td>0.23</td>
        <td>0</td>
        <td>0</td>
        <td>0.31</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0.31</td>
    </tr>
    <tr>
        <th>Tweet 2:</th>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0.42</td>
        <td>0</td>
        <td>0</td>
        <td>0.42</td>
        <td>0.42</td>
        <td>0</td>
        <td>0.32</td>
        <td>0.42</td>
        <td>0.42</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
        <td>0</td>
    </tr>
</table>
    </li>
</ul>

<h2>The dimension of these vectors</h2>
<ul>
    <li>Sparsity:
        <ul>
            <li>Here we had $n = 17$ features (columns). How many will there be in general?</li>
            <li>Most of the feature-values are zero. Why?</li>
            <li>We say that the matrix is <b>sparse</b></li>
            <li>It would be wasteful to store it using very long arrays. We need a data structure that only 
                stores the non-zero elements: <b>sparse matrices</b><br />
                (Don't worry: scikit-learn takes care of this 'behind the scenes')
            </li>
        </ul>
    </li>
    <li>The curse of dimensionality, yet again:
        <ul>
            <li>Reduce the number of features by
                <ul>
                    <li>discarding tokens that appear in too few documents (<code>min_df</code> in scikit-learn)
                    </li>
                    <li>discarding tokens that appear in too many documents (<code>max_df</code>)</li>
                    <li>keeping only the most frequent tokens (<code>max_features</code>)</li>
                </ul>
            </li>
            <li>Use dimensionality reduction:
                <ul>
                    <li>E.g. singular value decomposition (SVD) is suitable for bag-of-words, rather than PCA</li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

<h2>Observation about bag-of-words representations</h2>
<ul>
    <li>This representation is good for many applications in AI but it does have drawbacks too:
        <ul>
            <li>It loses all the information that English conveys through the order of words in sentences
                <ul>
                    <li>E.g. "People learn to hate" and "People hate to learn" have very different meanings but
                        end up with the same bag-of-words representation
                    </li>
                </ul>
            </li>
            <li>It loses the information that English conveys using its stop-words, most notably negation
                <ul>
                    <li>E.g. "They hate religion" and "I do not hate religion" will have the same bag-of-words
                        representation
                    </li>
                </ul>
            </li>
        </ul>
    </li>
    <li>This may not matter for some applications (e.g. spam detection) but will matter for
        others (e.g. machine translation), for which you need a different representation
    </li>
    <li>What other weaknesses does it have?</li>
</ul>

<h1>Bag-of-words representation in scikit-learn</h1>

In [9]:
tweets = [
    "No one is born hating another person because of the color of his skin or his background or his religion.",
    "People must learn to hate, and if they can learn to hate, they can be taught to love.",
    "For love comes more naturally to the human heart than its opposite."
]

<ul>
    <li>In the example below, we put a <code>CountVectorizer</code> into a pipeline</li>
    <li>It does tokenization
        <ul>
            <li>By default, it converts to lowercase, it treats punctuation as spaces, and it treats two or more
                consecutive characters as a word. Each word becomes a token (feature)
            </li>
        </ul>
        <li>The example below discards stop-words using the list we saw earlier</li>
        <li>It also, by default, discards any word that appears in every document</li>
        <li>It does not do stemming or lemmatization but there are ways of incorporating a stemmer from, e.g., NLTK</li>
        <li>Finally, it vectorizes, producing sparse matrices of word fequencies. 
            (There is an option to produce a binary representation, instead of frequencies)
        </li>
    </li>
</ul>

In [10]:
# Create the pipeline
text_pipeline = Pipeline([
        ("vectorizer", CountVectorizer(stop_words='english'))
    ])

# Run the pipeline
text_pipeline.fit(tweets)
X = text_pipeline.transform(tweets)

In [11]:
# Let's see the features
text_pipeline.named_steps["vectorizer"].get_feature_names()

['background',
 'born',
 'color',
 'comes',
 'hate',
 'hating',
 'heart',
 'human',
 'learn',
 'love',
 'naturally',
 'opposite',
 'people',
 'person',
 'religion',
 'skin',
 'taught']

In [12]:
# We can look at the sparse array. The first number identifies the tweet (0, 1 or 2), the second is which feature, and
# the last is the frequency
print(X)

  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 5)	1
  (0, 13)	1
  (0, 14)	1
  (0, 15)	1
  (1, 4)	2
  (1, 8)	2
  (1, 9)	1
  (1, 12)	1
  (1, 16)	1
  (2, 3)	1
  (2, 6)	1
  (2, 7)	1
  (2, 9)	1
  (2, 10)	1
  (2, 11)	1


In [13]:
# Vectorize a new document
new_document = "Unsurprisingly, people hate to learn that their religion loves to hate."

new_document_as_vector = text_pipeline.transform([new_document])

In [14]:
# Notice how it ignores words that weren't in the original tweets, such as "unsurprisingly" and "loves"

print(new_document_as_vector)

  (0, 4)	2
  (0, 8)	1
  (0, 12)	1
  (0, 14)	1


<ul>
    <li>In the example below, we put a <code>TfidfVectorizer</code> into a pipeline instead</li>
    <li>By default, it normalizes the values using the L2 norm (see CS46111)</li>
</ul>

In [15]:
# Create the pipeline
text_pipeline = Pipeline([
        ("vectorizer", TfidfVectorizer(stop_words='english'))
    ])

# Run the pipeline
text_pipeline.fit(tweets)
X = text_pipeline.transform(tweets)

In [16]:
print(X)

  (0, 15)	0.377964473009
  (0, 14)	0.377964473009
  (0, 13)	0.377964473009
  (0, 5)	0.377964473009
  (0, 2)	0.377964473009
  (0, 1)	0.377964473009
  (0, 0)	0.377964473009
  (1, 16)	0.307460988215
  (1, 12)	0.307460988215
  (1, 9)	0.233832006484
  (1, 8)	0.614921976431
  (1, 4)	0.614921976431
  (2, 11)	0.423394483412
  (2, 10)	0.423394483412
  (2, 9)	0.322002417819
  (2, 7)	0.423394483412
  (2, 6)	0.423394483412
  (2, 3)	0.423394483412


In [17]:
# Vectorize a new document
new_document = "Unsurprisingly, people hate to learn that their religion loves to hate."

new_document_as_vector = text_pipeline.transform([new_document])

In [18]:
# Notice how it ignores words that weren't in the original tweets, such as "unsurprisingly" and "loves"

print(new_document_as_vector)

  (0, 14)	0.377964473009
  (0, 12)	0.377964473009
  (0, 8)	0.377964473009
  (0, 4)	0.755928946018


<h2>Similarity &amp; distance for bag-of-words representation</h2>
<ul>
    <li>For details and formulae, see CS4611</li>
    <li>Euclidean distance is not suitable</li>
    <li>Very common is <b>cosine similarity</b>, which gives values in $[0, 1]$, where 1 means 'identical'</li>
    <li>To get <b>cosine distance</b>, we can subtract from 1, so now 1 means 'completely different'</li>
    <li>The exact formulae differ depending on what is assumed about normalization
        <ul>
            <li>If we assume the vectors have been normalized, then simpler formula</li>
            <li>If not, then the formula is more complicated</li>
        </ul>
    </li>
</ul>

<h2>Similarity &amp; distance for bag-of-words representation in scikit-learn</h2>
<ul>
    <li>The code below assumes that the vectors have already been normalized, e.g. produced
        by <code>TfidfVectorizer</code>
    </li>
</ul>

In [23]:
def cosine(x, xprime):
    # Assumes x and  xprime are already normalized
    # Converts from sparse matrices because np.dot does not work on them
    return 1 - x.toarray().dot(xprime.toarray().T)

In [24]:
# So which of Barack Obama's tweets is most similar to our new document?
tweets[np.argmin([cosine(new_document_as_vector, x) for x in X])]

'People must learn to hate, and if they can learn to hate, they can be taught to love.'