<h1>CS4618: Artificial Intelligence I</h1>
<h1>Datasets</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

# Class, for use in pipelines, to select certain columns from a DataFrame and convert to a numpy array
# From A. Geron: Hands-On Machine Learning with Scikit-Learn & TensorFlow, O'Reilly, 2017
# Modified by Derek Bridge to allow for casting in the same ways as pandas.DataFrame.astype
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names, dtype=None):
        self.attribute_names = attribute_names
        self.dtype = dtype
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_selected = X[self.attribute_names]
        if self.dtype:
            return X_selected.astype(self.dtype).values
        return X_selected.values

<h1>Features</h1>
<ul>
    <li>Suppose we want to store data about objects, such as houses</li>
    <li><b>Features</b> describe the houses, e.g.
        <ul>
            <li>$\mathit{flarea}$: the total floor area (in square metres)</li>
            <li>$\mathit{bdrms}$: the number of bedrooms</li>
            <li> $\mathit{bthrms}$: the number of bathrooms</li>
        </ul>
    </li>
    <li>A particular house has <b>values</b> for the features
        <ul>
            <li>e.g. your house: $\mathit{flarea} = 114, \mathit{bdrms} = 3, \mathit{bthrms} = 2$</li>
        </ul>
    </li>
    <li>Then we can represent a house using a vector
        <ul>
            <li>e.g. your house: $\cv{114\\3\\2}$
        </ul>
    </li>
    <kli>We will always use $n$ to refer to the number of features, e.g. above $n = 3$
</ul>

<h1>Examples</h1> 
<ul>
    <li>Suppose we collect a <b>dataset</b> containing data about lots of houses, e.g.:
        $$\cv{114\\3\\2} \,\, \cv{92.9\\3\\2} \,\,\cv{171.9\\4\\3} \,\, \cv{79\\3\\1}$$
    </li>
    <li>Each member of this dataset is called an <b>example</b>, and we will use $m$ to refer to the number of examples, e.g.
        above $m = 4$
    </li>
<ul>

<h1>Dataset notation</h1>
<ul>
    <li>We will use a <em>superscript</em> to index the examples
        <ul>
            <li>
                $\v{x}^{(i)}$ will be the $i$th example
            </li>
            <li>
                The first example in the dataset is $\v{x}^{(1)}$, the second is $\v{x}^{(2)}$, $\ldots$, 
                the last is $\v{x}^{(m)}$ (Note, we index from 1)
            </li>
            <li>
                We're writing the superscript in parentheses to make it clear that we are using it for indexing.
                It is not 'raising to a power'. If we want to raise to a power, we will drop the parentheses.
            </li>
        </ul>
    </li>
    <li>We will use a <em>subscript</em> to index the features (again starting from 1)</li>
    <li>Class exercise. Using the dataset on the previous slide
        <ul>
            <li>what is $\v{x}_2^{(1)}$?</li>
            <li>what is $\v{x}_1^{(2)}$?</li>
        </ul>
    </li>
</ul>

<h2>Dataset as a matrix</h2>
<ul>
    <li>We can represent a dataset $\Set{\v{x}^{(1)}, \v{x}^{(2)}, \ldots, \v{x}^{(m)}}$ as a $m \times n$
        matrix $\v{X}$ as follows:
        $$\v{X} = \begin{bmatrix}
              \v{x}_1^{(1)} & \v{x}_2^{(1)} & \ldots & \v{x}_n^{(1)} \\
              \v{x}_1^{(2)} & \v{x}_2^{(2)} & \ldots & \v{x}_n^{(2)} \\
              \vdots        & \vdots        & \vdots & \vdots \\
              \v{x}_1^{(m)} & \v{x}_2^{(m)} & \ldots & \v{x}_n^{(m)} \\
              \end{bmatrix}
        $$
    </li>
    <li>Note how each example becomes a <em>row</em> in $\v{X}$</li>
    <li>You can think of row $i$ as the transpose of $\v{x}^{(i)}$</li>
    <li>For the example dataset, we get
        $$\v{X} = 
            \begin{bmatrix}
                114 & 3 & 2 \\
                92.9 & 3 & 2 \\
                171.9 & 4 & 3 \\
                79 & 3 & 1
            \end{bmatrix}
        $$
    </li>
</ul>

<h1>Cork Property Prices Dataset</h1>
<ul>
    <li>At the beginning of November 2014, I scraped a dataset of property prices for Cork city from www.daft.ie</li>
    <li>They are in a CSV file. Each line in the file is an example, representing one house</li>
    <li>Hence, each line of the file contains the feature-values for the floor area, number of bedrooms, number of
        bathrooms, and several other features that we will ignore for now
    </li>
    <li>We will use the pandas library
        <ul>
            <li>to read the dataset from the csv file into what pandas calls a DataFrame</li>
            <li>to explore the dataset: looking at values, computing summary statistics, plotting graphs&hellip;</li>
        </ul>
    </li>
    <li>But then we will use the scikit-learn library
        <ul>
            <li>we will create 'pipelines' to transform the data</li>
            <li>typically the first step in every pipeline will convert the pandas DataFrame to a numpy 2D array</li>
            <li>typically the next step in the pipeline will prepare the data (e.g. scale it)</li>
            <li>typically the last step in the pipeline will do something interesting: clustering, regression, 
                classification,&hellip;
            </li>
        </ul>
    </li>
</ul>

<h1>Using pandas to Read and Explore the Data</h1>

In [4]:
# Use pandas to read the CSV file into a DataFrame
df = pd.read_csv("datasets/dataset_corkA.csv")

In [5]:
# The dimensions
df.shape

(207, 9)

In [6]:
# The features
df.columns

Index(['flarea', 'type', 'bdrms', 'bthrms', 'floors', 'devment', 'ber',
       'location', 'price'],
      dtype='object')

In [7]:
# The datatypes
df.dtypes

flarea      float64
type         object
bdrms         int64
bthrms        int64
floors        int64
devment      object
ber          object
location     object
price         int64
dtype: object

In [8]:
# Summary statistics
df.describe(include="all")

Unnamed: 0,flarea,type,bdrms,bthrms,floors,devment,ber,location,price
count,207.0,207,207.0,207.0,207.0,207,207,207,207.0
unique,,4,,,,2,12,36,
top,,Semi-detached,,,,SecondHand,G,CityCentre,
freq,,65,,,,204,25,40,
mean,128.094686,,3.434783,2.10628,1.826087,,,,274.724638
std,73.970582,,1.23239,1.185802,0.379954,,,,171.756507
min,41.8,,1.0,1.0,1.0,,,,55.0
25%,82.65,,3.0,1.0,2.0,,,,165.0
50%,106.0,,3.0,2.0,2.0,,,,225.0
75%,153.65,,4.0,3.0,2.0,,,,327.5


In [9]:
# A few of the examples
df.head(3)

Unnamed: 0,flarea,type,bdrms,bthrms,floors,devment,ber,location,price
0,497.0,Detached,4,5,2,SecondHand,B2,Carrigrohane,975
1,83.6,Detached,3,1,1,SecondHand,D2,Glanmire,195
2,97.5,Semi-detached,3,2,2,SecondHand,D1,Glanmire,225


<h1>Using a scikit-learn Pipeline</h1>
<ul>
    <li>This pipeline will contain only one step: a class for selecting certain features (columns) from a pandas DataFrame, 
        and converting to a numpy array (which is what scikit-learn uses)
    </li>
    <li>Normally, a pipeline will contain more than one step (see later examples)</li>
</ul>

In [10]:
# The features we want to select
features = ["flarea", "bdrms", "bthrms"]

# Create the pipeline
pipeline = Pipeline([
        ("selector", DataFrameSelector(features))
    ])

In [11]:
# Run the pipeline
pipeline.fit(df)
X = pipeline.transform(df)

In [12]:
# Let's take a look at a few rows in X - to show you that we now have a 2D numpy array
X[:3]

array([[ 497. ,    4. ,    5. ],
       [  83.6,    3. ,    1. ],
       [  97.5,    3. ,    2. ]])

<h1>Similarity &amp; Distance</h1>
<ul>
    <li>In AI, we often want to know how <em>similar</em> one object is to another
        <ul>
            <li>E.g. how similar is my house to yours</li>
            <li>E.g. which house in our dataset is most similar to yours</li>
        </ul>
    </li>
    <li>In fact, here we are instead going to measure how <em>different</em> they are using a <b>distance function</b>
        <ul>
            <li>(N.B. This is not about geographical distance)</li>
        </ul>
    </li>
    <li>Let $\v{x}$ be one vector of feature values and $\v{x}'$ be another</li>
    <li>Simplest is to measure their <b>Euclidean distance</b>:
        $$d(\v{x}, \v{x}') = \sqrt{(\v{x}_1 - \v{x}_1')^2 + (\v{x}_2 - \v{x}_2')^2 + \ldots + (\v{x}_n - \v{x}_n')^2}$$
        or, more concisely:
        $$d(\v{x}, \v{x}') = \sqrt{\sum_{j=1}^n(\v{x}_j - \v{x}_j')^2}$$
    </li>
    <li>Euclidean distance has a minimum value of 0 (meaning identical) but no maximum value (depends on your data)</li>
    <li>Class exercise. What is the Euclidean distance between $\v{x} = \cv{100\\1\\4}$ and $\v{x}' = \cv{100\\5\\1}$?</li>
</ul>

<h1>Euclidean Distance in numpy</h1>
<ul>
    <li>It has a nice vectorized implementation (no loop!) using numpy:</li>
</ul>

In [13]:
def euc(x, xprime):
    return np.sqrt(np.sum((x - xprime)**2))

In [14]:
# Example
your_house = np.array([114.0, 3, 2])
my_house = np.array([107.0, 3, 1])

euc(your_house, my_house)

7.0710678118654755

<ul>
    <li>We can compute the distance between your house and all the houses in X</li>
    <li>(We have to write a loop here, because our <code>euc</code> function is not vectorized)</li>
</ul>

In [15]:
dists = [euc(your_house, x) for x in X]

In [16]:
# Just to show you, here are the first 3 distances
dists[:3]

[383.01305460780316, 30.4164429215515, 16.5]

In [17]:
# Even better, we can, with one line of code, find the most similar house
np.min([euc(your_house, x) for x in X])

1.5620499351813331

In [18]:
# Even better again, we can find which house is the most similar
np.argmin([euc(your_house, x) for x in X])

25

In [19]:
# Best of all, we can display the most similar house
df.ix[np.argmin([euc(your_house, x) for x in X])]

flarea              115.2
type        Semi-detached
bdrms                   4
bthrms                  2
floors                  2
devment        SecondHand
ber                    D2
location          Douglas
price                 385
Name: 25, dtype: object

<h1>Problems with Euclidean distance</h1>
<ul>
    <li>There are at least two problems with Euclidean distance (and many other distance measures too):
        <ul>
            <li>Features with different scales</li>
            <li>The curse of dimensionality (next lecture)</li>
        </ul>
    </li>
</ul>

<h1>Scaling Numeric Values</h1>
<ul>
    <li>Different numeric-valued features often have very different ranges
        <ul>
            <li>E.g. the values for floor area are going to range from a few tens to a few hundreds of square metres</li>
            <li>But the number of bedrooms and bathrooms is going to range from 0 to a dozen or so at most
        </ul>
    </li>
    <li>
        When computing the Euclidean distance, features with large ranges will dominate the distance calculations, 
        thus giving features with small ranges negligible influence.
    </li>
    <li>
        E.g., consider your house $\v{x} = \cv{114\\3\\2}$ and two others, $\v{y} = \cv{119\\3\\2}$ and
        $\v{z} = \cv{114\\7\\2}$. 
        <ul>
            <li><em>Intuitively</em>, which house is more similar to yours, $\v{y}$ or $\v{z}$?</li>
            <li>Now compute the Euclidean distances</li>
            <li>According to these distances, which house is more similar to yours?</li>
        </ul>
    </li>
    <li>
        The solution is to <b>scale</b> (or 'normalize') the values so that they have similar ranges
    </li>
    <li>We'll discuss two ways to do this:
        <ul>
            <li>Min-max sclaing</li>
            <li>Standardization</li>
        </ul>
    </li>
</ul>

<h1>Min-Max Scaling</h1>
<ul>
    <li>Suppose we want to scale feature $j$</li>
    <li>Let $max_j$ be the maximum possible value for this feature, which
        can be supplied by your domain expert
    </li>
    <li>A quick-and-dirty way to scale the values to $[0,1]$ is to divide each value $\v{x}_j$ by $max_j$:
        $$\v{x}_j \gets \frac{\v{x}_j}{max_j}$$
        <ul>
            <li>E.g. suppose no house will be above 500 square metres</li>
            <li>So you divide values by 500</li>
        </ul>
    </li>
    <li>Suppose your domain expert also supplies a minimum possible value $min_j$</li>
    <li>Then a slightly improved way to scale to $[0, 1]$ is to subtract the minimum value and divide by the range:
        $$\v{x}_j \gets \frac{\v{x}_j - min_j}{max_j - min_j}$$
        <ul>
            <li>Suppose the smallest houses are 40 square metres and the largest are 500 square metres</li>
            <li>So we subtract 40 and divide by $500 - 40$</li>
        </ul>
        This is called <b>min-max scaling</b>
    </li>
</ul>


<h1>Min-Max Scaling in scikit-learn</h1>
<ul>
    <li>scikit-learn provides a class called <code>MinMaxScaler</code>, which does something similar:
        <ul>
            <li>
                Above, we said we should use the smallest <em>possible</em> value and the largest <em>possible</em>
                value &mdash; presumably we got them from our domain expert
            </li>
            <li>
                In scikit-learn, the min and max are computed from the data: the smallest and largest <em>actual</em> values
                in the dataset
            </li>
            <li>
                <b>Question:</b> What might potentially go wrong by using scikit-learn's approach?
            </li>
        </ul>
    </li>
    <li>We can include the scaler as a step in our pipeline</li>
</ul>

In [20]:
# The features we want to select
features = ["flarea", "bdrms", "bthrms"]

# Create the pipeline
pipeline = Pipeline([
        ("selector", DataFrameSelector(features)),
        ("scaler", MinMaxScaler())
    ])

In [21]:
# Run the pipeline
pipeline.fit(df)
X = pipeline.transform(df)

In [22]:
# Let's take a look at a few rows in X
X[:3]

array([[ 1.        ,  0.33333333,  0.44444444],
       [ 0.09182777,  0.22222222,  0.        ],
       [ 0.1223638 ,  0.22222222,  0.11111111]])

In [23]:
# Let's scale your house too
# Don't try to understand or copy this code - it's a hack that you won't need
your_house_df = pd.DataFrame([{"flarea":114.0, "bdrms":3, "bthrms":2}])
your_house_scaled = pipeline.transform(your_house_df)[0]
your_house_scaled

array([ 0.1586116 ,  0.22222222,  0.11111111])

In [24]:
# To see what effect this has had, let's see which house is most similar to yours
np.argmin([euc(your_house_scaled, x) for x in X])

23

In [25]:
# Let's look at its features
df.ix[np.argmin([euc(your_house_scaled, x) for x in X])]

flarea              112.4
type        Semi-detached
bdrms                   3
bthrms                  2
floors                  2
devment        SecondHand
ber                    C2
location        Blackrock
price                 225
Name: 23, dtype: object

<h1>Standardization</h1>
<ul>
    <li>In some cases, you don't want feature values to have the same range but to have the same mean
        and even the same variance
    </li>
    <li>
        One idea is <b>mean centering</b>, where you subtract the mean value of the feature
        <ul>
            <li>If you do this to all values, some of the new values will be positive and some will be negative and 
                their mean will be approximately zero
                </li>
            </li>
        </ul>
    </li>
    <li>But better still is <b>standardization</b>, in which you subtract the mean and divide by the standard
        deviation:
        $$\v{x}_j \gets \frac{\v{x}_j - \mu_j}{\sigma_j}$$
        where $\mu_j$ is the mean of the values for feature $j$ and $\sigma_j$ is their standard deviation
    </li>
    <li>
        If you use this, then the mean will be approximately zero, the standard deviation will be 1 
    </li>
</ul>

<h1>Standardization in scikit-learn</h1>
<ul>
    <li>scikit-learn provides a class called <code>StandardScaler</code>
    </li>
    <li>It uses means and standard deviations that it calculates from your dataset (statisticians would say that it should
        use the population mean and standard deviation, but these are generally not known)
    </li>
    <li>We can include the scaler as a step in our pipeline</li>
</ul>

In [26]:
# The features we want to select
features = ["flarea", "bdrms", "bthrms"]

# Create the pipeline
pipeline = Pipeline([
        ("selector", DataFrameSelector(features)),
        ("scaler", StandardScaler())
    ])

In [27]:
# Run the pipeline
pipeline.fit(df)
X = pipeline.transform(df)

In [28]:
# Let's take a look at a few rows in X
X[:3]

array([[ 4.99927973,  0.45974713,  2.4462228 ],
       [-0.6029769 , -0.35365164, -0.93520037],
       [-0.41460881, -0.35365164, -0.08984458]])

In [29]:
# Let's scale your house too
# Don't try to understand or copy this code - it's a hack that you won't need
your_house_df = pd.DataFrame([{"flarea":114.0, "bdrms":3, "bthrms":2}])
your_house_scaled = pipeline.transform(your_house_df)[0]
your_house_scaled

array([-0.19100641, -0.35365164, -0.08984458])

In [30]:
# To see what effect this has had, let's see which house is most similar to yours
np.argmin([euc(your_house_scaled, x) for x in X])

23

<p>
    (Here, it's the same as when we used min-max scaling; it won't always be so)
</p>