## Chapter 8 Principles of Feature Engineering and Selection

# 8.4  Histogram features for real data types

In this Section we briefly overview methods of knowledge-driven feature design for naturally high dimensional text, image, and audio data types, all of which are based on the same core concept for representing data: the histogram. A histogram is just a simple way of summarizing/representing the contents of an array of numbers as a vector showing how many times each number appears in the array. Although each of the aforementioned data types differs substantially in nature, we will see how the notion of a histogram-based feature makes sense in each context. While histogram features are not guaranteed to produce perfect separation, their simplicity and all around solid performance makes them quite popular in practice.

Lastly note that the discussion in this Section is only aimed at giving the reader a high level, intuitive understanding of how common knowledge-driven feature design methods work. The interested reader is encouraged to consult specialized texts (referenced throughout this Section) on each subject for further study. 

## 8.4.1 Histogram features for text data

Many popular uses of classification, including spam detection and sentiment analysis, are based on text data (e.g., online articles, emails, social-media updates, etc.).  However
with text data, the initial input (i.e., the document itself) requires a significant amount of preprocessing and transformation prior to further feature design and classification. The most basic yet widely used feature of a document for regression/classification tasks is a called Bag of Words (BoW) histogram or feature vector. Here we introduce the BoW histogram and discuss its strengths, weaknesses, and common extensions.

A BoW feature vector of a document is a simple histogram count of the different words it contains with respect to a single corpus or collection of documents (each count of an individual word is a feature, and taken together gives a feature vector), minus those nondistinctive words that do not characterize the document. To illustrate this idea
let us build a BoW representation for the following corpus of two documents each containing a single sentence. 

\begin{equation}
\begin{array}{c}
1)\,\,\mbox{dogs are the best} \\
\,\,2)\,\,\mbox{cats are the worst}
\end{array}
\end{equation}

To make the BoW representation of these documents we begin by parsing them, creating representative vectors (histograms) $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ which contain the number of times each word appears in each document. For the two documents above these vectors take the form

\begin{equation}
\begin{array}{cc}
\mathbf{x}_{1}=\frac{1}{\sqrt{2}}\left[\begin{array}{c}
1\\
0\\
1\\
0
\end{array}\right]\left(\begin{array}{c}
\mbox{best}\\
\mbox{cat}\\
\mbox{dog}\\
\mbox{worst}
\end{array}\right) & \,\,\mathbf{x}_{2}=\frac{1}{\sqrt{2}}\left[\begin{array}{c}
0\\
1\\
0\\
1
\end{array}\right]\left(\begin{array}{c}
\mbox{best}\\
\mbox{cat}\\
\mbox{dog}\\
\mbox{worst}
\end{array}\right).\end{array}\label{eq:BoW-vector-representation}
\end{equation}

Notice that uninformative words such as 'are' and 'the', typically referred to as \textit{stop words}, are not included in the representation.  Further notice that we count the singular 'dog' and 'cat' in place of their plural which appeared in the actual documents in (\ref{eq:text-features-documents-example-1}).  This preprocessing step is commonly called \textit{stemming}, where related words with a common stem or root are reduced to and then represented by their common root. For instance, the words 'learn', 'learning', 'learned', and 'learner', in the final BoW feature vector are represented
by and counted as 'learn'. Additionally, each BoW vector is normalized to have unit length. 

Given that the BoW vector contains only non-negative entries and has unit length, the correlation between two BoW vectors $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ always ranges between $0\leq\mathbf{x}_{1}^{T}\mathbf{x}_{2}^{\,}\leq1$.  When the correlation is zero (i.e., the vectors are perpendicular), as with the two vectors in (\ref{eq:BoW-vector-representation}), the two vectors are considered maximally different and will therefore
(hopefully) belong to different classes. In the instances shown in (\ref{eq:BoW-vector-representation}) the fact that $\mathbf{x}_{1}^{T}\mathbf{x}_{2}=0^{T}$ makes sense: the two documents are completely different, containing entirely different words and polar opposite sentiment. On the other hand the higher the correlation between two vectors the more similar the documents are purported to be, with highly correlated documents (hopefully) belonging to the same class. For example, the BoW vector of the document ``I love dogs'' would have positive correlation with $\mathbf{x}_{1}$ the document in (\ref{eq:BoW-vector-representation}) about dogs.

However because the BoW vector is such a simple representation of a document, completely ignoring word order, punctuation, etc., it can only provide a gross summary of a document's contents and is thus not always distinguishing. For example, the two documents ``dogs are better than cats'' and ``cats are better than dogs'' would be considered the same document using BoW representation, even though they imply completely opposite relations. Nonetheless the gross summary provided by BoW can be distinctive enough for many applications. Additionally, while more complex representations of documents (capturing word order, parts of speech, etc.,) may be employed they can often be unwieldily (see e.g., \cite{manning1999foundations}). 

#### <span style="color:#a50e3e;">Example 1: </span>  Sentiment analysis

Determining the aggregated feelings of a large base of customers, using text-based content like product reviews, tweets, and comments, is commonly referred to as \emph{sentiment analysis} (as first discussed in Example \ref{example:sentiment-analysis-1}). Classification models are often used to perform sentiment analysis, learning to identify consumer data of either positive or negative feelings.

For example, Figure \ref{fig:team america} shows BoW vector representations for two brief reviews of a controversial comedy movie, one with a positive opinion and the other with a negative one. The BoW vectors are rotated sideways in this Figure so that the horizontal axis contains the common words between the two sentences (after stop word removal
and stemming), and the vertical axis represents the count for each word (before normalization). The polar opposite sentiment of these two reviews is perfectly represented in their BoW representations, which as one can see are orthogonal (i.e., they have zero correlation).

<figure>
  <img src= "../../mlrefined_images/superlearn_images/Fig_4_23_new.png" width="80%" height="60%" alt=""/>
  <figcaption>   
<strong>Figure 1:</strong> <em> BoW representation of two movie review excerpts, with words (after the removal of stop words and stemming) shared between the two reviews listed along the horizontal axis. The vastly different opinion of each review is reflected very well by the BoW histograms, which have zero correlation.
</em>  </figcaption> 
</figure>

#### <span style="color:#a50e3e;">Example 1: </span>  Spam detection

Spam detection is a standard text-based two class classification problem.  Implemented in most email systems, spam detection automatically identifies unwanted messages (e.g., advertisements), referred to as spam, from the emails users want to see. Once trained, a spam detector can remove unwanted messages without user input, greatly improving a user's email experience. In many spam detectors the BoW feature vectors are formed with respect to a specific list of spam words (or phrases) including 'free', 'guarantee', 'bargain', 'act now', 'all natural' , etc., that are frequently seen in spam emails. Additionally features like the frequency of certain characters like '!' and '{*}' are appended to
the BoW feature, as are other spam-targeted features like the total number of capital letters in the email and the length of longest uninterrupted sequence of capital letters, as these features can further distinguish the two classes. 

In Figure \ref{fig:spam-results} we show classification results on a spam email dataset consisting of BoW, character frequencies, and other spam-focused features (including those mentioned previously) taken from $1813$ spam and $2788$ real email messages for a total
of $P=4601$ datapoints (this data is taken from \cite{Lichman:2013}).  Employing the softmax cost to learn the separator, the Figure shows the number of misclassifications per iteration of Newton's method (using the counting cost in (\ref{eq:counting cost}) at each iteration).  More specifically these classification results are shown for the same
dataset using only BoW features (in black), BoW and character frequencies (in green), and the BoW/character frequencies as well as spam-targeted features (in magenta) (see exercise \ref{exercise-perform-spam-detection} for further details). Unsurprisingly the addition of character frequencies improves the classification, with the best performance occurring when the spam-focused features are used as well.

<figure>
  <img src= "../../mlrefined_images/superlearn_images/spam_features.png" width="80%" height="60%" alt=""/>
  <figcaption>   
<strong>Figure 2:</strong> <em> Results of applying the softmax cost (using Newton's method) to distinguish spam from real email using BoW and additional features. The number of misclassifications per iteration of Newton's method is shown in the case BoW features (in black), BoW and character frequencies (in green), and BoW, character frequencies, as well as spam-focused features (in magenta). In each case adding more distinguishing features (on top of the BoW vector) improves classification. Data in this figure is taken from [#Lichman:2013]
</em>  </figcaption> 
</figure>

## 8.4.2 Histogram features for image data 

To perform classification tasks on image data, like object detection (see Example \ref{Example-object-detection}), the raw input features are pixel values of an image itself. The pixel values of an $8$-bit grayscale image are each just a single integer in the range of $0$ (black) to $255$ (white), as illustrated in Figure \ref{fig:img-is-pixels}.  In other words, a grayscale image is just a matrix of integers ranging from $0$ to $255$. A color image is then just a set of three such grayscale matrices: one for each of the red, blue, and green channels.

<figure>
  <img src= "../../mlrefined_images/superlearn_images/close_up.png" width="80%" height="60%" alt=""/>
  <figcaption>   
<strong>Figure 3:</strong> <em> An $8$-bit grayscale image consists of pixels, each taking a value between $0$ (black) and $255$ (white). To visualize individual pixels, a small $8\times8$ block from the original image is blown up on the right.
</em>  </figcaption> 
</figure>

Pixel values themselves are typically not discriminative enough to be useful for classification tasks. We illustrate why this is the case using a simple example in Figure \ref{fig:edged_shapes}. Consider the three simple images of shapes shown in the left column of this Figure. The first two are similar triangles while the third shape
is a square, and we would like an ideal set of features to reflect the similarity of the first two images as well as their distinctness from the last image. However due to the difference in their relative size, position in the image, and the contrast of the image itself (the image with the smaller triangle is darker toned overall) if we were to use raw pixel values to compare the images (by taking the difference between each image pair\footnote{This is to say that if we denote by $\mathbf{X}_{i}$ the $i^{\textrm{th}}$ image then we would find that $\left\Vert \mathbf{X}_{1}-\mathbf{X}_{3}\right\Vert _{F}<\left\Vert \mathbf{X}_{1}-\mathbf{X}_{2}\right\Vert _{F}$.}) we would find that the square and larger triangle in the top image are more similar than the two triangles themselves. This is because the pixel values of the first and third image, due to their identical contrast and location of the triangle/square, are indeed more similar
than those of the two triangle images.

---

This is to say that if we denote by $\mathbf{X}_{i}$ the $i^{\textrm{th}}$
image then we would find that $\left\Vert \mathbf{X}_{1}-\mathbf{X}_{3}\right\Vert _{F}<\left\Vert \mathbf{X}_{1}-\mathbf{X}_{2}\right\Vert _{F}$.}

---

<figure>
  <img src= "../../mlrefined_images/superlearn_images/edged_shapes.png" width="80%" height="60%" alt=""/>
  <figcaption>   
<strong>Figure 4:</strong> <em> (left column) Three images of simple shapes. While the triangles in the top two images are visually
similar, this similarity is not reflected by comparing their raw pixel values. (middle column) Edge detected versions of the original images,
here using $8$ edge orientations, retain the distinguishing structural content while significantly reducing the amount of information in
each image. (right column) By taking normalized histograms of the edge content we have a feature representation that captures the similarity
of the two triangles quite well while distinguishing both from the square.
</em>  </figcaption> 
</figure>

In the middle and right columns of Figure \ref{fig:edged_shapes} we illustrate a two step procedure that generates the sort of discriminating feature transformation we are after. In the first part we shift perspective from the pixels themselves to the edge content at each pixel. As first detailed in Example \ref{Example-Visual-feature-design-intro}, by
taking edges instead of pixel values we significantly reduce the amount of information we must deal with in an image without destroying its identifying structures. In the middle column of the Figure we show corresponding edge detected images, in particular highlighting $8$ equally (angularly) spaced edge orientations, starting from $0$ degrees
(horizontal edges) with $7$ additional orientations at increments of $22.5$ degrees, including $45$ degrees (capturing the diagonal edges of the triangles) and $90$ degrees (vertical edges). Clearly the edges retain distinguishing characteristics from each original image, while significantly reducing the amount of total information in each case. 

We then make normalized histogram of each image's edge content. That is, we make a vector consisting of the amount of each edge orientation found in the image and normalize the resulting vector to have unit length. This is completely analogous to the BoW feature representation described for text data previously, with the counting of edge orientations
being the analog of counting ``words'' in the case of text data.  Here we also have a normalized histogram which represents an image grossly while ignoring the location and ordering of its information.  However as shown in the right panel of the Figure unlike raw pixel values these histogram feature vectors capture characteristic information
about each image, with the top two triangle images having very similar histograms and both differing significantly from that of the third image of the square.

Generalizations of the previously described edge histogram concept are widely used as feature transformations for visual object detection.  As detailed in Example \ref{Example-object-detection}, the task of object detection is a popular classification problem where objects of interest (e.g., faces) are located in an example image. While the
basic principles which led to the consideration of an edge histogram still hold, example images for such a task are significantly more complicated than the simple geometric shapes shown in Figure \ref{fig:edged_shapes}.  In particular, preserving local information at smaller scales of an image is considerably more important. Thus a natural way to extend
the edge histogram feature is to compute it not over the entire image, but by breaking the image into relatively small patches and computing an edge histogram of each patch, then concatenating the results. In Figure \ref{fig:edge-histogram-figure} we show a diagram of a common variations of this technique often used in practice where we normalize neighboring histograms jointly in larger blocks (for further details see e.g., \cite{prince2012computer,dalal2005histograms}). Interestingly this sort of feature transformation can in fact be written out algebraically as a set of quadratic transformations of the input image \cite{bristow2014linear}. 

<figure>
  <img src= "../../mlrefined_images/superlearn_images/HOG.png" width="80%" height="60%" alt=""/>
  <figcaption>   
<strong>Figure 5:</strong> 
<em> 
A pictorial representation
of the sort of generalized edge histogram feature transformation commonly
used for object detection. An input image is broken down into small
(here $9\times9$) blocks, and an edge histogram is computed on each
of the smaller non-overlapping (here $3\times3$) patches that make
up the block. The resulting histograms are then concatenated and normalized
jointly, producing a feature vector for the entire block. Concatenating
such block features by scanning the block window over the entire image
gives the final feature vector. 
</em>  
</figcaption> 
</figure>

To give a sense of just how much histogram-based features improve our ability to detect visual objects we now show the results of a simple experiment on a large face detection dataset. This data consists of $3,000$ cropped $28\times28$ (or dimension $N=784)$ images of faces (taken from \cite{angelova2005pruning}) and $7,000$ equal sized non-face images (taken from various images not containing faces), a sample of which is shown in Figure \ref{fig:face-noface-examples}.

<figure>
  <img src= "../../mlrefined_images/superlearn_images/face_detection_data.png" width="80%" height="60%" alt=""/>
  <figcaption>   
<strong>Figure 5:</strong> 
<em> 
Example images taken
from a large face detection dataset of (left panel) $3,000$ facial
and (right panel) $7,000$ non-facial images (see text for further
details). The facial images shown in this Figure are taken from }\cite{angelova2005pruning}.
</em>  
</figcaption> 
</figure>

We then compare the classification accuracy of the softmax classifier
on this large training set of data using \textbf{a)} raw pixels and
\textbf{b) }a popular histogram-based feature known as the Histogram
of oriented Gradients (HoG) \cite{dalal2005histograms}. HoG features
were extracted using the Vlfeat software library \cite{vedaldi2010vlfeat},
providing a corresponding feature vector of each image in the dataset
(of length $N=496$). In Figure \ref{fig:face-pixels-vs-HoG-results}
we show the resulting number of misclassifications per iteration of
Newton's method applied to the raw pixel (black) and HoG feature (magenta)
versions of data. While the raw images are not linearly separable,
with over $300$ misclassifications upon convergence of Newton's method,
the HoG feature version of the data is perfectly separable by a hyperplane
and presents zero misclassifications upon convergence. 

<figure>
  <img src= "../../mlrefined_images/superlearn_images/raw_vs_hog.png" width="80%" height="60%" alt=""/>
  <figcaption>   
<strong>Figure 6:</strong> 
<em> 
An experiment comparing the classification efficacy of raw pixel versus histogram-based features for a large training set of face detection data (see text for further details). Employing the softmax classifier, the number of misclassifications per iteration Newton's method is shown for both raw pixel data (in black) and histogram-based features (in magenta). While the raw data itself has overlapping classes, with a large number of misclassifications upon convergence of Newton's method, the histogram-based feature representation of the data is perfectly linearly separable with zero misclassifications upon convergence.
</em>  
</figcaption> 
</figure>

## 8.4.3 Histogram features for audio data 

Like images raw audio signals are not discriminative enough to be used for audio-based classification tasks (e.g., speech recognition) and once again properly designed histogram-based features are used. In the case of an audio signal it is the histogram of its frequencies, otherwise known as its spectrum, that provides a robust summary of its contents. As illustrated pictorially in Figure [fig:speech-histogram], the spectrum of an audio signal counts up (in histogram fashion) the strength of each level of its frequency or oscillation. This is done by decomposing the speech signal over a basis of sine waves of ever increasing frequency, with the weights on each sinusoid representing the amount of that frequency in the original signal. Each oscillation level is analogous to an edge direction in the case of an image, or an individual word in the case of a BoW text feature.

<figure>
  <img src= "../../mlrefined_images/superlearn_images/spectrum.png" width="80%" height="60%" alt=""/>
  <figcaption>   
<strong>Figure 7:</strong> 
<em> 
A pictorial representation of an audio signal and its representation as a frequency histogram or spectrum. (left panel) A figurative audio signal can be decomposed as a linear combination of simple sinusoids with varying frequencies (or oscillations). (right panel) The frequency histogram then contains the strength of each sinusoid in the representation of the audio signal. </em>  
</figcaption> 
</figure>

In Example [example-features-for-object-detection] we discussed how edge histograms computed on overlapping blocks of an image provide a useful feature transformation for object detection since they preserve characteristic local information. Likewise computing frequency histograms over overlapping windows of an audio signal (forming a 'spectrogram' as illustrated pictorially In Figure [fig:spectrogram]) produces a feature vector that preserves important local information as well, and is a common feature transformation used for speech recognition. Further processing of the windowed histograms, in order to e.g., emphasize the frequencies of sound best recognized by the human ear, are also commonly performed in practical implementations of this sort of feature transformation [#huang2001spoken, #rabiner1993fundamentals]. 

<figure>
  <img src= "../../mlrefined_images/superlearn_images/mfcc.png" width="80%" height="60%" alt=""/>
  <figcaption>   
<strong>Figure 7:</strong> 
<em> 
A pictorial representation of histogram-based features for audio data. The original speech signal (shown on the left) is broken up into small (overlapping) windows whose frequency histograms are computed and stacked vertically to produce a 'spectrogram' (shown on the right). Classification tasks like speech recognition are then performed using this feature representation, or a further refinement of it (see text for further details).
</em>  
</figcaption> 
</figure>