/
intro.tex
471 lines (404 loc) · 25.6 KB
/
intro.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
\chapter{Introduction}
\begin{quote}
Because it's there.
\end{quote}
\hfill --- George Mallory, on why he wanted to climb Mount Everest.\\
\vspace{0.5in}
{\em Why human pose estimation? }
The idea of an intelligent robot performing a variety of tasks, extraordinary
and mundane, up to and exceeding human performance, has captured the hearts and
minds of people since at least the European Renaissance. A key feature of much
of this romantic vision is that robots can interact with {\em us}---working
with, around and for humans. An understanding of human pose is a crucial
component to making this compelling dream become a reality.
In the more practical and not-too-distant future, understanding human pose from
images has enormous potential to help in many computer vision tasks: semantic
indexing of images and video~\citep{posesearch}, action
recognition~\citep{pose-action11}, human-object interaction~\citep{bangpeng12},
and scene understanding~\citep{gupta11}, to name a few.
The problem of human pose understanding is also interesting in its own right.
It defines part of the boundary of what can and cannot be accomplished by
artificially intelligent systems. Infants and even other species can
understand human pose---why can't a computer?
Pose estimation subsumes one of the holy grails of computer vision: general
object recognition. It serves as useful vehicle to demonstrate computer vision
techniques that can be used in other subfields. Humans can be considered a
collection of related objects (body parts), or a single, highly deformable
object. The parts themselves are some of the most difficult to detect in the
literature. Typical objects that researchers work on recognizing---faces,
bicycles or even potted plants \citep{voc09}---have distinguishing features,
reliable patterns and limited intra-class variability. A body part such as a
lower arm, on the other hand, is far more generic. It has a generic shape---
at best it can be described as a projection of a cylinder or frustum---and is
subject to much higher intra-class variability due to clothing, articulated
pose, body type, and severe foreshortening. Features developed must be
invariant to pose, lighting, texture and color and still discriminate parts
from clutter, or efficient search procedures over these variations need to be
developed. These types of techniques are valuable for computer vision in
general.
Human pose estimation is also one of the most computationally demanding
problems in computer vision, as the set of possible outputs is combinatorial in
the number of parts. It can be posed as a graph assignment or graphical model
inference problem with an enormous set of possible labels (for each part,
determine which pixel it is associated with). This makes it an interesting
testbed for advancements in graphical model and matching algorithms for and
beyond computer vision---in language understanding, computational biology,
statistics and physics.
Finally, the problem of pose estimation is timely. In the computer vision
community, statistical machine learning tools and supervised datasets give us
principled protocols to learn effective recognition models which didn't exist a
decade ago. Robots, cameras, and automated systems are more and more pervasive
in everyday life. The demand for reliable pose estimation is already strong in
the entertainment and defense industries.
For all these reasons---practical applications, the dream of artificial
intelligence, the general applicability to vision and machine learning, the
convergence of technology to make it all possible---human pose estimation is an
excellent problem upon which to focus.
\section{Problem Statement}
Here we formalize our problem definition in terms of input, output and
computational requirements as follows:
\begin{problem}[2D human upper-body pose estimation]
\label{prob:pose}
\hspace*{\fill}
\begin{description}
\item[\textbf{Input:}] A single RGB image or RGB video sequence containing the
rough location and scale of a person in every frame, with no additional
information.
\item[\textbf{Output:}] Line segments describing the major anatomical parts
\{left and right upper arms, left and right lower arms, torso, head\} in pixel
coordinates.
\item[\textbf{Requirements:}] Computation time and space polynomial in the
number of input pixels and number of output parts.
\end{description}
\end{problem}
\begin{figure}[tb]
\begin{center}
\includegraphics[width=0.95\textwidth]{figs/problem-statement.pdf}
\caption[Statement of problem.]{An example illustrating the pose estimation
problem, formalized in~\probref{pose}.}
\label{fig:pose-problem}
\end{center}
\end{figure}
Importantly, we concern ourselves only with 2D (two dimensional) pixel array
input. This makes the task much more challenging than when using additional
sensors, such as in Microsoft's Kinect capture system~\citep{kinect} where
depth information and hence reliable knowledge of the background can be used.
However, our limited-sensor problem also means it can be applied in more
general settings: we can apply such pose estimation methods outdoors and on the
wealth of archival images and footages already stored on personal computers,
libraries, and photo and video sharing web sites.
Furthermore, we do not assume any additional information, such as knowledge of
the foreground, background, clothing, lighting, indoor versus outdoor,
etcetera. All these factors work to confound estimation by introducing
appearance artifacts. We refer to our general setting as pose estimation {\em
in the wild}, to stress the fact that the datasets we consider are from
unconstrained foreground and backgrounds settings (or nearly unconstrained,
when dealing with TV shows).
Also of note, we only consider the upper body, although all methods and models
discussed in this work can be extended to full body processing (\ie including
hips and upper and lower legs). In fact, most of the models and tools
developed in this work can be applied to other articulated objects, and in
general, other domains in which estimating the instantiation of interacting
parts (\eg, handwriting recognition, or gene sequencing). We focus on upper
body human pose in this work because (1) most interesting pose variation occurs
in the upper body, (2) there is a vast amount of data of people's upper bodies
from TV shows, movies and images where lower halves are not visible, and (3)
there is little extra knowledge to be learned about pose estimation by
including the lower body parts, while increasing the computation time of all
models at least linearly.
Finally, we restrict ourselves to polynomial running time. The space of all
possible poses is exponential in the number of parts. The ideal approach, if
computation were not an issue, would be to enumerate all possible poses and
score them all using any scoring function of arbitrary complexity. However,
this is simply not feasible, and we are forced to make conditional independence
assumptions between certain parts to achieve tractability. In practice, we
wish to estimate pose on the order of a few minutes or seconds per frame.
\section{Intrinsic difficulties}
Human pose estimation in the wild is an extremely challenging problem. It
shares all of the difficulties of object detection, such as confounding
background clutter, lighting, viewpoint, and scale. In addition, there are
significant difficulties unique to human poses. We are forced to reason over
an enormous number of plausible poses for each image, making this a very
computationally demanding problem. In this section we go over the intrinsic
difficulties of this problem, both from perceptual and computational
standpoints.
\subsection{Perceptual issues}\label{sec:perceptual}
\begin{figure}[tb]
\begin{center}
\includegraphics[width=1.05\textwidth]{figs/perceptual-issues.pdf}
\caption[Perceptual difficulties in pose estimation]{Some of the perceptual
challenges in human pose estimation. Large variations in lighting, pose,
viewpoint, foreshortening, relative scale and clutter all work to confound pose
estimation. See~\secref{perceptual}}
\label{fig:perceptual-issues}
\end{center}
\end{figure}
One of the primary difficulties in human pose estimation is that appearance of
pose is largely unconstrained, making it highly variable with multiple
appearance modes. The following issues are illustrated
in~\figref{perceptual-issues}.
\mypar{Lighting:} Images of pose can be taken indoor or outdoor, making not
only the mean intensity of the image variable (signal bias), but also contrast
(signal gain). This issue is well studied in computer vision, and to an extent,
features have been developed to be invariant to lighting, \eg, HoG~\citep{hog},
but require harsh quantization and local normalization of edge energy
information.
\mypar{Viewpoint and pose: } Humans can look very different depending on where
they are with respect to the imaging plane. The global ``twist'' (rotation
about the length-of-body axis) which determines the degree of frontal versus
profile stance of the person can be somewhat mitigated by coarse person
detectors~\citep{andriluka2010}. However, the body can also go through radical
appearance changes due to the articulation of the limbs, forcing practical
systems to decompose the modeling into the most basic, articulation-invariant
components as atomic units: limbs and joints.
\mypar{Relative scale:} We assume our input is a detected person at a rough
global scale. However, we still have a large variation in the scale of parts
in two different ways: In any particular person, the ratio of limb lengths may
not be consistent; \eg a baby's proportions are very different than an adult's.
Across people, there are also a large differences in the geometries of parts,
based on gender, body type (fat, skinny, muscular), and age. These further
contribute to the variability in appearance.
\mypar{2D projection:} The fact that we are working with images that are
projections of the real world lead to further difficulties. Foreshortening
makes estimating the length of the limb in 2D coordinates even more difficult,
and changes the appearance. Self-occlusions and foreground occlusions make a
part invisible and are very hard to determine without further scene or depth
knowledge. Finally, it is inherently ambiguous to map from 2D pose to 3D real
world coordinates (even up to an unknown global scale factor), discussed
further in~\secref{limitations}. This makes it difficult to model priors on
arm length, as we are forced to measure and reason about lengths on the 2D
pixel grid.
\mypar{Clothing:} Clothing contributes a near-infinite space of foreground
variation. Not only is clothing responsible for foreground clutter, it also
can be considered an occluder which hides parts (\eg baggy clothing, skirts,
ponchos) and can break assumptions about left-right appearance symmetry (\eg an
asymmetric shirt).
\mypar{Background clutter:} Background clutter accounts for roughly half the
errors in pose estimation performance. Often it is extremely difficult to
separate edges in the background from lower arms. Lower arms appear as little
more than a pair of roughly parallel lines in an image, as do many man-made
structures and natural objects in backgrounds: walls, tables, chairs, posts,
trees, etcetera.
\begin{figure}[tb]
\begin{center}
\includegraphics[width=1.05\textwidth]{figs/dataset-multimodal.pdf}
\caption[Variations in appearance]{Some illustrations of variation in
appearance in the PASCAL Stickmen dataset. (a) An average of the dataset in
grayscale. (b) Average of Sobel edges over dataset. (c) Polar histogram of
the inner angle made between upper and lower arm, with examples for
$0^\circ,45^\circ,90^\circ,135^\circ,180^\circ,225^\circ,270^\circ$ and
$315^\circ$. (d) A random sampling of 100 left elbows from the Buffy Stickmen
pose dataset, removing color and intensity bias, to illustrate the huge variety
of appearance due to clutter, motion blur, clothing, body type, and pose.
\label{fig:dataset-multimodal}}
\end{center}
\end{figure}
\subsection{Computational issues}
To operationalize \probref{pose}, let $x$ be the input image pixels, and $y$ be
a representation of the output predicted pose. Then a general solution
to~\probref{pose} would take the form of a {\em scoring function} $s(x,y)$
which evaluates the quality of any estimated pose $y$ in the image $x$. We can
define the ``best'' pose as the highest scoring: $y^\star = \argmax_y s(x,y)$
(or $y^\star = \arg\sup_y s(x,y)$ if $y$ is infinite dimensional, \ie
continuous). Then~\probref{pose} is satisfied if this determination of the
maximizer can be done in polynomial time. There are two sources of intrinsic
computational complexity within this framework.
\mypar{Complexity of the input:} For the reasons outlined
in~\secref{perceptual}, there are an astronomical number of different inputs
$x$ that can map to the same true pose $y$---the same layout of body parts can
look very different from image to image. The problem is inherently multimodal,
in the sense that radically different appearances are equally valid input
representations of any particular pose. For a few different illustrations of
the variability of a dataset, see~\figref{dataset-multimodal}.
To deal with this complex problem, we are forced to either design features that
are invariant to the multimodality (\eg, a generic patch-based arm detector
based on coarse edges, or geometric features based on relative part coordinate
systems), or to partition the space and model multiple modes separately. In
the case of the latter approach, we are faced with other difficult decisions
regarding model complexity: how to define modes and a notion of locality, and
what the right trade-off is between the richness of the model and the error in
fitting the model at different modes with a finite amount of training data.
\mypar{Complexity of the output:}
The enormous combinatorial space of possible output poses is a second source of
computational complexity. A typical discretization of the state space of human
poses is an $80 \times 80$ spatial grid of part locations at $24$ possible
angles~\citep{felz05}, resulting in an output space that is roughly $150000$
possibilities for each part, and thus $150000^6 \approx 10^{30}$ for the joint
output space of all $6$ upper body parts~\footnote{In the case of continuous
spaces, the output space is infinite (infinitely precise), and to maintain
tractability the form of $s(x,y)$ is typically analytical with a closed-form
maximum, mean and/or mode, or approximate sampling techniques are used for
inference. See \secref{rel}.}.
Enumerating all possibilities for a joint configuration of all parts is clearly
not feasible. At the other extreme, we could ignore part interactions, and
estimate the pose of each part separately---a task which instead has $6 \times
150000$ possibilities, which is computationally very cheap with modern
computers. However, individual part detection is extremely difficult for body
parts~\citep{andriluka09} due to the wide range of appearances and lack of
discriminating features.
Between the two extremes of (1) estimating parts in isolation and (2)
enumerating all possible joint pose configurations, there lies a family of
models $s(x,y)$ that consider {\em some} part interactions, but not all. The
simplest of these is a {\em first-order}, or {\em pairwise model}, which looks
at pairs of part interactions at a time, and the graph of part interactions
forms a tree structure. This compromise between a full model of every part and
a decoupled model of independent parts will be the basic model building block
throughout this work.
In such a pairwise model, the basic bottleneck operation is to evaluate the
quality of a pair of parts at a time. The model combines all such pairwise
scores together to determine the optimal global pose. This scoring requires
$150000^2 \approx 1$ billion possibilities to consider for a pair of parts in
our example $80 \times 80 \times 24$ state space, which is large but just small
enough for modern machines to handle with some additional model restrictions,
detailed in~\secref{ps}.
\section{Contributions of this thesis}
\begin{figure}[t!]
\begin{center}
\includegraphics[width=0.99\textwidth]{figs/puzzle.pdf}
\caption[Puzzle analogy of pose estimation.]{Puzzle analogy of pose estimation.
Classical approaches only use individual part detectors and geometric
plausibility to determine the pose of a person. This is analogous to
attempting to put together a puzzle without looking at appearance of the pieces
---only the plausibility of them fitting together. On the other hand, models
with data-dependent interactions are analogous to using the appearance of the
puzzle piece faces as well as their fit when constructing the puzzle. Even for
humans, it is easier to spot the correct pair of upper/lower arms when they are
examined jointly.}
\label{fig:puzzle}
\end{center}
\end{figure}
Due to the computational issues discussed above, previous work in pose
estimation has resorted to a model of pose that considers, at most, pairwise
interactions between parts, in a specially restricted form: the network of
part-pair interactions is described by a tree structured graph, and the
interactions are described by simple kinematic consistency. This is known as
the basic {\em pictorial structure} (PS) model, also referred to as a {\em
spring model}---see \secref{ps}, and \figref{spring-model-intro}.
In PS models, each part has an individual score for placement at any location
in the image, which must be balanced with part-pair penalties for deforming
from the default model positions (``pose prior''). For example, the
deformation penalty between an upper arm and lower arm expresses the fact that
they should roughly agree on where the elbow is. The deformation penalties can
be thought of as springs at rest at default positions and stretched by moving
the parts to new positions. The parts themselves are attracted individually to
likely spots in the image. Thus a balance is sought between individual part
beliefs and respecting the prior notion of what a pose should look like.
\begin{figure}[tb]
\begin{center}
\includegraphics[width=0.99\textwidth]{figs/spring-model.pdf}
\caption[Spring model of pose.]{Spring model of pose. At left, the original
spring pictorial structure model that appeared in \citet{fischler1973ps}. At
right, the standard PS model for 2D human pose. The states are shown as unit
vectors indicating the position of joints and their direction. The mean
displacement between joints are shown as solid black circles, connected by
solid black lines to show the kinematic tree structure. The displacement from
mean positions are shown as springs stretching. This figure is repeated again
for convenience in \secref{ps}.}
\label{fig:spring-model-intro}
\end{center}
\end{figure}
An important property of PS models is that the pairwise, spring stretch,
termsare {\em blind to the image content}. The problem with this is that
individual part detector scores are extremely weak (\secref{perceptual}): they
must work in isolation and generalize to limbs in all settings of backgrounds,
foregrounds, articulation, and environment.
As a result, the above model is effectively trying to piece together parts of a
person, when the parts themselves are extremely ambiguous. As an extreme
analogy of this, it is similar to attempting to put together a jigsaw puzzle in
which the pieces themselves are very generic. One has many plausible
candidates for where each puzzle piece should go individually (like individual
part scores), but to determine whether two pieces are adjacent, one can only
see how well they fit together, and {\em not } the image content on their
faces; see \figreff{puzzle}{top row}.
\subsection{Image-dependent interactions in a tree structured
model,~\secref{CPS}}
\label{sec:contrib1}
As an improvement over the basic PS model, we wish to actually {\em exploit }
image content when modeling pairwise interactions. In the puzzle analogy, this
would allow us to fit pieces together based on their color similarity and
continuous contours across the connection boundary; see \figreff{puzzle}{bottom
row}. We exploit these same cues for determining whether limbs go together in
an image, as well as additional cues such as region support and multimodal
descriptions of geometry.
Unfortunately, this turns out to be computationally infeasible using standard
tools and techniques. In light of this, we propose a {\em cascade} of models
to focus computation on pose possibilities that are more promising. There are
many pose possibilities that are easy to reject as incorrect with a simple
model (like a basic PS model, or even simpler), and we are then able to freely
apply a richer model on the possibilities that remain. This is illustrated
in~\figref{cps-overview}.
To employ a cascade approach, we develop and analyze {\em structured prediction
cascades}, and apply them to the problem of 2D pose estimation. Importantly, we
provide a novel training objective for the cascade so that parameters of the
models are learned to specifically to filter out a significant proportion of
possibilities at every cascade level.
\subsection{Image-dependent interactions in a general
graph,~\secref{stretchable}}
The cascade approach we proposed in the previous section works for any
tree-structured model. However, in any tree model, we fail to capture
important interactions between parts, both {\em within} a single frame (\eg,
color constancy between left/right symmetric parts to model clothing) and {\em
across} frames when dealing with tracking multiple parts and their interactions
over time in video. Determining the best possible answer $\argmax_y s(x,y)$
over a general cyclic graph of part relationships is known to be
$\#P$-hard~\citep{koller-book}---exponential in the number of frames of video.
We provide an approximate approach which decomposes a cyclic model of pose
into a collection of subgraph trees, whose union of edges covers all the
relationships we care to model. This allows us to exploit all the interesting
interaction terms in the original model with efficient inference in each
subtree, thanks to the structure and the use of our cascade approach.
Then, we can exploit all cues used in the tree-based model, and in addition,
cues based on color symmetry across the body, and temporal appearance and
location persistence information. We propose and investigate empirically
different methods of reaching a consensus between the subtrees.
We evaluate our approach on a new video dataset, the first of its kind in
tracking human pose in the wild without any assumed extra knowledge. We show
our proposed model and approximation scheme is beneficial, beating the
state-of-the-art in pose estimation systems.
\subsection{Multimodal interactions,~\secref{llps}}
\label{sec:contrib3}
The aforementioned models focus attention on increasing the quality of features
and increasing the number of modeled part interactions. The hope is that these
more expressive models do a better job at capturing the inherently multimodal
appearance space of poses, by better separating the true pose configurations
from false alarms. This somewhat addresses the issue of non-linearity in
lower-dimensional feature spaces, \eg, using only edge information.
Complementary to the models described in the previous sections, we propose to
capture this nonlinearity directly with an explicitly multimodal model. Now
the goal is to determine not only the best pose layout, but also which {\em
mode} the pose belongs to.
Each mode need only model a portion of the pose space. Instead of fitting the
parameters of one monolithic model to cover all possible modes, here we learn
separate parameters for each mode, allowing us to learn more precise
descriptions of appearance and geometry, for, \eg, an arms crossed mode, an
arms raised mode, etcetera.
\subsection{Technical summary of models}
\input{contributions-technical}
\subsection{Summary of contributions}
In summary, the work detailed in this thesis contributes the following to the
fields of machine learning and computer vision, and especially their
intersection for human pose estimation:
\begin{itemize}
\item New models of human pose that capture image-dependent interactions.
\item Computational innovations that enable learning and inference in these
models, which are \naively intractable: structured cascades and tree ensemble
methods.
\item A variety of new features and feature types not typically applied to pose
estimation. Some of these are bottom-up type features complementary to the
traditional edge-based cues.
\item State-of-the-art results on the public Buffy and Pascal Stickmen single
frame datasets, and our introduced MoviePose single frame and VideoPose video
sequence datasets.
\end{itemize}
\subsection{Published work supporting this thesis}
The Cascaded Pictorial Structure model (\secref{CPS}) first appeared in
\citet{sapp2010cascades} and introduced the concepts of a coarse-to-fine
cascade that allows one to use arbitrary features efficiently. This extended
the original Structured Prediction Cascades approach from \citet{cascades},
with a journal version under review---\citet{cascades-jmlr}.
Using the cascade approach for an ensemble of models was developed in
\citet{weisssapp10}. This idea was then extended beyond cascade filtering for
prediction in the Ensembles of Stretchable Models framework
(\secref{stretchable}) for pose estimation in video in \citet{sapp2011}. The
complementary non-parametric approach of Local Linear Pictorial Structures
(\secref{llps}) is currently to be submitted---\citet{sapp-llps}.