Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
17 changed files
with
759 additions
and
79 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Original file line | Diff line number | Diff line change |
---|---|---|---|
@@ -1 +1,24 @@ | |||
Heyyyy abstract... | Human pose estimation from monocular images is one of the most challenging and | ||
computationally demanding problems in computer vision. Standard models such as | |||
Pictorial Structures consider interactions between kinematically connected | |||
joints or limbs, leading to inference cost that is quadratic in the number of | |||
pixels. As a result, researchers and practitioners have restricted themselves | |||
to simple models which only measure the quality of limb-pair possibilities by | |||
their 2D geometric plausibility. | |||
|
|||
In this talk, we propose novel methods which allow for efficient inference in | |||
richer models with data-dependent interactions. First, we introduce structured | |||
prediction cascades, a structured analog of binary cascaded classifiers, which | |||
learn to focus computational effort where it is needed, filtering out many | |||
states cheaply while ensuring the correct output is unfiltered. Second, we | |||
propose a way to decompose models of human pose with cyclic dependencies into a | |||
collection of tree models, and provide novel methods to impose model agreement. | |||
|
|||
These techniques allow for sparse and efficient inference on the order of | |||
minutes per image or video clip. As a result, we can afford to model pairwise | |||
interaction potentials much more richly with data-dependent features such as | |||
contour continuity, segmentation alignment, color consistency, optical flow and | |||
more. We show empirically that these richer models are worthwhile, obtaining | |||
significantly more accurate pose estimation on popular datasets. | |||
|
|||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Original file line | Diff line number | Diff line change |
---|---|---|---|
@@ -0,0 +1 @@ | |||
\section{Future work} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Original file line | Diff line number | Diff line change |
---|---|---|---|
@@ -0,0 +1,33 @@ | |||
\begin{algorithm} | |||
\caption[max-sum inference]{Max-sum message passing to solve | |||
$$\argmax_{y \in \cY} h(x,y) = \argmax \sum_{i \in \cV} \phi_i + \sum_{ij \in \cE} \phi_{ij}$$} | |||
\label{alg:max-inference} | |||
\begin{algorithmic} | |||
|
|||
\REQUIRE $ $ \\ | |||
Factors $\{\phi_i\}, \{\phi_{ij}\}$\\ | |||
Tree graph $G$ with (arbitrary) root node index $r$ and topological ordering $\pi$, where $\pi_n = r$. | |||
|
|||
\ENSURE $y^\star = \argmax_{y} h(x,y)$ | |||
\FOR{$i = \pi_1, \pi_2, \ldots, \pi_n $ } | |||
\STATE | |||
$m_i = \phi_i + \sum_{j \in \text{kids}(i)} m_{j \rightarrow i}$ | |||
\IF{$i == r$} \STATE \textbf{break} \ENDIF | |||
\STATE | |||
$p = \text{parent}(i) $ | |||
\STATE | |||
$m_{i \rightarrow p} = \max_{y_i} \phi_{ip} + m_i$ | |||
\STATE | |||
$a_i = \argmax_{y_i} \phi_{ip} + m_i$ | |||
\ENDFOR | |||
|
|||
\STATE | |||
$y^\star_r = \argmax_{[1 \ldots k]} m_r$ | |||
\FOR{$i = \pi_{n-1},\pi_{n-2},\ldots,1$} | |||
\STATE | |||
$y^\star_i = a_i\left[y^\star_{\text{parent}(i)}\right]$ | |||
\ENDFOR | |||
|
|||
\end{algorithmic} | |||
\end{algorithm} | |||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Original file line | Diff line number | Diff line change |
---|---|---|---|
@@ -1,45 +1,28 @@ | |||
Please refer to~\citet{sapp2010cascades}. | \chapter{Introduction} | ||
|
|
||
\chapter{Human pose estimation} | ``Geman quote'' | ||
|
|
||
\chapter{Structured prediction} | why i love what i do: | ||
|
One of the most compelling problems of computer vision is general object | ||
\chapter{Pictorial structures: Pose estimation meets structured prediction} | recognition. The ability for computers or robots to do this is blah | ||
We first summarize the basic pictorial structure model and then |
|
||
describe the inference and learning in the cascaded pictorial structures. | why pose? | ||
%\subsection{Basic PS Model} | it's super hard: Human pose estimation inherits all the difficulties of object | ||
Classical pictorial structures are a class of graphical models where the nodes of the graph represents object parts, and edges between parts encode pairwise geometric relationships. For modeling human pose, the standard PS model decomposes as a tree structure into unary potentials (also referred to as appearance terms) and pairwise terms between pairs of physically connected parts. Figure~\ref{fig:ps} shows a PS model for 6 upper body parts, with lower arms connected to upper arms, and upper arms and head connected to torso. In previous work~\cite{devacrf,felz05,ferrari08,posesearch,andriluka09}, the pairwise terms do not depend on data and are hence referred to as a spatial or structural prior. | recognition | ||
%\begin{figure}[] | it shows off computation | ||
%\begin{center} |
|
||
%\centerline{\includegraphics[width=0.75\columnwidth]{data/model_parameters2.pdf}} | philosopical question - humans can do it, babies, dogs can do it - why can't a | ||
%\caption{Basic upper-body model with part state $l$ and part support rectangle of size $(w,h)$.} | computer | ||
%\label{fig:ps} |
|
||
%\end{center} | \section{Problem Statement} | ||
%% \vskip -0.5in |
|
||
%\end{figure} | defn: 2D means two dimensional | ||
The state of part $i$, denoted as $y_i \in \mathcal{Y}_i$, encodes the joint | \subsection{Related problems} | ||
location of the part in image coordinates and the direction of the limb as a |
|
||
unit vector: $y_i = [y_{ix} \; y_{iy} \; y_{iu} \; y_{iv}]^T$. The state of the | \section{Inherent difficulties} | ||
model is the collection of states of $M$ parts: $p(ys = ys) = p(y_1 = y_1, | \subsection{perceptual} | ||
\ldots, y_M = y_M)$. The size of the state space for each part, | \subsection{computational} | ||
$|\mathcal{Y}_i|$, the number of possible locations in the image times the |
|
||
number of pre-defined discretized angles. For example, standard PS | \section{PREVIEW OF MY WORK} | ||
implementations typically model the state space of each part in a roughly $100 |
|
||
\times 100$ grid for $y_{ix} \times y_{iy}$, with 24 different possible values |
|
||
of angles, yielding $|\mathcal{Y}_i| = 100 \times 100 \times 24 = 240,000$. The | |||
standard PS formulation (see~\cite{felz05}) is usually written in a | |||
log-quadratic form: | |||
\begin{align} | |||
p( ys | x) &\propto \prod_{ij} \exp(-\frac{1}{2}||\Sigma_{ij}^{-1/2}(T_{ij}(y_i) - y_j - \mu_{ij})||_2^2) \times \prod_{i=1}^M \exp(\mu_i^T\phi_i(y_i,x)) | |||
\label{eqn:standard_ps} | |||
\end{align} | |||
The parameters of the model are $\mu_i,\mu_{ij}$ and $\Sigma_{ij}$, and $\phi_i(y_i,x)$ are features of the (image) data $x$ at location/angle $y_i$. The affine mapping $T_{ij}$ transforms the part coordinates into a relative reference frame. The PS model can be interpreted as a set of springs at rest in default positions $\mu_{ij}$, and stretched according to tightness $\Sigma^{-1}_{ij}$ and displacement $\phi_{ij}(ys) = T_{ij}(y_i) - y_j$. The unary terms pull the springs toward locations $y_i$ with higher scores $\mu_i^T\phi_i(y_i,x)$ which are more likely to be a location for part $i$. | |||
|
|||
This form of the pairwise potentials allows inference to be performed faster than $O(|\mathcal{Y}_i|^2)$: MAP estimates $\argmax_{ys} p(ys | x)$ can be computed efficiently using a generalized distance transform for max-product message passing in $O(|\mathcal{Y}_i|)$ time. Marginals of the distribution, $p(y_i | x)$, can be computed efficiently using FFT convolution for sum-product message passing in $O(|\mathcal{Y}_i| \log |\mathcal{Y}_i|)$~\cite{felz05}. | |||
|
|||
While fast to compute and intuitive from a spring-model perspective, this model has two significant limitations. One, the pairwise costs are unimodal Gaussians, which cannot capture the true multimodal interactions between pairs of body parts. Two, the pairwise terms are only a function of the geometry of the state configuration, and are oblivious to the image cues, for example, appearance similarity or contour continuity of the a pair of parts. | |||
|
|||
\section{Inference tricks (DT, conv)} | |||
\section{Issues} | |||
|
|||
\chapter{Thesis contributions} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Original file line | Diff line number | Diff line change |
---|---|---|---|
@@ -1,3 +1,2 @@ | |||
wa | wa | ||
!pdflatex thesis.tex | !pdflatex thesis.tex && bibtex thesis && bibtex thesis && pdflatex thesis | ||
" # && bibtex thesis && bibtex thesis && pdflatex thesis |
Oops, something went wrong.