-

bensapp · Apr 12, 2012 · 0f898dd · 0f898dd
1 parent e7eb1e0
commit 0f898dd
Show file tree

Hide file tree

Showing 17 changed files with 759 additions and 79 deletions.
diff --git a/CPS.tex b/CPS.tex
@@ -1,4 +1,4 @@
-\chapter{Cascaded Pictorial Structures}
+\chapter{Cascaded Pictorial Structures}\label{sec:CPS}
 
 Pictorial structure models~\cite{fischler1973ps} are a popular method for human body pose estimation~\cite{felz05,fergus2005sparse,devacrf,ferrari08,andriluka09}.
 The model is a Conditional Random Field over pose variables that characterizes 
@@ -105,9 +105,9 @@ \subsection*{Structured Prediction Cascades} \label{cascades}
 \begin{figure}[t]
 \begin{center}
 \includegraphics[width=0.75\textwidth]{figs/empty.jpg}
-\caption{Upper right: Detector-based pruning by thresholding (for the lower 
+\caption[SHORT TITLE]{Upper right: Detector-based pruning by thresholding (for 
-right arm) yields many hypotheses far way from the true one. Lower row: The 
+the lower right arm) yields many hypotheses far way from the true one. Lower 
-CPS, however, exploits global information to perform better pruning.}
+row: The CPS, however, exploits global information to perform better pruning.}
 \label{fig:cascade_pruning}
 \end{center}
 \end{figure}

diff --git a/PennDiss.sty b/PennDiss.sty
@@ -214,10 +214,9 @@ Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy\\
 \thispagestyle{empty}%
 \null\vfill
 \begin{center}
-\Large \MYTITLE \\
+\Large \mytitle \\
-\Large COPYRIGHT\\
+\Large \copyright~Copyright by \@author\\
-\@copyrightyear\\
+\@copyrightyear
-\@author\\
 \end{center}
 \vfill\newpage}
 

diff --git a/abstract.tex b/abstract.tex
@@ -1 +1,24 @@
-Heyyyy abstract...
+Human pose estimation from monocular images is one of the most challenging and 
+computationally demanding problems in computer vision. Standard models such as 
+Pictorial Structures consider interactions between kinematically connected 
+joints or limbs, leading to inference cost that is quadratic in the number of 
+pixels. As a result, researchers and practitioners have restricted themselves 
+to simple models which only measure the quality of limb-pair possibilities by 
+their 2D geometric plausibility.
+
+In this talk, we propose novel methods which allow for efficient inference in 
+richer models with data-dependent interactions. First, we introduce structured 
+prediction cascades, a structured analog of binary cascaded classifiers, which 
+learn to focus computational effort where it is needed, filtering out many 
+states cheaply while ensuring the correct output is unfiltered. Second, we 
+propose a way to decompose models of human pose with cyclic dependencies into a 
+collection of tree models, and provide novel methods to impose model agreement.
+
+These techniques allow for sparse and efficient inference on the order of 
+minutes per image or video clip. As a result, we can afford to model pairwise 
+interaction potentials much more richly with data-dependent features such as 
+contour continuity, segmentation alignment, color consistency, optical flow and 
+more. We show empirically that these richer models are worthwhile, obtaining 
+significantly more accurate pose estimation on popular datasets.
+
+
diff --git a/commands.tex b/commands.tex
@@ -8,7 +8,7 @@
 \newcommand{\LossA}{\mathcal{L}_{\psi}}
 \newcommand{\LossMAX}{\mathcal{L}^{max}_{\psi}}
 \newcommand{\X}{\mathcal{X}}
-\newcommand{\E}{\mathbf{E}}
+\newcommand{\E}{\mathbb{E}}
 \newcommand{\bw}{\mathbf{w}}
 \newcommand{\bft}{\mathbf{f}}
 \newcommand{\bx}{\mathbf{x}}
@@ -20,6 +20,7 @@
 
 \newcommand{\Ind}{\mathbf{1}}
 \newcommand{\argmax}{\mathop{\arg\max}}
+\newcommand{\argmin}{\mathop{\arg\min}}
 
 \newcommand{\Vones}[1]{\ensuremath{\mathbf{1}_{#1}}}
 \newcommand{\eqdef}{\stackrel{\rm def}{=}}
@@ -35,20 +36,37 @@
 \newcommand{\w}{\mathbf{w}}
 \newcommand{\f}{\mathbf{f}}
 
+\newcommand{\naive}{naive\xspace}
 \newcommand{\CPS}{CPS\xspace}
 \newcommand{\LLPS}{LLPS\xspace}
 \newcommand{\LLPSlong}{Local Linear Pictorial Structures\xspace}
 
 % some common mathcals
+\newcommand{\cH}{\mathcal{H}}
+\newcommand{\cC}{\mathcal{C}}
+\newcommand{\cD}{\mathcal{D}}
 \newcommand{\cL}{\mathcal{L}}
+\newcommand{\cX}{\mathcal{X}}
 \newcommand{\cY}{\mathcal{Y}}
 \newcommand{\cR}{\mathcal{R}}
-\newcommand{\cE}{\mathcal{R}}
+\newcommand{\cE}{\mathcal{E}}
-\newcommand{\cV}{\mathcal{R}}
+\newcommand{\cV}{\mathcal{V}}
 
 \newcommand{\reals}{\mathbb{R}}
+\newcommand{\defn}{\triangleq}
 
 \newcommand{\tree}{\Upsilon}
+\newcommand{\attrib}[1]{ \nopagebreak{\raggedleft\footnotesize #1\par}}
+\newcommand{\myquotation}[2]{{\em #1}\\\attrib{#2}}
+
+\newcommand{\secref}[1]{\hyperref[sec:#1]{\textsection\ref{sec:#1}}}
+\newcommand{\equref}[1]{\hyperref[eq:#1]{Equation~\ref{eq:#1}}}
+\newcommand{\algref}[1]{\hyperref[alg:#1]{Algorithm~\ref{alg:#1}}}
+\newcommand{\thmref}[1]{\hyperref[thm:#1]{Theorem~\ref{thm:#1}}}
+\newcommand{\lemref}[1]{\hyperref[lem:#1]{Lemma~\ref{lem:#1}}}
+\newcommand{\tabref}[1]{\hyperref[tab:#1]{Table~\ref{tab:#1}}}
+\newcommand{\figref}[1]{\hyperref[fig:#1]{Figure~\ref{fig:#1}}}
+
 
 \newcommand{\score}[1]{\theta(x,#1)}         % score function
 \newcommand{\scoremax}[0]{\theta^\star(x)}   % argmax score
@@ -70,7 +88,7 @@
 %\renewcommand{\includegraphics}[2]{}
 
 %% usual commands
-\newcommand{\todo}[1]{\textcolor{red}{TODO: #1}}
+\newcommand{\todo}[1]{\textcolor{red}{\\{\bf TODO:} #1 \\}}
 %\newcommand{\todo}[1]{{\bf{TODO: #1}}}
 %\newcommand{\todo}[1]{}
 
@@ -135,6 +153,8 @@
 %\makeatother
 
 
+\renewcommand{\algorithmicrequire}{\textbf{Input:}} 
+\renewcommand{\algorithmicensure}{\textbf{Ouput:}}
 
 %% specific commands
 \newcommand{\trans}[1]{{#1}^{\ensuremath{\mathsf{T}}}}           % transpose

diff --git a/ensembles.tex b/ensembles.tex
@@ -1,4 +1,4 @@
-\chapter{Ensembles}
+\chapter{Ensembles} \label{sec:stretchable}
 
 
 \begin{figure}[t!]

diff --git a/features.tex b/features.tex
@@ -1,3 +1,9 @@
+\chapter{Features}\label{features}
+
+\myquotation{Do not call me a computer vision engineer \ldots  I am a perceptual 
+scientist!}{Yiannis Alimonous}
+
+
 The introduced \CPS model allows us to capture appearance, geometry and shape information of parts and pairs of parts in the final level of the cascade---much richer than the standard geometric deformation costs and texture filters of previous PS models~\cite{felz05,devacrf,ferrari08,andriluka09}.  
 %Table~\ref{feat_table} lists all features that we use and will describe in this section.  
 Each part is modeled as a rectangle anchored at the part joint with the major axis defined as the line segment between the joints (see Figure~\ref{fig:ps}).  For training and evaluation, our datasets have been annotated only with this part axis.

diff --git a/figs/empty.jpg b/figs/empty.jpg
diff --git a/figs/empty.jpg0 b/figs/empty.jpg0
diff --git a/future.tex b/future.tex
@@ -0,0 +1 @@
+\section{Future work}
diff --git a/inference-alg.tex b/inference-alg.tex
@@ -0,0 +1,33 @@
+\begin{algorithm} 
+\caption[max-sum inference]{Max-sum message passing to solve 
+$$\argmax_{y \in \cY} h(x,y) = \argmax \sum_{i \in \cV} \phi_i + \sum_{ij \in \cE} \phi_{ij}$$} 
+\label{alg:max-inference} 
+\begin{algorithmic} 
+
+\REQUIRE $ $ \\ 
+Factors $\{\phi_i\}, \{\phi_{ij}\}$\\
+Tree graph $G$ with (arbitrary) root node index $r$ and topological ordering $\pi$, where $\pi_n = r$.
+
+\ENSURE $y^\star = \argmax_{y} h(x,y)$ 
+\FOR{$i = \pi_1, \pi_2, \ldots, \pi_n $ }
+\STATE
+$m_i = \phi_i + \sum_{j \in \text{kids}(i)} m_{j \rightarrow i}$
+\IF{$i == r$} \STATE \textbf{break} \ENDIF
+\STATE
+$p = \text{parent}(i) $
+\STATE
+$m_{i \rightarrow p} = \max_{y_i} \phi_{ip} + m_i$
+\STATE
+$a_i = \argmax_{y_i} \phi_{ip} + m_i$
+\ENDFOR
+
+\STATE
+$y^\star_r = \argmax_{[1 \ldots k]} m_r$
+\FOR{$i = \pi_{n-1},\pi_{n-2},\ldots,1$}
+\STATE 
+$y^\star_i = a_i\left[y^\star_{\text{parent}(i)}\right]$
+\ENDFOR
+
+\end{algorithmic} 
+\end{algorithm}
+
diff --git a/intro.tex b/intro.tex
@@ -1,45 +1,28 @@
-Please refer to~\citet{sapp2010cascades}.
+\chapter{Introduction}
-
+
-\chapter{Human pose estimation}
+``Geman quote''
-
+
-\chapter{Structured prediction}
+why i love what i do:
-
+One of the most compelling problems of computer vision is general object 
-\chapter{Pictorial structures: Pose estimation meets structured prediction}
+recognition.  The ability for computers or robots to do this is blah
-We first summarize the basic pictorial structure model and then
+
-describe the inference and learning in the cascaded pictorial structures.
+why pose?
-%\subsection{Basic PS Model}
+it's super hard: Human pose estimation inherits all the difficulties of object 
-Classical pictorial structures are a class of graphical models where the nodes of the graph represents object parts, and edges between parts encode pairwise geometric relationships.  For modeling human pose, the standard PS model decomposes as a tree structure into unary potentials (also referred to as appearance terms) and pairwise terms between pairs of physically connected parts.  Figure~\ref{fig:ps} shows a PS model for 6 upper body parts, with lower arms connected to upper arms, and upper arms and head connected to torso.  In previous work~\cite{devacrf,felz05,ferrari08,posesearch,andriluka09}, the pairwise terms do not depend on data and are hence referred to as a spatial or structural prior.
+recognition
-%\begin{figure}[]
+it shows off computation
-%\begin{center}
+
-%\centerline{\includegraphics[width=0.75\columnwidth]{data/model_parameters2.pdf}}
+philosopical question - humans can do it, babies,  dogs can do it - why can't a 
-%\caption{Basic upper-body model with part state $l$ and part support rectangle of size $(w,h)$.}
+computer
-%\label{fig:ps}
+
-%\end{center}
+\section{Problem Statement}
-%% \vskip -0.5in
+
-%\end{figure}
+defn: 2D means two dimensional
-The state of part $i$, denoted as $y_i \in \mathcal{Y}_i$, encodes the joint 
+\subsection{Related problems}
-location of the part in image coordinates and the direction of the limb as a 
+
-unit vector: $y_i = [y_{ix} \; y_{iy} \; y_{iu} \; y_{iv}]^T$. The state of the 
+\section{Inherent difficulties}
-model is the collection of states of $M$ parts: $p(ys = ys) = p(y_1 = y_1, 
+\subsection{perceptual}
-\ldots, y_M = y_M)$.  The size of the state space for each part, 
+\subsection{computational}
-$|\mathcal{Y}_i|$, the number of possible locations in the image times the 
+
-number of pre-defined discretized angles. For example, standard PS 
+\section{PREVIEW OF MY WORK}
-implementations typically model the state space of each part in a roughly $100 
+
-\times 100$ grid for $y_{ix} \times y_{iy}$, with 24 different possible values 
+
-of angles, yielding $|\mathcal{Y}_i| = 100 \times 100 \times 24 = 240,000$. The 
-standard PS formulation (see~\cite{felz05}) is usually written in a 
-log-quadratic form:
-\begin{align}
-p( ys | x) &\propto \prod_{ij} \exp(-\frac{1}{2}||\Sigma_{ij}^{-1/2}(T_{ij}(y_i) - y_j - \mu_{ij})||_2^2)  \times \prod_{i=1}^M \exp(\mu_i^T\phi_i(y_i,x))
-\label{eqn:standard_ps}
-\end{align}
-The parameters of the model are $\mu_i,\mu_{ij}$ and $\Sigma_{ij}$, and $\phi_i(y_i,x)$ are features of the (image) data $x$ at location/angle $y_i$.  The affine mapping $T_{ij}$ transforms the part coordinates into a relative reference frame.  The PS model can be interpreted as a set of springs at rest in default positions $\mu_{ij}$, and stretched according to tightness $\Sigma^{-1}_{ij}$ and displacement $\phi_{ij}(ys) = T_{ij}(y_i) - y_j$.  The unary terms pull the springs toward locations $y_i$ with higher scores $\mu_i^T\phi_i(y_i,x)$ which are more likely to be a location for part $i$.
-
-This form of the pairwise potentials allows inference to be performed faster than $O(|\mathcal{Y}_i|^2)$:  MAP estimates $\argmax_{ys} p(ys | x)$ can be computed efficiently using a generalized distance transform for max-product message passing in $O(|\mathcal{Y}_i|)$ time.  Marginals of the distribution, $p(y_i | x)$, can be computed efficiently using FFT convolution for sum-product message passing in $O(|\mathcal{Y}_i| \log |\mathcal{Y}_i|)$~\cite{felz05}.
-
-While fast to compute and intuitive from a spring-model perspective, this model has two significant limitations.  One, the pairwise costs are unimodal Gaussians, which cannot capture the true multimodal interactions between pairs of body parts.  Two, the pairwise terms are only a function of the geometry of the state configuration, and are oblivious to the image cues, for example, appearance similarity or contour continuity of the a pair of parts.
-
-\section{Inference tricks (DT, conv)}
-\section{Issues}
-
-\chapter{Thesis contributions}
diff --git a/make.vim b/make.vim
@@ -1,3 +1,2 @@
 wa
-!pdflatex thesis.tex 
+!pdflatex thesis.tex && bibtex thesis && bibtex thesis && pdflatex thesis 
-" # && bibtex thesis && bibtex thesis && pdflatex thesis