forked from RockStarCoders/alienMarkovNetworks
-
Notifications
You must be signed in to change notification settings - Fork 2
/
AMBsegmentationNotes.txt
268 lines (128 loc) · 14.2 KB
/
AMBsegmentationNotes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
Segmentation using CRFs
References
[1] Multi-Class Segmentation with Relative Location Prior. Gould et al 2008.
[2] TextonBoost: Joint Appearance, Shape and Context Modelling for Multi-Class Object Recognition and Segmentation. Shotton et al 2006.
[3] Graph Cuts in Vision and Graphics: Theories and Applciations. Boykov and Veksler 2005.
[4] Contour and Texture Analysis for Image Segmentation. Malik et al 2001.
[5] Histogram of Orient Gradients for Human Detection. Dalal & Triggs 2005.
[6] Matching words and pictures. Barnard et al 2003.
[7] Contour and Texture Analysis for Image Segmentation. Malik et al 2001.
[8] Categorization by learned universal dictionary. Winn, Criminisi & Minka 2005.
[9] Representing and recognising the visual appearance of materials using three-dimensional textons. Malik & Leung, 2001.
[10] A Statistical approach to texture classification from single images. Varma & Zisserman 2005.
Background
Much work has been done.
Conditional Random Fields (CRFs) are the most prominent models in the domain.
Seek to apply modern approaches to segment and label images, with applications to GIAT data.
Approach
Incrementally build towards repeating implementation and results of [1] and [2].
1. Binary segmentation (background/foreground) using color histograms and min-cut algorithm to find MAP label assignment.
2. Build multi-class, pixel classifier for unary potentials within CRF model from MSRC dataset.
a) Implement following features:
Colour histograms
Histogram of oriented gradient (HOG)
Local binary pattern
Texture based feature (textons)
b) Replicate TextonBoost [2] features.
c) Replciate Relative Prior [1] features.
Feature generation
Approach is to build a python module of functions, where each function generates a vector of result values.
Where possible, seek to use existing python tools. Scipy (and Numpy) and the scikit-image library will be used, and where OpenCV python is used, beware the BGR!
Colour histograms features
Implemented 1D colour histogram for RGB image. Key assumption is that input image is 8-bit, either greyscale or RGB.
Implemented a 3D colour histogram to capture colour correlations in the 256 x 256 x 256 colour space. That is generate a count of values for colours (0, 0, 0) ... (0,0,256) ... (256, 256, 256). Used Numpy.histogramdd to generate multi-dimensional histograms to be generated.
In both cases, number of bins can be set to { 2, 4, 8, 16, 32, 64, 128, 256} bins using the numberBins parameter.
Note: Need some kind of regularisation on the RGB histogram, either some constant or even smoothing operator over image in the learning phase.
Histogram of oriented gradients (HOG) features
HOG was defined by Dalal and Triggs in the 2005 CVPR paper "Histogram of oriented gradient for human detection" [5]. In the paper, they detail the definition of HOG features and experiments regarding parameter setting. In particular, the paper states:
(1) Gamma/Colour Normalization
"We evaluated several input pixel representations including grayscale, RGB and LAB colour spaces optionally with power law (gamma) equalization."
"We do use colour information when available. RGB and LAB colour spaces give comparable results, but restricting to grayscale reduces performance by 1.5% at 10−4 FPPW. Square root gamma compression of each colour channel im- proves performance at low FPPW (by 1% at 10−4 FPPW) but log compression is too strong and worsens it by 2% at 10−4 FPPW."
Note: Power law gamma equalisation is defined as newImage = originalImage ^ (gamma) [see http://www.sci.utah.edu/~cscheid/spr05/imageprocessing/project1/]
The scikit-image project implementation [see https://github.com/scikit-image/scikit-image/blob/master/skimage/feature/_hog.py] uses an optional parameter to flag square root equalisation or none.
(2) Gradient
"Masks tested included various 1-D point derivatives (uncentred [−1, 1], centred [−1, 0, 1] and cubic-corrected [1, −8, 0, 8, −1]) as well as 3×3 Sobel masks and 2x2 diagonal ones. Simple 1-D [−1, 0, 1] masks at σ=0 work best."
"Using uncentred [−1, 1] derivative masks also decreases performance (by 1.5% at 10−4 FPPW),"
The scikit-image HOG implementation uses the numpy.diff function for gradient calculation [see http://docs.scipy.org/doc/numpy/reference/generated/numpy.diff.html]:
"The first order difference is given by out[n] = a[n+1] - a[n] along the given axis"
This corresponds to the uncentred [-1, 1] mask. The scipy examples show the results as being reduced in dimension compared to input (doesn't generate an edge gradient value).
However, the numpy.gradient function [see http://docs.scipy.org/doc/numpy/reference/generated/numpy.gradient.html#numpy.gradient] uses central differences in the interior and first differences at the boundaries. Here, the example shows that central difference = 0.5 * [-1,0,1] mask.
One idea is to allow user defined function to be used to create gradients from an image.
(3) Gradient in colour images
"For colour images, we calculate separate gradients for each colour channel, and take the one with the largest norm as the pixel’s gradient vector."
The scikit-image library HOG implementation does NOT handle colour images [see https://github.com/scikit-image/scikit-image/blob/master/skimage/feature/_hog.py].
To proceed with the scikit-image HOG implementation, we have a few choices:
* Convert colour to greyscale, and accept performance hit (easy)
* Try to reconstruct a greyscale image, inverse gradient operation on the maxG_x and maxG_y matrices (non-trivial)
Alternatively, we could refactor scikit-image._HOG to operate on user-specified gradient matrix for the image (e.g. result of a yet-to-be-written "maxGradientsFromColors(rgbImage)" function).
Let's go with the greyscale option for now.
(3) AOB?
In summary, the HOG implementation is a function which wraps the scikit-image HOG implementation in logic which handles colour images by coonverting to 0-255 greyscale by taking the mean of colour values and normalising. Key assumption is that the input images are 8-bit RGB images.
Texture features
Both Shotton [1] and Gould [2] use texture-based features. Shotton's TextBoost paper [2] defines shape-texture potentials and refers to [6]. Gould's relative location paper [1] defines a set of appearance features based on [7] which act a unary potentials in the CRF model.
Shotton [2] states:
"Textons. Efficiency demands compact representations for the range of different appearances of an object. For this we utilize textons [9 - Representing and Recognising the visual appearance of materials using 3D textons, Malik & Leung 2001] which have proven effective in categorizing materials [10 - A statistical approach to texture classification from single images, Varma & Zisserman 2005] as well as generic object classes [8 - Categorization by learned universal visual dictionary, Winn, Criminisi & Minka, 2005]. A dictionary of textons is learned by convolving a 17-dimensional filter-bank (footnote) with all the training images and running K-means clustering (using Mahalanobis distance) on the filter responses. Finally, each pixel in each image is assigned to the nearest cluster centre, thus providing the texton map.
Footnote: the filterbank used here is identical to that in [8 - Winn, Criminski, Minka 2005], consisting of scaled Gaussians, x and y derivatives of Gaussians, and Laplacians of Gaussians. The Gaussians are applied to all three colour channels, while the remaining filters only to the luminance. The perceptually uniform CIELab color space is used."
Review of [8] to identify definition and implementation details of textons:
"Textons and texton histograms. Each training image is convolved with a filter-bank to generate a set of filter responses [9, 16]. These filter responses are aggregated over
all the images in the entire training set (independently from class labels) and clustered using a K-means approach. Mahalanobis distance between features is used during clustering."
"Filter-banks. In this paper we have tested a number of different filter-banks made of combinations of Gaussians, first and second order derivatives of Gaussians and Gabor kernels. Many filter-banks produced comparable results with the best one made of 3 Gaussians, 4 Laplacian of Gaussians (LoG) and 4 first order derivatives of Gaussians. The three Gaussian kernels (with σ = 1, 2, 4) are applied to each CIE L,a,b channel [7], thus producing 9 filter responses. The four LoGs (with σ = 1, 2, 4, 8) were applied to the L channel only, thus producing 4 filter responses. The four derivatives of Gaussians were divided into the two x− and y−aligned sets, each with two different values of σ (σ = 2, 4). Derivatives of Gaussians were also applied to the L channel only, thus producing 4 final filter responses. Therefore, each pixel in each image has associated a 17−dimensional feature vector. Note that first order derivatives of Gaussian kernels are not rotational invariant."
Therefore filterbank-channel application grid looks like:
Name Defn CIE channel
L a b
G1 N(0, 1) yes yes yes
G2 N(0, 2) yes yes yes
G3 N(0, 4) yes yes yes
LoG1 lap(N(0, 1)) yes no no
LoG2 lap(N(0, 2)) yes no no
LoG3 lap(N(0, 4)) yes no no
LoG4 lap(N(0, 8)) yes no no
Div1xG1 dx(N(0,2)) yes no no
Div1xG2 dx(N(0,4)) yes no no
Div1yG1 dy(N(0,2)) yes no no
Div1yG2 dy(N(0,4)) yes no no
Note use of CIE colour model [see http://en.wikipedia.org/wiki/CIE_1931_color_space] and varied applciation of filterbank to L, a, b channels. The following stackoverflow post gives useful info [http://stackoverflow.com/questions/13405956/convert-an-image-rgb-lab-with-python] and mentions scikit-image conversion. I use the scikit-image.color.rgb2lab() function to convert images to CIELab.
Filter window size.
I haven't read a specification for the size of the filter (window size) in [8], [9] [10]. I did find one paper that states a filter window size of (13x13) - see [http://academia.edu/1146226/Texture_Classification_Using_Three_Circular_Filters].
Also, there is a StackOverflow question regarding window size for Gaussian filter see [http://stackoverflow.com/questions/16165666/how-to-determine-the-window-size-of-a-gaussian-filter]. This recommends a window size of 3*Gaussian's sigma, which seems to be accepted wisdom.
As a first approximation, we will adopt (13x13) for the size of the filters in the filter bank.
PossumStats
Some general stats for the MSRC dataset.
Images by class-pair. Seek to count the number of images containing pixels of type class1 and class2, where these pixels are neighbours.
First, what constitues a neighbour?
Clearly, a 4-neighbourhood match.
However, if two pixels are diagonal neighbours, should we count this? In this case, we would have an image region like:
... classA 0 ... ... 0 classA ...
... 0 classB ... or ... classB 0 ....
Where classes "touch" by a diagonal, I dont think that the classes are proper neighbours. We need an efficient way to test if an image contains pixel of classA and pixel of classB that are 4-neighbours.
It is possible to construct binary arrays which represent pixels of a given class, giving us binary "class channel" versions of the source image (pixels having 0 or 1 value).
The difference between such class channels will be a new image, with pixel values { -1, 0, 1 }. The test we wish to conduct is to see if there are two pixels that are 4-neighbours such that the pixel values are -1 and +1. If we take the gradient of the difference image, if any of the following pixel configurations exist:
... -1 ... ... +1 ... ... ... ... ... ... ... ... ...
... +1 ... or ... -1 ... or ... +1 -1 ... or ... -1 +1 ...
In the first two cases, the absolute value of the y-derivative will be 2, similarily for the x-derivative int he last two cases. Therefore, to test if there is at least one 4-neighbour match between classA and classB pixels in an image, we can:
1) Create class channel binary images
2) Take the difference of the binary images
3) Compute the x and y gradients of the image
4) Take absoulte value of x and y gradient
5) Check the existence of a gradient value of 2 in either x or y derivative
Note this wouldn't work for 8 neighbours, likely need to work with orientation and magnitude of gradient to check diagonal neighbour condition.
I looked through the MSRC dataset by hand to see if the "void" class was common. In many of the images, object segments are separated by a void area - need a better way to detect object segment neighbours, ignoring void when sensible.
Classification
Pixel-based logistic regression. take each pixel in each image as a data point, build a logistic regression classifier to be used as a unary potential within the CRF.
Ignore "void" pixels
Shotton et al state that "void pixels are ignored for training and testing". How best to "ignore" void pixels?
If pixels are simply "dropped" from an input image before feature generation, the number of image has missing pixels, which makes generating features from neighbourhoods (LBP or filter responses) imposible without some kind of substitution regime.
Possible to features for all pixels, then discard pixels with void ground truth before appending into the "image result" feature vector. Since our classifier is learning a pixel model, there is no need for the size of the feature vector to be (numPixels x numFeatures) for every image, since input data is at the pixel level.
Take image, generate features and then convert to ( 1 x numPixelsNotVoid) array. Convert labels to array of same same shape.
For all images in training/validation datasets:
create image features
reshape to (numPixels x numFeatures) np array per image
reshape labels to (numPixels) np array
stack features into combined (totalNumPixels x numFeatures) feature dataset
stack labels into combined label set
Q: How best to incorporate image-wide features (like colour histograms) into pixel-level models? Just append image features to all pixel feature datasets?
Getting to grips with scikit-learn LogisticRegression model - should be fun.
TODO list
* Implement an "ObjectSegmentNeighbours" function
* Implement colour histograms in HSV and/or HS colour space.
* ColourHOG implementation based on scikit-image HOG