This document outlines my 4th place solution for the Kaggle Yelp Restaurant Photo Classification competition.
We fine-tune the pre-trained Inception V3 network provided by mxnet here.
We modify the symbol file here by renaming the fully-connected layer. This prevents mxnet from trying to initialize this layer with pre-trained weights. We also reduce the number of output classes from 1000 to 9 (as there are 9 business labels to predict) and change the output to LogisticRegressionOutput (independent sigmoids) which is appropriate for multi-label learning, as the output probabilties are not constrained to add to 1.
Using this modified architecture, we initialize the model with pre-trained weights, except for the last layer which was renamed and modified.
We train the network for 5 epochs through the training set, using random crop, mirror, and scaling of the images. An AWS Spot instance (g2.2xlarge) was used with cuDNN.
The features (1024 total) from the global_pool layer (next to last) are then extracted.
Business features were chosen to be the average of their image features.
These features are used as input into classical machine learning (ML) models. These models include support vector classification, logistic regression, and random forest. We use one-vs-rest methodology for the multi-label problem. Scikit-learn was used here.
The class label probabilities from the 3 ML models are averaged and we use a threshold of 0.46 (determined from cross-validation) for selecting a label.
Finally, a majority vote classifier is used taking features extracted from epochs 3 through 5.
Fine-tuning of the network resulted in an improvement of 0.04 in local CV tests. On the public leaderboard, this resulted in a score around 0.81, which was good for top 10 at the time.
Interestingly, adding a random forest classifier into the ML ensemble resulted in about a 0.01 improvement. Probably this was because the non-linear nature of random forest could pick out some relations the linear classifiers could not, and the random forest predictions were generally less correlated with the others. Random forest was more optimistic in choosing labels, while the linear classifiers were less so, so averaging provided a sort of best of both. Ensembling via majority vote from different epochs added an additional 0.01 improvement.
On the last day, I found that decreasing the regularization parameter in the SVC model helped increase the score. One of the submissions was 6th on the public leaderboard but would have been first on the private leaderboard. Most likely, though, it would have been a lucky submission if chosen. If I had more time, I would have ensembled together more of these modified SVC models to try to get a more consistent model when tested against the private leaderboard.
This part will outline the scripts I used.
I used the ImageRecordIter from mxnet, which has quite good performance. The record file generator first requires a .lst file which is what create_img_list.py does. After creating the .lst file, generate the record file following the steps here. Use resize=299 label_width=9.
The script train_inception.py fine-tunes the Inception V3 model pre-trained by mxnet.
The script get_image_features.py extracts the image features from the global_pool layer (next to last) of the fine-tuned model.
The business features are the average of the image features. The scripts get_biz_features*.py extract these features.
Using the business features and labels, train three different ML models (support vector classification, logistic regression, and random forest). Use the one-vs-rest approach for multi-label classification. I used scikit-learn for these models in train_ml.py.
Using the models, generate predictions in test_ml.py. The predictions use the average probabilities from the 3 ML models and a threshold of 0.46 for assigning labels. The threshold was chosen from local CV runs.
After generating predictions from features extracted from different epochs, the script merge_submissions.py generates an majority vote classifier. I found ensembled model scores were generally more reliable, in terms of private/public scores being closer.
I'd like to thank Kaggle for hosting the competition and Yelp for providing the challenge and interesting data set. I also want to give a high recommendation for mxnet. Great performance and ease-of-use.