# Starting conditions

Two freshly recruited data scientists who knows how to code, have a fair understanding of statistics and the constrains of real life data project. Both of us have an affinity with unix environments, experience in python programming language and a geek culture to use shell tools to automatize tasks. Another common characteristic is that we both have a PhD in quantitative disciple. During this experience, we carry out our own research project with autonomy and perseverance in order to obtain original results. In addition, we have acquired transversal skills in written and oral communication, but also in project and time management. What we knew when we started.... (add points about what we learned in terms of project management during the phd ?)

On the definition of data scientist. Data scientist can have a rather broad definition. In our view it is someone who can handle a data oriented project in its entirety. It should comprise the overlapping of three distinct domains: the skills of a statistician who knows how to model and summarize datasets; the skills of a computer scientist who can design and use algorithms to efficiently store, process, and visualize this data; and the scientific regard to ask the right questions and to put their answers in context. Most common tasks of a data scientist include project conception, tidying and exploring raw data, develop meaningful analysis, extract knowledge and information and communication an interesting and informative story. (This includes but is not limited to : project conception, data exploration/prep, developping meaningful analysis, extract knowledge not only information...) 

We took a kaggle competition as a trial project to help us acquire an experience in real world data issues. The objective of this competition is to contribute to fisheries monitoring by finding the best algorithm classifying into seven species of pictures catched from fishing boats. For more details about the rules, please refer to [kaggle website](https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring).

A consequence of the competition NDA, we can not share the pictures. The training dataset is divided in different categories with several annotated species with two extra classes : no fishes or other (whales, ...). The pictures are wide field shots from the several boats encompassing variations such as : day/night, multiplicity of fishes per picture, large fields of boat features,...

In the following sections you will find a summary of our discovery of image analysis and classification as close as possible to how we lived the experience.

# Stage 0 : How do you even analyze a single image ?

It is one thing to use softwares such as Photoshop (Adobe), or The Gimp, quite an other to think about images as matrices and how you can extract meaningful features from them. A few libraries exists to work with images, two major attracted our interest with the constrain that we essentially work with Python : [Scikit Image](http://scikit-image.org/) and [OpenCV](http://opencv.org/). We started to test quite a few things that one can do with an image with a naive perspective : color coding and their advantages, thresholding, segmentations, transformations, detection of interest points (SIFT, SURF, ORB, hessian of gaussian, laplacian of gaussian, ..) and probably a few forgotten ones.

In our hands even if scikit image was maybe a bit more intuitive with a very pythonic approach, OpenCV was much more stable and faster on operations such as color segmentation. There is only one counter intuitive hiccup which is that by default, OpenCV works with BGR color space and not RGB for it seems [historical reasons](https://www.learnopencv.com/why-does-opencv-use-bgr-color-format/).

One line of work that we tried was to take a few representative images and test how we could increase signal to noise ratio. Thresholding with Otsu was for example interesting but the fish often finished in the background region (black). One thing that was helpful was to perform color segmentation to globally remove noise and generate cartoon-like images. We first tried the methods implemented in scikit image but found out that for some images, the segmentation was impossible due to some implementation specificities. Once we realised that color segmentation could basically be done like [that](http://docs.opencv.org/3.1.0/d1/d5c/tutorial_py_kmeans_opencv.html), we added this step in our later trials to simplify the images:

In [None]:
# source : http://docs.opencv.org/3.1.0/d1/d5c/tutorial_py_kmeans_opencv.html

import cv2
import numpy as np

im = 'path/to/image/name.jpg'
img = cv2.imread(im)
# Color segmentation starts here
Z = img.reshape((-1, 3))
# convert to np.float32
Z = np.float32(Z)
# define criteria, number of clusters(K) and apply kmeans()
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
# number of color wanted in the final picture
K = 16
ret,label,center=cv2.kmeans(Z,K,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS)
# Now convert back into uint8, and make original image
center = np.uint8(center)
res = center[label.flatten()]
res2 = res.reshape((img.shape))

cv2.imshow('res2',res2)
cv2.waitKey(0)
cv2.destroyAllWindows()

We next tried to generate a mask in order to remove background from the image such as large elements of the boats that are rather squarish and have homogeneous colors.

In [None]:
import cv2
import numpy as np
from matplotlib import pyplot as plt

im = 'path/to/image/name.jpg'
img = cv2.imread(im)
blur = cv2.GaussianBlur(img, (5, 5), 0)
Z = blur.reshape((-1,3))
# convert to np.float32
Z = np.float32(Z)
# define criteria, number of clusters(K) and apply kmeans()
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
K = 16
ret,label,center=cv2.kmeans(Z,K,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS)
# Now convert back into uint8, and make original image
center = np.uint8(center)
res = center[label.flatten()]
res2 = res.reshape((blur.shape))
# Convert to grayscale and apply otsu.
gray = cv2.cvtColor(res2, cv2.COLOR_BGR2GRAY)
ret, thresh = cv2.threshold(gray,0,255,cv2.THRESH_OTSU)

# Noise removal by contour detection of large elements
im2, contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
mask = np.zeros(thresh.shape, np.uint8)
mask2 = np.zeros(thresh.shape, np.bool)
# Remove large elements, typically boat structures 
for c in contours:
    # if the contour is not sufficiently large, ignore it
    if cv2.contourArea(c) < 7000:
        continue
    cv2.drawContours(mask, [c], -1, (255, 255, 255), -1)
mask2[mask < 250] = True
masked = thresh * mask2
masked = cv2.cvtColor(masked, cv2.COLOR_GRAY2BGR)

# Perform keypoint detection on masked image
orb = cv2.ORB_create(nfeatures=3000)
kp, descs = orb.detectAndCompute(res2 * masked, None)
blobs_img = cv2.drawKeypoints(img, kp, None, color=(0,255,0), flags=0)

# Plot shape of the mask and the detected keypoints
fig, ax = plt.subplots(1, 2, sharex=False, sharey=False)
ax[1, 1].set_aspect(aspect='auto', adjustable='box-forced')
ax[1, 1].set_title('Threshold+Mask')
ax[1, 1].axis('off')
ax[1, 1].imshow(masked, cmap=plt.cm.gray)

ax[1, 2].set_aspect(aspect='auto', adjustable='box-forced')
ax[1, 2].set_title('ORB')
ax[1, 2].axis('off')
ax[1, 2].imshow(blobs_img)
plt.show()


# Stage 1 : Follow tutorials for machine learning

What we did having two minds on one problem : 
- one tries to optimize feature extraction + classic machine learning approches (SVM, xgboost,..), the other went on deep learning. 
  - Advantages and inconviniets (perfomance vs calculation time)

- How do we do on our scores ? Whoa great results... Wait on public leaderboard our results really suck... Why is that ?
  - Identification of the boat issue


# Stage 2 : Understand deeply your data

- Day / night pictures
- Varying quality of pics
- Separation of train and validation sets using boats id
- Getting a fair score

# Stage 3 : The limits of simplicity

Our solution as a balance between complexity and time spent on the project

## Stage 3.1 : Preprocessing improvements

- Crop and rotate images (CNN using regression)
- Histogram equalization to improve contract

# Stage 4 : What it is likely to take to win

Amount of test iterations...
Going beyond : unsupervised feature extraction, Faster R-CNN, ensembling, ...


# Conclusion

Redefinition of data scientist.
A working data scientist has to balance things and put them into perspective depending on the number of projects he/she works on. 
We did not have the time to go into fully custom algorithms for a winning solution (or a very good one). On allocated time frame we did learn a lot (collaborative approach, IT skills, image analysis skills, ...). We did not provide the best approach 
Final words, the algorithms are not the only ones that learn, we do too and as long as we keep learning we are on the right tracks. Maybe next time ?