# Image Classifier

**By:** Daksh Gupta

**Project Description:**

***Libraries Required:*** OpenCV 3.1.0

This is an image classifier, that classifies images as Spam or Notspam. Here Spam images refer to images that have large amount of text, memes, cartoons etc. and Notspam images in general are clicked pictures of humans, landscapes, animals etc.

I have manually collected and labeled my dataset which has a total od 3,426 images. The images are my personal clicks and images I think of as SPAM from my WhatsApp images folder.

The classifier uses the following features to describe the images:

- No. of faces in the image
- No. of distinct color in the image
- % area of image covered by text
- % area of image covered by top color 1
- % area of image covered by top color 2
- % area of image covered by top color 3
- % area of image covered by top color 4
- % area of image covered by top color 5
- % area of image covered by top color 6
- % area of image covered by top color 7
- % area of image covered by top color 8
- % area of image covered by top color 9
- % area of image covered by top color 10

From basic experimentation, I found that in general, Spam images have very less number of unique colors and the ones that they have share a occupy a large number of image pixels. On the other hand, Notspam images have a large number of unique colors.

To get a proper number of unique colors, I have considered only the colors/shades that occupy more than 0.5% of the image.

I will be selecting the best out of SVC, LogisticRegression, DecisionTreeClassifier, MLPClassifier and BernoulliNB to train and classify this data.

***To Run:***

Place the images in a folder called 'data' in the same folder as the Notebook. The structure of the folder should be like "*data\\notspam (1).jpeg; data\\notspam (2).jpeg; .....; data\\spam (1).jpg ...*"

There is a 'data.pckl' and 'labels.pckl' file with the Notebook, these contain the data matrix and the target variable numpy array respectively. The code below gives preference to loading the data from the .pckl files over processing the images again and again.

The model comparision results are stored in 'modelparams.pckl'

In [1]:
import numpy as np
import cv2
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import scale, MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import BernoulliNB
import pickle
import os

In [2]:
'''
This function detects the percentage of image covered with text. It doesn't use techniques like OCR (optical character recognition)
rather it uses edge detection and thresholding to determin the location of text.

Because of this it never returns 0% ever, since all images have edges.
'''
def detectTextArea(img):
    img_gray = img
    if len(img.shape) > 2:
        img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    img_sobel = cv2.Sobel(img_gray, cv2.CV_8U, 1, 0, ksize=3, scale=1, delta=0, borderType=cv2.BORDER_DEFAULT)
    _, img_threshold = cv2.threshold(img_sobel, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    element = cv2.getStructuringElement(cv2.MORPH_RECT, (17, 3))
    img_threshold = cv2.morphologyEx(img_threshold, cv2.MORPH_CLOSE, element)
    image, contours, heirarchy = cv2.findContours(img_threshold, 0, 1)
    boundRect = []
    area = 0
    for contour in contours:
        # print(contour.size)
        if contour.size > 100:
            contours_poly = cv2.approxPolyDP(contour, 3, True)
            rect = cv2.boundingRect(contours_poly)
#             print(rect)
            area += rect[2]*rect[3]
            if rect[2] > rect[3]:
                boundRect.append(rect)
                
    return area/img_gray.size, boundRect

In [3]:
'''
This function detects and returns the number of faces in the provided image. The faces are supposed to have an almost neutral
emotion and SHADES don't work.
'''

face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')
# face_cascade = cv2.CascadeClassifier('haarcascade_profileface.xml')


def faceCounter(img, cascade=face_cascade):
    img_gray = img
    if len(img.shape) > 2:
        img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    img_eq = cv2.equalizeHist(img_gray)
#     key = 0
#     while key != 27:
#         cv2.imshow('disp',img_eq)
#         key = cv2.waitKey()
    faces = cascade.detectMultiScale(img_eq,minSize=(50, 50),flags=1,scaleFactor=1.2)
    return len(faces)

In [4]:
'''
This function returns the number of unique shades of gray present in the image. It filters out shades that occupy less than
0.5% of the image.

The reason for doing this is to get only the visually impactful colors instead of all of them. For example, a screenshot of a 
notepad document with text contains all 256 shades, but the visually impactful are just two ,i.e., Black and White.
'''
def getUniqueColors(img):
    img_gray = img
    if len(img.shape) > 2:
        img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    total_pixels = img_gray.size
    x, counts = np.unique(img_gray, return_counts=True)
    colors = []
    area = []
#     print(x.size)
    for i in range(x.size):
        if float(counts[i])/total_pixels > 0.005:
#             print(x[i])
            colors.append(x[i])
#             print()
            area.append(float(counts[i])/total_pixels)
    area.sort(reverse = True)
#     print(colors, area)
    return len(colors), area

In [5]:
'''
This code snippet forms the data and label matrices
'''

data = np.ones((2,2),np.float32)
labels = np.ones((1),np.int32)
if not os.path.exists('data.pckl') and not os.path.exists('labels.pckl'):
    data = np.ones((2,2),np.float32)
    labels = np.ones((1),np.int32)
    for dirname, dirnames, filenames in os.walk('data'):
        total_ex = len(filenames)
        data = np.zeros((total_ex,13),np.float32)
        labels = np.zeros((total_ex),np.int32)
        for i, file in enumerate(filenames):
            if file.startswith('spam'):
                labels[i] = 1
            else:
                labels[i] = 0
            print(os.path.join(dirname,file))
            img = cv2.imread(os.path.join(dirname,file),0)
            h,w = img.shape[:2]
            ratio = w/float(h)
            img = cv2.resize(img, (int(ratio*720),720))
            faces = faceCounter(img)
            colors, area = getUniqueColors(img)
            txtarea, _ = detectTextArea(img)
            data[i,0] = faces
            data[i,1] = colors
            data[i,2] = txtarea
            for j in range(10):
                if j < colors:
                    data[i,j+3] = area[j]
    
    pickle.dump(data, open('data.pckl','wb'))
    pickle.dump(labels, open('labels.pckl','wb'))
else:
    data = pickle.load(open('data.pckl','rb'))
    labels = pickle.load(open('labels.pckl','rb'))

In [6]:
print('Number of instances: {d[0]}\nNumber of features: {d[1]}'.format(d=data.shape))

Number of instances: 3426
Number of features: 13


In [7]:
'''
This snippet scales the input data between the range 0 to 1.

'''

scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data)
# print(np.min(data_scaled,axis = 0))

### Perfromance

I am using F1 Score as my primary performance measure.

#### Baselines

In [8]:
'''
Random Prediction
'''
y = np.random.randint(0,high=2,size=labels.shape)
# print(y)
f1 = f1_score(labels,y)
pre = precision_score(labels,y)
rec = recall_score(labels, y)
acc = accuracy_score(labels, y)
print('Random Class Prediction\n\nF1 Score: {:.4f},\nPrecision: {:.4f},\nRecall: {:.4f},\nAccuracy: {:.4f}'.format(f1,pre, rec, acc))

Random Class Prediction

F1 Score: 0.5209,
Precision: 0.5424,
Recall: 0.5011,
Accuracy: 0.4947


In [9]:
y = None
count = np.where(labels==0)
count = count[0].shape[0]
count1 = labels.shape[0]-count
cls = ''
if count > count1:
    cls='Not_spam'
    y = np.zeros(labels.shape)
else:
    cls='Spam'
    y = np.ones(labels.shape)
f1 = f1_score(labels,y)
pre = precision_score(labels,y)
rec = recall_score(labels, y)
acc = accuracy_score(labels, y)
print('Majority Class Prediction ({})\n\nF1 Score: {:.4f},\nPrecision: {:.4f},\nRecall: {:.4f},\nAccuracy: {:.4f}'.format(cls,f1,pre, rec, acc))

Majority Class Prediction (Spam)

F1 Score: 0.7081,
Precision: 0.5482,
Recall: 1.0000,
Accuracy: 0.5482


Below I am going to perform a coarse model selection by comparing 5 models with some basic tweaks.

In [10]:
'''
In this code I am trying a combination of 30,200 different comninations for the random state
of splitting the data_scaled into train and test data and also the penalty factor for the
L2 regulatization of Logisitic Regression classifier, Support Vector Machine Classifier and MLP Classifier.

I cannot provide any prior belief for my classes because I have none.

The code will create a pickle dump named 'modelparams.pckl' saving the F1 scores and
parameters and abbriviations for classifiers for all 30,200 combinations.

I am using random_state seeds from 0 to 99 and C values from 0.01 to 1.

'''

models = []
clfnames = ['LR','NB','DT','SVC','MLP']
classifiers = [LogisticRegression(), BernoulliNB(), DecisionTreeClassifier(), SVC(), MLPClassifier(learning_rate='adaptive',max_iter=500)]
Cs = np.linspace(0.01,1,100,endpoint=True,dtype = np.float_).tolist()
# print(len(Cs))
clf = None
count = 0
if not os.path.exists('modelparams.pckl'):
    n = 100
    for i in range(n):
        tr_X, te_X, tr_y, te_y = train_test_split(data_scaled, labels, train_size = 0.67,random_state = i)
        for clf, name in zip(classifiers, clfnames):
            if name in ['LR','SVC','MLP']:
                x = clf
                for c in Cs:
                    if name == 'MLP':
                        x.set_params(alpha=c)
                    else:
                        x.set_params(C = c)
                    score = cross_val_score(x, tr_X, tr_y, cv=10, scoring='f1')
                    models.append((np.mean(score),i,name,c))
                    count += 1
                    print('\rCompleted: {:.2f}%'.format((count/((n*2.)+(n*2.*100)))*100),end=' ')
            else:
                score = cross_val_score(clf, tr_X, tr_y, cv=10, scoring='f1')
                models.append((np.mean(score),i,name,0))
                count += 1
                print('\rCompleted: {:.2f}%'.format((count/((n*2.)+(n*2.*100)))*100),end=' ')
    print('\n')
    pickle.dump(models,open('modelparams.pckl','wb'))
else:
    print('Loading Data ...')
    models = pickle.load(open('modelparams.pckl','rb'))
    print('Done')

Loading Data ...
Done


In [11]:
'''
Printing the results of comparisons
'''
for _ in models:
    print(_)

(0.8324975246953038, 0, 'LR', 0.01)
(0.83495089425525248, 0, 'LR', 0.02)
(0.83454257748840432, 0, 'LR', 0.03)
(0.83475352127572378, 0, 'LR', 0.04)
(0.83446357956881556, 0, 'LR', 0.05)
(0.83516465899037584, 0, 'LR', 0.060000000000000005)
(0.83549666430246072, 0, 'LR', 0.06999999999999999)
(0.83513659883601221, 0, 'LR', 0.08)
(0.83513659883601221, 0, 'LR', 0.09)
(0.83415441158053993, 0, 'LR', 0.09999999999999999)
(0.83366992510302218, 0, 'LR', 0.11)
(0.83366992510302218, 0, 'LR', 0.12)
(0.83366992510302218, 0, 'LR', 0.13)
(0.83433659176968877, 0, 'LR', 0.14)
(0.83433659176968877, 0, 'LR', 0.15000000000000002)
(0.83385306925674352, 0, 'LR', 0.16)
(0.83349664081520858, 0, 'LR', 0.17)
(0.83300818313705527, 0, 'LR', 0.18000000000000002)
(0.83300818313705527, 0, 'LR', 0.19)
(0.83265539172043412, 0, 'LR', 0.2)
(0.83265539172043412, 0, 'LR', 0.21000000000000002)
(0.83300148026686238, 0, 'LR', 0.22)
(0.83265915661541001, 0, 'LR', 0.23)
(0.83265915661541001, 0, 'LR', 0.24000000000000002)
(0.83300

In [12]:
models.sort(key= lambda x: x[0], reverse = True)
m,rs,name,c = models[0]
print('Best model: {},\n"C"/"alpha" parameter value: {},\nF1 score: {:.4f},\nRandom state for train_test_split: {}'.format(name,c,m,rs))

Best model: MLP,
"C"/"alpha" parameter value: 0.51,
F1 score: 0.8518,
Random state for train_test_split: 81


Since, for the above test between 5 classifiers MLP was the best one, I will write the code to perform fine model selection by further tweaking the MLP Classifier by trying various combinations of other parameters namely:
- hidden_layer_sizes
- activation
- solver
- random_state

In [14]:
'''
This code tries to find the best MLPClassifier for the split training data at random state 'rs' and alpha 'c'

I am trying all possible combinations of the values of the parameters 'activation', 'solver' and custom values for the parameters
'random_state' and 'hidden_layer_sizes'

I am trying for 0-49 random states and 4 combination hidden layers.

The best settings should have the highest mean cross validation score on the training data with 10 folds.

The result of this code is stored on the disk within the file 'MLPparams.pckl'

'''
act = ['identity','logistic','tanh','relu']
solve = ['lbfgs','sgd','adam']
hide = [(100,),(50,50,),(33,33,33,),(25,25,25,25,)]
states = 50
MLPresults = []
tr_X, te_X, tr_y, te_y  = train_test_split(data_scaled, labels, train_size = 0.67, random_state = rs)
clf = MLPClassifier(alpha = c, learning_rate='adaptive',max_iter=500, shuffle=False)
if not os.path.exists('MLPparams.pckl'):
    count = 0
    for i in range(states):
        for a in act:
            for s in solve:
                for h in hide:
                    clf.set_params(random_state=i, solver=s, activation=a, hidden_layer_sizes=h)
                    score = cross_val_score(clf, tr_X, tr_y, cv=10, scoring='f1')
                    score = np.mean(score)
                    count += 1.
                    if score >= m:
                        MLPresults.append((score,i,a,s,h))
                    print('\rCompleted: {:.2f}%'.format((count/(50.*4*3*4))*100), end=' ')
    print()
    pickle.dump(MLPresults,open('MLPparams.pckl','wb'))
else:
    print('Loading Data ...')
    MLPresults = pickle.load(open('MLPparams.pckl','rb'))
    print('Done')

Loading Data ...
Done


In [15]:
'''
Printing results of comparisons
'''
for _ in MLPresults:
    print(_)

(0.85249231288239768, 0, 'identity', 'lbfgs', (33, 33, 33))
(0.85284155216317237, 0, 'identity', 'lbfgs', (25, 25, 25, 25))
(0.85250567386256171, 0, 'logistic', 'lbfgs', (100,))
(0.85383295261294234, 0, 'logistic', 'lbfgs', (50, 50))
(0.85457330607285475, 0, 'logistic', 'lbfgs', (33, 33, 33))
(0.86302463359867398, 0, 'tanh', 'lbfgs', (100,))
(0.85816580506722073, 0, 'tanh', 'lbfgs', (50, 50))
(0.8591287713072765, 0, 'tanh', 'lbfgs', (33, 33, 33))
(0.85830713054191388, 0, 'tanh', 'lbfgs', (25, 25, 25, 25))
(0.85284496919809294, 0, 'tanh', 'adam', (33, 33, 33))
(0.86584585323458041, 0, 'relu', 'lbfgs', (100,))
(0.85690209804941075, 0, 'relu', 'lbfgs', (50, 50))
(0.86472462409022488, 0, 'relu', 'lbfgs', (33, 33, 33))
(0.85530933491631822, 0, 'relu', 'lbfgs', (25, 25, 25, 25))
(0.85213994973203167, 1, 'identity', 'lbfgs', (100,))
(0.85213767922565631, 1, 'identity', 'lbfgs', (33, 33, 33))
(0.85187058906602231, 1, 'identity', 'lbfgs', (25, 25, 25, 25))
(0.85191283239310622, 1, 'identity', '

In [16]:
# MLPp = pickle.load(open('MLPparams.pckl','rb'))
# print(len(MLPresults))
if len(MLPresults) > 0:
    MLPresults.sort(key=lambda x: x[0],reverse=True)
    score, i, a, s, h = MLPresults[0]
else:
    # set to default values
    i = None
    a = 'relu'
    s = 'adam'
    h = (100,)
    
# print(score, i, a, s, h)
print('Best values of parameters:\n')
print('random_state = {},\nactivation = {},\nsolver = {},\nhidden_layer_sizes = {}\n'.format(i,a,s,h))
print('Previous F1 Score: {:.4f}, Improved F1 Score: {:.4f}'.format(m,score))

Best values of parameters:

random_state = 29,
activation = relu,
solver = lbfgs,
hidden_layer_sizes = (33, 33, 33)

Previous F1 Score: 0.8518, Improved F1 Score: 0.8712


Here the above value of hidden_layer_sizes mean that I have 3 hidden layers with 33 neurons in each layers

### Evalutation Results for selected model

In [17]:
clf = MLPClassifier( hidden_layer_sizes=h, alpha = c, learning_rate='adaptive', max_iter=500, shuffle=False, random_state=i, activation=a, solver=s)
clf.fit(tr_X, tr_y)
y = clf.predict(te_X)
f1 = f1_score(te_y,y)
pre = precision_score(te_y,y)
rec = recall_score(te_y, y)
acc = accuracy_score(te_y, y)
print('Actual Prediction\n\nF1 Score: {:.4f},\nPrecision: {:.4f},\nRecall: {:.4f},\nAccuracy: {:.4f}'.format(f1,pre, rec, acc))

Actual Prediction

F1 Score: 0.8402,
Precision: 0.8875,
Recall: 0.7977,
Accuracy: 0.8355
