# Aesthetic Classification

In this notebook we work with different functions to make a model and obtain results from image descriptors.
This will be an example in order to create scripts that generate automaticatly the results for our paper.

## A bit of set up

We need numpy and pandas for data. Pickle and gzip for read the extracted features. Our folder with the code of our functions. Different models from scikit.

In [2]:
# set up Python environment: numpy for numerical routines
import numpy as np
import pandas as pd

# for store the results
from six.moves import cPickle as pickle
import gzip

# our code (utilsData needs a view)
import sys
sys.path.append('../pycode/')
import utilsData
from preprocess import utilities

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [3]:
from sklearn.metrics import roc_auc_score, accuracy_score
import full_models

In [1]:
import sqlite3

## AVA dataset
We start with AVA data. First, a info package must be load. It contains information about votes, style features, labels and IDs. Then with the information of the arff file and readARFF function, we extract the features with their IDs. Finally, the information is combined.

In [4]:
features_file = '../features/AVA/GHIST.arff'
#features_file = '../features/features_fc6.pklz'
output_file = '../prueba.pklz'
selected_model = 'NBG'
decaf_discrete = 'False'

In [14]:
if features_file[-4:] == 'pklz':
    features = pickle.load(open(features_file,'rb',2))
else:
    features = utilsData.readARFF(features_file)
    
features['id'] = features['id'].astype(int)
#for test in notebooks
#features = features.iloc[:,-101:]

# we take the name of the features and delete de ID
features_names = np.array(features.columns)
index = np.argwhere(features_names=='id')
features_names = np.delete(features_names, index)

# this line is for normalize decaf features
if (decaf_discrete == 'True'):
    features[features_names],_ = utilities.reference_forward_implementation(np.array(features[features_names]),5,2,1.5,0.75)

data2 = pickle.load(gzip.open('../packages/AVA_info.pklz','rb',2))
data2=data2.merge(features, on='id', how='right', copy=False)
num_images = data.shape[0]

# to free space
del features

In [5]:
dbfile = '../packages/sqlite_aux.db'
cxn = sqlite3.connect(dbfile)
c = cxn.cursor()

In [6]:
data = pickle.load(gzip.open('../packages/AVA_info.pklz','rb',2))
data.to_sql(name='data', con = cxn, if_exists='replace')

del data

In [8]:
if features_file[-4:] == 'pklz':
    features = pickle.load(open(features_file,'rb',2))
else:
    features = utilsData.readARFF(features_file)
    
features['id'] = features['id'].astype(int)

# we take the name of the features and delete de ID
features_names = np.array(features.columns)
index = np.argwhere(features_names=='id')
features_names = np.delete(features_names, index)

# this line is for normalize decaf features
if (decaf_discrete == 'True'):
    features[features_names],_ = utilities.reference_forward_implementation(np.array(features[features_names]),5,2,1.5,0.75)

features.to_sql(name='features', con = cxn, if_exists='replace')

del features

In [9]:
strSQL = 'SELECT * FROM data d INNER JOIN features f ON d.id = f.id;'
data = pd.read_sql(strSQL, cxn)

num_images = data.shape[0]

In [10]:
data

Unnamed: 0,index,line,id,vote1,vote2,vote3,vote4,vote5,vote6,vote7,...,var247,var248,var249,var250,var251,var252,var253,var254,var255,var256
0,0,1,953619,0,1,5,17,38,36,15,...,0.000129,0.000157,0.000103,0.000129,0.000129,0.000225,0.000209,0.000159,0.000267,0.000230
1,1,2,953958,10,7,15,26,26,21,10,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,2,3,954184,0,0,4,8,41,56,10,...,0.000142,0.000133,0.000094,0.000115,0.000136,0.000210,0.000488,0.001900,0.007170,0.042446
3,3,4,954113,0,1,4,6,48,37,23,...,0.000101,0.000146,0.000112,0.000110,0.000137,0.000099,0.000121,0.000212,0.000873,0.002445
4,4,5,953980,0,3,6,15,57,39,6,...,0.000266,0.000103,0.000052,0.000033,0.000033,0.000035,0.000014,0.000027,0.000022,0.000008
5,5,6,954175,0,0,5,13,40,53,14,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
6,6,7,953349,1,1,1,7,27,46,28,...,0.015285,0.032400,0.035461,0.025639,0.017521,0.015727,0.036219,0.013295,0.011787,0.005186
7,7,8,953645,0,0,0,8,33,51,27,...,0.000802,0.000515,0.000450,0.000526,0.000625,0.000601,0.000385,0.000258,0.000103,0.000007
8,8,9,953897,0,0,0,5,19,46,29,...,0.000068,0.000053,0.000060,0.000040,0.000018,0.000020,0.000020,0.000046,0.000126,0.000031
9,9,10,953841,0,0,3,8,37,44,22,...,0.000003,0.000000,0.000003,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [11]:
data_aux = data[np.append(features_names,['Class'])]
data_aux['Class'] = pd.Categorical(data_aux['Class'],range(0,len(data_aux['Class'].unique())))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [None]:
np.random.seed(1000)
num_folds = 5
folds = np.random.choice(range(0,num_images),replace=False,size=(num_folds,int(num_images/num_folds)))

In [None]:
results = {}
results['balanced']=0
results['AUC']=0
results['accuracy']=0

for i in range(0, num_folds):
    
    train_indices = np.delete(folds,i,axis=0).reshape(-1)
    train_indices = train_indices[utilities.balance_class(data_aux['Class'].cat.codes[train_indices])]
    
    test_indices = folds[i]
    
    if selected_model == 'NB':
        predictions = full_models.fullNB(data_aux, train_indices, test_indices)
        
    elif selected_model == 'AODE':
        predictions = full_models.fullAODE(data_aux, train_indices, test_indices)
    
    elif selected_model == 'NBG':
        predictions = full_models.fullNBG(data_aux, train_indices, test_indices, features_names, 'Class')
    
    elif selected_model == 'SVM':
        predictions = full_models.fullSVM(data_aux, train_indices, test_indices, features_names, 'Class')
        
    elif selected_model == 'ELM':
        predictions = full_models.fullELM(data_aux, train_indices, test_indices, features_names, 'Class')
        
    elif selected_model == 'GBoost':
        predictions = full_models.fullGBoost(data_aux, train_indices, test_indices, features_names, 'Class')
    
    results['balanced'] += utilsData.balanced_accuracy(data_aux['Class'].cat.codes[test_indices], predictions)
    results['AUC'] += roc_auc_score(data_aux['Class'].cat.codes[test_indices], predictions)
    results['accuracy'] += accuracy_score(data_aux['Class'].cat.codes[test_indices], (predictions >= 0.5).astype(int))
    
results['balanced'] /= num_folds
results['AUC'] /= num_folds
results['accuracy'] /= num_folds

In [None]:
pickle.dump(results, gzip.open( output_file, "wb" ), 2)