# Aesthetic Classification

In this notebook we work with different functions to make a model and obtain results from image descriptors.
This will be an example in order to create scripts that generate automaticatly the results for our paper.

## A bit of set up

We need numpy and pandas for data. Pickle and gzip for read the extracted features. Our folder with the code of our functions. Different models from scikit.

In [1]:
# set up Python environment: numpy for numerical routines
import numpy as np
import pandas as pd

# for store the results
import pickle
import gzip

# default models from scikit
from sklearn import svm
from sklearn.naive_bayes import GaussianNB

# our code (utilsData needs a view)
import sys
sys.path.append('../pycode/')
import utilsData

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [2]:
from preprocess.mdl import MDL_method
from preprocess.unsupervised import Unsupervised_method
from models.nb import Naive_Bayes

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.metrics import roc_auc_score, accuracy_score

## AVA dataset
We start with AVA data. First, a info package must be load. It contains information about votes, style features, labels and IDs. Then with the information of the arff file and readARFF function, we extract the features with their IDs. Finally, the information is combined.

In [3]:
features = utilsData.readARFF('../features/AVA/CHIST.arff')
output_file = '../results/NB_CHIST.pklz'
selected_model = 'NB'

In [4]:
data = pickle.load(gzip.open('../packages/AVA_info.pklz','rb',2))

# we take the name of the features and delete de ID
features_names = np.array(features.columns)
index = np.argwhere(features_names=='id')
features_names = np.delete(features_names, index)

data=pd.merge(data, features, on='id', how='right')
num_images = data.shape[0]

# to free space
del features

In [5]:
data_aux = data[np.append(features_names,['Class'])]
data_aux['Class'] = pd.Categorical(data_aux['Class'],data_aux['Class'].unique())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [6]:
np.random.seed(1000)
num_folds = 5
folds = np.random.choice(range(0,num_images),replace=False,size=(num_folds,int(num_images/num_folds)))

In [10]:
results = {}
results['balanced']=0
results['AUC']=0
results['accuracy']=0

for i in range(0, num_folds):
    
    train_indices = np.delete(folds,i,axis=0).reshape(-1)
    train_indices = train_indices[utilsData.balance_class(data_aux['Class'].cat.codes[train_indices])]
    
    test_indices = folds[i]
    
    if selected_model == 'NB':
    
        discretization = MDL_method()
        #discretization.frequency = True

        discretization.train(data_aux.loc[train_indices])
        data_fold = discretization.process(data_aux)
    
        model = Naive_Bayes()
        model.fit(data_fold.loc[train_indices])
    
        predictions =  model.predict_probs(data_fold.loc[test_indices])[1]
    
    elif selected_model == 'NBG':
        data_fold = data_aux.copy()
    
        model = GaussianNB()
        model.fit(data_fold.loc[train_indices,features_names],data_fold['Class'].cat.codes[train_indices])
        
        predictions =  model.predict_proba(data_fold.loc[test_indices,features_names])[:,1]
    
    elif selected_model == 'SVM':
        data_fold = data_aux.copy()
    
        model = LinearSVC()
        model.fit(data_fold.loc[train_indices,features_names],data_fold['Class'].cat.codes[train_indices])
        
        predictions =  model.predict(data_fold.loc[test_indices,features_names])
    
    results['balanced'] += utilsData.balanced_accuracy(data_fold['Class'].cat.codes[test_indices], predictions)
    results['AUC'] += roc_auc_score(data_fold['Class'].cat.codes[test_indices], predictions)
    results['accuracy'] += accuracy_score(data_fold['Class'].cat.codes[test_indices], (predictions >= 0.5).astype(int))
    
results['balanced'] /= num_folds
results['AUC'] /= num_folds
results['accuracy'] /= num_folds



var1
{'var1': array([            -inf,   7.81250000e-07,   2.44671500e-04,
         1.48244000e-03,   7.48682000e-03,   4.97043000e-02,
         9.81180500e-01,              inf])}
var2
{'var1': array([            -inf,   7.81250000e-07,   2.44671500e-04,
         1.48244000e-03,   7.48682000e-03,   4.97043000e-02,
         9.81180500e-01,              inf]), 'var2': array([            -inf,   1.75212000e-06,   1.27097000e-04,
         1.77058000e-03,   4.23098000e-03,   1.85491000e-01,
                    inf])}
var3
{'var1': array([            -inf,   7.81250000e-07,   2.44671500e-04,
         1.48244000e-03,   7.48682000e-03,   4.97043000e-02,
         9.81180500e-01,              inf]), 'var2': array([            -inf,   1.75212000e-06,   1.27097000e-04,
         1.77058000e-03,   4.23098000e-03,   1.85491000e-01,
                    inf]), 'var3': array([            -inf,   7.81250000e-07,   3.29429000e-05,
         3.10569000e-04,   9.39520500e-04,              inf])}
var4
{'var4

MemoryError: 

In [None]:
pickle.dump(results, gzip.open( output_file, "wb" ), 2)

In [None]:
results