#### Features:
1. Fraction of clauses that are unit clauses. <!-- exactly one literal -->
2. Fraction of clauses that are Horn clauses. <!-- at most one non-negated literal -->
3. Fraction of clauses that are ground Clauses. <!-- ? -->
4. Fraction of clauses that are demodulators. <!-- equality used as rule to rewrite newly inferred clause -->
5. Fraction of clauses that are rewrite rules (oriented demodulators). <!-- ? -->
6. Fraction of clauses that are purely positive.
7. Fraction of clauses that are purely negative.
8. Fraction of clauses that are mixed positive and negative.
9. Maximum clause length. <!-- number of literals -->
10. Average clause length.
11. Maximum clause depth. <!-- see below -->
12. Average clause depth.
13. Maximum clause weight. <!-- defined by prover; probably its symbol count, excluding commas, parentheses, negation symbols, and disjunction symbols -->
14. Average clause weight.

<!-- 
Depth of Term, Atom, Literal, Clause
* depth of variable, constant, or propositional atom: 0;
* depth of term or atom with arguments: one more than the maximum argument depth;
* depth of literal: depth of its atom (negation signs don't count);
* depth of clause: maximum of depths of literals;
* For example, p(x) | -p(f(x)) has depth 2.
-->

In [8]:
import pandas as pd

df = pd.read_csv("data/all-data-raw.csv", header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.83307,0.99682,0.83307,0.76789,0,0.76948,0.069952,0.16057,6,1.2734,...,0.73684,0.00188,0.73872,0.073308,0.18797,-100.0,-100.0,-100.0,-100.0,-100.0
1,0.83307,0.99682,0.83307,0.76948,0,0.77107,0.068363,0.16057,6,1.2734,...,0.74248,0.00188,0.74436,0.067669,0.18797,0.08,0.08,0.2,0.08,0.08
2,0.83307,0.99682,0.83307,0.76789,0,0.76948,0.069952,0.16057,6,1.2734,...,0.7406,0.00188,0.74248,0.069549,0.18797,-100.0,-100.0,-100.0,-100.0,-100.0
3,0.83307,0.99682,0.83307,0.76789,0,0.76948,0.069952,0.16057,6,1.2734,...,0.72932,0.00188,0.7312,0.080827,0.18797,-100.0,-100.0,-100.0,-100.0,-100.0
4,0.83307,0.99682,0.83307,0.76789,0,0.76948,0.069952,0.16057,6,1.2734,...,0.7312,0.00188,0.73308,0.078947,0.18797,-100.0,-100.0,-100.0,-100.0,-100.0


In [9]:
h_times = df.iloc[:, -5:]
print(h_times.head())

import numpy as np
def best_heuristic(row):
    n_heuristics = 5
    h_times = row[-n_heuristics:].reset_index(drop=True)
    h_times.replace({-100.0 : np.nan}, inplace=True)
    idx, min_time = h_times.idxmin(), h_times.min()
    if np.isnan(min_time):
       return 0
    else:
       return idx+1

df['heuristic'] = df.apply(best_heuristic, axis=1)
df.drop([53, 54, 55, 56, 57], axis=1, inplace=True)
df.head()

       53      54     55      56      57
0 -100.00 -100.00 -100.0 -100.00 -100.00
1    0.08    0.08    0.2    0.08    0.08
2 -100.00 -100.00 -100.0 -100.00 -100.00
3 -100.00 -100.00 -100.0 -100.00 -100.00
4 -100.00 -100.00 -100.0 -100.00 -100.00


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,heuristic
0,0.83307,0.99682,0.83307,0.76789,0,0.76948,0.069952,0.16057,6,1.2734,...,0.020202,0.80639,0.99624,0.80263,0.73684,0.00188,0.73872,0.073308,0.18797,0
1,0.83307,0.99682,0.83307,0.76948,0,0.77107,0.068363,0.16057,6,1.2734,...,0.020202,0.80639,0.99624,0.80263,0.74248,0.00188,0.74436,0.067669,0.18797,1
2,0.83307,0.99682,0.83307,0.76789,0,0.76948,0.069952,0.16057,6,1.2734,...,0.020202,0.80639,0.99624,0.80263,0.7406,0.00188,0.74248,0.069549,0.18797,0
3,0.83307,0.99682,0.83307,0.76789,0,0.76948,0.069952,0.16057,6,1.2734,...,0.020202,0.80639,0.99624,0.80263,0.72932,0.00188,0.7312,0.080827,0.18797,0
4,0.83307,0.99682,0.83307,0.76789,0,0.76948,0.069952,0.16057,6,1.2734,...,0.020202,0.80639,0.99624,0.80263,0.7312,0.00188,0.73308,0.078947,0.18797,0


In [10]:
df['heuristic'].value_counts()

0    2554
1    1089
3     748
5     624
4     617
2     486
Name: heuristic, dtype: int64

In [11]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

X, y = df.drop(['heuristic'], axis=1).astype('float64'), df['heuristic']
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=44)

pipe = Pipeline([
    ('scaling', StandardScaler()),
    ('classifier', KNeighborsClassifier())
])

params = {
    'classifier__n_neighbors': range(1,15),
    'classifier__weights': ['uniform', 'distance']
}

kfold = StratifiedKFold(10, shuffle=True, random_state=42)
knn_grid = GridSearchCV(pipe, params, scoring='accuracy', cv=kfold, verbose=1, n_jobs=-1)
knn_grid.fit(X, y)

print(knn_grid.best_params_)

val_col_space = 20
print("{:{}}: {:.4f}".format("10-fold CV", val_col_space, knn_grid.best_score_))

Fitting 10 folds for each of 28 candidates, totalling 280 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    7.5s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   49.1s


{'classifier__n_neighbors': 12, 'classifier__weights': 'distance'}
10-fold CV          : 0.6036


[Parallel(n_jobs=-1)]: Done 280 out of 280 | elapsed:  1.3min finished


In [6]:
print(knn_grid.best_params_)
val_col_space = 20
print("{:{}}: {:.4f}".format("10-fold CV", val_col_space, knn_grid.best_score_))

{'classifier__n_neighbors': 12, 'classifier__weights': 'distance'}
10-fold CV          : 0.6036


In [7]:
knn_grid.cv_results_



{'mean_fit_time': array([0.1100785 , 0.03610344, 0.03349228, 0.03313601, 0.03274267,
        0.03311157, 0.03241754, 0.03190088, 0.03241916, 0.03449843,
        0.03517005, 0.03594844, 0.03356011, 0.03292785, 0.03286555,
        0.03289592, 0.0325963 , 0.03850722, 0.03231304, 0.03217304,
        0.03228445, 0.03587463, 0.03352423, 0.0330765 , 0.044292  ,
        0.03924787, 0.03848658, 0.04546318]),
 'std_fit_time': array([0.09040997, 0.00889914, 0.00270994, 0.00304311, 0.00239372,
        0.00203112, 0.00105066, 0.0016699 , 0.00127894, 0.00422062,
        0.00164387, 0.00651597, 0.0019609 , 0.00150714, 0.00197402,
        0.00046716, 0.001533  , 0.00865069, 0.00134937, 0.00154251,
        0.00224909, 0.01138787, 0.00272046, 0.0023571 , 0.01361918,
        0.00846211, 0.01015233, 0.01232112]),
 'mean_score_time': array([0.08051262, 0.06128943, 0.08200164, 0.08054371, 0.09369106,
        0.08880317, 0.09941392, 0.09473569, 0.10661414, 0.11513705,
        0.13452382, 0.11915903, 0.114005

In [8]:
results_knn = pd.DataFrame(knn_grid.cv_results_)



In [9]:
results_knn.head()
results_knn.drop(['params'], axis= 1)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__n_neighbors,param_classifier__weights,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,split2_train_score,split3_train_score,split4_train_score,split5_train_score,split6_train_score,split7_train_score,split8_train_score,split9_train_score,mean_train_score,std_train_score
0,0.110079,0.09041,0.080513,0.016833,1,uniform,0.587948,0.592834,0.587948,0.591205,...,0.978016,0.976381,0.973665,0.9753,0.977302,0.974582,0.975676,0.971869,0.97541,0.001669
1,0.036103,0.008899,0.061289,0.004037,1,distance,0.587948,0.592834,0.587948,0.591205,...,0.978016,0.976381,0.973665,0.9753,0.977302,0.974582,0.975676,0.971869,0.97541,0.001669
2,0.033492,0.00271,0.082002,0.004112,2,uniform,0.584691,0.565147,0.576547,0.592834,...,0.790153,0.791061,0.791137,0.788231,0.789541,0.789942,0.792521,0.790563,0.790436,0.001153
3,0.033136,0.003043,0.080544,0.007485,2,distance,0.589577,0.59772,0.587948,0.584691,...,0.977108,0.977108,0.974392,0.9753,0.976938,0.974946,0.974769,0.975862,0.975791,0.000936
4,0.032743,0.002394,0.093691,0.009028,3,uniform,0.568404,0.552117,0.578176,0.561889,...,0.75436,0.756177,0.758264,0.751181,0.75177,0.753631,0.755491,0.750998,0.753969,0.002185
5,0.033112,0.002031,0.088803,0.004387,3,distance,0.596091,0.59772,0.594463,0.605863,...,0.978743,0.978379,0.976934,0.977661,0.978028,0.976398,0.976765,0.975681,0.977244,0.000899
6,0.032418,0.001051,0.099414,0.005928,4,uniform,0.558632,0.568404,0.560261,0.565147,...,0.713118,0.71548,0.721032,0.713767,0.714545,0.715142,0.718279,0.711615,0.715666,0.002834
7,0.031901,0.00167,0.094736,0.003703,4,distance,0.59772,0.600977,0.586319,0.596091,...,0.978924,0.978379,0.977116,0.977842,0.978573,0.976761,0.976584,0.976225,0.977498,0.000859
8,0.032419,0.001279,0.106614,0.006878,5,uniform,0.561889,0.561889,0.547231,0.547231,...,0.690225,0.68968,0.697058,0.690883,0.693118,0.689179,0.693229,0.69147,0.69182,0.002299
9,0.034498,0.004221,0.115137,0.01497,5,distance,0.59772,0.600977,0.591205,0.591205,...,0.979106,0.978198,0.976934,0.977842,0.978573,0.976761,0.976584,0.976951,0.977589,0.000796


In [63]:
from sklearn.tree import DecisionTreeClassifier

pipe = Pipeline([
    ('scaling', StandardScaler()),
    ('classifier', DecisionTreeClassifier())
])

params = {
    'classifier__criterion': ['gini', 'entropy']
}

kfold = StratifiedKFold(10, shuffle=True, random_state=42)
dt_grid = GridSearchCV(pipe, params, scoring='accuracy', cv=kfold, verbose=1, n_jobs=-1)
dt_grid.fit(X, y)

print(dt_grid.best_params_)

val_col_space = 20
print("{:{}}: {:.4f}".format("10-fold CV", val_col_space, dt_grid.best_score_))

Fitting 10 folds for each of 2 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    2.6s finished


{'classifier__criterion': 'gini'}
10-fold CV          : 0.5628


In [11]:
results_dt = pd.DataFrame(dt_grid.cv_results_)
results_dt.head()
results_dt.drop(['params'], axis= 1)



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__criterion,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,...,split2_train_score,split3_train_score,split4_train_score,split5_train_score,split6_train_score,split7_train_score,split8_train_score,split9_train_score,mean_train_score,std_train_score
0,0.214126,0.021468,0.001572,0.000744,gini,0.589577,0.555375,0.57329,0.552117,0.54085,...,0.979106,0.978743,0.977116,0.977842,0.978754,0.976943,0.976765,0.977132,0.977734,0.000804
1,0.436205,0.041683,0.001169,0.000277,entropy,0.558632,0.535831,0.530945,0.576547,0.517974,...,0.979106,0.978743,0.977116,0.977842,0.978754,0.976943,0.976765,0.977132,0.977734,0.000804


In [12]:
import numpy as np

corr = df.corr
print(df.drop("heuristic", axis=1).apply(lambda x: x.corr(df.heuristic)))

0    -0.051288
1    -0.067283
2     0.012292
3    -0.056761
4          NaN
5    -0.020025
6     0.006412
7     0.015327
8     0.042063
9     0.030478
10   -0.202562
11   -0.188363
12   -0.155318
13   -0.052812
14    0.040710
15   -0.002808
16   -0.012917
17   -0.045495
18   -0.191722
19   -0.177150
20   -0.069303
21   -0.047792
22   -0.053727
23   -0.010610
24   -0.032321
25   -0.091450
26   -0.087245
27   -0.104309
28   -0.057786
29   -0.082848
30   -0.084612
31    0.002486
32    0.040710
33   -0.030349
34         NaN
35    0.028871
36    0.070051
37    0.041048
38   -0.062701
39   -0.085992
40   -0.057328
41    0.019764
42   -0.062534
43   -0.039263
44   -0.033728
45   -0.058661
46    0.017486
47    0.018575
48   -0.064887
49   -0.055400
50   -0.046700
51    0.048953
52    0.031527
dtype: float64


Function definition based on reference: https://pyswarms.readthedocs.io/en/latest/examples/feature_subset_selection.html#using-binary-pso

In [66]:
def f_per_particle(m, alpha):
    if np.count_nonzero(m) == 0:
        X_subset = X
    else:
        X_subset = X[m]
    P = X_subset.apply(lambda x: x.corr(df.heuristic)).sum()
    return (alpha * (1.0 - P)
        + (1.0 - alpha) * (1 - (X_subset.shape[1] / len(X.columns))))

In [67]:
def f(x, alpha = 0.8):
    n_particles = x.shape[0]
    j = [f_per_particle(x[i], alpha) for i in range(n_particles)]
    return np.array(j)

In [68]:
import pyswarms as ps
from pyswarms.discrete import BinaryPSO

In [69]:
options = {'c1': 0.5, 'c2': 0.5, 'w':0.9, 'k': 30, 'p':2}
dimensions = 52 # dimensions should be the number of features

In [70]:
optimizer = ps.discrete.BinaryPSO(n_particles=30, dimensions=dimensions, options=options)

In [71]:
cost, pos = optimizer.optimize(f, print_step=100, iters=1000, verbose=2)

INFO:pyswarms.discrete.binary:Arguments Passed to Objective Function: {}
INFO:pyswarms.discrete.binary:Iteration 1/1000, cost: 3.1932794070817994
INFO:pyswarms.discrete.binary:Iteration 101/1000, cost: 3.0525275363493383
INFO:pyswarms.discrete.binary:Iteration 201/1000, cost: 3.0269362871252543
INFO:pyswarms.discrete.binary:Iteration 301/1000, cost: 3.0269362871252543
INFO:pyswarms.discrete.binary:Iteration 401/1000, cost: 3.0269362871252543
INFO:pyswarms.discrete.binary:Iteration 501/1000, cost: 3.0269362871252543
INFO:pyswarms.discrete.binary:Iteration 601/1000, cost: 3.014140662513213
INFO:pyswarms.discrete.binary:Iteration 701/1000, cost: 3.014140662513213
INFO:pyswarms.discrete.binary:Iteration 801/1000, cost: 3.014140662513213
INFO:pyswarms.discrete.binary:Iteration 901/1000, cost: 3.014140662513213
Optimization finished!
Final cost: 2.9885
Best value: [ 0.000000 0.000000 0.000000 ...]



In [72]:
print(pos)

[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [73]:
pipe = Pipeline([
    ('scaling', StandardScaler()),
    ('classifier', DecisionTreeClassifier())
])

params = {
    'classifier__criterion': ['gini', 'entropy']
}

kfold = StratifiedKFold(10, shuffle=True, random_state=42)
dt_grid = GridSearchCV(pipe, params, scoring='accuracy', cv=kfold, verbose=1, n_jobs=-1)
dt_grid.fit(X[pos], y)

print(dt_grid.best_params_)

val_col_space = 20
print("{:{}}: {:.4f}".format("10-fold CV", val_col_space, dt_grid.best_score_))

Fitting 10 folds for each of 2 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


{'classifier__criterion': 'gini'}
10-fold CV          : 0.5155


[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    2.2s finished


In [74]:
pipe = Pipeline([
    ('scaling', StandardScaler()),
    ('classifier', KNeighborsClassifier())
])

params = {
    'classifier__n_neighbors': range(1,15),
    'classifier__weights': ['uniform', 'distance']
}

kfold = StratifiedKFold(10, shuffle=True, random_state=42)
knn_grid = GridSearchCV(pipe, params, scoring='accuracy', cv=kfold, verbose=1, n_jobs=-1)
knn_grid.fit(X[pos], y)

print(knn_grid.best_params_)

val_col_space = 20
print("{:{}}: {:.4f}".format("10-fold CV", val_col_space, knn_grid.best_score_))

Fitting 10 folds for each of 28 candidates, totalling 280 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  76 tasks      | elapsed:    2.9s


{'classifier__n_neighbors': 14, 'classifier__weights': 'distance'}
10-fold CV          : 0.5159


[Parallel(n_jobs=-1)]: Done 280 out of 280 | elapsed:   10.7s finished


In [75]:
from sklearn.decomposition import PCA

In [77]:
pipe = Pipeline([
    ('scaling', StandardScaler()),
    ('pca', PCA(0.96)),
    ('classifier', KNeighborsClassifier())
])

params = {
    'classifier__n_neighbors': range(1,15),
    'classifier__weights': ['uniform', 'distance']
}

kfold = StratifiedKFold(10, shuffle=True, random_state=42)
knn_grid = GridSearchCV(pipe, params, scoring='accuracy', cv=kfold, verbose=1, n_jobs=-1)
knn_grid.fit(X[pos], y)

print(knn_grid.best_params_)

val_col_space = 20
print("{:{}}: {:.4f}".format("10-fold CV", val_col_space, knn_grid.best_score_))

Fitting 10 folds for each of 28 candidates, totalling 280 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  86 tasks      | elapsed:    1.6s


{'classifier__n_neighbors': 14, 'classifier__weights': 'distance'}
10-fold CV          : 0.5147


[Parallel(n_jobs=-1)]: Done 280 out of 280 | elapsed:    4.8s finished


In [78]:
pipe = Pipeline([
    ('scaling', StandardScaler()),
    ('pca', PCA(0.96)),
    ('classifier', DecisionTreeClassifier())
])

params = {
    'classifier__criterion': ['gini', 'entropy']
}

kfold = StratifiedKFold(10, shuffle=True, random_state=42)
dt_grid = GridSearchCV(pipe, params, scoring='accuracy', cv=kfold, verbose=1, n_jobs=-1)
dt_grid.fit(X[pos], y)

print(dt_grid.best_params_)

val_col_space = 20
print("{:{}}: {:.4f}".format("10-fold CV", val_col_space, dt_grid.best_score_))

Fitting 10 folds for each of 2 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


{'classifier__criterion': 'gini'}
10-fold CV          : 0.5126


[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    0.4s finished


In [79]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [81]:
pipe = Pipeline([
    ('scaling', StandardScaler()),
    ('lda', LinearDiscriminantAnalysis()),
    ('classifier', KNeighborsClassifier())
])

params = {
    'classifier__n_neighbors': range(1,15),
    'classifier__weights': ['uniform', 'distance']
}

kfold = StratifiedKFold(10, shuffle=True, random_state=42)
knn_grid = GridSearchCV(pipe, params, scoring='accuracy', cv=kfold, verbose=1, n_jobs=-1)
knn_grid.fit(X[pos], y)

print(knn_grid.best_params_)

val_col_space = 20
print("{:{}}: {:.4f}".format("10-fold CV", val_col_space, knn_grid.best_score_))

Fitting 10 folds for each of 28 candidates, totalling 280 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  76 tasks      | elapsed:    1.8s


{'classifier__n_neighbors': 13, 'classifier__weights': 'distance'}
10-fold CV          : 0.5092


[Parallel(n_jobs=-1)]: Done 280 out of 280 | elapsed:    6.6s finished


In [82]:
pipe = Pipeline([
    ('scaling', StandardScaler()),
    ('lda', LinearDiscriminantAnalysis()),
    ('classifier', DecisionTreeClassifier())
])

params = {
    'classifier__criterion': ['gini', 'entropy']
}

kfold = StratifiedKFold(10, shuffle=True, random_state=42)
dt_grid = GridSearchCV(pipe, params, scoring='accuracy', cv=kfold, verbose=1, n_jobs=-1)
dt_grid.fit(X[pos], y)

print(dt_grid.best_params_)

val_col_space = 20
print("{:{}}: {:.4f}".format("10-fold CV", val_col_space, dt_grid.best_score_))

Fitting 10 folds for each of 2 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


{'classifier__criterion': 'gini'}
10-fold CV          : 0.5116


[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    0.5s finished
