## Before running

A virtual environment can be created using 
- 'pipenv install'
- 'pipenv shell'

This will allow us to all use the same packages and versions. They are listed in the Pipfile

In [1]:
from refactoring import *

## Inputs

Dictionaries are taken as input from a parameter file, they contain the parameters for each soap descriptor

In [2]:
descDict1 = {'lower': 1, 'upper': 50, 'centres': '{8, 7, 6, 1, 16, 17, 9}',
             'neighbours': '{8, 7, 6, 1, 16, 17, 9}', 'mu': 0, 
             'mu_hat': 0, 'nu': 2, 'nu_hat': 0, 'mutation_chance': 0.50, 
             'min_cutoff': 1, 'max_cutoff': 50, 'min_sigma': 0.1, 
             'max_sigma': 0.9}

descDict2 = {'lower': 51, 'upper': 100, 'centres': '{8, 7, 6, 1, 16, 17, 9}',
             'neighbours': '{8, 7, 6, 1, 16, 17, 9}', 'mu': 0, 
             'mu_hat': 0, 'nu': 2, 'nu_hat': 0, 'mutation_chance': 0.50,
             'min_cutoff': 51, 'max_cutoff': 100, 'min_sigma': 1.1, 
             'max_sigma': 1.9}

Other parameters are also taken as input. These are automatically checked that the parameters are viable

In [3]:
num_gens = 100
best_sample, lucky_few, population_size, number_of_children = 4, 2, 12, 4
early_stop = 2
early_number = 3 
min_generations = 5

## GeneParameter

GeneParameter class is created from each descriptor dictionary. 

In [4]:
params1 = GeneParameters(**descDict1)
params2 = GeneParameters(**descDict2)

In [5]:
params1

GeneParameters(lower=1, upper=50, centres='{8, 7, 6, 1, 16, 17, 9}', neighbours='{8, 7, 6, 1, 16, 17, 9}', mu=0, mu_hat=0, nu=2, nu_hat=0, mutation_chance=0.5, min_cutoff=1, max_cutoff=50, min_sigma=0.1, max_sigma=0.9)

## GeneSet

We can use these classes to create a specific set of parameters that are consistant with these values. This returns a randomly generated GeneSet class

In [6]:
example_gene_set = params1.make_gene_set()
example_gene_set

GeneSet(39, 49, 6, 0.14)

We can get the parameters used to create the GeneSet class

In [7]:
example_gene_set.gene_parameters

GeneParameters(lower=1, upper=50, centres='{8, 7, 6, 1, 16, 17, 9}', neighbours='{8, 7, 6, 1, 16, 17, 9}', mu=0, mu_hat=0, nu=2, nu_hat=0, mutation_chance=0.5, min_cutoff=1, max_cutoff=50, min_sigma=0.1, max_sigma=0.9)

We can get a descriptor string to be used as an input for getting SOAPs

In [8]:
example_gene_set.get_soap_string()

'soap average cutoff=39 l_max=49 n_max=6 atom_sigma=0.14 n_Z=7 Z={8, 7, 6, 1, 16, 17, 9} n_species=7 species_Z={8, 7, 6, 1, 16, 17, 9} mu=0 mu_hat=0 nu=2 nu_hat=0'

We can also mutate the gene using the mutation chance in the GeneParameters class

In [9]:
print(f"Before mutation {example_gene_set}")
example_gene_set.mutate_gene()
print(f"After mutation {example_gene_set}")

Before mutation [39, 49, 6, 0.14]
After mutation [46, 49, 19, 0.14]


## Individual

An Individual is made up of a list of GeneSet classes.

In [10]:
example_gene_set_two = params2.make_gene_set()
gene_set_list = [example_gene_set, example_gene_set_two]
example_individual = Individual(gene_set_list)
example_individual

Individual(['GeneSet(46, 49, 19, 0.14)', 'GeneSet(96, 92, 53, 1.36)'])

Getting the score for an indivudual

In [11]:
example_individual.get_score()
example_individual.score

142

Breeding two individuals to create a child. Mutation is automatically performed during this

In [12]:
example_individual_two = Individual(gene_set_list)
print(f"Breeding {example_individual} with {example_individual_two}")
child = breed_individuals(example_individual, example_individual_two)
print(f"Created child {child}")

Breeding Individual(['[46, 49, 19, 0.14]', '[96, 92, 53, 1.36]']) with Individual(['[46, 49, 19, 0.14]', '[96, 92, 53, 1.36]'])
Created child Individual(['[46, 19, 42, 0.14]', '[96, 92, 71, 1.36]'])


## Population

A Population is a collection of Individual classes. This can be created using a list of GeneParameter classes

In [13]:
gene_parameters = [params1, params2]
pop = Population(best_sample, lucky_few, population_size, 
                 number_of_children, gene_parameters, 
                 maximise_scores = True)
pop

Population(4, 2, 12, 4, [GeneParameters(lower=1, upper=50, centres='{8, 7, 6, 1, 16, 17, 9}', neighbours='{8, 7, 6, 1, 16, 17, 9}', mu=0, mu_hat=0, nu=2, nu_hat=0, mutation_chance=0.5, min_cutoff=1, max_cutoff=50, min_sigma=0.1, max_sigma=0.9), GeneParameters(lower=51, upper=100, centres='{8, 7, 6, 1, 16, 17, 9}', neighbours='{8, 7, 6, 1, 16, 17, 9}', mu=0, mu_hat=0, nu=2, nu_hat=0, mutation_chance=0.5, min_cutoff=51, max_cutoff=100, min_sigma=1.1, max_sigma=1.9)], True)

To initialise the population

In [14]:
pop.initialise_population()

Initial population of size 12 generated


If you want a way of neatly seeing what individuals are in the population

In [15]:
pop.print_population()

Individual(['[41, 20, 25, 0.79]', '[59, 52, 77, 1.64]']) has a score of: 100
Individual(['[3, 5, 47, 0.12]', '[62, 89, 57, 1.44]']) has a score of: 65
Individual(['[10, 37, 19, 0.16]', '[64, 64, 81, 1.53]']) has a score of: 74
Individual(['[35, 2, 14, 0.54]', '[80, 56, 96, 1.72]']) has a score of: 115
Individual(['[36, 31, 18, 0.69]', '[57, 99, 66, 1.3]']) has a score of: 93
Individual(['[48, 42, 21, 0.24]', '[97, 86, 93, 1.63]']) has a score of: 145
Individual(['[44, 3, 13, 0.41]', '[77, 82, 96, 1.65]']) has a score of: 121
Individual(['[40, 13, 3, 0.82]', '[61, 89, 93, 1.79]']) has a score of: 101
Individual(['[7, 42, 37, 0.11]', '[81, 97, 76, 1.52]']) has a score of: 88
Individual(['[11, 14, 25, 0.7]', '[98, 79, 94, 1.83]']) has a score of: 109
Individual(['[19, 17, 17, 0.62]', '[98, 90, 76, 1.16]']) has a score of: 117
Individual(['[40, 17, 5, 0.59]', '[82, 54, 68, 1.13]']) has a score of: 122


The next generation can then be generated 

In [16]:
pop.next_generation()
pop.print_population()

Individual(['[19, 40, 13, 0.49]', '[88, 91, 85, 1.65]']) has a score of: 107
Individual(['[35, 17, 32, 0.41]', '[98, 84, 78, 1.66]']) has a score of: 133
Individual(['[48, 17, 21, 0.78]', '[60, 82, 68, 1.63]']) has a score of: 108
Individual(['[42, 15, 41, 0.41]', '[77, 90, 62, 1.65]']) has a score of: 119
Individual(['[40, 41, 27, 0.38]', '[82, 54, 97, 1.13]']) has a score of: 122
Individual(['[40, 42, 21, 0.59]', '[59, 86, 51, 1.8]']) has a score of: 99
Individual(['[18, 37, 18, 0.69]', '[84, 54, 66, 1.3]']) has a score of: 102
Individual(['[11, 31, 25, 0.64]', '[99, 99, 66, 1.15]']) has a score of: 110
Individual(['[31, 17, 44, 0.41]', '[87, 82, 76, 1.76]']) has a score of: 118
Individual(['[11, 31, 25, 0.7]', '[98, 63, 54, 1.3]']) has a score of: 109
Individual(['[36, 40, 48, 0.7]', '[98, 87, 66, 1.82]']) has a score of: 134
Individual(['[48, 13, 5, 0.13]', '[82, 86, 68, 1.63]']) has a score of: 130


So to run the full GA 

In [17]:
for _ in range(num_gens):
    pop.next_generation()
pop.print_population()

Individual(['[4, 48, 28, 0.26]', '[92, 85, 66, 1.45]']) has a score of: 96
Individual(['[23, 29, 43, 0.72]', '[78, 92, 68, 1.42]']) has a score of: 101
Individual(['[46, 33, 49, 0.26]', '[92, 79, 67, 1.45]']) has a score of: 138
Individual(['[46, 33, 29, 0.26]', '[93, 65, 83, 1.45]']) has a score of: 139
Individual(['[23, 33, 16, 0.33]', '[55, 98, 54, 1.13]']) has a score of: 78
Individual(['[16, 17, 23, 0.44]', '[65, 84, 70, 1.63]']) has a score of: 81
Individual(['[2, 17, 28, 0.44]', '[76, 54, 54, 1.7]']) has a score of: 78
Individual(['[4, 31, 31, 0.72]', '[58, 99, 61, 1.3]']) has a score of: 62
Individual(['[41, 33, 17, 0.7]', '[73, 97, 66, 1.7]']) has a score of: 114
Individual(['[3, 33, 25, 0.86]', '[78, 51, 91, 1.3]']) has a score of: 81
Individual(['[46, 33, 28, 0.71]', '[92, 65, 89, 1.71]']) has a score of: 138
Individual(['[41, 48, 42, 0.43]', '[78, 63, 54, 1.37]']) has a score of: 119


## BestHistory

BestHistory is a class to store the history and check convergence criteria. So the entire GA can be run, printed, and saved using the following code snippet:

In [18]:
hist = BestHistory(early_stop, early_number, min_generations)
pop = Population(best_sample, lucky_few, population_size, 
                 number_of_children, gene_parameters, 
                 maximise_scores = True)

pop.initialise_population()    
for gen in range(num_gens):
    if hist.converged:
        break
    print(f"Generation {gen}")
    pop.next_generation()
    hist.append(pop)
    print("-------")

Initial population of size 12 generated
Generation 0
Best Individual Individual(['[33, 48, 20, 0.54]', '[97, 91, 84, 1.17]']) with a score of 130 added to history
-------
Generation 1
Best Individual Individual(['[39, 40, 19, 0.64]', '[95, 61, 99, 1.51]']) with a score of 134 added to history
-------
Generation 2
Best Individual Individual(['[41, 26, 2, 0.83]', '[97, 95, 99, 1.14]']) with a score of 138 added to history
-------
Generation 3
Best Individual Individual(['[44, 49, 4, 0.27]', '[97, 61, 90, 1.63]']) with a score of 141 added to history
-------
Generation 4
Best Individual Individual(['[44, 30, 5, 0.5]', '[95, 64, 81, 1.19]']) with a score of 139 added to history
-------
Generation 5
Best Individual Individual(['[48, 43, 6, 0.5]', '[95, 93, 76, 1.31]']) with a score of 143 added to history
-------
Generation 6
Best Individual Individual(['[47, 30, 6, 0.42]', '[95, 93, 76, 1.35]']) with a score of 142 added to history
SOAP_GAS has converged
-------


There now exists the entire history of the best Individuals throughout each generation that can be saved and easily accessed. 

In [19]:
vars(hist)

{'history': [Individual(['GeneSet(33, 48, 20, 0.54)', 'GeneSet(97, 91, 84, 1.17)']),
  Individual(['GeneSet(39, 40, 19, 0.64)', 'GeneSet(95, 61, 99, 1.51)']),
  Individual(['GeneSet(41, 26, 2, 0.83)', 'GeneSet(97, 95, 99, 1.14)']),
  Individual(['GeneSet(44, 49, 4, 0.27)', 'GeneSet(97, 61, 90, 1.63)']),
  Individual(['GeneSet(44, 30, 5, 0.5)', 'GeneSet(95, 64, 81, 1.19)']),
  Individual(['GeneSet(48, 43, 6, 0.5)', 'GeneSet(95, 93, 76, 1.31)']),
  Individual(['GeneSet(47, 30, 6, 0.42)', 'GeneSet(95, 93, 76, 1.35)'])],
 'converged': True,
 'early_stop': 2,
 'early_number': 3,
 'min_generations': 5}

In [20]:
example_gene_set.cutoff = 5
example_gene_set.l_max = 3
example_gene_set.n_max = 6

In [21]:
test_ind = Individual([example_gene_set])
conf_s, data = get_conf()

conf_s file exists


In [22]:
test_ind.comp_soaps(conf_s, data)

Getting soap for Atoms(symbols='COC6NCOC4ClC7O2C2O2H18', pbc=False)
Getting soap for Atoms(symbols='C2ONC6OH9', pbc=False)
Getting soap for Atoms(symbols='C3SO2C6FCONC7NCF3OH14', pbc=False)
Getting soap for Atoms(symbols='C6NC6NC3OC7NOH25', pbc=False)
Getting soap for Atoms(symbols='C25H34O6', pbc=False)
Getting soap for Atoms(symbols='C3SCONC5O2H15', pbc=False)
Getting soap for Atoms(symbols='COC6OC2NC3OC12NOH26', pbc=False)
Getting soap for Atoms(symbols='C10N2C6SO2NCF3H14', pbc=False)
Getting soap for Atoms(symbols='C9ONCOCCl2ONO2H12', pbc=False)
Getting soap for Atoms(symbols='C2NC2NC22H28', pbc=False)
Getting soap for Atoms(symbols='C14ClOC6NCH26', pbc=False)
Getting soap for Atoms(symbols='C19ClNC2NCH17', pbc=False)
Getting soap for Atoms(symbols='C12OC10NOH27', pbc=False)
Getting soap for Atoms(symbols='C2OCSCONC3NCONFH10', pbc=False)
Getting soap for Atoms(symbols='C10OC8OH24', pbc=False)
Getting soap for Atoms(symbols='C9ONC6FC9FO2H21', pbc=False)
Getting soap for Atoms(symbol

In [23]:
test_ind.soaps.shape

(123, 3613)

In [68]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras import layers, optimizers, Model, backend
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.callbacks import EarlyStopping
from scipy.stats import pearsonr
from sklearn.metrics import mean_squared_error
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

In [29]:
def cv_split(individual, splits, repeats, random_state):
    """
    Returns split indices for train and test sets
    """
    cv = RepeatedKFold(n_splits = splits, n_repeats = repeats, random_state = random_state)
    for train_index, test_index in cv.split(soaps):
        print(train_index, test_index) 

In [31]:
def scaleData(train, test):
    scaler = MinMaxScaler()
    train_scaled = scaler.fit_transform(train)
    test_scaled = scaler.transform(test)

    return train_scaled, test_scaled, scaler

In [46]:
def buildModel(X):
    backend.clear_session()
    input_layer = Input(X.shape[1])
    hidden_layer = input_layer
    for layer in [30,30,30]:
        hidden_layer = Dense(layer, activation='relu')(hidden_layer)

    output_layer = Dense(units=1, activation='linear')(hidden_layer)
    model = Model(input_layer, output_layer)
    model.compile(loss='mean_squared_error', optimizer=optimizers.Adam(learning_rate=0.01), metrics=['mean_squared_error'])

    return model

In [70]:

def get_scores(Individual, train_index, test_index, scaling = None, **kwargs):
    estimator = buildModel(Individual.soaps)
    X_train, X_test, X_scaler = scaleData(Individual.soaps[train_index], Individual.soaps[test_index])
    y_train, y_test, y_scaler = scaleData(Individual.targets[train_index].reshape(-1,1), Individual.targets[test_index].reshape(-1,1))
    scores = scorerNN(estimator, X_train, X_test, y_train, y_test, y_scaler)
    print(scores[0])
    return scores

In [71]:
def scorerNN(estimator, X_train, X_test, y_train, y_test, y_scaler):
    """ Scoring function for use with NN regressor. Added by Matt. """

    callback = EarlyStopping(monitor='val_loss', patience=50)
    estimator.fit(X_train, y_train, callbacks=[callback], validation_split=0.1, epochs=200, verbose=False)
    y_test_pred, y_train_pred = estimator.predict(X_test, verbose=False), estimator.predict(X_train, verbose=False)
    y_test_pred, y_train_pred = y_scaler.inverse_transform(y_test_pred), y_scaler.inverse_transform(y_train_pred)
    y_test = np.ravel(y_test)
    y_train = np.ravel(y_train)
    testCorr = pearsonr(y_test, y_test_pred)[0]
    trainCorr = pearsonr(y_train, y_train_pred)[0]
    testMSE = mean_squared_error(y_test, y_test_pred)
    trainMSE =  mean_squared_error(y_train, y_train_pred)
    return (2 * (trainMSE * (1-trainCorr)) + (testMSE * (1-testCorr))), X_train, X_test, y_train, y_test, y_test_pred, y_train_pred

In [72]:
cv = RepeatedKFold(n_splits = 5, n_repeats = 2, random_state = 6)
for train_index, test_index in cv.split(test_ind.soaps):
    get_scores(test_ind, train_index, test_index)

[99645.56217025577]
[69876.66126100467]
[124741.35231907856]
[68401.65889012409]
[54596.799431133564]
[92641.5624158965]
[116602.60204956552]
[79126.32387857576]




[nan]
[110951.30853281723]


In [41]:
type(t.dtype)

numpy.dtype[float64]

In [44]:
def build_model(individual):
    if individual.targets.dtype == float:
        target_type = 'regression'
    elif individual.targets.dtype == (int or str):
        target_type = 'classification'
    estimator = 3
    return estimator, target_type

In [45]:
build_model(test_ind)

(3, 'reg')