## Before running

A virtual environment can be created using 
- 'pipenv install'
- 'pipenv shell'

This will allow us to all use the same packages and versions. They are listed in the Pipfile

In [1]:
from refactoring import *

## Inputs

Dictionaries are taken as input from a parameter file, they contain the parameters for each soap descriptor

In [2]:
descDict1 = {'lower': 1, 'upper': 50, 'centres': '{8, 7, 6, 1, 16, 17, 9}',
             'neighbours': '{8, 7, 6, 1, 16, 17, 9}', 'mu': 0, 
             'mu_hat': 0, 'nu': 2, 'nu_hat': 0, 'mutation_chance': 0.50, 
             'min_cutoff': 1, 'max_cutoff': 50, 'min_sigma': 0.1, 
             'max_sigma': 0.9,
             'message_steps': 8}

descDict2 = {'lower': 51, 'upper': 100, 'centres': '{8, 7, 6, 1, 16, 17, 9}',
             'neighbours': '{8, 7, 6, 1, 16, 17, 9}', 'mu': 0, 
             'mu_hat': 0, 'nu': 2, 'nu_hat': 0, 'mutation_chance': 0.50,
             'min_cutoff': 51, 'max_cutoff': 100, 'min_sigma': 1.1, 
             'max_sigma': 1.9,
             'message_steps': 8}

Other parameters are also taken as input. These are automatically checked that the parameters are viable

In [3]:
num_gens = 100
best_sample, lucky_few, population_size, number_of_children = 4, 2, 12, 4
early_stop = 2
early_number = 3 
min_generations = 5

## GeneParameter

GeneParameter class is created from each descriptor dictionary. 

In [4]:
params1 = GeneParameters(**descDict1)
params2 = GeneParameters(**descDict2)

In [5]:
params1

GeneParameters(lower=1, upper=50, centres='{8, 7, 6, 1, 16, 17, 9}', neighbours='{8, 7, 6, 1, 16, 17, 9}', mu=0, mu_hat=0, nu=2, nu_hat=0, mutation_chance=0.5, min_cutoff=1, max_cutoff=50, min_sigma=0.1, max_sigma=0.9, message_steps=8)

## GeneSet

We can use these classes to create a specific set of parameters that are consistant with these values. This returns a randomly generated GeneSet class

In [6]:
example_gene_set = params1.make_gene_set()
example_gene_set

GeneSet(18, 23, 29, 0.51)

We can get the parameters used to create the GeneSet class

In [7]:
example_gene_set.gene_parameters

GeneParameters(lower=1, upper=50, centres='{8, 7, 6, 1, 16, 17, 9}', neighbours='{8, 7, 6, 1, 16, 17, 9}', mu=0, mu_hat=0, nu=2, nu_hat=0, mutation_chance=0.5, min_cutoff=1, max_cutoff=50, min_sigma=0.1, max_sigma=0.9, message_steps=8)

We can get a descriptor string to be used as an input for getting SOAPs

In [8]:
example_gene_set.get_soap_string()

'soap cutoff=18 l_max=23 n_max=29 atom_sigma=0.51 n_Z=7 Z={8, 7, 6, 1, 16, 17, 9} n_species=7 species_Z={8, 7, 6, 1, 16, 17, 9} mu=0 mu_hat=0 nu=2 nu_hat=0'

We can also mutate the gene using the mutation chance in the GeneParameters class

In [9]:
print(f"Before mutation {example_gene_set}")
example_gene_set.mutate_gene()
print(f"After mutation {example_gene_set}")

Before mutation [18, 23, 29, 0.51]
After mutation [27, 23, 29, 0.74]


## Individual

An Individual is made up of a list of GeneSet classes.

In [10]:
example_gene_set_two = params2.make_gene_set()
gene_set_list = [example_gene_set, example_gene_set_two]
example_individual = Individual(gene_set_list)
example_individual

Individual(['GeneSet(27, 23, 29, 0.74)', 'GeneSet(68, 95, 67, 1.57)'])

Getting the score for an indivudual

In [11]:
example_individual.get_score()
example_individual.score

95

Breeding two individuals to create a child. Mutation is automatically performed during this

In [12]:
example_individual_two = Individual(gene_set_list)
print(f"Breeding {example_individual} with {example_individual_two}")
child = breed_individuals(example_individual, example_individual_two)
print(f"Created child {child}")

Breeding Individual(['[27, 23, 29, 0.74]', '[68, 95, 67, 1.57]']) with Individual(['[27, 23, 29, 0.74]', '[68, 95, 67, 1.57]'])
Created child Individual(['[19, 28, 25, 0.74]', '[72, 74, 67, 1.57]'])


## Population

A Population is a collection of Individual classes. This can be created using a list of GeneParameter classes

In [13]:
gene_parameters = [params1, params2]
pop = Population(best_sample, lucky_few, population_size, 
                 number_of_children, gene_parameters, 
                 maximise_scores = True)
pop

Population(4, 2, 12, 4, [GeneParameters(lower=1, upper=50, centres='{8, 7, 6, 1, 16, 17, 9}', neighbours='{8, 7, 6, 1, 16, 17, 9}', mu=0, mu_hat=0, nu=2, nu_hat=0, mutation_chance=0.5, min_cutoff=1, max_cutoff=50, min_sigma=0.1, max_sigma=0.9, message_steps=8), GeneParameters(lower=51, upper=100, centres='{8, 7, 6, 1, 16, 17, 9}', neighbours='{8, 7, 6, 1, 16, 17, 9}', mu=0, mu_hat=0, nu=2, nu_hat=0, mutation_chance=0.5, min_cutoff=51, max_cutoff=100, min_sigma=1.1, max_sigma=1.9, message_steps=8)], True)

To initialise the population

In [14]:
pop.initialise_population()

Initial population of size 12 generated


If you want a way of neatly seeing what individuals are in the population

In [15]:
pop.print_population()

Individual(['[40, 47, 43, 0.48]', '[57, 90, 74, 1.71]']) has a score of: 97
Individual(['[15, 38, 22, 0.47]', '[65, 94, 62, 1.16]']) has a score of: 80
Individual(['[1, 33, 33, 0.41]', '[93, 56, 97, 1.65]']) has a score of: 94
Individual(['[44, 19, 5, 0.67]', '[74, 55, 87, 1.82]']) has a score of: 118
Individual(['[42, 4, 17, 0.32]', '[63, 56, 56, 1.77]']) has a score of: 105
Individual(['[9, 27, 23, 0.39]', '[64, 70, 83, 1.37]']) has a score of: 73
Individual(['[33, 42, 22, 0.63]', '[99, 94, 61, 1.53]']) has a score of: 132
Individual(['[32, 45, 29, 0.45]', '[94, 73, 81, 1.73]']) has a score of: 126
Individual(['[32, 23, 38, 0.11]', '[84, 61, 84, 1.6]']) has a score of: 116
Individual(['[15, 19, 25, 0.4]', '[82, 62, 96, 1.88]']) has a score of: 97
Individual(['[42, 21, 26, 0.5]', '[59, 57, 78, 1.5]']) has a score of: 101
Individual(['[35, 38, 36, 0.11]', '[99, 63, 85, 1.65]']) has a score of: 134


The next generation can then be generated 

In [16]:
pop.next_generation()
pop.print_population()

Individual(['[9, 47, 23, 0.53]', '[76, 70, 89, 1.71]']) has a score of: 85
Individual(['[34, 47, 23, 0.39]', '[57, 90, 67, 1.71]']) has a score of: 91
Individual(['[32, 45, 20, 0.45]', '[87, 73, 81, 1.82]']) has a score of: 119
Individual(['[23, 47, 38, 0.89]', '[57, 70, 71, 1.37]']) has a score of: 80
Individual(['[32, 45, 23, 0.2]', '[79, 73, 87, 1.72]']) has a score of: 111
Individual(['[27, 42, 36, 0.41]', '[72, 94, 99, 1.35]']) has a score of: 99
Individual(['[30, 48, 13, 0.17]', '[77, 88, 61, 1.53]']) has a score of: 107
Individual(['[35, 38, 22, 0.11]', '[99, 63, 91, 1.53]']) has a score of: 134
Individual(['[40, 18, 25, 0.39]', '[81, 64, 57, 1.52]']) has a score of: 121
Individual(['[24, 36, 29, 0.67]', '[88, 89, 87, 1.73]']) has a score of: 112
Individual(['[32, 47, 5, 0.77]', '[94, 53, 81, 1.82]']) has a score of: 126
Individual(['[23, 42, 22, 0.77]', '[99, 56, 60, 1.53]']) has a score of: 122


So to run the full GA 

In [17]:
for _ in range(num_gens):
    pop.next_generation()
pop.print_population()

Individual(['[49, 18, 10, 0.21]', '[72, 70, 78, 1.13]']) has a score of: 121
Individual(['[44, 17, 34, 0.39]', '[94, 85, 75, 1.89]']) has a score of: 138
Individual(['[44, 49, 10, 0.48]', '[51, 72, 95, 1.68]']) has a score of: 95
Individual(['[27, 6, 15, 0.88]', '[67, 92, 81, 1.67]']) has a score of: 94
Individual(['[32, 46, 48, 0.52]', '[71, 74, 64, 1.82]']) has a score of: 103
Individual(['[45, 6, 23, 0.64]', '[61, 74, 64, 1.83]']) has a score of: 106
Individual(['[1, 13, 27, 0.34]', '[80, 85, 64, 1.59]']) has a score of: 81
Individual(['[47, 49, 28, 0.17]', '[94, 85, 70, 1.59]']) has a score of: 141
Individual(['[40, 6, 37, 0.8]', '[64, 53, 88, 1.37]']) has a score of: 104
Individual(['[39, 24, 38, 0.36]', '[92, 74, 81, 1.83]']) has a score of: 131
Individual(['[22, 46, 33, 0.34]', '[66, 53, 92, 1.87]']) has a score of: 88
Individual(['[39, 19, 19, 0.13]', '[79, 76, 75, 1.87]']) has a score of: 118


## BestHistory

BestHistory is a class to store the history and check convergence criteria. So the entire GA can be run, printed, and saved using the following code snippet:

In [18]:
hist = BestHistory(early_stop, early_number, min_generations)
pop = Population(best_sample, lucky_few, population_size, 
                 number_of_children, gene_parameters, 
                 maximise_scores = True)

pop.initialise_population()    
for gen in range(num_gens):
    if hist.converged:
        break
    print(f"Generation {gen}")
    pop.next_generation()
    hist.append(pop)
    print("-------")

Initial population of size 12 generated
Generation 0
Best Individual Individual(['[30, 29, 32, 0.13]', '[98, 61, 82, 1.57]']) with a score of 128 added to history
-------
Generation 1
Best Individual Individual(['[35, 49, 5, 0.12]', '[98, 61, 82, 1.77]']) with a score of 133 added to history
-------
Generation 2
Best Individual Individual(['[35, 17, 23, 0.9]', '[98, 93, 82, 1.77]']) with a score of 133 added to history
-------
Generation 3
Best Individual Individual(['[48, 17, 11, 0.68]', '[98, 93, 82, 1.63]']) with a score of 146 added to history
-------
Generation 4
Best Individual Individual(['[48, 14, 27, 0.47]', '[99, 61, 99, 1.22]']) with a score of 147 added to history
-------
Generation 5
Best Individual Individual(['[48, 42, 27, 0.51]', '[99, 93, 99, 1.63]']) with a score of 147 added to history
SOAP_GAS has converged
-------


There now exists the entire history of the best Individuals throughout each generation that can be saved and easily accessed. 

In [19]:
vars(hist)

{'history': [Individual(['GeneSet(30, 29, 32, 0.13)', 'GeneSet(98, 61, 82, 1.57)']),
  Individual(['GeneSet(35, 49, 5, 0.12)', 'GeneSet(98, 61, 82, 1.77)']),
  Individual(['GeneSet(35, 17, 23, 0.9)', 'GeneSet(98, 93, 82, 1.77)']),
  Individual(['GeneSet(48, 17, 11, 0.68)', 'GeneSet(98, 93, 82, 1.63)']),
  Individual(['GeneSet(48, 14, 27, 0.47)', 'GeneSet(99, 61, 99, 1.22)']),
  Individual(['GeneSet(48, 42, 27, 0.51)', 'GeneSet(99, 93, 99, 1.63)'])],
 'converged': True,
 'early_stop': 2,
 'early_number': 3,
 'min_generations': 5}

In [20]:
example_gene_set.cutoff = 5
example_gene_set.l_max = 3
example_gene_set.n_max = 6

In [21]:
test_ind = Individual([example_gene_set])
conf_s, data = get_conf()

conf_s file exists


In [22]:
conf_s

[Atoms(symbols='COC6NCOC4ClC7O2C2O2H18', pbc=False),
 Atoms(symbols='C2ONC6OH9', pbc=False),
 Atoms(symbols='C3SO2C6FCONC7NCF3OH14', pbc=False),
 Atoms(symbols='C6NC6NC3OC7NOH25', pbc=False),
 Atoms(symbols='C25H34O6', pbc=False),
 Atoms(symbols='C3SCONC5O2H15', pbc=False),
 Atoms(symbols='COC6OC2NC3OC12NOH26', pbc=False),
 Atoms(symbols='C10N2C6SO2NCF3H14', pbc=False),
 Atoms(symbols='C9ONCOCCl2ONO2H12', pbc=False),
 Atoms(symbols='C2NC2NC22H28', pbc=False),
 Atoms(symbols='C14ClOC6NCH26', pbc=False),
 Atoms(symbols='C19ClNC2NCH17', pbc=False),
 Atoms(symbols='C12OC10NOH27', pbc=False),
 Atoms(symbols='C2OCSCONC3NCONFH10', pbc=False),
 Atoms(symbols='C10OC8OH24', pbc=False),
 Atoms(symbols='C9ONC6FC9FO2H21', pbc=False),
 Atoms(symbols='C2OCOC2NC9Cl2CO2C3H19', pbc=False),
 Atoms(symbols='C14FCO2H13', pbc=False),
 Atoms(symbols='C7O2C3O2NC7NC2ClH17', pbc=False),
 Atoms(symbols='C23H28ClN3O5S', pbc=False),
 Atoms(symbols='CNC6SO2NSO2NClH8', pbc=False),
 Atoms(symbols='C5OC14OCO2COH30', p

In [23]:
test_ind.comp_soaps(conf_s, data)

Mismatch in MP and SOAP matrix shapes for:  Atoms(symbols='C3NCNC2NCNCNOCPO3H14', pbc=False)
SOAP shape of: (32, 3613)
MP matrix shape of: (33, 33)
The SOAP string is: soap cutoff=5 l_max=3 n_max=6 atom_sigma=0.74 n_Z=7 Z={8, 7, 6, 1, 16, 17, 9} n_species=7 species_Z={8, 7, 6, 1, 16, 17, 9} mu=0 mu_hat=0 nu=2 nu_hat=0


In [24]:
test_ind.soaps.shape

(40, 3613)

In [25]:
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras import layers, optimizers, Model, backend
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.callbacks import EarlyStopping
from scipy.stats import pearsonr
from sklearn.metrics import mean_squared_error
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

In [26]:
def cv_split(individual, splits, repeats, random_state):
    """
    Returns split indices for train and test sets
    """
    cv = RepeatedKFold(n_splits = splits, n_repeats = repeats, random_state = random_state)
    for train_index, test_index in cv.split(soaps):
        print(train_index, test_index) 

In [27]:
def scaleData(train, test):
    scaler = MinMaxScaler()
    train_scaled = scaler.fit_transform(train)
    test_scaled = scaler.transform(test)

    return train_scaled, test_scaled, scaler

In [28]:
def buildModel(X):
    backend.clear_session()
    input_layer = Input(X.shape[1])
    hidden_layer = input_layer
    for layer in [30,30,30]:
        hidden_layer = Dense(layer, activation='relu')(hidden_layer)

    output_layer = Dense(units=1, activation='linear')(hidden_layer)
    model = Model(input_layer, output_layer)
    model.compile(loss='mean_squared_error', optimizer=optimizers.Adam(learning_rate=0.01), metrics=['mean_squared_error'])

    return model

In [29]:

def get_scores(Individual, train_index, test_index, scaling = None, **kwargs):
    estimator = buildModel(Individual.soaps)
    X_train, X_test, X_scaler = scaleData(Individual.soaps[train_index], Individual.soaps[test_index])
    y_train, y_test, y_scaler = scaleData(Individual.targets[train_index].reshape(-1,1), Individual.targets[test_index].reshape(-1,1))
    scores = scorerNN(estimator, X_train, X_test, y_train, y_test, y_scaler)
    print(scores[0])
    return scores

In [34]:
def scorerNN(estimator, X_train, X_test, y_train, y_test, y_scaler):
    """ Scoring function for use with NN regressor. Added by Matt. """

    callback = EarlyStopping(monitor='val_loss', patience=50)
    estimator.fit(X_train, y_train, callbacks=[callback], validation_split=0.1, epochs=200, verbose=False)
    y_test_pred, y_train_pred = estimator.predict(X_test, verbose=False), estimator.predict(X_train, verbose=False)
    y_test_pred, y_train_pred = np.ravel(y_scaler.inverse_transform(y_test_pred)), np.ravel(y_scaler.inverse_transform(y_train_pred))
    y_test = np.ravel(y_test)
    y_train = np.ravel(y_train)
    testCorr = pearsonr(y_test, y_test_pred)[0]
    trainCorr = pearsonr(y_train, y_train_pred)[0]
    testMSE = mean_squared_error(y_test, y_test_pred)
    trainMSE =  mean_squared_error(y_train, y_train_pred)
    return (2 * (trainMSE * (1-trainCorr)) + (testMSE * (1-testCorr))), X_train, X_test, y_train, y_test, y_test_pred, y_train_pred

In [35]:
cv = RepeatedKFold(n_splits = 5, n_repeats = 2, random_state = 6)
for train_index, test_index in cv.split(test_ind.soaps):
    get_scores(test_ind, train_index, test_index)

214403.34070760768
190655.05887630995
111676.89074412467
198327.0521714487
124938.06866380203
80274.06370105426
268812.504723915
464457.1047346276
237376.088376651
85861.69157290537


In [41]:
# type(t.dtype)

In [38]:
def build_model(individual):
    if individual.targets.dtype == float:
        target_type = 'regression'
    elif individual.targets.dtype == (int or str):
        target_type = 'classification'
    estimator = 3
    return estimator, target_type

In [39]:
build_model(test_ind)

(3, 'regression')