## Self developed LGP System for RegularExpression Generation for Entity Extraction ##
### Developed by: Jacob Caurdy for CSE 891 Intro to Genetic Programming ###
Notes on RegExProgram Class which is adapated from the TreeGP implementation described in [this paper](https://arts.units.it/retrieve/handle/11368/2758954/57751/2014-Computer-AutomaticSynthesisRegexExamples%20%282%29.pdf)
1. Removed concatenation operator since, all it would do is add another register, since we are managing the # of registers as an hyperparameter and all registers/genes are concatenated at compile time
2. Need to experiment with '|' or operator

Data for this program was created through [this paper](https://aclanthology.org/W09-3302/) and downloaded from https://github.com/juand-r/entity-recognition-datasets/tree/master/data/wikigold


In [2]:
import re
import copy
import time
import random
import numpy as np
import string
from typing import List
from nltk import edit_distance

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [3]:
from utils import generate_word_dataset, generate_sentence_dataset, generate_sentence_dataset_v2

** A few different size datasets **
For word datasts: Each target is a string (at most a single word) and the label is either the word or an empty string
For sentence datasets: Each target is a sentence string and each label is a named entity (can be null) that exists in the sentence


In [4]:
SINGLE_TARGETS, SINGLE_LABELS = generate_word_dataset(['I-PER', 'I-ORG', 'I-LOC', 'I-MISC'], 1)
non_null_labels = [label for label in SINGLE_LABELS if label != '']
WORD_TARGET_FIVE, WORD_LABEL_FIVE = generate_word_dataset(['I-PER', 'I-ORG', 'I-LOC', 'I-MISC'], 5)
non_null_labels_five = [label for label in WORD_LABEL_FIVE if label != '']

In [24]:
print('Num of single word targets in one doc:', len(SINGLE_TARGETS))
print('Proportion of non-null labels:', len(non_null_labels)/len(SINGLE_LABELS))
print('Num of single word targets in five docs:', len(WORD_TARGET_FIVE))
print('Proportion of non-null labels:', len(non_null_labels_five)/len(WORD_LABEL_FIVE))

Num of single word targets in one doc: 516
Proportion of non-null labels: 0.2616279069767442
Num of single word targets in five docs: 3358
Proportion of non-null labels: 0.18254913639070874


In [8]:
TARGETS, NAMED_ENTITIES = generate_sentence_dataset(['I-PER', 'I-ORG', 'I-LOC', 'I-MISC'], 1)
TARGET_NAMES_3DOCS, PERSON_ENTITIES =  generate_sentence_dataset(['I-PER'], 3)
TARGETS_10DOCS, LABELS_10DOCS = generate_sentence_dataset(['I-PER'], 10)
LOCATION_TARGETS_10, LOCATION_LABELS_10 = generate_sentence_dataset(['I-LOC'], 10)
ALLTARGET_10, AllLabels_10 = generate_sentence_dataset(['I-PER', 'I-ORG', 'I-LOC', 'I-MISC'], 10)

In [26]:
avg_length_sml = sum([len(s) for s in TARGET_NAMES_3DOCS])/len(TARGET_NAMES_3DOCS)
avg_length_med = sum([len(s) for s in TARGETS_10DOCS])/len(TARGETS_10DOCS)
print('Number of examples small named dataset', len(NAMED_ENTITIES), 'Average sentence length (in chars):', int(avg_length_sml))
print('Number of examples medium named dataset', len(LABELS_10DOCS), 'Average sentence length (in chars):', int(avg_length_med))

Number of examples small named dataset 69 Average sentence length (in chars): 394
Number of examples medium named dataset 115 Average sentence length (in chars): 275


In [5]:
class RegExProgram:
    """
    Represents an individual RegEx Expression in our population. Each individual consists of sequential genes which represent a part of the total RegEx expression
    Genes are NOT restricted to be same length, however they all start at initGeneLength at birth
    """
    __slots__ = ['operandSet', 'operatorSet', 'genes', 'geneLengths', 'initGeneLength', 'operatorRate', 'fitness', 'targets', 'labels']

    def __init__(self, parentProgram=None, operandSet=None, operatorSet=None, numGenes=1,
                 initGeneLength=5, operatorRate=0.5, targets=List[str], labels=List[str]):
        if parentProgram:
            self.operandSet = parentProgram.operandSet
            self.operatorSet = parentProgram.operatorSet
            self.genes = copy.deepcopy(parentProgram.genes)
            self.initGeneLength = parentProgram.initGeneLength
            self.operatorRate = parentProgram.operatorRate
            self.targets = parentProgram.targets
            self.labels = parentProgram.labels
        else:
            if operandSet is None:
                # 'c1...c2' represent a list of possible operands to pick one from and not a range which is denoted by 'c1-c2'
                # represent these as single item since it creates equal probability for selection of each 'category' of operands
                self.operandSet = {'a...z','A...Z','0...9', '!...?', 'a-z', 'A-Z', '0-9', '\w', '\d', '.', ' '}
            if operatorSet is None:
                # c denotes an operand
                self.operatorSet = {'c*+', 'c++', 'c?+', '(c)', '[c]', '[^c]', '|', }
            self.genes = ['' for _ in range(numGenes)]
            self.initGeneLength = initGeneLength
            self.operatorRate = operatorRate
            self._generate_regex()
            self.targets = targets
            self.labels = labels

        self.fitness = float('inf')
        self.run_fitness()

    def __str__(self):
        return "".join(self.genes)

    def __len__(self):
        return len(str(self))

    def __lt__(self, other):
        return str(self) < str(other)

    def __gt__(self, other):
        return str(self) > str(other)

    def __eq__(self, other):
        return str(self) == str(other)

    def _get_operand(self) -> str:
        """
        return a random operand from the set
        """
        opchoice = random.choice(tuple(self.operandSet))
        if opchoice == 'a...z':
            return random.choice(string.ascii_lowercase)
        elif opchoice == 'A...Z':
            return random.choice(string.ascii_uppercase)
        elif opchoice == '0...9':
            return random.choice(string.digits)
        elif opchoice == '!...?':
            return random.choice(string.punctuation)
        return opchoice

    def generate_gene(self, geneIndex) -> str:
        """
        Generate a gene, can be used for initializing or mutation
        """
        gene = ''
        while len(gene) < self.initGeneLength:
            getOperand = random.random() >= self.operatorRate
            if getOperand or len(gene) >= self.initGeneLength or gene == '':
                operand = self._get_operand()
                gene += operand
            else:
                operator = random.choice(tuple(self.operatorSet))
                if operator == '|':
                    self.genes.insert(geneIndex+1, '|')
                else:
                    gene = operator.replace('c', gene)


        self.genes[geneIndex] = gene

    def _generate_regex(self):
        """
        randomly initialize all genes in the program
        """
        for i, _ in enumerate(self.genes):
            self.generate_gene(i)

    def mutate(self):
        """
        In place mutation
        Mutate a random gene either by random initialization or deletion
        """
        # either change a gene, or delete one
        geneIndex = random.choice(range(len(self.genes)))
        if len(self.genes) < 2 or random.random() <= 0.80:
            self.generate_gene(geneIndex)
        else:
            self.genes.pop(geneIndex)
        self.run_fitness()

    def crossover(self, other):
        """
        In place crossover
        swap two genes in the same random spot between two programs
        """
        len1 = len(self.genes)
        len2 = len(other.genes)
        shortlen = len1 if len1 <= len2 else len2
        geneindex = random.choice(range(shortlen))
        self.genes[geneindex], other.genes[geneindex] = other.genes[geneindex], self.genes[geneindex]
        self.run_fitness()
        other.run_fitness()

    def run_fitness(self, verbose=False) -> float:
        """
        sets fitness of the program, a float
        The score is a weighted linear combination of the fitness functions:
            1. Levenschtein distance
            2. The length of the regular expression
        """
        pattern = self.isValid()
        score = 0
        if not pattern:
            self.fitness = float('inf')
        else:
            if verbose:
                print('Genome:', str(self))
            for i, target in enumerate(self.targets):
                result = pattern.search(target)
                result = result.group() if result else ''
                score += edit_distance(result, self.labels[i])
                if verbose:
                    print('Label:', self.labels[i], 'Result: ', result)
            self.fitness = score #+ len(self)

    def run(self,pattern:re.Pattern,  target:str) -> List[str]:
        """
        Return matching strings from target, if not Valid return None
        """
        pass

    def isValid(self) -> bool:
        """
        Return re pattern object from compiling, if compile error return None
        """
        try:
            pattern = re.compile(str(self))
        except re.error:
            return None
        return pattern

In [9]:
class RegExPopulation:
    """
    Represents a population of regex programs attempting to evolve to a task
    """
    def __init__(self, populationSize:int=500, generationLimit:int=50, targets:List[str]=TARGET_NAMES_3DOCS, labels:List[str]=PERSON_ENTITIES,
                 numGeneMean=2, numGeneVar=2/3, geneLengthMean=4, geneLengthVar=1, tourneySize=7, randomProp=0.1, mutantProp=0.1):
        self.populationSize = populationSize
        self.generationLimit = generationLimit
        self.targets = targets
        self.labels = labels
        self.numGeneMean = numGeneMean
        self.numGeneVar = numGeneVar
        self.geneLengthMean = geneLengthMean
        self.geneLengthVar = geneLengthVar
        self.tourneySize = tourneySize
        self.population = self.gen_population(populationSize)
        # crossover Proportion determined by 1 - randomProp - mutantProp
        self.randomProp = randomProp # proportion of intermediate generation that is randomly generated
        self.mutantProp = mutantProp # proportion of intermediate generation that is mutated from current generation (w/ tourney selection for parent)

    def __str__(self):
        return str([str(idv) for idv in self.population])

    def __getitem__(self, indices):
        return self.population[indices]

    def gen_population(self, popSize:int=1):
        pop = []
        geneLengths = np.rint(np.random.normal(loc=self.geneLengthMean, scale=self.geneLengthVar, size=popSize))
        numGenes = np.rint(np.random.normal(loc=self.numGeneMean, scale=self.numGeneVar, size=popSize))
        for i in range(popSize):
            geneLength = int(geneLengths[i]) if geneLengths[i] > 0 else 1
            numGene = int(numGenes[i]) if numGenes[i] > 0 else 1
            pop.append(RegExProgram(numGenes=geneLength, initGeneLength=numGene, targets=self.targets, labels=self.labels))

        return pop

    def tourney_selection(self, individuals):
        best_fitness = float('inf')
        best_idv = None
        for idv in individuals:
            if idv.fitness < best_fitness:
                best_fitness = idv.fitness
                best_idv = idv
        return best_idv if best_idv else individuals[0]

    def evolve(self, verbose=False):
        prevtopfive_fitness = float('inf')
        count = 0
        for gen in range(self.generationLimit):
            time1 = time.perf_counter()
            # generate an intermediate population (kids) composed of 10% random, 10% mutations after tourney selection, 80% crossover after tourney selection (for both)
            # take top 100 individual from current and intermediate population
            kids = self.gen_population(int(self.populationSize * self.randomProp))
            # add mutators
            while len(kids) < self.populationSize * (self.mutantProp + self.randomProp):
                sample = random.sample(self.population, self.tourneySize)
                kid = RegExProgram(self.tourney_selection(sample))
                kid.mutate()
                kids.append(kid)

            timeMutate = time.perf_counter()

            # add crossovers
            while len(kids) < self.populationSize * (1 - self.mutantProp - self.randomProp):
                sample1 = random.sample(self.population, self.tourneySize)
                sample2 = random.sample(self.population, self.tourneySize)
                kid1 = RegExProgram(parentProgram=self.tourney_selection(sample1))
                kid2 = RegExProgram(parentProgram=self.tourney_selection(sample2))
                kid1.crossover(kid2)
                kids.append(kid1)
                kids.append(kid2)

            #timekids = time.perf_counter()

            # get top individuals from current and intermediate for next generation
            fitness_individual_list = []
            for idv in self.population:
                fitness_individual_list.append((idv.fitness, idv))

            for idv in kids:
                fitness_individual_list.append((idv.fitness, idv))

            fitness_individual_list.sort()
            self.population = []
            while len(self.population) < self.populationSize:
                self.population.append(fitness_individual_list[len(self.population)][1])

            best_score = fitness_individual_list[0][0]
            time2 = time.perf_counter()

            # print('Gen:', gen)
            # print('\t Time to generate mutants:', timeMutate - time1)
            # print('\t Time to generate crossovers:', timekids - timeMutate)
            # print('\t Time to generate kids:', timekids - time1)
            # print('\t Time to get next population', time2 - timekids)
            best_five_idvs = ['{}, {}'.format(fitness, str(idv)) for fitness, idv in fitness_individual_list[:5]]
            print("Generation {:d}, Fitness: {:.2f}, Best Genome: {}, Time Elapsed {:.2f}".format(gen, best_score, self.population[0], time2-time1))
            if verbose:
                print('\tBest five: ', best_five_idvs)
            curtopfive_fitness = sum([fitness for fitness, _ in fitness_individual_list[:5]]) / 5
            if curtopfive_fitness >= prevtopfive_fitness:
                count += 1
            else:
                count = 0
            prevtopfive_fitness = curtopfive_fitness
            if best_score == 0 or count == 5:
                break

Trying to optimize operandRate to minimize number of invalid RE's generated. At first, I realized is the initial amount of invalid RE's will approach operatorRate due to empty operands getting inserted into initialized Genes, i.e. ('[]','()',etc.). However, we need enough operators to induce diversity. I changed the code so that the initialed genes always start with an operand, so we cannot have an invalid RE as a result of an empty operator. This reduces our invalid number significantly, around 15% on average. Im going to stick with 50% operand rate.

In [7]:
p = RegExProgram(numGenes=1000, initGeneLength=4, operatorRate=0.50, targets=TARGET_NAMES_3DOCS, labels=PERSON_ENTITIES)
count = 0
for gene in p.genes:
    try:
        re.compile(gene)
    except:
        print(gene, end=' ')
        count += 1
print('Number of invalid REs generated:', count )

YP2?+ 0-9*+ .*+F [ ]*+  ?+a-z \dF?+ ( ++) [9]++ M)ka-z .*+  V*+) \w++ V*+*+  *+\w '*+. q\d?+ r?+0-9 [<]++ I*+A-Z a?+A-Z  *+= A-Z++ R?+*+  *+7 \d*+ \d?+ J?+++ \d*+ m*+= R++0 (a-z (6++) g\w++ [3]++ a-z++ a-z?+ [z]*+ 8\d?+ a-z++ 0-9*+  *+  [8]*+ .A ++ \d*+ n*+% A-Z*+ O\d*+  7++ A-Z*+ #?+.  ++++ A-Z++ a-z?+ \we*+ \d++ b*+V \ya-z 0-9++ \d ++ \w?+  K.?+ a\d?+ `*+++ .?+3 83++ &.++ \dh++ \d++ A-Z*+ ( ?+)  .({ a-z*+ k++\d 0-9?+ o*+a-z H\w++ A-Z*+ \w++ \wY++ ,++*+ \w++ i++++ ?\d] 4++. A-Z?+ 3.*+ .?+h \w++ 5*+\d  ?+. A-Z*+ \w*+ A-Z++ P++A-Z *++w +.0-9 Fx)A-Z \w?+ (8)*+ 2*+S *0-9 a-z?+ \d*+ [C]*+ 3?+a-z \d++ \d*+ .?+  (.)++ a-z++ \d++ .\w?+ I*+. \d?+ A-Z?+ f++. \d*+ \d++ #7++ a-z++ n*+0 \dM?+ \d*+  ?+< (}[) A-Z*+ : *+ J++L \w*+ a-z?+ 0-9*+ 2?+  2++?+ a-z++  *+  a-z?+ ?0-9  ++. u*+, a-z*+ \w*+ 0-9*+ G++. .X6?+ 0-9?+ [N]?+ 0-9*+  ?+\d \w++ 9++++  ?+} \w ?+ )\d++  ++\d (F?+) \dG*+ a-z[ ( ++) 2++, 'IH?+ \w_?+ 0-9++ \w(*+  ++  ?0-9 g++E R.?+ \d++ .0G?+ a-z?+ A-Z*+ [.]*+ a-z++ )?+. .\w++  *+\d y++?+ (C)

**Testing out inheritance from parent**

In [8]:
p = RegExProgram(numGenes=2, initGeneLength=4, targets=TARGET_NAMES_3DOCS, labels=PERSON_ENTITIES)
p2 = RegExProgram(numGenes=3, initGeneLength=2, targets=TARGET_NAMES_3DOCS, labels=PERSON_ENTITIES)
print(p.genes, p.fitness, p2.genes, p2.fitness)
k = RegExProgram(p)
print('After birth:', k.genes, k.fitness)
k.mutate()
print('After mutation', k.genes, k.fitness)
k.crossover(p2)
print('After crossover:', k.genes, k.fitness)

['a-z.', '0-9\\d'] 94 ['\\d', '.?+', 'a-z'] inf
After birth: ['a-z.', '0-9\\d'] 94
After mutation ['a-z.', '{c\\d'] 94
After crossover: ['a-z.', '.?+'] inf


**Simple 500 member population evolved on medium dataset for I-PER labels**

In [9]:
population = RegExPopulation(populationSize=500, targets=TARGETS_10DOCS, labels=LABELS_10DOCS)
population.evolve()

Generation 0, Fitness: 1117.00, Best Genome: .. , Time Elapsed 1.77
Generation 1, Fitness: 1012.00, Best Genome: [^2]\w...a, Time Elapsed 2.88
Generation 2, Fitness: 1012.00, Best Genome: [^2]\w...a, Time Elapsed 3.51
Generation 3, Fitness: 1012.00, Best Genome: .\w...a, Time Elapsed 6.47
Generation 4, Fitness: 979.00, Best Genome: [^ ]\w.. , Time Elapsed 7.49
Generation 5, Fitness: 953.00, Best Genome: \w\w\w[\w]., Time Elapsed 8.69
Generation 6, Fitness: 942.00, Best Genome: \w\w\w..., Time Elapsed 7.68
Generation 7, Fitness: 933.00, Best Genome: [^2]\w\w.. \w, Time Elapsed 7.67
Generation 8, Fitness: 933.00, Best Genome: [^2]\w\w.. \w, Time Elapsed 8.33
Generation 9, Fitness: 933.00, Best Genome: [^2]\w\w.. \w, Time Elapsed 9.80
Generation 10, Fitness: 933.00, Best Genome: [^2]\w\w.. \w, Time Elapsed 9.91
Generation 11, Fitness: 933.00, Best Genome: [^2]\w\w.. \w, Time Elapsed 10.53
Generation 12, Fitness: 928.00, Best Genome: [^2]\w\w.. ., Time Elapsed 10.37
Generation 13, Fitness:

## **Lets see how it did on matching the labels on sentence datasets for Named Labels, Location Labels and All labels** ##

In [10]:
population.population[0].run_fitness(True)

Label:  Result:  tenth a
Label: Kojima Minoru Result:  ember K
Label:  Result:  139th w
Label: Frederick H. Collier Result:  erick H
Label: Lt. Gen. Jubal Early Result:  efeat L
Label: Philip Sheridan Result:  under P
Label: Sheridan Result:  orted S
Label: William T. Sherman Result:  pport W
Label:  Result:   1896 A
Label: John Greiner Result:   John G
Label: James W. Hoyt Result:  James W
Label: Edward Farr Result:  dward F
Label: William McLaughlin Result:  roner W
Label: George F. Hauser. Houser Result:  rator G
Label: Hauser Result:  efore H
Label: Hauser Result:  auser w
Label: Edward Farr Result:  dward F
Label: Edward D Result:  ineer E
Label: Farr Result:   Farr o
Label: George F. Hauser Result:  eorge F
Label: John Greiner Result:  ineer J
Label: Edward Farr Result:  dward F
Label: George F. Hauser Result:  eorge F
Label:  Result:   2007 B
Label: Gregg Brandon Result:  Gregg B
Label: Kory Lichtensteiger Result:  enior K
Label: Erique Dozier Result:  niors E
Label: Corey Partr

**Some attempt at Hyperparameter Tuning**
Gene Numbers and Gene Length

In [11]:
geneNumRange = [1, 2, 3, 4, 5]
geneLengthRange = [2, 4, 6, 8, 10]
best_members = []
for gene in geneNumRange:
    for length in geneLengthRange:
        print('-'*25, '\nNum Genes:', gene, 'Gene Length:', length, '\n', '-'*25)
        population = RegExPopulation(populationSize=500, targets=TARGETS_10DOCS, labels=LABELS_10DOCS)
        population.evolve()
        best_members.append(population[0])

------------------------- 
Num Genes: 1 Gene Length: 2 
 -------------------------
Generation 0, Fitness: 1133.00, Best Genome: \w ., Time Elapsed 1.03
Generation 1, Fitness: 1045.00, Best Genome: \w.. , Time Elapsed 1.86
Generation 2, Fitness: 990.00, Best Genome: \w+a.\w, Time Elapsed 3.65
Generation 3, Fitness: 983.00, Best Genome: \w..\w , Time Elapsed 6.44
Generation 4, Fitness: 937.00, Best Genome: \w+a.., Time Elapsed 7.64
Generation 5, Fitness: 937.00, Best Genome: \w+a.., Time Elapsed 9.73
Generation 6, Fitness: 937.00, Best Genome: \w+a.., Time Elapsed 7.43
Generation 7, Fitness: 937.00, Best Genome: \w+a.., Time Elapsed 7.95
Generation 8, Fitness: 937.00, Best Genome: \w+a.., Time Elapsed 9.22
Generation 9, Fitness: 937.00, Best Genome: \w+a.., Time Elapsed 8.37
Generation 10, Fitness: 937.00, Best Genome: \w+a.., Time Elapsed 7.37
------------------------- 
Num Genes: 1 Gene Length: 4 
 -------------------------
Generation 0, Fitness: 1096.00, Best Genome: \w\w\w, Time Elap

KeyboardInterrupt: 

**Look at all the best members of each population**

In [12]:
[(str(i), i.fitness) for i in best_members]

[('\\w+a..', 937),
 ('\\w*\\w.', 1000),
 ('....[^r] \\w', 924),
 ('\\w.\\w[^..] ', 982),
 ('\\w+.\\w+', 812),
 ('\\w+.. ', 954),
 ('..\\w. ...', 898),
 ('\\w \\w*\\w.\\w', 791),
 ('\\w+.. ', 954),
 ('[^ ]\\w\\w\\w\\w(.)', 914),
 ('\\w\\w\\w[\\w]....', 839),
 ('\\w.\\w. ..', 911),
 ('..\\w.. .\\w', 903),
 ('\\w\\w\\w\\w.\\w[^(]', 879),
 ('.. .[^rA]\\w', 1004),
 ('\\w*\\w.', 1000),
 ('\\w\\w[\\w]\\w..\\w', 888),
 ('..\\w.\\w .', 924),
 ('\\w.\\w\\w .', 951),
 ('\\w+.\\w[^.] ', 923)]

In [13]:
fitList = [i.fitness for i in best_members]
minFitIdx =  fitList.index(min(fitList))
best_members[minFitIdx].run_fitness(verbose=True)

Label:  Result:  0 is t
Label: Kojima Minoru Result:  r Kojima M
Label:  Result:  e 139th w
Label: Frederick H. Collier Result:  r was t
Label: Lt. Gen. Jubal Early Result:  l Early
Label: Philip Sheridan Result:  r Philip S
Label: Sheridan Result:  d Sheridan i
Label: William T. Sherman Result:  t William T
Label:  Result:  e 1896 A
Label: John Greiner Result:  n Greiner
Label: James W. Hoyt Result:  y of t
Label: Edward Farr Result:  f Edward F
Label: William McLaughlin Result:  r William M
Label: George F. Hauser. Houser Result:  r George F
Label: Hauser Result:  e Hauser c
Label: Hauser Result:  r was p
Label: Edward Farr Result:  d Farr
Label: Edward D Result:  r Edward D
Label: Farr Result:  r of t
Label: George F. Hauser Result:  n George F
Label: John Greiner Result:  r John G
Label: Edward Farr Result:  f Edward F
Label: George F. Hauser Result:  y have u
Label:  Result:  e 2007 B
Label: Gregg Brandon Result:  y Gregg B
Label: Kory Lichtensteiger Result:  r Kory L
Label: Eriqu

In [None]:
geneNumRange = [1, 2, 3, 4, 5]
geneLengthRange = [2, 4, 6, 8, 10]
best_members = []
for geneMean in geneNumRange:
    for length in geneLengthRange:
        print('-'*25, '\nMean Num Genes:', geneMean, 'Mean Gene Length:', length, '\n', '-'*25)
        population = RegExPopulation(populationSize=500, targets=LOCATION_TARGETS_10, labels=LOCATION_LABELS_10, numGeneMean=geneMean, geneLengthMean=length)
        population.evolve()
        best_members.append(population[0])

In [None]:
[(str(i), i.fitness) for i in best_members]

In [None]:
fitList = [i.fitness for i in best_members]
minFitIdx =  fitList.index(min(fitList))
best_members[minFitIdx].run_fitness(verbose=True)

In [None]:
geneNumRange = [1, 2, 3, 4, 5]
geneLengthRange = [2, 4, 6, 8, 10]
best_members = []
for geneMean in geneNumRange:
    for length in geneLengthRange:
        print('-'*25, '\nMean Num Genes:', geneMean, 'Mean Gene Length:', length, '\n', '-'*25)
        population = RegExPopulation(populationSize=500, targets=ALLTARGET_10, labels=AllLabels_10, numGeneMean=geneMean, geneLengthMean=length)
        population.evolve()
        best_members.append(population[0])

### Dicussion - Sentence Dataset Results ###
Well that didnt go so well, it more or less learned to match the first few words of the sentence since thats where the NER always was due to how I fabricated the dataset. Lets fix that but first try some learning on single word targets. We'll also change the proportion of intermediate generation to try to reduce quick convergence and introduce more diversity. <br> **NOTE:** For word datasets, I only evolve on datasets including all labels since the positive class is highly imbalanced if we only include partial labels.

In [20]:
best_members = []
for i in range(10):
    population = RegExPopulation(populationSize=500, targets=WORD_TARGET_FIVE, labels=WORD_LABEL_FIVE, mutantProp=0.25, randomProp=0.25)
    population.evolve()
    best_members.append(population[0])
# Generation 49, Fitness: 1912.00, Best Genome: [A-Z]\w\w[^(?+]|[^A-Z]\d\d[a-z], Time Elapsed 5.96

Generation 0, Fitness: 3288.00, Best Genome:    \d\d, Time Elapsed 4.49
Generation 1, Fitness: 3247.00, Best Genome: M, Time Elapsed 4.52
Generation 2, Fitness: 3247.00, Best Genome: M, Time Elapsed 3.45
Generation 3, Fitness: 3225.00, Best Genome: A\w\w, Time Elapsed 3.62
Generation 4, Fitness: 3206.00, Best Genome: M., Time Elapsed 5.90
Generation 5, Fitness: 3206.00, Best Genome: M., Time Elapsed 5.04
Generation 6, Fitness: 3206.00, Best Genome: M., Time Elapsed 5.31
Generation 7, Fitness: 3206.00, Best Genome: M., Time Elapsed 4.81
Generation 8, Fitness: 3206.00, Best Genome: M., Time Elapsed 6.77
Generation 9, Fitness: 3206.00, Best Genome: M., Time Elapsed 6.02
Generation 10, Fitness: 3206.00, Best Genome: M., Time Elapsed 5.70
Generation 11, Fitness: 3206.00, Best Genome: M., Time Elapsed 5.14
Generation 12, Fitness: 3206.00, Best Genome: M., Time Elapsed 5.58
Generation 13, Fitness: 3206.00, Best Genome: M., Time Elapsed 5.60
Generation 14, Fitness: 3206.00, Best Genome: M., Ti

In [21]:
best_members_normal = []
for i in range(10):
    population = RegExPopulation(populationSize=500, targets=WORD_TARGET_FIVE, labels=WORD_LABEL_FIVE)
    population.evolve()
    best_members_normal.append(population[0])

Generation 0, Fitness: 3234.00, Best Genome: P(.), Time Elapsed 8.78
Generation 1, Fitness: 3176.00, Best Genome: R\w\w\w, Time Elapsed 8.66
Generation 2, Fitness: 3168.00, Best Genome: R\w.\w\w, Time Elapsed 8.57
Generation 3, Fitness: 3168.00, Best Genome: R\w.\w\w, Time Elapsed 8.80
Generation 4, Fitness: 3087.00, Best Genome: R\w\w+, Time Elapsed 10.05
Generation 5, Fitness: 3087.00, Best Genome: R\w.+, Time Elapsed 11.22
Generation 6, Fitness: 3087.00, Best Genome: R\w+, Time Elapsed 8.81
Generation 7, Fitness: 3087.00, Best Genome: R[^ ]\w+, Time Elapsed 9.41
Generation 8, Fitness: 3087.00, Best Genome: R[^ ]\w+, Time Elapsed 10.10
Generation 9, Fitness: 3087.00, Best Genome: R.+, Time Elapsed 9.95
Generation 10, Fitness: 3087.00, Best Genome: R.+, Time Elapsed 10.00
Generation 11, Fitness: 3087.00, Best Genome: R.+, Time Elapsed 11.50
Generation 0, Fitness: 3284.00, Best Genome: (9)., Time Elapsed 11.74
Generation 1, Fitness: 3254.00, Best Genome: (J)[^9], Time Elapsed 11.02
Gen

In [29]:
print([(str(member), member.fitness) for member in best_members])
print('Average Fitness, altered proportions', sum([member.fitness for member in best_members])/len(best_members))
print('Average Fitness, standard proportions', sum([member.fitness for member in best_members_normal])/len(best_members_normal))

[('M.', 3206), ('B\\w*.', 3052), ('[*+ A-Z[3][^.]\\w.', 1912), ('[A-Z]\\w\\w[^(?+]|[^A-Z]\\d\\d[a-z]', 1912), ('S....', 3123), ('[M].*', 3084), ('C.*', 3025), (' ?S\\w.\\w', 3156), ('B.+|S', 3021), ('[A-Z][^(7)][^A]', 2274)]
Average Fitness, altered proportions 2776.5
Average Fitness, standard proportions 3034.0


In [30]:
best_members[3].run_fitness(verbose=True)

Genome: [A-Z]\w\w[^(?+]|[^A-Z]\d\d[a-z]
Label: 010 Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label: Japanese Result:  Japa
Label:  Result:  Punk
Label:  Result:  Tech
Label:  Result:  
Label: The Result:  
Label: Mad Result:  
Label: Capsule Result:  Caps
Label: Markets Result:  Mark
Label:  Result:  This
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label: Osc-Dis Result:  Osc-
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label: Introduction Result:  Intr
Label: 010 Result:  
Label:  Result:  
Label: Come Result:  Come
Label:  Result:  Foun
Label:  Result:  
Label: Kojima Result:  Koji
Label: Minoru Result:  Mino
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label: Good Result:  Good
Label: Day Result:  
Label:  Result:  
Label: Wardanceis Resu

**Hyperparameter Tuning on Word Dataset using modified intermediate proportions, since those gave better results on average and the best idv.**

**RESULTS:** Wow! We got the best individual out of this so far we've seen on the word dataset with the genome: [^a-z][^9++]*\w\w[^0-9]| <br>This arose from using meanNumGene=5, meanGeneLength=4
<br>
Lets look at the fitness of this individual in detail


In [11]:
program = RegExProgram(targets=WORD_TARGET_FIVE, labels=WORD_LABEL_FIVE)
program.genes = ['[^a-z][^9++]*\w\w[^0-9]|']
program.run_fitness(verbose=True)

Genome: [^a-z][^9++]*\w\w[^0-9]|
Label: 010 Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label: Japanese Result:  Japanese
Label:  Result:  Punk
Label:  Result:  Techno
Label:  Result:  
Label: The Result:  
Label: Mad Result:  
Label: Capsule Result:  Capsule
Label: Markets Result:  Markets
Label:  Result:  This
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label: Osc-Dis Result:  Osc-Dis
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label: Introduction Result:  Introduction
Label: 010 Result:  
Label:  Result:  
Label: Come Result:  Come
Label:  Result:  Founding
Label:  Result:  
Label: Kojima Result:  Kojima
Label: Minoru Result:  Minoru
Label:  Result:  
Label:  Result:  
Label:  Result:  
Label: Good Result:  Good
Label: Day Result:  
Label:  Result: 

Next is just a hyperparemter search of sentence dataset using altered proportions. It produces similar results to standard proportion hyperparameter searches w/ best indvidual around 3900 fitness

In [None]:
geneNumRange = [1, 2, 3, 4, 5]
geneLengthRange = [2, 4, 6, 8, 10]
best_members_10 = []
for geneMean in geneNumRange:
    for length in geneLengthRange:
        print('-'*25, '\nMean Num Genes:', geneMean, 'Mean Gene Length:', length, '\n', '-'*25)
        population = RegExPopulation(populationSize=500, targets=ALLTARGET_10, labels=AllLabels_10,
                                     geneLengthMean=length, numGeneMean=geneMean, mutantProp=0.25, randomProp=0.25)
        population.evolve()
        best_members_10.append(population[0])

------------------------- 
Mean Num Genes: 1 Mean Gene Length: 2 
 -------------------------
Generation 0, Fitness: 5925.00, Best Genome: .a, Time Elapsed 2.22
Generation 1, Fitness: 5710.00, Best Genome: [^.]\w \w, Time Elapsed 2.35
Generation 2, Fitness: 5686.00, Best Genome: \w\w \w, Time Elapsed 2.56
Generation 3, Fitness: 5686.00, Best Genome: \w\w \w, Time Elapsed 2.59
Generation 4, Fitness: 5686.00, Best Genome: \w\w \w, Time Elapsed 2.91
Generation 5, Fitness: 5425.00, Best Genome: \w+, Time Elapsed 3.04
Generation 6, Fitness: 5425.00, Best Genome: \w+, Time Elapsed 3.58
Generation 7, Fitness: 5425.00, Best Genome: \w+, Time Elapsed 3.38
Generation 8, Fitness: 5425.00, Best Genome: \w+, Time Elapsed 3.50
Generation 9, Fitness: 5425.00, Best Genome: \w+, Time Elapsed 3.41
Generation 10, Fitness: 5425.00, Best Genome: \w+, Time Elapsed 4.06
Generation 11, Fitness: 5425.00, Best Genome: \w+, Time Elapsed 3.71
Generation 12, Fitness: 5425.00, Best Genome: \w+, Time Elapsed 3.83
Gen

### Sentence Datasets, but with varying or maximal context ###
Now I created a new sentence generation dataset to always include maximal context up until the previous NER in the document. There is also a parameter for including a random amount of context up until the previous NER.
**NOTE**: The average rand context sentence length usually lies between 77-80 characters

In [9]:
AllTargets_10_MaxContext, AllLabels_10_MaxContext = generate_sentence_dataset_v2(['I-PER', 'I-ORG', 'I-LOC', 'I-MISC'], 10)
AllTargets_10_RandContext, AllLabels_10_RandContext = generate_sentence_dataset_v2(['I-PER', 'I-ORG', 'I-LOC', 'I-MISC'], 10, random_context=True)

In [19]:
avg_sentence_length_max = sum([len(sentence) for sentence in AllTargets_10_MaxContext]) / len(AllTargets_10_MaxContext)
avg_sentence_length_rand = sum([len(sentence) for sentence in AllTargets_10_RandContext]) / len(AllTargets_10_RandContext)
print('Num of examples in max context dataset', len(AllTargets_10_MaxContext), 'Avg. target length', avg_sentence_length_max)
print('Num of examples in rand context dataset', len(AllLabels_10_RandContext), 'Avg. target length', avg_sentence_length_rand)

Num of examples in max context dataset 557 Avg. target length 98.66068222621185
Num of examples in rand context dataset 557 Avg. target length 79.81687612208259


### Hyperparamter Search Max/Rand Context Sentence Datasets ###
Now we hyperparameter search on both datasets (I got rid of a few gene lengths that never evolved much from previous iterations)

In [29]:
geneNumRange = [1, 2, 3, 4, 5]
geneLengthRange = [2, 4, 6]
best_members_10 = []
for geneMean in geneNumRange:
    for length in geneLengthRange:
        print('-'*25, '\nMean Num Genes:', geneMean, 'Mean Gene Length:', length, '\n', '-'*25)
        population = RegExPopulation(populationSize=1000, targets=AllTargets_10_MaxContext, labels=AllLabels_10_MaxContext,
                                     geneLengthMean=length, numGeneMean=geneMean, mutantProp=0.25, randomProp=0.25)
        population.evolve()
        best_members_10.append(population[0])

------------------------- 
Mean Num Genes: 1 Mean Gene Length: 2 
 -------------------------
Generation 0, Fitness: 5861.00, Best Genome: \w\w , Time Elapsed 10.86
Generation 1, Fitness: 5742.00, Best Genome: .\w\w , Time Elapsed 9.53
Generation 2, Fitness: 5742.00, Best Genome: .\w\w , Time Elapsed 12.91
Generation 3, Fitness: 5742.00, Best Genome: .\w\w , Time Elapsed 13.36
Generation 4, Fitness: 5742.00, Best Genome: .\w\w , Time Elapsed 15.14
Generation 5, Fitness: 5742.00, Best Genome: .\w\w , Time Elapsed 14.83
Generation 6, Fitness: 5742.00, Best Genome: .\w\w , Time Elapsed 14.61
Generation 7, Fitness: 5742.00, Best Genome: .\w\w , Time Elapsed 15.86
Generation 8, Fitness: 5725.00, Best Genome: \w\w\w , Time Elapsed 11.02
Generation 9, Fitness: 5725.00, Best Genome: \w\w\w , Time Elapsed 12.01
Generation 10, Fitness: 5725.00, Best Genome: \w\w\w , Time Elapsed 11.80
Generation 11, Fitness: 5599.00, Best Genome: \w\w\w\w, Time Elapsed 14.12
Generation 12, Fitness: 5599.00, Best 

In [None]:
geneNumRange = [1, 2, 3, 4, 5]
geneLengthRange = [2, 4, 6]
best_members_10 = []
for geneMean in geneNumRange:
    for length in geneLengthRange:
        print('-'*25, '\nMean Num Genes:', geneMean, 'Mean Gene Length:', length, '\n', '-'*25)
        population = RegExPopulation(populationSize=1000, targets=AllTargets_10_RandContext, labels=AllLabels_10_RandContext,
                                     geneLengthMean=length, numGeneMean=geneMean, mutantProp=0.25, randomProp=0.25)
        population.evolve()
        best_members_10.append(population[0])

------------------------- 
Mean Num Genes: 1 Mean Gene Length: 2 
 -------------------------
Generation 0, Fitness: 5847.00, Best Genome: e\w\w, Time Elapsed 3.89
Generation 1, Fitness: 5847.00, Best Genome: e\w\w, Time Elapsed 5.24
Generation 2, Fitness: 5847.00, Best Genome: e\w\w, Time Elapsed 7.80
Generation 3, Fitness: 5847.00, Best Genome: e\w\w, Time Elapsed 6.84
Generation 4, Fitness: 5847.00, Best Genome: e\w\w, Time Elapsed 6.62
Generation 5, Fitness: 5814.00, Best Genome: \w\w , Time Elapsed 7.72
Generation 6, Fitness: 5715.00, Best Genome: \w\w.., Time Elapsed 13.41
Generation 7, Fitness: 5715.00, Best Genome: \w\w.., Time Elapsed 8.32
Generation 8, Fitness: 5715.00, Best Genome: \w\w.., Time Elapsed 11.42
Generation 9, Fitness: 5715.00, Best Genome: \w\w.., Time Elapsed 8.90
Generation 10, Fitness: 5715.00, Best Genome: \w\w.., Time Elapsed 10.90
Generation 11, Fitness: 5715.00, Best Genome: \w\w.., Time Elapsed 7.71
Generation 12, Fitness: 5715.00, Best Genome: \w\w.., Ti

  pattern = re.compile(str(self))


Generation 6, Fitness: 5682.00, Best Genome: [^a-z ][^3'*+]['.\w], Time Elapsed 4.00
Generation 7, Fitness: 5682.00, Best Genome: [^a-z ][^3'*+]['.\w], Time Elapsed 4.39
Generation 8, Fitness: 5618.00, Best Genome: [^0][\da-z[a] [^%][^ yc], Time Elapsed 4.65
Generation 9, Fitness: 5345.00, Best Genome: [A-Z]\w[^? A-Z], Time Elapsed 5.50
Generation 10, Fitness: 5345.00, Best Genome: [A-Z]\w[^? A-Z], Time Elapsed 5.54
Generation 11, Fitness: 5308.00, Best Genome: [A-Z][^\6][^.], Time Elapsed 5.55
Generation 12, Fitness: 5308.00, Best Genome: [A-Z][^\6][^.], Time Elapsed 5.76
Generation 13, Fitness: 5308.00, Best Genome: [A-Z][^\6][^.], Time Elapsed 6.00
Generation 14, Fitness: 5308.00, Best Genome: [A-Z][^\6][^.], Time Elapsed 6.47
Generation 15, Fitness: 5308.00, Best Genome: [A-Z][^\6][^.], Time Elapsed 6.74
Generation 16, Fitness: 5308.00, Best Genome: [A-Z][^\6][^.], Time Elapsed 6.57
Generation 17, Fitness: 5308.00, Best Genome: [A-Z][^\6][^.], Time Elapsed 7.02
Generation 18, Fitne

In [10]:
geneNumRange = [1, 2, 3, 4, 5]
geneLengthRange = [2, 4, 6]
best_members_10max = []
for geneMean in geneNumRange:
    for length in geneLengthRange:
        print('-'*25, '\nMean Num Genes:', geneMean, 'Mean Gene Length:', length, '\n', '-'*25)
        population = RegExPopulation(populationSize=1000, targets=AllTargets_10_MaxContext, labels=AllLabels_10_MaxContext,
                                     geneLengthMean=length, numGeneMean=geneMean, mutantProp=0.33, randomProp=0.33)
        population.evolve()
        best_members_10max.append(population[0])

------------------------- 
Mean Num Genes: 1 Mean Gene Length: 2 
 -------------------------
Generation 0, Fitness: 5838.00, Best Genome: .\wa, Time Elapsed 13.93
Generation 1, Fitness: 5838.00, Best Genome: .\wa, Time Elapsed 17.88
Generation 2, Fitness: 5838.00, Best Genome: .\wa, Time Elapsed 16.53
Generation 3, Fitness: 5838.00, Best Genome: .\wa, Time Elapsed 16.66
Generation 4, Fitness: 5838.00, Best Genome: .\wa, Time Elapsed 17.95
Generation 5, Fitness: 5780.00, Best Genome: .\w+, Time Elapsed 17.14
Generation 6, Fitness: 5771.00, Best Genome:  \w+, Time Elapsed 18.38
Generation 7, Fitness: 5706.00, Best Genome: a\w+, Time Elapsed 18.99
Generation 8, Fitness: 5706.00, Best Genome: a\w+, Time Elapsed 20.74
Generation 9, Fitness: 5706.00, Best Genome: a\w+, Time Elapsed 21.05
Generation 10, Fitness: 5706.00, Best Genome: a\w+, Time Elapsed 21.83
Generation 11, Fitness: 5706.00, Best Genome: a\w+, Time Elapsed 26.36
Generation 12, Fitness: 5706.00, Best Genome: a\w+, Time Elapsed 

  pattern = re.compile(str(self))


Generation 31, Fitness: 5053.00, Best Genome: [A-Z](.)[\d\w[a-z][\w], Time Elapsed 22.00
Generation 32, Fitness: 5053.00, Best Genome: [A-Z](.)[\d\w5++\dGA-Z[\w][\w], Time Elapsed 21.21
Generation 33, Fitness: 5053.00, Best Genome: [A-Z](.)[A-Z\w[^'3][\w], Time Elapsed 19.50
Generation 34, Fitness: 5053.00, Best Genome: [A-Z](.)[A-Z\w[^'3][\w], Time Elapsed 22.22
Generation 35, Fitness: 5045.00, Best Genome: [A-Z](.)[hA-Z5++\dGA-Z[\w][^U. ], Time Elapsed 22.14
Generation 36, Fitness: 4713.00, Best Genome: [A-Z](.)[\d\w5++\d[^&.][\w][\w], Time Elapsed 20.55
Generation 37, Fitness: 4523.00, Best Genome: [A-Z].\w\w[a-z][\w], Time Elapsed 21.17
Generation 38, Fitness: 4523.00, Best Genome: [A-Z].\w\w[a-z][\w], Time Elapsed 21.34
Generation 39, Fitness: 4523.00, Best Genome: [A-Z].\w\w[a-z][\w], Time Elapsed 21.35
Generation 40, Fitness: 4523.00, Best Genome: [A-Z].\w\w[a-z][\w], Time Elapsed 20.20
Generation 41, Fitness: 4523.00, Best Genome: [A-Z].\w\w[a-z][\w], Time Elapsed 19.30
Generat

  pattern = re.compile(str(self))


Generation 5, Fitness: 5574.00, Best Genome: \w\w[^2][^A-Z][^.\w]|S0-9, Time Elapsed 5.82
Generation 6, Fitness: 5574.00, Best Genome: \w\w[^2][^A-Z][^.\w]|S0-9, Time Elapsed 7.70
Generation 7, Fitness: 5574.00, Best Genome: \w\w[^2][^A-Z][^.\w]|S0-9, Time Elapsed 10.63
Generation 8, Fitness: 5574.00, Best Genome: \w\w[^2][^A-Z][^.\w]|S0-9, Time Elapsed 10.48
Generation 9, Fitness: 5569.00, Best Genome: \w\w[^2][^1 ][^.\w]|S0-9, Time Elapsed 11.95
Generation 10, Fitness: 5464.00, Best Genome: [^.](\w)\w\w[^2][^A-Z][^.\w]|a-za-z, Time Elapsed 14.26
Generation 11, Fitness: 5464.00, Best Genome: [^.](\w)\w\w[^2][^A-Z][^.\w]|a-za-z, Time Elapsed 14.61
Generation 12, Fitness: 5329.00, Best Genome: [^.][a-z]\w\w[^2][^A-Z]|0-9\d, Time Elapsed 18.47
Generation 13, Fitness: 5329.00, Best Genome: [^.][a-z]\w\w[^2][^A-Z]|0-9\d, Time Elapsed 19.84
Generation 14, Fitness: 5329.00, Best Genome: [^.][a-z]\w\w[^2][^A-Z]|0-9\d, Time Elapsed 16.73
Generation 15, Fitness: 5266.00, Best Genome: [^.][A-Z][

In [10]:
best_members_normal = []
for i in range(10):
    population = RegExPopulation(populationSize=500, targets=WORD_TARGET_FIVE, labels=WORD_LABEL_FIVE,
                                 numGeneMean=5, geneLengthMean=4, randomProp=.25, mutantProp=.25, generationLimit=100)
    population.evolve()
    best_members_normal.append(population[0])

Generation 0, Fitness: 2862.00, Best Genome: [^a-z][\da-z], Time Elapsed 2.94
Generation 1, Fitness: 2708.00, Best Genome: [A-ZY][\da-z], Time Elapsed 2.32
Generation 2, Fitness: 2636.00, Best Genome: [A-ZY][^0-9], Time Elapsed 3.21
Generation 3, Fitness: 2636.00, Best Genome: [A-ZY][^0-9], Time Elapsed 3.33
Generation 4, Fitness: 2636.00, Best Genome: [A-ZY][^0-9], Time Elapsed 3.27
Generation 5, Fitness: 2636.00, Best Genome: [A-ZY][^0-9], Time Elapsed 3.34
Generation 6, Fitness: 2636.00, Best Genome: [A-ZY][^0-9], Time Elapsed 3.38
Generation 7, Fitness: 2636.00, Best Genome: [A-ZY][^0-9], Time Elapsed 3.56
Generation 8, Fitness: 2636.00, Best Genome: [A-ZY][^0-9], Time Elapsed 3.33
Generation 9, Fitness: 2636.00, Best Genome: [A-ZY][^0-9], Time Elapsed 3.42
Generation 10, Fitness: 2385.00, Best Genome: [^a-z][^0-9][^A-Z], Time Elapsed 3.68
Generation 11, Fitness: 2367.00, Best Genome: [^a-z][a-zh][^A-Z], Time Elapsed 4.01
Generation 12, Fitness: 2367.00, Best Genome: [^a-z][a-zh][^

In [19]:
sorted_fitness_best_members = sorted([(member, member.fitness) for member in best_members_normal], key=lambda x: x[1])
print(sorted_fitness_best_members)

[(<__main__.RegExProgram object at 0x0000020CC057F510>, 1316), (<__main__.RegExProgram object at 0x0000020CC06025F0>, 1481), (<__main__.RegExProgram object at 0x0000020CC05B2BA0>, 1874), (<__main__.RegExProgram object at 0x0000020CC04909E0>, 1888), (<__main__.RegExProgram object at 0x0000020CC06314A0>, 1928), (<__main__.RegExProgram object at 0x0000020CC0490C10>, 1928), (<__main__.RegExProgram object at 0x0000020CC0589820>, 1932), (<__main__.RegExProgram object at 0x0000020CC17A0890>, 1936), (<__main__.RegExProgram object at 0x0000020CC17B52E0>, 1936), (<__main__.RegExProgram object at 0x0000020CC1821A50>, 2951)]


In [25]:
best_idv = sorted_fitness_best_members[0][0]
best_idv.run_fitness(verbose=True)
best_idv.fitness

Genome: [^a-z ](.+)\w
Label: Ross Result:  Ross
Label: G. Result:  
Label: Hoyt Result:  Hoyt
Label:  Result:  
Label:  Result:  June
Label:  Result:  1943
Label:  Result:  Brigadier
Label:  Result:  General
Label: Jesse Result:  Jesse
Label: Auton Result:  Auton
Label:  Result:  
Label:  Result:  September
Label:  Result:  1943
Label:  Result:  Colonel
Label: William Result:  William
Label: L Result:  
Label: Curry Result:  Curry
Label:  Result:  
Label:  Result:  July
Label:  Result:  1945
Label:  Result:  
Label:  Result:  
Label:  Result:  1st
Label:  Result:  Lieutenant
Label: John Result:  John
Label: J. Result:  
Label: Brody Result:  Brody
Label:  Result:  
Label:  Result:  April
Label:  Result:  1952
Label:  Result:  Colonel
Label: Meredith Result:  Meredith
Label: H. Result:  
Label: Shade Result:  Shade
Label:  Result:  
Label:  Result:  October
Label:  Result:  1952
Label:  Result:  Colonel
Label: Emmett Result:  Emmett
Label: S Result:  
Label: Davis Result:  Davis
Label: 

816

In [22]:
word_test_set_target, word_test_set_label = generate_word_dataset(['I-PER', 'I-ORG', 'I-LOC', 'I-MISC'], 10)
word_test_set_label = word_test_set_label[4000:]
word_test_set_target = word_test_set_target[4000:]
print(len(word_test_set_label))
word_test_set_target

1241


['Ross',
 'G.',
 'Hoyt',
 '4',
 'June',
 '1943',
 'Brigadier',
 'General',
 'Jesse',
 'Auton',
 '6',
 'September',
 '1943',
 'Colonel',
 'William',
 'L',
 'Curry',
 '29',
 'July',
 '1945',
 '--',
 'unkn',
 '1st',
 'Lieutenant',
 'John',
 'J.',
 'Brody',
 '24',
 'April',
 '1952',
 'Colonel',
 'Meredith',
 'H.',
 'Shade',
 '10',
 'October',
 '1952',
 'Colonel',
 'Emmett',
 'S',
 'Davis',
 '5',
 'September',
 '1953',
 '--',
 'unkn',
 'Captain',
 'Newell',
 'H.',
 'Beaty',
 '8',
 'April',
 '1957',
 'Colonel',
 'Clay',
 'Tice',
 'Jr.',
 '15',
 'May',
 '1957',
 'Brigadier',
 'General',
 'Andrew',
 'J.',
 'Evans',
 'Jr.',
 '1',
 'July',
 '1960',
 'Colonel',
 'Thomas',
 'L',
 'Hayes',
 '25',
 'September',
 '1963',
 '--',
 '1',
 'January',
 '1965',
 'Unknown',
 '1',
 'June',
 '1985',
 '--',
 '14',
 'July',
 '1985',
 'Major',
 'General',
 'John',
 'C.',
 'Scheidt',
 'Jr.',
 '15',
 'July',
 '1985',
 'Brigadier',
 'General',
 'Philip',
 'M.',
 'Drew',
 '12',
 'August',
 '1986',
 'Brigadier',
 'Gen

**Run Best Idv on Test Set, Wow it does better on test set! 1316 vs. 810**

In [24]:
best_idv.targets = word_test_set_target
best_idv.labels = word_test_set_label
best_idv.run_fitness()
best_idv.fitness

816