# Generating a population of datasets using a genetic algorithm
---

## Motivation
---
Preliminary tests on relatively small datasets has shown that deliberately choosing a fairer starting point for the algorithm (i.e. by assigning potential centroids to points in the data by solving a matching game) does not necessarily improve the performance of the overall $k$-modes algorithm.

The hope is to develop a set of datasets which are better clustered by our matching initialisation than by Cao's method (the leading competitor). We will develop this set as a population in a genetic algorithm.

## Method
---
#### 1. Beginning with an initial set of randomly-generated categorical datasets
 - Defined by their number of **rows, columns, and clusters** we will cluster them each by our method and Cao's.

#### 2. Select two parents that are well-performing
 - Performance is given by our **objective function**
 - We stipulate some level of **difference** between these parent datasets to **maintain** some population **diversity**. By doing this, we avoid falling into a local minimum too soon.

#### 3. Create an offspring based on some crossover operation
 - This will likely be some mix of blending the columns of our parent datasets, and resizing it to be full.

#### 4. Select some underperforming dataset and replace it with the offspring. Go to 2 until some stopping criterion is met


At each generation, we 'roll a dice' and mutate a dataset at random according to some pre-defined mutation rate. By mutating a member of our population, we force some diversity even when we are converging to some 'stable' population of datasets. This mutation operation will be similar to the crossover operation.

In [102]:
from kmodes.kmodes import KModes
from sklearn.datasets import make_blobs

import operator
import itertools

import pandas as pd
import numpy as np


In [73]:
def pointwise_dissim(x, y):
    return np.sum(x != y, axis=0)


def dissim(Y, x):
    return np.sum(Y != x, axis=1)


In [74]:
class DataSet(object):
    """ A dataset object, defined by its number of rows, columns and clusters, and a generator seed. """

    def __init__(self, n_rows, n_cols, n_clusters, seed):

        self.n_rows = n_rows
        self.n_cols = n_cols
        self.n_clusters = n_clusters
        self.seed = seed

        np.random.seed(self.seed)
        self.cluster_std = (0.5 - 0.01) * np.random.random() + 0.01

    def get_params(self):
        """ Return the parameters of our dataset as a tuple for easy access. """
        return self.n_rows, self.n_cols, self.n_clusters, self.seed

    def get_dataframe(self):
        """ Generate the dataset itself as a pandas.DataFrame object. """

        data, target = make_blobs(
            self.n_rows,
            self.n_cols,
            self.n_clusters,
            self.cluster_std,
            center_box=(0, 1),
            random_state=self.seed,
        )

        data = np.round(data, 0)
        dataframe = pd.DataFrame(
            {f"attr{col}": data[:, col] for col in range(data.shape[1])}
        )

        return dataframe

    def get_clustering(self, init):
        """ Cluster the dataset into `n_clusters` parts, initialised by method `init`. """

        km = KModes(n_clusters=self.n_clusters, init=init, n_init=10)
        km.fit_predict(self.get_dataframe())

        return km

    def fitness(self):
        """ Find the fitness of a dataset by clustering it by Cao's method, and our own;
        return the difference between their costs. """

        cao = self.get_clustering("cao")
        matching_best = self.get_clustering("matching_best")

        return matching_best.cost_ - cao.cost_


In [59]:
def generate_first_population(population_size):
    """ Given some population size (this also acts as a max seed),
    create an initial set of DataSet objects. """

    population = []
    seed = 0

    while seed < population_size:

        np.random.seed(seed)
        n_rows = np.random.randint(100, 10000)
        n_cols = np.random.randint(4, 500)
        n_clusters = np.random.randint(3, 20)

        dataset = DataSet(n_rows, n_cols, n_clusters, seed)
        population.append(dataset)
        seed += 1

    return population


In [60]:
def get_ordered_population(population):
    """ Order the current population by their fitness in descending order. """

    ordered_population = {}
    for individual in population:
        ordered_population[individual] = individual.fitness()

    return sorted(ordered_population.items(), key=operator.itemgetter(1), reverse=True)


In [106]:
def select_breeders(ordered_population, best_sample, lucky_sample):
    """ Given a population, select breeders for the next generation.
    
    Parameters
    ----------
    ordered_population : dict
        A sorted dictionary where keys are current individuals, each
        with their fitness as the corresponding value.
    best_sample : int
        The number of best performing individuals to take to breed for
        the next generation.
    lucky_sample : int
        The number of individuals to be randomly selected to breed for
        the next generation.

    Returns
    -------
    breeders : list
        The individuals to breed for the next generation.
    """

    np.random.seed(0)
    breeders = []
    ranked_individuals = list(ordered_population.items())
    for i in range(best_sample):
        individual = ranked_individuals.pop(0)[0]
        breeders.append(individual)

    for j in range(lucky_sample):
        individual = np.random.choice(ranked_individuals)[0]
        breeders.append(individual)

    np.random.shuffle(breeders)
    return breeders


In [107]:
def create_child(parent1, parent2):
    """ Given two parents, randomly select one of the parameters of either
    parent or their average. Then return a new DataSet instance as 'their
    child' with these selected parameters. """

    params = np.empty((3, 4))
    for row, parent in enumerate([parent1, parent2]):
        params[row, :] = parent.get_params()
        params[2, :] += parent.get_params()
    params[2, :] //= 2

    child_params = (np.random.choice(params[:, col]) for col in range(4))
    child = DataSet(*child_params)

    return child


In [108]:
def create_children(breeders, population_size, max_children):
    """ All breeders reproduce with one another, where each pair of parents
    produce a number of children decided by the modulo product of their seeds.
    A maximum number of children is passed as a parameter here. """

    next_population = []
    for parent1, parent2 in itertools.combinations(breeders, r=2):
        number_children = parent1.seed * parent2.seed % max_children
        np.random.seed(number_children)
        for child in range(number_children):
            if len(next_population) < population_size:
                next_population.append(create_child(parent1, parent2))

    return next_population


In [117]:
def mutation(individual, population_size):
    """ Mutate an individual by changing its seed, i.e. the same structural
    parameters but with new values. """

    individual.seed = np.random.randint(population_size)
    return individual


In [141]:
def mutate_population(population, mutation_rate):
    """ Mutate individuals in the population at each generation according to
    a mutation_rate, i.e. the proportion of the population to be mutated. """

    for i, individual in enumerate(population):
        if np.random.random() < mutation_rate:
            individual = mutation(individual)

    return population


In [None]:
%%timeit

population_size = 10
max_children = population_size // 4
population = generate_first_population(population_size)
step = 0
while step < 5:
    ordered_population = get_ordered_population(population)
    breeders = select_breeders(ordered_population)
    new_generation = create_children(breeders, population_size, max_children)
    population = mutate_population(new_generation)

In [2]:
for _ in range(1):
    print(_)


0


In [8]:
a = [1, 2, 3, 4, 5]

del a[a.index(1)]

a


[2, 3, 4, 5]

In [10]:
from genetic_data.pdfs import Gamma, Poisson


In [11]:
g = Gamma(100)


In [15]:
Gamma.

<genetic_data.pdfs.Gamma at 0x11290f5c0>