# Algoritmos Genéticos


*   Carlos Cerro
*   Daniel Pinto


## 1. Objetivo de la Iteración
En la presente iteración se pretende mostrar el uso que tienen los algoritmos genéticos para automatizar los procedimientos de aprendizaje automático. Utilizaremos datos de [MAGIC GAMMA Telescope Dataset](https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope) y aplicaremos un algoritmo genético que nos ayude a seleccionar cuales son las características y el mejor modelo a utilizar para resolver el problema en cuestión. La solución sigue de la siguiente manera: Primero damos un contexto del problema, luego desarrollamos el modelo y por último analizamos los resultados.

## 2.Contexto del problema
Actualmente para poder realizar algoritmos de aprendizaje automático se requiere mucho trabajo previo, dentro de esto se encuentra la ingeniería de características, elección y validación del modelo, etc. Y al final el reto se centra en elegir la mejor combinación de técnicas que permitan minimizar el error de las predicciones. Debido a esto se está trabajando bastante en el denominado "Auto-ML" , con el objetivo de reducir la complejidad de los algoritmos de aprendizaje automático.

En este caso en particular nos centraremos en el data set mencionado anteriormente, para dar un breve resumen los datos son genreados para simular el registro de partículas gamma de alta energía en un telescopio gamma Cherenkov atmosférico terrestre utilizando la técnica de imagen. La meta es clasificar una particula como gamma, que se la señal deseada, o como hadron, que es el ruido, basados en los atributos provistos en los datos.

Como se puede observar, nos enfrentamos a un problema de clasificación, de la ayuda de los algoritmos genéticos eligiremos cual es el modelo con mejor desmpeño y que tratamiento se le deben dar a los datos para obtener una mayor precisión en los resultados.

![TPOT](https://drive.google.com/uc?id=1OR1YdAdpEYLxL_JDtnPOzkJmE6eYjWmW)

Primero observaremos nuestros datos

In [0]:
#Importar las librerías
import numpy as np
import pandas as pd

In [0]:
datos_telescopio = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/magic/magic04.data', header = None)

In [0]:
datos_telescopio.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g


In [0]:
datos_telescopio.describe().to_latex(float_format="%.2f", bold_rows = True, )

'\\begin{tabular}{lrrrrrrrrrr}\n\\toprule\n{} &  fLength &   fWidth &    fSize &    fConc &   fConcl &    fAsym &  fM3Long &  fM3Trans &   fAlpha &    fDist \\\\\n\\midrule\n\\textbf{count} & 19020.00 & 19020.00 & 19020.00 & 19020.00 & 19020.00 & 19020.00 & 19020.00 &  19020.00 & 19020.00 & 19020.00 \\\\\n\\textbf{mean } &    53.25 &    22.18 &     2.83 &     0.38 &     0.21 &    -4.33 &    10.55 &      0.25 &    27.65 &   193.82 \\\\\n\\textbf{std  } &    42.36 &    18.35 &     0.47 &     0.18 &     0.11 &    59.21 &    51.00 &     20.83 &    26.10 &    74.73 \\\\\n\\textbf{min  } &     4.28 &     0.00 &     1.94 &     0.01 &     0.00 &  -457.92 &  -331.78 &   -205.89 &     0.00 &     1.28 \\\\\n\\textbf{25\\%  } &    24.34 &    11.86 &     2.48 &     0.24 &     0.13 &   -20.59 &   -12.84 &    -10.85 &     5.55 &   142.49 \\\\\n\\textbf{50\\%  } &    37.15 &    17.14 &     2.74 &     0.35 &     0.20 &     4.01 &    15.31 &      0.67 &    17.68 &   191.85 \\\\\n\\textbf{75\\%  } &    7

In [0]:
datos_telescopio.describe()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConcl,fAsym,fM3Long,fM3Trans,fAlpha,fDist
count,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0,19020.0
mean,53.250154,22.180966,2.825017,0.380327,0.214657,-4.331745,10.545545,0.249726,27.645707,193.818026
std,42.364855,18.346056,0.472599,0.182813,0.110511,59.206062,51.000118,20.827439,26.103621,74.731787
min,4.2835,0.0,1.9413,0.0131,0.0003,-457.9161,-331.78,-205.8947,0.0,1.2826
25%,24.336,11.8638,2.4771,0.2358,0.128475,-20.58655,-12.842775,-10.849375,5.547925,142.49225
50%,37.1477,17.1399,2.7396,0.35415,0.1965,4.01305,15.3141,0.6662,17.6795,191.85145
75%,70.122175,24.739475,3.1016,0.5037,0.285225,24.0637,35.8378,10.946425,45.88355,240.563825
max,334.177,256.382,5.3233,0.893,0.6752,575.2407,238.321,179.851,90.0,495.561


Como podemos ver, tenemos un data set con 10 variables y con 19 mil observaciones. Una de las variables es categorica y es la que describe el tipo de particula. Basado en el lugar donde se obtuvieron los datos, procederemos a añadir los nombres a cada una de las variables.

In [0]:
datos_telescopio.columns = ['fLength', 'fWidth','fSize','fConc','fConcl',
                            'fAsym','fM3Long','fM3Trans','fAlpha',
                            'fDist','clase']

In [0]:
datos_telescopio.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConcl,fAsym,fM3Long,fM3Trans,fAlpha,fDist,clase
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g


A continuación una breve descripción de los datos:


*   **fLength**: Continua # eje principal de la elipse.
*   **fWidth**: Continua # eje secundario de la elipse.
*   **fSize**: Continua # 10 - logaritmo de la suma del contenido de todos los pixeles.
*   **fConc**: Continua # razón entre la suma de los dos pixeles más altos sobre fSize.
*   **fConc1**: Continua # razón del pizel más alto sobre fSize
*   **fAsym**: Continua # distancia del pixel más alto al centro, proyectada sobre el eje mayor
*   **fM3Long**: Continua # raiz tercera del tercer momento a lo largo del eje mayor
*   **fM3Trans**: Continua # raiz tercera del tercer momento a lo largo del eje menor
*   **fAlpha**: Continua # Angulo del eje principal con un vector de origen.
*   **fDist**: Continua # distancia entre el origen al centro de la elipse.
*   **Clase**: g,h # gamma(señal), hadron(ruido) - Variable objetivo



In [0]:
datos_telescopio['clase'].value_counts()

g    12332
h     6688
Name: clase, dtype: int64

Se puede observar entonces que la mayoría de observaciones pertenecen a la clase gamma con 12332, mientras que la clase hadron tiene 6688 observaciones.

## 3. Desarrollo del modelo

Primero realizaremos un pre procesamiento de los datos, en la primera parte ordenamos aleatoriamente los datos, luego generamos los codigos númericos para la variable clase.

In [0]:
#Orden aleatorio
datos_randomizados = datos_telescopio.iloc[np.random.permutation(len(datos_telescopio))]
data = datos_randomizados.reset_index(drop=True)
data.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConcl,fAsym,fM3Long,fM3Trans,fAlpha,fDist,clase
0,38.8604,19.207,2.8946,0.311,0.1931,10.4087,35.6672,14.7268,5.8146,141.765,g
1,23.251,17.6772,2.6048,0.4845,0.277,25.0677,21.5714,-14.7494,25.3941,244.06,h
2,62.2097,25.172,3.3344,0.1686,0.0864,6.4592,55.0157,14.4867,6.1974,217.096,g
3,68.3318,15.8303,3.0165,0.2806,0.1621,-56.7726,-44.6009,12.5336,4.2399,65.3891,h
4,48.0649,6.4926,2.4433,0.5369,0.355,-5.5375,-11.5578,4.9494,12.5969,256.029,g


In [0]:
#Crear código para variable clase
data['clase'] = data['clase'].map({'g':0,'h':1})
data.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConcl,fAsym,fM3Long,fM3Trans,fAlpha,fDist,clase
0,38.8604,19.207,2.8946,0.311,0.1931,10.4087,35.6672,14.7268,5.8146,141.765,0
1,23.251,17.6772,2.6048,0.4845,0.277,25.0677,21.5714,-14.7494,25.3941,244.06,1
2,62.2097,25.172,3.3344,0.1686,0.0864,6.4592,55.0157,14.4867,6.1974,217.096,0
3,68.3318,15.8303,3.0165,0.2806,0.1621,-56.7726,-44.6009,12.5336,4.2399,65.3891,1
4,48.0649,6.4926,2.4433,0.5369,0.355,-5.5375,-11.5578,4.9494,12.5969,256.029,0


In [0]:
#Creamos la variable objetivo aparte
clase = data['clase'].values

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
training_indices, validation_indices = training_indices, testing_indices = train_test_split(data.index,
                                                                                            stratify = clase,
                                                                                            train_size=0.75, test_size=0.25)

In [0]:
training_indices.size, validation_indices.size

(14265, 4755)

Para realizar el algoritmo, utilizaremos la librería tpot e importaremos TPOTClassifier. Este tiene varios parametros, pero mencionaremos los más importantes:



1.   **Generaciones/generations**: Número de iteraciónes para correr el proceso de optimización. El valor por defecto es 100
2.   **population_size**: Número de individuos a retener en la programación genetica. El valor por defecto es 100.
3.   **offspring_size**: Numero de descendentes a proucir por cada generación de programación genética. Defecto 100.
4.   **mutation_rate**: Tasa de mutación para el algoritmo de programación genetica, rangos entre 0 y 1. Este parametro indica cuantos cambios aleatorios aplicar a cada generación. Defecto 0.9.
5.   **Crossover_rate**: Tasa de curce para el algoritmo, entre 0 y 1.
6.   **Scoring**: Función usada para evaluar la calidad para el problema de clasificación. Defecto es accuracy.
7.   **cv**: Estrategia de validación curzada cuando se evaluan las fuentes de información.
8.   **random_state**: La semilla del numero pseudo-aleatorio usado en TPOT.



In [0]:
!pip install tpot

Collecting tpot
[?25l  Downloading https://files.pythonhosted.org/packages/ea/9f/813faf5ec7aa95f393a07603abd01fcb925b65ffe95441b25da029a69ff7/TPOT-0.11.1-py3-none-any.whl (75kB)
[K     |████████████████████████████████| 81kB 2.5MB/s 
Collecting tqdm>=4.36.1
[?25l  Downloading https://files.pythonhosted.org/packages/47/55/fd9170ba08a1a64a18a7f8a18f088037316f2a41be04d2fe6ece5a653e8f/tqdm-4.43.0-py2.py3-none-any.whl (59kB)
[K     |████████████████████████████████| 61kB 6.8MB/s 
[?25hCollecting update-checker>=0.16
  Downloading https://files.pythonhosted.org/packages/17/c9/ab11855af164d03be0ff4fddd4c46a5bd44799a9ecc1770e01a669c21168/update_checker-0.16-py2.py3-none-any.whl
Collecting stopit>=1.1.1
  Downloading https://files.pythonhosted.org/packages/35/58/e8bb0b0fb05baf07bbac1450c447d753da65f9701f551dca79823ce15d50/stopit-1.1.2.tar.gz
Collecting deap>=1.2
[?25l  Downloading https://files.pythonhosted.org/packages/0a/eb/2bd0a32e3ce757fb26264765abbaedd6d4d3640d90219a513aeabd08ee2b/de

In [0]:
from tpot import TPOTClassifier

In [0]:
tpot = TPOTClassifier(generations = 5, verbosity = 2)
tpot.fit(data.drop('clase',axis=1).loc[training_indices].values,
         data.loc[training_indices,'clase'].values)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=600.0, style=ProgressStyle(de…

Generation 1 - Current best internal CV score: 0.881037504381353
Generation 2 - Current best internal CV score: 0.8811076060287417
Generation 3 - Current best internal CV score: 0.8811076060287417
Generation 4 - Current best internal CV score: 0.8825096389765159
Generation 5 - Current best internal CV score: 0.8831405538030144

Best pipeline: GradientBoostingClassifier(MinMaxScaler(StandardScaler(input_matrix)), learning_rate=0.1, max_depth=7, max_features=1.0, min_samples_leaf=5, min_samples_split=2, n_estimators=100, subsample=0.8)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=None, generations=5,
               max_eval_time_mins=5, max_time_mins=None, memory=None,
               mutation_rate=0.9, n_jobs=1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=100,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)

Se puede observar que se corrieron 5 generaciones, donde se muestra el accurracy para el modelo en cada una de ellas, donde se ve una leve mejora entre la primera generación y la última. Se puede observar que el mejor modelo y el que más se ajusta a los datos es GradientBoostingClassifier, y el algoritmo genético nos arroja cuales deben ser los parametros.

El algoritmo anterior tomó varias horas en correr, pero es posible ajustar unos parametros para limitar el tiempo de corrida del algoritmo, sin embargo, al ajustar estos parametros estamos generando que el algoritmo no recorra todos los modelos posibles y puede que el modelo que arroje no sea el más óptimo.



In [0]:
tpot = TPOTClassifier(verbosity=2, max_time_mins=2,max_eval_time_mins=0.04,
                      population_size = 15)
tpot.fit(data.drop('clase',axis=1).loc[training_indices].values,
         data.loc[training_indices,'clase'].values)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=15.0, style=ProgressStyle(des…

Generation 1 - Current best internal CV score: 0.8422712933753942
Generation 2 - Current best internal CV score: 0.8422712933753942
Generation 3 - Current best internal CV score: 0.8432527164388363
Generation 4 - Current best internal CV score: 0.8432527164388363
Generation 5 - Current best internal CV score: 0.8432527164388363

2.03 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: DecisionTreeClassifier(input_matrix, criterion=entropy, max_depth=9, min_samples_leaf=13, min_samples_split=9)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=None, generations=100,
               max_eval_time_mins=0.04, max_time_mins=2, memory=None,
               mutation_rate=0.9, n_jobs=1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=15,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)

In [0]:
tpot.score(data.drop('clase',axis=1).loc[validation_indices].values,
           data.loc[validation_indices, 'clase'].values)

0.8557308096740274

Es evidente entonces que al limitar el tiempo de computo del algoritmo genético, perdemos precisión. Esto nos lleva a la conclusión de, que los algoritmos genéticos son heurísticas para solucionar problemas de optimización np completos, es decir que tienen una alta complejidad. Al limitar los parámetros que manejan los algoritmos, nos dejaran en un óptimo local mas no el óptimo global.

Ahora probemos en otra base de datos

In [0]:
player = pd.read_parquet('/content/drive/My Drive/Maestría en ciencia de los datos y análitica/Aprendizaje Automatico/player_attributes.parquet')

In [0]:
#Primero ordenamos de manera aleatoria nuestros datos
random_players = player.iloc[np.random.permutation(len(player))]
players = random_players.reset_index(drop= True)
players.head()

Unnamed: 0,overall_rating,potential,crossing,finishing,heading_accuracy,volleys,curve,free_kick_accuracy,long_passing,ball_control,acceleration,sprint_speed,agility,reactions,balance,shot_power,jumping,stamina,strength,long_shots,aggression,positioning,vision,penalties,marking,high_attacking,low_attacking,medium_attacking,high,low,medium,overall_gk,preferred_foot_bin
0,0.655738,0.672414,0.723404,0.697917,0.680412,0.641304,0.815217,0.697917,0.680851,0.73913,0.857143,0.858824,0.776471,0.696203,0.702381,0.768421,0.682927,0.732558,0.616279,0.8,0.582418,0.774194,0.71875,0.553191,0.322581,1,0,0,0,0,1,0.095426,1
1,0.459016,0.5,0.340426,0.239583,0.628866,0.173913,0.228261,0.177083,0.414894,0.478261,0.547619,0.552941,0.576471,0.455696,0.595238,0.389474,0.658537,0.604651,0.627907,0.242105,0.582418,0.548387,0.53125,0.521277,0.666667,0,0,1,0,0,1,0.245162,1
2,0.770492,0.758621,0.808511,0.84375,0.793814,0.815217,0.673913,0.666667,0.702128,0.793478,0.988095,0.976471,0.823529,0.822785,0.654762,0.789474,0.743902,0.837209,0.72093,0.747368,0.516484,0.860215,0.791667,0.765957,0.430108,1,0,0,0,0,1,0.090995,1
3,0.491803,0.413793,0.765957,0.677083,0.453608,0.782609,0.847826,0.833333,0.638298,0.728261,0.107143,0.176471,0.682353,0.544304,0.75,0.736842,0.52439,0.27907,0.546512,0.663158,0.681319,0.72043,0.708333,0.797872,0.301075,0,0,1,0,0,1,0.093842,1
4,0.52459,0.5,0.606383,0.572917,0.556701,0.456522,0.652174,0.645833,0.648936,0.673913,0.607143,0.576471,0.694118,0.607595,0.797619,0.694737,0.719512,0.627907,0.639535,0.694737,0.56044,0.655914,0.739583,0.606383,0.483871,0,0,1,0,0,1,0.271842,1


In [0]:
#Creamos la variable objetivo aparte
foot = players['preferred_foot_bin'].values

In [0]:
training_indices, validation_indices = training_indices, testing_indices = train_test_split(players.index,
                                                                                            stratify = foot,
                                                                                            train_size=0.75, test_size=0.25)

In [0]:
training_indices.size, validation_indices.size

(99687, 33229)

In [0]:
tpot = TPOTClassifier(verbosity=2, max_time_mins=10,max_eval_time_mins=0.04,
                      population_size = 20)
tpot.fit(players.drop('preferred_foot_bin',axis=1).loc[training_indices].values,
         players.loc[training_indices,'preferred_foot_bin'].values)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=20.0, style=ProgressStyle(des…

Generation 1 - Current best internal CV score: 0.7558157036071993
Generation 2 - Current best internal CV score: 0.7558157036071993

10.01 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: DecisionTreeClassifier(input_matrix, criterion=entropy, max_depth=1, min_samples_leaf=13, min_samples_split=2)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=None, generations=100,
               max_eval_time_mins=0.04, max_time_mins=10, memory=None,
               mutation_rate=0.9, n_jobs=1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=20,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)

In [0]:
tpot.score(players.drop('preferred_foot_bin',axis=1).loc[validation_indices].values,
           players.loc[validation_indices, 'preferred_foot_bin'].values)

0.7558157031508622

Dejando que el algoritmo corra 5 generaciones

In [0]:
tpot = TPOTClassifier(generations = 5, verbosity = 2)
tpot.fit(players.drop('preferred_foot_bin',axis=1).loc[training_indices].values,
         players.loc[training_indices,'preferred_foot_bin'].values)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=600.0, style=ProgressStyle(de…

Generation 1 - Current best internal CV score: 0.9459609164192717


In [0]:
tpot = TPOTClassifier(verbosity=2, max_time_mins=60,max_eval_time_mins=0.1,
                      population_size = 40)
tpot.fit(players.drop('preferred_foot_bin',axis=1).loc[training_indices].values,
         players.loc[training_indices,'preferred_foot_bin'].values)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=40.0, style=ProgressStyle(des…

Generation 1 - Current best internal CV score: 0.7627574326275929
Generation 2 - Current best internal CV score: 0.7627574326275929
Generation 3 - Current best internal CV score: 0.7627574326275929
Generation 4 - Current best internal CV score: 0.7627574326275929
Generation 5 - Current best internal CV score: 0.7647035221837544
Generation 6 - Current best internal CV score: 0.7647035221837544
Generation 7 - Current best internal CV score: 0.7647035221837544
Generation 8 - Current best internal CV score: 0.7818572449449087
Generation 9 - Current best internal CV score: 0.7818572449449087

61.20 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: DecisionTreeClassifier(input_matrix, criterion=entropy, max_depth=10, min_samples_leaf=16, min_samples_split=2)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=None, generations=100,
               max_eval_time_mins=0.1, max_time_mins=60, memory=None,
               mutation_rate=0.9, n_jobs=1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=40,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)