### Generalization: parity function problem

_Vous généraliserez l’étude précédente à des fonctions de 3 à 10 variables d’entrée pour lesquelles vous
fournirez une base d’apprentissage sur le problème de la parité : la fonction retourne 1 quand le nombre de
variables d’entrée prenant la valeur 1 est pair, et 0 sinon._

In [27]:
%run utils/preparation.ipynb

In [1]:
%run utils/exploration.ipynb

In [2]:
%run utils/MLP_utils.ipynb

#### Data preparation

We first load the attribute dictionary for the dataset.

In [30]:
# Load encoded attribute values from the Breast Cancer dataset
fpath = "/tmp/DM2_attr_val_encoded.json"
attr_val_breast_dataset_encoded = load_json(fpath)

We then generate sample records randomly and classify them using the parity function

In [31]:
def parity_fct(attr_values):
    return (attr_values.count(1) % 2) == 0

    
def generate_record(attr_val_dict):
    import random
    
    def add_record_class(input_record, class_fct=parity_fct):
        attr_values = list(input_record.values())
        output_label = class_fct(attr_values)
        output_record = input_record
        output_record["Class"] = 1 if output_label else 0
        return output_record
    
    attr_val_dict = {k : v for k, v in attr_val_dict.items() if k != "Class"}
    new_record = { k : random.choice(v) for k, v in attr_val_dict.items() }
    output_record = add_record_class(new_record)
    return output_record

def generate_dataset(generate_record, n_records):
    import pandas as pd
    
    records = []
    for i in range(n_records):
        new_record = generate_record(attr_val_breast_dataset_encoded)
        records.append(new_record)
    return pd.DataFrame(records)

In [32]:
dataset_parity_fct = generate_dataset(generate_record, 1000)
dataset_parity_fct.head()

Unnamed: 0,Class,age,breast,breast_quad,deg_malig,inv_nodes,irradiat,menopause,node_caps,tumor_size
0,1,0,0,3,1,3,1,0,2,7
1,1,1,0,4,1,5,0,1,2,1
2,0,1,0,5,1,5,1,0,2,3
3,1,2,1,1,0,5,2,2,0,6
4,1,5,1,4,1,4,0,2,0,2


### Run MLP on the Parity function dataset

In [33]:
# Input data, labels
X, y = get_nn_inputs(dataset_parity_fct)

print(X[:10])
print()
print(y[:10])

[(0, 0, 3, 1, 3, 1, 0, 2, 7), (1, 0, 4, 1, 5, 0, 1, 2, 1), (1, 0, 5, 1, 5, 1, 0, 2, 3), (2, 1, 1, 0, 5, 2, 2, 0, 6), (5, 1, 4, 1, 4, 0, 2, 0, 2), (4, 1, 5, 2, 6, 0, 0, 0, 9), (5, 1, 5, 1, 6, 1, 2, 1, 2), (1, 0, 6, 2, 1, 2, 2, 0, 10), (5, 0, 0, 1, 3, 0, 1, 0, 1), (5, 0, 6, 0, 0, 1, 1, 1, 7)]

[1, 1, 0, 1, 1, 0, 1, 1, 0, 0]


In [34]:
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(solver='sgd', max_iter=200) # default parameters

from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(
    clf,
    param_distributions=param_grid,
    n_iter=30, # 30 (random) search iteration
    n_jobs=4, # 4 parallel jobs
    refit=True,
    cv=10, # 10-fold cross-validation
    verbose=0,
    random_state=None
)

random_search.fit(X, y)
print("best params:\n{}".format(random_search.best_params_))
print("best score :\n{}".format(random_search.best_score_))



best params:
{'learning_rate': 'adaptive', 'hidden_layer_sizes': (39, 45)}
best score :
0.519
