# Project 1

## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset.

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. 

Build a better way to interpret them through supervised machine learning.

## Your assignment

Apply Artificial Neural Network supervised machine learning techniques to this data set and validate it by applying K-Fold cross validation (K=10).

The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.

Many optimization techniques provide the means of "hyperparameters" to be tuned (e.g. Genetic Algorithms). Once you identify a promising approach, see if you can make it even better by tuning its hyperparameters.

Below it's described the set of steps that outline the development of this project, with some guidance and hints. If you're up for a real challenge, try doing this project from scratch in a new, clean notebook!


## Let's begin: prepare your data

Start by importing the mammographic_masses.data.txt file into a Pandas dataframe (hint: use read_csv) and take a look at it.

In [1]:
import pandas as pd
data = pd.read_csv("mammographic_masses.data.txt")
data = data.drop(["BI-RADS"],axis=1)
data.head()
columns = data.columns

Make sure you use the optional parmaters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):

In [2]:
import numpy
data = data.replace("?",numpy.NaN)

Evaluate whether the data needs cleaning; your model is only as good as the data it's given. Hint: use describe() on the dataframe.

In [3]:
data.describe()

Unnamed: 0,severity
count,961.0
mean,0.463059
std,0.498893
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in.

In [4]:
data.isna().sum()

age          5
shape       31
margin      48
density     76
severity     0
dtype: int64

If the missing data seems randomly distributed, go ahead and drop rows with missing data. Hint: use dropna().

In [5]:
data = data.dropna()

Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels.

In [6]:
features = data.drop(["severity"],axis=1)
target = data['severity']
features = features.astype(float)

Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. Hint: use preprocessing.StandardScaler().

In [7]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(features)
featuresScaled = scaler.transform(features)
features = pd.DataFrame(featuresScaled,columns = data.columns[:-1],index=features.index)

Split data in training data and test data. Test data will be used to a final evaluation.

In [8]:
from sklearn.model_selection import train_test_split
 
featModeling, featEvaluation, targetModeling, targetEvaluation = train_test_split(features, target, test_size=0.05, random_state=24)

#dataModeling = featModeling.join(targetModeling)

from sklearn.model_selection import KFold
import numpy as np
from sklearn.metrics import classification_report,confusion_matrix

def fitnessFunction(featModeling,classifier):
    kf = KFold(10)
    model_score = []
    for trainIndex, testIndex in kf.split(featModeling):
        final_preds  = []
        trainFeatures = featModeling.iloc[trainIndex]
        trainTarget = targetModeling.iloc[trainIndex]

        testFeatures = featModeling.iloc[testIndex]
        testTarget = targetModeling.iloc[testIndex]

        input_func = tf.estimator.inputs.pandas_input_fn(x=trainFeatures,y=trainTarget,batch_size=20,shuffle=True)
        classifier.train(input_fn=input_func,steps=500)
        pred_fn = tf.estimator.inputs.pandas_input_fn(x=testFeatures,batch_size=len(testFeatures),shuffle=False)
        note_predictions = list(classifier.predict(input_fn=pred_fn))
        for pred in note_predictions:
            final_preds.append(pred['class_ids'][0])
        matrix = confusion_matrix(testTarget,final_preds)
        recall_rate = (matrix[1][1])/(matrix[1][1]+matrix[1][0])
        accuracy = (matrix[0][0]+matrix[1][1])/len(testTarget)
        model_score.append((accuracy*0.8)+(recall_rate*0.2))

    model_score_mean = np.mean(model_score)
    return model_score_mean

    

In [9]:
import tensorflow as tf
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 
tf.logging.set_verbosity(tf.logging.ERROR)

max_layers = 10
max_nodes = 7

age = tf.feature_column.numeric_column("age")
shape = tf.feature_column.numeric_column("shape")
margin = tf.feature_column.numeric_column("margin")
density = tf.feature_column.numeric_column("density")

feat_cols = [age,shape,margin,density]

new_population = []
pop_size=20

layers = numpy.random.randint(low=1, high=max_layers, size=pop_size)
nodes = numpy.random.randint(low=1, high=max_nodes, size=pop_size)
learn_rate = numpy.random.randint(low=2, high=9, size=pop_size)
act_func = numpy.random.randint(low=0, high=2, size=pop_size)

for i in range(pop_size):
    new_population.append([layers[i],nodes[i],learn_rate[i],act_func[i]])

new_population = np.array(new_population)
print(new_population)

[[7 4 6 1]
 [7 1 6 0]
 [8 1 5 0]
 [7 1 5 1]
 [8 5 4 1]
 [3 6 7 0]
 [9 4 5 0]
 [4 3 8 0]
 [4 1 3 1]
 [3 3 3 0]
 [9 3 8 0]
 [1 3 8 0]
 [8 1 3 1]
 [4 1 2 0]
 [3 3 5 0]
 [2 4 2 1]
 [8 2 5 0]
 [4 3 5 1]
 [4 6 3 1]
 [2 2 3 1]]


In [10]:
def pop_fitness(featModeling, new_population, pop_size):
    scores = []
    for i in range(pop_size):
        hiddenNodes = numpy.repeat([2**new_population[i][1]],new_population[i][0])

        learn_rate = 1/ (10** (new_population[i][2]))


        if(new_population[i][3]==0):
            actFunction=tf.nn.relu6
        else:
            actFunction=tf.nn.sigmoid


        print(new_population[i])
        classifier = tf.estimator.DNNClassifier(hidden_units=hiddenNodes, 
                                                n_classes=2,
                                                feature_columns=feat_cols, 
                                                activation_fn=actFunction, 
                                                optimizer=tf.train.GradientDescentOptimizer(learn_rate)
                                               )

        score = fitnessFunction(featModeling, classifier)
        scores.append(score)
    return scores

In [11]:
def select_mating_pool(population, scores, num_parents):
    parents = numpy.empty((num_parents, population.shape[1]))
    for parent_num in range(num_parents):
        max_fitness_idx = numpy.where(scores == numpy.max(scores))
        max_fitness_idx = max_fitness_idx[0][0]
        parents[parent_num, :] = population[max_fitness_idx, :]
        scores[max_fitness_idx] = 0
    return parents

In [12]:
def crossover(parents, offspring_size):
    offspring = numpy.empty(offspring_size)
    # The point at which crossover takes place between two parents. Usually it is at the center.
    crossover_point = numpy.uint8(offspring_size[1]/2)

    for k in range(offspring_size[0]):
        # Index of the first parent to mate.
        parent1_idx = k%parents.shape[0]
        # Index of the second parent to mate.
        parent2_idx = (k+1)%parents.shape[0]
        # The new offspring will have its first half of its genes taken from the first parent.
        offspring[k, 0:crossover_point] = parents[parent1_idx, 0:crossover_point]
        # The new offspring will have its second half of its genes taken from the second parent.
        offspring[k, crossover_point:] = parents[parent2_idx, crossover_point:]
    return offspring

In [13]:

def mutation(offspring_crossover):
    # Mutation changes a single gene in each offspring randomly.
    for idx in range(offspring_crossover.shape[0]):
        # The random value to be added to the gene.
        gene_number = numpy.random.randint(0, 4, 1)
        if (gene_number == 0):
            offspring_crossover[idx,0] = numpy.random.randint(1, max_layers, 1)
        elif (gene_number == 1):
            offspring_crossover[idx,1] = numpy.random.randint(1, max_nodes, 1)
        elif (gene_number == 2):
            offspring_crossover[idx,2] = numpy.random.randint(2, 9, 1)
        elif (gene_number == 3):
            offspring_crossover[idx,3] = numpy.random.randint(0, 2, 1)
    
    return offspring_crossover

In [14]:
genes_number = 4 # Numero de genes
gen_numbers = 8  # Numero de geracoes
num_parents = int(pop_size/2)
offspring_number = int(pop_size/2)

offspring_size = (offspring_number, genes_number)

for generation in range(gen_numbers):
    print(new_population)
    scores = pop_fitness(featModeling, new_population, pop_size)
    print('Scores:' )
    print(scores)
    parents = select_mating_pool(new_population, scores, num_parents)
    offspring = crossover(parents, offspring_size)
    offspring_mutation = mutation(offspring)
    print('Parents:')
    print(parents)
    print('Offspring:')
    print(offspring)
    
    new_population[0:parents.shape[0], :] = parents
    new_population[parents.shape[0]:, :] = offspring_mutation

    
    print("Finished!")   

[[7 4 6 1]
 [7 1 6 0]
 [8 1 5 0]
 [7 1 5 1]
 [8 5 4 1]
 [3 6 7 0]
 [9 4 5 0]
 [4 3 8 0]
 [4 1 3 1]
 [3 3 3 0]
 [9 3 8 0]
 [1 3 8 0]
 [8 1 3 1]
 [4 1 2 0]
 [3 3 5 0]
 [2 4 2 1]
 [8 2 5 0]
 [4 3 5 1]
 [4 6 3 1]
 [2 2 3 1]]
[7 4 6 1]
[7 1 6 0]
[8 1 5 0]
[7 1 5 1]
[8 5 4 1]
[3 6 7 0]
[9 4 5 0]
[4 3 8 0]
[4 1 3 1]
[3 3 3 0]
[9 3 8 0]
[1 3 8 0]
[8 1 3 1]
[4 1 2 0]
[3 3 5 0]
[2 4 2 1]
[8 2 5 0]
[4 3 5 1]
[4 6 3 1]
[2 2 3 1]
Scores:
[0.5904836092177865, 0.4095163907822136, 0.4095163907822136, 0.5904836092177865, 0.4095163907822136, 0.46048996229187394, 0.6079462558950369, 0.5940342421291788, 0.4616239857189225, 0.7863017997637207, 0.629062564610339, 0.4962343498203386, 0.5904836092177865, 0.4095163907822136, 0.5830996130209194, 0.7754510666420151, 0.6129665937832764, 0.4095163907822136, 0.45508601103537805, 0.41099416528913146]
Antes: 
[[3. 3. 2. 1.]
 [2. 4. 8. 0.]
 [9. 3. 5. 0.]
 [8. 2. 5. 0.]
 [9. 4. 8. 0.]
 [4. 3. 6. 1.]
 [7. 4. 5. 1.]
 [7. 1. 3. 1.]
 [8. 1. 5. 0.]
 [3. 3. 3. 0.]]
Antes: 
[

[1 3 2 1]
[3 3 3 0]
[2 4 2 1]
[9 3 6 0]
[8 4 8 0]
[9 4 5 0]
[7 1 6 1]
[3 3 5 0]
[8 4 7 1]
[4 4 2 1]
[1 3 3 0]
[9 3 2 1]
[7 4 6 0]
[9 3 8 0]
[5 4 5 0]
[9 4 6 1]
[7 5 5 0]
[3 5 7 1]
[8 4 2 1]
Scores:
[0.8165713383332474, 0.8119483524123728, 0.7435258327665524, 0.7057115468583542, 0.4095163907822136, 0.5890324362772519, 0.418947928927859, 0.4095163907822136, 0.39275186076817487, 0.4095163907822136, 0.4793898085037326, 0.7422583286608978, 0.4414151249594288, 0.5644240267603763, 0.4095163907822136, 0.25558495615231636, 0.5904836092177865, 0.5530047893116993, 0.4095163907822136, 0.4421746186303149]
Antes: 
[[2. 4. 2. 1.]
 [1. 3. 3. 0.]
 [3. 3. 3. 0.]
 [1. 3. 2. 1.]
 [2. 4. 6. 1.]
 [9. 4. 8. 0.]
 [8. 4. 6. 0.]
 [7. 4. 5. 0.]
 [7. 5. 2. 1.]
 [4. 4. 2. 0.]]
Antes: 
[[2. 4. 2. 1.]
 [1. 3. 3. 0.]
 [3. 3. 3. 0.]
 [1. 3. 2. 1.]
 [2. 4. 6. 1.]
 [9. 4. 8. 0.]
 [8. 4. 6. 0.]
 [7. 4. 5. 0.]
 [7. 5. 2. 1.]
 [4. 4. 2. 0.]]
Antes: 
[[2. 4. 2. 1.]
 [9. 3. 3. 0.]
 [3. 3. 3. 0.]
 [1. 3. 2. 1.]
 [2. 4. 6. 1.]

[1 3 2 1]
[1 1 3 0]
[1 3 3 0]
[1 4 3 0]
[7 4 3 0]
[1 1 2 1]
[5 3 3 0]
[3 3 3 0]
[1 2 3 0]
[4 4 2 1]
[1 3 5 0]
[1 6 3 0]
[4 3 3 0]
[1 4 3 0]
[7 4 5 1]
[1 1 3 0]
[5 2 3 0]
[3 3 3 0]
[1 2 5 0]
Scores:
[0.8143535132302508, 0.7921315420451077, 0.7645539819938132, 0.8048581607867087, 0.7724788275624989, 0.7236818861344375, 0.8088936287844597, 0.7567980025431942, 0.7517715017898018, 0.7770389513537379, 0.4995975332684194, 0.29295637566868915, 0.8075368077415922, 0.6937700010881246, 0.763154377793609, 0.5904836092177865, 0.7559871725425273, 0.6135479432675204, 0.7886181695691096, 0.7209919811065644]
Antes: 
[[2. 4. 2. 1.]
 [1. 1. 3. 0.]
 [1. 6. 3. 0.]
 [1. 3. 2. 1.]
 [1. 3. 3. 0.]
 [3. 3. 3. 0.]
 [1. 2. 3. 0.]
 [1. 4. 3. 0.]
 [1. 1. 3. 0.]
 [1. 4. 2. 0.]]
Antes: 
[[4. 4. 2. 1.]
 [1. 1. 3. 0.]
 [1. 6. 3. 0.]
 [1. 3. 2. 1.]
 [1. 3. 3. 0.]
 [3. 3. 3. 0.]
 [1. 2. 3. 0.]
 [1. 4. 3. 0.]
 [1. 1. 3. 0.]
 [1. 4. 2. 0.]]
Antes: 
[[4. 4. 2. 1.]
 [1. 3. 3. 0.]
 [1. 6. 3. 0.]
 [1. 3. 2. 1.]
 [1. 3. 3. 0.]


In [19]:
classifier = tf.estimator.DNNClassifier(hidden_units=[8], 
                                                n_classes=2,
                                                feature_columns=feat_cols, 
                                                activation_fn=tf.nn.sigmoid, 
                                                optimizer=tf.train.GradientDescentOptimizer(1/(10**2))
                                               )


input_func = tf.estimator.inputs.pandas_input_fn(x=featModeling,y=targetModeling,batch_size=20,shuffle=True)
classifier.train(input_fn=input_func,steps=500)

pred_fn = tf.estimator.inputs.pandas_input_fn(x=featEvaluation,batch_size=len(featEvaluation),shuffle=False)
note_predictions = list(classifier.predict(input_fn=pred_fn))
final_preds  = []
for pred in note_predictions:
    final_preds.append(pred['class_ids'][0])
matrix = confusion_matrix(targetEvaluation,final_preds)
print(matrix)

[[18  6]
 [ 2 16]]
