# DM2: "Connexionism: backpropagation algorithm"

-----------------------------

_Eole Cervenka, Nov 13th 2017_

+ Python version: 3.6
+ libraries: sklean, numpy, pandas
+ dependencies:

    + `Eole_Cervenka_DM2_preparation.ipynb`
    + `Eole_Cervenka_DM2_exploration.ipynb`
    + `Eole_Cervenka_DM2_MLP.ipynb`
        
+ Data:
    + `data.csv` (cf Preparation section)
    
-----------------------------
    
## Part I - Breast cancer data

### Preparation


The input data in file `breast-cancer.arff` is converted to `.csv` formatted file: `data.csv` such as:

```
'age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat','Class'
'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
...
```

## Preparation

### helper functions

In [1]:
%run utils/helper_functions.ipynb

In [2]:
%run utils/preparation.ipynb

In [3]:
%run utils/exploration.ipynb

In [4]:
%run utils/MLP_utils.ipynb

### data set

I use the `pandas` library to load and manipulate the dataset.

In [5]:
import pandas as pd

fpath = "data.csv"
df = pd.read_csv(fpath, quotechar="'")

df.rename( columns={
        'tumor-size': 'tumor_size',
        'inv-nodes': 'inv_nodes',
        'node-caps' : 'node_caps',
        'deg-malig' : 'deg_malig',
        'breast-quad' : 'breast_quad'
    }, inplace=True)

df.head()

Unnamed: 0,age,menopause,tumor_size,inv_nodes,node_caps,deg_malig,breast,breast_quad,irradiat,Class
0,40-49,premeno,15-19,0-2,yes,3,right,left_up,no,recurrence-events
1,50-59,ge40,15-19,0-2,no,1,right,central,no,no-recurrence-events
2,50-59,ge40,35-39,0-2,no,2,left,ft_low,,recurrence-events
3,40-49,premeno,35-39,0-2,yes,3,right,left_low,yes,no-recurrence-events
4,40-49,premeno,30-34,3-5,yes,2,left,right_up,no,recurrence-events


### Exploration: attribute value-frequency histograms

In [49]:
attr_dict = attr_val_freq(df) # attribute value-frequency

for attr, value_freq in attr_dict.items():
    print("Value Frequency: {}".format(attr) )
    bar = ""
    
    for value in sorted(list(value_freq.keys())):
        freq = value_freq[value]
        bar = "." * int(freq /3)
        padding = " " * (20 - len("{}".format(value))) + '|'
        print(value, padding,bar)
    print("\n\t***\n")

Value Frequency: age
20-29                | 
30-39                | ............
40-49                | ..............................
50-59                | ...............................
60-69                | ...................
70-79                | ..

	***

Value Frequency: menopause
ge40                 | ..........................................
lt40                 | ..
premeno              | ..................................................

	***

Value Frequency: tumor_size
0-4                  | ..
10-14                | .........
15-19                | ..........
20-24                | ................
25-29                | ..................
30-34                | ....................
35-39                | ......
40-44                | .......
45-49                | .
5-9                  | .
50-54                | ..

	***

Value Frequency: inv_nodes
0-2                  | ......................................................................
12-14                |

### Data preparation

1. Deal with missing attributes
2. Categorical values encoding

#### Missing values

Remove records with value `'?'` in attribute `node_caps` or `breast_quad`, or with value `'NaN'` in attribute `irradiat`

In [8]:
df = remove_missing_values(df)
df.head()

Unnamed: 0,age,menopause,tumor_size,inv_nodes,node_caps,deg_malig,breast,breast_quad,irradiat,Class
0,40-49,premeno,15-19,0-2,yes,3,right,left_up,no,recurrence-events
1,50-59,ge40,15-19,0-2,no,1,right,central,no,no-recurrence-events
3,40-49,premeno,35-39,0-2,yes,3,right,left_low,yes,no-recurrence-events
4,40-49,premeno,30-34,3-5,yes,2,left,right_up,no,recurrence-events
5,50-59,premeno,25-29,3-5,no,2,right,left_up,yes,no-recurrence-events


#### Categorical values encoding

In [9]:
df_encoded, label_encoder = encode_df(df)
df_encoded.head()

Unnamed: 0,age,menopause,tumor_size,inv_nodes,node_caps,deg_malig,breast,breast_quad,irradiat,Class
0,2,2,2,0,2,2,1,3,0,1
1,3,0,2,0,1,0,1,1,0,0
3,2,2,6,0,2,2,1,2,1,0
4,2,2,5,4,2,1,0,5,0,1
5,3,2,4,4,1,1,1,3,1,0


### Attribute overview

In [10]:
val_dict = attr_val_dict(df_encoded)

fpath = "/tmp/DM2_attr_val_encoded.json"
save_json(val_dict, fpath)

for k, v in val_dict.items(): print(k, sorted(v))

age [0, 1, 2, 3, 4, 5]
menopause [0, 1, 2]
tumor_size [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
inv_nodes [0, 1, 2, 3, 4, 5, 6]
node_caps [0, 1, 2]
deg_malig [0, 1, 2]
breast [0, 1]
breast_quad [0, 1, 2, 3, 4, 5]
irradiat [0, 1]
Class [0, 1]


### X, y (input matrix, label vector)

In [11]:
# matrix input X and label vector y
X, y = get_nn_inputs(df_encoded)

# Preview
print("X  --sample records:")
for r in X[:10]:
    print(r)
print("...")
print("\ny --sample labels")
print(y[:10], '\n')

# Save to pkl
training_data = (X, y)
save_pickle(training_data, '/tmp/training.pkl')

X  --sample records:
(2, 2, 2, 0, 2, 2, 1, 3, 0)
(3, 0, 2, 0, 1, 0, 1, 1, 0)
(2, 2, 6, 0, 2, 2, 1, 2, 1)
(2, 2, 5, 4, 2, 1, 0, 5, 0)
(3, 2, 4, 4, 1, 1, 1, 3, 1)
(3, 0, 7, 0, 1, 2, 0, 3, 0)
(2, 2, 1, 0, 1, 1, 0, 3, 0)
(2, 2, 0, 0, 1, 1, 1, 4, 0)
(2, 0, 7, 2, 2, 1, 1, 3, 1)
(3, 2, 4, 0, 1, 1, 0, 2, 0)
...

y --sample labels
[1, 0, 0, 1, 0, 0, 0, 0, 0, 0] 

(X, y) saved to '/tmp/training.pkl'
