# En este notebook vamos a aplicar algunas herrammientas aprendidas para hacer clasificación. Para ello utilizaremos la Mammographic Dataset, que se encuentra en http://archive.ics.uci.edu/ml/index.php

1. Title: Mammographic Mass Data

2. Sources:

   (a) Original owners of database:
        Prof. Dr. R�diger Schulz-Wendtland
        Institute of Radiology, Gynaecological Radiology, University Erlangen-Nuremberg
        Universit�tsstra�e 21-23
        91054 Erlangen, Germany
        
   (b) Donor of database:
        Matthias Elter
        Fraunhofer Institute for Integrated Circuits (IIS)
        Image Processing and Medical Engineering Department (BMT) 
        Am Wolfsmantel 33
        91058 Erlangen, Germany
        matthias.elter@iis.fraunhofer.de
        (49) 9131-7767327 
        
   (c) Date received: October 2007
 
3. Past Usage:
    M. Elter, R. Schulz-Wendtland and T. Wittenberg (2007)
    The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process.
    Medical Physics 34(11), pp. 4164-4172

4. Relevant Information:
    Mammography is the most effective method for breast cancer screening
    available today. However, the low positive predictive value of breast
    biopsy resulting from mammogram interpretation leads to approximately
    70% unnecessary biopsies with benign outcomes. To reduce the high
    number of unnecessary breast biopsies, several computer-aided diagnosis
    (CAD) systems have been proposed in the last years.These systems
    help physicians in their decision to perform a breast biopsy on a suspicious
    lesion seen in a mammogram or to perform a short term follow-up
    examination instead.
    This data set can be used to predict the severity (benign or malignant)
    of a mammographic mass lesion from BI-RADS attributes and the patient's age.
    It contains a BI-RADS assessment, the patient's age and three BI-RADS attributes
    together with the ground truth (the severity field) for 516 benign and
    445 malignant masses that have been identified on full field digital mammograms
    collected at the Institute of Radiology of the
    University Erlangen-Nuremberg between 2003 and 2006.
    Each instance has an associated BI-RADS assessment ranging from 1 (definitely benign)
    to 5 (highly suggestive of malignancy) assigned in a double-review process by
    physicians. Assuming that all cases with BI-RADS assessments greater or equal
    a given value (varying from 1 to 5), are malignant and the other cases benign,
    sensitivities and associated specificities can be calculated. These can be an
    indication of how well a CAD system performs compared to the radiologists.

5. Number of Instances: 961

6. Number of Attributes: 6 (1 goal field, 1 non-predictive, 4 predictive attributes)

7. Attribute Information:
   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)

8. Missing Attribute Values: Yes
    - BI-RADS assessment:    2
    - Age:                   5
    - Shape:                31
    - Margin:               48
    - Density:              76
    - Severity:              0

9. Class Distribution: benign: 516; malignant: 445

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt

In [None]:
mam_data=pd.read_csv('../../datasets/mammographic_masses.data', na_values='?')

In [None]:
mam_data.info()

In [None]:
mam_data.describe()

In [None]:
mam_data.head(10)

In [None]:
mam_data.columns=['bi_rads', 'age','shape','margin','density','severity']

In [None]:
fig, axs=plt.subplots(ncols=2, nrows=3)
axs=axs.flatten()
for i,ax in enumerate(axs):
    ax.hist(mam_data.dropna()[mam_data.dropna().iloc[:,5]==0].iloc[:,i])
    ax.hist(mam_data.dropna()[mam_data.dropna().iloc[:,5]==1].iloc[:,i])
    

### Plotear categorical y ordinal data usando seaborn

In [None]:
import seaborn as sns
for var in ['bi_rads','shape','margin','density']:
    plt.figure()
    sns.countplot(x=var, hue='severity', data=mam_data)

In [None]:
sns.boxplot(x='severity', y='age', data=mam_data)

## Sin NANs

In [None]:
mam_data_clean = mam_data.dropna()

In [None]:
print(mam_data_clean.info())

Creamos como siempre una matriz de features y un vector de targets

In [None]:
X=mam_data_clean.drop(columns=['severity']).values
y=mam_data_clean['severity'].values

Partimos los datos en train y test usando la funcion `train_test_split de scikit`

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Vamos a probar un árbol de decisión y random forest, que no requieren mucho preprocessing de los datos

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

clf=DecisionTreeClassifier()
clf.fit(X_train, y_train)
print("la accuracy del arbol de decision es: ", clf.score(X_test, y_test))

clf=RandomForestClassifier()
clf.fit(X_train, y_train)
print("la accuracy del random forest es: ", clf.score(X_test, y_test))

In [None]:
for depth in np.arange(1,10):
    clf=DecisionTreeClassifier(max_depth=depth)
    clf.fit(X_train, y_train)
    plt.plot(depth, clf.score(X_test, y_test), '.r')    
    plt.ylabel('accuracy')
    plt.xlabel('depth')

Qué pasa si hacemos lo mismo con Support Vector Machines, por ejemplo?

In [None]:
from sklearn.svm import SVC

clf=SVC()
clf.fit(X_train, y_train)
print("la accuracy de SVM es: ", clf.score(X_test, y_test))

Uno tiene que convertir los datos categoricos

In [None]:
from sklearn.preprocessing import OneHotEncoder
oHe=OneHotEncoder(categorical_features=[0,2,3,4], sparse=False)
X_train_transform = oHe.fit_transform(X_train)
X_test_transform = oHe.transform(X_test)

Además, conviene tener todas las variables con la misma escala

In [None]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train_transform =  ss.fit_transform(X_train_transform)
X_test_transform =  ss.transform(X_test_transform)

In [None]:
print(X_train_transform.mean(axis=0), X_train_transform.std(axis=0))

In [None]:
clf.fit(X_train_transform, y_train)
print("la accuracy de SVM es: ", clf.score(X_test_transform, y_test))

Podemos ver cómo varía la clasificación cambiado los parámetros del clasificador

In [None]:
for C in 10**np.arange(-2,2, 0.2):
    clf=SVC(C=C)
    clf.fit(X_train_transform, y_train)
    plt.plot(C, clf.score(X_test_transform, y_test), '.r') # Mejor en escala logarítmica
    #plt.semilogx(C, clf.score(X_test_transform, y_test), '.r')
    plt.ylabel('accuracy')
    plt.xlabel('C')

In [None]:
from sklearn.neighbors import KNeighborsClassifier

for neigh in np.arange(1,20, 1):
    clf=KNeighborsClassifier(n_neighbors=neigh)
    clf.fit(X_train_transform, y_train)
    plt.plot(neigh, clf.score(X_test_transform, y_test), '.r')
    plt.ylabel('accuracy')
    plt.xlabel('neighbors')

In [None]:
clf=RandomForestClassifier(max_depth=3, n_estimators=100)
clf.fit(X_train_transform, y_train)
clf.score(X_test_transform, y_test)

In [None]:
X_train_sub = X_train[:,[0,1,2,3,4]]
X_test_sub = X_test[:,[0,1,2,3,4]] 

oHe=OneHotEncoder(categorical_features=[0,2,3,4], sparse=False)
X_train_transform = oHe.fit_transform(X_train_sub)
X_test_transform = oHe.transform(X_test_sub)

X_train_transform =  ss.fit_transform(X_train_transform)
X_test_transform =  ss.transform(X_test_transform)

In [None]:
clf=SVC()
clf.fit(X_train_transform, y_train)
print(" la accuray después de eliminar features es: ", clf.score(X_test_transform, y_test))

## Con los NaNs?

In [None]:
y=mam_data['severity'].values

In [None]:
from sklearn.preprocessing import Imputer

In [None]:
imp_1 = Imputer(strategy='mean')
temp_1 = imp_1.fit_transform(mam_data['age'].values.reshape(-1,1))

In [None]:
imp_2 = Imputer(strategy='most_frequent')
temp_2 = imp_2.fit_transform(mam_data.loc[:,['bi_rads','shape','margin','density']].values)

In [None]:
X= np.concatenate((temp_1,temp_2),axis=1)

In [None]:
X.shape

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
oHe=OneHotEncoder(categorical_features=[1,2,3,4], sparse=False)
X_train_transform = oHe.fit_transform(X_train)
X_test_transform = oHe.transform(X_test)
ss = StandardScaler()
X_train_transform =  ss.fit_transform(X_train_transform)
X_test_transform =  ss.fit_transform(X_test_transform)

In [None]:
plt.hist(X_train_transform[:,20])
plt.xlabel("age")

In [None]:
for C in 10**np.arange(-2,2, 0.2):
    clf=SVC(C=C)
    clf.fit(X_train_transform, y_train)
    plt.plot(C, clf.score(X_test_transform, y_test), '.r') # Mejor en escala logarítmica
    plt.semilogx(C, clf.score(X_test_transform, y_test), '.r')
    plt.ylabel('accuracy')
    plt.xlabel('C')