In [36]:
%run '0.0_init_configuration.ipynb'

<h2 id="Example">Cancer Data Example</h2>

The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007)[[http://mlearn.ics.uci.edu/MLRepository.html](http://mlearn.ics.uci.edu/MLRepository.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML241ENSkillsNetwork31576874-2022-01-01)]. The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:

| Field name  | Description                 |
| ----------- | --------------------------- |
| ID          | Clump thickness             |
| Clump       | Clump thickness             |
| UnifSize    | Uniformity of cell size     |
| UnifShape   | Uniformity of cell shape    |
| MargAdh     | Marginal adhesion           |
| SingEpiSize | Single epithelial cell size |
| BareNuc     | Bare nuclei                 |
| BlandChrom  | Bland chromatin             |
| NormNucl    | Normal nucleoli             |
| Mit         | Mitoses                     |
| Class       | Benign or malignant         |

<br>
<br>

Let's load the dataset:

In [37]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv")

df.shape

(699, 11)

In [38]:
#tenemos un elemento catalogado como "?"
df['BareNuc'].unique()
# forzamos convertir los valores no numericos a nulos y con notnull() donde los valores # son true
# y con df[] seleccionamos los valores validos
df = df[pd.to_numeric(df['BareNuc'], errors='coerce').notnull()]
df.head()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [39]:
# agrupamos en una lista los features
X  = df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc',
          'BlandChrom', 'NormNucl', 'Mit']]

# definimos la variable objetivo
y = df['Class']

In [40]:
#Esplit the data into training and testing sets
X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.2, random_state=4)
print("Train set: ", X_train.shape, y_train.shape )
print("Test set: ", X_test.shape, y_test.shape )

Train set:  (546, 9) (546,)
Test set:  (137, 9) (137,)


In [41]:
#Using GridSearchCV to search over specified parameter values of the model
model = RandomForestClassifier()
model.get_params().keys()

dict_keys(['bootstrap', 'ccp_alpha', 'class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'monotonic_cst', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])

We can use GridSearch for Exhaustive search over specified parameter values. We see many of the parameters are similar to Classification trees; let's try a different parameter for `max_depth`, `max_features` and `n_estimators`.

In [42]:
param_grid = {'n_estimators' : [2 * n + 1 for n in range(20)],
              'max_depth': [2 * n + 1 for n in range(10)],
              'max_features' : ['auto', 'sqrt', 'log2']}

# we create the grid search object and fit it
search = GridSearchCV(estimator=model, param_grid = param_grid, scoring='accuracy')
search.fit(X_train, y_train)
# we can see the best accuracy score of the searched parameter was ~98%.
search.best_score_

np.float64(0.9799165971643037)

In [43]:
#The best parameter value are: 
search.best_params_

{'max_depth': 11, 'max_features': 'log2', 'n_estimators': 19}

In [44]:
def get_accuracy(X_train, X_test, y_train, y_test, model):
    return {'test Accuracy': metrics.accuracy_score(y_test, model.predict(X_test)),
            'train Accuracy': metrics.accuracy_score(y_train, model.predict(X_train))}

In [45]:
get_accuracy(X_train, X_test, y_train, y_test, search.best_estimator_)

{'test Accuracy': 0.9781021897810219, 'train Accuracy': 1.0}