# Machine Learning

# Iris Species

## Introduction

The objective is to develop a classification model using the Iris dataset and TensorFlow

The data set has five columns:

- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)
- target

The Iris dataset contains three classes

- Setosa (0)
- Versicolor (1)
- Virginica (2)

each one has 50 records of sepal length and width and petal length and width in centimeters. 

##### Source: Pierian Data

## Data Analysis

## Python, Pandas, Scikit-learn & TensorFlow

Import libraries.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import tensorflow as tf

Load the Iris dataset into a Pandas DataFrame.

In [2]:
df_iris = pd.read_csv("iris.csv")

Explore values and dimensions of the DataFrame.

In [3]:
df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


In [4]:
df_iris.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
145,6.7,3.0,5.2,2.3,2.0
146,6.3,2.5,5.0,1.9,2.0
147,6.5,3.0,5.2,2.0,2.0
148,6.2,3.4,5.4,2.3,2.0
149,5.9,3.0,5.1,1.8,2.0


In [5]:
df_iris.shape

(150, 5)

In [6]:
df_iris.groupby("target").size()

target
0.0    50
1.0    50
2.0    50
dtype: int64

Column names must be adjusted and target column must be casted from float to int.

In [7]:
df_iris.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')

In [8]:
df_iris.columns = ['sepal_len_cm', 'sepal_w_cm', 'petal_len_cm', 'petal_w_cm', 'target']

In [9]:
df_iris.columns

Index(['sepal_len_cm', 'sepal_w_cm', 'petal_len_cm', 'petal_w_cm', 'target'], dtype='object')

In [10]:
df_iris["target"] = df_iris["target"].apply(int)

After some adjustments, there are no spaces between column names and target column is int type.

In [11]:
df_iris.head()

Unnamed: 0,sepal_len_cm,sepal_w_cm,petal_len_cm,petal_w_cm,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


It is time to split the dataset between train and test.

In [12]:
X = df_iris[['sepal_len_cm', 'sepal_w_cm', 'petal_len_cm', 'petal_w_cm']]

In [13]:
y = df_iris["target"]

Data set = Train (70%) & Test (30%)

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Train model

Define feature columns using Tensorflow. Based on the dataset, feature columns must be of type numeric.

In [15]:
feature_columns = []

In [16]:
for col in X.columns:
    feature_columns.append(tf.feature_column.numeric_column(col))

In [17]:
feature_columns

[_NumericColumn(key='sepal_len_cm', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='sepal_w_cm', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='petal_len_cm', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='petal_w_cm', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

Define an input function

In order to define an input function and an estimator object, some parameters must be tuned in order to obtain better results

In [18]:
input_fn = tf.estimator.inputs.pandas_input_fn(x=X_train, y=y_train, batch_size=20, num_epochs=150, shuffle=True)

In this case n_classes is equal to three because there are three types of flowers, hidden units is one of the parameters that defines the structure of the neural network that must be tuned in order to obtain better results.

In [19]:
estimator = tf.estimator.DNNClassifier(hidden_units=[10,10,15,10,10],n_classes=3,feature_columns=feature_columns)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/z6/vzs6bsnx4hn2qr27y9pzrrmw0000gn/T/tmp5f2lpiby', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1a2dca8710>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [20]:
estimator.train(input_fn=input_fn,steps=30)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /var/folders/z6/vzs6bsnx4hn2qr27y9pzrrmw0000gn/T/tmp5f2lpiby/model.ckpt.
INFO:tensorflow:loss = 21.280293, step = 1
INFO:tensorflow:Saving checkpoints for 30 into /var/folders/z6/vzs6bsnx4hn2qr27y9pzrrmw0000gn/T/tmp5f2lpiby/model.ckpt.
INFO:tensorflow:Loss for final step: 5.1098313.


<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x1a2dca8358>

## Test model

Define an input function. Unlike the previous function, this function only receives the test set because the objective is to get predictions based once the model was trained.

In [21]:
prediction_fn = tf.estimator.inputs.pandas_input_fn(x=X_test,batch_size=len(X_test),shuffle=False)

In [22]:
predictions = list(estimator.predict(input_fn=prediction_fn))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/z6/vzs6bsnx4hn2qr27y9pzrrmw0000gn/T/tmp5f2lpiby/model.ckpt-30
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


Review the predictions.

In [23]:
predictions

[{'logits': array([-2.2779343,  1.6463001,  1.2753551], dtype=float32),
  'probabilities': array([0.01155504, 0.5848504 , 0.40359464], dtype=float32),
  'class_ids': array([1]),
  'classes': array([b'1'], dtype=object)},
 {'logits': array([ 6.046981  ,  0.00824976, -3.2561164 ], dtype=float32),
  'probabilities': array([9.9753040e-01, 2.3786938e-03, 9.0916466e-05], dtype=float32),
  'class_ids': array([0]),
  'classes': array([b'0'], dtype=object)},
 {'logits': array([-3.163869 ,  1.8996477,  2.2327194], dtype=float32),
  'probabilities': array([0.00263298, 0.41639417, 0.58097285], dtype=float32),
  'class_ids': array([2]),
  'classes': array([b'2'], dtype=object)},
 {'logits': array([ 6.4244266e+00,  1.2074150e-03, -3.4619508e+00], dtype=float32),
  'probabilities': array([9.9832851e-01, 1.6207081e-03, 5.0777857e-05], dtype=float32),
  'class_ids': array([0]),
  'classes': array([b'0'], dtype=object)},
 {'logits': array([ 5.823807,  0.00987 , -3.134577], dtype=float32),
  'probabiliti

## Results

The following results were obtained after a tuning of parameters

Confusion matrix and classification report will be used in order to evaluate the results

In [24]:
predictions_single_vals = []

for prediction in predictions:
    predictions_single_vals.append(prediction["class_ids"][0])

In [25]:
print(confusion_matrix(y_test,predictions_single_vals))

[[13  0  0]
 [ 0 15  1]
 [ 0  1 15]]


The first class was predicted correctly at 100%, however the second and third class were predicted at 93%.

In [26]:
print(classification_report(y_test,predictions_single_vals))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       0.94      0.94      0.94        16
           2       0.94      0.94      0.94        16

   micro avg       0.96      0.96      0.96        45
   macro avg       0.96      0.96      0.96        45
weighted avg       0.96      0.96      0.96        45



The above results can be checked with the last report, additionally it can be observed a f1 value of 96% which represents a high accuracy of the model. The model was obtained after tuning the parameters, in this case more tests should be performed in order to avoid an overfitting.