## A "Hello World" Example of Machine Learning - Revisit

Loading the Iris dataset from scikit-learn. 

The first column represents Sepal length, the second column represents Sepal width,  the third column represents the petal length, and the fourth column the petal width of the flower samples. The classes (type of species) are already converted to integer labels where 0=Iris-Setosa, 1=Iris-Versicolor, 2=Iris-Virginica.

Here, we are using only two features: the third and fourth columns. 

In [1]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
#iris.data

In [2]:
X = iris.data[:, [0, 3]]

In [3]:
y = iris.target

In [4]:
print('Class labels:', np.unique(y))

Class labels: [0 1 2]


Scikit-learn algorithms support multi-class classification via the One-Versus-Rest(OvR) method. 

Splitting data into 70% training and 30% test data:

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

In [6]:
print('Labels counts in y:', np.bincount(y))
print('Labels counts in y_train:', np.bincount(y_train))
print('Labels counts in y_test:', np.bincount(y_test))

Labels counts in y: [50 50 50]
Labels counts in y_train: [35 35 35]
Labels counts in y_test: [15 15 15]


### Standardizing the features:

In [7]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler() #center the distribution around zero (mean), with a standard deviation of 1.
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [8]:
from sklearn.linear_model import Perceptron

ppn = Perceptron(max_iter=100, eta0=0.1, random_state=42)
ppn.fit(X_train_std, y_train)

Perceptron(eta0=0.1, max_iter=100, random_state=42)

### Test the model with the hold-out test set

In [9]:
y_pred = ppn.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred).sum()))

Misclassified samples: 10


In [10]:
from sklearn.metrics import accuracy_score

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

Accuracy: 0.7777777777777778


In [11]:
X_new = [[1.1, 0.2],[0.4, 1.9], [1.4, 0.2]]
y_new = ppn.predict(X_new)
y_new

array([2, 2, 2])

### Evaluate the model using cross validation

In [12]:
from sklearn.model_selection import cross_val_score
cross_val_score(ppn, X_train_std, y_train, cv=4, scoring="accuracy")

array([0.88888889, 0.76923077, 0.53846154, 0.80769231])

2-features: sccuracy score: array([0.88888889, 0.48148148, 0.7037037 , 0.95833333])

### Exercise 1: Use all four features to train the model and use cross validaton to check if the results better? Briefly explain why. 

Using four features on a test and training set, standardizing them, and then checking for the accuracy of the misclassified samples to see how four features compares to two.

In [13]:
iris = datasets.load_iris()
#iris.data
X = iris.data[:, [0, 1, 2, 3]]
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

sc = StandardScaler() #center the distribution around zero (mean), with a standard deviation of 1.
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

ppn = Perceptron(max_iter=100, eta0=0.1, random_state=42)
ppn.fit(X_train_std, y_train)

y_pred = ppn.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred).sum()))

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

Misclassified samples: 3
Accuracy: 0.9333333333333333


Using four features, we have less misclassified samples and a higher accuracy score than only two.

Cross validating and checking the accuracy scores again.

In [14]:
cross_val_score(ppn, X_train_std, y_train, cv=4, scoring="accuracy")

array([0.92592593, 0.88461538, 0.84615385, 0.88461538])

Once again, we see that using four features results in higher accuracy scores than only two features. This is because these extra features display the variance of the training data better which leads to higher accuracy scores of 10-30%.

### Exercise 2: Try with the scikit-learn stochastic gradient descent model instead of perceptron. Use all four features. Evaluate with cross-validation how does the model perform in terms of accuracy using both two features and four features. 

Testing two features with the stochastic gradient descent model.

In [15]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(random_state=42)

iris = datasets.load_iris()
#iris.data
X = iris.data[:, [0, 3]]
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

sc = StandardScaler() #center the distribution around zero (mean), with a standard deviation of 1.
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

# ppn = Perceptron(max_iter=100, eta0=0.1, random_state=42)
# ppn.fit(X_train_std, y_train)
sgd.fit(X_train_std, y_train)

# y_pred = ppn.predict(X_test_std)
y_pred = sgd.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred).sum()))

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

cross_val_score(sgd, X_train_std, y_train, cv=4, scoring="accuracy")

Misclassified samples: 1
Accuracy: 0.9777777777777777


array([0.92592593, 0.76923077, 0.96153846, 0.80769231])

With only two features, there is a single misclassified sample and a 97.8% accuracy. After cross validating, we see accuracy scores comparable to the Perceptron model with four features.

Testing four features with the stochastic gradient descent model.

In [16]:
sgd = SGDClassifier(random_state=42)

iris = datasets.load_iris()
#iris.data
X = iris.data[:, [0, 1, 2, 3]]
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

sc = StandardScaler() #center the distribution around zero (mean), with a standard deviation of 1.
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

sgd.fit(X_train_std, y_train)

y_pred = sgd.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred).sum()))

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

cross_val_score(sgd, X_train_std, y_train, cv=4, scoring="accuracy")

Misclassified samples: 5
Accuracy: 0.8888888888888888


array([1.        , 0.96153846, 0.84615385, 0.88461538])

Using four features, we see more misclassified samples and a slightly lower accuracy score. After cross validating, however, we achieve much higher accuracy scores than the Perceptron model or the stochastic gradient descent model with only two features.