## A "Hello World" Example of Machine Learning - Revisit

Loading the Iris dataset from scikit-learn. 

The first column represents Sepal length, the second column represents Sepal width,  the third column represents the petal length, and the fourth column the petal width of the flower samples. The classes (type of species) are already converted to integer labels where 0=Iris-Setosa, 1=Iris-Versicolor, 2=Iris-Virginica.

Here, we are using only two features: the third and fourth columns. 

In [1]:
from sklearn import datasets
import numpy as np

iris = datasets.load_iris()
# iris.data

In [2]:
X = iris.data[:, [0, 3]]

In [3]:
y = iris.target

In [4]:
print('Class labels:', np.unique(y))

Class labels: [0 1 2]


Scikit-learn algorithms support multi-class classification via the One-Versus-Rest(OvR) method. 

Splitting data into 70% training and 30% test data:

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

In [6]:
print('Labels counts in y:', np.bincount(y))
print('Labels counts in y_train:', np.bincount(y_train))
print('Labels counts in y_test:', np.bincount(y_test))

Labels counts in y: [50 50 50]
Labels counts in y_train: [35 35 35]
Labels counts in y_test: [15 15 15]


### Standardizing the features:

In [7]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler() #center the distribution around zero (mean), with a standard deviation of 1.
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

In [8]:
from sklearn.linear_model import Perceptron

ppn = Perceptron(max_iter=100, eta0=0.1, random_state=42)
ppn.fit(X_train_std, y_train)

Perceptron(eta0=0.1, max_iter=100, random_state=42)

### Test the model with the hold-out test set

In [9]:
y_pred = ppn.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred).sum()))

Misclassified samples: 10


In [10]:
from sklearn.metrics import accuracy_score

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))

Accuracy: 0.7777777777777778


In [11]:
X_new = [[1.1, 0.2],[0.4, 1.9], [1.4, 0.2]]
y_new = ppn.predict(X_new)
y_new

array([2, 2, 2])

### Evaluate the model using cross validation

In [12]:
from sklearn.model_selection import cross_val_score
cross_val_score(ppn, X_train_std, y_train, cv=4, scoring="accuracy")

array([0.88888889, 0.76923077, 0.53846154, 0.80769231])

2-features: sccuracy score: array([0.88888889, 0.48148148, 0.7037037 , 0.95833333])

### Exercise 1: Use all four features to train the model and use cross validaton to check if the results better? Briefly explain why. 

Create a new dataset using all four features and then split and standardize the data using sklearn

In [13]:
X = iris.data[:, [0, 1, 2, 3]]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

Create a perceptron with 100 epochs, a learning rate of 0.1, and a random state of 42 to be get reproducable results. Then fit the model on the data from the cell above and print out the number of misclassificated samples and the accuracy score.

In [14]:
ppn = Perceptron(max_iter=100, eta0=0.1, random_state=42)
ppn.fit(X_train_std, y_train)
y_pred = ppn.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred).sum()))
print('Accuracy: ' + str(accuracy_score(y_test, y_pred)*100))

Misclassified samples: 3
Accuracy: 93.33333333333333


Preform 4 fold cross validation and print out the accuracy scores. 

In [15]:
cross_val_score(ppn, X_train_std, y_train, cv=4, scoring="accuracy")

array([0.92592593, 0.88461538, 0.84615385, 0.88461538])

The results are better because the perceptron has more features and thus more information to help in being able to more accurately classify the data. Though this does not mean will work in every scenario. 

### Exercise 2: Try with the scikit-learn stochastic gradient descent model instead of perceptron. Use all four features. Evaluate with cross-validation how does the model perform in terms of accuracy using both two features and four features. 

Import sklearn SGDClassifier and create an object with random state of 42

In [16]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(random_state=42)

Using the data from the pervious exercise (all four features) fit the sgd classifier and print out the number of misclassified samples and the accuracy score

In [17]:
sgd.fit(X_train_std, y_train)
y_pred_sgd = sgd.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred_sgd).sum()))
print('Accuracy: ' + str(accuracy_score(y_test, y_pred_sgd)))

Misclassified samples: 5
Accuracy: 0.8888888888888888


Prefrom 4 fold cross validation and print out the accuracy scores

In [18]:
cross_val_score(sgd, X_train_std, y_train, cv=4, scoring="accuracy")

array([1.        , 0.96153846, 0.84615385, 0.88461538])

Create a new dataset using two features and then split and standardize the data using sklearn. Also create a sgd classifier object with random state 42

In [19]:
X = iris.data[:, [0, 3]]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

sgd = SGDClassifier(random_state=42)

Fit the sgd classifier and print out the amount of mislcassifated samples and the accuracy score

In [20]:
sgd.fit(X_train_std, y_train)
y_pred_sgd = sgd.predict(X_test_std)
print('Misclassified samples: ' + str((y_test != y_pred_sgd).sum()))
print('Accuracy: ' + str(accuracy_score(y_test, y_pred_sgd)))

Misclassified samples: 1
Accuracy: 0.9777777777777777


Preform 4 fold cross validation on the sgd classifier and print out the accuracy of each fold

In [21]:
cross_val_score(sgd, X_train_std, y_train, cv=4, scoring="accuracy")

array([0.92592593, 0.76923077, 0.96153846, 0.80769231])

The SGD Classifier had better results when it was given four features compared to using two features. This is not surprising because the model has more information to work with to help with classifying the targets correctly though this is not always the case.