Building and analyzing the performance of multiclass classifiers

Multiclass classifiers
In this assessment, you will load a dataset and train two models to perform multiclass classification. Then you will compare the results of the models. The dataset is the digits dataset available from scikit-learn's datasets library. This dataset contains 1,797 samples of written digits, and your goal is to correctly identify digits from 0 to 9.

Load the data
Import the load_digits() function from the sklearn.datasets library.
Invoke load_digits() with the return_X_y parameter set to True. Store the returned dataset in variables X and y.

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_20newsgroups_vectorized
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import  classification_report, confusion_matrix
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns


In [7]:
digits = load_digits()
print(digits.data.shape)


(1797, 64)


In [8]:
X, y = load_digits(return_X_y=True)

Exploratory data analysis
Perform a few exploratory steps, including the following:

Display the number of rows of data returned.
Display the number of features in the dataset.
Use NumPy's bincount() to display how many samples belong to each class. Is this a balanced dataset?

In [9]:
import numpy as np

print('The number of rows in the dataset is {:d}'.format(X.shape[0]))
print('The number of features in the dataset is {:d}'.format(X.shape[1]))
np.bincount(y)

The number of rows in the dataset is 1797
The number of features in the dataset is 64


array([178, 182, 177, 183, 181, 182, 181, 179, 174, 180])

Prepare training and testing data
Use train_test_split() to split the dataset into a training set and a test set. Set the proportion of test data to 20%. Set a random state value so that the results will be repeatable.

In [10]:
from sklearn.model_selection  import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

Cross-validation with logistic regression
In this step, you will create a LogisticRegression classifier and use five-fold cross-validation to train the model.

Import a LogisticRegression classifier from scikit-learn.
Instantiate a LogisticRegression classifier with the lbfgs solver and ovr multiclass strategy. You may have to set the maximum number of iterations to 1,000.
Perform cross-validation on the model.
Print the cross-validation scores and the mean of the cross-validation scores.

In [11]:
# Imports here

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

lr_clf = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=1000)
lr_cv_scores = cross_val_score(lr_clf, X_train, y_train, cv = 5)

print('Accuracy scores for the 5 folds: ', lr_cv_scores)
print('Mean cross validatiion score: {:.3f}'.format(np.mean(lr_cv_scores)))

Accuracy scores for the 5 folds:  [0.94791667 0.94791667 0.95470383 0.94425087 0.95470383]
Mean cross validatiion score: 0.950


Cross-validation with random forest¶
Perform the same steps as above, but this time with a RandomForestClassifier.

In [12]:
# Imports here

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=24)
rf_cv_scores = cross_val_score(rf_clf, X_train, y_train, cv = 5)

print('Accuracy scores for the 5 folds: ', rf_cv_scores)
print('Mean cross validatiion score: {:.3f}'.format(np.mean(rf_cv_scores)))


Accuracy scores for the 5 folds:  [0.96527778 0.96180556 0.97212544 0.96864111 0.96167247]
Mean cross validatiion score: 0.966
