# Multiclass classifiers¶


## In this assessment, you will load a dataset and train two models to perform multiclass classification. Then you will compare the results of the models. The dataset is the digits dataset available from scikit-learn's datasets library. This dataset contains 1,797 samples of written digits, and your goal is to correctly identify digits from 0 to 9.

## Load the data

1. Import the `load_digits()` function from the sklearn.datasets library.
2. Invoke `load_digits()` with the `return_X_y` parameter set to `True`. Store the returned dataset in variables `X` and `y`.

In [3]:
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)

## Exploratory data analysis
Perform a few exploratory  steps, including the following:

1. Display the number of rows of data returned.
2. Display the number of features in the dataset.
3. Use NumPy's `bincount()` to display how many samples belong to each class. Is this a balanced dataset?

In [9]:
import numpy as np

print('The number of rows in the dataset is {:d}'.format(X.data.shape[0]))
print('The number of features in the dataset is {:d}'.format(X.data.shape[1]))
np.bincount(y)

The number of rows in the dataset is 1797
The number of features in the dataset is 64


array([178, 182, 177, 183, 181, 182, 181, 179, 174, 180], dtype=int64)

Based on array we do have a balanced dataset.

## Prepare training and testing data
1. Use `train_test_split()` to split the dataset into a training set and a test set. Set the proportion of test data to 20%. Set a random state value so that the results will be repeatable.

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Cross-validation with logistic regression
In this step, you will create a `LogisticRegression` classifier and use five-fold cross-validation to train the model.

1. Import a `LogisticRegression` classifier from scikit-learn.
2. Instantiate a `LogisticRegression` classifier with the `lbfgs` solver and `ovr` multiclass strategy. You may have to set the maximum number of iterations to 1,000.
3. Perform cross-validation on the model.
4. Print the cross-validation scores and the mean of the cross-validation scores.

In [12]:
# Imports here
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score


lr_clf = LogisticRegression(solver='lbfgs', random_state=64, max_iter=1000) #defaults to ovr multiclass strategy 

lr_cv_scores = cross_val_score(lr_clf, X_train, y_train, cv=5)

print('Accuracy scores for the 5 folds: ', lr_cv_scores)
print('Mean cross validation score: {:.3f}'.format(np.mean(lr_cv_scores)))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Accuracy scores for the 5 folds:  [0.96875    0.96527778 0.95121951 0.96864111 0.93031359]
Mean cross validation score: 0.957


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Cross-validation with random forest
Perform the same steps as above, but this time with a `RandomForestClassifier`.

In [13]:
# Imports here
from sklearn.ensemble import RandomForestClassifier


rf_clf =RandomForestClassifier(n_estimators=24)

rf_cv_scores = cross_val_score(rf_clf, X_train, y_train, cv=5)

print('Accuracy scores for the 5 folds: ', rf_cv_scores)
print('Mean cross validation score: {:.3f}'.format(np.mean(rf_cv_scores)))

Accuracy scores for the 5 folds:  [0.97222222 0.94444444 0.94425087 0.97560976 0.94425087]
Mean cross validation score: 0.956
