# Classification using logistic regression

Logistic regression is one of the simplest models for data classification. In this section, we'll use this algorithm to identify different species of flowers from its characteristics.

## Loading the dataset
The *Iris* dataset contains 150 entries from three different species and four features:

In [1]:
import pandas as pd

iris = pd.read_csv('datasets/iris.csv')
print(iris.info())
iris.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
Sepal_length    150 non-null float64
Sepal_width     150 non-null float64
Petal_length    150 non-null float64
Petal_width     150 non-null float64
Class           150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
None


Unnamed: 0,Sepal_length,Sepal_width,Petal_length,Petal_width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [2]:
iris.Class.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [3]:
class_mapping = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
iris.Class = iris.Class.map(class_mapping)
iris.Class.unique()

array([0, 1, 2])

## Training the classifier
We'll apply the same pattern as before to train a classifier using scikit-learn.

In [4]:
from sklearn.linear_model import LogisticRegression

X = iris.drop('Class', axis=1).values
y = iris.Class.values

lr = LogisticRegression().fit(X, y)

entry_64 = X[64, :]
print('Predicted class: ', lr.predict(entry_64))
print('Actual class: ', y[64])

Predicted class:  [1]
Actual class:  1




## Evaluating the model
Scikit-learn's LogisticRegression also provides a *score()* method to assess how well the classifier can predict the classes of a given dataset:

In [5]:
lr.score(X, y)

0.95999999999999996

For classification problems, however, it's usually helpful to generate a *confusion matrix* in order to check if any classes have a high rate of misclassifications:

In [6]:
from sklearn.metrics import confusion_matrix

y_pred = lr.predict(X)
confusion_matrix(y, y_pred, labels=[0, 1, 2])

array([[50,  0,  0],
       [ 0, 45,  5],
       [ 0,  1, 49]])

The confusion matrix above shows that only six examples were misclassified: five examples that are known to be from class Iris-versicolor were classified as Iris-virginica and one example of class Iris-virginica was misclassified as belonging to class Iris-versicolor.