In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn import datasets
from sklearn.model_selection import train_test_split

# Section 1: The Logistic Classification Function
A logistic model extends the logistic regression model into a classifier, meaning that now the logistic function can be used to classify continuous values into discretized categories

In [8]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

# this dataset is loaded from the scikit-learn website
iris_data = load_iris()

# show descriptive information on the dataset
print(iris_data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [3]:
import pandas as pd

X, y = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names), pd.DataFrame(data=iris_data.target, columns=["iris_type"])
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


### Each feature in our dataset contains continuous data, which is essential for appropriate regression calculation.

In [4]:
y.head()

Unnamed: 0,iris_type
0,0
1,0
2,0
3,0
4,0


### We'll be using a module called train_test_split() that allows us to randomly partition our data. This will make more sense as we put it in practice.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# This final command is needed to restructure our y-data in order to effectively fit and predict the model on it.
y_train, y_test = np.ravel(y_train), np.ravel(y_test)

### By comparing the predicted y-values with our true test y-values, we can ascertain our model's accuracy.

One way to do this is by manually iterating across our predicted y-values (y_pred) and true test y-values (y_test) and checking which values are equivalent.

However, we can do this by simply calling .score() on our machine learning model.

Keep in mind, two important arguments we send as parameters to our model are the solver and the multi_class parameters.

The parameter passed for the solver case is called the Limited-Memory BFGS, which is a popular optimization algorithm useful in machine learning.

The parameter passed for the multi_class case is called multinomial, which tells the model that the logistic regression expects discretized cases that are over two classes.

In [9]:
logreg = LogisticRegression(random_state=0, solver="lbfgs", multi_class="multinomial")
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### Here, we've just instantiated our machine learning model (our logistic regression model) by assigning an empty model to a variable.

Then, we fitted the model to our training dataset. Here, it learns the approximate relationship between the X and y datasets.

Through learning the relationship between the X-y training data, we're hoping that the model can approximately determine the y-value given a new X-value.

In [11]:
y_pred = logreg.predict(X_test)
y_pred

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
       0, 0, 2, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 2])

### By printing out and seeing the values across y_pred, we see what our model thinks should be the correct target labels for the corresponding X_test values.

A quick sanity check we can do to assure that our data is what we think it is is to call .shape on y_pred and y_test.

If the shape of both the predicted y-values array and the true test y-values array are consistent with one another, then we can assume that our model worked somewhat effectively.

We can also use the .predict_proba() command to grab the relative classification probabilities in an array, rather than the assigned classes themselves.

In [12]:
logreg.predict_proba(X_test)

array([[1.17924703e-04, 5.61479667e-02, 9.43734109e-01],
       [1.26288661e-02, 9.60454922e-01, 2.69162124e-02],
       [9.84397680e-01, 1.56022816e-02, 3.85650267e-08],
       [1.25180832e-06, 2.31530394e-02, 9.76845709e-01],
       [9.70234755e-01, 2.97650820e-02, 1.62609745e-07],
       [2.01669798e-06, 5.94453033e-03, 9.94053453e-01],
       [9.81899481e-01, 1.81004487e-02, 7.04478339e-08],
       [2.84241427e-03, 7.47089885e-01, 2.50067701e-01],
       [1.50915665e-03, 7.38524267e-01, 2.59966577e-01],
       [2.05287874e-02, 9.35891198e-01, 4.35800150e-02],
       [9.22436289e-05, 1.59475749e-01, 8.40432007e-01],
       [6.98627957e-03, 8.09989247e-01, 1.83024474e-01],
       [4.08220400e-03, 7.93602802e-01, 2.02314994e-01],
       [3.05681845e-03, 7.60910824e-01, 2.36032358e-01],
       [3.87699846e-03, 7.10277106e-01, 2.85845895e-01],
       [9.82815573e-01, 1.71843701e-02, 5.65491187e-08],
       [6.72901329e-03, 7.56465383e-01, 2.36805604e-01],
       [1.14291723e-02, 8.45111

### The next step is to simply determine the model's accuracy.

We can do this by calling .score() on our model and sending it our test data.

One common mistake many people make is to send y_pred rather than X_test as the first argument to the scoring method.

In this case, since we have three explicit class labels, we technically have a 33.33% baseline probability to correctly assign any label.

In [16]:
logreg.score(X_test, y_test) * 100

97.36842105263158