# ClassGraphic binary classification demo

In [None]:
from classgraphic.essential import *

In [None]:
# settings
random_state = 42

Let's create a demo dataset for binary classification, where the outcome of an experiment is true (1) or false (0). We will generate 500 observations with 5 features, or measurements. For this demo, we'll make the classes biased to the true (1) class.

In [None]:
X, y = make_classification(n_samples=500, n_features=5, weights=[0.23, 0.77], random_state=random_state)

What does the data look like? `describe` will give us statistics, and `missing` will show us if there are any missing values.

In [None]:
describe(X)

In [None]:
missing(X)

Now on the `target` side of things, let's see how many true and false values we have in the dataset.

In [None]:
class_imbalance(y)

We have on purpose set the positive or true class to be a tad over 3/4 of the samples. Normally we do not know that in advance, so the pie chart helps us to visualize this.

Next step is to do a split of our data into a train and test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=random_state)

We have a pretty large test_size (half), but still, let's look at how train and test compare as far as classes.

In [None]:
class_imbalance(y_train, y_test, condition="train,test", sort=True, reference=False)

That's the default. For easier comparison of two sets, in this case the bar chart might be easier to evaluate. We can either have them displayed as "stack" (default) bars, or as "overlay" (or "group" or "relative"). Options are there depending on your preference...

In [None]:
class_imbalance(y_train, y_test, condition="train,test", sort=True, always_bar=True)

In [None]:
class_imbalance(y_train, y_test, condition="train,test", sort=True, always_bar=True, barmode="overlay")

Any way we look at this, the difference for our True class between train and test is not exagerated, 16 observations. A similar 16 observations difference for our False (0) class however is a much bigger portion. It would be preferable in this scenario to stratify our split:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.5, random_state=random_state)

In [None]:
class_imbalance(y_train, y_test, condition="train,test", sort=True, always_bar=True, barmode="overlay")

Train and test are now matching. But there is still a large imbalance between our True and False observation. We will keep that in mind, but will move forward.

Let's now train the model.

In [None]:
lr_model = LogisticRegression(max_iter=200, random_state=random_state)
lr_model.fit(X_train, y_train)

Let's now get direct predictions and prediction probabilites from our test set.

In [None]:
y_probs = lr_model.predict_proba(X_test)
y_pred = lr_model.predict(X_test)

It is important to visualize the class errors. This can be done with bar charts or with a confusion matrix, or both. When using two visualizations in a single cell, the first one has to have .show(), else it will not be displayed in the notebook result cell.

In [None]:
class_error(lr_model, y_test, y_pred).show()
confusion_matrix_table(lr_model, y_test, y_pred)

We can also get more metrics:

In [None]:
classification_table(lr_model, y_test, y_pred)

Overall, not doing bad, but our Negative class predictions are not as good as for our True or Positive class. Would it improve if we instructed LogisticRegression to balance the class weights?

In [None]:
lr_model = LogisticRegression(max_iter=200, random_state=random_state, class_weight="balanced")
lr_model.fit(X_train, y_train)

In [None]:
y_probs = lr_model.predict_proba(X_test)
y_pred = lr_model.predict(X_test)

In [None]:
class_error(lr_model, y_test, y_pred).show()
confusion_matrix_table(lr_model, y_test, y_pred).show()
classification_table(lr_model, y_test, y_pred)

We are doing better on the recall for our negative class, but the precision went down.

Let's look at the Precision Recall and ROC plots for this "balanced" model.

In [None]:
precision_recall(lr_model, y_test, y_pred).show()
roc(lr_model, y_test, y_probs[:, 1])

Finally, let's look at the feature importance and the probability histogram.

In [None]:
feature_importance(lr_model, y).show()
prediction_histogram(lr_model, y_test, y_pred, nbins=25)