# Model an Ordinal Logistic Regression in Python

This notebook will perform ordinal logistic regression on our sample data.  We will use a new dataset for this analysis, one which is more amenable to our analysis, as it includes ranked ordering.

For prior analysis, we've used `pandas`, `numpy`, and a variety of functions from `scikit-learn`.  Now we'll add the `statsmodels` library and replace sklearn's `LogisticRegression()` class with `OrderedModel()`.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report
from statsmodels.miscmodels.ordinal_model import OrderedModel

This dataset is in Stata format.  Stata is a paid product for data analysis, but we don't have it available to us here.

In [None]:
df = pd.read_stata("../data/ologit.dta")
df.head()

One nice thing about loading data in Stata format is that we can capture that `apply` is an ordered categorical variable without needing to specify it ourselves.

In [None]:
df.dtypes

In [None]:
df['apply'].dtype

Similar to other analyses, we'll split our data into X and Y subsets and then into training and test splits.

In [None]:
y = df['apply']
x = df.drop(['apply'], axis=1)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, stratify=y)

## Training a Model

The operation for statsmodels algorithms is a bit different from scikit-learn.  Here, we need to pass in the training dataset when we instantiate the ordered model, and then we specify the method we'll use when fitting on that trained dataset.

In [None]:
clf = OrderedModel(y_train, x_train, distr='probit')
model = clf.fit(method='bfgs')

Statsmodels also gives us back a nicely formatted table with results.

In [None]:
model.summary()

## Evaluating the Model

Now that we've trained our model, we can generate predictions.  Predictions are going to be a bit different here, as we need to send in the parameters from our trained model--that's not stored with the classifier itself.

In [None]:
y_pred = clf.predict(model.params, x_test)

Also, predictions are in terms of probabilities, so we can sue `argmax(1)` to get the position of the value with the largest probability for each row.

In [None]:
y_pred[0:5]

In [None]:
y_pred.argmax(1)

To compare these results to our initial test values, we can transform the ordered categories into underlying codes.

In [None]:
y_test.values.codes

Now we have enough that we can build a confusion matrix and see how we did.

In [None]:
confusion_matrix(y_test.values.codes, y_pred.argmax(1))

In [None]:
print(classification_report(y_test.values.codes, y_pred.argmax(1)))