Let's take another look at the concept of "regression to mediocrity" as described in Nina Zumel's great article [*Why Do We Plot Predictions on the x-axis?*](http://www.win-vector.com/blog/2019/09/why-do-we-plot-predictions-on-the-x-axis/).

This time let's consider the issue from the point of view of multinomial classification (a concept discusssed [here](https://github.com/WinVector/pyvtreat/blob/master/Examples/Multinomial/MultinomialExample.md)).

First we load our packages and generate some synthetic data.

In [1]:
import numpy
import numpy.random
import pandas
import sklearn.linear_model
import sklearn.metrics

In [2]:
numpy.random.seed(34524)

N = 100

df = pandas.DataFrame({
    'x1': numpy.random.normal(size=N),
    'x2': numpy.random.normal(size=N),
    })
noise = 0.25*numpy.random.normal(size=N)
y = df.x1 + df.x2 + noise
df['y'] = numpy.where(
    y < -2, 
    'short opportunity', 
    numpy.where(
        y > 2, 
        'long opportunity', 
        'indeterminate'))

df.head()

Unnamed: 0,x1,x2,y
0,0.389409,-1.117456,indeterminate
1,-0.354096,-0.76258,indeterminate
2,-0.057603,1.278706,indeterminate
3,-0.400339,2.040186,indeterminate
4,-0.125245,1.051682,indeterminate


In [3]:
df['y'].value_counts()

indeterminate        84
short opportunity    10
long opportunity      6
Name: y, dtype: int64

Please pretend this data is a record of stock market trading situations where we have determined (by peaking into the future, something quite easy to do with historic data) there is a large opportunity to make money buying security (called `long opportunity`) or a larger opportunity to make money selling a security (called `short opportunity`).

Let's build a model using the two observable dependent variables `x1` and `x2`.  These are measurements that are available at the time of the proposed trade that we hope correlate with or "predict" the future trading result.  For our model we will use a simple multinomial logistic regression.

In [4]:
model_vars = ['x1', 'x2']

fitter = sklearn.linear_model.LogisticRegression(
    solver = 'saga',
    penalty = 'l2',
    C = 1,
    max_iter = 1000,
    multi_class = 'multinomial')
fitter.fit(df[model_vars], df['y'])


LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)

We can then examing the model predictions on the training data itself (a *much* lower standard than evaluating the model on held out data!!).

In [5]:
# convenience functions for predicting and adding predictions to original data frame

def add_predictions(d_prepared, model_vars, fitter):
    pred = fitter.predict_proba(d_prepared[model_vars])
    classes = fitter.classes_
    d_prepared['prob_on_predicted_class'] = 0
    d_prepared['prediction'] = None
    for i in range(len(classes)):
        cl = classes[i]
        d_prepared[cl] = pred[:, i]
        improved = d_prepared[cl] > d_prepared['prob_on_predicted_class']
        d_prepared.loc[improved, 'prediction'] = cl
        d_prepared.loc[improved, 'prob_on_predicted_class'] = d_prepared.loc[improved, cl]
    return d_prepared

def add_value_by_column(d_prepared, name_column, new_column):
    vals = d_prepared[name_column].unique()
    d_prepared[new_column] = None
    for v in vals:
        matches = d_prepared[name_column]==v
        d_prepared.loc[matches, new_column] = d_prepared.loc[matches, v]
    return d_prepared

In [6]:
# df['prediction'] = fitter.predict(df[model_vars])
df = add_predictions(df, model_vars, fitter)
df = add_value_by_column(df, 'y', 'prob_on_correct_class')

In [7]:
result_columns = ['y', 'prob_on_predicted_class', 'prediction', 
                  'indeterminate', 'long opportunity', 
                  'short opportunity', 'prob_on_correct_class']
df[result_columns].head()

Unnamed: 0,y,prob_on_predicted_class,prediction,indeterminate,long opportunity,short opportunity,prob_on_correct_class
0,indeterminate,0.961999,indeterminate,0.961999,0.006164,0.031837,0.961999
1,indeterminate,0.919933,indeterminate,0.919933,0.003162,0.076905,0.919933
2,indeterminate,0.892118,indeterminate,0.892118,0.106681,0.001201,0.892118
3,indeterminate,0.818074,indeterminate,0.818074,0.181323,0.000603,0.818074
4,indeterminate,0.927103,indeterminate,0.927103,0.070774,0.002123,0.927103


A common way to examine the relation of the model predictions to outcomes is a graphical table called a *confusion matrix*.  The [scikit learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) has states:

> By definition a confusion matrix `C` is such that `C[i,j]` is equal to the number of observations known to be in group `i` but predicted to be in group `j`.

and

> Wikipedia and other references may use a different convention for axes.

This means in the sckit learn convention the column-id is determined by the prediction.  This further means: as a visual point the horizontal position of cells in the scikit learn confusion matrix is determined by the prediction because matrices have the odd convention that the first index is row which specifies what vertical level one is refering to.

Frankly we think scikit learn has the right rendering choice: consistency and legibility over convention. As Nina Zumel [demonstrated](http://www.win-vector.com/blog/2019/09/why-do-we-plot-predictions-on-the-x-axis/): there are good reasons to have predictions on the x-axis for plots, and the same holds for diagrams or matrices.

So let's look at this confusion matrix.

In [8]:
sklearn.metrics.confusion_matrix(
    y_true=df.y, 
    y_pred=df.prediction, 
    labels=['short opportunity', 'indeterminate', 'long opportunity'])

array([[ 2,  8,  0],
       [ 1, 83,  0],
       [ 0,  5,  1]])

Our claim is: the prediction is controlling left/right in this matrix and the actual value to be predicted is determining up/down.

We can confirm this as we see there are 16 actual values of `y` that are not `intermediate` and only 4 values of `prediction` that are not intermediate.  As the rows of the confusion matrix match the `y`-totals and the columns of the confusion matrix match the `prediction` totals we can confirm the matrix is oriented as described.

In [9]:
sum(df['y']!='indeterminate')

16

In [10]:
sum(df['prediction']!='indeterminate')

4

And now, if we are careful, we see the effect.

Notice again that 16 actual values of `y` that are not `intermediate` and only 4 values of `prediction` that are not intermediate.  The model only identifies one fourth as many possible extreme situations or good trades as there actually work, *even* on its own training data.  This is not a pathology such as structure in the residuals, but instead (simple and expected) [regression to mediocrity](https://en.wikipedia.org/wiki/Regression_toward_the_mean), as [Nina Zumel already described](http://www.win-vector.com/blog/2019/09/why-do-we-plot-predictions-on-the-x-axis/) so well.  In the real data all outcome probabilities are zero or one, in the predictions things are a bit more blurry. This example is less clearn than Nina Zumel's, but driven by related processes.

Of course one can try to adjust the per-class thesholds to find more potential trading opportunities: but the new opportunities found are going to likely be of lower quality than the ones identified.
