In this notebook, I evaluate the predictions of the neural network model on the test set of stories using basic statistical analysis. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import pickle
import statsmodels.api as sm
import numpy as np

%matplotlib notebook

In [2]:
# load pickled lists
predicted = pickle.load(open('files/predictions.pkl', 'rb'))
actual = pickle.load(open('files/actuals.pkl', 'rb'))

The datasets are easily contrasted by examining their histograms.

In [3]:
# convert to numpy arrays 
X = np.array(predicted)
Y = np.array(actual)

In [4]:
fig, axes = plt.subplots(1, 2, sharey=True, tight_layout=True)

axes[0].hist(X, label='predicted rating')
axes[1].hist(Y, label='actual rating', color='r')

axes[0].set_ylabel('num stories')

fig.legend(loc=(0.6,0.8))
fig.text(0.5, 0, 'ratings')
fig.text(0.45, 0.98, 'Rating Histograms', fontsize=10)


<IPython.core.display.Javascript object>

Text(0.45, 0.98, 'Rating Histograms')

The distribution of the test set ratings is left-skewed, while that of the model predictions is more symmetric.
The most frequent predicted rating is just under 7.5, while that of the actual ratings is around 8. 
At the extremes, the model fails to predict ratings lower than 6 or greater than 8, while the actual ratings have a larger range. Further differences are observed in the measures of center and variation below.

In [5]:
print('mean of predictions: %.3f' % X.mean())
print('mean of actuals: %.3f' % Y.mean())

mean of predictions: 7.336
mean of actuals: 7.597


In [6]:
print('median of predictions: %.3f' % np.median(X))
print('median of actuals: %.3f' % np.median(Y))

median of predictions: 7.367
median of actuals: 7.780


In [7]:
print('stdev of predictions: %.3f' % X.std())
print('stdev of actuals: %.3f' % Y.std())

stdev of predictions: 0.297
stdev of actuals: 1.081


In [8]:
print('range of predictions: %.3f' % (X.max()-X.min()))
print('range of actuals: %.3f' % (Y.max()-Y.min()))

range of predictions: 2.060
range of actuals: 6.010


The mean and median of the test set ratings are noticeably larger than those of the model predictions. More starkly, the standard deviation of the model predictions is almost 4 times smaller than that of the test set ratings, explaining the narrow spike around the mean in the prediction histogram. The range of predictions is 3 times more narrow than that of the test ratings.  

The model clearly struggles to capture the variance present in the test set ratings. While many of the model predictions fall within a standard deviation of the test ratings, the model fails to replicate the wider range of outcomes in the test set. 

While flawed, the model may still have some predictive capacity. It's worth checking some basic 2D statistics to see if the model results could have occurred by chance. 

In [9]:
plt.plot(predicted, actual, 'b.')

plt.axhline(y=Y.mean(), c='k', label='mean')
plt.axvline(x=X.mean(), c='k')

plt.xlabel('predicted rating')
plt.ylabel('actual rating')
plt.title('Comparing Predicted and Actual Ratings')
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x7f4fae804b38>

We see from the scatterplot of actual vs. predicted ratings that the model predictions are closest to the test set ratings around the means of each dataset. The plot clearly illustrates the model's inaccurate predictions at extreme ratings. In particular, the model struggles most with test set ratings that are low, generating increasingly divergent predictions. 

A model with 100% accurate predictions would result in a predicted vs. actual scatterplot that is one-to-one and linear. For this model, this scatterplot is shown below.

In [10]:
x = np.linspace(6.0, 8.5, 100)
y = x
plt.plot(predicted, actual, 'b.')
plt.plot(x,y,'r', label='y=x')
plt.xlabel('predicted rating')
plt.ylabel('actual rating')
plt.title('Comparing Predicted and Actual Ratings')
plt.legend(loc=(0.8,0.5))

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x7f4fae355898>

The red line is the identity line. Ideally, the plot of actual vs. predicted ratings would be this line. This graph corroborates the conclusions drawn from the 1D data. In general, the model tends to underestimate the rating, as seen with the dense clustering of points above the identity line. 

But did the model results occur by chance? Regression of actual ratings onto predicted ratings can determine the answer.

In [11]:
# standard scaling for best model results
X = (X - X.mean())/X.std()
Y = (Y - Y.mean())/Y.std()

In [12]:
mod = sm.OLS(Y,X).fit()
mod.summary()

0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.076
Model:,OLS,Adj. R-squared (uncentered):,0.074
Method:,Least Squares,F-statistic:,56.95
Date:,"Tue, 03 Nov 2020",Prob (F-statistic):,1.4e-13
Time:,16:37:16,Log-Likelihood:,-963.01
No. Observations:,698,AIC:,1928.0
Df Residuals:,697,BIC:,1933.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,0.2748,0.036,7.546,0.000,0.203,0.346

0,1,2,3
Omnibus:,115.98,Durbin-Watson:,2.129
Prob(Omnibus):,0.0,Jarque-Bera (JB):,184.553
Skew:,-1.069,Prob(JB):,8.41e-41
Kurtosis:,4.331,Cond. No.,1.0


We see that the predicted ratings explain about 8% of the variance in the actual ratings. Additionally, the low p-value indicates that the relation is unlikely to have occurred by chance. The Pearson Correlation coefficient between the ratings calculated below.

In [13]:
print('rating correlation: %.3f' % np.corrcoef(X,Y)[1,0])

rating correlation: 0.275


Conclusions:

The model results are unlikely to have occurred by chance, and they are weakly correlated with the actual ratings. Although the model has some predictive ability, it is generally inaccurate. In order to accurately predict story ratings, a more complicated model will need to be considered. 