# Module 6. Classification part 2
# Assessing performance
## Lecture objectives

1. Understand different ways to assess the predictive accuracy of a model

In the last lecture, we learned how to estimate a random forests model. But we skimmed over how to assess its predictive performance. 

First, let's recreate the same model we estimated before. Because we are using the `random_state` parameter, we should get exactly the same results as before.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# create the dataframe, including dummies
joinedDf = pd.read_pickle('../scratch/joined_permits.pandas')
dummies1 = pd.get_dummies(joinedDf.UseType, prefix='usetype_')  # creates a dataframe of dummies
dummies2 = pd.get_dummies(joinedDf.UseDescription, prefix='usedesc_')
joinedDf = joinedDf.join(dummies1).join(dummies2) 
xvars = (dummies1.columns.tolist() + dummies2.columns.tolist() + 
            ['YearBuilt1', 'Units1', 'Bedrooms1', 'Bathrooms1', 'SQFTmain1', 'Roll_LandValue', 
             'Roll_ImpValue', 'Roll_LandBaseYear', 'Roll_ImpBaseYear', 'CENTER_LAT', 'CENTER_LON' ])
yvar = 'hasADU'
df_to_fit = joinedDf[xvars+[yvar]].dropna()

# test-train split
X_train, X_test, y_train, y_test = train_test_split(
    df_to_fit[xvars], df_to_fit[yvar], test_size = 0.25, random_state = 1)

# estimate and fit the model
rf = RandomForestClassifier(n_estimators = 100, random_state = 1, n_jobs=-1) # n_jobs uses all your computer's cores
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

Let's also re-estimate the more simplified version of the model. That's because the real value of these performance assessments is to compare models - which predicts best?

In [None]:
xvars2 = ['SQFTmain1', 'Roll_LandValue', 'Roll_ImpValue']

# create a dataframe with no NaNs
df_to_fit2 = joinedDf[xvars2+[yvar]].dropna()

X_train2, X_test2, y_train2, y_test2 = train_test_split(
    df_to_fit2[xvars2], df_to_fit2[yvar], test_size = 0.25, random_state = 1)

rf2 = RandomForestClassifier(n_estimators = 50, random_state = 1)
rf2.fit(X_train2, y_train2)
y_pred2 = rf2.predict(X_test2)

Last time we predicted that 0.1% of the parcels would have an ADU (0.06% in the stripped down, simplified model). Let's confirm we get the same overall prediction.

In [None]:
print('Predicted fraction True: {:.4f}. Actual fraction True: {:.4f}'.format(
    y_pred.mean(), y_test.mean()))

print('\nSimplified model:\nPredicted fraction True: {:.4f}. Actual fraction True: {:.4f}'.format(
    y_pred2.mean(), y_test2.mean()))

Let's look more formally at the predictive accuracy.

`scikit-learn` has several useful functions here:
* The [*confusion matrix*](https://en.wikipedia.org/wiki/Confusion_matrix) gives the number of observations that fall into each predicted and actual category (e.g is True and predicted to be True - a "true positive")
* The [*accuracy score*](https://scikit-learn.org/stable/modules/model_evaluation.html#accuracy-score) gives the fraction of accurate predictions. A high score is no guarantee of a good model. Suppose you want to predict whether a person will be struck by lightening tomorrow. Then just predict "No" for everyone. Accurate, but not useful!
* The [*classification report*](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report) reports key metrics for each category (in this case, `True` and `False`. *Precision* gives the fraction of true positives compared to the number of predicted positives (TP / (TP + FP) - basically, how many of our positive predictions are correct. *Recall* gives the fraction of true positives compared to the number of actual positives (TP / (TP + FN) - basically, what proportion of the positives did we correctly predict.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay

print(confusion_matrix(y_test, y_pred))
confusion_matrix?

In [None]:
print('Accuracy score: {:.4f}'.format(accuracy_score(y_test, y_pred)))

In [None]:
print(classification_report(y_test, y_pred))

We can also plot the confusion matrix. Note the colorbar shows the number of observations.

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

How does this compare to our simplified model?

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test2, y_pred2)

So our model complex model does a bit better. In particular, we do a much better job in getting the true positives. 

It's still not great. But this is not surprising: ADU construction will depend on many factors that are outside our dataset: lot shape, the ability of the owners to afford an ADU, and their personal circumstances (e.g. having an elderly family member). Bringing in census data and other predictors might help.

But even if the predictions are not always accurate, the predicted *probabilities* can be informative. We could even map them to indicate where ADUs are likely to be built. 

We can access the probabilities from the `rf` object. Note it gives us a matrix: the first column is the probability of `False`, and the second is the probability of `True`.

In [None]:
rf.predict_proba(X_test)

Also note that we want to have the predicted probabilities of *all* the parcels - the test and the training datasets.

So we can concatenate the two dataframes.

In [None]:
rf.predict_proba(pd.concat([X_test,X_train]))

It's easy to turn an array like this into a pandas DataFrame.

In [None]:
# create a dataframe
predictions = pd.DataFrame(rf.predict_proba(pd.concat([X_test,X_train])), 
                           columns = ['pred_noADU', 'pred_ADU'])
predictions.head()

Let's join our original test data back to these predictions. 

One wrinkle: we just have an integer index on `predictions`. Does this match `X_test`? 

In [None]:
X_test.head()

No, that has the original `APN` index. Fortunately, `reset_index()` will change the index to a sequential integer index.

In [None]:
predictions= predictions.join(pd.concat([X_test,X_train]).reset_index())
predictions.head()

Now, let's map these predictions.

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Convert your dataframe, <strong>predictions</strong> to a GeoDataFrame that includes a geometry column.
</div>

In [None]:
import geopandas as gpd
predictions = gpd.GeoDataFrame(predictions, 
                    geometry = gpd.points_from_xy(
                        predictions.CENTER_LON, 
                        predictions.CENTER_LAT, crs='EPSG:4326'))

In [None]:
import contextily as ctx
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,10))
predictions.to_crs('EPSG:3857').plot('pred_ADU', markersize=0.001, ax=ax)
ctx.add_basemap(ax=ax)
ax.set_xticks([])
ax.set_yticks([])

Actually, this map says more about where parcels overlap rather than the presence of an ADU. There are a couple of strategies that we might use to avoid this:
* Don't plot parcels with less than a certain probability of having an ADU
* Use a different colormap (the `cmap` argument)
* Use a smaller marker size (to avoid some overlap) (the `markersize` argument)
* Scale the colormap so that it tops out at a lower value (the `vmax` argument)

Let's try the first and fourth approaches.

In [None]:
fig, ax = plt.subplots(figsize=(5,5))
predictions[predictions.pred_ADU>0.1].to_crs('EPSG:3857').plot('pred_ADU', cmap='Purples', markersize=0.001, 
                                     vmax=0.2, ax=ax, legend=True)
ctx.add_basemap(ax=ax, alpha = 0.3)
ax.set_xticks([])
ax.set_yticks([])

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> Experiment with different values for <strong>vmax</strong>. What does this parameter do?
</div>

<div class="alert alert-block alert-info">
    <strong>Exercise:</strong> What other colormaps can you use? Explore the matplotlib documentation.
</div>

Hmm. Maybe changing the marker size is a better way to go than a color.

We can do this by scaling the column that we plot.

In [None]:
import contextily as ctx
import matplotlib.pyplot as plt

predictions['pred_ADU_scaled'] = predictions.pred_ADU*0.01
fig, ax = plt.subplots(figsize=(5,5))
predictions[predictions.pred_ADU>0.1].to_crs('EPSG:3857').plot(cmap='Purples', 
                                     markersize='pred_ADU_scaled', 
                                     ax=ax, legend=True)
ctx.add_basemap(ax=ax, alpha=0.4)
ax.set_xticks([])
ax.set_yticks([])

It's still not clear how to distinguish areas with lots of parcels from areas with few parcels but a high propensity to have an ADU. Any ideas? 

We'll leave the mapping here, but my inclination would be to aggregate to a grid or to census block groups, in order to make the propensity to have an ADU a bit clearer.

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>Confusion matrices are an excellent way to assess predictive performance, particularly for binary outcomes.</li>
  <li>Mapping and other descriptive plots are also good ways to explore the outputs.</li>
</ul>
</div>