# Classification

So far you've predicted numeric targets. This type of modeling is called **regression**, hence the "Regressor" part of `RandomForestRegressor`.

Another common problem you'll see is making a choice between mutually exclusive outcomes. For example, spam detection is predicting whether an email is "spam" or "not spam" based on the email's content. This type of modeling is called **classification**.

There are two types of classification: 
- **binary** (choosing between two classes) and 
- **multiclass** (choosing between more than two classes). 

In general there are different approaches to the two types of classification, but most multiclass models will also work for binary problems.

It's straightforward to build classification models using what you already know about scikit-learn. Instead of `RandomForestRegressor`, you will use `RandomForestClassifier`. 

As an example of classification with `RandomForestClassifier`, we'll use a dataset of phone features to predict a phone's price range. The targets in the data have values:

 * 0 (low cost)
 * 1 (medium cost)
 * 2 (high cost)
 * 3 (very high cost)
 
The features are things like

* battery_power: Total energy a battery can store in one time measured in mAh
* blue: Has bluetooth or not
* clock_speed: speed at which microprocessor executes instructions
* dual_sim: Has dual sim support or not
* fc: Front Camera mega pixels
* four_g: Has 4G or not
* ....

Here is a quick overview of the data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics

data = pd.read_csv('../input/mobile-price-classification/train.csv')
data.head()

Unnamed: 0,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,n_cores,...,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi,price_range
0,842,0,2.2,0,1,0,7,0.6,188,2,...,20,756,2549,9,7,19,0,0,1,1
1,1021,1,0.5,1,0,1,53,0.7,136,3,...,905,1988,2631,17,3,7,1,1,0,2
2,563,1,0.5,1,2,1,41,0.9,145,5,...,1263,1716,2603,11,2,9,1,1,0,2
3,615,1,2.5,0,0,0,10,0.8,131,6,...,1216,1786,2769,16,8,11,1,0,0,2
4,1821,1,1.2,0,13,1,44,0.6,141,2,...,1208,1212,1411,8,2,15,1,1,0,1


In [2]:
data.columns

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')

We create our feature and targets the same as before using `train_test_split`. This part looks like what you've already seen.

In [3]:
# Set variables for the targets and features
y = data['price_range']
X = data.drop('price_range', axis=1)

# Split the data into training and validation sets
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=7)

Creating and fitting the model is similar to what you've done before, except you'll use `RandomForestClassifier` instead of `RandomForestRegressor`.

In [4]:
# Create the classifier and fit it to our training data
model = RandomForestClassifier(random_state=7, n_estimators=100)
model.fit(train_X, train_y)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=7, verbose=0,
                       warm_start=False)

The simplest metric for classification models is the **accuracy**, the fraction predictions that are correct. Scikit-learn provides `metrics.accuracy_score` to calculate this.

In [5]:
# Predict classes given the validation features
pred_y = model.predict(val_X)

# Calculate the accuracy as our performance metric
accuracy = metrics.accuracy_score(val_y, pred_y)
print("Accuracy: ", accuracy)

Accuracy:  0.864


## Confusion Matrix

Our model did pretty well, correctly predicting around 86% of the price ranges in the validation data. It's often useful to look at where the model is failing with a **confusion matrix** which shows us how our model classified the inputs.

In [6]:
# Calculate the confusion matrix itself
confusion = metrics.confusion_matrix(val_y, pred_y)
print(f"Confusion matrix:\n{confusion}")


# Normalizing by the true label counts to get rates
print(f"\nNormalized confusion matrix:")
for row in confusion:
    print(row / row.sum())

Confusion matrix:
[[130   6   0   0]
 [  4  91  15   0]
 [  0  21  98  12]
 [  0   0  10 113]]

Normalized confusion matrix:
[0.95588235 0.04411765 0.         0.        ]
[0.03636364 0.82727273 0.13636364 0.        ]
[0.         0.16030534 0.7480916  0.09160305]
[0.         0.         0.08130081 0.91869919]


It's a little easier to understand as a nice little figure like so:

<img src="https://i.imgur.com/idD0k8y.png" alt="example confusion matrix" width=400px>

The rows of the confusion matrix are the true class and the columns are the predicted class. The diagonal tells us how many of each class the model predicted correctly. The off-diagonals show where the model is making wrong predictions, where it is "confused."

For example, looking at the first column and second row, we classified four phones that were actually medium cost (true label: 1) as low cost (predicted label: 0). We see for classes 0 and 3, the low cost and highest cost phones, our model works really well, above 90% accurate. However, our model is weaker for medium and high cost phones. Note that incorrect predictions are only between adjacent classes. The model doesn't confuse low cost and very high cost phones.

## Class probabilities 

Classification models actually calculate a *probability distribution* over the classes. Using `model.predict` simply returns the class with the highest probability. This might not be ideal based on how the decision affects your metrics or downstream measures. To get the probabilities themselves, use the `.predict_proba` method.

In [7]:
probs = model.predict_proba(val_X)
print(probs)

[[0.02 0.09 0.44 0.45]
 [0.02 0.06 0.22 0.7 ]
 [0.   0.17 0.61 0.22]
 ...
 [0.05 0.17 0.42 0.36]
 [0.45 0.34 0.13 0.08]
 [0.25 0.53 0.18 0.04]]


This shows the probability the model assigns to each class. Often in business problems, decisions you make lead to different monetary returns. The expected return for a decision based on your classifier is the probability times the monetary return of that decision.

Consider probabilities `[0.05 0.17 0.42 0.36]`. Assume the third option would result in \\$100 of profit while the fourth option would return \\$150 in profit. Then the expected monetary values are $0.42* \$100 = \$42$ and $0.36*\$150 = \$54$. Even though the third option has the highest probability, on average it would be better from a business perspective to choose the fourth option.

# Your Turn 
Try **[some classification](#$NEXT_NOTEBOOK_URL$)** yourself. It's not complicated given what you already know, and it will dramatically expand what types of use cases you can tackle.