<a href="https://colab.research.google.com/github/YahyaMansoor/Data-Analytics-exercises/blob/main/ConfusionMatrix_v0_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Hunting Exoplanets In Space - Model Evaluation

---

### 

In the previous class, we deployed the `RandomForestClassifier` prediction model which classified the stars, in the test dataset, as `1` and `2`.

The model turned out to be 99% accurate. However, in the dataset, `565` stars are classified as `1` and remaining `5` are classified as `2`. But the Random Forest Classifier model classified every star as `1`. Ideally, it should classify a few stars as `2` because the ultimate goal of the Kepler Space telescope was to find exoplanets in space.

Hence, we need to ensure that for any kind of uneven distribution of data in the test dataset, our model should make accurate predictions. For this purpose, we need to evaluate the model that we deployed.

Generally, a classification model (in this case, Random Forest Classification) is evaluated through a concept called **confusion matrix**.

In this class, we will learn how to evaluate the performance of a classification-based machine learning model using a confusion matrix.


Let's run all the codes in the code cells that we have already covered in the previous classes and begin this class from **Activity 1: The Confusion Matrix** section. You too run the code cells until the first activity.


---

#### Loading The Datasets
Create a Pandas DataFrame every time you start the Jupyter notebook.

Dataset links (don't click on them):

1. Train dataset

  https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv

2. Test dataset

  https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv

In [None]:
# Load the datasets.
import pandas as pd
exo_train_df = pd.read_csv('https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv')
exo_test_df = pd.read_csv('https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv')
exo_train_df

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,93.85,83.81,20.10,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,...,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,2,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,...,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.70,6.46,16.00,19.93
2,2,532.64,535.92,513.73,496.92,456.45,466.00,464.50,486.39,436.56,...,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.80,-28.91,-70.02,-96.67
3,2,326.52,347.39,302.35,298.13,317.74,312.70,322.33,311.31,312.42,...,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,2,-1107.21,-1112.59,-1118.95,-1095.10,-1057.55,-1034.48,-998.34,-1022.71,-989.57,...,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5082,1,-91.91,-92.97,-78.76,-97.33,-68.00,-68.24,-75.48,-49.25,-30.92,...,139.95,147.26,156.95,155.64,156.36,151.75,-24.45,-17.00,3.23,19.28
5083,1,989.75,891.01,908.53,851.83,755.11,615.78,595.77,458.87,492.84,...,-26.50,-4.84,-76.30,-37.84,-153.83,-136.16,38.03,100.28,-45.64,35.58
5084,1,273.39,278.00,261.73,236.99,280.73,264.90,252.92,254.88,237.60,...,-26.82,-53.89,-48.71,30.99,15.96,-3.47,65.73,88.42,79.07,79.43
5085,1,3.82,2.09,-3.29,-2.88,1.66,-0.75,3.85,-0.03,3.28,...,10.86,-3.23,-5.10,-4.61,-9.82,-1.50,-4.65,-14.55,-6.41,-2.55


Check the number of rows and columns in the DataFrames.

In [None]:
# Number of rows and columns in the DataFrames.
print(exo_train_df.shape)
exo_test_df.shape

(5087, 3198)


(570, 3198)

---

#### The `value_counts()` Function

To compute how many times a value occurs in a series, use the `value_counts()` function.

In [None]:
# The number of times a value occurs in a Pandas series.
exo_test_df['LABEL'].value_counts()

1    565
2      5
Name: LABEL, dtype: int64

There are `565` stars which are classified as `1` and `5` stars classified as `2`. This means that only `5` stars have a planet.


---

#### Importing `RandomForestClassifier` Module
We need to import a module called `RandomForestClassifier` from a package called `sklearn.ensemble`. The `sklearn` or **scikit-learn** is a collection of many machine learning modules. Almost every machine learning algorithm can be directly applied without a knowledge of math using the **scikit-learn** library. It is kind of a plug-and-play device.

In [None]:
# Import the 'RandomForestClassifier' module from the 'sklearn.ensemble' library.
from sklearn.ensemble import RandomForestClassifier

---

#### The Target & Feature Variables Separation

The `RandomForestClassifier` module has a function called `fit()` which takes two inputs. The first input is the collection of feature variables. The second input is the target variable. Hence, we need to extract the target variable and the feature variables separately from the training dataset.

Let's store the feature variables in the `x_train` variable and the target variable in the `y_train`.

In [None]:
# Extract feature variables from the training dataset.
x_train = exo_train_df.iloc[:, 1:]
x_train.head()

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,93.85,83.81,20.1,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,-160.17,...,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,-73.38,...,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.7,6.46,16.0,19.93
2,532.64,535.92,513.73,496.92,456.45,466.0,464.5,486.39,436.56,484.39,...,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.8,-28.91,-70.02,-96.67
3,326.52,347.39,302.35,298.13,317.74,312.7,322.33,311.31,312.42,323.33,...,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,-1107.21,-1112.59,-1118.95,-1095.1,-1057.55,-1034.48,-998.34,-1022.71,-989.57,-970.88,...,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54


In [None]:
# Retrieve only the first column, i.e., the 'LABEL' column.
y_train = exo_train_df.iloc[:, 0] 
y_train.head()

0    2
1    2
2    2
3    2
4    2
Name: LABEL, dtype: int64

---

#### Fitting The Model
Let's train the model using the `fit()` function.

In [None]:
# Train the 'RandomForestClassifier' model using the 'fit()' function.
rf_clf = RandomForestClassifier(n_jobs=-1, n_estimators=50)
rf_clf.fit(x_train, y_train)

rf_clf.score(x_train, y_train)

1.0

As you can see, we have built the Random Forest Classifier model with 50 decision trees. The fitting accuracy score of the model is 100%.



---

#### Target And Feature Variables From Test Dataset
Now we need to make predictions on the test dataset. So, we just need to extract feature variables from the test dataset using the `iloc[]` function.

In [None]:
# Extract the feature variables from the test dataset.
x_test = exo_test_df.iloc[:, 1:]
x_test.head()

Unnamed: 0,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,119.88,100.21,86.46,48.68,46.12,39.39,18.57,6.98,6.63,-21.97,...,14.52,19.29,14.44,-1.62,13.33,45.5,31.93,35.78,269.43,57.72
1,5736.59,5699.98,5717.16,5692.73,5663.83,5631.16,5626.39,5569.47,5550.44,5458.8,...,-581.91,-984.09,-1230.89,-1600.45,-1824.53,-2061.17,-2265.98,-2366.19,-2294.86,-2034.72
2,844.48,817.49,770.07,675.01,605.52,499.45,440.77,362.95,207.27,150.46,...,17.82,-51.66,-48.29,-59.99,-82.1,-174.54,-95.23,-162.68,-36.79,30.63
3,-826.0,-827.31,-846.12,-836.03,-745.5,-784.69,-791.22,-746.5,-709.53,-679.56,...,122.34,93.03,93.03,68.81,9.81,20.75,20.25,-120.81,-257.56,-215.41
4,-39.57,-15.88,-9.16,-6.37,-16.13,-24.05,-0.9,-45.2,-5.04,14.62,...,-37.87,-61.85,-27.15,-21.18,-33.76,-85.34,-81.46,-61.98,-69.34,-17.84


Let's also extract the target variable from the test dataset so that we can compare the actual target values with the predicted values later.

In [None]:
# Extract the target variable from the test dataset.
y_test = exo_test_df.iloc[:, 0]
y_test.head()

0    2
1    2
2    2
3    2
4    2
Name: LABEL, dtype: int64

---

#### The `predict()` Function
Now, let's make predictions on the test dataset by calling the `predict()` function with the features variables of the test dataset as an input.

In [None]:
# Make predictions using the 'predict()' function.
y_predicted = rf_clf.predict(x_test)
y_predicted

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

The actual target values are stored in a Pandas series. So, for the sake of consistency, let's convert the NumPy array of the predicted values into a Pandas series.

In [None]:
# Convert the NumPy array of predicted values into a Pandas series.
y_predicted = pd.Series(y_predicted)
y_predicted.head()

0    1
1    1
2    1
3    1
4    1
dtype: int64

Now, let's count the number of stars classified as `1` and `2`.

In [None]:
# Using the 'value_counts()' function, count the number of times 1 and 2 occur in the predicted values.
y_predicted.value_counts()

1    570
dtype: int64

---

#### Activity 1: The Confusion Matrix
Let's quickly first create a confusion matrix and then will try to understand it.

To create a confusion matrix, first import `confusion_matrix` module from the `sklearn.metrics` library. This library contains all the parameters to evaluate a machine learning model. In addition to the `confusion_matrix` module, let's also import the `classification_report` module. We will use them later to evaluate our module.

In [None]:
#  Import the 'confusion_matrix' and 'classification_report' functions from the 'sklearn.metrics' module.
from sklearn.metrics import confusion_matrix, classification_report

Now, create the confusion matrix using the `confusion_matrix()` function. It requires two inputs. The first input is actual target values (`y_test`) and the second input is predicted target values (`y_predicted`).

In [None]:
#  Create a confusion matrix using the 'y_test' and 'y_predicted' values.
confusion_matrix(y_test,y_predicted)

array([[565,   0],
       [  5,   0]])

Now that we have got our confusion matrix, let's try to understand this concept.

**Confusion Matrix:**
---

It is way of evaluating the performance of your machine learning algorithm.

**For Example:**

Suppose that you attempted an online exam in which you already know that out of 100 questions, you have given 75 correct answers and 25 incorrect answers.

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C17/WHJ-BOY-TYPING-APT-C17.gif" height=400/>


However, the exam software did not assessed the answers correctly and marked many correct answers as incorrect and incorrect answers as correct. Let us evaluate the performance of this software using confusion matrix.

- There are two possible classes:
  1. Class `correct`.
  2. Class `incorrect`.

We need to find out how many correct answers were accurately assessed or predicted by the software. 

Thus, 
- positive outcome $\Rightarrow$ `correct` answer.
- negative outcome $\Rightarrow$ `incorrect` answer.

In technical terms, the desired outcome is called a **positive outcome**. 


Now, consider the following table and a `2 X 2` matrix known as **confusion matrix**. This table shows the actual and predicted values for the first 4 questions.

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C17/table2.PNG"/>



Now let us have a look at each cell of the confusion matrix.

1. The first row first column value indicates those `incorrect` answers which were <b><font color=green>accurately</font></b> assessed or predicted as `incorrect` by the software.
Such values are called as **True Negative (TN)**.


<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C17/True_negative.png"/>

2. The second  row second column value  indicates those `correct` answers which were <b><font color=green>accurately</font></b> assessed or predicted as `correct` by the software.
Such values are called as **True Positive (TP)**.

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C17/true_positives.png"/>

3. The second  row first column value  indicates those `correct`  answers which were <b><font color=red>inaccurately</font></b>  assessed or predicted as `incorrect` by the software.
Such values are called as **False Negative (FN)**.


<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C17/fn.png"/>

4. The first  row second column value  indicates those `incorrect`  answers which were <b><font color=red>inaccurately</font></b>  assessed or predicted as `correct` by the software.
Such values are called **False Positive (FP)**.


<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C17/FP.png"/>


The resultant confusion matrix obtained after evaluating values for all the 100 questions are as follows:

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C17/whj-negative-positive-apt-c17-01.png" height=300/>


- Values that are accurately predicted or assessed by the model are labelled **True (T)**. Thus, the number of answers which were <b><font color=green>accurately</font></b> predicted by the software = `85`


<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C17/whj-85-ans-apt-c17.gif" height=450/>

- Values that are inaccurately predicted or assessed by the model are labelled **False (F)**. Thus, the number of answers which were <b><font color=red>inaccurately</font></b> predicted by the software = `15`


<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C17/whj-15-answers-apt-c17.gif" height=450/>

Thus the confusion matrix compares the actual values with the predicted values and thus it is very useful in evaluating the performance of your machine learning model.

---

Let us apply the concept of confusion matrix for our dataset.

There are two possible classes:
1. The class `1` values are stars **NOT** having a planet.
2. The class `2` values are stars having a planet.

 So, after you deploy the classification model, there are 4 possible outcomes. They are:

1. Class `1` values predicted as class `1`. 

2. Class `1` values predicted as class `2`.

3. Class `2` values predicted as class `1`.

4. Class `2` values predicted as class `2`.

These 4 possibilities can be reported in a confusion matrix. 

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|||
|Actual Class `2` (`y_test`)|||

where

- `y_test` contains the actual class `1` and class `2` values

- `y_predicted` contains the predicted class `1` and class `2` values

In this table,

- the values **predicted as class `1` and actually belonging to class `1`** are reported in the first row and first column.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565||
|Actual Class `2` (`y_test`)|||

- the values **predicted as class `1` but actually belonging to class `2`** are reported in the second row and first column. 

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565||
|Actual Class `2` (`y_test`)|5||

- the values **predicted as class `2` and actually belonging to class `2`** are reported in the second row and second column.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565||
|Actual Class `2` (`y_test`)|5|0|

- the values **predicted as class `2` but actually belonging to class `1`** are reported in the first row and second column.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565|0|
|Actual Class `2` (`y_test`)|5|0|

In this case, the class `1` values refer to the stars not having a planet whereas class `2` values refer to the stars having a planet. 

**Positive Outcome**

Detecting a star having a planet is the desired outcome (positive outcome). 
Thus, 
- positive outcome $\Rightarrow$ class `2`.

- negative outcome $\Rightarrow$ class `1`.

So, here the positive outcome is the prediction of the stars having a planet, i.e., prediction of the class `2` values. Likewise, finding a star which does not have any planet is a *negative outcome*. So, here the negative outcome is the prediction of the class `1` values. 

Observe the output of `confusion_matrix(y_test, y_predicted)` function.

```
array([[565,   0],
       [  5,   0]])
```




- `565` values are **True Negative (TN)** values because they are **truly** predicted as class `1` values.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565 (TN)||
|Actual Class `2` (`y_test`)|||


- `5` values are **False Negative (FN)** values because they are **falsely** predicted as class `1` values. They should have been predicted as class `2` values.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565 (TN)||
|Actual Class `2` (`y_test`)|5 (FN)||


- `0` values are **True Positive (TP)** values because they are **truly** predicted as class `2` values.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565 (TN)||
|Actual Class `2` (`y_test`)|5 (FN)|0 (TP)|


- `0` values are **False Positive (FP)** values because they are **falsely** predicted as class `2` values. They should have been predicted as class `1` values.

||Predicted Class `1` (`y_predicted`)|Predicted Class `2` (`y_predicted`)|
|-|-|-|
|Actual Class `1` (`y_test`)|565 (TN)|0 (FP)|
|Actual Class `2` (`y_test`)|5 (FN)|0 (TP)|

---

#### Activity 2: Precision And Recall^

A good prediction model provides a very large number of true positive (TP) values and a very low number of true negative (TN) values.

**Precision:**

Based on the TP and FP values, we define a parameter called **precision**. 
It is used to evaluate the number of correct positive predictions made.

It is the ratio of the TP values to the sum of TP and FP values, i.e.,

(defn of precision and recall)
(precision of both outcomes.)
(add example values)

$$\text{precision} = \frac{\text{TP}}{\text{TP + FP}}$$

Let us calculate the precision for exam software:

<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C17/whj-negative-positive-apt-c17-01.png" height=300/>

For the above matrix,
$$\text{precision} = \frac{\text{TP}}{\text{TP + FP}}=\frac{\text{65}}{\text{65 + 5}}=\text{0.928}$$

Ideally, the precision should be 1 for a good classifier model. In this case, it is 0.928 which is quite good.

Similarly, let's calculate the precision for our Random Forest classifier model.
Currently, the model has given `0` TP values and `0` FP values. Therefore, the precision value is undefined because

$$\text{precision} = \frac{0}{0 + 0} = \text{undefined}$$

*In mathematics, the division by 0 is undefined (or not defined).*

**Recall:**

Based on the TP and FN values, we define another parameter called **recall**.
It is the ratio of the TP values to the sum of TP and FN values, i.e, 

$$\text{recall} = \frac{\text{TP}}{\text{TP + FN}}$$

Let us calculate the precision for the exam software model:
<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C17/whj-negative-positive-apt-c17-01.png" height=300/>

For the above matrix,
$$\text{recall} = \frac{\text{TP}}{\text{TP + FN}}=\frac{\text{65}}{\text{65 + 10}}=\text{0.867}$$

Ideally, the recall should be 1 for a good classifier model. In this case, it is 0.867 which is pretty good.

Similarly, let's calculate the recall for our Random Forest classifier model.

Currently, the model gives `0` TP and `5` FN values. Hence, the recall value is 0 because

$$\text{recall} = \frac{0}{0+5} = \text{0}$$

Imagine if the prediction model labels every star as `2`, i.e, every star has a planet. Then, the number of TP values will be the maximum, i.e., `5` but the number of FP values will also be maximum, i.e., `565`. In such a case, the precision value would be

$$\text{precision} = \frac{5}{5+565} = \frac{5}{570} = 0.008$$

which is very very low.

Also, the model will give `0` FN values. Then, the recall value would be

$$\text{recall} = \frac{5}{5 + 0} = 1$$


So, even though the recall value would be equal to 1, the precision value would be close to 0. Hence, this would be a bad prediction model.


Evidently, there is a trade-off. If the recall value is high, then the precision value will be low and vice-versa. Hence, we need to find an optimum point where both, the precision and the recall values are acceptable.

---

#### Activity 3: The `f1-score`

To find an optimum point where both, the precision and recall values, are high, we calculate another parameter called **f1-score**. It is a harmonic mean of the precision and recall values, i.e.,



$$\text{f1-score} = 2 \left( \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \right)$$

Let us calculate the f1-score for the exam software model:
<img src="https://curriculum.whitehatjr.com/APT+Asset/APT+C17/whj-negative-positive-apt-c17-01.png" height=300/>

For the above matrix,
$$\text{f1-score} = 2 \left( \frac{\text{0.928} \times \text{0.867}}{\text{0.928} + \text{0.867}} \right)=\text{0.896}$$

f1-score will be high only when both precision and recall are high. In this case, it is 0.896 which is a good f1-score.

Similarly, let's calculate the f1-score for our Random Forest classifier model.

Based on the current predictions, the f1-scores value is undefined because both the precision and recall values are also undefined.

$$\text{f1-score} = 2 \left( \frac{\text{undefined} \times 0}{\text{undefined} + 0} \right) =  \text{undefined}$$

You can also get these values by calling a function called `classification_report()`. It takes two inputs: the actual target values and the predicted target values, i.e., `y_test` and `y_predicted`.

**Note:** You may get the following warning message after executing the code in the code cell below.

```
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
```

Ignore the warning.

In [None]:
#  Print the 'precision', 'recall' and 'f1-score' values using the 'classification_report()' function.
print(classification_report(y_test, y_predicted))

              precision    recall  f1-score   support

           1       0.99      1.00      1.00       565
           2       0.00      0.00      0.00         5

    accuracy                           0.99       570
   macro avg       0.50      0.50      0.50       570
weighted avg       0.98      0.99      0.99       570



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


As you can see, the `precision` and `f1-score` values are reported as `0.00` for class `2` because they are actually undefined values. 
  
Ideally, the above values for class `2` should also be close to `1.00`. Then only we can say that our prediction model is satisfactory. This shows that accuracy alone cannot tell whether a prediction model is making correct predictions or not.

In the next class, we will try to improve the model so that we get the desired precision, recall and f1-score values for class `2`.



---

---

#### Activity 1: Google Coding Challenge

This coding question is taken from Google Coding Challenge

**Problem**

There are $N$ houses for sale. The $i^{\text{th}}$ house costs $A_i$ dollars to buy. You have a budget of $B$ dollars to spend. What is the maximum number of houses you can buy?

**Input**

The first line of the input gives the number of test cases, $T$. $T$ test cases follow. Each test case begins with a single line containing the two integers $N$ and $B$. The second line contains $N$ integers. The $i^{\text{th}}$ integer is $A_i$, the cost of the $i^{\text{th}}$ house.

**Output**

For each test case, output one line containing `Case #x: y`, where `x` is the test case number (starting from 1) and `y` is the maximum number of houses you can buy.

**Limits**

$1 ≤ T ≤ 100$$

$1 ≤ B ≤ 10^5$

$1 ≤ A_i ≤ 1000$, for all $i$.


**Sample** 

**Input** 

```
3
4 100
20 90 40 90
4 50
30 30 10 10
3 300
999 999 999
```

**Output**

```
Case # 1 : 2
Case # 2 : 3
Case # 3 : 0
```

In Sample `Case #1`, you have a budget of `100` dollars. You can buy the first and third houses for `20 + 40 = 60` dollars.

In Sample `Case #2`, you have a budget of `50` dollars. You can buy the first, third and fourth houses for `30 + 10 + 10 = 50` dollars.

In Sample `Case #3`, you have a budget of `300` dollars. You cannot buy any houses. So the answer is `0`.

In [None]:
# Solution:
def max_affordable_houses(my_budget, house_prices):
	sum_house_prices = 0
	count = 0
	house_prices.sort()
	for price in house_prices:
		sum_house_prices += int(price)
		if sum_house_prices > my_budget:
			continue
		else:
			count += 1
	return count

test_cases = [] # This line of code should be shown to the students.
num_test_cases = int(input()) # This line of code should be shown to the students.
while num_test_cases > 0: # This line of code should be shown to the students.
	input_list = input().split(' ') # This line of code should be shown to the students.
	num_houses = int(input_list[0]) # This line of code should be shown to the students.
	my_budget = int(input_list[1]) # This line of code should be shown to the students.
	house_prices = input().split(' ') # This line of code should be shown to the students.
	house_prices = [int(price) for price in house_prices] # This line of code should be shown to the students.
	test_cases.append([my_budget, house_prices]) # This line of code should be shown to the students.
	num_test_cases -= 1 # This line of code should be shown to the students.

for i in range(len(test_cases)):
	my_budget = test_cases[i][0]
	house_prices = test_cases[i][1]
	print("Case #", i + 1, ":", max_affordable_houses(my_budget, house_prices))

---

---

---