# **F1 Score**

### What is F1 Score ?
Before we understand what is F1 score lets have some brief introduction of some of the terms that are going to be used. 
Lets understand these terms using confusion matrix. Now before understanding these terms using confusion matrix, lets take a look at some classification metrics that would be used in that matrix.
<br><br>

#### **Classification Metrics**

When performing classification predictions, there's four types of outcomes that could occur.

* **True positives** are when you predict an observation belongs to a class and it actually does belong to that class.

* **True negatives** are when you predict an observation does not belong to a class and it actually does not belong to that class.

* **False positives** occur when you predict an observation belongs to a class when in reality it does not.

* **False negatives** occur when you predict an observation does not belong to a class when in fact it does.

These four outcomes are often plotted on a confusion matrix.  

#### **Important Terms**

Now since we have have a idea about classification metrics that we are going to use in the confusion matrix, lets understand some terms using confusion matrix. 

![matrix](https://miro.medium.com/max/1050/1*OhEnS-T54Cz0YSTl_c3Dwg.jpeg)
* **Precision**: Lets understand precision witht the help of its formula.
![pr](https://miro.medium.com/max/711/1*HGd3_eAJ3-PlDQvn-xDRdg.png)  

  Did you notice denominator? The denominator is actually the total predicted positive. So, new formula is:
![nf](https://miro.medium.com/max/666/1*C3ctNdO0mde9fa1PFsCVqA.png)

  So with this we can infer that precision tells us about how precise/accurate our model is, out of those predicted positive, how many of them are actual positive.<br>

  Precision is a good measure to determine, when the costs of False Positive is high. For instance, email spam detection. In email spam detection, a false positive means that an email that is non-spam (actual negative) has been identified as spam (predicted spam). The email user might lose important emails if the precision is not high for the spam detection model.  

  
* **Recall**: Lets understand recall with the same logic.
![r](https://miro.medium.com/max/627/1*dXkDleGhA-jjZmZ1BlYKXg.png)
![r1](https://miro.medium.com/max/1050/1*BBhWQC-m0CLN4sVJ0h5fJQ.jpeg)
Recall actually calculates how many of the Actual Positives our model capture through labeling it as Positive (True Positive). Applying the same understanding, we know that Recall shall be the model metric we use to select our best model when there is a high cost associated with False Negative.<br>
For instance, in fraud detection or sick patient detection. If a fraudulent transaction (Actual Positive) is predicted as non-fraudulent (Predicted Negative), the consequence can be very bad for the bank.<br>
Similarly, in sick patient detection. If a sick patient (Actual Positive) goes through the test and predicted as not sick (Predicted Negative). The cost associated with False Negative will be extremely high if the sickness is contagious.<br>

  The following graphic does a phenomenal job visualizing the difference between precision and recall.
![rp](https://www.jeremyjordan.me/content/images/2017/07/Precisionrecall.svg.png)

* **Accuracy**: One of the more obvious metrics, it is the measure of all the correctly identified cases. It is most used when all the classes are equally important. It is the percentage for correct predictions for the test data. In other words, accuracy tells us how often we can expect our machine learning model will correctly predict an outcome out of the total number of times it made predictions. For example: Let’s assume that you were testing your machine learning model with a dataset of 100 records and that your machine learning model predicted all 90 of those instances correctly. The accuracy metric, in this case, would be: (90/100) = 90%. The accuracy rate is great but it doesn’t tell us anything about the errors our machine learning models make on new data we haven’t seen before.  

![Aimage](https://miro.medium.com/max/1400/1*sVuthxNoz09nzzJTDN1rww.png)  

Accuracy is a useful metric only when you have an equal distribution of classes on your classification. This means that if you have a use case in which you observe more data points of one class than of another, the accuracy is not a useful metric anymore. Lets understand this with an example.

#### **Imbalanced data example**

Imagine you are working on the sales data of a website. You know that 99% of website visitors don’t buy and that only 1% of visitors buy something. You are building a classification model to predict which website visitors are buyers and which are just lookers.
Now imagine a model that doesn’t work very well. It predicts that 100% of your visitors are just lookers and that 0% of your visitors are buyers. It is clearly a very wrong and useless model.  
What would happen if we’d use the accuracy formula on this model? Your model has predicted only 1% wrongly: all the buyers have been misclassified as lookers. The percentage of correct predictions is therefore 99%. The problem here is that an accuracy of 99% sounds like a great result, whereas your model performs very poorly. In conclusion: accuracy is not a good metric to use when you have class imbalance

**And Finally**,

* **F1 Score**: The F-score, also called the F1-score, is a measure of a model’s accuracy on a dataset. It is used to evaluate binary classification systems, which classify examples into ‘positive’ or ‘negative’. It is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model’s precision and recall.
<br>

  The F-score is commonly used for evaluating information retrieval systems such as search engines, and also for many kinds of machine learning models, in particular in natural language processing.  
It represents the model score as a function of precision and recall score. F-score is a machine learning model performance metric that gives equal weight to both the Precision and Recall for measuring its performance in terms of accuracy, making it an alternative to Accuracy metrics (it doesn’t require us to know the total number of observations). It’s often used as a single value that provides high-level information about the model’s output quality. This is a useful measure of the model in the scenarios where one tries to optimize either of precision or recall score and as a result, the model performance suffers


### **F-Score Formula**

The formula for the standard F1-score is the harmonic mean of the precision and recall. A perfect model has an F-score of 1.
![FScore](https://miro.medium.com/max/423/1*T6kVUKxG_Z4V5Fm1UXhEIw.png)

### **Why F-Score ?**

Why to use F-Score when we already have Accuracy? Well...Accuracy is a metric easy to explain, performing well in balanced datasets. You should use it if both classes (positive and negative) are equally important . F1 on the other hand is better suited for imbalanced classification, using the Harmonic Mean to penalize extreme values. In case False Negatives and False Positives are of the same importance F1 is the proper metric to use.



### **Implementation**

Lets Implement F-Score to have a more better insight.



In [1]:
# FORMULA
# F1 = 2 * (precision * recall) / (precision + recall)

>Here y is the actual data set, y_pred is the predicted(testing) dataset.

In [16]:
y = [0,0,0,1,1,1]
y_pred = [0,1,1,1,1,1]
n = len(y)
class Metrics:
    tp = 0
    tn = 0
    fp = 0
    fn = 0
    precision = 0
    recall = 0
    def confusion_matrix(self):
        for i in range(n):
            if y[i]==1 and y_pred[i]==1:
                self.tp += 1
            if y[i]==0 and y_pred[i]==0:
                self.tn += 1
            if y[i]==0 and y_pred[i]==1:
                self.fp += 1
            if y[i]==1 and y_pred[i]==0:
                self.fn += 1
        return self.tp, self.tn, self.fp, self.fn
    
    def precision_recall(self):
        self.precision = self.tp/(self.tp+self.fp)
        self.recall = self.tp/(self.tp+self.fn)
        print('Precision : ',self.precision, '\nRecall : ',self.recall)
        
    def f1_score(self):
        f1 = 2*(self.precision*self.recall)/(self.precision+self.recall)
        print('F1 Score : ',f1)
        
model = Metrics()
print(model.confusion_matrix())
model.precision_recall()
model.f1_score()

(3, 1, 2, 0)
Precision :  0.6 
Recall :  1.0
F1 Score :  0.7499999999999999


>As you can observe from the code above, the numbers in the bracket are in the order (true positive, true negative, false positive, false negative).  
With the help of these value we have calculated precision, recall and finally F1 Score.

In [17]:
from sklearn.metrics import f1_score
f1_score(y, y_pred)

0.7499999999999999

>Now this F1 score tells us that 75% of the time our testing dataset is going to be accurate.

## **Referrences**

* https://towardsdatascience.com/the-f1-score-bec2bbc38aa6

* https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

* https://en.wikipedia.org/wiki/F-score