# Classification

The classification models are used to predict discrete outcomes `(y)` using indepent variables `(x)`.
The dependent variable is always a **class** or a **category**.

For instance:

# Fruit Classification 

| Fruit              | Color (x1) | Weight (g) (x2) | Taste (x3) | Edible (y) | Predicted Edible (ŷ) |
|--------------------|------------|-----------------|------------|------------|----------------------|
| Apple              | Red        | 150             | Sweet      | Yes        | Yes                  |
| Banana             | Yellow     | 120             | Sweet      | Yes        | Yes                  |
| Lemon              | Green      | 80              | Sour       | Yes        | No                   |
| Orange             | Orange     | 130             | Sweet      | Yes        | Yes                  |
| Grape              | Purple     | 5               | Sweet      | Yes        | Yes                  |
| Tomato             | Red        | 100             | Umami      | Yes        | No                   |
| Belladonna         | Black      | 5               | Sweet      | No         | No                   |
| Amanita Mushroom   | White      | 30              | Bitter     | No         | Yes                  |
| Kiwi               | Brown      | 75              | Sweet      | Yes        | Yes                  |
| Papaya             | Orange     | 500             | Sweet      | Yes        | Yes                  |
| Pear               | Green      | 180             | Sweet      | Yes        | No                   |
| Strychnine Nut     | Brown      | 10              | Bitter     | No         | No                   |
| Blackberry         | Black      | 5               | Sweet      | Yes        | Yes                  |

## Classification metrics

There are many ways to evaluate a classification algorithm based.

### Accuracy

It is the fraction of correct predictions made by the machine learning model.
<center>


`number of correct predicions / total number of observations `

`9 / 13 =  0.69`
</center>


### Precision

Precision is a metric used to evaluate the accuracy of a classification model. It is defined as the ratio of true positive predictions to the total number of positive predictions (both true and false positives).

#### Interpretation

- **True Positives (TP)**: The number of correctly predicted positive instances.
- **False Positives (FP)**: The number of instances incorrectly predicted as positive.

High precision indicates a low number of false positives, which means the model is good at predicting positive instances correctly.

<center>

`number of true_positives / #true_positives + #false_positives`

`7 / 7 + 1 = 0.875`
</center>


### Recall

It is used to calculate the quality of negative predictions made by the model. It is defined as the ratio of true positive predictions to the total number of actual positive instances (both true positives and false negatives).

#### Interpretation

- **True Negative (TN)**: The number of correctly predicted negative instances.
- **False Negatives (FN)**: The number of positive instances incorrectly predicted as negative.

High recall indicates a low number of false negatives, which means the model is good at identifying positive instances.

<center>

`number of true_positives / #true_positives + #false_negatives`

`7 / 7 + 3 = 0.7`
</center>

## Mistakes 

### Underfitting:

Underfitting occurs when a model is too simple to capture the complexity of the data, resulting in poor performance on both training and test data.

### Overfitting :

Overfitting happens when a model is overly complex and fits too closely to the details and noise in the training data, leading to excellent performance on training data but poor performance on test data.

## Train and test split

In [None]:
# Import the module
from sklearn.model_selection import train_test_split

X = churn_df.drop("churn", axis=1).values
y = churn_df["churn"].values

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier to the training data
knn.fit(X_train,y_train)

# Print the accuracy
print(knn.score(X_test, y_test))