# Classification 

- Classification problems are an important category of problems where outcome variable takes discrete values.
- Primary objective is to predict the probability of an observation belonging to a class, known as class probability.

## Overview of classification..
- Classification problems with binary outcomes are called binary classification.
- Classification problems with multiple outcomes are called multinomial classification.
- Techniques used for solving classification problems:
    1. Logistical regression
    2. Classificaion trees
    3. Discriminant analysis
    4. Neural networks
    5. Support vector machines (SVM)

## Diabetes classificaiton: A simple example.
- Suppose we have the following patient data, which consists of a single feature (blood glucose level), and a class label 0 for non-diabetic and 1 for diabetic.
- put blood glucose level on x-axis and class label on y-axis.
- when we plot the graph, we get a sigmoid curve which connects all the points.
- Here line y = 0.5 is the classifier. It breaks the sample space into two different classes, diabetic and non-diabetic.

## Diabetes classification.
- Dataset: diabetes dataset
- The diabetes dataset used in this exercise is based on data originally collected by National Institute of Diabetes and Digestive and Kidney Diseases.

- NOTE: you can load a dataset from some other location on internet using wget command:
```python
# load the training dataset:
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/diabetes.csv
diabetes = pd.read_csv('diabetes.csv')
diabetes.head()
```

- Identification of x and y: all attributes are feature vectors except "Diabetic".
- Here, "Diabetic" is a label.

## Creating Feature x and label y
```python

# Separate features and labels
features = ["Pregnancies", "PlasmaGlucose", "DiastolicBloodPressure", TricepsThickness, SerumInsuline, BMI, DiabetesPedigree, Age]
lebel = 'Diabetic'
X, y = diabetes[features].values, diabetes[label].values
```

- Split the dataset into training and testing sets.
```python


    X_train ... = train_test_split(X,y, test_size=0.30, random_state=0)

```

- Building model
    - regularization rate = learning rate
    - if learning rate is too high, then we might not get accurate result.
    - if learning rate is too low, then the learning process can take a long time.

- Correct deductions (class 0)
    - TN = True Negative
    - TP = True Positive
- Incorrect deduction (class 1)
    - FN = False Negative
    - FP = False Positive

- Recall determines the total quantity of true positives (some may be wrongly deducted).
    - Recall : TP / (TP + FN) = of all the cases that are positive, how many did model identify?
    - Recall is the **quantity** determination of true positives.
- Precision determines the total number of TP (all are correctly determined)
    - Precision: TP / (TP + FP) = of all cases that the model predicted to be positive, how many actually are positive?
    - Precision is the **quality** determination of true positives.
- F1-Score: It is a collection of Recall and Precision.

- Note that in this case, Precision and Accuracy are two different things.
- Precision = Dart thrower is throwing darts at some common spot. It may or may not be close to bullseye.
- Accuracy = Dart thrower is throwing darts very close to bullseye, but there might be some variance.

## Evaluate the trained model:
- Calculating the classification report to determine precision, recall, F1-score, etc..
```python

    refer slides for code
    ...

```

- Calculating the overall precision, recall.

```python

    refer slides for code
    ...

```

- Therefore, there are 4 important metrics to analyze in classification task:
    1. Accuracy
    2. Precision
    3. Recall
    4. F1-Score

# Decision Trees

## Why decision tree?
- In order to solve linearly inseparable problems, or in order to classify linearly inseparable problems, decision tree can be used.
- For linear separability: https://en.wikipedia.org/wiki/Linear_separability

## Classification using decision tree:
- Task1: Selecting Informative Attributes.
- Task2: Visualising the segmentation.
- Task3: Trees as set of rules.

## Selecting Informative Attributes:
- Instructor displays a photo of stick figures of different body shapes (squares and circles) with yes or no above them.
- Attributes: 
    - head-shape: square, circular
    - body-shape: rectangular, oval
    - body-color: gray, white
- Target variable:
    - write-off: Yes, No

## Entropy
- Entropy is used for binary classification problem, which is having relationship with probability as shown in figure.
- Instructor draws a graph of an upside down parabola, such that, y=0 when x=0 and x=1, and y = 1 when x = 0.5.
- In this graph, y-axis is E(S) and x-axis is probability from 0 to 1.
- E(S) = {1 @ p=0.5, 0 @ p=0 & p=1} 
- For binary classification: E(S) = -1 * P(Y)*log2(P(Y)) - P(N)*log2(P(N))  [for c = 2]
- for eg: S : {4P, 2F}
    - E(S) = -(4/6)*log2(4/6) - (2/6)*log2(2/6) = 0.918
    - You can also use online entropy calculator and scientific calculator to do this calculation.

## Information Gain
- For more info see: https://en.wikipedia.org/wiki/Information_gain_(decision_tree)
- Look at section "Another Take on Information Gain, with Example" for more easy to understand explaination.

## Example:
- A research is trying to identify the root node to design a decision tree classifier to classify the students based on scores greater than or equal to 50% which is a pass and less than 50% which is a fail. The data is given in the below table.
- Refer the slides for data table


- Questions:
1. Find the entropy of the column.
    - Solution: S : {4P, 2F}
    - E(S) = -(4/6)*log2(4/6) - (2/6)*log2(2/6) = 0.918
    - You can also use online entropy calculator and scientific calculator to do this calculation.

2. Calculate the information gain (IG) for the parameter 'Attendance %'.
    - Solution:
    - Instructor draws a table for calculating information gain:
    - Attendance: LT70 LT70 LT70 LT70 MT70 MT70
    - Target C  : F    F    P    P    P    P
    - For root node LT70, S = {4P, 2F}
    - E(S) is entropy of parent, E(S1) is the entropy of child 1, E(S2) is the entropy of child 2.
    - Therefore, formula for Information Gain for parameter 'Attendance %' (A) is given by:
    - IG(A, E(S)) = E(S) - (Sv1/S)*E(S1) - (Sv2/S)*E(S2)
        - Here, S = no. of class in Parent
        - Sv1 = no. of class in child 1
        - Sv2 = no. of class in child 2.
    - so, here, S = 6, Sv1 = 4, Sv2 = 2.
    - E(S) = -(4/6)*log(4/6) - (2/6)*log(2/6) = 0.91
    - E(S1) = -(2/4)*log(2/4) - (2/4)*log(2/4) = 1
    - E(S2) = -(2/2)*log(2/2) - 0 = 0
    - therefore, IG(A, E(S)) = E(S) - (Sv1/S)*E(S1) - (Sv2/S)*E(S2) = 0.91 - 4/6 * 1 - 2/6 * 0 = 0.25


2a. question: Calculate the IG value of "% of marks" as selected attribute for the given problem.

- % M = 30 45 50 70 75 85
- T C = F  F  P  P  P  P
- S : {4P, 2F}
- S1: {2F, 0P}
- S2: {4P, 0F}
- IG: 0.918
- do this as homework.
 
3. Which of the following can be accepted as root node.
- % of marks
- % of attendance
- No. of assignments completed
- No. of hours studied.
    - Solution:
    - Instructor says that students must try all and calculate IG values for each and every one of them and select the one with best IG value.
    - Do this as homework.



## Visualizing Segmentations: An Example.
- Instructor shows a decision tree of Balance vs age and probabilities of write-off or no write-off.

## Trees as a set of rule
- IF(Balance < 50K) AND (Age < 50) THEN Class = Write-off
- IF(Balance < 50K) AND (Age >= 50) THEN Class = No Write-off
- IF(Balance >= 50K) AND (Age < 45) THEN Class = Write-off
- IF(Balance >= 50K) AND (Age >= 45) THEN Class = No Write-off

- For the previous problem, assume that % of marks is selected as root node based on IG value, write down tree as a set of rules for that.
- Trees as a set of rule:
- IF (Marks >= 50%) THEN Class = Pass
- IF (Marks < 50%) THEN Class = Fail

# Naive Bayes classifier

- Sigmoid function:
- P(Y) = 1/(1+e^(-x))
    - x = independent variable


- Different types of probabilities:
    - Marginal: P(A)
    - Joint: P(A U B)
    - Conditional: P(A|B)

## Formula for conditional probability:
- P(A|B) = P(A ⋂ B) / P(B) - [A & B are dependent in nature]

## Bayes Theorem
- P(A|B) = P(A and B) / P(B) - [1]
- P(B|A) = P(A and B) / P(A)
- P(A and B) = P(B|A) * P(A) - [subsititute in eq 1]
- P(A|B) = ( P(B|A) * P(A) ) / P(B)
- Bayes' Theorem: P(A|B) = ( P(B|A) * P(A) ) / P(B)