### 1.1. Rule Based Classification

Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the following from − IF condition THEN conclusion. 

Let us consider a rule R1, R1: IF age = youth AND student = yes, THEN buy_computer = yes. 

The IF part of the rule is called rule condition and the THEN part of the rule is called rule consequent. The condition part consists of one or more attribute tests and these tests are logically related (AND, OR). The consequent part consists of class prediction.

#### 1.1.1. One Rule Classification

We can generate a large number of rules with the data and check the error rates and pick the set of rules with the least error rates. 

Here are a few rules.

    R1:(Give Birth = no) →Birds
    R2:(Can Fly = no) →Reptiles
    R3:(Give Birth = no) ∧ (Can Fly = yes) →Birds 
    R4:(Give Birth = yes) ∧ (Blood Type = warm) →Mammals 
    R5:(Give Birth = no) ∧ (Can Fly = no) →Reptiles 
    R6:(Give Birth = no) ∧(Live in Water = yes) →Fishes

Observe the above set of rules. The rules R1 and R2 contain only one variable while R3, R4, R5 have 2 variables. 

Rule R1 has an error rate of 33% as it would classify python as a bird, rule R2 has an error rate of 25% as it will misclassify mammals and amphibians as reptiles. Rule R5 has an error rate of 50% as it would classify amphibians as reptiles, instead we can modify the rule as 
    
    R5:(Give Birth = no) ∧ (Can Fly = no) ∧ (Live in water = no) →Reptiles

This will give an error rate of 0% with this sample.

However, rules R3, R4, R6 have 0 error rates. It is because they took multiple features into consideration and therefore, can produce more accurate results. However, there is also a disadvantage in using a lot of features in a rule. It will increase the specificity of the rule and therefore an incoming datapoint maynot fall under any rule and therefore lead to the unstability of the classifier. 

For example, consider leopard shark. 

Even though leopard shark is a fish, it can't be validated with the help of any of the rules created and therefore can't be classified.

#### 1.1.2. Rule Extraction
The rule extraction from a decision tree is pretty straight forward. The leaf nodes contain all the classes and hence is the consequent part of the rule. Every path from the root node leading to the leaf node is a rule whose corresponding consequent part is stored in that leaf node.


#### 1.1.3. Strategies for Learning Rules
##### General-to-Specific 
Start with an empty rule. Add constraints to eliminate negative examples. Stop when only positive examples are covered.

##### Specific-to-General 
Start with a rule that identifies a single random instance. Remove constraints to cover more positive examples. Stop when further generalization starts covering negatives.

#### 1.1.4. Rule Pruning
The Assessment of quality is made on the original set of training data. The rule may perform well on training data but less well on subsequent data. That's why the rule pruning is required. The rule is pruned by altering the conjunct. The rule R is pruned, if pruned version of R has greater quality when assessed on an independent set of tuples(cross validation sets). 

FOIL is one of the simplest and effective method for rule pruning. For a given rule R,

FOIL_Prune = pos - neg / pos + neg

where pos and neg is the number of positive and negative tuples covered by R, respectively. Postive tuples are those correctly predicted and negative are those incorrectly predicted. This value will increase with the accuracy of R on the pruning set. Hence, if the FOIL_Prune value is higher for the pruned version of R, then we prune R.

#### 1.1.5. Pros and Cons

The main advantage is ease of interpretation (as long as there aren’t too many rules) - basically a human can understand how the model makes predictions & whether it makes sense. For a specific instance it is possible to verify that the process worked correctly, and see what the main factors in the prediction were.

The main disadvantage is that rule-based methods are usually not the best performers in terms of prediction quality. Other methods (forests, SVM, deep nets) tends to be better. Also, rule-based methods are better only for data with categorical features. Covering all the possibilities is a very difficult task especially if you are working with a large set of featured data

#### Questionnaire

**1. Make a rule based classification for the following data**

[Solutions](https://github.com/ebi-byte/kt/blob/master/supervised_ML/RBC%20Solution.ipynb)