# 1) Machine Learning Library: scikit-learn

Scikit-learn is an open source machine learning library for the Python programming language. It is mainly written in python with the exception of few core algorithms which are written in Cython to achieve performances. It makes heavy use of the other python packages: NumPy, SciPy, and matplotlib.

[Scikit-learn GitHub Repo](https://github.com/scikit-learn/scikit-learn)

Features of Scikit-learn:

* Clean and Streamlined APIs explained with complete documentation and examples.
* Simple and efficient tools for data mining and data analysis. Once you understand the basic use of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward.
* Easily availability and reusable in various contexts. Clean integration with other python APIs.
* Open source and commercially usable - BSD license.
* Provide a platform to design similar projects as Scikit-learn via [GitHub community](https://github.com/scikit-learn-contrib/scikit-learn-contrib)

**Why Scikit-learn is appropriate for this task ?**

* Very Efficient for supervised learning task like classification.
* Simple and handy implementation of many classification based algorithms.
* Implementation can easily be integrated with other python packages for better representation of the output. Viz. Pandas, Mathplotlib, NumPy
* Lucidly written estimator APIs with same import/instantiate/fit/predict pattern holds. Makes it easy to learn and build estimations. 

# 2) Data Preparation Step

* Reading data set (autoimmune.txt) using Python package: Pandas
* Using read_csv function. Columns are separated by tabs and rows are separated by newlines.

In [None]:
import pandas as pd

ds = pd.read_csv('data/autoimmune.txt', delimiter="\t",header=None)

* Apply transpose function to map attributes as columns and patient as rows

In [None]:
ds=ds.transpose()

* Assingning names to the columns as per the data set
* Re-numbering data set index from 1 (using numpy)

In [None]:
import numpy as np

ds.columns=['Age','Blood_Pressure','BMI','Plasma_level','Autoimmune_Disease','Adverse_events','Drug_in_serum','Liver_function','Activity_test','Secondary_test']
ds.index = np.arange(1, len(ds) + 1)

* Mapping data set into dependent attributes and class label attribute 

In [4]:
X = ds.drop('Autoimmune_Disease',axis=1)
y = ds['Autoimmune_Disease']

# 3) Classification Alogrithm 
### 1: C4.5 / ID3
A decision tree is a tree where each node specifies a criteria test of some hypothesis, and each branch descending from the node corresponds to one of the possible values resulting from that criteria. Each branch represents a decision and each leaf represents an outcome(categorical or continues value). It belongs to the family of supervised learning algorithms. It can be used for solving regression and classification problems. 

The main objective of decision tree algorithm is to build a hypothesis from the given training data and use these decision rules to predict class label in other cases. For predicting a class label, one start from the root of the tree. Then comparing the values of the root attribute with record’s attribute. On the basis of comparison, follow the branch corresponding to that value and jump to the next node. Continuing on the same approach, one recursively compare record’s attribute values with other internal nodes of the tree until a leaf node with the class label is found.

**Note: The central idea of ID3 algorithm is selecting which attribute to test at each node in the tree:** 

* There are different selection measure to identify the attribute which can be considered as the root note at each level.
    * Information gain - for categorical attributes
        * In order to define information gain precisely, we begin by defining a entropy - a measure of the uncertainty of a random variable.
    * Gini index - for continous attributes
    
    
* ID3 Search technique:
    * Greedy search
    * Guided by information gain
    
#### Features
* Simple to understand and makes some good interpretation.
* Handles irrelevant attributes (gain=0) and missing data.
* Faster execution time.

#### Disadvantages
* Divide data only along the axis.
* May not find the best tree owing to greedy nature.
* High probability of overfitting 

#### Major Problem
* Overfitting: It occurs when decision trees become more specific to training data and less bias towards testing data. As a result accuracy of prediction goes down. It generally happens when it builds many branches due to outliers and irregularities in data.

Two approaches to avoid overfitting:

1) Pre-Pruning
  * It stops the tree construction when the best measure of a threshold value is reached.
  
2) Post-Pruning (Sub Tree replacement pruning WF 6.1)
  * Post pruning is done once the tree starts suffering from overfitting. Cross-validation check is used to measure the performance of a pruning. If there is a decrease in a accuracy during such a validation check, the node is converted into a leaf.
        
### Applying ID3 on given Data Set
Estimating performace using 10-fold cross-validation technique.

* creating instance of Decision Tree classifier algorithm
    * Criterion: function to measure the quality of a split. (gini / entropy)
    * max_depth (pre-prune): The max_depth parameter denotes maximum depth of the tree

In [5]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier(criterion='entropy',random_state = 100,
 max_depth=3) #creating instance of ID3 Classifier with parameters

* Performing 10-fold cross-validation to estimate the likely future performance of classification model

In [6]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

#applying k-fold for 10 splits, shuffle=false
k_fold_classifn_model = KFold(n_splits=10, shuffle=False, random_state=None)

id3_accuracy_score=[] # list to store accuracy for all 10 combination of data sets.
for train_index, test_index in k_fold_classifn_model.split(X):
    X_train=X.iloc[train_index] #using python iloc to fetch tuples for given indexes.
    y_train=y.iloc[train_index]
    X_test=X.iloc[test_index]
    y_true=y.iloc[test_index]
    
    dtree.fit(X_train,y_train) # fit model to data set
    y_pred = dtree.predict(X_test) # predict for outcomes
    
    id3_accuracy_score.append(accuracy_score(y_true, y_pred)) #computing subset accuracy

* Visualizing Decision tree (using python visualization package: pydotplus)

<img src="data/scr3.png" style="width: 600px;float:left"/>

### 2: Naive Bayes Classification
It is a supervised learning classification algorithm based on Bayes's Theorem. It assumes that the presence of a particular attribute in a class is independent of any other feature of the class. It makes use of prior conditional probability and Bayes's rule to predict the outcome. Naive Bayes model is very effective for the large-scale data model.

**The Bayes Classifier**:
* Conceptually, we know how frequently some particular attribute X is observed, given a known outcome. Working on the same observation we can compute the reverse, to compute the chance of that outcome happening, given X.
* The class with the highest probability is the outcome of prediction.

<img src="data/scr2.png" style="width: 300px;float:left"/>

    P(Outcome given that we know some attributes X1,X2..) = P(X1,X2... given that we know the Outcome) times Prob(Outcome), scaled by the P(X1,X2...)

**Features**
* It pre-computes probabilities and likelihood of training samples. Hence, classification task becomes easy and efficient.
* It perform well in case of categorical data compared to numerical ones.
* Popular model for real time and multi class prediction tasks.
* Test Classification is the area where Naive Bayes is mostly used.
        
### Applying Naive Bayes Classifier on given Data Set
Estimating performace using 10-fold cross-validation technique.

* creating instance of Naive Bayes classifier for a Gaussian distribution

In [7]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB() #creating instance of Naive Bayes Classifier with default parameters

* Performing 10-fold cross-validation to estimate the likely future performance of classification model

In [8]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

#applying k-fold for 10 splits, shuffle=false
k_fold_classifn_model = KFold(n_splits=10, shuffle=False, random_state=None)

nb_accuracy_score=[] # list to store accuracy for all 10 combination of data sets.
for train_index, test_index in k_fold_classifn_model.split(X):
    X_train=X.iloc[train_index] #using python iloc to fetch tuples for given indexes.
    y_train=y.iloc[train_index]
    X_test=X.iloc[test_index]
    y_true=y.iloc[test_index]
    
    nb.fit(X_train,y_train) # fit model to data set
    y_pred = dtree.predict(X_test) # predict for outcomes
    
    nb_accuracy_score.append(accuracy_score(y_true, y_pred)) #computing subset accuracy

# 4) Conclusion

* In the report, we used 10-fold cross-validation to split the data into train and test dataset. We estimate the likely future performance of each classification model by calculating accuracy for each fold.
* To model decision tree classifier we used the information gain. We used a Gaussian instance for Naive Bayes classifier.

In [9]:
#Estimating the accuracy of machine learning model by averaging the 
#accuracies derived in all the k(=10) cases of cross validation.

print ("Accuracy of a ID3 model is "+str(sum(id3_accuracy_score)/len(id3_accuracy_score)))

print ("Accuracy of a Naive Bayes model is "+str(sum(nb_accuracy_score)/len(nb_accuracy_score)))

Accuracy of a ID3 model is 0.7716927453769559
Accuracy of a Naive Bayes model is 0.8060455192034139


<img src="data/scr6.png" style="width: 450px;float:left" /> <img src="data/scr7.png" style="width: 450px;float:left" />

**Verdict**

* On comparing the accuracy of ID3 and Naive Bayes classfiers, Naive Bayes emerges as the most efficient classification model for the given data set.

* There is not much difference in the accuracy between the two classifiers. Decision trees are flexible and easy to understand. They just need data in the tabular structure without any pre-processing design changes. But they tend to overfit the data and requires post-pruning. Whereas Bayes can perform quite well. It uses prior probabilities to compute class variable which makes classification task very straightforward. It doesn't overfit nearly as much so there is no need to prune or process the network.

# 4) References
* Tools for machine learning algorithms. Scikit-learn packages and dependencies. http://scikit-learn.org Accessed 7 Oct, 2018 
* Pydotplus library. https://pydotplus.readthedocs.io/ Accessed 7 Oct, 2018
* Pandas Library. http://pandas.pydata.org/pandas-docs/stable/ Accessed 7 Oct, 2018
* Scikit-learn setup. https://github.com/scikit-learn/scikit-learn Accessed 6 Oct, 2018
* Scikit-learn information. https://en.wikipedia.org/wiki/Scikit-learn Accessed 7 Oct, 2018
* Decision Tree: https://en.wikipedia.org/wiki/Decision_tree Accessed 7 Oct, 2018
* Chapter 3 and Chapter 6: Mitchell, Tom, 1997. Machine Learning. International ed. McGraw-Hill.
* Tom M. Mitchell, Machine Learning 10-701.http://www.cs.cmu.edu/~tom/10701_sp11/slides/NBayes-1-20-2011-ann.pdf Accessed 8 Oct, 2018