# AI Project 4
## Arman Rostami

In this project, several machine learning algorithms including Decision Trees and Random Forests are studied and implemented. The goal is to predict having heart disease given some features.

## Initializations

### imports

In [1]:
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn.metrics import accuracy_score

### Useful Functions

Following funciton prints accuracy of given predictions using correct results.

In [2]:
def print_accuracy(data_labels, predict_labels):
    accuracy = accuracy_score(data_labels, predict_labels)
    print("Accuracy: %s" % accuracy)

### Reading data

In [3]:
data = pd.read_csv("data.csv")
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


### Splitting to test and train data

Given data is splitted to train and test data. 80% of data is randomly chosen to be for training data and others for test data.

In [4]:
train_bools = np.random.rand(len(data)) < 0.8
train_data = data[train_bools]
test_data = data[~train_bools]

## 1) Decision Tree

Decision Trees are one of the simplest and yet most successful forms of machine learning algorithms. A decision tree represents a function that takes as input a vector of attribute values and
returns a “decision”—a single output value. The input and output values can be discrete or
continuous.

A decision tree reaches its decision by performing a sequence of tests. Each internal
node in the tree corresponds to a test of the value of one of the input attributes, $A_i$ , and
the branches from the node are labeled with the possible values of the attribute, $A_i = v_{ik}$ .
Each leaf node in the tree specifies a value to be returned by the function.

Following sections show implementation of Decision Tree using python's sklearn library.

### Implementation

#### Selecting Featrues

All features (of course except feature) are used to create decision tree. List of features is shown below.

In [5]:
features = list(train_data.columns[:-1])
print("Features:", features, sep = "\n")

Features:
['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']


#### Using sklearn's DecisionTreeClassifier

sklearn's DecisionTreeClassifier is used to create decision tree based on given features and target of training data.

In [6]:
sample_features = train_data[features]
sample_targets = train_data['target']
decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(sample_features, sample_targets)

#### Results

In [7]:
predicts = decision_tree.predict(test_data[features])

In [8]:
print_accuracy(test_data['target'], predicts)

Accuracy: 0.7627118644067796


## 2) Random Forest

### Bootstraping

Bootstraping is a powerful statistical method for estimating a quantity from a data sample. It can be used to estimate summary statistics such as the mean or standard deviation. In this method estimates are calculated by averaging estimates from multiple small data samples. Bootstraping reduces variance by reducing sensitivity to small noises in data. One method to implement bootstraping is to randomly sample with replacement from known observations and create new samples which are less likely identical to the initial sample since replacement is used. Then these samples can be used to generate results based on their values. (e.g. calculating average of results)

### Bagging

Bootstrap Aggregation or bagging a simple and very powerful ensemble method which is based on bootstraping and is used to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. Bagging reduces variance and helps to avoid overfitting.

### Overfitting

Overfitting is a modeling error that occurs when model fits the given training data so much that it would be inaccurate in predicting the outcomes of the untrained data. Overfitting typically happens in Decision Trees since Desicion Tree tries to create best tree to perfectly fit all samples in the training data set. Methods such as pruning and random forests can be used to overcome overfitting in decision trees.

### Random Forest

Random Forest is a method used with Decision Trees. In this method several samples are generated from training date like the bagging model and on each new samples, a Decision Tree is implemented with one difference with bagging Decision Tree method: It also selects a subset of features to split nodes in Desicion Tree. Theses features are selected randomly and number of features to select is specified as a parameter to algorithm.

### Implementation

Following sections show steps to implement the Random Forest algorithm.

#### 2-1) Generate New Samples

Five new samples with size of 150 are generated from training data with replacing allowed.

In [9]:
NEW_SAMPLES_COUNT = 5
NEW_SAMPLE_SIZE = 150

In [10]:
def generate_new_sample(data, sample_size):
    chosen_data_bools = np.random.choice(len(data), sample_size, replace=True)
    return data.iloc[chosen_data_bools]

In [11]:
new_samples = list()
for i in range(NEW_SAMPLES_COUNT):
    new_sample = generate_new_sample(train_data, NEW_SAMPLE_SIZE)
    new_samples.append(new_sample)

#### 2-2) Bagging

Using the previous section, Bagging method is implemented in the following codes.

Creating a Decision Tree for each new sample set:

In [12]:
decision_trees = [tree.DecisionTreeClassifier() for i in range(NEW_SAMPLES_COUNT)]    

Training created Decision trees with corresponding sample set:

In [13]:
for i, sample in enumerate(new_samples):
    sample_features = sample[features]
    sample_targets = sample['target']
    decision_trees[i].fit(sample_features, sample_targets)

Gathering predictions of each Decision Tree:

In [14]:
bagging_predicts = list()
for i, desicion_tree in enumerate(decision_trees):
    predicts = decision_trees[i].predict(test_data[features])
    bagging_predicts.append(predicts)

Using a voter to assign label to given test. Voter selects label which is used more in Decision Trees results:

In [15]:
final_predicts = list()
for i in range(len(test_data)):
    target_count = 0
    
    for predicts in bagging_predicts:
        target_count += predicts[i]
    
    target = int(target_count >= (len(bagging_predicts) / 2))
    final_predicts.append(target)

Results:

In [16]:
print_accuracy(test_data['target'], final_predicts)

Accuracy: 0.8983050847457628


#### 2-3) Examine Features

Features are examined by deleting them and calculating accuracy of the Decision Tree classifier with deleted feature to find out deleting which feature results in better accuracy. Results are dependent on data. **In this example** deleting age feature results in less loss in accuracy.

In [17]:
for deleted_feature in features:
    available_features = [feature for feature in features if feature != deleted_feature]
    sample_features = train_data[available_features]
    sample_targets = train_data['target']
    
    decision_tree = tree.DecisionTreeClassifier()
    decision_tree = decision_tree.fit(sample_features, sample_targets)
    
    predicts = decision_tree.predict(test_data[available_features])
    print('Deleted Feature: %s' % deleted_feature)
    print_accuracy(test_data['target'], predicts)
    print('***')

Deleted Feature: age
Accuracy: 0.8813559322033898
***
Deleted Feature: sex
Accuracy: 0.8135593220338984
***
Deleted Feature: cp
Accuracy: 0.8305084745762712
***
Deleted Feature: trestbps
Accuracy: 0.7627118644067796
***
Deleted Feature: chol
Accuracy: 0.7627118644067796
***
Deleted Feature: fbs
Accuracy: 0.8305084745762712
***
Deleted Feature: restecg
Accuracy: 0.7796610169491526
***
Deleted Feature: thalach
Accuracy: 0.7796610169491526
***
Deleted Feature: exang
Accuracy: 0.7796610169491526
***
Deleted Feature: oldpeak
Accuracy: 0.7627118644067796
***
Deleted Feature: slope
Accuracy: 0.7966101694915254
***
Deleted Feature: ca
Accuracy: 0.6610169491525424
***
Deleted Feature: thal
Accuracy: 0.711864406779661
***


#### 2-4) Decision Tree by Choosing Random Features

Five features are randomly chosen and given to Decision Tree.

In [18]:
RANDOM_FEATURES_COUNT = 5

available_features = np.random.choice(features, RANDOM_FEATURES_COUNT, replace=False)
sample_features = train_data[available_features]
sample_targets = train_data['target']

decision_tree = tree.DecisionTreeClassifier()
decision_tree = decision_tree.fit(sample_features, sample_targets)

predicts = decision_tree.predict(test_data[available_features])

print("Chosen Features:")
print(available_features)
print_accuracy(test_data['target'], predicts)

Chosen Features:
['age' 'slope' 'sex' 'thal' 'restecg']
Accuracy: 0.6610169491525424


#### 2-5) Random Forest Implementation

Using previous sections, Random Forest method is implemented.

Creating a Decision Tree for each new sample set:

In [19]:
decision_trees = [tree.DecisionTreeClassifier() for i in range(NEW_SAMPLES_COUNT)]

Create random feature sets for each Desicion Tree classifier:

In [20]:
random_forest_features = [np.random.choice(features, RANDOM_FEATURES_COUNT, replace=False) for i in range(NEW_SAMPLES_COUNT)]

Training created decision trees with corresponding sample set and random features:

In [21]:
for i, sample in enumerate(new_samples):
    available_features = random_forest_features[i]
    sample_features = sample[available_features]
    sample_targets = sample['target']
    decision_trees[i].fit(sample_features, sample_targets)

Gathering predictions of each Decision Tree:

In [22]:
bagging_predicts = list()
for i, desicion_tree in enumerate(decision_trees):
    predicts = decision_trees[i].predict(test_data[random_forest_features[i]])
    bagging_predicts.append(predicts)

Using a voter to assign label to given test. Voter selects label which is used more in Decision Trees results:

In [23]:
final_predicts = list()
for i in range(len(test_data)):
    target_count = 0
    
    for predicts in bagging_predicts:
        target_count += predicts[i]
    
    target = int(target_count >= (len(bagging_predicts) / 2))
    final_predicts.append(target)

Results:

In [24]:
print_accuracy(test_data['target'], final_predicts)

Accuracy: 0.847457627118644


## Conclusion

Based on results and previous sections, bagging reduces variance and overfitting by ignoring specific samples with replacing samples, It helps to have a better accuracy on a new test set. Random Forest also reduces overfitting by selecting a subset of features so that the resulting predictions from all of the subtrees have less correlation which results to better accuracy. Also note that dataset was not perfect and there are cases in which normal decision tree gives more accuracy or Random Forest method may not work perfectly since it's using random features.