# CS 559 HW 4

## Question 1 [ 40 Points ]

**Support Vector Machines (SVMs)**

[25 points ] Download this dataset, split it as a 80% training and 20% test set. and implement the support vector algorithm from scratch using Numpy and Pandas.

[10 points ] Report the accuracies for the train and test sets. Comment on whether your model has overfit.

[5 points] Test your model performance with the scikit-learn model. Comment on the difference in accuracy. 


In [2]:
class SVM():
    def __init__(self):
        #Fill it in
        ...

    def fit(self):
        #Fill it in
        ...
  
    def predict(self):
        #Fill it in
        ...


In [None]:
model = SVM()
model.fit('Dataset')

Training accuracy : 

Test accuracy : 

## Question 2 [ 40 Points ]

**Decision Trees**

a. [5 points] Complete the `test_split` function.

In [130]:
from sklearn.datasets import load_iris
import numpy as np

In [131]:
# Split a dataset based on an attribute and an attribute value
def test_split(index, value, dataset):
	left, right = list(), list()
	for row in dataset:
		# move all nodes that are true to the left, false to the right
		if row[index] >= value:
			left.append(row)
		else:  # row[index] < value
			right.append(row)
	return left, right

b. [5 points] Complete the `gini_index` function.

In [132]:
# Calculate the Gini index for a split dataset
def gini_index(groups, classes):
	"""
	Using the calculation approach described here:
	https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity
	"""
	# count all samples at split point
	left_group, right_group = groups
	n_instances = len(left_group) + len(right_group)

	# sum weighted Gini index for each group
	gini = 0.0
	for group in groups:
		if len(group) > 0:  # protect against ZeroDivisionError
			num_classes_in_group = len(set(row[-1] for row in group))
			uncertainty = 1 - (1 / num_classes_in_group)
			gini += (len(group) / n_instances) * uncertainty
	return gini


c. [5 points] Complete the `get_split` function.

In [133]:
def get_split(dataset):
	"""Find the best split point by iterating over every feature / value
    and calculating the information gain."""
	unique_labels = list(set(row[-1] for row in dataset))
	b_index, b_value, b_score, b_groups = 999, 999, 999, None

	for feature_index in range(len(dataset[0])-1):
		for row in dataset:
			row_val = row[feature_index]
			groups = test_split(feature_index, row[feature_index], dataset) 
			gini = gini_index(groups, unique_labels)
			if gini < b_score:
				b_index, b_value, b_score, b_groups = (
					feature_index, row_val, gini, groups
				)
        
	return {'index': b_index, 'value': b_value, 'groups': b_groups}

In [134]:
# Create a terminal node value
def to_terminal(group):
	outcomes = [row[-1] for row in group]
	return max(set(outcomes), key=outcomes.count)

d. [15 points] Complete the `split` function.

In [147]:
# Create child splits for a node or make terminal
#Hint : Just call the to_terminal and get_split functions defined above. 

def split(node, max_depth, min_size, depth):
	left, right = node['groups']
	del(node['groups'])
 
	# check for a no split
	if not left or not right:
		node['left'] = node['right'] = to_terminal(left)
		return
	# check for max depth
	if depth >= max_depth:
		node['left'], node['right'] = (
			to_terminal(left), 
			to_terminal(right)
		)
		return

	# process left child
	if len(left) <= min_size:
		node['left'] = to_terminal(left)
	else:
		node['left'] = get_split(right)
		split(node['left'], max_depth, min_size, depth+1)
  
	# process right child
	if len(right) <= min_size:
		node['right'] = to_terminal(right)
	else:
		node['right'] = get_split(right)
		split(node['right'], max_depth, min_size, depth+1)

In [148]:
# Build a decision tree
def build_tree(train, max_depth, min_size):
	root = get_split(train)
	split(root, max_depth, min_size, 1)
	return root

e. [10 points] Print the tree. 

In [149]:
# Print a decision tree
def print_tree(node, depth=0):
	if isinstance(node, dict):
		print('%s[X%d < %.3f]' % ((depth*' ', (node['index']+1), node['value'])))
		print_tree(node['left'], depth+1)
		print_tree(node['right'], depth+1)
	else:
		print('%s[%s]' % ((depth*' ', node)))

In [150]:
iris = load_iris()

X = np.array(iris.data)
y = np.array(iris.target).reshape(-1,1)

data = np.append(X,y,axis=1)

In [155]:
tree = build_tree(data, 1, 1)
print_tree(tree)

[X3 < 3.000]
 [1.0]
 [0.0]


## Question 3 [ 20 Points ]

**Random Forests and Boosting**


### Loading Data

In [13]:
import pandas as pd

In [14]:
col_names = [
    "buying", "maint", "doors",
    "persons", "lug_boot", "safety"
]

In [16]:
car_df = pd.read_csv(
    './car/car.data',
    sep=",", usecols=list(range(0, 6)), names=col_names,
)

In [17]:
car_df.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
0,vhigh,vhigh,2,2,small,low
1,vhigh,vhigh,2,2,small,med
2,vhigh,vhigh,2,2,small,high
3,vhigh,vhigh,2,2,med,low
4,vhigh,vhigh,2,2,med,med


In [18]:
features = col_names[:-1]
target = [col_names[-1]]

### Preprocessing & Splitting Data 

The approach I'll go with here is that because all the features are *technically* categorical, I'll one-hot encode them:

In [20]:
from sklearn import metrics, model_selection
from sklearn.preprocessing import OneHotEncoder

In [21]:
encoded_data = dict()
enc = OneHotEncoder(handle_unknown='ignore')
for feat in features:
    input_feat = car_df[feat].values.reshape(-1, 1)
    encoded_feat = enc.fit_transform(input_feat).toarray()
    encoded_data[feat] = encoded_feat

X = np.column_stack(list(
    encoded_data.values()
))

For the target column, I'll merely use sparse integer encoding:

In [30]:
classes = car_df["safety"].unique().tolist()
convert_to_int = lambda label: classes.index(label)

In [36]:
y = car_df["safety"].transform(convert_to_int)

In [38]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state=42)

### Training and Comparing Ensembles

First model - Random Forest coming right up:

In [48]:
from sklearn.ensemble import RandomForestClassifier

In [56]:
rf_clf = RandomForestClassifier(
    n_estimators=100, max_depth=1, 
    random_state=0, oob_score=True).fit(X_train, y_train)

y_pred = rf_clf.predict(X_test)
test_accuracy = metrics.accuracy_score(y_test, y_pred)
print(f"Random Forest Train Accuracy: {round(rf_clf.oob_score * 100, 4)}%")
print(f"Random Forest Test Accuracy: {round(test_accuracy * 100, 4)}%")


Random Forest Train Accuracy: 100%
Random Forest Test Accuracy: 28.3237%


Second model: Gradient Boosting, anyone?

In [39]:
from sklearn.ensemble import GradientBoostingClassifier

In [63]:
grad_boost_clf = GradientBoostingClassifier(
    n_estimators=100, learning_rate=1.0,
    max_depth=1, random_state=0).fit(X_train, y_train)

# get preds so we can eval accuracy
y_pred = grad_boost_clf.predict(X_train)
train_accuracy = metrics.accuracy_score(y_train, y_pred)
print(f"Gradient Boosted Classifier Train Accuracy: {round(train_accuracy * 100, 4)}%")

y_pred = grad_boost_clf.predict(X_test)
test_accuracy = metrics.accuracy_score(y_test, y_pred)
print(f"Gradient Boosted Classifier Test Accuracy: {round(test_accuracy * 100, 4)}%")



Gradient Boosted Classifier Train Accuracy: 36.5412%
Gradient Boosted Classifier Test Accuracy: 20.5202%


As we can see from above, the Random Forest classifier outperforms the Gradient Boosted by ~8% on the test dataset, when both were given the same 5 one-hot encoded features and hyperparameters for `n_estimators` and `max_depth`.

Both model have higher train accuracies than their respective test accuracies, which suggests both have started overfitting. However, it appears to be much more severe in the case of the Random Forest, which has a perfect score for training accuracy (100%), vs. the ~36.54% for the Gradient Boosted tree.