# Exercise Sheet 1: Python Basics

This first  exercise sheet tests the basic functionalities of the Python programming language in the context of a simple prediction task. We consider the problem of predicting health risk of subjects from personal data and habits. We first use for this task a decision tree

![](tree.png)

adapted from the webpage http://www.refactorthis.net/post/2013/04/10/Machine-Learning-tutorial-How-to-create-a-decision-tree-in-RapidMiner-using-the-Titanic-passenger-data-set.aspx. For this exercise sheet, you are required to use only pure Python, and to not import any module, including numpy. In exercise sheet 2, the nearest neighbor part of this exercise sheet will be revisited with numpy.

## Classifying a single instance (15 P)

* Create a function that takes as input a tuple containing values for attributes (smoker,age,diet), and computes the output of the decision tree.
* Test your function on the tuple `('yes',31,'good')`,

In [1]:
### Replace by your own code
def funck(a):
    if a[0] == 'yes':
        if a[1] < 29.5:
            return 'less'
        else:
            return 'more'
    else:
        if a[2] == 'good':
            return 'less'
        else:
            return 'more'
###
funck(('yes',31,'good'))

'more'

## Reading a dataset from a text file (10 P)

The file `health-test.txt` contains several fictious records of personal data and habits.

* Read the file automatically using the methods introduced during the lecture.
* Represent the dataset as a list of tuples.

In [2]:
### Replace by your own code
f = open('health-test.txt', 'r')
###
fileprintout = f.read()
nonewlines = fileprintout.split('\n')

# remove last element in list
del nonewlines[-1]

### create an empty list for tuples
tuple_list = []

### for each entry create a tuple
### and add it to the list
for a in nonewlines:
    b = a.split(',')
    tup = (b[0], int(b[1]), b[2])  
    tuple_list.append(tup)
    
print(tuple_list)

[('yes', 21, 'poor'), ('no', 50, 'good'), ('no', 23, 'good'), ('yes', 45, 'poor'), ('yes', 51, 'good'), ('no', 60, 'good'), ('no', 15, 'poor'), ('no', 18, 'good')]


## Applying the decision tree to the dataset (15 P)

* Apply the decision tree to all points in the dataset, and compute the percentage of them that are classified as "more risk".

In [3]:
### Replace by your own code
count = 0

for e in tuple_list:
    if funck(e) == 'more':
        count += 1

print(float(count)/len(tuple_list))
    
###

0.375


## Learning from examples (10 P)

Suppose that instead of relying on a fixed decision tree, we would like to use a data-driven approach where data points are classified based on a set of training observations manually labeled by experts. Such labeled dataset is available in the file `health-train.txt`. The first three columns have the same meaning than for `health-test.txt`, and the last column corresponds to the labels.

* Write a procedure that reads this file and converts it into a list of pairs. The first element of each pair is a triplet of attributes, and the second element is the label.

In [4]:
### Replace by your own code
fobject = open('health-train.txt', 'r')
nnlns = fobject.read().split('\n')

del nnlns[-1]

final_list = []

for e in nnlns:
    b = e.split(',')
    tup = (b[0], int(b[1]), b[2])
    tuptup = (tup, b[3])
    final_list.append(tuptup)

print(final_list)

    

###

[(('yes', 54, 'good'), 'less'), (('no', 55, 'good'), 'less'), (('no', 26, 'good'), 'less'), (('yes', 40, 'good'), 'more'), (('yes', 25, 'poor'), 'less'), (('no', 13, 'poor'), 'more'), (('no', 15, 'good'), 'less'), (('no', 50, 'poor'), 'more'), (('yes', 33, 'good'), 'more'), (('no', 35, 'good'), 'less'), (('no', 41, 'good'), 'less'), (('yes', 30, 'poor'), 'more'), (('no', 39, 'poor'), 'more'), (('no', 20, 'good'), 'less'), (('yes', 18, 'poor'), 'less'), (('yes', 55, 'good'), 'more')]


## Nearest neighbor classifier (25 P)

We consider the nearest neighbor algorithm that classifies test points following the label of the nearest neighbor in the training data. For this, we need to define a distance function between data points. We define it to be

`d(a,b) = (a[0]!=b[0])+((a[1]-b[1])/50.0)**2+(a[2]!=b[2])`

where `a` and `b` are two tuples corrsponding to the attributes of two data points.

* Write a function that retrieves for a test point the nearest neighbor in the training set, and classifies the test point accordingly.
* Test your function on the tuple `('yes',31,'good')`

In [5]:
### Replace by your own code


def retrieve_nearest_neighbor(a):
    # set the initial minimum to infinity
    current_min = float("inf")
    closest_neighbor = None
    closest_neighbor_label = None
    
    for e in final_list:
        b = e[0]
        label = e[1]
        # calculate distance between test point and current point
        d = (a[0]!=b[0])+((int(a[1])-int(b[1]))/50.0)**2+(a[2]!=b[2])
        
        # if the distance is less than the current minimum, update the closest neighbor
        if d < current_min:
            current_min = d
            closest_neighbor = b
            closest_neighbor_label = label
            
    return (a, closest_neighbor_label)
            
        
retrieve_nearest_neighbor(('yes',31,'good'))
    
###

(('yes', 31, 'good'), 'more')

* Apply both the decision tree and nearest neighbor classifiers on the test set, and find the data point(s) for which the two classifiers disagree, and the fraction of the time it happens.

In [6]:
### Replace by your own code

# a list for holding points of disagreement
disagree_points = []

test_points = tuple_list
train_points = final_list

# iterate over the test points and apply both decision tree and nearest neighbor
for e in test_points:
    decision_tree_result = funck(e)
    #print(funck(e))
    nearest_neighbor_result = retrieve_nearest_neighbor(e)
    #print(nearest_neighbor_result[1])
    #print("-----------------------------")
    
    if decision_tree_result != nearest_neighbor_result[1]:
        disagree_points.append(e)
        
print(disagree_points, (float(len(disagree_points))/len(test_points)))
        

###

[('yes', 51, 'good')] 0.125


One problem of simple nearest neighbors is that one needs to compare the point to predict to all data points in the training set. This can be slow for datasets of thousands of points or more. Alternatively, some classifiers train a model first, and then use it to classify the data.

## Nearest mean classifier (25 P)

We consider one such trainable model, which operates in two steps:

(1) Compute the average point for each class, (2) classify new points to be of the class whose average point is nearest to the point to predict.

For this classifier, we convert the attributes smoker and diet to real values (for smoker: yes=1.0 and no=0.0, and for diet: good=0.0 and poor=1.0), and use the modified distance function:

`d(a,b) = (a[0]-b[0])**2+((a[1]-b[1])/50.0)**2+(a[2]-b[2])**2`

We adopt an object-oriented approach for building this classifier.

* Implement the methods `train` and `predict` of the class `NearestMeanClassifier`.

In [7]:
class NearestMeanClassifier:
    
   
    
    # Training method that takes as input a dataset
    # and produces two internal vectors corresponding
    # to the mean of each class.
    def train(self,dataset):
        # train the data
        
        # create dictionary of trained model
        trained_model = {'less': (0.0, 0.0, 0.0), 'more': (0.0, 0.0, 0.0)}
        
        # vars for holding class attribute sums
        less_smoker_sum = 0.0
        less_age_sum = 0.0
        less_diet_sum = 0.0
        
        more_smoker_sum = 0.0
        more_age_sum = 0.0
        more_diet_sum = 0.0
        
        less = 0
        more = 0
        
        # iterate over training data and compute mean attributes for classes
        for e in dataset:
            if e[1] == 'less':
                less_smoker_sum += smoker_real(e[0][0])
                less_age_sum += int(e[0][1])
                less_diet_sum += diet_real(e[0][2])
                less += 1
            else:
                more_smoker_sum += smoker_real(e[0][0])
                more_age_sum += int(e[0][1])
                more_diet_sum += diet_real(e[0][2])
                more += 1
                
        #print(less_smoker_sum, less_age_sum, less_diet_sum)
        #print(more_smoker_sum, more_age_sum, more_diet_sum)
            
        # calculate the attribute means
        
        less_smoker_mean = less_smoker_sum/less
        less_age_mean = less_age_sum/less
        less_diet_mean = less_diet_sum/less
        
        more_smoker_mean = more_smoker_sum/more
        more_age_mean = more_age_sum/more
        more_diet_mean = more_diet_sum/more
        
        trained_model['less'] = (round(less_smoker_mean,2), round(less_age_mean,2), round(less_diet_mean,2))
        trained_model['more'] = (round(more_smoker_mean,2), round(more_age_mean,2), round(more_diet_mean,2))
        
        self.model = trained_model
        print(trained_model)
        ###
    
    # Prediction method that takes as input a new data
    # point and predicts it to belong to the class with
    # nearest mean.
    def predict(self,x):
        
        # convert values of the data point to reals
        realdp = (smoker_real(x[0]), x[1], diet_real(x[2]))
        #print(realdp)
        
        less_mean = self.model['less']
        more_mean = self.model['more']
        
        #calculates distance between data point and class means
        d_to_less = (realdp[0]-less_mean[0])**2+((realdp[1]-less_mean[1])/50.0)**2+(realdp[2]-less_mean[2])**2
        d_to_more = (realdp[0]-more_mean[0])**2+((realdp[1]-more_mean[1])/50.0)**2+(realdp[2]-more_mean[2])**2
        
        #print(d_to_less)
        #print(d_to_more)
        
        if d_to_less < d_to_more:
            return 'less'
        else:
            return 'more'
        ###
        
def smoker_real(smoker):
    if(smoker == 'yes'):
        return 1.0
    else:
        return 0.0
        
        
def diet_real(diet):
    if(diet == 'good'):
        return 0.0
    else:
        return 1.0

* Build an object of class `NearestMeanClassifier`, train it on the training data, and print the mean vector for each class.

In [8]:
### Replace by your own code
nmc = NearestMeanClassifier()
nmc.train(train_points)
###

{'less': (0.33, 32.11, 0.22), 'more': (0.57, 37.14, 0.57)}


* Predict the test data using the nearest mean classifier and print all test examples for which all three classifiers (decision tree, nearest neighbor and nearest mean) agree.

In [9]:
### Replace by your own code

# list for holding agreed data points and labels
agreed_list = []


for e in test_points:
    nmc_result = nmc.predict(e)
    decision_tree_result = funck(e)
    nearest_neighbor_result = retrieve_nearest_neighbor(e)
    
    if (nmc_result == decision_tree_result) and (nmc_result == nearest_neighbor_result[1]) and (decision_tree_result == nearest_neighbor_result[1]):
        agreed_list.append((e, nmc_result))
        
print(agreed_list)
    
    
    
###

[(('no', 50, 'good'), 'less'), (('no', 23, 'good'), 'less'), (('yes', 45, 'poor'), 'more'), (('no', 60, 'good'), 'less'), (('no', 15, 'poor'), 'more'), (('no', 18, 'good'), 'less')]
