[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abdn-cs3033-ai/practicals/blob/main/week09/tutorial8-learning.ipynb)

# CS3033: Artificial Intelligence

## Tutorial 08: Supervised Learning

#### Prof. Felipe Meneguzzi

Adapted from code in the [AIMA-Python](https://github.com/aimacode/aima-python) public repository.

In order to run this tutorial, you need to download the auxiliary files from Github into your notebook, which we do with Jupyter's shell commands (if you downloaded the entire repo, the code below is not necessary).

In [None]:
try:
  import google.colab
  print("We are in Google colab, we need to clone the repo")
  !git clone https://github.com/abdn-cs3033-ai/practicals.git
  %cd practicals/week09
  %pip install -r requirements.txt
except:
  print("Not in colab")

## Datasets

The key ingredient for machine learning algorithms is Data. Thus, we start by reviewing a Dataset class for transparent access from our learning algorithms. 
A lot of the datasets we will work with are .csv files (although other formats are supported too). There is a collection of sample datasets ready to use [on aima-data](https://github.com/aimacode/aima-data), and we use a subset we placed on [our branch](./aima-data). In such files, each line corresponds to one item/measurement. Each individual value in a line represents a *feature* and usually there is a value denoting the *class* of the item. We encode data using the [DataSet](learning.py#L8) class (which we import in the next Python cell).

### Class Attributes

* **examples**: Holds the items of the dataset. Each item is a list of values.

* **attrs**: The indexes of the features (by default in the range of [0,f), where *f* is the number of features). For example, `item[i]` returns the feature at index *i* of *item*.

* **attrnames**: An optional list with attribute names. For example, `item[s]`, where *s* is a feature name, returns the feature of name *s* in *item*.

* **target**: The attribute a learning algorithm will try to predict. By default the last attribute.

* **inputs**: This is the list of attributes without the target.

* **values**: A list of lists which holds the set of possible values for the corresponding attribute/feature. If initially `None`, it gets computed (by the function `setproblem`) from the examples.

* **distance**: The distance function used in the learner to calculate the distance between two items. By default `mean_boolean_error`.

* **name**: Name of the dataset.

* **source**: The source of the dataset (url or other). Not used in the code.

* **exclude**: A list of indexes to exclude from `inputs`. The list can include either attribute indexes (attrs) or names (attrnames).

<!-- - ```d.examples``` — A list of examples. Each one is a list of attribute values.
- ```d.attrs```    — A list of integers to index into an example, so ```example[attr]``` gives a value. Normally the same as ```range(len(d.examples[0]))```.
- ```d.attr_names``` — Optional list of mnemonic names for corresponding ```attrs```.
- ```d.target``` — The attribute that a learning algorithm will try to predict. By default the final attribute.
- ```d.inputs``` — The list of ```attrs``` without the target.
- ```d.values``` — A list of lists: each sublist is the set of possible values for the corresponding attribute. If initially ```None```, it is computed from the known examples by ```self.set_problem```. If not ```None```, an erroneous value raises ```ValueError```.
- ```d.distance``` A function from a pair of examples to a non-negative number. Should be symmetric, etc. Defaults to ```mean_boolean_error``` since that can handle any field types.
- ```d.name``` — Name of the data set (for output display only).
- ```d.source``` — URL or other source where the data came from.
- ```d.exclude``` — A list of attribute indexes to exclude from d.inputs. Elements of this list can either be integers (```attrs```) or ```attr_names```. -->

<!-- Normally, you call the constructor and you are done. Then you just access fields like `d.examples`, `d.target` and `d.inputs`. -->

### Class Helper Functions

These functions help modify a `DataSet` object to your needs.

* **sanitize**: Takes as input an example and returns it with non-input (target) attributes replaced by `None`. Useful for testing. Keep in mind that the example given is not itself sanitized, but instead a sanitized copy is returned.

* **classes_to_numbers**: Maps the class names of a dataset to numbers. If the class names are not given, they are computed from the dataset values. Useful for classifiers that return a numerical value instead of a string.

* **remove_examples**: Removes examples containing a given value. Useful for removing examples with missing values, or for removing classes (needed for binary classifiers).

With that infrastructure in place, we now instantiate the restaurant dataset from the book, and which we used in the lecture. 

In [None]:
import copy
from collections import defaultdict
from statistics import stdev
from utils4e import argmax_random_tie, normalize, remove_all
import numpy as np
from tqdm import tqdm

from notebook import psource, pseudocode
from learning import DataSet, parse_csv

def RestaurantDataSet(examples=None):
    """
    [Figure 18.3]
    Build a DataSet of Restaurant waiting examples.
    """
    return DataSet(name='restaurant', target='Wait', examples=examples,
                   attr_names='Alternate Bar Fri/Sat Hungry Patrons Price Raining Reservation Type WaitEstimate Wait')

### Importing a Dataset

#### Importing from aima-data

Datasets uploaded on aima-data can be imported with the following line:

In [None]:
## Let's check that this dataset worked

restaurant1 = DataSet(name='restaurant')

We can also use the class we defined above, and to check that we imported the correct dataset, we can do the following:

In [None]:
restaurant = RestaurantDataSet()

## Check the first example in the dataset

print(restaurant.examples[0])
print(restaurant.inputs)

Which correctly prints the first line in the csv file and the list of attribute indexes.

When importing a dataset, we can specify to exclude an attribute (for example, at index 1) by setting the parameter `exclude` to the attribute index or name.

In [None]:
# Load the full dataset
housing = DataSet(name='housing', target='Price', examples=None,
                   attr_names='Size Bedrooms Price')

print(housing.inputs)

# Load the full dataset

housing = DataSet(name='housing', target='Price', examples=None,
                   attr_names='Size Bedrooms Price', exclude=[1])
print(housing.inputs)

### Attributes

Here we showcase the attributes.

First we will print the first three items/examples in the dataset.

In [None]:
print(restaurant.examples[:3])

Then we will print `attrs`, `attr_names`, `target`, `input`. Notice how `attrs` holds values in [0,10], but since the fourth attribute is the target, `inputs` holds values in [0,9].

In [None]:
print("attrs:", restaurant.attrs)
print("attrnames (by default same as attrs):", restaurant1.attr_names)
print("attrnames (by default same as attrs):", restaurant.attr_names)
print("target:", restaurant.target)
print("inputs:", restaurant.inputs)

Now we will print all the possible values for the first feature/attribute.

In [None]:
print(restaurant.values[0])

Finally, we will print the dataset's name and source. Keep in mind that we have not set a source for the dataset, so in this case it is empty.

In [None]:
print("name:", restaurant.name)
print("source:", restaurant.source)

A useful combination of the above is `dataset.values[dataset.target]` which returns the possible values of the target. For classification problems, this will return all the possible classes. Let's try it:

In [None]:
print(restaurant.values[restaurant.target])

### Helper Functions

We will now take a look at the auxiliary functions found in the class.

First we will take a look at the `sanitize` function, which sets the non-input values of the given example to `None`.

In this case we want to hide the class of the first example, so we will sanitize it.

Note that the function doesn't actually change the given example; it returns a sanitized *copy* of it.

In [None]:
print("Sanitized:",restaurant.sanitize(restaurant.examples[0]))
print("Original:",restaurant.examples[0])

We also have `classes_to_numbers`. For a lot of the classifiers (like Neural Networks and Logistic Regression/Classification), classes should have numerical values. With this function we map string class names to numbers.

In [None]:
print("Class of first example:",restaurant1.examples[0][restaurant1.target])
restaurant1.classes_to_numbers()
print("Class of first example:",restaurant1.examples[0][restaurant1.target])

As you can see "Yes" was mapped to 1.

Finally, we take a look at `find_means_and_deviations`. It finds the means and standard deviations of the features for each class.

In [None]:
## TODO Debug this

# means, deviations = restaurant1.find_means_and_deviations()

# print("Yes target feature means:", means["Yes"])
# print("No mean for first feature:", means["No"][0])

# print("Yes target feature deviations:", deviations["Yes"])
# print("No deviation for second feature:",deviations["No"][1])

## Decision Trees

A decision tree is a flowchart that uses a tree of decisions and their possible consequences for classification. At each non-leaf node of the tree an attribute of the input is tested, based on which corresponding branch leading to a child-node is selected. At the leaf node the input is classified based on the class label of this leaf node. The paths from root to leaves represent classification rules based on which leaf nodes are assigned class labels.

We now proceed to developing our algorithm for learning decision trees, for this, we will need a couple of data structures we define below, specifically, the decision points in the decision tree, and the leaf nodes. 

### Implementation
The nodes of the tree constructed by our learning algorithm are stored using either `DecisionFork` or `DecisionLeaf` based on whether they are a parent node or a leaf node respectively.

`DecisionFork` holds the attribute, which is tested at that node, and a dict of branches. The branches store the child nodes, one for each of the attribute's values. Calling an object of this class as a function with input tuple as an argument returns the next node in the classification path based on the result of the attribute test.

The leaf node stores the class label in `result`. All input tuples' classification paths end on a `DecisionLeaf` whose `result` attribute decide their class.

In [None]:
class DecisionFork:
    """
    A fork of a decision tree holds an attribute to test, and a dict
    of branches, one for each of the attribute's values.
    """

    def __init__(self, attr, attr_name=None, default_child=None, branches=None):
        """Initialize by saying what attribute this node tests."""
        self.attr = attr
        self.attr_name = attr_name or attr
        self.default_child = default_child
        self.branches = branches or {}

    def __call__(self, example):
        """Given an example, classify it using the attribute and the branches."""
        attr_val = example[self.attr]
        if attr_val in self.branches:
            return self.branches[attr_val](example)
        else:
            # return default class when attribute is unknown
            return self.default_child(example)

    def add(self, val, subtree):
        """Add a branch. If self.attr = val, go to the given subtree."""
        self.branches[val] = subtree

    def display(self, indent=0):
        name = self.attr_name
        print('Test', name)
        for (val, subtree) in self.branches.items():
            print(' ' * 4 * indent, name, '=', val, '==>', end=' ')
            subtree.display(indent + 1)

    def __repr__(self):
        return 'DecisionFork({0!r}, {1!r}, {2!r})'.format(self.attr, self.attr_name, self.branches)


class DecisionLeaf:
    """A leaf of a decision tree holds just a result."""

    def __init__(self, result):
        self.result = result

    def __call__(self, example):
        return self.result

    def display(self):
        print('RESULT =', self.result)

    def __repr__(self):
        return repr(self.result)

### Decision Tree Learning
Decision tree learning is the construction of a decision tree from class-labeled training data. The data is expected to be a tuple in which each record of the tuple is an attribute used for classification. The decision tree is built top-down, by choosing a variable at each step that best splits the set of items. There are different metrics for measuring the "best split". These generally measure the homogeneity of the target variable within the subsets.

#### Information Gain
Information gain is based on the concept of entropy from information theory. Entropy is defined as:

$$H(p) = -\sum{p_i \log_2{p_i}}$$

Information Gain is difference between entropy of the parent and weighted sum of entropy of children. The feature used for splitting is the one which provides the most information gain.

#### Pseudocode

You can view the pseudocode by running the cell below:

In [None]:
pseudocode("Decision-Tree-Learning")

### Your Implementation
Using the classes above, your next task is to implement the DecisionTreeLearner

Our implementation of `DecisionTreeLearner` provided will use information gain as the metric for selecting which attribute to test for splitting. The function builds the tree top-down recursively. Based on the input it should make one of the four choices:

- If the input at the current step has no training data we return the plurality value of classes of input data received in the parent step (previous level of recursion). This is conveniently available to you in the `plurality_value` method.
- If all values in training data belong to the same class it returns a `DecisionLeaf` whose class label is the class which all the data belongs to.
- If the data has no attributes that can be tested we return the class with the highest plurality value in the training data.
- We choose the attribute which gives the highest amount of entropy gain (which we implement for you in the `choose_attribute` method) and return a `DecisionFork` which splits based on this attribute. Each branch recursively calls `decision_tree_learning` to construct the sub-tree.


In [None]:
class DecisionTreeLearner:
    """[Figure 18.5]"""

    def __init__(self, dataset: DataSet, size=None):
        self.dataset = dataset
        self.tree = self.decision_tree_learning(dataset.examples, dataset.inputs)

    def decision_tree_learning(self, examples, attrs, parent_examples=()):
        tree = None
        #### Your Code Here ####
        
        






        
        #########################
        return tree

    def plurality_value(self, examples):
        """
        Return the most popular target value for this set of examples.
        (If target is binary, this is the majority; otherwise plurality).
        """
        popular = argmax_random_tie(self.dataset.values[self.dataset.target],
                                    key=lambda v: self.count(self.dataset.target, v, examples))
        return DecisionLeaf(popular)

    def count(self, attr, val, examples):
        """Count the number of examples that have example[attr] = val."""
        return sum(e[attr] == val for e in examples)

    def all_same_class(self, examples):
        """Are all these examples in the same target class?"""
        class0 = examples[0][self.dataset.target]
        return all(e[self.dataset.target] == class0 for e in examples)

    def choose_attribute(self, attrs, examples):
        """Choose the attribute with the highest information gain."""
        return argmax_random_tie(attrs, key=lambda a: self.information_gain(a, examples))

    def information_gain(self, attr, examples):
        """Return the expected reduction in entropy from splitting by attr."""

        def I(examples):
            return information_content([self.count(self.dataset.target, v, examples)
                                        for v in self.dataset.values[self.dataset.target]])

        n = len(examples)
        remainder = sum((len(examples_i) / n) * I(examples_i)
                        for (v, examples_i) in self.split_by(attr, examples))
        return I(examples) - remainder

    def split_by(self, attr, examples):
        """Return a list of (val, examples) pairs for each val of attr."""
        return [(v, [e for e in examples if e[attr] == v]) for v in self.dataset.values[attr]]

    def predict(self, x):
        return self.tree(x)


def information_content(values):
    """Number of bits to represent the probability distribution in values."""
    probabilities = normalize(remove_all(0, values))
    return sum(-p * np.log2(p) for p in probabilities)

### Testing our classifier

We can now use our implementation to train a model using the restaurant dataset and classify a few examples.

In [None]:
## From AIMA
from learning import cross_validation, err_ratio

dt_learner = DecisionTreeLearner(restaurant)
print("Classification for the first example: "+restaurant.examples[0][restaurant.target])
print("Prediction for the first example: "+dt_learner.predict(restaurant.examples[0]))

In [None]:
dt_learner = DecisionTreeLearner(restaurant1)

mean_error = err_ratio(dt_learner, restaurant1)
print('Trained Dataset Mean error %.2f'%mean_error)

mean_error = cross_validation(DecisionTreeLearner, restaurant1)
print('Trained Dataset Mean error %.2f'%mean_error)

## Linear Regression

Now that we have looked how to learn decision trees for classification, we switch to regression algorithms. Regression algorithms assume a continuous valued output, based on the values of input features. In this practical, we focus on linear regression learners. 
A Linear Learner is a model that assumes a linear relationship between the input variables $x$ and the single output variable $y$. More specifically, that $y$ can be calculated from a linear combination of the input variables $x$. Linear learner is a quite simple model as the representation of this model is a linear equation.  

The linear equation assigns one scaler factor to each input value or column, called coefficients or weights. One additional coefficient is also added, giving additional degree of freedom and is often called the intercept or the bias coefficient.   
For example :  $y = ax_{1} + bx_{2} + c$. 

More generally, for multivariate linear regression, we call the current model hypothesis $h_{\mathbf{w}}(\vec{x})$, where $\mathbf{w} = \{w_{0}, w_{1}, \dots w_{n}\}$ are the weights, and $\vec{x} = \{ x_{1}, \dots, x_{n}\}$ is a feature vector. Thus:

$$h_{\mathbf{w}}(\vec{x}) = w_{0} + w_{1}x_{1} + \dots + w_{n}x_{n}$$

Modern hardware acceleration allows us to perform matrix multiplication very fast, so that efficient prediction in a linear model is made using a vector of weights, which we multiply by a vector with the input features, plus a dummy feature $x_{0} = 1$ to match the intercept term:

$$h_{\mathbf{w}}(\vec{x}) = \mathbf{w} \cdot \vec{x} = \mathbf{w}^{\top} \vec{x} = \sum_{i}w_{i}x_{i}$$

Our algorithm wants to generate a vector of weights $\mathbf{w}^{*}$ that minimizes the squared error loss over examples:

$$\mathbf{w}^{*} = \arg\min_{\mathbf{w}} \sum_{j} \left( y_{j} - \mathbf{w} \cdot \vec{x}_{j} \right)^{2}$$

This algorithm first assigns some random weights to the input variables and then based on the error calculated updates the weight for each variable, using the following update rule.

$$w_{i} \gets w_{i} + \alpha\sum_{j}\left( y_{j} - h_{\mathbf{w}}(\vec{x}_{j}) \right) \times x_{j,i}$$

Finally, the prediction is made with the updated weights.  

In [None]:
from utils4e import random_weights

class LinearRegressionLearner:
    """
    [Section 18.6.3]
    Multivariate Linear ression.
    """

    def __init__(self, dataset: DataSet, learning_rate=0.01, epochs=100):
        
        idx_i = dataset.inputs
        idx_t = dataset.target
        examples = dataset.examples
        num_examples = len(examples)

        # X transpose
        X_col = list(zip(*dataset.examples))  # vertical columns of X
        X_col = [X_col[i] for i in idx_i]

        # add dummy
        ones = [1 for _ in range(len(examples))]
        X_col = [ones] + X_col

        # initialize random weights
        num_weights = len(idx_i) + 1
        w = random_weights(min_value=-0.5, max_value=0.5, num_weights=num_weights)

        err = [0]

        iter = tqdm(range(epochs),postfix="Error: %f"%(np.mean(err)))
        for epoch in iter:
            err = []
            # pass over all examples
            for example in examples:
                x = [1] + [example[i] for i in idx_i] ## We add 1 to the example array because we assume the first weight is the bias coefficient.
                # Compute the prediction with the current weights
                #### Your code here (1 line) ####
                y = None
                ##################################
                t = example[idx_t]
                err.append(t - y)

            # update weights
            for i in range(len(w)):
                # Compute the overall loss and update each parameter
                #### Your code here (2 lines) ####
                loss = None
                w[i] = None
                ##################################
            iter.set_postfix_str("Error: %f"%(np.mean(err)))
        self.w = w
        # print('Finished training, final error is %f'%np.mean(err))

    def predict(self, example):
        x = [1] + example
        return np.dot(self.w, x)



## Dataset

We need a numeric dataset for linear regression. The dataset below is one such dataset, which we plot for you to inspect your model later on.

In [None]:
# produce vector inline graphics
%matplotlib inline
# from IPython.display import set_matplotlib_formats
# set_matplotlib_formats('pdf', 'svg')
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

from matplotlib import pyplot
pyplot.rcParams['text.usetex'] = False


simple_data = DataSet(name='data1', target='Y', examples=None,
                   attr_names='X Y')

fig = pyplot.figure()  # open a new figure
data = np.array(simple_data.examples)

pyplot.plot(data[:,0], data[:,-1], 'ro', ms=10, mec='k')
pyplot.ylabel('Y')
pyplot.xlabel('X')

## Training the dataset

We can now use the linear regression learner we implement to find a model to predict points in this space. For your convenience, we also plot the curve your model induces in this space. 

In [None]:

lr_learner = LinearRegressionLearner(simple_data)

# print(lr_learner.w)

fig = pyplot.figure()  # open a new figure
data = np.array(simple_data.examples)

pyplot.plot(data[:,0], data[:,-1], 'ro', ms=10, mec='k')
pyplot.ylabel('Y')
pyplot.xlabel('X')
x_s = np.arange(0,25)
y_s = [lr_learner.predict([x]) for x in x_s]
pyplot.plot(x_s,y_s , color="green")