# Programming Machine Learning Lab
# Exercise 10

**General Instructions:**

1. You need to submit the PDF as well as the filled notebook file.
1. Name your submissions by prefixing your matriculation number to the filename. Example, if your MR is 12345 then rename the files as **"12345_Exercise_10.xxx"**
1. Complete all your tasks and then do a clean run before generating the final PDF. (_Clear All Ouputs_ and _Run All_ commands in Jupyter notebook)

**Exercise Specific instructions::**

1. You are allowed to use only NumPy and Pandas (unless stated otherwise). You can use any library for visualizations.
1. Incase you require a GPU for the 2nd Part, try Google Colab or Kaggle and create a separate PDF report for that.

### Part 1

**Decision Trees**

In this part, you need to implement a decision tree for classification. 

- Implement an object class **"Decision_Tree"** with learn and predict methods. The class should work with multiple **Quality-criterion**. (Accuracy, Information Gain, Misclassification Rate (MCR)) 
- Implement appropriate stopping criterion i.e. max depth, gain is too small, minimum number of samples for splitting. You can have one or more stopping criterias.
- Download and read the Nursery dataset. Link: https://archive.ics.uci.edu/ml/datasets/Nursery
- Once the data is loaded, split the data into 70-20-10 split for train/validation/test. *(You can use sklearn for splitting the dataset)*
- Train your **"Decision_Tree"** with different hyperparameters
    - Perform either grid or random search. *(You can use sklearn for hyperparameter search)*
    - Hyperparameters can include max-depth, minimunm gain for splitting, minimum number of samples for splitting. Quality-criterion must be one of the hyperparameter. 
    - Compare the results on validation data. 
    - Report the test results for the best model. 
    - Print the best tree using a breath first tree traversal (only till depth of 4).

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from ucimlrepo import fetch_ucirepo
from sklearn.tree import DecisionTreeClassifier

In [5]:
class Node():
    def __init__(self, feature=None, threshold=None, left=None, right=None, gain=None, value=None):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.gain = gain
        self.value = value
        

class DecisionTree():
    def __init__(self, min_samples=None, max_depth=None, minimunm_gain=None, criterion="information-gain"):
        self.min_samples = min_samples
        self.max_depth = max_depth
        self.minimunm_gain = minimunm_gain
        self.criterion = criterion
        
    def split_data(self, dataset, feature, threshold):
        # Create empty arrays to store the left and right datasets
        left_dataset = []
        right_dataset = []
        
        # Loop over each row in the dataset and split based on the given feature and threshold
        for row in dataset:
            if row[feature] <= threshold:
                left_dataset.append(row)
            else:
                right_dataset.append(row)

        # Convert the left and right datasets to numpy arrays and return
        left_dataset = np.array(left_dataset)
        right_dataset = np.array(right_dataset)
        return left_dataset, right_dataset

    def entropy(self, y):
        entropy = 0

        # Find the unique label values in y and loop over each value
        labels = np.unique(y)
        for label in labels:
            # Find the examples in y that have the current label
            label_examples = y[y == label]
            # Calculate the ratio of the current label in y
            pl = len(label_examples) / len(y)
            # Calculate the entropy using the current label and ratio
            entropy += -pl * np.log2(pl)

        # Return the final entropy value
        return entropy

    def information_gain(self, parent, left, right):
        # set initial information gain to 0
        information_gain = 0
        # compute entropy for parent
        parent_entropy = self.entropy(parent)
        # calculate weight for left and right nodes
        weight_left = len(left) / len(parent)
        weight_right= len(right) / len(parent)
        # compute entropy for left and right nodes
        entropy_left, entropy_right = self.entropy(left), self.entropy(right)
        # calculate weighted entropy 
        weighted_entropy = weight_left * entropy_left + weight_right * entropy_right
        # calculate information gain 
        information_gain = parent_entropy - weighted_entropy
        return information_gain

    def best_split(self, dataset, num_samples, num_features):
        # dictionary to store the best split values
        best_split = {'gain': -1, 'feature': None, 'threshold': None}
        # loop over all the features
        for feature_index in range(num_features):
            #get the feature at the current feature_index
            feature_values = dataset[:, feature_index]
            #get unique values of that feature
            uniques = np.unique(feature_values).sort()
            thresholds = []
            for i in range(1, len(uniques)):
                thresholds.append(np.average(uniques[i], uniques[i-1]))
            thresholds = np.array(thresholds)
            # loop over all values of the feature
            for threshold in thresholds:
                # get left and right datasets
                left_dataset, right_dataset = self.split_data(dataset, feature_index, threshold)
                # check if either datasets is empty
                if len(left_dataset) and len(right_dataset):
                    # get y values of the parent and left, right nodes
                    y, left_y, right_y = dataset[:, -1], left_dataset[:, -1], right_dataset[:, -1]
                    information_gain = -1
                    # compute information gain based on the y values
                    if self.criterion == "information-gain":
                        information_gain = self.information_gain(y, left_y, right_y)
                    # update the best split if conditions are met
                    if information_gain > best_split["gain"]:
                        best_split["feature"] = feature_index
                        best_split["threshold"] = threshold
                        best_split["left_dataset"] = left_dataset
                        best_split["right_dataset"] = right_dataset
                        best_split["gain"] = information_gain
        return best_split

    def calculate_leaf_value(self, y):
        y = list(y)
        #get the highest present class in the array
        most_occuring_value = max(y, key=y.count)
        return most_occuring_value
    
    def build_tree(self, dataset, current_depth=0):
        # split the dataset into X, y values
        X, y = dataset[:, :-1], dataset[:, -1]
        n_samples, n_features = X.shape
        # keeps spliting until stopping conditions are met
        if self.min_samples:
            if n_samples < self.min_samples: 
                leaf_value = self.calculate_leaf_value(y)
                return Node(value=leaf_value)
        if self.max_depth:
            if current_depth > self.max_depth:
                leaf_value = self.calculate_leaf_value(y)
                return Node(value=leaf_value)
        # Get the best split
        best_split = self.best_split(dataset, n_samples, n_features)
        if self.minimunm_gain:
            if best_split["gain"] < self.minimunm_gain:
                leaf_value = self.calculate_leaf_value(y)
                return Node(value=leaf_value)
        elif best_split["gain"] <= 0:
            leaf_value = self.calculate_leaf_value(y)
            return Node(value=leaf_value)
            # continue splitting the left and the right child. Increment current depth
        left_node = self.build_tree(best_split["left_dataset"], current_depth + 1)
        right_node = self.build_tree(best_split["right_dataset"], current_depth + 1)
        # return decision node
        return Node(best_split["feature"], best_split["threshold"],
                    left_node, right_node, best_split["gain"])

    def fit(self, X, y):
        dataset = np.concatenate((X, y), axis=1)  
        self.root = self.build_tree(dataset)

    def predict(self, X):
        # Create an empty list to store the predictions
        predictions = []
        # For each instance in X, make a prediction by traversing the tree
        for x in X:
            prediction = self.make_prediction(x, self.root)
            # Append the prediction to the list of predictions
            predictions.append(prediction)
        # Convert the list to a numpy array and return it
        np.array(predictions)
        return predictions
    
    def make_prediction(self, x, node):
        # if the node has value i.e it's a leaf node extract it's value
        if node.value != None: 
            return node.value
        else:
            #if it's node a leaf node we'll get it's feature and traverse through the tree accordingly
            feature = x[node.feature]
            if feature <= node.threshold:
                return self.make_prediction(x, node.left)
            else:
                return self.make_prediction(x, node.right)

In [10]:
# fetch dataset
nursery = fetch_ucirepo(id=76)

# data (as pandas dataframes)
X = nursery.data.features
y = nursery.data.targets

# Split the data into 70-20-10 split for train/validation/test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.3, random_state=42)

# model = DecisionTree(min_samples=2)
# model.fit(X_train, y_train)


Unnamed: 0,parents,has_nurs,form,children,housing,finance,social,health
6463,pretentious,improper,completed,more,critical,convenient,nonprob,priority
7624,pretentious,critical,foster,2,convenient,inconv,nonprob,priority
8430,pretentious,very_crit,foster,1,convenient,convenient,problematic,recommended
9920,great_pret,less_proper,completed,more,critical,convenient,nonprob,not_recom
6597,pretentious,improper,incomplete,3,convenient,inconv,nonprob,recommended


### Part 2

**NLP - Word2Vec Model**

In this part, we will learn on how neural networks work on language data. We would be implementing a simple *Continuous bag-of-words (CBOW)* model. (You can read more at https://arxiv.org/abs/1301.3781v3)

**You can use any deep learning library (for example, pytorch or tensorflow) for this part.**

- Create a custom dataloader for Continuous bag-of-words (CBOW) model. This would take the whole text as input and generate samples to train the model. CBOW usualy takes 'n' words before and after a target word and tries to predict the target word. We will use n=2, so the output of the dataloader would be (B,4), (B,1) for X and y respectively (Where B is batch-size).

- Creata a Neural Network with folowing specificaitons 
    - Embedding layer of size 16
    - 2 x Linear layer of size 32
    - ReLU Activation for hidden layer

Train the model for 50 epochs with cross-entropy loss and visualize the final embeddings to understand similarity between the learned word embeddings.

In [1]:
### Write your code here