# Coursework 1 - Decision Trees Learning

 Enter your candidate number here: 101746

In this coursework you will explore the classification and regression capabilities of one of the most used machine learning techniques, namely **Decision Trees**. 

Decision tree learning is one of the most widely used and practical methods for inductive inference. Moreover, decision trees are extremely useful in a sense that the acquired knowledge can be easily interpreted by a human being, allowing for us to understand and perhaps learn from it.

Decision trees
Decision trees are a high-level representation of a sequence of *yes/no* regarding a set of evidences that can lead to a conclusion about an event.

For instance, let us suppose that you want to decide whether or not you should play tennis in a given summer day. Based on your past experiences and how the weather was on that day, you collected the following data:

In [4]:
import pandas as pd
names = ['Outlook','Temperature','Humidity','Wind','Good?']
tennis_train = pd.read_csv(r"C:\Users\joshi\Downloads\ca1\ca1\tennis-train.txt",
                     sep=' ',# In the file, attributes are separated by white spaces
                     names=names)

In [5]:
tennis_train.head(10)

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Good?
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


In [6]:
attrs = names[:-1]
label = names[-1:]
X_train = tennis_train[attrs]
y_train = tennis_train[label]

In [7]:
X_train.head()

Unnamed: 0,Outlook,Temperature,Humidity,Wind
0,Sunny,Hot,High,Weak
1,Sunny,Hot,High,Strong
2,Overcast,Hot,High,Weak
3,Rain,Mild,High,Weak
4,Rain,Cool,Normal,Weak


In [8]:
y_train.head()

Unnamed: 0,Good?
0,No
1,No
2,Yes
3,Yes
4,Yes


## Pre-processing
Before you train the moodel, you might want to convert the *nominal* variables to *numerical* ones.
For such, you can use the `LabelEncoder` class from the `sklearn.preprocessing` module. 
For instance:

In [31]:
from sklearn import preprocessing

In [32]:
fruits = ['apple','orange','apple','lemon','orange','banana']
le = preprocessing.LabelEncoder()
le.fit_transform(fruits)

array([0, 3, 0, 2, 3, 1], dtype=int64)

You will use the `DecisionTreeClassifier` and the `DecisionTreeRegressor` classes from  the `sklearn.tree` module to complete the following tasks:

The assignment
 For each of the following experiments, provide the code and generated outputs.
  1. For the `Tennis` dataset, provide the accuracy on the **training** and **test** sets;
  2. For the `Iris` dataset, provide the accuracy on the **training** and **test** sets;
  3. Compare the accuracies of each experiment above when you change the criterion from 'gini' to 'entropy'.
  4. From the `Iris` dataset, you will generate *noisy* versions of the original dataset by randomly changing from the correct class label to wrong ones from 0%-30% of the **training** instances (in increments of 2%)** and plot the output of the accuracies obtained from the (uncorrupted) **test** data for each noise level.  The x-axis should be the noisy level(0-30%) and the y-axis the accuracy.
  5. Hos does increasing the paramenter `min_samples_leaf` (e.g., from 1 to 2,3...) affect the accuracy on the  test set from the noisy iris experiment? You should generate plots with multiple lines, each of them corresponding to one value of `min_samples_leaf`. 
 

In [208]:
from sklearn import preprocessing
from sklearn import tree
import numpy as np
import pandas as pd


def importcsv(path_csv, names):
    """
    Simple csv import as pandas dataframe
    """
    return pd.read_csv(path_csv, sep=' ', names=names)


# Locating Resources
tennis_names = ['Outlook','Temperature','Humidity','Wind','Good?']
tennis_train = importcsv(r'C:\Users\joshi\Downloads\ca1\ca1\tennis-train.txt', tennis_names)
tennis_test = importcsv(r'C:\Users\joshi\Downloads\ca1\ca1\tennis-test.txt', tennis_names)


iris_class_names = ['Iris-versicolor', 'Iris-virginica', 'Iris-setosa']
iris_names = ["sepal-length", "sepal-width", "petal-length", "petal-width", "Class?"]
iris_train = importcsv(r'C:\Users\joshi\Downloads\ca1\ca1\iris-train.txt', tennis_names)
iris_test = importcsv(r'C:\Users\joshi\Downloads\ca1\ca1\iris-train.txt', tennis_names)
iris_train.columns = iris_names
iris_test.columns = iris_names



def encodelabels(dataframeData, names, inputencoding):
    """
    Encode the given data frame into a numerically labelled dataframe
    Params: 
    dataframeData - Raw dataframe from csv
    names - A list of column names
    inputencoding - A boolean to control whether input and result or just result would be encoded
    
    Returns:
    A fully numerically encoded input dataframe and an encoded result vector
    """
    dataArray = dataframeData.drop(columns=names[4])
    targetArray = dataframeData[names[4]]
    le = preprocessing.LabelEncoder()
    targetArray_n = le.fit_transform(targetArray)
    if inputencoding:
        for name in names[:4]:
            name_n = name + '_n'
            dataArray[name_n] = le.fit_transform(dataArray[name])
    return dataArray[dataArray.columns[-len(names[:4]):]], targetArray_n


def getScore(dataframeTrain, dataframeTest, criterion, inputencoding, names):
    """
    Create a Decisio Tree classifier object, train the model on encoded training data. Then test the model with the test data to formulate a score
    Params:
    dataframeTrain - Training dataframe
    dataframeTest - Testing dataframe
    criterion - Information Gain Setting for classifier (either 'gini' or 'entropy')
    inputEncoding - Again a boolean to denote whether inputs and results or just results are encoded
    names - A list of attribute names 
    
    Returns: The score of testing the test data against the trained model

    """
    model = tree.DecisionTreeClassifier(criterion=criterion)
    inputTrainData, trainerTargets = encodelabels(dataframeTrain, names, inputencoding)
    inputTestData, testerTargets = encodelabels(dataframeTest, names, inputencoding)
    model.fit(inputTrainData, trainerTargets)
    return model.score(inputTestData, testerTargets)
  
    
def Q1():
    tennisResultGini = getScore(tennis_train, tennis_test, "gini",True, tennis_names)
    tennisResultEntropy = getScore(tennis_train, tennis_test, "entropy", True, tennis_names)
    print("Tennis DataSet Score with Gini setting:" + str(tennisResultGini) + " with Entropy Setting:" + str(tennisResultEntropy))
def Q2():
    irisResultGini = getScore(iris_train, iris_test, "gini", False, iris_names)
    irisResultEntropy = getScore(iris_train, iris_test, "entropy", False, iris_names)
    print("Iris DataSet Score with Gini setting:" + str(irisResultGini) + " with Entropy Setting:" + str(irisResultEntropy))

Q1()
Q2()



Tennis DataSet Score with Gini setting:0.75 with Entropy Setting:0.75
Iris DataSet Score with Gini setting:1.0 with Entropy Setting:1.0


In [232]:
import random as rand


iris_class_names = ['Iris-versicolor', 'Iris-virginica', 'Iris-setosa']


def generateNoise(dataFrame, noise_perc):
    selected = rand.choices(dataFrame.values, k=noise_perc)
    selectedoutput = []
    for entry in selected:
        poss_class_names = iris_class_names.copy()
        poss_class_names.remove(entry[4])
        chosen_class_name = rand.choice(poss_class_names)
        selectedoutput.append([entry, chosen_class_name])
    return selectedoutput
          
generateNoise(iris_train, 100)
                                
    

[[array([6.3, 3.3, 4.7, 1.6, 'Iris-versicolor'], dtype=object),
  'Iris-virginica'],
 [array([6.4, 2.9, 4.3, 1.3, 'Iris-versicolor'], dtype=object),
  'Iris-virginica'],
 [array([7.7, 3.8, 6.7, 2.2, 'Iris-virginica'], dtype=object),
  'Iris-versicolor'],
 [array([5.9, 3.2, 4.8, 1.8, 'Iris-versicolor'], dtype=object), 'Iris-setosa'],
 [array([7.3, 2.9, 6.3, 1.8, 'Iris-virginica'], dtype=object),
  'Iris-versicolor'],
 [array([4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], dtype=object), 'Iris-versicolor'],
 [array([5.1, 3.5, 1.4, 0.2, 'Iris-setosa'], dtype=object), 'Iris-versicolor'],
 [array([7.1, 3.0, 5.9, 2.1, 'Iris-virginica'], dtype=object),
  'Iris-versicolor'],
 [array([6.8, 2.8, 4.8, 1.4, 'Iris-versicolor'], dtype=object), 'Iris-setosa'],
 [array([5.2, 3.5, 1.5, 0.2, 'Iris-setosa'], dtype=object), 'Iris-versicolor'],
 [array([5.1, 3.4, 1.5, 0.2, 'Iris-setosa'], dtype=object), 'Iris-virginica'],
 [array([7.9, 3.8, 6.4, 2.0, 'Iris-virginica'], dtype=object), 'Iris-setosa'],
 [array([4.9, 2.5