# CT5170 - Principles of Machine Learning
### Assignment 1
##### Daniel Verdejo - 22240224
___

1: _Have a look at the given data, understand the problem based on the dependent variable and select a machine learning category that can solve the task/problem.  Briefly explain why do you think it is the correct ML category for this problem?_

Answer:

The data provided is a subset of the Iris dataset which typically includes 3 species. We have both a training and test dataset for our neural network to consume. 

To solve the problem we will use a supervised learning classification task as we are provided 80 samples in total in a training dataset. 

Classification will enable us to organise our data in categorical sense based off predictions made about each sample the model is given. 

We first train the model with a dataset of known data, consisting of 2 classes with 40 samples per class (as seen below), each sample containing 4 attributes and the target label. The attributes are the sepal width and length (in cm), and the petal width and length (in cm). 

Once trained we will then need to feed it new data (data it has not yet seen) to measure its performance.
___

Lets begin...

In [19]:
import csv
import numpy as np
import keras as kr

training_samples = list(csv.reader(open('./plant-data/plant-train.csv')))[1:]
testing_samples = list(csv.reader(open('./plant-data/plant-test.csv')))[1:]

# the attributes will be our inputs - sepal width, sepal length, petal width, petal length 
attributes = np.array(training_samples)[:,:4].astype(np.float)

# The classes will be our outputs
training_classes = np.array(training_samples)[:,4]
testing_classes = np.array(testing_samples)[:,4]

uniq_training_classes, output_indices, training_counts = np.unique(training_classes, return_counts=True, return_inverse=True)
print('training data: ', np.asarray((uniq_training_classes, training_counts)).T)

uniq_test_classes, test_counts = np.unique(testing_classes, return_counts=True)
print('\ntest data: ', np.asarray((uniq_test_classes, test_counts)).T)


training data:  [['setosa' '40']
 ['virginica' '40']]

test data:  [['setosa' '10']
 ['virginica' '10']]


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  attributes = np.array(training_samples)[:,:4].astype(np.float)


2: _Explore and report the data and its distribution among training and testing data. Can we call it imbalanced dataset, explain your answer (yes/no) briefly?_

Answer:

No, we cannot say the provided dataset is imbalanced. As we can see above, both the training and testing datasets show a 50:50 ratio for each class. The dataset is perfectly balanced with no specific class showing a majority or minority in either the training or testing datasets.
___

3: _Research and write down about open-source machine learning package that are freely available, and select one that you think will be good and easy for this task. Your report should include a short overview of the main features of the package you have chosen._ 

Answer:

Some of the most popular machine learing libraries for python are as follows:

**TensorFlow** - Developed by Google, written in Python and C++, it offers a free and open-source library used for artificial intelligence and machine learning applications. The library offers both a high-level and low-level API to users to enable creating machine leanring models which can be run on many different host devices including, cloud, desktop, mobile or even on edge devices. It offers tools to validate and transform large datasets, and tools to discover and remove bias in data to improve outcomes on models. Tensoflow offers in depth API documentation and learning materials for users of all experience levels.

**Keras** - Is a machine learning API written purely in Python, which runs on top of the TensorFlow platform. It offers an easy to use high-level API which abstracts some of the more complex functionality of TensorFlow. Keras offers a quick and easy way to get up and running for smaller projects.

**PyTorch** - Developed by the Linux Foundation and Meta AI (a branch of Meta, formerly Facebook) was released in September 2016. The library is written in Python, CUDA and C++, and offers a machine learning framework based on the Torch library. It can be used to develop machine learning applications and also offers capabilities for the creation of REST API endpoint to ease application integration. It only offers low level apis so it can be more difficult to use than Keras.


The machine learning library that I will be using here is [Keras](https://keras.io/). Its ease of use makes carrying out tasks like the one we are tackling today quick and easy. 

Using the Keras library over something like the TensorFlow or PyTorch libraries will reduce the amount of effort on my part to build the neural network, and should also be fairly easy to read and understand for you, the reader. As this dataset is quite small and relatively trivial using one of the other options, while they may operate faster, would add an unnecessary level of complexity as we should not need to dig into the lower level apis, or spend a lot of time debugging our neural network.

4: _In order to use the dataset (Plant-dataset) supplied below, you might need to do some work to prepare it for input into the ML package, depending on the ML category requirements. Document any data preparation (e.g. normalisation) steps in your report._



Some preparation of the dataset is required, we first got our attributes above like so: `attributes = np.array(training_samples)[:,:4].astype(np.float)` these our inputs.

Next we got our classes: `training_classes = np.array(training_samples)[:,4]` these will be our outputs.

Above we used `uniq_training_classes, inverse, training_counts = np.unique(training_classes, return_counts=True, return_inverse=True)` for  the training dataset to view how the data is distributed among the 2 class in both datasets. We also got the indicies of the unique array. This consists of 0 and 1s which points us to the index for each output. As we can see below.



In [20]:
print(output_indices)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1]


5: _From the ML package, select two different algorithms from the category you selected and apply to the dataset. In your report, include a clear description of both algorithms. Ensure that you acknowledge all of your sources of information.
Report the results with and without normalisation of the data._

In [None]:
# Create a sequential model
model = kr.models.Sequential()

#add our initial layer with an input for each of our attributes (4), and a hidden layer with 16 nodes
model.add(kr.layers.Dense(16, input_shape=(4,)))

# apply an activation function to the layer
model.add(kr.layers.Activation('sigmoid'))

# add our output layer with one node for each class
model.add(kr.layers.Dense(2))

# apply an activation function to the output layer
model.add(kr.layers.Activation('softmax'))


6: _Train and test your chosen algorithms using the training set provided in plant-train.csv. You should then test your trained models using the test set provided in plant-test.csv. Report on the results with appropriate performance metric e.g. accuracy that you consider best for each model on the training set and on the test set. Also include details of the classification models constructed – these may include graphics if appropriate._


7: _Discuss in your report whether the two models give very similar or significantly different results, and why._