# üå± Pivotal Future - recruitment task

This assignment is to test both your python and ML skills, and your ability to think critically about a
problem. A set of tasks is given below. Please summarise your findings in a short report (see instructions in the last section), and attach any code you use in your analysis. 

Do not spend more than 8 hours on the task. Read carefully all the sections, and then complete as you many sections as you wish.

You can either run the code locally on your machine, or use https://colab.google/ if you need GPU resources.

### üåº Background 
At Pivotal, we leverage AI for biodiversity measurement and monitoring. *Plants* are fundamental to ecosystems, serving as primary producers that support all life forms. Monitoring plant species and their distributions provides critical insights into ecosystem health, climate change effects, and biodiversity patterns. Accurate identification of plant species is essential for effective conservation efforts. However, plant identification presents challenges due to high variability in morphology, seasonal changes, and environmental influences. Leaves, flowers, bark, and growth patterns can vary significantly within species, making classification complex. 

You can read more about plants on the [iNaturalist page](https://www.inaturalist.org/taxa/47126-Plantae). You can explore identification techniques under the Observations section.

### üñäÔ∏è The task
In this task, you will train and test a single-label classifier to distinguish between different species of plants, utilizing machine learning techniques to analyze and interpret the data.

The task is divided in the following parts:
1. Load and explore the dataset
2. Train the classification model
3. Evaluate the model
4. Bonus part (not mandatory)
5. Questions for report

We reccomend to read every section first, and then to start coding.

# 1. Data exploration

Download the dataset that was sent to you. This will contain the dataset needed for the task. It will be divided in 10 folders, each one containing the plant species indicated by the folder names. These will be your target classes.

In this section of the task, you have to: 
* inspect the dataset and make exploratory plots. Feel free to use the libraries that you prefer. 
* format the data so that you can use it for model training (i.e. folder structure, image sizes, etc...)




In [None]:
# code here

## 2. Train model

In this section you have to write the code to train a categorical classifier. The target classes are the species of the plants. You are free to use any library you prefer (i.e. tensorflow, pytorch, ...). 

We suggest you to use a simple solution, like a simple convolutional neural network (CNN), or a pretrained model using transfer learning techniques.

In [None]:
# code here

## 3. Inference and metrics

Visualize the training and validation loss and accuracy to analyze the performance. Make some plots to show the performance of the model (i.e. confusion matrix, ROC curve, ...)


In [None]:
# code here

## 4. Bonus part (not mandatory)

In this section, you can be creative and add extra parts that are not requested in the task. Possible additions could be (but are not limited to): 
* testing effect of data augmentations 
* compare different model architectures performance
* use the features extracted by the CNN model to cluster different predictions using k-means 

In [47]:
# code here

## 5. Write a report

Please write a short and concise report (6 pages at most), or make few slides to present the results. Cover the following points:
* Dataset description, findings and plots from the data exploration part
* Model selected
* Metrics of the model and error analysis
* Bonus parts (if applicable)

In your final report, please make sure to address the following questions:

1. What patterns did you find during the data exploration phase?
2. How can we address the class imbalance during training?
3. What data augmentation techniques should be used for this particular dataset
4. What metrics are good indicators that our model is performing well for this specific task?
5. Did you find any patterns in the errors made by the model during the validation? 
6. In the wild, it is rare that we are able to neatly separate individual plant species into single images, how would you change your model to be used as a multi-label classifier to be able to predict multiple species per image, and what kind of data would you need?
7. There are around 380,000 species of plant, other than with a CNN, how else could you use deep learning to identify them, and what kind of data would you need?

When you are done with the code and the report, please send the update notebook and the pdf/slides at the indicated email. 

üçÄ Good luck! 