# IMPERIAL COLLEGE LONDON
# MSc ASSESSMENT 2021/22

## Environmental Data Module Week 1: Data Science and Machine Learning

***For internal students of Imperial College London***<br>
***Taken by students of MSc Environmental Data Science and Machine Learning***

13:00 – 16:00 London Time, Friday 19 November 2021


## ⚠️ Disclaimer ⚠️

### Assessment type
The assessments are run as open-book assessments, and as such we have worked hard to create a coursework that assesses synthesis of knowledge rather than factual recall. Be aware that access to the internet,
notes or other sources of factual information in the time provided may not be too helpful and may well
limit your time to successfully synthesise the answers required. 

### Plagiarism
The use of the work of another student,
past or present, constitutes plagiarism. Giving your work to another student to use may also constitute
an offence. Collusion is a form of plagiarism and will be treated in a similar manner. This is an individual
assessment and thus should be completed solely by you. The College will investigate all instances where
an assessment offence is reported or suspected, using plagiarism software, vivas and other tools, and apply
appropriate penalties to students. 

## Submission of your assessment
You have ***until 4 pm London time on Friday, November 19th***, to complete this assessment. Code pushed after the deadline or sent via email/Teams will not be taken into consideration unless there are very good mitigating circumstances. We recommend pushing intermediate solutions of your code often.

To submit you need to do a two step process:

1. First, **run the code in the very last cell of this notebook**. This will do three things: 
    * It will ensure that you have the correct variable names declared in the notebook. If you don't have the right variable names, the code will return a `NameError` and this will inform you of what variable is missing. Look in the notebook to see where you were supposed to declare it.
    * If all your variables are declared, the code should output a list of tests on the **type** of the variables you declared. If all the types are correct, you will see green `PASSED` tests. Keep in mind, this only indicates that you have the correct type, **NOT that your solution is correct**. It is possible to still have a correct solution and fail the **type check**, for instance, if your solution is very different than mine and you end up with a different class than what I expect. So check your solution, make sure it is ok, but don't panic if you cannot see a problem.
    * The code will save your solution in the 'answers' folder so it can easily be extracted for marking. The message will tell you whether or not the variables were correctly saved (even if they were not of the right type)
2. Second, if you have all PASSED tests or if you are confident that your solution is correct, **YOU NEED TO PUSH YOUR ASSESSMENT TO GitHub Classroom!** Code not pushed to GitHub classroom before the deadline will not be taken into consideration. Note that if you make changes to your code, you need to do step (1) again (run the code at the end of this notebook) and then step (2) again (push to GitHub).

You can push to GitHub as often you want during the assessment (only the last push before the deadline counts): so don't leave it to the last minute. 

## Assessed Coursework Structure and marking criteria
This assessment is marked out of a **maximum of 100 marks**. The total number of achievable point per question is clearly indicated. You **need to answer all questions** to achieve the maximum mark. 

### Assessment Criteria
In all assessments, we will analyse performance against performance
on the rest of the course and against data from previous years and use an evidence-based approach to
maintain a fair and robust assessment. As with all assessments, the best strategy is to read the question
carefully and answer as fully as possible, taking account of the time and number of marks available.
The following will be used to attribute marks:
- **Clean code**: Follows general clean code pracices, e.g. code is well organised and easy to read and understand, avoids repetition of code by creating functions where needed (DRY principle), code can be executed top to bottom in the notebook, variable names are logical. For each question below, **20% of the mark will be awarded for how clean/well organised your code is.**
- **Correct solution**: A solution that gives a correct answer
- **Complete solution**: Ability to show that you have understood the general principles behind the question by fully testing your answer, for instance, by plotting relevant data distribution or using other tools to gain insight from your data. Be careful though not to overdo it: remember, clean code!



## 🍀 GOOD LUCK!

# INITIAL SETUP

🚨 Please enter your **CID** (`int`) and **GitHub Username** (`string`) in the cell below, and **run the cell (execute the code)**. Double check that it is correct! 🚨

In [None]:
CID = 
GitHubUsername = 

---------------------------

# 🚢PART A: Data preparation - core data from ODP [50 marks]

Execute the cell below to load the data into a variable named `core_df`. Questions 1 to 5 are based on this data. Unless otherwise stated, always save your dataframe as `code_df` after making a transformation to the data.

In [None]:
import pandas as pd

core_df = pd.read_csv('data/core_data_labelled.csv')
core_df


## The Data
You should already be familiar with the type of data here because it is ODP core data:

- **'Leg'**: The 'Leg' number, which is the old ODP (the predecessor of IODP) terminology for an 'Expedition'
- **'Site'**: The ODP Site number, where multiple wells can be drilled
- **'H'**: The 'Hole' name (a sequential letter starting with A, B, C, ...), effectively a well. So a full well name would be 'leg-siteHole', such as 'ODP 194-1192A' 
- **'Cor'**: The core number, each core is drilled for about 9.8 meters down from the surface of the sediments. Thus, core 1 is the shallowest core, and core numbers increase downhole.
- **'T'**: The tool used to cut the core, i.e. the cutting shoe. This can be one of 6 types: H, X, R, Z, M and W. H and X are the most common.
- **'Sc'**: The section of the core the sample comes from. Each core is divided into up to 7 sections and one core catcher ('CC').
- **'Top(cm)'**: The distance in cm from the top of the section where the sample was collected.
- **'Depth (mbsf)'**: The depth of the sample in the well measured from the seafloor (mbsf = meters below seafloor).
- **'Corr. Counts'**: The 'Corrected counts' for the core natural gamma ray. In effect, natural gamma ray data.
- **'Density (g/cc)'**: An estimate of the bulk density of the rock and sediments from automated core measurements.
- **'L*'**: The luminosity channel, part of the Lab* color space of the sediment color measured by the automated tracks. 
- **'a*'**: The 'a' color axis, part of the Lab* color space of the sediment color measured by the automated tracks.
- **'b*'**: The 'b' color axis, part of the Lab* color space of the sediment color measured by the automated tracks.
- **'CaCO3 (wt %)'**: percent carbonate present in the sample. 

## Answer the following questions using this data:

# Question 1 [10 marks]
**a)** What is the total number of missing values in `core_df`? Save your answer into a variable named `cdf_nb_missing` below:

In [None]:
#Your Answer:
cdf_nb_missing = 

**b)** What is the average of the standard deviations of all of the numerical features in `core_df`? Save your answer into a variable named `cdf_std_mean` below:

In [None]:
cdf_std_mean = 

**c)** What are the maximum and minimum values of `CaCO3 (wt %)`  in `core_df`? Save your answers into a variable named `cdf_min_carb` and `cdf_max_carb` below:

In [None]:
cdf_min_carb = 
cdf_max_carb = 

# Question 2 [10 marks]
Select all numerical columns from `core_df` and save them into an array variable called `num_columns`. Use a single imputer to replace all missing values from `num_columns` straight in `core_df` and choose either the `mean` or the `median` as your strategy. Your goal is to have values as close as possible from the most frequent value (but not exactly the most frequent: **do not** use the `most_frequent` strategy). In other words, choose the most appropriate strategy between  `mean` or `median` for the data that you have.

In [None]:
# Your code here

num_columns = 

In [None]:
core_df = 

# Question 3 [10 marks]
Imput any missing categorical value with a reasonable strategy.

In [None]:
core_df = 

# Question 4 [10 marks]
Are there any obvious numerical outliers (i.e. data that clearly need to be measurement errors) from `core_df`? Give your answer below as a string (either `'yes'` or `'no'`) in the variable named `outliers_present`. If there are any, then **drop the rows** containing the outliers directly in `core_df`.

In [None]:
outliers_present = 

In [None]:
core_df = 

# Question 5 [10 marks]
**a)** First run the code below to create a copy of your dataframe called `encoded_df`. For **Question 5**, you will use this new dataframe instead of the original `core_df`.

In [None]:
# Run this code:

encoded_df = core_df.copy()

**b)** Now look carefully at each one of the features in `core_df`. Do the following:
* Scale numerical features that need scaling using a single scaler (decide which scaler is most appropriate - **don't use a different scaler for each numerical feature**) 
* Encode all categorical features. 

Replace the original values directly in `encoded_df` by their encoded/scaled values.

In [None]:
# Your Code here
encoded_df = 

---------------------------------------------------------------------------------------------

# 🐝 PART B: Training and testing algorithms - Swarm Behaviour

Execute the cell below to load the labelled data into a variable named `data`. **Questions 6** is based on this data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv('data/swarm_training.csv')
unknown_behaviour =  pd.read_csv('data/unknown_swarm_behaviour.csv')
data

## Dataset Description

The dataset contains 292 `features` (different measurement of behaviour) and 1 label (`Swarm_Behaviour`).

`Swarm_Behaviour` can take the values 0 (*not swarming*) or 1 (*swarming*).

The following has been done on this dataset already:
* There is no missing data
* Numerical values were scaled using a `RobustScaler()` 
* Categorical data are encoded

Hence, the data is ready for machine learning applications though it is always possible for you to do more on the data if you want. 🧽

# Question 6 [50 marks]

You are a data scientist working for a non-profit looking at endangered eco-systems. You are given the dataset above, and are asked to train a `LogisticRegression` model to predict `Swarm_Behaviour` in new, unseen samples. In insects and birds, 'swarming' refers to the tendency to form large groups when flying together.

Your bosses also give you the following directives:
* The model needs to be a `LogisticRegression` model - Upper management will discard any other approach 
* You need to train your model to achieve the highest possible **precision** using the labelled dataset (`data`) above 
* However, management requires your algorithm to have a **recall** of ***at least 70%***
* Once you are happy with your final trained algorithm, you are asked to save it into a variable named `final_model` 
* You are then asked to make predictions about the swarming behaviour of unclassified species given in the `unknown_behaviour` dataframe above, and to save your predictions in a variable named `predictions`
* Your algorithm will be evaluated based on its performanced on the unseen dataset (i.e. your `predictions`)

GOOD LUCK! 🧧

In [None]:
# Your code

final_model = 

In [None]:
predictions = 

# 🚨 CHECK AND SAVE YOUR ANSWERS BEFORE PUSHING 🚨
### Run the code cell below

Running the cell below will save your answer(s) in the `answers` folder. This is an important part of the correction process as we will use this as a first pass assessement of your work. It will also give you an indication if you have saved all of the important variables in the notebook correctly, and if some are missing.

**If you have a variable missing** the code below will output a clear, red error with a 'NameError', something like that:
`name 'final_model' is not defined`

If all of the variables are defined, the code will give you a detailed report on the expected types of each variable, and whether you need to check them or not. The variables with correct types will be saved on the disk.

You can run this code as often as want, and you should run it one last time before you push your code to GitHub classroom.

In [None]:
from answers.check_answers import CheckAnswers

answer = CheckAnswers(cid=CID,
                 username=GitHubUsername,
                 cdf_nb_missing=cdf_nb_missing,
                cdf_std_mean=cdf_std_mean,
                cdf_min_carb=cdf_min_carb,
                cdf_max_carb=cdf_max_carb,
                num_columns=num_columns,
                outliers_present=outliers_present,
                core_df=core_df,
                encoded_df=encoded_df,
                final_model=final_model,
                predictions=predictions).checkAll()

print(answer)