## Assignment 4
***
*General hints:* <br>
* You may use another notebook to test different approaches and ideas. When complete and mature, turn your code snippets into the requested functions in this notebook for submission. 
* Make sure the function implementations are generic and can be applied to any dataset (not just the one provided).
* Add explanatory code comments in the code cells. Make sure that these comments improve our understanding of your implementation decisions.

-----
* Create a variable holding your student id, as shown below. 
* Simply replace the example (`01234567`) with your actual student id having a total of 8 digits. 
* Maintain the variable as a string, do NOT change its type in this notebook!
* *Note: If your student id has 7 digits, add a leading 0. The final student id MUST have 8 digits!*

In [None]:
mn = '12318768'

In [None]:
import pytest
import pandas as pd 
import numpy as np

## 0. Import

Implement a function `tidy` which imports the data set assigned and provided to you as a CSV file into a `pandas` dataframe. Access the data set and establish whether your data set is tidy. If not, clean the data set before continuing with Step 1. Mind all rules of tidying data sets in this step. Make sure you comply to the following statements:
* If there is an index column (row numbers) in your tidied dataset, keep it.
* The following columns, once identified, correspond to variables 1:1 (no need for transformations):
  * `full_name`
  * `automotive`
  * `color`
  * `job`
  * `address`
  * `coordinates`
  * `km_per_litre`
* The tidied dataset should have a total of 9 columns (not including the index), the first column should be `full_name` and the last one `km_per_litre`.
* Mind the intended content of each attribute (e.g. `full_name` should contain the full name of a person, no need to change that)
* If tidy or done, have the function `tidy` return the ready data set as a dataframe.

Note that `tidy` must take a single parameter that holds your student id (`mn`) as one part of the basename (according to the CoC) of the CSV file (i.e., the CoC file name without file extension). Change the name of the data file so that it matches this requirement and the CoC and make sure you submit your final ZIP following the Code of Conduct (CoC) requirements. Especially, make sure you put your data file in a folder called `data/` when submitting.

In [None]:
def tidy(x):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert type(tidy(mn)) == pd.core.frame.DataFrame, "T0.1"
assert len((tidy(mn)).columns) == 9, "T0.2"
assert list((tidy(mn)).columns)[0] == "full_name", "T0.3"
assert list((tidy(mn)).columns)[len((tidy(mn)).columns)-1] == "km_per_litre", "T0.4"

In [None]:
# Edit this cell or remove it, and you shall perish, meow! 😼⚡️

-------
## 1. Missing values

### 1.1 Code part
Implement a function called `missing_values` which takes as an input a dataframe and check if there are any missing values in the dataset. Record the row positions (*not* the row labels!) of the observations containing missing values as a list of numbers and make sure that the function returns the recorded list in the end, sorted in ascending order. If there are no missing values, `missing_values` should return an empty list.

**NOTE:** You shall find out how missing values are encoded in your datasest and which missing values occur in your dataset, you will ***need manual inspection*** by applying Python helpers. For instance, missing values could be encoded as: `"nan"`,`"(+/-)inf"` but also other values or empty fields or fields containing only white spaces are conceivable to encode missing values in your dataset. Do *not* rely on built-in Python or pandas functions alone!

Important: Mind the difference between row positions and row labels. `.index` of a dataframe returns row labels. `.iloc` takes row positions.

In [None]:
def missing_values(x):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert type(missing_values(tidy(mn))) == list, "T1.1"
assert all(isinstance(i, int) for i in missing_values(tidy(mn))), "T1.2"

In [None]:
# Edit this cell or remove it, and you shall perish, meow! 😼⚡️

### 1.2. Analytical part

* Does the dataset contain missing values?
* Explain your manual-inspection procedure and the Python helpers used!
* If no, explain how you proved that this is actually the case. 
* If yes, describe the discovered missing values. What could be an explanation for their missingness?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!


YOUR ANSWER HERE

------
## 2. Handling missing values
### 2.1. Code part
Apply a (simple) function called *handling_missing_values* for handling missing values using an adequate single-imputation technique (or, one of the alternatives to single imputation) of your choice per type of missing values. Make use of the techniques learned in Unit 4. The function should take as an input a dataframe and return the updated dataframe. Mind the following:
- The objective is to apply single imputation on these synthetic data. Do not make up a background story (at this point)!
- Do NOT simply drop the missing values. This is not an option.
- The imputation technique must be adequate for a given variable type (quantitative, qualitative).
- To establish whether a variable is quantitative or qualitative, it is *not* sufficient to only inspect on data types!

In [None]:
def handling_missing_values(x):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
assert len(missing_values(handling_missing_values(tidy(mn)))) == 0, "T2.1"
assert handling_missing_values(tidy(mn)).shape == tidy(mn).shape, "T2.2"

### 2.2. Analytical part
Discuss the implications. Answer the following:

- How would you qualify the data-generating processes leading to different types of missing values, provided that the data was not synthetic?
- What are the benefits and disadvantages of the chosen single-imputation technique?
- How would you apply a multiple-imputation technique to one type of missing values, if applicable at all?
- We asked you to test for/treat as missing values by checking certain field values, as well as empty fields or fields containing the numeric value 0... what are potential problems of this heuristics?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!

YOUR ANSWER HERE

-----
## 3. Detecting duplicate entries
Implement a function called `duplicates` that takes as an input a (tidy) dataframe `x` and a list of column labels (`VARIABLES`). Assume that `duplicates` receives a dataframe as returned from your Step 0 implementation of `tidy`. It then checks whether there are any duplicates in the dataset. Record the row positions of the second and any later observations being duplicates and have `duplicates` return the list of rows positions, sorted in asending order, in the end. An empty list indicates the absence of duplicated observations.

Important:
* The first observation that belongs to the detected duplicates is *not* considered a duplicate!
* Mind the difference between row positions and row labels. `.index` of a dataframe returns row labels. `.iloc` takes row positions.

In [None]:
VARIABLES = [list]; # Change value assignment!

def duplicates(x, vars):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
df = tidy(mn);
assert len(VARIABLES) > 0 and all([v in df.columns.tolist() for v in VARIABLES]), "T3.1"
assert duplicates(df, [list]) == "Name variables defining potential duplicates!", "T3.2"
assert duplicates(df, None) == "Name variables defining potential duplicates!", "T3.3"
assert type(duplicates(df, vars = df.columns.tolist())) == list, "T3.4"
assert all(isinstance(i, int) for i in duplicates(df, df.columns.tolist())), "T3.5"

In [None]:
# Edit this cell or remove it, and you shall perish, meow! 😼⚡️


-----
## 4. Detecting outliers
### 4.1. Code part
Implement a function called `detecting_outliers` to detect outliers in one selected quantitative variable. Pick a suitable variable from the tidied dataset based on your characterisation and apply one suitable outlier-detection technique as covered in Unit 4. Justify your choice of this technique in the analytical part. Again, the function is assumed to receive a tidied data set from Step 0. The function returns the row positions (*not* row labels!) of the rows containing outliers on the selected variable, sorted in ascending order.

In [None]:
def detecting_outliers(x):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
df = tidy(mn);
assert type(detecting_outliers(df)) == list, "T4.1"
assert all(isinstance(i, int) for i in detecting_outliers(df)), "T4.2"
assert len(detecting_outliers(df)) > 0 and len(detecting_outliers(df)) < .05*df.shape[0]


### 4.2. Analytical part
Discuss the implications. 

- What is the chosen outlier-detection technique? Explain it using your own words in 3-4 sentences.
- Describe the outliers detected: How many? How do they relate to the typical, non-outlier values in the remaining dataset?
- What could be one reason these outliers appear in the dataset? How would you treat them further?

Write your answer in the markdown cell below. Do NOT delete or replace the answer cell with another one!

YOUR ANSWER HERE