In [2]:
from nose.tools import assert_equal 
import pandas as pd 

## 0. Import

Implement a function `tidy` which imports the data set assigned and provided to you as a CSV file into a `pandas` dataframe. Access the data set and establish whether your data set is tidy. If not, clean the data set before continuing with Step 1. Mind all rules of tidying data sets in this step. Make sure you comply to the following statements:
* If there is an index column (row numbers) in your tidied dataset, keep it.
* The following columns, once identified, correspond to variables 1:1 (no need for transformations):
  * `full_name`
  * `automotive`
  * `color`
  * `job`
  * `address`
  * `coordinates`
  * `km_per_litre`
* The tidied dataset should have a total of 9 columns (not including the index), the first column should be `full_name`.
* Mind the intended content of each attribute (e.g. full_name should contain the full name of a person, no need to change that)
* If tidy or done, have the function `tidy` return the ready data set as a dataframe.

Note that `tidy` must take a single parameter that holds the basename of the CSV file (i.e., the name without file extension). Do NOT change the name of the file, do not overwrite the original data file, and make sure you submit your final ZIP following the Code of Conduct (CoC) requirements. Especially, make sure you put your data file in a folder called `data/` when submitting.

In [3]:
def tidy(x):
    df = pd.read_csv(f"./data/{x}.csv", header=None)    
    if df.shape[0] < df.shape[1]:  # If there are more columns than rows, it's structured horizontally
        df = df.transpose()        # swaping the rows and columns
        df.columns = df.iloc[0]    # seting the first row as the header
        df = df[1:]                # removing the first row as it's now the header
    
    # Split 'date_time/full_company_name' into 'date_time' and 'company_name'
    df[['date_time', 'company_name']] = df['date_time/full_company_name'].str.extract(r'([0-9\- :\.]+)([A-Za-z\s,&]+)$')
    
    df = df.drop(columns=['date_time/full_company_name'])
    
    required_columns = ['full_name', 'automotive', 'color', 'job', 'address', 'coordinates', 
        'date_time', 'company_name', 'km_per_litre']
    df = df[required_columns]

    return df

#tidy(mn)

In [4]:
assert_equal(type(tidy(mn)), pd.core.frame.DataFrame)
assert_equal(len((tidy(mn)).columns), 9)
assert_equal(list((tidy(mn)).columns)[0], "full_name")
assert_equal(list((tidy(mn)).columns)[len((tidy(mn)).columns)-1], "km_per_litre")

-------
## 1. Missing values

### 1.1 Code part
Implement a function called `missing_values` which takes as an input a dataframe and check if there are any missing values in the dataset. Record the row ids of the observations containing missing values as a list of numbers and make sure that the function returns the recorded list in the end. If there are no missing values, `missing_values` should return an empty list.

**NOTE:** You shall find out how missing values are encoded in your datasest and which missing values occur in your dataset, you probably will ***need manual inspection***. For instance they could be encoded as: `"nan"`,`"(+/-)inf"` but also other values or empty fields or fields containing only white spaces are conceivable to encode missing values in your dataset. (We are aware that this test generic test might be overshooting in practice ;-))

In [6]:
import numpy as np
def missing_values(x):
    df_str = x.astype(str)     # srt to handle potential white space issues
    
    condition = df_str.isin(["nan", "NaN", "", " ","-" "inf", "-inf", "+inf"])   # condition for missing values
    
    missing_values = x.index[condition.any(axis=1)].tolist()
    #display(x.loc[missing_values])
    return missing_values

#missing_values(tidy(mn))

In [7]:
assert_equal(type(missing_values(tidy(mn))), list)
assert_equal(all(isinstance(i, int) for i in missing_values(tidy(mn))), True)

### 1.2. Analytical part

* Does the dataset contain missing values?
* If no, explain how you proved that this is actually the case.
* If yes, describe the discovered missing values. What could be an explanation for their missingness?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!


1. Yes there are missing values in the dataset.
2. We have Nan, -inf, +inf, - , " ".

NaN: typical example of missing data in Python. It means that data was not provided/recorded in the first place.

+inf/-inf: data was entered incorrectly.

We have missing data in following columns: full_name, automotive, job, address, coordinates, data_time, company_name. 

------
## 2. Handling missing values
### 2.1. Code part
Apply a (simple) function called *handling_missing_values* for handling missing values using an adequate single-imputation technique  of your choice per type of missing values. Make use of the techniques learned in Unit 4. The function should take as an input a dataframe and return the updated dataframe. Mind the following:
- The objective is to apply single imputation on these synthetic data. Do not make up a background story (at this point)!
- Do NOT simply drop the missing values. This is not an option.
- The imputation technique must be adequate for a given variable type (quantitative, qualitative). 

In [9]:
def handling_missing_values(x):
    x.replace(['+inf', '-inf', '-', ' ', ''], np.nan, inplace=True) #raplacing with NaN
    
    if 'date_time' in x.columns:
        original_index = x.index
        x['date_time'] = pd.to_datetime(x['date_time'])             # converting srting into datetime
        x = x.sort_values(by='date_time')                           # sorting in hronological order for later use of LOCF
        x['date_time'].fillna(method='ffill', inplace=True)         # LOCF(Last Observation Carried Forward) to fill missing date values
        x = x.loc[original_index]
    for column in x.columns:                                        # for qualitative data - mode(most frequent value)
            if not x[column].mode().empty:
                mode_value = x[column].mode()[0]                    # Get the first mode value
            else:
                mode_value = 'Unknown'
            x[column].fillna(mode_value, inplace=True)
    return x

#handling_missing_values(tidy(mn))

In [10]:
assert_equal(len(missing_values(handling_missing_values(tidy(mn)))), 0)
assert_equal(handling_missing_values(tidy(mn)).shape, tidy(mn).shape)

### 2.2. Analytical part
Discuss the implications. Answer the following:

- How would you qualify the data-generating processes leading to different types of missing values, provided that the data was not synthetic?
- What are the benefits and disadvantages of the chosen single-imputation technique?
- How would you apply a multiple-imputation technique to one type of missing values, if applicable at all?
- We asked you to test for/treat as missing values by checking certain field values, as well as empty fields or fields containing the numeric value 0... what are potential problems of this heuristics?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!

We have missing data in following columns: full_name, automotive, job, address, coordinates, data_time, company_name.
So my guess that it's most likely MCAR: The "process" having produced missing data is independent from the (other) variables in a dataset.
##### Processes that could lead to a missing data in our table:
1. Equipment needed for measurement suffered a malfunction or service outage.
2. Unplanned termination of data collection.
Generally: The probability of a value being missing is unrelated to the value itself or any other variable in the data set. Therefore, the missing data is said being missing completely at random (MCAR).

##### Single imputation (a.k.a. single substitution):
I used mode imputation for qualitative variables: Substitute the most common value (mode) for the missing values of a given variable.
Pros: Allows for complete-case analysis, does typically not add additional bias MCAR (wich I belive is the case here).
Cons: adds bias when MAR and MNAR; replacement values are all identical, so the variances end up smaller than in real data.

Most of missing values are categorical, but there are some numerical values in data_time. There I decided to use LOCF(Last Observation Carried Forward) to fill missing date values, as we did in tutorium. But there was a problem data wasn't structured as time series, so first I filtered data in hronological order, then filled the missing values and finally returned initial order. LOCF may also be seen as special case of hot-deck imputation.

##### Random Hot-Deck Without Predictors:

While in my opinion we dont have any valid predictors for the full_name or automotive  (in theory "job" column can be used, but not in this case, while it's seems to be a unique identifier), so Random Hot-Deck Without Predictors could be used for full_name column (for example). First I would have created donor table with all non-missing values from the column full_name and than would draw at random values to fill the missing one.

##### If we handled values '0' as NaN:

In this table specifically I could have affected only km_per_litre column. But generally:
 - it could have been a valid value
 - if it was the only heuristics, I would have overlooked other non-standard missing values such as '+/-inf', ' '


-----
## 3. Detecting duplicate entries
Implement a function called `duplicates` that takes as an input a (tidy) dataframe `x`. Assume that `duplicates` receives a dataframe as returned from your Step 0 implementation of `tidy`. It then checks whether there are any duplicates in the dataset. Record the row ids of the observations being duplicates and have `duplicates` returns the list in the end. An empty list indicates the absence of duplicated observations.

In [11]:
def duplicates(x):
    x = x.astype(str)
    duplicates = x[x[['full_name', 'company_name']].duplicated()].index.tolist()
    return duplicates

In [12]:
assert_equal(type(duplicates(tidy(mn))), list)
assert_equal(all(isinstance(i, int) for i in duplicates(tidy(mn))), True)

-----
## 4. Detecting outliers
### 4.1. Code part
Implement a function called `detecting_outliers` to detect outliers in one selected quantitative variable. Pick a suitable variable from the tidied dataset based on your characterisation and apply one suitable outlier-detection technique as covered in Unit 4. Justify your choice of this technique in the analytical part. Again, the function is assumed to receive a tidied data set from Step 0. The function returns the row ids of the rows containing outliers on the selected variable.

In [14]:
def detecting_outliers(x, distance = 0.5):
    var = 'km_per_litre'
    x[var] = pd.to_numeric(x[var], errors='coerce')
    q25 = x[var].quantile(0.25)
    q75 = x[var].quantile(0.75)
    #print(f"Q1 (25th percentile): {q25}, Q3 (75th percentile): {q75}")
    IQ = q75 - q25
    minval = q25 - distance*IQ
    maxval = q75 + distance*IQ
    #print(f"IQR: {IQ}, Min value: {minval}, Max value: {maxval}")
    outliers = x[(x[var] < minval) | (x[var] > maxval)]
    index = outliers.index.tolist()
    #print(len(index))
    #display(x.loc[index])
    return index

#detecting_outliers(tidy(mn))

In [15]:
from nose.tools import assert_equal
assert_equal(type(detecting_outliers(tidy(mn))), list)
assert_equal(all(isinstance(i, int) for i in detecting_outliers(tidy(mn))), True)
assert_equal(len(detecting_outliers(tidy(mn))) > 0 and len(detecting_outliers(tidy(mn))) < .05*tidy(mn).shape[0], True)


### 4.2. Analytical part
Discuss the implications. 

- What is the chosen outlier-detection technique? Explain it using your own words in 3-4 sentences.
- Describe the outliers detected: How many? How do they relate to the typical, non-outlier values in the remaining dataset?
- What could be one reason these outliers appear in the dataset? How would you treat them further?

Write your answer in the markdown cell below. Do NOT delete or replace the answer cell with another one!

Q1 (25th percentile): 19.0, Q3 (75th percentile): 36.0
IQR: 17.0, Min value: 10.5, Max value: 44.5

##### The inter-quantile-range (IQR)
I started with computing 25th percentile (Q1 = 19.0) and the 75th percentile (Q3 = 36.0) of the data, IQR (= 17) is calculated as the difference between Q3 and Q1. Any data points that fall below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are considered outliers. This method is effective for highlighting data wich is out of range of central quantile.

##### Outliers Detected
From the output of the function, the range for non-outliers was calculated to be between 10.5 and 44.5. I used distance 0.5 and in the end I have a list consisting indexes of 44 outliners, which are either 0 or exceeded 60, with max value 82. These outliers are not typical for the dataset, as most values are concentrated between 20 and 40.
##### Reasons for Outliers
I would say, that otliners in my dataset are caused by:
- Natural Variation: Sometimes, outliers represent real but rare events or exceptional cases (e.g., a highly efficient vehicle).

