*General hints:* <br>
* You may use another notebook to test different approaches and ideas. When complete and mature, turn your code snippets into the requested functions in this notebook for submission. 
* Make sure the function implementations are generic and can be applied to any dataset (not just the one provided).
* Add explanatory code comments in the code cells. Make sure that these comments improve our understanding of your implementation decisions.

-----
* Create a variable holding your student id, as shown below. 
* Simply replace the example (`01234567`) with your actual student id having a total of 8 digits. 
* Maintain the variable as a string, do NOT change its type in this notebook!
* *Note: If your student id has 7 digits, add a leading 0. The final student id MUST have 8 digits!*

In [1]:
mn = '11915039'

## 0. Import

Implement a function `tidy` which imports the data set assigned and provided to you as a CSV file into a `pandas` dataframe. Access the data set and establish whether your data set is tidy. If not, clean the data set before continuing with Step 1. Mind all rules of tidying data sets in this step. Make sure you comply to the following statements:
* If there is an index column (row numbers) in your tidied dataset, keep it.
* The following columns, once identified, correspond to variables 1:1 (no need for transformations):
  * `full_name`
  * `automotive`
  * `color`
  * `job`
  * `address`
  * `coordinates`
* The tidied dataset should have a total of 8 columns (not including the index), the first column should be `full_name`.
* Mind the intended content of each attribute (e.g. full_name should contain the full name of a person, no need to change that)
* If tidy or done, have the function `tidy` return the ready data set as a dataframe.

Note that `tidy` must take a single parameter that holds the basename of the CSV file (i.e., the name without file extension). Do NOT change the name of the file, do not overwrite the original data file, and make sure you submit your final ZIP following the [Code of Conduct](https://datascience.ai.wu.ac.at/ws21/dataprocessing1/code_of_conduct.html) requirements. Especially, make sure you put your data file in a folder called `data/` when submitting.

In [2]:
def tidy(x):
    csv = "./data/" + x + ".csv"
    df = pandas.read_csv(csv)

    if len(df.columns) > len(df.index):
        # transpose the dataframe
        df = df.T

    if pandas.Series(["full_name","automotive","color","job","address","coordinates"]).isin(df.iloc[0]).all():
        df.columns = df.iloc[0]
        df = df.drop(df.index[0])
        
    # split last column
    lastColumn = str(df.columns[-1]).split('/')
    colDate = lastColumn[0]
    colCompany = lastColumn[1]
    df.rename(columns={df.columns[-1]: colDate}, inplace=True)
    df[colCompany] = df.columns[-1]

    from dateutil.parser import parse
    import numpy as np

    for index, row in df.iterrows():
        try:
            date = parse(row[colDate][0:19], fuzzy_with_tokens=True)
        except ValueError:
            # get the first alphabetic letter from the value, to identify the beginning of the company name
            indexAlpha = 0
            for s in row[colDate]:
                if s.isalpha():
                    break
                indexAlpha += 1
                
            row[colCompany] = row[colDate][indexAlpha:]
            row[colDate] = np.nan
            continue

        row[colCompany] = row[colDate][19:]
        row[colDate] = str(date[0])

    return df


In [3]:
from nose.tools import assert_equal
import pandas
assert_equal(type(tidy(mn)), pandas.core.frame.DataFrame)
assert_equal(len((tidy(mn)).columns), 8)
assert_equal(list((tidy(mn)).columns)[0], "full_name")


-------
## 1. Missing values

### 1.1 Code part
Implement a function called `missing_values` which takes as an input a dataframe and check if there are any missing values in the dataset. Record the row ids of the observations containing missing values as a list of numbers and make sure that the function returns the recorded list in the end. If there are no missing values, `missing_values` should return an empty list.

In [4]:
def missing_values(x):
    df = x[x.isna().any(axis=1)]
    
    return df.index.tolist()

In [5]:
from nose.tools import assert_equal
assert_equal(type(missing_values(tidy(mn))), list)


### 1.2. Analytical part

* Does the dataset contain missing values?
* If no, explain how you proved that this is actually the case.
* If yes, describe the discovered missing values. What could be an explanation for their missingness?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!


Yes the dataset contains missing values. The missing values occur randomly in every column of the dataset, so I would describe it as MCAR data, because occurrence that a data is missing is unrelated to the value itself or any other variable in the data set.
An explanation for the missing values could be that the data gathering process itself wasn't that sophisticated. Because in the original data version, the two columns had been put together, therefore I would argue, that then the likeability is also high, that there a missing values inside the data observations.

Another point I want to mention is, that it is not fully possible to get the company name with a 100% accuracy, because there is the possibility that the term "None" is actually part of the company name.

------
## 2. Handling missing values
### 2.1. Code part
Apply a (simple) function called *handling_missing_values* for handling missing values using an adequate single-imputation technique  of your choice per type of missing values. Make use of the techniques learned in Unit 4. The function should take as an input a dataframe and return the updated dataframe. Mind the following:
- The objective is to apply single imputation on these synthetic data. Do not make up a background story (at this point)!
- Do NOT simply drop the missing values. This is not an option.
- The imputation technique must be adequate for a given variabel type (quantitative, qualitative).

In [6]:
def handling_missing_values(x):
    # average datetime
    dates = []
    from dateutil.parser import parse
    for index, row in x.iterrows():
        try:
            date = parse(str(row['date_time']), fuzzy_with_tokens=True)
            dates.append(date[0])
        except ValueError:
            continue
 
    import numpy as np
    mean = (np.array(dates, dtype='datetime64[s]')
            .view('i8')
            .mean()
            .astype('datetime64[s]'))
    
    meanDate = parse(str(mean), fuzzy_with_tokens=True)[0]
    missing = missing_values(x)
    
    dfCommonValues = x
    
    for row in missing:
        for columnIndex, value in x.iloc[int(row)].items():
            if str(value) == "nan":
                if columnIndex == "date_time":
                    # fill with mean from dates
                    x.iloc[int(row)]['date_time'] = meanDate
                else:
                    # fill with most frequent value from the qualitative column
                    x.iloc[int(row)][columnIndex] = dfCommonValues[columnIndex].value_counts().idxmax()

    return x

In [7]:
from nose.tools import assert_equal
assert_equal(len(missing_values(handling_missing_values(tidy(mn)))), 0)
assert_equal(handling_missing_values(tidy(mn)).shape, tidy(mn).shape)

### 2.2. Analytical part
Discuss the implications. Answer the following:
- How would you qualify the data-generating processes leading to different types of missing values, provided that the data was not synthetic?
- What are the benefits and disadvantages of the chosen single-imputation technique?
- How would you apply a multiple-imputation technique to one type of missing values, if applicable at all?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!

- Qualify the data-generating process:
    - I had missing values in every column. One type of the missing values was in the date_time column, a quantitative value, therefore I calculated the mean and inserted the mean in the empty cells. But in my opinion it's not the best way to handle this, because the mean doesn't fit for DateTimes that good, because you can't add it up in a sophisticated way.

    - The second type of missing values was in every other columns. The type was a qualitative value, therefore i calculated the most common value for each column, and filled the missing cells in the dataset with it.
The column "coordinates" includes two decimal values, from which the mean could be calculated, but according to what I heard in the lecture, it's not neccesaryly needed to convert the coordinates to numeric values and then make special calucaltions with it.

- It has positive aspects because you can execute complete case analysis of the sample, but in contrast the missing values which are filled in are mostly identically (especially with qualitative data), which has effects on the variance of the data.

- Multiple-Imputation technique:
    - I would probably go with an Hot-Deck approach, to generate values for the qualitative missing data, for instance the coordinates. I would take qualitative variable address and maybe also the qualitative variable automotive in consideration to form a contingency table. In this way I would approach to get a useful mean to fill the missing values.

-----
## 3. Duplicate entries
Implement a function called `duplicates` that takes as an input a (tidy) dataframe `x`. Assume that `duplicates` receives a dataframe as returned from your Step 0 implementation of `tidy`. It then checks whether there are any duplicates in the dataset. Record the row ids of the observations being duplicates and have `duplicates` returns the list in the end. An empty list indicates the absence of duplicated observations.

In [8]:
def duplicates(x):
    df = x[x.duplicated()]
    
    return df.index.tolist()

In [9]:
from nose.tools import assert_equal
assert_equal(type(duplicates(tidy(mn))), list)


-----
## 4. Handling duplicate entries
### 4.1. Code part
Implement a function called `handling_duplicate_entries` for handling duplicate entries. Again, the function is assumed to receive a tidied data set as obtained from Step 0. It deduplicates the tidy data set. The function then returns the dataframe without duplicates.

In [10]:
def handling_duplicate_entries(x):
    return x.drop(duplicates(x))


In [11]:
from nose.tools import assert_equal
assert_equal(len(duplicates(handling_duplicate_entries(tidy(mn)))), 0)

### 4.2. Analytical part
Discuss the implications. 

- What are the benefits and disadvantages of the chosen duplicate-handling technique?
- What are alternative definitions of (intra-source) duplicates for the given dataset?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!

- I choose to find out full duplicates, which means that I searched for identically observations in the dataset
    - the benefit is, that the observations in the dataset will be more unique
    
    - a possible negative aspect of the observation could be, that every single value of the observation has to be compared with the value of another observation, which takes a lot more time, than for example if I would only look for duplicates in certain columns
    
- Alternative definitions could be
    - regarding the "full_names" column, names that are written differntly concerning upper and lower case letters
    - regarding the "automotive" column, when the "-" between the numbers and values is occuring or not