## Assignment 4
***
*General hints:* <br>
* You may use another notebook to test different approaches and ideas. When complete and mature, turn your code snippets into the requested functions in this notebook for submission. 
* Make sure the function implementations are generic and can be applied to any dataset (not just the one provided).
* Add explanatory code comments in the code cells. Make sure that these comments improve our understanding of your implementation decisions.

-----
* Create a variable holding your student id, as shown below. 
* Simply replace the example (`01234567`) with your actual student id having a total of 8 digits. 
* Maintain the variable as a string, do NOT change its type in this notebook!
* *Note: If your student id has 7 digits, add a leading 0. The final student id MUST have 8 digits!*

In [1]:
mn = '12318768'

In [2]:
import pytest
import pandas as pd 
import numpy as np

## 0. Import

Implement a function `tidy` which imports the data set assigned and provided to you as a CSV file into a `pandas` dataframe. Access the data set and establish whether your data set is tidy. If not, clean the data set before continuing with Step 1. Mind all rules of tidying data sets in this step. Make sure you comply to the following statements:
* If there is an index column (row numbers) in your tidied dataset, keep it.
* The following columns, once identified, correspond to variables 1:1 (no need for transformations):
  * `full_name`
  * `automotive`
  * `color`
  * `job`
  * `address`
  * `coordinates`
  * `km_per_litre`
* The tidied dataset should have a total of 9 columns (not including the index), the first column should be `full_name` and the last one `km_per_litre`.
* Mind the intended content of each attribute (e.g. `full_name` should contain the full name of a person, no need to change that)
* If tidy or done, have the function `tidy` return the ready data set as a dataframe.

Note that `tidy` must take a single parameter that holds your student id (`mn`) as one part of the basename (according to the CoC) of the CSV file (i.e., the CoC file name without file extension). Change the name of the data file so that it matches this requirement and the CoC and make sure you submit your final ZIP following the Code of Conduct (CoC) requirements. Especially, make sure you put your data file in a folder called `data/` when submitting.

In [3]:
def tidy(x): #this function will tidy the data
    file_path = f"data/{x}.csv" #first we fetch the file path
    my_data_frame = pd.read_csv(file_path, header=None) #then we read the file

    my_data_frame_indexed = my_data_frame.set_index(0) #we set the first column as the index
    df = my_data_frame_indexed.T # and after looking at the data it seems to be transposed, so we transpose it again 

    first_col_name = df.columns[0] #now we can see that the first column is the full name of the person
    if pd.isna(first_col_name): #if the first column name is missing, we drop it
        df = df.drop(columns=[first_col_name])

    combined_column = df["date_time/full_company_name"] #now lets start splitting the name and date column

    date_time_part = combined_column.str[:26] # after looking at the data we can see that the date and time are the first 26 characters, so we can split it after 26 characters
    company_name_part = combined_column.str[26:] #and the rest is the company name

    df["date_time"] = date_time_part.str.strip() #now we can create the new columns, that are stripped
    df["company_name"] = company_name_part.str.strip() #and once again

    df = df.drop(columns=["date_time/full_company_name"]) #and now we remove the original column

    desired_column_order = [ #lets reorder our headers, so they match the asignment
        "full_name",
        "automotive",
        "color",
        "job",
        "address",
        "coordinates",
        "date_time",      # New column
        "company_name",   # New column
        "km_per_litre"    # and lets Ensure this is last
    ]
    df = df[desired_column_order] # and now we reorder the columns

    # and finally we return the dataframe
    return df

In [4]:
assert type(tidy(mn)) == pd.core.frame.DataFrame, "T0.1"
assert len((tidy(mn)).columns) == 9, "T0.2"
assert list((tidy(mn)).columns)[0] == "full_name", "T0.3"
assert list((tidy(mn)).columns)[len((tidy(mn)).columns)-1] == "km_per_litre", "T0.4"

In [5]:
# Edit this cell or remove it, and you shall perish, meow! 😼⚡️

-------
## 1. Missing values

### 1.1 Code part
Implement a function called `missing_values` which takes as an input a dataframe and check if there are any missing values in the dataset. Record the row positions (*not* the row labels!) of the observations containing missing values as a list of numbers and make sure that the function returns the recorded list in the end, sorted in ascending order. If there are no missing values, `missing_values` should return an empty list.

**NOTE:** You shall find out how missing values are encoded in your datasest and which missing values occur in your dataset, you will ***need manual inspection*** by applying Python helpers. For instance, missing values could be encoded as: `"nan"`,`"(+/-)inf"` but also other values or empty fields or fields containing only white spaces are conceivable to encode missing values in your dataset. Do *not* rely on built-in Python or pandas functions alone!

Important: Mind the difference between row positions and row labels. `.index` of a dataframe returns row labels. `.iloc` takes row positions.

In [6]:
import pandas as pd # Make sure pandas is imported

def missing_values(x): #now lets find some missing values
    missing_row_indices = [] #here we store the indices of the rows with missing values

    missing_value_strings = ["n/a", "na", "null", "none", "nan", "inf", "-inf", "+inf", ""] #first we define the strings that we want to check for. Lets keep this generalized, so it also works for other dfs

    print(f"Checking for standard missing values (NaN, None) and specific strings: {missing_value_strings}") #lets get a little help to answer the questions

    # and now we start with a stroll through the dataframe
    for i in range(len(x)):
        row = x.iloc[i] #now we for each row
        # Use .items() to get both column name and value, helpful for reporting
        for col_name, item in row.items(): # we check each item

            if pd.isna(item): #and we start with the standard pandas check for NaN/None
                print(f"Info: Row position {i}, Column '{col_name}': Found standard missing value (pd.isna=True). Original value: {item}") #hen we define where we found a missing value
                missing_row_indices.append(i) #if we find a missing value we add the index to the list
                break # and then we Move to next row once a missing value is found
            # now we check for any leftovers
            try:
                # Convert to string, strip whitespace, convert to lowercase
                item_str_lower = str(item).strip().lower()
                if item_str_lower in missing_value_strings:
                    print(f"Info: Row position {i}, Column '{col_name}': Found custom missing value string. Original value: '{item}', Matched string: '{item_str_lower}'") #here we print another missing item
                    missing_row_indices.append(i) #if we find a missing value we also add the index to the list
                    break # and lets move on again
            except Exception:
                # now lets add exceptions because it wont work otherwise and a friend told me that this is a good practice ;)
                pass

    final_sorted_list = sorted(list(set(missing_row_indices))) #now we sort the list and remove potential duplicates if break was removed

    if not final_sorted_list:
        print("Result: No missing values detected.") # a bit more output to answer the questions
    else:
        print(f"Result: Found {len(final_sorted_list)} rows with missing values.")

    # and finally we return the list
    return final_sorted_list


In [7]:
assert type(missing_values(tidy(mn))) == list, "T1.1"
assert all(isinstance(i, int) for i in missing_values(tidy(mn))), "T1.2"

Checking for standard missing values (NaN, None) and specific strings: ['n/a', 'na', 'null', 'none', 'nan', 'inf', '-inf', '+inf', '']
Info: Row position 2, Column 'company_name': Found custom missing value string. Original value: 'NaN', Matched string: 'nan'
Info: Row position 57, Column 'automotive': Found custom missing value string. Original value: '-inf', Matched string: '-inf'
Info: Row position 131, Column 'full_name': Found standard missing value (pd.isna=True). Original value: nan
Info: Row position 159, Column 'color': Found standard missing value (pd.isna=True). Original value: nan
Info: Row position 165, Column 'automotive': Found standard missing value (pd.isna=True). Original value: nan
Info: Row position 184, Column 'coordinates': Found custom missing value string. Original value: '+inf', Matched string: '+inf'
Info: Row position 201, Column 'color': Found standard missing value (pd.isna=True). Original value: nan
Info: Row position 211, Column 'coordinates': Found custo

In [8]:
# Edit this cell or remove it, and you shall perish, meow! 😼⚡️

### 1.2. Analytical part

* Does the dataset contain missing values?
* Explain your manual-inspection procedure and the Python helpers used!
* If no, explain how you proved that this is actually the case. 
* If yes, describe the discovered missing values. What could be an explanation for their missingness?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!


My dataset contains missing values, as shown by the output and a manual inspection. I started the manual inspection, by trying to open the file with "numbers", which gave me an error and told me the CSV was broken. Then I opened it with a text editor and saw multiple placeholder values via control+f. Things like "nan" etc. Code wise my python fucntion uses .iloc for indexing, pd.isna() for NaN/None detection, and string methods .lower() and .strip() to standardize text before checking against missing value indicators like 'n/a'. It returns a sorted list of affected row indices. 

The Normal are likely to be empty information, string NaNs are probably when someone typed it into the system like that, emptry strings might be fields left blank when collecting data. The +- inf might stem from miscalculations, or errors with numerical data

------
## 2. Handling missing values
### 2.1. Code part
Apply a (simple) function called *handling_missing_values* for handling missing values using an adequate single-imputation technique (or, one of the alternatives to single imputation) of your choice per type of missing values. Make use of the techniques learned in Unit 4. The function should take as an input a dataframe and return the updated dataframe. Mind the following:
- The objective is to apply single imputation on these synthetic data. Do not make up a background story (at this point)!
- Do NOT simply drop the missing values. This is not an option.
- The imputation technique must be adequate for a given variable type (quantitative, qualitative).
- To establish whether a variable is quantitative or qualitative, it is *not* sufficient to only inspect on data types!

In [9]:
def handling_missing_values(x):
    df = x.copy() # We'll work on a copy, in case we still need the original

    # lets take our missing value strings again (code doesnt seem to work, if i dont re define it in this cell). 
    missing_value_strings = ["n/a", "na", "null", "none", "nan", "inf", "-inf", "+inf", "", "NaN"]

    df = df.replace(missing_value_strings, np.nan) # and now we replace the missing values with NaN

    problematic_mode_strings = {s.lower() for s in missing_value_strings if isinstance(s, str)} # and now we convert all strings to lower case (to  help with the mode later on)

    for col in df.columns: #now we go through each column and fill in values
        if col == "km_per_litre": #for the km_per_litre column we use the mean
            df[col] = pd.to_numeric(df[col], errors='coerce') # first we ensure it's numeric
            mean_val = df[col].mean() # then we calculate the mean
            if pd.notna(mean_val): # Only fill if mean could be calculated
                df[col] = df[col].fillna(mean_val)

        elif col == "coordinates": #now we look at the coordinates column
            # For 'coordinates', use a specific placeholder, 0,0
            df[col] = df[col].fillna("(0,0)")

        else:
            # For 'date_time' and all other columns, try filling with the mode (most frequent value).
            # If mode is problematic or doesn't exist, use a default.
            default_fill = "1970-01-01 00:00:00" if col == "date_time" else "Unknown" #btw we use this time because a youtube video once told me that its the default time for unix systems. we could use any time here, just make sure that its recognizable as a default value
            fill_value = default_fill # Start with the default

            mode_result = df[col].mode() # now we calculate the mode
            if not mode_result.empty: # if the mode is not empty
                potential_mode = mode_result[0] # we take the first value
                # Check if the mode itself is NaN or looks like one of our missing markers
                is_problematic = pd.isna(potential_mode) or \
                                 (isinstance(potential_mode, str) and \
                                  potential_mode.strip().lower() in problematic_mode_strings) # and now we check if the mode is problematic (and we get to use some pretty booleans)

                if not is_problematic:
                    fill_value = potential_mode # Use the mode if it's valid

            # Fill any remaining NaNs in the column with the defined value above ("Unknown" or "1970-01-01 00:00:00")
            df[col] = df[col].fillna(fill_value)

    return df # and now we return the dataframe

In [10]:
assert len(missing_values(handling_missing_values(tidy(mn)))) == 0, "T2.1"
assert handling_missing_values(tidy(mn)).shape == tidy(mn).shape, "T2.2"

Checking for standard missing values (NaN, None) and specific strings: ['n/a', 'na', 'null', 'none', 'nan', 'inf', '-inf', '+inf', '']
Result: No missing values detected.


### 2.2. Analytical part
Discuss the implications. Answer the following:

- How would you qualify the data-generating processes leading to different types of missing values, provided that the data was not synthetic?
- What are the benefits and disadvantages of the chosen single-imputation technique?
- How would you apply a multiple-imputation technique to one type of missing values, if applicable at all?
- We asked you to test for/treat as missing values by checking certain field values, as well as empty fields or fields containing the numeric value 0... what are potential problems of this heuristics?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!

1. Missing Data
    - MCAR (Missing Completely At Random): Missingness is random & unrelated to any values (e.g., random data corruption).
    - MAR (Missing At Random): Missingness depends only on other observed variables (e.g., older cars less likely to report fuel efficiency, but not based on the efficiency itself).
    - MNAR (Missing Not At Random): Missingness depends on the missing value itself (e.g., people with very low fuel efficiency avoid reporting it). Determining the true type requires domain knowledge.

2. Single-Imputation pros and cons
    - Pro: Simple, fast, creates a complete dataset for algorithms.
    - Con: Underestimates uncertainty (false precision), distorts relationships (correlations, variance), can bias results.
    - Mean Imputation (km_per_litre): Pro: Preserves mean. Con: Reduces variance, sensitive to outliers, ignores other variables.
    - Mode Imputation (Others): Pro: Simple for categorical data. Con: Distorts distribution (artificial spike), ignores other variables.

3. Use of multiple-imputation 
    MI addresses the uncertainty ignored by single imputation.
    3.1. Model: Build a model predicting the missing variable (e.g., km_per_litre) using other variables.
    3.2. Impute Multiple Times: Generate several plausible values for each missing entry based on the model and its uncertainty, creating multiple complete datasets.
    3.3. Analyze: Perform the analysis on each dataset separately.
    3.4. Pool: Combine the results using specific rules (Rubin's Rules) to get a final estimate and correct uncertainty.

4. Issues with testing for missing values
    - False Positives: Treating valid data as missing.
        - 0 is often a real measurement, not missing.
        - "" (empty string) can be intentionally blank, not unknown.
        - Specific strings ("NA") might be valid categories in some contexts.
    - False Negatives: Failing to detect actual missing data.
        - The list of missing strings ("n/a", etc.) might be incomplete (miss "?, -999).
    - Context Matters: The meaning of 0, "", or specific strings depends heavily on the variable. Universal rules without context are risky.

-----
## 3. Detecting duplicate entries
Implement a function called `duplicates` that takes as an input a (tidy) dataframe `x` and a list of column labels (`VARIABLES`). Assume that `duplicates` receives a dataframe as returned from your Step 0 implementation of `tidy`. It then checks whether there are any duplicates in the dataset. Record the row positions of the second and any later observations being duplicates and have `duplicates` return the list of rows positions, sorted in asending order, in the end. An empty list indicates the absence of duplicated observations.

Important:
* The first observation that belongs to the detected duplicates is *not* considered a duplicate!
* Mind the difference between row positions and row labels. `.index` of a dataframe returns row labels. `.iloc` takes row positions.

In [11]:
df = tidy(mn) #lets put our tidy function to work
VARIABLES = df.columns.tolist(); #and lets get the columns and put them in a list

def duplicates(data, vars):
    if not isinstance(data, pd.DataFrame): #first we check if the input is a dataframe
        raise TypeError("Input must be a pandas DataFrame")
    
    if not vars or not isinstance(vars, list) or not all(v in data.columns.tolist() for v in vars): # lets 1. check if vars list is empty 2. if it even is a list 3. if all the variables are in the dataframe (essentially, is the list correct, given the headers of our df)
        return "Name variables defining potential duplicates!"
    
    mask_duplicates = data.duplicated(subset=vars, keep='first') #now we create a mask for the duplicates

    duplicate_pos = [data.index.get_loc(i) for i in data[mask_duplicates].index] #then we get the positions of the duplicates

    return sorted(duplicate_pos) #and now we return the sorted list



<function duplicates at 0x1073befc0>


In [12]:
df = tidy(mn);
assert len(VARIABLES) > 0 and all([v in df.columns.tolist() for v in VARIABLES]), "T3.1"
assert duplicates(df, [list]) == "Name variables defining potential duplicates!", "T3.2"
assert duplicates(df, None) == "Name variables defining potential duplicates!", "T3.3"
assert type(duplicates(df, vars = df.columns.tolist())) == list, "T3.4"
assert all(isinstance(i, int) for i in duplicates(df, df.columns.tolist())), "T3.5"

In [13]:
# Edit this cell or remove it, and you shall perish, meow! 😼⚡️


-----
## 4. Detecting outliers
### 4.1. Code part
Implement a function called `detecting_outliers` to detect outliers in one selected quantitative variable. Pick a suitable variable from the tidied dataset based on your characterisation and apply one suitable outlier-detection technique as covered in Unit 4. Justify your choice of this technique in the analytical part. Again, the function is assumed to receive a tidied data set from Step 0. The function returns the row positions (*not* row labels!) of the rows containing outliers on the selected variable, sorted in ascending order.

In [14]:
def detecting_outliers(df_input): #lets find some gas guzzlers!
    col_name = "km_per_litre" #first we define the column we want to analyze
    std_dev_multiplier = 2.0 # and then the sd multiplier

    df = df_input.copy() # we will use a copy to leave the original untouched

    if col_name not in df.columns: # first we check if the given col even exists
        print(f"Error: Required column '{col_name}' not found in DataFrame.")
        return []

    original_non_null_count = df[col_name].notna().sum() #first we store the original count of NaNa
    df[col_name] = pd.to_numeric(df[col_name], errors='coerce') # then we check if the column is numeric, coercing errors
    numeric_non_null_count = df[col_name].notna().sum() # and we store the new count of NaNa

    if original_non_null_count != numeric_non_null_count: # and then a little check to see if the coercion caused changes
         print(f"Note: Converted column to numeric. Valid entries changed from {original_non_null_count} to {numeric_non_null_count}.")

    # now that we know we only have numeric values left, we drop the Nans
    numeric_data = df[col_name].dropna()
    num_valid_points = len(numeric_data)
    print(f"Found {num_valid_points} valid numeric data points for analysis.")

    if num_valid_points < 2: # now lets see if we have enough data to calculate the standard deviation
        print(f"Not enough data ({num_valid_points}) to calculate standard deviation. Cannot detect outliers.")
        return [] # Return empty list if not enough data

    mean_val = numeric_data.mean() #and now we can finally get to the stats part. lets calc the mean
    std_dev_val = numeric_data.std() #adn the std dev

    # --> Print the calculated statistics <--
    print(f"Calculated Mean for '{col_name}': {mean_val:.2f}")
    print(f"Calculated Standard Deviation for '{col_name}': {std_dev_val:.2f}")

    if std_dev_val == 0 or pd.isna(std_dev_val): #now we check if the std dev is 0 or NaN
        print("Standard deviation is zero or NaN. Outliers cannot be determined by this method.")
        return [] # Return empty list

    lower_bound = mean_val - std_dev_multiplier * std_dev_val #now we get to the bounds part. first the lower bound
    upper_bound = mean_val + std_dev_multiplier * std_dev_val #and then the upper bound

    print(f"Calculated Lower Bound ({std_dev_multiplier} std dev): {lower_bound:.2f}") #and lets print them
    print(f"Calculated Upper Bound ({std_dev_multiplier} std dev): {upper_bound:.2f}")

    is_outlier = ((df[col_name] < lower_bound) | (df[col_name] > upper_bound)) & df[col_name].notna() # now we get the outliers witha  mask

    outlier_positions = np.where(is_outlier)[0].tolist() # and use the mask to the get positions of the outliers
    num_outliers = len(outlier_positions) #and then we get the number of outliers

    print(f"Number of outliers detected: {num_outliers}") #and lets print the number of outliers

    return sorted(outlier_positions)

In [15]:
df = tidy(mn);
assert type(detecting_outliers(df)) == list, "T4.1"
assert all(isinstance(i, int) for i in detecting_outliers(df)), "T4.2"
assert len(detecting_outliers(df)) > 0 and len(detecting_outliers(df)) < .05*df.shape[0]


Found 1671 valid numeric data points for analysis.
Calculated Mean for 'km_per_litre': 27.68
Calculated Standard Deviation for 'km_per_litre': 11.40
Calculated Lower Bound (2.0 std dev): 4.87
Calculated Upper Bound (2.0 std dev): 50.49
Number of outliers detected: 57
Found 1671 valid numeric data points for analysis.
Calculated Mean for 'km_per_litre': 27.68
Calculated Standard Deviation for 'km_per_litre': 11.40
Calculated Lower Bound (2.0 std dev): 4.87
Calculated Upper Bound (2.0 std dev): 50.49
Number of outliers detected: 57
Found 1671 valid numeric data points for analysis.
Calculated Mean for 'km_per_litre': 27.68
Calculated Standard Deviation for 'km_per_litre': 11.40
Calculated Lower Bound (2.0 std dev): 4.87
Calculated Upper Bound (2.0 std dev): 50.49
Number of outliers detected: 57
Found 1671 valid numeric data points for analysis.
Calculated Mean for 'km_per_litre': 27.68
Calculated Standard Deviation for 'km_per_litre': 11.40
Calculated Lower Bound (2.0 std dev): 4.87
Calc

### 4.2. Analytical part
Discuss the implications. 

- What is the chosen outlier-detection technique? Explain it using your own words in 3-4 sentences.
- Describe the outliers detected: How many? How do they relate to the typical, non-outlier values in the remaining dataset?
- What could be one reason these outliers appear in the dataset? How would you treat them further?

Write your answer in the markdown cell below. Do NOT delete or replace the answer cell with another one!

- The chosen outlier-detection technique is the 2 x Standard Deviation Method. This method operates by first calculating the average (mean) and the standard deviation of the selected variable (km_per_litre). It then establishes a range considered "typical" by going a specific number of standard deviations (in this case 2) below and above the mean. Any data point that falls outside this calculated range is classified as an outlier, indicating it's statistically far from the central tendency of the data.
- The analysis detected 57 outliers in the km_per_litre variable. These outliers represent vehicles with fuel efficiency values that fall outside the calculated "typical" range. Specifically, they are vehicles with a km_per_litre value either less than 4.87 (exceptionally low fuel efficiency) or greater than 50.49 (exceptionally high fuel efficiency). These outlier values contrast significantly with the bulk of the dataset, where typical, non-outlier vehicles have fuel efficiencies falling between 4.87 and 50.49 km/litre, centered around the calculated mean of approximately 27.68 km/litre.
- One possible reason for these outliers could be the presence of genuine extreme vehicle types within the dataset. For example, very low km_per_litre values (< 4.87) might correspond to heavy-duty vehicles, large trucks, or high-performance sports cars not representative of typical passenger cars. Conversely, very high values (> 50.49) could represent highly fuel-efficient vehicles like hybrids, potentially electric vehicles if measured on a comparable scale, motorcycles included erroneously, or data entry errors.
