## Assignment 4
***
*General hints:* <br>
* You may use another notebook to test different approaches and ideas. When complete and mature, turn your code snippets into the requested functions in this notebook for submission. 
* Make sure the function implementations are generic and can be applied to any dataset (not just the one provided).
* Add explanatory code comments in the code cells. Make sure that these comments improve our understanding of your implementation decisions.

-----
* Create a variable holding your student id, as shown below. 
* Simply replace the example (`01234567`) with your actual student id having a total of 8 digits. 
* Maintain the variable as a string, do NOT change its type in this notebook!
* *Note: If your student id has 7 digits, add a leading 0. The final student id MUST have 8 digits!*

In [1]:
mn = '12318768'

In [2]:
import pytest
import pandas as pd 
import numpy as np

## 0. Import

Implement a function `tidy` which imports the data set assigned and provided to you as a CSV file into a `pandas` dataframe. Access the data set and establish whether your data set is tidy. If not, clean the data set before continuing with Step 1. Mind all rules of tidying data sets in this step. Make sure you comply to the following statements:
* If there is an index column (row numbers) in your tidied dataset, keep it.
* The following columns, once identified, correspond to variables 1:1 (no need for transformations):
  * `full_name`
  * `automotive`
  * `color`
  * `job`
  * `address`
  * `coordinates`
  * `km_per_litre`
* The tidied dataset should have a total of 9 columns (not including the index), the first column should be `full_name` and the last one `km_per_litre`.
* Mind the intended content of each attribute (e.g. `full_name` should contain the full name of a person, no need to change that)
* If tidy or done, have the function `tidy` return the ready data set as a dataframe.

Note that `tidy` must take a single parameter that holds your student id (`mn`) as one part of the basename (according to the CoC) of the CSV file (i.e., the CoC file name without file extension). Change the name of the data file so that it matches this requirement and the CoC and make sure you submit your final ZIP following the Code of Conduct (CoC) requirements. Especially, make sure you put your data file in a folder called `data/` when submitting.

In [18]:
import pytest
import pandas as pd
import numpy as np
# import re # Not needed
# import os # Not needed

# Make sure 'mn' is defined before calling tidy
mn = '12318768'

def tidy(x):
    """
    Imports and tidies the dataset from a CSV file using only pandas and numpy.

    Args:
        x (str): The student ID, used to construct the filename (e.g., 'data/12318768.csv').

    Returns:
        pandas.DataFrame: The tidied dataframe, or None if the file cannot be read.
    """
    # --- Step 1: Define potential file paths using string formatting ---
    filepath_data = f'data/{x}.csv'
    filepath_current = f'{x}.csv'
    df = None # Initialize df to None

    # --- Step 2: Try reading the file from specified paths ---
    try:
        df = pd.read_csv(filepath_data, header=None)
    except FileNotFoundError:
        try:
            df = pd.read_csv(filepath_current, header=None)
        except FileNotFoundError:
            print(f"Error: Data file not found at {filepath_data} or {filepath_current}")
            return None
        except Exception as e:
             print(f"Error reading {filepath_current}: {e}")
             return None
    except Exception as e:
         print(f"Error reading {filepath_data}: {e}")
         return None

    # --- Step 3: Proceed with tidying ONLY if df was loaded successfully ---
    if df is None:
        print("Error: DataFrame was not loaded.")
        return None

    # Set the first column (variable names) as the index
    try:
        df = df.set_index(0)
    except KeyError:
        print("Error: Cannot set index. Column 0 might be missing or dataframe structure is unexpected.")
        return None

    # Transpose the dataframe
    df = df.T

    # Clean index and column names
    df.index.name = None
    df.columns.name = None

    # Define and replace non-standard missing values
    na_list = ['NaN', 'nan', 'NA', 'N/A', 'n/a', '--', '-inf', 'inf', 'None', '']
    df = df.replace(na_list, np.nan)
    df = df.replace(r'^\s*$', np.nan, regex=True) # Handle whitespace-only strings

    # Split the combined 'date_time/full_company_name' column using string slicing
    combined_col = 'date_time/full_company_name'
    datetime_len = 26 # Length of 'YYYY-MM-DD HH:MM:SS.ffffff'

    if combined_col in df.columns:
        df[combined_col] = df[combined_col].astype(str)
        df.loc[:, 'date_time'] = df[combined_col].str[:datetime_len]
        df.loc[:, 'date_time'] = pd.to_datetime(df['date_time'], errors='coerce')
        df.loc[:, 'full_company_name'] = df[combined_col].str[datetime_len:].str.strip()
        df.loc[df['full_company_name'] == '', 'full_company_name'] = np.nan
        df = df.drop(columns=[combined_col])
    else:
        if 'date_time' not in df.columns: df['date_time'] = pd.NaT
        if 'full_company_name' not in df.columns: df['full_company_name'] = np.nan
        print(f"Warning: Column '{combined_col}' not found for splitting.")

    # Convert 'km_per_litre' to numeric
    if 'km_per_litre' in df.columns:
        df.loc[:, 'km_per_litre'] = pd.to_numeric(df['km_per_litre'], errors='coerce')
    else:
        df['km_per_litre'] = np.nan
        print("Warning: Column 'km_per_litre' not found.")

    # Ensure all required columns exist and set the final order
    final_cols_order = ['full_name', 'automotive', 'color', 'job', 'address', 'coordinates', 'date_time', 'full_company_name', 'km_per_litre']
    for col in final_cols_order:
        if col not in df.columns:
            print(f"Warning: Expected column '{col}' missing. Adding as NaN/NaT.")
            if col == 'date_time':
                df[col] = pd.NaT
            else:
                df[col] = np.nan

    # Reorder columns
    try:
        df = df[final_cols_order]
    except KeyError as e:
        print(f"Error reordering columns. Missing expected columns: {e}")
        available_cols = [col for col in final_cols_order if col in df.columns]
        df = df[available_cols]

    # Reset index to standard integer index
    df = df.reset_index(drop=True)

    return df

# --- Call tidy function ---
tidied_df = tidy(mn)

# --- Export the tidied DataFrame to CSV ---
if tidied_df is not None:
    output_filename = f"tidied_data_{mn}.csv"
    try:
        # index=False prevents pandas from writing the DataFrame index as a column
        tidied_df.to_csv(output_filename, index=False)
        print(f"Successfully exported tidied data to '{output_filename}'")
    except Exception as e:
        print(f"Error exporting DataFrame to CSV: {e}")

    # --- Run Assertions (as before) ---
    assert type(tidied_df) == pd.core.frame.DataFrame, "T0.1 Failed: Result is not a DataFrame"
    print("T0.1 Passed: Result is a DataFrame.")
    assert len(tidied_df.columns) == 9, f"T0.2 Failed: Expected 9 columns, got {len(tidied_df.columns)}"
    print("T0.2 Passed: Correct number of columns.")
    assert list(tidied_df.columns)[0] == "full_name", f"T0.3 Failed: First column is {list(tidied_df.columns)[0]}, expected 'full_name'"
    print("T0.3 Passed: First column is 'full_name'.")
    assert list(tidied_df.columns)[-1] == "km_per_litre", f"T0.4 Failed: Last column is {list(tidied_df.columns)[-1]}, expected 'km_per_litre'"
    print("T0.4 Passed: Last column is 'km_per_litre'.")
else:
    print("Tidying failed, cannot run assertions or export.")

Successfully exported tidied data to 'tidied_data_12318768.csv'
T0.1 Passed: Result is a DataFrame.
T0.2 Passed: Correct number of columns.
T0.3 Passed: First column is 'full_name'.
T0.4 Passed: Last column is 'km_per_litre'.


In [17]:
assert type(tidy(mn)) == pd.core.frame.DataFrame, "T0.1"
assert len((tidy(mn)).columns) == 9, "T0.2"
assert list((tidy(mn)).columns)[0] == "full_name", "T0.3"
assert list((tidy(mn)).columns)[len((tidy(mn)).columns)-1] == "km_per_litre", "T0.4"

In [None]:
# Edit this cell or remove it, and you shall perish, meow! 😼⚡️

-------
## 1. Missing values

### 1.1 Code part
Implement a function called `missing_values` which takes as an input a dataframe and check if there are any missing values in the dataset. Record the row positions (*not* the row labels!) of the observations containing missing values as a list of numbers and make sure that the function returns the recorded list in the end, sorted in ascending order. If there are no missing values, `missing_values` should return an empty list.

**NOTE:** You shall find out how missing values are encoded in your datasest and which missing values occur in your dataset, you will ***need manual inspection*** by applying Python helpers. For instance, missing values could be encoded as: `"nan"`,`"(+/-)inf"` but also other values or empty fields or fields containing only white spaces are conceivable to encode missing values in your dataset. Do *not* rely on built-in Python or pandas functions alone!

Important: Mind the difference between row positions and row labels. `.index` of a dataframe returns row labels. `.iloc` takes row positions.

In [19]:
def missing_values(x):
    """
    Finds the row positions containing missing values (NaN or NaT) in a dataframe.

    Args:
        x (pandas.DataFrame): The input dataframe (assumed to be tidied).

    Returns:
        list: A sorted list of integer row positions containing missing values.
              Returns an empty list if no missing values are found or input is invalid.
    """
    # Check if the input is a DataFrame
    if not isinstance(x, pd.DataFrame):
        print("Error: Input must be a pandas DataFrame.")
        return []

    # Identify rows containing any NaN/NaT values across columns
    # .isnull() detects both np.nan and pd.NaT
    rows_with_nan_mask = x.isnull().any(axis=1)

    # Get the index labels of these rows. Since tidy() ends with reset_index(drop=True),
    # the index labels directly correspond to the 0-based row positions.
    nan_index_labels = x[rows_with_nan_mask].index

    # Convert the index labels (positions) to a list
    nan_positions = nan_index_labels.tolist()

    # Sort the positions in ascending order (although RangeIndex is usually sorted)
    nan_positions.sort()

    return nan_positions

# --- Example Call and Assertion ---
# Ensure the necessary imports (pandas, numpy) and the tidy function are defined
# and the tidied_df is created.

# Create tidied_df first
tidied_df = tidy(mn)

if tidied_df is not None:
    mv_indices = missing_values(tidied_df)
    print(f"Row positions with missing values: {mv_indices[:20]}... (showing first 20)") # Show only first few for brevity
    print(f"Total number of rows with missing values: {len(mv_indices)}")

    # Run Assertions
    assert type(mv_indices) == list, "T1.1 Failed: Result is not a list"
    print("T1.1 Passed: Result is a list.")
    assert all(isinstance(i, int) for i in mv_indices), "T1.2 Failed: List does not contain only integers"
    print("T1.2 Passed: List contains only integers.")
    # Additional check for sorting
    assert all(mv_indices[i] <= mv_indices[i+1] for i in range(len(mv_indices)-1)), "T1.3 Failed: List is not sorted"
    print("T1.3 Passed: List is sorted.")
else:
    print("Cannot run missing_values tests because tidying failed.")

Row positions with missing values: [57, 110, 131, 159, 165, 201, 286, 312, 328, 368, 404, 453, 543, 571, 594, 607, 648, 653, 673, 698]... (showing first 20)
Total number of rows with missing values: 45
T1.1 Passed: Result is a list.
T1.2 Passed: List contains only integers.
T1.3 Passed: List is sorted.


In [20]:
assert type(missing_values(tidy(mn))) == list, "T1.1"
assert all(isinstance(i, int) for i in missing_values(tidy(mn))), "T1.2"

In [None]:
# Edit this cell or remove it, and you shall perish, meow! 😼⚡️

### 1.2. Analytical part

* Does the dataset contain missing values?
* Explain your manual-inspection procedure and the Python helpers used!
* If no, explain how you proved that this is actually the case. 
* If yes, describe the discovered missing values. What could be an explanation for their missingness?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!


YOUR ANSWER HERE

------
## 2. Handling missing values
### 2.1. Code part
Apply a (simple) function called *handling_missing_values* for handling missing values using an adequate single-imputation technique (or, one of the alternatives to single imputation) of your choice per type of missing values. Make use of the techniques learned in Unit 4. The function should take as an input a dataframe and return the updated dataframe. Mind the following:
- The objective is to apply single imputation on these synthetic data. Do not make up a background story (at this point)!
- Do NOT simply drop the missing values. This is not an option.
- The imputation technique must be adequate for a given variable type (quantitative, qualitative).
- To establish whether a variable is quantitative or qualitative, it is *not* sufficient to only inspect on data types!

In [None]:
def handling_missing_values(x):
    """
    Handles missing values in the dataframe using single imputation.
    Uses median for quantitative columns ('km_per_litre') and mode for qualitative columns.

    Args:
        x (pandas.DataFrame): The input dataframe (assumed tidied), potentially containing missing values.

    Returns:
        pandas.DataFrame: The dataframe with missing values imputed.
    """
    if not isinstance(x, pd.DataFrame):
        print("Input is not a pandas DataFrame.")
        return x # Return original input
    if x.empty:
       return x # Return empty DataFrame if input is empty

    df = x.copy() # Work on a copy

    # 1. Ensure missing values identified in Step 1 are represented as NaN
    #    (Re-apply detection logic here to be self-contained, though it assumes Step 1 identified them correctly)
    na_strings = [
        'nan', 'NaN', 'NA', 'N/A', '#N/A', 'null', 'Null', '', 'none', 'None',
        'missing', 'Missing', ' ', '?', '-', '--',
        'inf', '-inf', 'Infinity', '-Infinity'
    ]
    for col in df.columns:
        if df[col].dtype == 'object':
             try:
                is_string = df[col].apply(lambda item: isinstance(item, str))
                df.loc[is_string, col] = df.loc[is_string, col].str.strip()
                df[col].replace(na_strings, np.nan, inplace=True)
             except AttributeError:
                 # This column might contain non-strings even if dtype is object
                 pass # Ignore if strip fails for non-strings
        if pd.api.types.is_numeric_dtype(df[col]):
            df[col].replace([np.inf, -np.inf], np.nan, inplace=True)
        # Add specific placeholder replacements if identified in Step 1 (e.g., 0 for km_per_litre)
        # if col == 'km_per_litre' and pd.api.types.is_numeric_dtype(df[col]):
        #     df[col].replace(0, np.nan, inplace=True) # Be cautious if 0 is valid

    # 2. Identify quantitative vs. qualitative columns from the 9 tidied columns
    # 'km_per_litre' is quantitative.
    # 'Id' is numeric but acts as an identifier; treating as qualitative for mode imputation is safer than median.
    # Others are qualitative or complex strings/coordinates.
    quantitative_cols = ['km_per_litre']
    qualitative_cols = [
        'full_name', 'Id', 'automotive', 'color', 'job',
        'address', 'coordinates', 'datetime_company_details'
    ]

    # Ensure all columns are classified
    all_cols = quantitative_cols + qualitative_cols
    if len(all_cols) != len(df.columns) or set(all_cols) != set(df.columns):
         print("Warning: Column classification doesn't match DataFrame columns. Check lists.")
         # Fallback or error handling could be added here

    # 3. Impute missing values
    for col in df.columns:
        if df[col].isna().any(): # Only impute columns with missing values
            if col in quantitative_cols:
                # Impute quantitative with median (robust to outliers)
                # Ensure column is numeric for median calculation
                numeric_col = pd.to_numeric(df[col], errors='coerce')
                if numeric_col.isna().all(): # Handle case where column becomes all NaN
                    median_val = 0 # Or some other default, maybe np.nan still? Use 0 for example.
                    print(f"Warning: Column '{col}' is all NaN after coercion. Filling with {median_val}.")
                else:
                    median_val = numeric_col.median()

                # Fill NaN in the original DataFrame column
                df[col].fillna(median_val, inplace=True)
                print(f"Imputed {col} (quantitative) with median: {median_val}")


            elif col in qualitative_cols:
                # Impute qualitative with mode
                # Mode might return multiple values; use the first one ([0])
                mode_val = df[col].mode()
                if not mode_val.empty:
                    fill_value = mode_val[0]
                    df[col].fillna(fill_value, inplace=True)
                    print(f"Imputed {col} (qualitative) with mode: {fill_value}")
                else:
                    # Handle cases where the column is entirely NaN or mode fails
                    fill_value = "Unknown" # Or another suitable placeholder
                    df[col].fillna(fill_value, inplace=True)
                    print(f"Imputed {col} (qualitative) with placeholder: {fill_value} (mode empty)")
            else:
                # This case should not happen if lists cover all columns
                 print(f"Warning: Column '{col}' was not classified for imputation.")

    return df

In [None]:
assert len(missing_values(handling_missing_values(tidy(mn)))) == 0, "T2.1"
assert handling_missing_values(tidy(mn)).shape == tidy(mn).shape, "T2.2"

### 2.2. Analytical part
Discuss the implications. Answer the following:

- How would you qualify the data-generating processes leading to different types of missing values, provided that the data was not synthetic?
- What are the benefits and disadvantages of the chosen single-imputation technique?
- How would you apply a multiple-imputation technique to one type of missing values, if applicable at all?
- We asked you to test for/treat as missing values by checking certain field values, as well as empty fields or fields containing the numeric value 0... what are potential problems of this heuristics?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!

YOUR ANSWER HERE

-----
## 3. Detecting duplicate entries
Implement a function called `duplicates` that takes as an input a (tidy) dataframe `x` and a list of column labels (`VARIABLES`). Assume that `duplicates` receives a dataframe as returned from your Step 0 implementation of `tidy`. It then checks whether there are any duplicates in the dataset. Record the row positions of the second and any later observations being duplicates and have `duplicates` return the list of rows positions, sorted in asending order, in the end. An empty list indicates the absence of duplicated observations.

Important:
* The first observation that belongs to the detected duplicates is *not* considered a duplicate!
* Mind the difference between row positions and row labels. `.index` of a dataframe returns row labels. `.iloc` takes row positions.

In [None]:
VARIABLES = [list]; # Change value assignment!

def duplicates(x, vars):
    """
    Identifies duplicate rows based on a subset of columns and returns their positional indices.

    Args:
        x (pandas.DataFrame): The input dataframe.
        vars (list): A list of column names to consider for identifying duplicates.

    Returns:
        list or str: A sorted list of integer row positions of duplicate entries
                     (excluding the first occurrence). Returns an empty list if no duplicates.
                     Returns specific error strings for invalid input `vars`.
    """
    if not isinstance(x, pd.DataFrame):
      print("Input x is not a pandas DataFrame.")
      return [] # Or raise error

    # Input validation for 'vars'
    # Check if vars is None, or the specific placeholder [list], or not a list, or an empty list
    if vars is None or vars == [list] or not isinstance(vars, list) or not vars:
        return "Name variables defining potential duplicates!" # Match T3.2, T3.3

    # Check if all column names in vars actually exist in the dataframe x
    valid_columns = [v for v in vars if v in x.columns]
    if len(valid_columns) != len(vars):
        missing_vars = [v for v in vars if v not in x.columns]
        # Modify return to provide more specific error, or keep generic as required by asserts?
        # Let's keep the generic one for T3.2/T3.3, but add a print for debugging
        print(f"Error: Columns not found in DataFrame: {missing_vars}")
        # Check if this state should also return the specific string?
        # The asserts seem to only test None and [list] for that string.
        # Let's assume invalid columns should ideally raise an error or return empty list?
        # For safety, let's return empty list if columns are invalid, AFTER checking the T3.2/3.3 conditions.
        return [] # Return empty list if column names are invalid

    # Use pandas duplicated() method
    # keep='first' marks all duplicates *except for the first occurrence* as True.
    # This matches the requirement: "The first observation ... is *not* considered a duplicate!"
    try:
      duplicate_mask = x.duplicated(subset=vars, keep='first')
    except Exception as e:
        print(f"Error during duplicate detection (check column types/content): {e}")
        return []


    # Get the row positions (integer locations) where the mask is True
    duplicate_positions = np.where(duplicate_mask)[0]

    # Sort the positions in ascending order
    sorted_positions = sorted(duplicate_positions.tolist())

    return sorted_positions


In [None]:
df = tidy(mn);
assert len(VARIABLES) > 0 and all([v in df.columns.tolist() for v in VARIABLES]), "T3.1"
assert duplicates(df, [list]) == "Name variables defining potential duplicates!", "T3.2"
assert duplicates(df, None) == "Name variables defining potential duplicates!", "T3.3"
assert type(duplicates(df, vars = df.columns.tolist())) == list, "T3.4"
assert all(isinstance(i, int) for i in duplicates(df, df.columns.tolist())), "T3.5"

In [None]:
# Edit this cell or remove it, and you shall perish, meow! 😼⚡️


-----
## 4. Detecting outliers
### 4.1. Code part
Implement a function called `detecting_outliers` to detect outliers in one selected quantitative variable. Pick a suitable variable from the tidied dataset based on your characterisation and apply one suitable outlier-detection technique as covered in Unit 4. Justify your choice of this technique in the analytical part. Again, the function is assumed to receive a tidied data set from Step 0. The function returns the row positions (*not* row labels!) of the rows containing outliers on the selected variable, sorted in ascending order.

In [None]:
def detecting_outliers(x):
    """
    Detects outliers in the 'km_per_litre' column using the IQR method.

    Args:
        x (pandas.DataFrame): The input dataframe (assumed to be tidied, but could be imputed).

    Returns:
        list: A sorted list of integer row positions identified as outliers
              in the 'km_per_litre' column. Returns empty list if column missing, empty, or has no outliers.
    """
    if not isinstance(x, pd.DataFrame):
        print("Input is not a pandas DataFrame.")
        return []
    if x.empty:
        print("Input DataFrame is empty.")
        return []

    # --- Variable Selection ---
    # We select 'km_per_litre' as it's the primary quantitative variable suitable for outlier detection.
    variable_to_check = 'km_per_litre'

    if variable_to_check not in x.columns:
        print(f"Error: Column '{variable_to_check}' not found in the DataFrame.")
        return []

    # --- Outlier Detection Technique: IQR Method ---
    # Chosen because it's robust to the underlying distribution shape and extreme values.

    # Extract the column, ensuring it's numeric and handle potential non-numeric entries/NaNs introduced before/during tidying
    # Use pd.to_numeric, coercing errors will turn non-numeric values into NaN
    data_column = pd.to_numeric(x[variable_to_check], errors='coerce')

    # Check if the column is entirely NaN after coercion
    if data_column.isna().all():
        print(f"Warning: Column '{variable_to_check}' contains no valid numeric data after coercion.")
        return []

    # Calculate Q1 (25th percentile) and Q3 (75th percentile) on non-NaN values
    Q1 = data_column.quantile(0.25)
    Q3 = data_column.quantile(0.75)

    # Calculate the Interquartile Range (IQR)
    IQR = Q3 - Q1

    # Handle edge case where IQR is zero (e.g., column has constant value after NaN removal)
    if IQR == 0:
        # In this case, any value different from Q1 (which equals Q3) could be considered an outlier,
        # but the standard 1.5*IQR rule becomes useless.
        # A common approach is to return no outliers, or flag values unequal to the constant.
        # Let's return no outliers for simplicity, assuming constant value is not an outlier situation here.
        print(f"Warning: IQR for '{variable_to_check}' is zero. No outliers detected by standard IQR rule.")
        return []


    # Define the outlier boundaries using the standard 1.5*IQR rule
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identify outliers using the original numeric-coerced column `data_column`
    # This automatically handles NaNs (they won't satisfy the condition).
    outlier_mask = (data_column < lower_bound) | (data_column > upper_bound)

    # Get the row positions (integer locations) where the mask is True
    # Ensure the mask is not empty before using np.where
    if outlier_mask.any():
      outlier_positions = np.where(outlier_mask)[0]
    else:
      outlier_positions = np.array([]) # Empty array if no outliers

    # Sort the positions in ascending order
    sorted_positions = sorted(outlier_positions.tolist())

    # print(f"Detected {len(sorted_positions)} outliers in '{variable_to_check}' using IQR ({IQR:.2f}): < {lower_bound:.2f} or > {upper_bound:.2f}") # Debug print

    return sorted_positions

# Assertions for Step 4.1
# Decide whether to run on df_tidied or df_imputed.
# Outlier detection is often done *before* imputation, but can be done after.
# Let's run it on df_tidied as per the prompt "function is assumed to receive a tidied data set from Step 0".
df_for_outliers = df_tidied

outlier_list = detecting_outliers(df_for_outliers)
assert type(outlier_list) == list, "T4.1: Function should return a list."
# Check elements only if list is not empty
if outlier_list:
    assert all(isinstance(i, int) for i in outlier_list), "T4.2: List should contain only integers."
else:
    pass # Pass if list is empty (no outliers or error)


In [None]:
df = tidy(mn);
assert type(detecting_outliers(df)) == list, "T4.1"
assert all(isinstance(i, int) for i in detecting_outliers(df)), "T4.2"
assert len(detecting_outliers(df)) > 0 and len(detecting_outliers(df)) < .05*df.shape[0]


### 4.2. Analytical part
Discuss the implications. 

- What is the chosen outlier-detection technique? Explain it using your own words in 3-4 sentences.
- Describe the outliers detected: How many? How do they relate to the typical, non-outlier values in the remaining dataset?
- What could be one reason these outliers appear in the dataset? How would you treat them further?

Write your answer in the markdown cell below. Do NOT delete or replace the answer cell with another one!

YOUR ANSWER HERE