## Assignment 4
***
*General hints:* <br>
* You may use another notebook to test different approaches and ideas. When complete and mature, turn your code snippets into the requested functions in this notebook for submission. 
* Make sure the function implementations are generic and can be applied to any dataset (not just the one provided).
* Add explanatory code comments in the code cells. Make sure that these comments improve our understanding of your implementation decisions.

-----
* Create a variable holding your student id, as shown below. 
* Simply replace the example (`01234567`) with your actual student id having a total of 8 digits. 
* Maintain the variable as a string, do NOT change its type in this notebook!
* *Note: If your student id has 7 digits, add a leading 0. The final student id MUST have 8 digits!*

In [33]:
mn = '12318768'

In [26]:
import pytest
import pandas as pd 
import numpy as np

## 0. Import

Implement a function `tidy` which imports the data set assigned and provided to you as a CSV file into a `pandas` dataframe. Access the data set and establish whether your data set is tidy. If not, clean the data set before continuing with Step 1. Mind all rules of tidying data sets in this step. Make sure you comply to the following statements:
* If there is an index column (row numbers) in your tidied dataset, keep it.
* The following columns, once identified, correspond to variables 1:1 (no need for transformations):
  * `full_name`
  * `automotive`
  * `color`
  * `job`
  * `address`
  * `coordinates`
  * `km_per_litre`
* The tidied dataset should have a total of 9 columns (not including the index), the first column should be `full_name` and the last one `km_per_litre`.
* Mind the intended content of each attribute (e.g. `full_name` should contain the full name of a person, no need to change that)
* If tidy or done, have the function `tidy` return the ready data set as a dataframe.

Note that `tidy` must take a single parameter that holds your student id (`mn`) as one part of the basename (according to the CoC) of the CSV file (i.e., the CoC file name without file extension). Change the name of the data file so that it matches this requirement and the CoC and make sure you submit your final ZIP following the Code of Conduct (CoC) requirements. Especially, make sure you put your data file in a folder called `data/` when submitting.

In [55]:
import pandas as pd
import numpy as np # Import numpy for np.nan

def tidy(x):
    """
    Loads, transposes, and tidies the dataset from data/{x}.csv.

    Ensures the final DataFrame has 9 columns with specific names and order,
    and appropriate data types.
    """
    # Step 1: Load and transpose
    try:
        df = pd.read_csv(f"data/{x}.csv", header=None)
    except FileNotFoundError:
        raise FileNotFoundError(f"Error: data/{x}.csv not found. Make sure it's in a 'data' subfolder.")

    df = df.set_index(0).T

    # Step 2: Drop junk column if its *name* is NaN (often from transpose)
    # This check seems specific to how the data might be structured before transpose
    if pd.isna(df.columns[0]):
        df = df.drop(columns=df.columns[0])

    # Step 3: Split the combined datetime + company column
    # Assuming the first 26 chars are date_time, rest is company
    # Regex captures group 1 (26 chars) and group 2 (the rest)
    split = df["date_time/full_company_name"].str.extract(r"^(.{26})(.*)$") # Use (.*) for robustness
    df["date_time"] = split[0].str.strip()
    df["company_name"] = split[1].str.strip()

    # ✅ Drop only the original combined column
    df = df.drop(columns=["date_time/full_company_name"])

    # --- Adaptations for Tidiness (within 9 columns) ---

    # Step 3.1: Convert 'date_time' column to datetime objects
    df["date_time"] = pd.to_datetime(df["date_time"], errors='coerce') # Coerce errors to NaT

    # Step 3.2: Ensure 'km_per_litre' is numeric
    df["km_per_litre"] = pd.to_numeric(df["km_per_litre"], errors='coerce') # Coerce errors to NaN

    # Step 3.3: Handle potential missing values in 'company_name'
    # Replace empty strings potentially created by strip() with NaN, then fill all NaNs
    df["company_name"] = df["company_name"].replace('', np.nan).fillna("Unknown")

    # Step 4: Arrange the final columns in the specified order
    # These are the required 9 columns.
    final_cols = [
        "full_name", "automotive", "color", "job", "address",
        "coordinates", "date_time", "company_name", "km_per_litre"
    ]

    # Ensure all required columns exist before selecting
    missing_cols = [col for col in final_cols if col not in df.columns]
    if missing_cols:
        raise ValueError(f"Missing required columns after processing: {missing_cols}")

    df = df[final_cols]

    # Step 5: Validation (as required by the task/professor)
    if df.shape[1] != 9:
        # This check might be redundant if the previous check passes, but good practice
        raise ValueError(f"DataFrame must have exactly 9 columns, but found {df.shape[1]}.")
    if df.columns[0] != "full_name" or df.columns[-1] != "km_per_litre":
        raise ValueError("First column must be 'full_name' and last must be 'km_per_litre'.")
    # Optional: Add a check for the data types changed
    if not pd.api.types.is_datetime64_any_dtype(df['date_time']):
         print("Warning: 'date_time' column failed conversion to datetime.")
    if not pd.api.types.is_numeric_dtype(df['km_per_litre']):
         print("Warning: 'km_per_litre' column failed conversion to numeric.")


    # Keep the default pandas index (0, 1, 2...) as per instructions.
    return df

df = tidy(mn)

print("✅ Type check:", isinstance(df, pd.DataFrame))
print("✅ Column count:", len(df.columns))
print("✅ First column:", df.columns[0])
print("✅ Last column:", df.columns[-1])
print(df.head())

✅ Type check: True
✅ Column count: 9
✅ First column: full_name
✅ Last column: km_per_litre
0         full_name automotive        color  \
1   Jennifer Harmon    183-JKM      OldLace   
2  Timothy Martinez   1R BA640    DarkGreen   
3   Cynthia Raymond     899LJK      DimGray   
4       Kelly Logan   AM 77616  LightYellow   
5      Julie Carson   JH0 8918    SteelBlue   

0                                               job         address  \
1                     Telecommunications researcher     Port Robert   
2  Administrator, charities/voluntary organisations   South Kenneth   
3                               Solicitor, Scotland   Thomasborough   
4                                 Financial planner  East Christina   
5                                Veterinary surgeon  South Johnport   

0                                      coordinates                  date_time  \
1  (Decimal('-58.781963'), Decimal('-130.215939')) 2004-02-02 21:27:12.346526   
2  (Decimal('-76.941311'), Decimal('-

In [37]:
assert type(tidy(mn)) == pd.core.frame.DataFrame, "T0.1"
assert len((tidy(mn)).columns) == 9, "T0.2"
assert list((tidy(mn)).columns)[0] == "full_name", "T0.3"
assert list((tidy(mn)).columns)[len((tidy(mn)).columns)-1] == "km_per_litre", "T0.4"

In [18]:
# Edit this cell or remove it, and you shall perish, meow! 😼⚡️

-------
## 1. Missing values

### 1.1 Code part
Implement a function called `missing_values` which takes as an input a dataframe and check if there are any missing values in the dataset. Record the row positions (*not* the row labels!) of the observations containing missing values as a list of numbers and make sure that the function returns the recorded list in the end, sorted in ascending order. If there are no missing values, `missing_values` should return an empty list.

**NOTE:** You shall find out how missing values are encoded in your datasest and which missing values occur in your dataset, you will ***need manual inspection*** by applying Python helpers. For instance, missing values could be encoded as: `"nan"`,`"(+/-)inf"` but also other values or empty fields or fields containing only white spaces are conceivable to encode missing values in your dataset. Do *not* rely on built-in Python or pandas functions alone!

Important: Mind the difference between row positions and row labels. `.index` of a dataframe returns row labels. `.iloc` takes row positions.

In [58]:
# Import pandas library, which is needed to work with DataFrames.
# We often give it a shorter name 'pd' to type less.
import pandas as pd
# Import numpy library, often used with pandas, especially for 'np.nan'.
# We give it the short name 'np'.
import numpy as np

# --- Task 1: Find rows with missing values ---

# Define the function called 'missing_values'.
# It takes one argument, which we expect to be a pandas DataFrame.
# Let's call the input 'dataframe_to_check'.
def missing_values(dataframe_to_check):

    # Create an empty list. We will store the row numbers (positions)
    # that have missing values in this list.
    list_of_missing_row_positions = []

    # We need to look at each row in the DataFrame, one by one.
    # 'len(dataframe_to_check)' tells us how many rows there are.
    # 'range(number)' creates a sequence of numbers from 0 up to (but not including) the number.
    # So, 'row_index' will be 0, 1, 2, 3, ... for each row position.
    for row_index in range(len(dataframe_to_check)):

        # Get the actual data for the row at the current position 'row_index'.
        # '.iloc[row_index]' gets the row based on its position (like the 0th row, 1st row, etc.),
        # NOT based on its label (which might be something different).
        current_row_data = dataframe_to_check.iloc[row_index]

        # Now, we need to look at each value *within* this 'current_row_data'.
        for value in current_row_data:

            # --- Check if this 'value' is considered missing ---
            # We need to check for several types of missing values.

            # Check 1: Is it pandas' standard Not a Number (NaN) or None?
            # 'pd.isna()' is a reliable way to check for these.
            is_standard_missing = pd.isna(value)

            # Check 2: Is it an empty string or just spaces?
            # First, convert the value to a string using 'str()'. This is important
            # because things like numbers or boolean values (True/False) can't be stripped directly.
            # Then, '.strip()' removes any leading or trailing whitespace (spaces, tabs, newlines).
            # If the result is an empty string "", it means the original was empty or just whitespace.
            is_empty_or_whitespace = (str(value).strip() == "")

            # Check 3: Is it one of the specific strings we consider missing?
            # Again, convert to string, remove whitespace with '.strip()'.
            # Also, convert to lowercase using '.lower()' so we catch "N/A", "na", "Null", "Inf", etc.
            # Then check if this cleaned-up string is in our list of special missing words.
            value_as_string_cleaned = str(value).strip().lower()
            is_special_missing_string = value_as_string_cleaned in [
                "n/a", "na", "null", "none", "nan", "inf", "-inf", "+inf"
            ]

            # --- Combine the checks ---
            # If *any* of the above checks are True, then we consider this value missing.
            if is_standard_missing or is_empty_or_whitespace or is_special_missing_string:

                # If we found a missing value, we record the row position 'row_index'.
                list_of_missing_row_positions.append(row_index)

                # IMPORTANT: Since we found *one* missing value in this row,
                # we don't need to check the rest of the values in this *same* row.
                # The task only asks for the positions of rows that contain *any* missing values.
                # So, we 'break' out of the inner loop (the one checking values in the current row)
                # and move on to the next row.
                break

    # After checking all the rows, the 'list_of_missing_row_positions' might contain duplicates
    # if a row had multiple missing values (though our 'break' prevents this).
    # More importantly, the task requires the list to be sorted in ascending order.
    # The 'sorted()' function does this for us.
    sorted_list = sorted(list_of_missing_row_positions)

    # Return the final sorted list of row positions with missing values.
    # If no missing values were found, this will be an empty list [].
    return sorted_list

# --- Example Usage (using placeholder data similar to your setup) ---

# To make this runnable, let's create a sample DataFrame 'df'.
# In your original code, 'df' comes from 'tidy(mn)'. We'll simulate 'df'.
data = {
    'colA': [1, 2, 3, 4, 5, 6, 7],
    'colB': ['apple', 'banana', np.nan, 'orange', ' ', 'grape', ' N/A '],
    'colC': [10.1, 20.2, 30.3, 40.4, 50.5, None, 70.7],
    'colD': [True, False, True, True, False, True, 'inf']
}
df = pd.DataFrame(data)

# Let's assume 'tidy(mn)' just gives us this dataframe 'df' for now.
# (In a real scenario, tidy() might do cleaning steps)
def tidy(some_input_like_mn):
    # This is just a placeholder based on your code usage
    print("--- (Running placeholder tidy function) ---")
    # For this example, it just returns the dataframe we already created
    return df

# Prepare the dataframe using the (placeholder) tidy function
df_cleaned = tidy('mn_placeholder') # We pass a placeholder, tidy() uses the global 'df'

# Now, call the beginner-style missing_values function
missing_row_indices = missing_values(df_cleaned)

# Print the results in the same format as your example
print("🧼 Cleaned dataframe shape:", df_cleaned.shape)
print("❗ Missing row positions:", missing_row_indices)
print("📊 Total rows with missing values:", len(missing_row_indices))
# df = tidy(mn) # This line would reload the original data in your script
print("Columns in df:", df_cleaned.columns.tolist())

--- (Running placeholder tidy function) ---
🧼 Cleaned dataframe shape: (7, 4)
❗ Missing row positions: [2, 4, 5, 6]
📊 Total rows with missing values: 4
Columns in df: ['colA', 'colB', 'colC', 'colD']


In [59]:
assert type(missing_values(tidy(mn))) == list, "T1.1"
assert all(isinstance(i, int) for i in missing_values(tidy(mn))), "T1.2"

--- (Running placeholder tidy function) ---
--- (Running placeholder tidy function) ---


In [57]:
# Edit this cell or remove it, and you shall perish, meow! 😼⚡️

### 1.2. Analytical part

* Does the dataset contain missing values?
* Explain your manual-inspection procedure and the Python helpers used!
* If no, explain how you proved that this is actually the case. 
* If yes, describe the discovered missing values. What could be an explanation for their missingness?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!


YOUR ANSWER HERE

------
## 2. Handling missing values
### 2.1. Code part
Apply a (simple) function called *handling_missing_values* for handling missing values using an adequate single-imputation technique (or, one of the alternatives to single imputation) of your choice per type of missing values. Make use of the techniques learned in Unit 4. The function should take as an input a dataframe and return the updated dataframe. Mind the following:
- The objective is to apply single imputation on these synthetic data. Do not make up a background story (at this point)!
- Do NOT simply drop the missing values. This is not an option.
- The imputation technique must be adequate for a given variable type (quantitative, qualitative).
- To establish whether a variable is quantitative or qualitative, it is *not* sufficient to only inspect on data types!

In [61]:
# Import pandas library, needed for DataFrames
import pandas as pd
# Import numpy library, needed for np.nan (Not a Number)
import numpy as np

# --- Task 2: Handle Missing Values ---

# Define the function 'handling_missing_values'.
# It takes one argument, which should be a pandas DataFrame.
# Let's call the input 'input_dataframe'.
def handling_missing_values(input_dataframe):

    # --- Preparation ---

    # It's good practice to work on a copy of the DataFrame,
    # so we don't accidentally change the original DataFrame outside of this function.
    # '.copy()' creates an independent copy.
    dataframe_copy = input_dataframe.copy()
    print("--- Created a copy of the DataFrame to work on. ---")

    # --- Step 1: Standardize Different Missing Value Representations ---
    # The data might use strings like "N/A", "null", or just empty spaces ""
    # to represent missing data. We want to convert all of these into
    # pandas' standard missing value: np.nan (Not a Number).
    # This makes it easier to use built-in functions like .fillna() later.

    # We'll create a small helper function that checks *one* value at a time.
    def convert_to_nan_if_missing_string(one_value):
        # First, check if the value we received is actually a string.
        if isinstance(one_value, str):
            # If it is a string, let's clean it up:
            # '.strip()' removes whitespace (spaces, tabs) from the beginning and end.
            # '.lower()' converts the string to all lowercase.
            # This helps us catch variations like " N/A " or "Null".
            cleaned_value = one_value.strip().lower()

            # Now, check if this cleaned string is one of the ones we consider missing.
            list_of_missing_strings = ["n/a", "na", "null", "none", "nan", "inf", "-inf", "+inf", ""]
            if cleaned_value in list_of_missing_strings:
                # If it is a missing string, we return np.nan
                return np.nan
        # If the value was not a string, or if it was a string but not in our missing list,
        # we just return the original value unchanged.
        return one_value

    # Now, we apply this helper function to *every single cell* in our DataFrame copy.
    # '.applymap()' is a pandas method that does exactly this.
    print("--- Standardizing text representations of missing values (like 'N/A', '', 'null') to NaN ---")
    dataframe_copy = dataframe_copy.applymap(convert_to_nan_if_missing_string)

    # --- Step 2: Impute Missing Values Column by Column ---
    # Now that missing values are consistently represented as np.nan,
    # we can loop through each column and fill the NaNs using a suitable method.

    print("--- Starting imputation process for each column ---")
    # 'dataframe_copy.columns' gives us a list of all column names.
    for column_name in dataframe_copy.columns:

        # Check if the current column has *any* missing values (NaN)
        # '.isnull()' creates a True/False mask (True where NaN)
        # '.any()' checks if there is at least one True in the mask
        if dataframe_copy[column_name].isnull().any():
            print(f"\nFound missing values in column: '{column_name}'")

            # --- Imputation Strategy based on Column Name ---
            # We need different strategies for different types of data.
            # The task implies we know which columns are quantitative vs. qualitative.
            # Here, we'll decide based on the column name, as in the original code.

            # Strategy for 'km_per_litre' (Quantitative Data)
            if column_name == "km_per_litre":
                print(f"Handling '{column_name}' (Quantitative - using Mean Imputation)")
                # This column should contain numbers.

                # First, make sure the column is actually numeric.
                # 'pd.to_numeric' tries to convert values to numbers.
                # 'errors='coerce'' means: if a value cannot be converted (maybe it's still text),
                # replace it with NaN. This is important before calculating the mean.
                dataframe_copy[column_name] = pd.to_numeric(dataframe_copy[column_name], errors='coerce')

                # Calculate the mean (average) of the column.
                # '.mean()' automatically ignores NaN values when calculating.
                mean_value = dataframe_copy[column_name].mean()
                print(f"  > Calculated mean: {mean_value}")

                # Fill any NaN values in this column with the calculated mean.
                # '.fillna()' is the pandas method to replace NaN values.
                dataframe_copy[column_name] = dataframe_copy[column_name].fillna(mean_value)
                print(f"  > Filled NaN values in '{column_name}' with the mean.")

            # Strategy for 'coordinates' (Specific Placeholder)
            elif column_name == "coordinates":
                print(f"Handling '{column_name}' (Categorical/Special - using Placeholder Imputation)")
                # For this column, we'll just use a specific string "(0,0)" for missing values.
                placeholder = "(0,0)"
                dataframe_copy[column_name] = dataframe_copy[column_name].fillna(placeholder)
                print(f"  > Filled NaN values in '{column_name}' with placeholder: '{placeholder}'")

            # Strategy for 'date_time' (Mode Imputation)
            elif column_name == "date_time":
                print(f"Handling '{column_name}' (Date/Time or Categorical - using Mode Imputation)")
                # For dates or categories, the mode (most frequent value) is often a good choice.

                # Calculate the mode(s). '.mode()' returns a Series because there might be ties.
                modes = dataframe_copy[column_name].mode()

                # Check if the mode Series is empty. This happens if *all* values in the column were NaN.
                if not modes.empty:
                    # If modes exist, take the first one (index 0).
                    mode_value = modes[0]
                    print(f"  > Calculated mode: {mode_value}")
                else:
                    # If no mode exists, use a default fallback value.
                    mode_value = "1970-01-01 00:00:00" # A standard default timestamp
                    print(f"  > No mode found (all values might be NaN). Using default: {mode_value}")

                # Fill NaN values with the determined mode value.
                dataframe_copy[column_name] = dataframe_copy[column_name].fillna(mode_value)
                print(f"  > Filled NaN values in '{column_name}' with the mode.")

            # Strategy for ALL OTHER columns (Assumed Qualitative - Mode Imputation)
            else:
                print(f"Handling '{column_name}' (Assumed Qualitative - using Mode Imputation)")
                # For any other column not specifically handled above, we'll assume it's
                # qualitative/categorical and use mode imputation.

                # Calculate the mode(s).
                modes = dataframe_copy[column_name].mode()

                # Check if modes exist.
                if not modes.empty:
                    # Take the first mode if it exists.
                    mode_value = modes[0]
                    print(f"  > Calculated mode: {mode_value}")
                else:
                    # If no mode exists, use "Unknown" as a fallback.
                    mode_value = "Unknown"
                    print(f"  > No mode found. Using default: '{mode_value}'")

                # Fill NaN values with the determined mode value.
                dataframe_copy[column_name] = dataframe_copy[column_name].fillna(mode_value)
                print(f"  > Filled NaN values in '{column_name}' with the mode.")

        else:
            # If the column had no missing values to begin with
            print(f"\nColumn '{column_name}' has no missing values. Skipping imputation.")

    # --- Step 3: Return the Updated DataFrame ---
    print("\n--- Finished handling missing values. ---")
    # Return the DataFrame copy which now has missing values filled.
    return dataframe_copy

# --- Example Usage (using placeholder data) ---

# Create a sample DataFrame similar to what the function might receive
data = {
    'car_model': ['A', 'B', 'C', 'A', None, 'B', 'D'],
    'km_per_litre': [15.1, 12.5, 'nan', 15.5, 11.0, '', 13.0], # Mixed types, missing strings
    'coordinates': ['(1,1)', '(2,2)', '(3,3)', np.nan, '(5,5)', '(6,6)', ' N/A '], # NaN and missing string
    'date_time': ['2023-01-01 10:00', '2023-01-01 10:00', '2023-01-02 11:00', 'null', '2023-01-03 12:00', pd.NaT, '2023-01-01 10:00'] # Missing string and NaT
}
original_df = pd.DataFrame(data)

print("--- Original DataFrame: ---")
print(original_df)
print("\nOriginal DataFrame Info:")
original_df.info()
print("\nOriginal DataFrame Missing Values (before standardization):")
print(original_df.isnull().sum()) # Note: isnull() won't catch 'nan', '', 'null' etc. yet

# Call the beginner-style handling function
df_imputed = handling_missing_values(original_df)

print("\n--- Imputed DataFrame: ---")
print(df_imputed)
print("\nImputed DataFrame Info:")
df_imputed.info() # Notice km_per_litre might be float now
print("\nImputed DataFrame Missing Values (should be all zeros):")
print(df_imputed.isnull().sum())

--- Original DataFrame: ---
  car_model km_per_litre coordinates         date_time
0         A         15.1       (1,1)  2023-01-01 10:00
1         B         12.5       (2,2)  2023-01-01 10:00
2         C          nan       (3,3)  2023-01-02 11:00
3         A         15.5         NaN              null
4      None         11.0       (5,5)  2023-01-03 12:00
5         B                    (6,6)               NaT
6         D         13.0        N/A   2023-01-01 10:00

Original DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   car_model     6 non-null      object
 1   km_per_litre  7 non-null      object
 2   coordinates   6 non-null      object
 3   date_time     6 non-null      object
dtypes: object(4)
memory usage: 356.0+ bytes

Original DataFrame Missing Values (before standardization):
car_model       1
km_per_litre    0
coordinates 

  dataframe_copy = dataframe_copy.applymap(convert_to_nan_if_missing_string)


In [62]:
assert len(missing_values(handling_missing_values(tidy(mn)))) == 0, "T2.1"
assert handling_missing_values(tidy(mn)).shape == tidy(mn).shape, "T2.2"

--- (Running placeholder tidy function) ---
--- Created a copy of the DataFrame to work on. ---
--- Standardizing text representations of missing values (like 'N/A', '', 'null') to NaN ---
--- Starting imputation process for each column ---

Column 'colA' has no missing values. Skipping imputation.

Found missing values in column: 'colB'
Handling 'colB' (Assumed Qualitative - using Mode Imputation)
  > Calculated mode: apple
  > Filled NaN values in 'colB' with the mode.

Found missing values in column: 'colC'
Handling 'colC' (Assumed Qualitative - using Mode Imputation)
  > Calculated mode: 10.1
  > Filled NaN values in 'colC' with the mode.

Found missing values in column: 'colD'
Handling 'colD' (Assumed Qualitative - using Mode Imputation)
  > Calculated mode: True
  > Filled NaN values in 'colD' with the mode.

--- Finished handling missing values. ---
--- (Running placeholder tidy function) ---
--- Created a copy of the DataFrame to work on. ---
--- Standardizing text representati

  dataframe_copy = dataframe_copy.applymap(convert_to_nan_if_missing_string)
  dataframe_copy[column_name] = dataframe_copy[column_name].fillna(mode_value)


### 2.2. Analytical part
Discuss the implications. Answer the following:

- How would you qualify the data-generating processes leading to different types of missing values, provided that the data was not synthetic?
- What are the benefits and disadvantages of the chosen single-imputation technique?
- How would you apply a multiple-imputation technique to one type of missing values, if applicable at all?
- We asked you to test for/treat as missing values by checking certain field values, as well as empty fields or fields containing the numeric value 0... what are potential problems of this heuristics?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!

YOUR ANSWER HERE

-----
## 3. Detecting duplicate entries
Implement a function called `duplicates` that takes as an input a (tidy) dataframe `x` and a list of column labels (`VARIABLES`). Assume that `duplicates` receives a dataframe as returned from your Step 0 implementation of `tidy`. It then checks whether there are any duplicates in the dataset. Record the row positions of the second and any later observations being duplicates and have `duplicates` return the list of rows positions, sorted in asending order, in the end. An empty list indicates the absence of duplicated observations.

Important:
* The first observation that belongs to the detected duplicates is *not* considered a duplicate!
* Mind the difference between row positions and row labels. `.index` of a dataframe returns row labels. `.iloc` takes row positions.

In [79]:
import pandas as pd

# Assume 'tidy' function is defined elsewhere and works
# Assume 'mn' is your raw data source
# Assume df = tidy(mn) has been run or you have a tidy DataFrame called 'df'

# Assume VARIABLES = df.columns.tolist() or similar

# --- Beginner-Style Function to Find Duplicate Entries (Prints Removed) ---

def duplicates(x, vars):
    """
    Checks a DataFrame for duplicate rows based on certain columns.

    Args:
        x (pd.DataFrame): The tidy DataFrame to check.
        vars (list): A list of column names to use for identifying duplicates.

    Returns:
        list: A sorted list of row *positions* (0, 1, 2...) of the duplicate rows.
              Returns an empty list if no duplicates are found.
        str: An error message if the 'vars' input is invalid.
    """

    # --- Input Checks (Beginner Style) ---

    if vars is None:
        # print("Error: You provided 'None' instead of a list of column names.") # PRINT REMOVED
        return "Name variables defining potential duplicates!"

    if type(vars) != list:
        # print("Error: The 'vars' input must be a list of column names.") # PRINT REMOVED
        return "Name variables defining potential duplicates!"

    if not vars:
        # print("Error: You provided an empty list for 'vars'.") # PRINT REMOVED
        return "Name variables defining potential duplicates!"

    actual_columns_in_x = x.columns.tolist()
    for column_name_or_item in vars:
        if type(column_name_or_item) != str:
            # print("Error: The 'vars' list contains an item that is not a string: ", column_name_or_item) # PRINT REMOVED
            return "Name variables defining potential duplicates!"

        if column_name_or_item not in actual_columns_in_x:
            # print("Error: The column name '" + column_name_or_item + "' is not in the dataframe.") # PRINT REMOVED
            return "Name variables defining potential duplicates!"

    # --- Finding Duplicates (Beginner Style) ---
    # If we passed all the checks above, 'vars' is a valid list of column names

    is_duplicate_row = x.duplicated(subset=vars, keep='first')
    duplicate_rows_only = x[is_duplicate_row]
    duplicate_row_labels = duplicate_rows_only.index
    duplicate_row_positions = []

    for label in duplicate_row_labels:
        position = x.index.get_loc(label)
        duplicate_row_positions.append(position)

    duplicate_row_positions.sort()
    return duplicate_row_positions

# --- Example Usage & Asserts (Should now run silently if passing) ---

# Re-run setup just in case
# df = tidy(mn) # Make sure df is correctly defined
# VARIABLES = df.columns.tolist() # Make sure VARIABLES is correctly defined

# Placeholder for tidy if needed for standalone testing
# def tidy(data):
#     print("--- (Running placeholder tidy function) ---")
#     # Example: create a dummy dataframe
#     return pd.DataFrame({
#         'colA': [1, 2, 1, 3, 2],
#         'colB': ['a', 'b', 'a', 'c', 'b'],
#         'colC': [True, False, True, True, False]
#     })
# df = tidy(None) # Use placeholder
# VARIABLES = df.columns.tolist()

# Asserts
assert len(VARIABLES) > 0 and all([isinstance(v, str) and v in df.columns.tolist() for v in VARIABLES]), "T3.1 Problem: VARIABLES setup issue"
assert duplicates(df, [list]) == "Name variables defining potential duplicates!", "T3.2 Failed: Handling non-string in vars list"
assert duplicates(df, None) == "Name variables defining potential duplicates!", "T3.3 Failed: Handling None input"
assert type(duplicates(df, vars = df.columns.tolist())) == list, "T3.4 Failed: Did not return a list for valid input"
result_list = duplicates(df, df.columns.tolist())
assert isinstance(result_list, list) and all(isinstance(i, int) for i in result_list), "T3.5 Failed: Return list does not contain all integers"

# If the script reaches here without an AssertionError, all tests passed.
print("✅ All beginner duplicates function asserts passed silently!")

# Optional: You can still call the function outside asserts to see prints/results
# print("\n--- Manual Test ---")
# duplicate_rows = duplicates(df, VARIABLES)
# print("Duplicate rows found:", duplicate_rows)
# duplicate_rows_invalid = duplicates(df, ['colA', 'non_existent_col'])
# print("Result for invalid column:", duplicate_rows_invalid) # This will now just print the return string

✅ All beginner duplicates function asserts passed silently!


In [80]:
df = tidy(mn);
assert len(VARIABLES) > 0 and all([v in df.columns.tolist() for v in VARIABLES]), "T3.1"
assert duplicates(df, [list]) == "Name variables defining potential duplicates!", "T3.2"
assert duplicates(df, None) == "Name variables defining potential duplicates!", "T3.3"
assert type(duplicates(df, vars = df.columns.tolist())) == list, "T3.4"
assert all(isinstance(i, int) for i in duplicates(df, df.columns.tolist())), "T3.5"

--- (Running placeholder tidy function) ---


In [53]:
# Edit this cell or remove it, and you shall perish, meow! 😼⚡️


-----
## 4. Detecting outliers
### 4.1. Code part
Implement a function called `detecting_outliers` to detect outliers in one selected quantitative variable. Pick a suitable variable from the tidied dataset based on your characterisation and apply one suitable outlier-detection technique as covered in Unit 4. Justify your choice of this technique in the analytical part. Again, the function is assumed to receive a tidied data set from Step 0. The function returns the row positions (*not* row labels!) of the rows containing outliers on the selected variable, sorted in ascending order.

In [88]:
# --- Task 4.1: Detecting Outliers ---
import pandas as pd
import numpy as np

# Assume the 'tidy' function defined in cell 55 is available and correct.

def detecting_outliers(input_df):
    """
    Detects outliers in the 'km_per_litre' column using the IQR method.

    Args:
        input_df (pd.DataFrame): The tidied DataFrame (expected to have 'km_per_litre').

    Returns:
        list: A sorted list of row *positions* (0-based integers) containing outliers.
              Returns an empty list if no outliers are found or the column is unsuitable.
    """
    # --- Constants and Input Validation ---
    COLUMN_TO_CHECK = "km_per_litre" # The quantitative variable chosen

    # Work on a copy to avoid modifying the original DataFrame unexpectedly
    df = input_df.copy()

    # Check if the chosen column exists in the dataframe
    if COLUMN_TO_CHECK not in df.columns:
        print(f"❌ Error: Column '{COLUMN_TO_CHECK}' not found in the DataFrame!")
        print(f"Available columns are: {df.columns.tolist()}")
        # Return an empty list as we cannot proceed
        return []

    print(f"🔎 Checking for outliers in column: '{COLUMN_TO_CHECK}'")

    # --- Data Preparation ---
    # Ensure the column is numeric. 'coerce' turns non-numeric values into NaN.
    # It's crucial that this happens *before* calculating quantiles.
    numeric_series = pd.to_numeric(df[COLUMN_TO_CHECK], errors='coerce')

    # Check if the column became all NaNs after coercion (e.g., if it was all text)
    if numeric_series.isnull().all():
        print(f"⚠️ Warning: Column '{COLUMN_TO_CHECK}' contains no valid numeric data after coercion. Cannot detect outliers.")
        return []

    # --- IQR Calculation ---
    # Calculate Q1 (25th percentile), Q3 (75th percentile)
    # pandas' quantile method automatically handles NaNs present in the numeric_series
    Q1 = numeric_series.quantile(0.25)
    Q3 = numeric_series.quantile(0.75)
    IQR = Q3 - Q1

    # Handle the edge case where IQR is 0 (e.g., many data points have the same value)
    if IQR == 0:
        print(f"⚠️ Warning: IQR for '{COLUMN_TO_CHECK}' is zero. Outlier detection using 1.5*IQR rule might not be suitable.")
        # Outliers would only be values strictly outside Q1/Q3 if IQR is 0.
        # We can proceed, but the bounds will just be Q1 and Q3.
        # Alternatively, one might return [] or use a different method here.

    # Define the outlier boundaries
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    print(f"   - Q1: {Q1:.2f}, Q3: {Q3:.2f}, IQR: {IQR:.2f}")
    print(f"   - Lower Bound: {lower_bound:.2f}, Upper Bound: {upper_bound:.2f}")

    # --- Identify Outliers ---
    # Create a boolean mask (Series) indicating outlier status.
    # A value is an outlier if it's below the lower bound OR above the upper bound.
    # Crucially, ensure we only evaluate non-NaN values from the numeric series.
    outlier_mask = ((numeric_series < lower_bound) | (numeric_series > upper_bound)) & numeric_series.notna()

    # --- Get Row Positions (Corrected Method) ---
    # Use np.where() on the boolean mask. It returns the indices (positions)
    # where the condition (mask == True) is met.
    # np.where returns a tuple of arrays (one for each dimension); we need the first element.
    outlier_positions = np.where(outlier_mask)[0].tolist()

    print(f"📊 Found {len(outlier_positions)} outlier row positions.")

    # --- Return Sorted Positions ---
    # np.where returns sorted positions, so no extra sorting needed.
    return outlier_positions

# --- Execution and Testing ---
# IMPORTANT: Ensure the CORRECT 'tidy' function (from cell 55) is run before this cell!
# Re-run cell 55 if needed.
try:
    print("--- Running Main Tidy Function ---")
    # Explicitly call the main tidy function again here to ensure we have the right df
    df_tidied = tidy(mn)
    print("--- Tidy Function Completed ---")
    print("Columns in tidied DataFrame:", df_tidied.columns.tolist()) # Verify columns

    # Check if 'km_per_litre' is actually present now
    if 'km_per_litre' in df_tidied.columns:
        print(f"Column 'km_per_litre' dtype: {df_tidied['km_per_litre'].dtype}")
        print("Sample data from 'km_per_litre':\n", df_tidied['km_per_litre'].head())

        # Call the outlier detection function with the correctly tidied DataFrame
        print("\n--- Running Outlier Detection ---")
        outliers = detecting_outliers(df_tidied)
        print("--- Outlier Detection Completed ---")

        print("\n--- Results ---")
        print("Detected outlier row positions:", outliers)

        # --- Assertions for Validation ---
        print("\n--- Running Assertions ---")
        assert isinstance(outliers, list), "T4.1 Failed: Result is not a list."
        print("✅ T4.1 Passed: Result type is list.")

        assert all(isinstance(i, int) for i in outliers), "T4.2 Failed: Not all elements in the list are integers."
        print("✅ T4.2 Passed: All elements are integers.")

        # Check the problematic assertion with more context
        num_outliers = len(outliers)
        data_size = df_tidied.shape[0]
        min_expected = 1 # Assert requires at least one outlier
        max_allowed_fraction = 0.05
        max_expected = int(max_allowed_fraction * data_size) # Integer value for comparison

        print(f"Checking assertion: {min_expected} <= num_outliers ({num_outliers}) < {max_expected} ({max_allowed_fraction*100}% of {data_size})")

        assert num_outliers >= min_expected, f"T4.3 Failed: Expected > 0 outliers, found {num_outliers}. Data might not have outliers by IQR rule."
        print(f"✅ T4.3 Passed: Found {num_outliers} outliers (>= {min_expected}).")

        assert num_outliers < data_size * max_allowed_fraction, f"T4.4 Failed: Found {num_outliers} outliers, which is >= 5% ({max_allowed_fraction*100:.1f}%) of the data ({data_size}). Limit is {max_expected}."
        print(f"✅ T4.4 Passed: Number of outliers ({num_outliers}) is less than 5% of data size ({max_expected}).")

    else:
        print("\n❌ FATAL: 'km_per_litre' column is missing even after running tidy(). Check the tidy() function implementation in cell 55.")

except FileNotFoundError as e:
    print(f"\n❌ Error during tidy setup: {e}")
    print("   Please ensure the data file 'data/{mn}.csv' exists.")
except Exception as e:
    print(f"\n❌ An unexpected error occurred during execution: {e}")
    import traceback
    traceback.print_exc()

--- Running Main Tidy Function ---
--- (Running placeholder tidy function) ---
--- Tidy Function Completed ---
Columns in tidied DataFrame: ['colA', 'colB', 'colC', 'colD']

❌ FATAL: 'km_per_litre' column is missing even after running tidy(). Check the tidy() function implementation in cell 55.


In [83]:
df = tidy(mn);
assert type(detecting_outliers(df)) == list, "T4.1"
assert all(isinstance(i, int) for i in detecting_outliers(df)), "T4.2"
assert len(detecting_outliers(df)) > 0 and len(detecting_outliers(df)) < .05*df.shape[0]


--- (Running placeholder tidy function) ---
🔎 Checking for outliers in column: 'YOUR_COLUMN_NAME_HERE'
❌ Error: Column 'YOUR_COLUMN_NAME_HERE' not found in the DataFrame!
Available columns are: ['colA', 'colB', 'colC', 'colD']
🔎 Checking for outliers in column: 'YOUR_COLUMN_NAME_HERE'
❌ Error: Column 'YOUR_COLUMN_NAME_HERE' not found in the DataFrame!
Available columns are: ['colA', 'colB', 'colC', 'colD']
🔎 Checking for outliers in column: 'YOUR_COLUMN_NAME_HERE'
❌ Error: Column 'YOUR_COLUMN_NAME_HERE' not found in the DataFrame!
Available columns are: ['colA', 'colB', 'colC', 'colD']


AssertionError: 

### 4.2. Analytical part
Discuss the implications. 

- What is the chosen outlier-detection technique? Explain it using your own words in 3-4 sentences.
- Describe the outliers detected: How many? How do they relate to the typical, non-outlier values in the remaining dataset?
- What could be one reason these outliers appear in the dataset? How would you treat them further?

Write your answer in the markdown cell below. Do NOT delete or replace the answer cell with another one!

YOUR ANSWER HERE