
# Individual Homework: Data Preparation and Exploratory Analysis

**Due:** 21 Sep 2025, 11:59 pm (SGT)  
**Submission:** Single `your_name_assignment1.ipynb` file uploaded to Canvas


## Student details
Fill this in before you start.

- Student ID:

  - Example: e1234567



## 0. Set-up

In [None]:

# Import packages you need. Keep to the standard stack where possible.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# If you need scikit-learn later, you can import it when required.
# from sklearn.preprocessing import StandardScaler, MinMaxScaler
# from sklearn.impute import SimpleImputer
# from sklearn.feature_selection import mutual_info_classif, mutual_info_regression

# Display options for readability
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 120)

# Set a seed for any random processes used later
SEED = 42
np.random.seed(SEED)
print('Set-up complete.')


## 1. Load the dataset

In [None]:
# The dataset dummy_homework_dataset.csv has been provided for you.
# It must be used for this homework assignment.

csv_path = 'tcx3213_assignment1_dataset.csv'  # This is the provided dataset file

# Load the dataset
# HINT: Use the appropriate pandas function to read the CSV file into a dataframe
# HINT: After loading, display the first few rows to check if the dataset loaded correctly

In [None]:
# Quick structure check
# HINT: Write code to display a concise summary of the dataframe, including column names, data types, and non-null counts
# HINT: Write code to display summary statistics for both numeric and non-numeric columns in a table format



### 1.1 Basic dataset notes
Briefly describe:
- What each column represents (in your own words)
- The size of the dataset (rows and columns)
- Any obvious data quality issues you notice at first glance


## 2. Exploratory data analysis (EDA)

In [None]:
# Identify numeric and categorical columns
# HINT: Write code to select columns with numeric data types and store them in a list
# HINT: Write code to select columns with non-numeric (categorical) data types and store them in a list

# Print your lists to confirm
# HINT: Write code to print the names of numeric columns
# HINT: Write code to print the names of categorical columns


In [None]:
# Plot distributions for numeric columns
# HINT: Loop through each numeric column in your list
# HINT: For each column, create a new figure
# HINT: Plot a histogram for the column with a suitable number of bins
# HINT: Add a title showing the column name
# HINT: Label the x-axis and y-axis clearly
# HINT: Display the plot


In [None]:
# Basic bivariate exploration (example: correlation heatmap for numeric columns)
# HINT: Compute the correlation matrix for all numeric columns in the dataframe
# HINT: Create a new figure for the heatmap
# HINT: Use a plotting function to display the correlation matrix as a heatmap
# HINT: Add a colour bar to show correlation strength
# HINT: Set the title of the plot
# HINT: Add x-axis and y-axis tick labels using column names, rotated for readability
# HINT: Adjust the layout so labels and the heatmap fit nicely
# HINT: Display the final heatmap



### 2.1 Notes from EDA
Summarise key observations:
- Which variables are skewed or have outliers
- Obvious relationships or lack of relationships
- Potential target variable if you choose to build a tiny baseline model later (optional)


## 3. Missing data analysis and handling

In [None]:
# Check overall missingness
# HINT: Calculate the number of missing values in each column
# HINT: Sort the missing value counts from highest to lowest
# HINT: Compute the percentage of missing values for each column
# HINT: Combine both counts and percentages into a single summary table for easy viewing



### Plan your approach
For each feature with missing values, write your plan:
- Impute with mean or median for numeric with mild skew
- Impute with mode for categorical
- Consider more suitable methods if needed
Explain your choice briefly.


In [None]:
# Example skeleton for applying simple imputations

# HINT: Decide on a strategy for imputing missing numeric values (e.g., mean or median)
# HINT: Decide on a strategy for imputing missing categorical values (e.g., most frequent)

# HINT: Loop through each numeric column
# HINT: If the column has missing values, replace them using the chosen numeric strategy

# HINT: Loop through each categorical column
# HINT: If the column has missing values, replace them using the chosen categorical strategy

# HINT: You can use pandas fillna() or sklearn SimpleImputer for this step


## 4. Encoding categorical variables

In [None]:
# Identify ordinal columns that have a natural order
# HINT: Look at your categorical columns and decide which ones have an inherent order (e.g., low < medium < high)
# HINT: Create a mapping dictionary to assign numeric values based on the order you identified
# HINT: Apply the mapping to transform the ordinal column into numeric form

# One-hot encoding for nominal columns
# HINT: For categorical columns without natural order, use one-hot encoding to create dummy variables
# HINT: Decide whether to drop the first category to avoid multicollinearity



Explain your encoding choices:
- Which columns were treated as ordinal and why
- Which columns were one-hot encoded and why
- Any categories you combined or cleaned


## 5. Scaling, normalisation, and transformation

In [None]:
# Inspect skewness
# HINT: Calculate the skewness for each numeric column in the dataframe
# HINT: Sort the skewness values in descending order to see which columns are most skewed
# HINT: Decide which columns might need transformations based on their skewness values

# Apply transformations where helpful (example: log1p for positive skew)
# HINT: Loop through each numeric column
# HINT: For columns with strong positive skew, apply a suitable transformation (e.g., log1p) to reduce skewness
# HINT: Create new transformed columns rather than overwriting original columns


In [None]:
# Scaling example placeholders
# HINT: Decide whether to use standardisation (StandardScaler) or normalisation (MinMaxScaler) based on your analysis needs
# HINT: Create a copy of your dataframe before applying scaling
# HINT: Apply the scaler to the numeric columns only
# HINT: Check the first few rows of the scaled dataframe to confirm changes



Briefly justify your choices:
- Which columns you transformed and why
- Which scaling method you used and why


## 6. Feature selection (basic)

In [None]:
# Filter-based selection examples

# Option A: remove highly correlated features
# HINT: Calculate the absolute correlation matrix for numeric columns
# HINT: Extract the upper triangle of the correlation matrix to avoid duplicates
# HINT: Identify columns with correlation above a chosen threshold (e.g., 0.9)
# HINT: Drop the highly correlated columns from the dataframe

# Option B: mutual information (choose appropriate function for your target type)
# HINT: Choose the correct mutual information function based on the target type
#       - mutual_info_regression for continuous targets
#       - mutual_info_classif for categorical targets
# HINT: Apply mutual information to all features except the target column
# HINT: Rank features by their MI scores and decide which ones to keep



Write a short note describing the selected features and why they are suitable.


## 7. Conclusions

Summarise what you did and learned. State any assumptions or limitations.


Write your conclusions here.

## Marking rubrics

### Summary rubric (100 marks)

| Section                        | Criteria                                                                                          | Marks |
|--------------------------------|---------------------------------------------------------------------------------------------------|-------|
| 0. Set-up                       | Clean imports, seed set, readable options                                                        | 5     |
| 1. Load and describe             | Correct load, preview, clear description of columns, correct data types description              | 10    |
| 2. EDA                           | Correct identification of numeric and categorical, sensible plots, correct observations about skewness and relationships | 20    |
| 3. Missing data                  | Accurate audit, reasonable plan, correct implementation, justification                           | 20    |
| 4. Encoding                      | Correct identification of ordinal vs nominal, appropriate encodings, explanation                 | 15    |
| 5. Scaling and transformation    | Correct skew checks, reasonable use of transformations, correct scaling choice with justification | 15    |
| 6. Feature selection             | Correct method, coherent threshold or MI use, clear explanation                                  | 10    |
| 7. Conclusions                   | Clear summary and limitations                                                                    | 5     |




### Detailed rubrics

| Section                     | Excellent                                                                                                         | Satisfactory                                    | Needs work                             | Common pitfalls                                                        |
|-----------------------------|-------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------|------------------------------------------------------------------------|
| 1. Load and describe         | Loads correctly, concise but informative column descriptions, correct dtype commentary                           | Loads correctly with brief notes                | Loads with errors or missing notes      | Forgetting to set the correct file name, no info() or describe()         |
| 2. EDA                       | All relevant plots, clear notes on skew and outliers, simple correlation heatmap                                  | Some plots, some comments                        | Minimal plots, little commentary        | Plotting unreadable figures, no labels or titles                         |
| 3. Missing data              | Complete audit table, choices justified by type and distribution, implementation correct                         | Partial audit, basic fill rules used             | Incomplete audit, unjustified methods   | Imputing strings with numeric strategies, leaving NaN without explanation|
| 4. Encoding                  | Correct ordinal maps, appropriate one-hot, clear rationale                                                        | Mostly correct encodings                         | Incorrect or unexplained encodings       | Treating nominal as ordinal, exploding cardinality without discussion    |
| 5. Scaling and transformation| Sound skew checks, appropriate transforms, correct scaler choice and reasoning                                   | Some transforms or scaling with basic reasoning | Misapplied transforms or no reasoning    | Applying log to zeros or negatives, scaling before fixing missing values |
| 6. Feature selection         | Method well explained, threshold or mutual information (MI) sensibly chosen, clean reduced set                                         | Some selection with light explanation            | No clear method or arbitrary drops        | Using target leakage, dropping at random                                 |
| 7. Conclusions               | Clear synthesis, limitations acknowledged                                                                        | Basic summary                                   | Vague or missing                        | No link between steps and conclusions                                    |
