# CS5228-KDDM, 2025/26-2, Coursework 1

### Introduction
Following the data cleaning phase, the dataset (result1-1.csv) is now assumed to be free of missing values and dirty records. However, to apply mathematical models and distance-based machine learning algorithms, the data must be transformed into a suitable numerical format. Q1-2 focuses on these essential data transformation steps.

The primary objectives of this section are twofold:

* Categorical Encoding: Many machine learning algorithms cannot process text data directly. We will convert specific categorical attributes—namely Workclass (Column B), Education (Column D), and Sex (Column J)—into numerical representations using unique integer labels (pseudo encoding). The intermediate result will be saved as result1-2.csv.


* Data Normalization: As observed in Task 1.1, the Capital Gain (Column K) and Capital Loss (Column L) attributes exhibit extreme sparsity and large numerical ranges. To prevent these large values from dominating distance calculations, we will apply a Z-transform (Standardization) to scale these features so they have a mean of 0 and a standard deviation of 1.

The final, fully preprocessed dataset will be saved as result1-3.csv, ready for subsequent clustering analysis.

#### Student Name: MA YUCHEN
#### Student Number: A0327384X

### CW1, Part 1: Data Preprocessing using Python (2+2=4 marks)

### CW1-1-2: Data Transformation (2 marks) 
#### Datasets: result1-1.csv 
This dataset is assumed to be clean and without any missing or dirty values. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler

In [2]:
def encode_categorical_columns(df):
    # B: workclass
    # D: education
    # J: sex
    target_cols = ['workclass', 'education', 'sex']
    

    le = LabelEncoder()
    
    for col in target_cols:
        df[col] = le.fit_transform(df[col])
        

    # Show the first 15 records
    print("\nThe first 15 records:")
    print(df.head(15))
    
    # Save Result
    output_file = 'result1-2.csv'
    df.to_csv(output_file, index=False)
    print(f"Save the Data to {output_file}")
    
    return df

In [3]:
def normalize_capital_columns(df):

    # K: capital-gain
    # L: capital-loss
    norm_cols = ['capital-gain', 'capital-loss']
    
    # 1. Print Mean, Std (Before Normalization)
    print("Before Normalization:")
    for col in norm_cols:
        print(f"  {col}: Mean = {df[col].mean():.4f}, Std = {df[col].std():.4f}")
        
    # 2. Z-transform: z = (x - mean) / std
    for col in norm_cols:
        mean_val = df[col].mean()
        std_val = df[col].std(ddof=0) 
        df[col] = (df[col] - mean_val) / std_val
    
    # 3. Print Mean, Std (After Normalization) 
    # Check Mean ≈ 0, Std ≈ 1
    print("\nAfter Normalization:")
    for col in norm_cols:
        print(f"  {col}: Mean = {df[col].mean():.4f}, Std = {df[col].std():.4f}")
        
    # 4. Show the last 15 records
    print("\nThe last 15 records of Target Columns):")
    print(df.tail(15))
    
    # Save Result
    output_file = 'result1-3.csv'
    df.to_csv(output_file, index=False)
    print(f"Save the Data to {output_file}")
    
    return df

In [4]:
if __name__ == "__main__":
    df = pd.read_csv('result1-1.csv')
    df_encoded = encode_categorical_columns(df)
    df_normalized = normalize_capital_columns(df_encoded)


The first 15 records:
    age  workclass  fnlwgt  education  education-num         marital-status  \
0    39          5   77516          9             13          Never-married   
1    50          4   83311          9             13     Married-civ-spouse   
2    38          2  215646         11              9               Divorced   
3    53          2  234721          1              7     Married-civ-spouse   
4    28          2  338409          9             13     Married-civ-spouse   
5    37          2  284582         12             14     Married-civ-spouse   
6    49          2  160187          6              5  Married-spouse-absent   
7    52          4  209642         11              9     Married-civ-spouse   
8    31          2   45781         12             14          Never-married   
9    42          2  159449          9             13     Married-civ-spouse   
10   37          2  280464         15             10     Married-civ-spouse   
11   30          5  141297   