<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Data Normalization Techniques**


Estimated time needed: **30** minutes


In this lab, you will focus on data normalization. This includes identifying compensation-related columns, applying normalization techniques, and visualizing the data distributions.


## Objectives


In this lab, you will perform the following:


- Identify duplicate rows and remove them.

- Check and handle missing values in key columns.

- Identify and normalize compensation-related columns.

- Visualize the effect of normalization techniques on data distributions.


-----


## Hands on Lab


#### Step 1: Install and Import Libraries


In [None]:
!pip install pandas

In [None]:
!pip install matplotlib

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

### Step 2: Load the Dataset into a DataFrame


We use the <code>pandas.read_csv()</code> function for reading CSV files. However, in this version of the lab, which operates on JupyterLite, the dataset needs to be downloaded to the interface using the provided code below.


The functions below will download the dataset into your browser:


In [None]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

df = pd.read_csv(file_path)

# Display the first few rows to check if data is loaded correctly
print(df.head())


In [None]:
#df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")

### Section 1: Handling Duplicates
##### Task 1: Identify and remove duplicate rows.


In [None]:
## Write your code here

# Task 1: Identify and remove duplicate rows

# Count duplicates
num_duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

# Remove duplicates
df = df.drop_duplicates()

print(f"Dataset shape after removing duplicates: {df.shape}")
print(f"Remaining duplicates: {df.duplicated().sum()}")

### Section 2: Handling Missing Values
##### Task 2: Identify missing values in `CodingActivities`.


In [None]:
## Write your code here

# Task 2: Identify missing values in CodingActivities

if 'CodingActivities' in df.columns:
    missing_count = df['CodingActivities'].isnull().sum()
    print(f"Missing values in CodingActivities: {missing_count}")
    print(f"Percentage missing: {(missing_count / len(df) * 100):.2f}%")
else:
    print("CodingActivities column not found")

##### Task 3: Impute missing values in CodingActivities with forward-fill.


In [None]:
## Write your code here

# Task 3: Impute missing values in CodingActivities with forward-fill

if 'CodingActivities' in df.columns:
    original_missing = df['CodingActivities'].isnull().sum()
    
    # Forward fill
    df['CodingActivities'] = df['CodingActivities'].fillna(method='ffill')
    
    # Check remaining missing
    after_missing = df['CodingActivities'].isnull().sum()
    
    print(f"Missing before forward-fill: {original_missing}")
    print(f"Missing after forward-fill: {after_missing}")
    print(f"Values imputed: {original_missing - after_missing}")
else:
    print("CodingActivities column not found")

**Note**:  Before normalizing ConvertedCompYearly, ensure that any missing values (NaN) in this column are handled appropriately. You can choose to either drop the rows containing NaN or replace the missing values with a suitable statistic (e.g., median or mean).


### Section 3: Normalizing Compensation Data
##### Task 4: Identify compensation-related columns, such as ConvertedCompYearly.
Normalization is commonly applied to compensation data to bring values within a comparable range. Here, you’ll identify ConvertedCompYearly or similar columns, which contain compensation information. This column will be used in the subsequent tasks for normalization.


In [None]:
## Write your code here

# Task 4: Identify compensation-related columns

# Find compensation columns
comp_columns = [col for col in df.columns if 'comp' in col.lower() or 'salary' in col.lower()]

print("Compensation-related columns:")
for col in comp_columns:
    print(f"  - {col}")

# Focus on ConvertedCompYearly
if 'ConvertedCompYearly' in df.columns:
    print(f"\nConvertedCompYearly statistics:")
    print(df['ConvertedCompYearly'].describe())
    
    missing = df['ConvertedCompYearly'].isnull().sum()
    print(f"\nMissing values: {missing} ({(missing/len(df)*100):.2f}%)")

##### Task 5: Normalize ConvertedCompYearly using Min-Max Scaling.
Min-Max Scaling brings all values in a column to a 0-1 range, making it useful for comparing data across different scales. Here, you will apply Min-Max normalization to the ConvertedCompYearly column, creating a new column ConvertedCompYearly_MinMax with normalized values.


In [None]:
## Write your code here

# Task 5: Normalize ConvertedCompYearly using Min-Max Scaling

if 'ConvertedCompYearly' in df.columns:
    # Drop NaN values for normalization
    df_clean = df.dropna(subset=['ConvertedCompYearly'])
    
    # Min-Max Scaling: (x - min) / (max - min)
    min_val = df_clean['ConvertedCompYearly'].min()
    max_val = df_clean['ConvertedCompYearly'].max()
    
    df_clean['ConvertedCompYearly_MinMax'] = (df_clean['ConvertedCompYearly'] - min_val) / (max_val - min_val)
    
    print("Min-Max Normalization applied:")
    print(f"Original range: [{min_val}, {max_val}]")
    print(f"Normalized range: [0, 1]")
    print(f"\nSample normalized values:")
    print(df_clean[['ConvertedCompYearly', 'ConvertedCompYearly_MinMax']].head())
    
    # Update the main dataframe
    df = df_clean
else:
    print("ConvertedCompYearly column not found")

##### Task 6: Apply Z-score Normalization to `ConvertedCompYearly`.

Z-score normalization standardizes values by converting them to a distribution with a mean of 0 and a standard deviation of 1. This method is helpful for datasets with a Gaussian (normal) distribution. Here, you’ll calculate Z-scores for the ConvertedCompYearly column, saving the results in a new column ConvertedCompYearly_Zscore.


In [None]:
## Write your code here

# Task 6: Apply Z-score Normalization to ConvertedCompYearly

if 'ConvertedCompYearly' in df.columns:
    # Z-score: (x - mean) / std
    mean_val = df['ConvertedCompYearly'].mean()
    std_val = df['ConvertedCompYearly'].std()
    
    df['ConvertedCompYearly_Zscore'] = (df['ConvertedCompYearly'] - mean_val) / std_val
    
    print("Z-score Normalization applied:")
    print(f"Mean: {mean_val:.2f}")
    print(f"Standard Deviation: {std_val:.2f}")
    print(f"\nZ-score distribution:")
    print(df['ConvertedCompYearly_Zscore'].describe())
    print(f"\nSample Z-scores:")
    print(df[['ConvertedCompYearly', 'ConvertedCompYearly_Zscore']].head())
else:
    print("ConvertedCompYearly column not found")

### Section 4: Visualization of Normalized Data
##### Task 7: Visualize the distribution of `ConvertedCompYearly`, `ConvertedCompYearly_Normalized`, and `ConvertedCompYearly_Zscore`

Visualization helps you understand how normalization changes the data distribution. In this task, create histograms for the original ConvertedCompYearly, as well as its normalized versions (ConvertedCompYearly_MinMax and ConvertedCompYearly_Zscore). This will help you compare how each normalization technique affects the data range and distribution.


In [None]:
## Write your code here

# Task 7: Visualize distributions

if all(col in df.columns for col in ['ConvertedCompYearly', 'ConvertedCompYearly_MinMax', 'ConvertedCompYearly_Zscore']):
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # Original distribution
    axes[0].hist(df['ConvertedCompYearly'], bins=50, edgecolor='black')
    axes[0].set_title('Original ConvertedCompYearly Distribution')
    axes[0].set_xlabel('Compensation')
    axes[0].set_ylabel('Frequency')
    
    # Min-Max normalized distribution
    axes[1].hist(df['ConvertedCompYearly_MinMax'], bins=50, edgecolor='black', color='green')
    axes[1].set_title('Min-Max Normalized Distribution')
    axes[1].set_xlabel('Normalized Compensation (0-1)')
    axes[1].set_ylabel('Frequency')
    
    # Z-score normalized distribution
    axes[2].hist(df['ConvertedCompYearly_Zscore'], bins=50, edgecolor='black', color='orange')
    axes[2].set_title('Z-score Normalized Distribution')
    axes[2].set_xlabel('Z-score')
    axes[2].set_ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()
    
    print("Visualization complete! Compare the three distributions above.")
else:
    print("Required columns not found for visualization")

### Summary


In this lab, you practiced essential normalization techniques, including:

- Identifying and handling duplicate rows.

- Checking for and imputing missing values.

- Applying Min-Max scaling and Z-score normalization to compensation data.

- Visualizing the impact of normalization on data distribution.


Copyright © IBM Corporation. All rights reserved.
