<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Finding Missing Values**


Estimated time needed: **30** minutes


Data wrangling is the process of cleaning, transforming, and organizing data to make it suitable for analysis. Finding and handling missing values is a crucial step in this process to ensure data accuracy and completeness. In this lab, you will focus exclusively on identifying and handling missing values in the dataset.


## Objectives


After completing this lab, you will be able to:


-   Identify missing values in the dataset.

- Quantify missing values for specific columns.

- Impute missing values using various strategies.


## Hands on Lab


##### Setup: Install Required Libraries


In [None]:
!pip install pandas
!pip install matplotlib
!pip install seaborn

##### Import Necessary Modules:


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Tasks


<h2>1. Load the Dataset</h2>
<p>
We use the <code>pandas.read_csv()</code> function for reading CSV files. However, in this version of the lab, which operates on JupyterLite, the dataset needs to be downloaded to the interface using the provided code below.
</p>


The functions below will download the dataset into your browser:



In [None]:
# Define the URL of the dataset
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())


### 2. Explore the Dataset
##### Task 1: Display basic information and summary statistics of the dataset.


In [None]:
## Write your code here

# Task 1: Display basic information and summary statistics

print("Dataset Information:")
print("=" * 60)
df.info()

print("\n" + "=" * 60)
print("Summary Statistics:")
print("=" * 60)
print(df.describe())

print("\n" + "=" * 60)
print(f"Dataset Shape: {df.shape}")
print(f"Number of Columns: {len(df.columns)}")
print(f"Number of Rows: {len(df)}")

### 3. Finding Missing Values
##### Task 2: Identify missing values for all columns.


In [None]:
## Write your code here

# Task 2: Identify missing values for all columns

print("Missing Values Analysis:")
print("=" * 60)

# Count missing values per column
missing_values = df.isnull().sum()

# Calculate percentage of missing values
missing_percent = (df.isnull().sum() / len(df)) * 100

# Create a summary DataFrame
missing_summary = pd.DataFrame({
    'Column': missing_values.index,
    'Missing_Count': missing_values.values,
    'Missing_Percentage': missing_percent.values
})

# Filter to show only columns with missing values
missing_summary = missing_summary[missing_summary['Missing_Count'] > 0]
missing_summary = missing_summary.sort_values('Missing_Count', ascending=False)

print(f"\nColumns with missing values: {len(missing_summary)}")
print("\nTop 10 columns with most missing values:")
print(missing_summary.head(10).to_string(index=False))

##### Task 3: Visualize missing values using a heatmap (Using seaborn library).



In [None]:
## Write your code here

# Task 3: Visualize missing values using a heatmap

# Create a heatmap of missing values
plt.figure(figsize=(12, 8))

# Select columns with the most missing values for visualization
columns_to_plot = missing_summary.head(20)['Column'].tolist()

if columns_to_plot:
    # Create a boolean DataFrame where True = missing
    missing_data = df[columns_to_plot].isnull()
    
    # Create heatmap
    sns.heatmap(missing_data, cbar=True, cmap='viridis', yticklabels=False)
    plt.title('Missing Values Heatmap (Top 20 Columns with Most Missing Data)')
    plt.xlabel('Columns')
    plt.ylabel('Rows')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
else:
    print("No missing values to visualize!")

##### Task 4: Count the number of missing rows for a specific column (e.g., `Employment`).


In [None]:
## Write your code here

# Task 4: Count the number of missing rows for Employment column

if 'Employment' in df.columns:
    employment_missing = df['Employment'].isnull().sum()
    employment_missing_percent = (employment_missing / len(df)) * 100
    
    print(f"Employment Column Missing Values:")
    print(f"  Missing Count: {employment_missing}")
    print(f"  Missing Percentage: {employment_missing_percent:.2f}%")
    print(f"  Total Rows: {len(df)}")
    print(f"  Non-Missing Rows: {len(df) - employment_missing}")
else:
    print("Employment column not found in dataset")

### 4. Imputing Missing Values
##### Task 5: Identify the most frequent (majority) value in a specific column (e.g., `Employment`).


In [None]:
## Write your code here

# Task 5: Identify the most frequent value in Employment column

if 'Employment' in df.columns:
    # Get value counts
    employment_counts = df['Employment'].value_counts()
    
    print("Employment Value Counts:")
    print(employment_counts)
    
    # Get the most frequent value
    most_frequent_employment = df['Employment'].mode()[0]
    
    print(f"\nMost Frequent Employment Value: '{most_frequent_employment}'")
    print(f"Frequency: {employment_counts.iloc[0]}")
    print(f"Percentage: {(employment_counts.iloc[0] / employment_counts.sum() * 100):.2f}%")
else:
    print("Employment column not found in dataset")

##### Task 6: Impute missing values in the `Employment` column with the most frequent value.



In [None]:
## Write your code here

# Task 6: Impute missing values in Employment column

if 'Employment' in df.columns:
    # Store original missing count
    original_missing = df['Employment'].isnull().sum()
    
    print(f"Before Imputation:")
    print(f"  Missing values: {original_missing}")
    
    # Impute with most frequent value
    most_frequent = df['Employment'].mode()[0]
    df['Employment'].fillna(most_frequent, inplace=True)
    
    # Verify imputation
    after_missing = df['Employment'].isnull().sum()
    
    print(f"\nAfter Imputation:")
    print(f"  Missing values: {after_missing}")
    print(f"  Values imputed: {original_missing - after_missing}")
    print(f"  Imputation value used: '{most_frequent}'")
    
    print("\nImputation successful!")
else:
    print("Employment column not found in dataset")

### 5. Visualizing Imputed Data
##### Task 7: Visualize the distribution of a column after imputation (e.g., `Employment`).


In [None]:
## Write your code here

# Task 7: Visualize the distribution of Employment after imputation

if 'Employment' in df.columns:
    # Create a figure with two subplots
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Bar chart
    employment_counts = df['Employment'].value_counts()
    axes[0].bar(range(len(employment_counts)), employment_counts.values)
    axes[0].set_xticks(range(len(employment_counts)))
    axes[0].set_xticklabels(employment_counts.index, rotation=45, ha='right')
    axes[0].set_title('Employment Distribution (After Imputation) - Bar Chart')
    axes[0].set_xlabel('Employment Status')
    axes[0].set_ylabel('Count')
    
    # Pie chart
    axes[1].pie(employment_counts.values, labels=employment_counts.index, autopct='%1.1f%%')
    axes[1].set_title('Employment Distribution (After Imputation) - Pie Chart')
    
    plt.tight_layout()
    plt.show()
    
    # Print statistics
    print("\nEmployment Distribution Statistics:")
    print(employment_counts)
    print(f"\nTotal respondents: {employment_counts.sum()}")
else:
    print("Employment column not found in dataset")

### Summary


In this lab, you:
- Loaded the dataset into a pandas DataFrame.
- Identified missing values across all columns.
- Quantified missing values in specific columns.
- Imputed missing values in a categorical column using the most frequent value.
- Visualized the imputed data for better understanding.
  


Copyright © IBM Corporation. All rights reserved.
