<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Impute Missing Values**


Estimated time needed: **30** minutes


In this lab, you will practice essential data wrangling techniques using the Stack Overflow survey dataset. The primary focus is on handling missing data and ensuring data quality. You will:

- **Load the Data:** Import the dataset into a DataFrame using the pandas library.

- **Clean the Data:** Identify and remove duplicate entries to maintain data integrity.

- **Handle Missing Values:** Detect missing values, impute them with appropriate strategies, and verify the imputation to create a complete and reliable dataset for analysis.

This lab equips you with the skills to effectively preprocess and clean real-world datasets, a crucial step in any data analysis project.


## Objectives


In this lab, you will perform the following:


-   Identify missing values in the dataset.

-   Apply techniques to impute missing values in the dataset.
  
-   Use suitable techniques to normalize data in the dataset.


-----


#### Install needed library


In [None]:
!pip install pandas

### Step 1: Import Required Libraries


In [None]:
import pandas as pd

### Step 2: Load the Dataset Into a Dataframe


#### **Read Data**
<p>
The functions below will download the dataset into your browser:
</p>


In [None]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())

### Step 3. Finding and Removing Duplicates
##### Task 1: Identify duplicate rows in the dataset.


In [None]:
## Write your code here

# Task 1: Identify duplicate rows

# Count duplicate rows
num_duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

# Display some duplicate rows
if num_duplicates > 0:
    print("\nFirst 5 duplicate rows:")
    print(df[df.duplicated(keep=False)].head())

##### Task 2: Remove the duplicate rows from the dataframe.



In [None]:
## Write your code here

# Task 2: Remove duplicate rows

# Remove duplicates
df = df.drop_duplicates()

print(f"Dataset shape after removing duplicates: {df.shape}")
print(f"Remaining duplicates: {df.duplicated().sum()}")

### Step 4: Finding Missing Values
##### Task 3: Find the missing values for all columns.


In [None]:
## Write your code here

# Task 3: Find missing values for all columns

missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values[missing_values > 0].sort_values(ascending=False).head(10))

print(f"\nTotal missing values: {missing_values.sum()}")

##### Task 4: Find out how many rows are missing in the column RemoteWork.


In [None]:
## Write your code here

# Task 4: Find missing rows in RemoteWork column

if 'RemoteWork' in df.columns:
    remote_missing = df['RemoteWork'].isnull().sum()
    print(f"Missing values in RemoteWork: {remote_missing}")
    print(f"Percentage missing: {(remote_missing / len(df) * 100):.2f}%")
else:
    print("RemoteWork column not found")

### Step 5. Imputing Missing Values
##### Task 5: Find the value counts for the column RemoteWork.


In [None]:
## Write your code here

# Task 5: Find value counts for RemoteWork

if 'RemoteWork' in df.columns:
    print("RemoteWork value counts:")
    print(df['RemoteWork'].value_counts())
else:
    print("RemoteWork column not found")

##### Task 6: Identify the most frequent (majority) value in the RemoteWork column.



In [None]:
## Write your code here

# Task 6: Identify most frequent value in RemoteWork

if 'RemoteWork' in df.columns:
    most_frequent = df['RemoteWork'].mode()[0]
    print(f"Most frequent value in RemoteWork: '{most_frequent}'")
    
    # Show frequency
    value_counts = df['RemoteWork'].value_counts()
    print(f"Frequency: {value_counts.iloc[0]}")
    print(f"Percentage: {(value_counts.iloc[0] / value_counts.sum() * 100):.2f}%")
else:
    print("RemoteWork column not found")

##### Task 7: Impute (replace) all the empty rows in the column RemoteWork with the majority value.



In [None]:
## Write your code here

# Task 7: Impute missing values in RemoteWork

if 'RemoteWork' in df.columns:
    # Store original missing count
    original_missing = df['RemoteWork'].isnull().sum()
    
    # Get most frequent value
    most_frequent = df['RemoteWork'].mode()[0]
    
    # Impute missing values
    df['RemoteWork'] = df['RemoteWork'].fillna(most_frequent)
    
    # Verify
    after_missing = df['RemoteWork'].isnull().sum()
    print(f"Missing values before imputation: {original_missing}")
    print(f"Missing values after imputation: {after_missing}")
    print(f"Values imputed: {original_missing - after_missing}")
    print(f"Imputation value used: '{most_frequent}'")
else:
    print("RemoteWork column not found")

##### Task 8: Check for any compensation-related columns and describe their distribution.



In [None]:
## Write your code here

# Task 8: Check for compensation columns and describe distribution

# Look for compensation-related columns
comp_columns = [col for col in df.columns if 'comp' in col.lower() or 'salary' in col.lower()]

print("Compensation-related columns found:")
print(comp_columns)

# Describe ConvertedCompYearly if it exists
if 'ConvertedCompYearly' in df.columns:
    print("\nConvertedCompYearly distribution:")
    print(df['ConvertedCompYearly'].describe())
    
    print(f"\nMissing values: {df['ConvertedCompYearly'].isnull().sum()}")
    print(f"Percentage missing: {(df['ConvertedCompYearly'].isnull().sum() / len(df) * 100):.2f}%")

### Summary 


**In this lab, you focused on imputing missing values in the dataset.**

- Use the <code>pandas.read_csv()</code> function to load a dataset from a CSV file into a DataFrame.

- Download the dataset if it's not available online and specify the correct file path.



<!--
## Change Log
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.3|Madhusudhan Moole|Updated lab|
|2024-10-29|1.2|Madhusudhan Moole|Updated lab|
|2024-09-27|1.1|Madhusudhan Moole|Updated lab|
|2024-09-26|1.0|Raghul Ramesh|Created lab|
--!>


Copyright © IBM Corporation. All rights reserved.
