<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Removing Duplicates**


Estimated time needed: **30** minutes


## Introduction


In this lab, you will focus on data wrangling, an important step in preparing data for analysis. Data wrangling involves cleaning and organizing data to make it suitable for analysis. One key task in this process is removing duplicate entries, which are repeated entries that can distort analysis and lead to inaccurate conclusions.  


## Objectives


In this lab you will perform the following:


1. Identify duplicate rows  in the dataset.
2. Use suitable techniques to remove duplicate rows and verify the removal.
3. Summarize how to handle missing values appropriately.
4. Use ConvertedCompYearly to normalize compensation data.
   


### Install the Required Libraries


In [None]:
!pip install pandas

### Step 1: Import Required Libraries


In [None]:
import pandas as pd

### Step 2: Load the Dataset into a DataFrame



load the dataset using pd.read_csv()


In [None]:
# Define the URL of the dataset
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())


**Note: If you are working on a local Jupyter environment, you can use the URL directly in the <code>pandas.read_csv()</code>  function as shown below:**



#df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


### Step 3: Identifying Duplicate Rows


**Task 1: Identify Duplicate Rows**
  1. Count the number of duplicate rows in the dataset.
  2. Display the first few duplicate rows to understand their structure.


In [None]:
## Write your code here

# Task 1: Identify Duplicate Rows

# Count the number of duplicate rows
num_duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

# Display the first few duplicate rows
print("\nFirst 5 duplicate rows:")
duplicate_rows = df[df.duplicated(keep=False)]
print(duplicate_rows.head())

# Show the shape before removal
print(f"\nDataset shape before removing duplicates: {df.shape}")

### Step 4: Removing Duplicate Rows


**Task 2: Remove Duplicates**
   1. Remove duplicate rows from the dataset using the drop_duplicates() function.
2. Verify the removal by counting the number of duplicate rows after removal .


In [None]:
## Write your code here

# Task 2: Remove Duplicates

# Remove duplicate rows using drop_duplicates()
df_cleaned = df.drop_duplicates()

print(f"Dataset shape after removing duplicates: {df_cleaned.shape}")
print(f"Number of rows removed: {df.shape[0] - df_cleaned.shape[0]}")

# Verify the removal by counting duplicates again
remaining_duplicates = df_cleaned.duplicated().sum()
print(f"\nRemaining duplicate rows: {remaining_duplicates}")

# Update df to use the cleaned version
df = df_cleaned

### Step 5: Handling Missing Values


**Task 3: Identify and Handle Missing Values**
   1. Identify missing values for all columns in the dataset.
   2. Choose a column with significant missing values (e.g., EdLevel) and impute with the most frequent value.


In [None]:
## Write your code here

# Task 3: Identify and Handle Missing Values

# Identify missing values for all columns
print("Missing values per column:")
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0].sort_values(ascending=False))

# Calculate percentage of missing values
print("\nPercentage of missing values:")
missing_percent = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print(missing_percent[missing_percent > 0].head(10))

# Choose EdLevel column for imputation
if 'EdLevel' in df.columns:
    print(f"\nMissing values in EdLevel: {df['EdLevel'].isnull().sum()}")
    
    # Find the most frequent value
    most_frequent_edlevel = df['EdLevel'].mode()[0]
    print(f"Most frequent value in EdLevel: {most_frequent_edlevel}")
    
    # Impute missing values with the most frequent value
    df['EdLevel'].fillna(most_frequent_edlevel, inplace=True)
    
    # Verify imputation
    print(f"Missing values in EdLevel after imputation: {df['EdLevel'].isnull().sum()}")

### Step 6: Normalizing Compensation Data


**Task 4: Normalize Compensation Data Using ConvertedCompYearly**
   1. Use the ConvertedCompYearly column for compensation analysis as the normalized annual compensation is already provided.
   2. Check for missing values in ConvertedCompYearly and handle them if necessary.


In [None]:
## Write your code here

# Task 4: Normalize Compensation Data Using ConvertedCompYearly

# Check if ConvertedCompYearly column exists
if 'ConvertedCompYearly' in df.columns:
    print("ConvertedCompYearly column statistics:")
    print(df['ConvertedCompYearly'].describe())
    
    # Check for missing values
    missing_comp = df['ConvertedCompYearly'].isnull().sum()
    print(f"\nMissing values in ConvertedCompYearly: {missing_comp}")
    print(f"Percentage of missing values: {(missing_comp / len(df) * 100):.2f}%")
    
    # Handle missing values - we can use median imputation for compensation
    if missing_comp > 0:
        median_comp = df['ConvertedCompYearly'].median()
        print(f"\nMedian compensation: ${median_comp:,.2f}")
        
        df['ConvertedCompYearly'].fillna(median_comp, inplace=True)
        print(f"Missing values after imputation: {df['ConvertedCompYearly'].isnull().sum()}")
    
    # Display sample of normalized compensation data
    print("\nSample of ConvertedCompYearly (normalized annual compensation):")
    print(df[['ConvertedCompYearly']].head(10))
else:
    print("ConvertedCompYearly column not found in dataset")

### Step 7: Summary and Next Steps


**In this lab, you focused on identifying and removing duplicate rows.**

- You handled missing values by imputing the most frequent value in a chosen column.

- You used ConvertedCompYearly for compensation normalization and handled missing values.

- For further analysis, consider exploring other columns or visualizing the cleaned dataset.


In [None]:
## Write your code here

# Summary of data wrangling steps performed

print("=" * 60)
print("DATA WRANGLING SUMMARY")
print("=" * 60)

print(f"\n1. Duplicate Removal:")
print(f"   - Identified and removed duplicate rows")
print(f"   - Final dataset shape: {df.shape}")

print(f"\n2. Missing Value Handling:")
print(f"   - Imputed EdLevel with most frequent value")
print(f"   - Imputed ConvertedCompYearly with median value")

print(f"\n3. Data Normalization:")
print(f"   - Used ConvertedCompYearly for standardized compensation analysis")

print(f"\n4. Final Data Quality:")
total_missing = df.isnull().sum().sum()
print(f"   - Total missing values remaining: {total_missing}")
print(f"   - Dataset is ready for analysis!")

print("=" * 60)

<!--
## Change Log

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|

--!>


Copyright © IBM Corporation. All rights reserved.
