<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Finding Duplicates Lab**


Estimated time needed: **30** minutes


## Introduction


Data wrangling is a critical step in preparing datasets for analysis, and handling duplicates plays a key role in ensuring data accuracy. In this lab, you will focus on identifying and removing duplicate entries from your dataset. 


## Objectives


In this lab, you will perform the following:


1. Identify duplicate rows in the dataset and analyze their characteristics.
2. Visualize the distribution of duplicates based on key attributes.
3. Remove duplicate values strategically based on specific criteria.
4. Outline the process of verifying and documenting duplicate removal.


## Hands on Lab


Install the needed library


In [None]:
!pip install pandas
!pip install matplotlib

Import pandas module


In [None]:
import pandas as pd


Import matplotlib


In [None]:
import matplotlib.pyplot as plt


## **Load the dataset into a dataframe**


<h2>Read Data</h2>
<p>
We utilize the <code>pandas.read_csv()</code> function for reading CSV files. However, in this version of the lab, which operates on JupyterLite, the dataset needs to be downloaded to the interface using the provided code below.
</p>


In [None]:
# Load the dataset directly from the URL
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VYPrOu0Vs3I0hKLLjiPGrA/survey-data-with-duplicate.csv"
df = pd.read_csv(file_path)

# Display the first few rows
print(df.head())

Load the data into a pandas dataframe:



Note: If you are working on a local Jupyter environment, you can use the URL directly in the pandas.read_csv() function as shown below:



In [None]:
# df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


## Identify and Analyze Duplicates


### Task 1: Identify Duplicate Rows
1. Count the number of duplicate rows in the dataset.
3. Display the first few duplicate rows to understand their structure.


In [None]:
## Write your code here

# Task 1: Identify Duplicate Rows

# Count the number of duplicate rows
num_duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

# Display the first few duplicate rows
duplicates = df[df.duplicated(keep=False)]
print(f"\nTotal rows including all duplicates: {len(duplicates)}")
print("\nFirst 5 duplicate rows:")
print(duplicates.head())

### Task 2: Analyze Characteristics of Duplicates
1. Identify duplicate rows based on selected columns such as MainBranch, Employment, and RemoteWork. Analyse which columns frequently contain identical values within these duplicate rows.
2. Analyse the characteristics of rows that are duplicates based on a subset of columns, such as MainBranch, Employment, and RemoteWork. Determine which columns frequently have identical values across these rows.
   


In [None]:
## Write your code here

# Task 2: Analyze Characteristics of Duplicates

# Identify duplicates based on selected columns
subset_cols = ['MainBranch', 'Employment', 'RemoteWork']
duplicates_subset = df[df.duplicated(subset=subset_cols, keep=False)]

print(f"Duplicate rows based on {subset_cols}: {len(duplicates_subset)}")

# Analyze which columns have identical values in duplicates
print("\nAnalyzing columns with identical values in duplicate rows:")
for col in df.columns:
    if duplicates_subset[col].nunique() < len(duplicates_subset):
        print(f"- {col}: {duplicates_subset[col].nunique()} unique values in {len(duplicates_subset)} duplicate rows")

# Show value counts for the subset columns
print("\nValue distribution in duplicate rows:")
for col in subset_cols:
    print(f"\n{col}:")
    print(duplicates_subset[col].value_counts().head())

### Task 3: Visualize Duplicates Distribution
1. Create visualizations to show the distribution of duplicates across different categories.
2. Use bar charts or pie charts to represent the distribution of duplicates by Country and Employment.


In [None]:
## Write your code here

# Task 3: Visualize Duplicates Distribution

# Create a figure with subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Bar chart for duplicates by Country (top 10)
if 'Country' in duplicates.columns:
    country_counts = duplicates['Country'].value_counts().head(10)
    axes[0, 0].bar(range(len(country_counts)), country_counts.values)
    axes[0, 0].set_xticks(range(len(country_counts)))
    axes[0, 0].set_xticklabels(country_counts.index, rotation=45, ha='right')
    axes[0, 0].set_title('Top 10 Countries in Duplicate Rows')
    axes[0, 0].set_ylabel('Count')

# 2. Pie chart for duplicates by Employment
if 'Employment' in duplicates.columns:
    employment_counts = duplicates['Employment'].value_counts()
    axes[0, 1].pie(employment_counts.values, labels=employment_counts.index, autopct='%1.1f%%')
    axes[0, 1].set_title('Employment Distribution in Duplicates')

# 3. Bar chart for RemoteWork
if 'RemoteWork' in duplicates.columns:
    remote_counts = duplicates['RemoteWork'].value_counts()
    axes[1, 0].bar(remote_counts.index, remote_counts.values)
    axes[1, 0].set_title('Remote Work Distribution in Duplicates')
    axes[1, 0].set_xlabel('Remote Work Status')
    axes[1, 0].set_ylabel('Count')
    axes[1, 0].tick_params(axis='x', rotation=45)

# 4. Bar chart for MainBranch
if 'MainBranch' in duplicates.columns:
    branch_counts = duplicates['MainBranch'].value_counts()
    axes[1, 1].bar(range(len(branch_counts)), branch_counts.values)
    axes[1, 1].set_xticks(range(len(branch_counts)))
    axes[1, 1].set_xticklabels(branch_counts.index, rotation=45, ha='right')
    axes[1, 1].set_title('Main Branch Distribution in Duplicates')
    axes[1, 1].set_ylabel('Count')

plt.tight_layout()
plt.show()

print(f"\nVisualization complete: {num_duplicates} duplicate rows analyzed")

### Task 4: Strategic Removal of Duplicates
1. Decide which columns are critical for defining uniqueness in the dataset.
2. Remove duplicates based on a subset of columns if complete row duplication is not a good criterion.


In [None]:
## Write your code here

# Task 4: Strategic Removal of Duplicates

# Define critical columns for uniqueness
# ResponseId should be unique for each survey response
critical_columns = ['ResponseId']

print("Original dataset shape:", df.shape)

# Remove duplicates based on critical columns
df_cleaned = df.drop_duplicates(subset=critical_columns, keep='first')

print("Dataset shape after removing duplicates:", df_cleaned.shape)
print(f"Removed {df.shape[0] - df_cleaned.shape[0]} duplicate rows")

# Verify no duplicates remain
remaining_duplicates = df_cleaned.duplicated(subset=critical_columns).sum()
print(f"\nRemaining duplicates based on {critical_columns}: {remaining_duplicates}")

## Verify and Document Duplicate Removal Process


### Task 5: Documentation
1. Document the process of identifying and removing duplicates.


2. Explain the reasoning behind selecting specific columns for identifying and removing duplicates.


### Summary and Next Steps
**In this lab, you focused on identifying and analyzing duplicate rows within the dataset.**

- You employed various techniques to explore the nature of duplicates and applied strategic methods for their removal.
- For additional analysis, consider investigating the impact of duplicates on specific analyses and how their removal affects the results.
- This version of the lab is more focused on duplicate analysis and handling, providing a structured approach to deal with duplicates in a dataset effectively.


<!--
## Change Log
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11- 05|1.3|Madhusudhan Moole|Updated lab|
|2024-10-28|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|
--!>


Copyright © IBM Corporation. All rights reserved.
