<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Removing Duplicates**


Estimated time needed: **30** minutes


## Introduction


In this lab, you will focus on data wrangling, an important step in preparing data for analysis. Data wrangling involves cleaning and organizing data to make it suitable for analysis. One key task in this process is removing duplicate entries, which are repeated entries that can distort analysis and lead to inaccurate conclusions.  


## Objectives


In this lab you will perform the following:


1. Identify duplicate rows  in the dataset.
2. Use suitable techniques to remove duplicate rows and verify the removal.
3. Summarize how to handle missing values appropriately.
4. Use ConvertedCompYearly to normalize compensation data.
   


### Install the Required Libraries


In [3]:
!pip install pandas

Collecting pandas
  Downloading pandas-2.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.3.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m141.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.3.1-cp312-cp312-manylinux_2_28_x86_64.whl (16.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.6/16.6 MB[0m [31m194.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: tzdata, numpy, pandas
Successfully installed numpy-2.3.1 pandas-2.3.1 tzdata-2025.2


### Step 1: Import Required Libraries


In [4]:
import pandas as pd

### Step 2: Load the Dataset into a DataFrame



load the dataset using pd.read_csv()


In [5]:
# Define the URL of the dataset
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
print(df.head())


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

**Note: If you are working on a local Jupyter environment, you can use the URL directly in the <code>pandas.read_csv()</code>  function as shown below:**



#df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


### Step 3: Identifying Duplicate Rows


**Task 1: Identify Duplicate Rows**
  1. Count the number of duplicate rows in the dataset.
  2. Display the first few duplicate rows to understand their structure.


In [6]:
num_duplicate_rows = df.duplicated().sum()
print("Number of duplicate rows:",num_duplicate_rows)
print(df.duplicated(keep=False).head(5))

Number of duplicate rows: 0
0    False
1    False
2    False
3    False
4    False
dtype: bool


### Step 4: Removing Duplicate Rows


**Task 2: Remove Duplicates**
   1. Remove duplicate rows from the dataset using the drop_duplicates() function.
2. Verify the removal by counting the number of duplicate rows after removal .


In [None]:
## Write your code here

### Step 5: Handling Missing Values


**Task 3: Identify and Handle Missing Values**
   1. Identify missing values for all columns in the dataset.
   2. Choose a column with significant missing values (e.g., EdLevel) and impute with the most frequent value.


In [7]:
#Get missing values count and percentage
missing_data = pd.DataFrame({
    'Missing Values':df.isnull().sum(),
    'Percentage (%)': round(df.isnull().mean()*100,2)
})
#Filter to only columns with missing values
missing_data = missing_data[missing_data['Missing Values']>0]

#print missing values count and percentage
print('Missing Values Report:')
print(missing_data.sort_values('Percentage (%)',ascending=False))

Missing Values Report:
                            Missing Values  Percentage (%)
AINextMuch less integrated           64289           98.25
AINextLess integrated                63082           96.40
AINextNo change                      52939           80.90
AINextMuch more integrated           51999           79.46
EmbeddedAdmired                      48704           74.43
...                                    ...             ...
YearsCode                             5568            8.51
NEWSOSites                            5151            7.87
LearnCode                             4949            7.56
EdLevel                               4653            7.11
AISelect                              4530            6.92

[109 rows x 2 columns]


In [10]:
# 1. Identify column with most missing values
missing_percent = df.isnull().mean().sort_values(ascending=False)
selected_column = missing_percent.index[0]  # Get column name with highest % missing

print(f"Column with most missing values: '{selected_column}' ({missing_percent.iloc[0]*100:.2f}% missing)")

# 2. Calculate mode (most frequent value)
mode_value = df[selected_column].mode().iloc[0]

# 3. Impute missing values (using recommended approach)
df = df.assign(**{selected_column: df[selected_column].fillna(mode_value)})

# 4. Verify imputation
print(f"\nAfter imputation with mode='{mode_value}':")
print(f"Missing values in '{selected_column}': {df[selected_column].isnull().sum()}")
print(f"New value counts:\n{df[selected_column].value_counts(dropna=False).head()}")

Column with most missing values: 'AINextNo change' (80.90% missing)

After imputation with mode='Writing code':
Missing values in 'AINextNo change': 0
New value counts:
AINextNo change
Writing code                            54791
Search for answers                       1096
Generating content or synthetic data      774
Debugging and getting help                762
Learning about a codebase                 587
Name: count, dtype: int64


**Task 4: Normalize Compensation Data Using ConvertedCompYearly**
   1. Use the ConvertedCompYearly column for compensation analysis as the normalized annual compensation is already provided.
   2. Check for missing values in ConvertedCompYearly and handle them if necessary.


In [14]:
#check missing values
missing_count = df['ConvertedCompYearly'].isnull().sum()
missing_percent_year = missing_count / len(df) * 100

print(f"Missing values in ConvertedCompYearly:{missing_count}({missing_percent_year:.2f}%)")

#handling missing values based on percentage
if missing_percent_year > 0:
    if missing_percent_year < 5: #small amount of missing data
        #impute with median
        median_comp = df['ConvertedCompYearly'].median()
        df_clean = df.copy()
        df_clean['ConvertedCompYearly'] = df_clean['ConvertedCompYearly'].fillna(median_comp)
        print(f"Imputed {missing_count} missing values with median: {median_comp:.2f}")
    else:
        print("Warning: Significant missing comensation data detected")
        print("Consider:")
        print("- Investigating why data is missing")
        print("- Using predictive imputation based on other columns")
        print("- Creating a separate missing indicator column")
        df_clean = df.copy()
else:
    df_clean = df.copy()
    print ("No missing values found in ConvertedCompYearly")

#Basic compensation analysis
print("\nComepnsation Analysis:")
print(f"Median annual compensation: ${df_clean['ConvertedCompYearly'].median():,.2f}")
print(f"Mean annual compensation: ${df_clean['ConvertedCompYearly'].mean():,.2f}")
print("\nTop 5 highest compensation values:")
print(df_clean['ConvertedCompYearly'].sort_values(ascending=False).head())


Missing values in ConvertedCompYearly:42002(64.19%)
Consider:
- Investigating why data is missing
- Using predictive imputation based on other columns
- Creating a separate missing indicator column

Comepnsation Analysis:
Median annual compensation: $65,000.00
Mean annual compensation: $86,155.29

Top 5 highest compensation values:
15837    16256603.0
12723    13818022.0
28379     9000000.0
17593     6340564.0
17672     4936778.0
Name: ConvertedCompYearly, dtype: float64


In [None]:
### Step 7: Summary and Next Steps

**In this lab, you focused on identifying and removing duplicate rows.**

- You handled missing values by imputing the most frequent value in a chosen column.

- You used ConvertedCompYearly for compensation normalization and handled missing values.

- For further analysis, consider exploring other columns or visualizing the cleaned dataset.


In [None]:
## Write your code here

<!--
## Change Log

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|

--!>


Copyright © IBM Corporation. All rights reserved.
