<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Capstone Part 5:** Removing Duplicates


#### Student Author: Abigail Hedden

## Objectives


* Identify duplicate rows  in the dataset
* Use suitable techniques to remove duplicate rows and verify the removal
* Summarize how to handle missing values appropriately
* Use ConvertedCompYearly to normalize compensation data
  

## Set-up

In [1]:
# import required packages
import pandas as pd

## Load the Dataset into a DataFrame


In [8]:
# provided code; no duplicates; used a seemingly older dataset with duplicates to practice skills with; df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VYPrOu0Vs3I0hKLLjiPGrA/survey-data-with-duplicate.csv")
df.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


##### df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv")


## Identify and handle duplicates

### Identify duplicate rows

In [9]:
# count the number of duplicate rows (all columns)
num_duplicates = df.duplicated().sum()
print(f'Total number of duplicate rows: {num_duplicates}')

# display the first few duplicate rows to understand structure
duplicates = df[df.duplicated(keep=False)]
print(duplicates.head())

Total number of duplicate rows: 20
   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   

### Remove duplicate rows

In [23]:
# drop duplicates
df_no_duplicates = df.drop_duplicates()

# verify removal by counting duplicates after removal
num_duplicates_after = df_no_duplicates.duplicated().sum()
print(f'Number of duplicate rows after removal: {num_duplicates_after}')

Number of duplicate rows after removal: 0


### Handling missing values


In [24]:
# identify the missing values for each column
missing_values = df_no_duplicates.isnull().sum()
print('Missing values per column:\n', missing_values)
print('')

# choose a specific column 'EdLevel', and fill missing values with the most frequent value for that column (mode)
most_frequent_edlevel = df_no_duplicates['EdLevel'].mode()[0]
print('EdLevel mode =', most_frequent_edlevel)
df_no_duplicates['EdLevel'] = df_no_duplicates['EdLevel'].fillna(most_frequent_edlevel)
print('')

# verify no missing values in 'EdLevel'
print(f"Missing values in `EdLevel` after imputation: {df_no_duplicates['EdLevel'].isnull().sum()}")
print('')

Missing values per column:
 ResponseId                 0
MainBranch                 0
Age                        0
Employment                 0
RemoteWork             10631
                       ...  
JobSatPoints_11        35992
SurveyLength            9255
SurveyEase              9199
ConvertedCompYearly    42002
JobSat                 36311
Length: 114, dtype: int64

EdLevel mode = Bachelor’s degree (B.A., B.S., B.Eng., etc.)

Missing values in `EdLevel` after imputation: 0



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_duplicates['EdLevel'] = df_no_duplicates['EdLevel'].fillna(most_frequent_edlevel)


### Normalizing compensation data
1. Use the ConvertedCompYearly column for compensation analysis as the normalized annual compensation is already provided.
2. Check for missing values in ConvertedCompYearly and handle them if necessary.

In [25]:
# use the ConvertedCompYearly column for compensation analysis, as the normalized annual compensation is already provided.

# check for missing values in ConvertedCompYearly
missing_comp = df_no_duplicates['ConvertedCompYearly'].isnull().sum()
print(f"Missing values in 'ConvertedCompYearly': {missing_comp}")
print('')

# Approach 1:  drop missing values
df_cleaned_drop = df_no_duplicates.dropna(subset=['ConvertedCompYearly'])
print(f"After dropping missing values, dataset shape: {df_cleaned_drop.shape}")
print("This removes a lot of data, so let's also try filling missing values with the mode instead.\n")

# Approach 2: fill missing values with the mode of ConvertedCompYearly instead of dropping
mode_comp = df_no_duplicates['ConvertedCompYearly'].mode()[0]
df_no_duplicates['ConvertedCompYearly'] = df_no_duplicates['ConvertedCompYearly'].fillna(mode_comp)

# verify that no missing values remain in the 'ConvertedCompYearly' column
missing_comp_after = df_no_duplicates['ConvertedCompYearly'].isnull().sum()
print(f"'ConvertedCompYearly' mode value used for imputation: {mode_comp}")
print(f"Missing values after imputation: {missing_comp_after}")

df_cleaned_fill = df_no_duplicates
print(f"After filling missing values, dataset shape: {df_cleaned_fill.shape}\n")

Missing values in 'ConvertedCompYearly': 42002

After dropping missing values, dataset shape: (23435, 114)
This removes a lot of data, so let's also try filling missing values with the mode instead.

'ConvertedCompYearly' mode value used for imputation: 64444.0
Missing values after imputation: 0
After filling missing values, dataset shape: (65437, 114)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_duplicates['ConvertedCompYearly'] = df_no_duplicates['ConvertedCompYearly'].fillna(mode_comp)


<!--
## Change Log

|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.2|Madhusudhan Moole|Updated lab|
|2024-09-24|1.1|Madhusudhan Moole|Updated lab|
|2024-09-23|1.0|Raghul Ramesh|Created lab|

--!>


## <h3 align="center"> © IBM Corporation. All rights reserved. <h3/>
