[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%202%20Notebooks/GDAN%205400%20-%20Week%202%20Notebooks%20%28VIII%29%20-%20Identifying%20Duplicate%20Records.ipynb)

This notebook provides recipes for identifying duplicate records in PANDAS

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Read in Data

In [None]:
import pandas as pd
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%201/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

df.head()

# Identifying Duplicate Records in PANDAS

**[ChatGPT prompt]** `How can I identify duplicate records in PANDAS?`

In PANDAS, you can identify duplicates in a DataFrame or Series using the `duplicated()` method. Below are some of the key techniques you should know. 

### Using `duplicated()`
The `duplicated()` method marks duplicates as True and unique values as False. Note that the `duplicated()` method only *flags* duplicates but does not remove them.

By default, `duplicated()` flags as False the first occurrence of a duplicate and marks subsequent ones as True. You can change this behavior using the `keep` parameter.

### Check for duplicates across all columns
There are no duplicate rows identified.

In [None]:
duplicates = df.duplicated()
duplicates.value_counts()

Note that `duplicates` is not a dataset; it is a series

In [None]:
print(type(df))
print(type(duplicates))

### Check for Duplicates Across Specific Columns
To find duplicates based on specific column(s):

In [None]:
# Identify duplicates based on the 'Name' column
duplicates = df.duplicated(subset=['Surname'])
print(duplicates.value_counts(), '\n')
duplicates[25:30]

### Count duplicates

In [None]:
num_duplicates = df.duplicated(subset=['Surname']).sum()
print(f"Number of duplicates: {num_duplicates}")

### Flag All Duplicates
The `duplicated(keep=False)` method in pandas identifies all occurrences of duplicate rows in a DataFrame or Series.

In [None]:
df.duplicated?

In [None]:
# Mark all occurrences of duplicates
all_duplicates = df.duplicated(subset=['Surname'], keep=False)
print('Number of duplicates:', len(all_duplicates), '\n')
print(all_duplicates.value_counts())

### Add a column to indicate duplicates
Adding a column with duplicated() is useful for:
- Analyzing Duplicates: You can inspect which rows are flagged as duplicates.
- Conditional Operations: Filter, modify, or drop rows based on their duplicate status.
- Debugging Data Quality: Quickly identify and investigate duplicate rows in your dataset.

In these methods, we will still be working with a PANDAS *dataframe*

In [None]:
df[:1]

In [None]:
# Add a column to indicate duplicates
df['Is_Duplicate'] = df.duplicated(subset=['Surname'], keep=False)
df['Is_Duplicate'].value_counts()

In this example we also *selecting* all records that are duplicates, then sorting the dataframe by `Surname`, and then displaying the first 6 records.

In [None]:
df[df['Is_Duplicate'] == True].sort_values('Surname')[:6]

In [None]:
# Add a column to indicate duplicates
# Note that we are overwriting our column `Is_Duplicate`
df['Is_Duplicate'] = df.duplicated(subset=['Surname', 'Street Address'], keep=False)
df[df['Is_Duplicate'] == True]