[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%204%20Notebooks/GDAN%205400%20-%20Week%204%20Notebooks%20%28V%29%20-%20Dropping%20Duplicate%20Observations.ipynb)

This notebook provides recipes for removing duplicate observations from a PANDAS dataframe

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Read in Data

In [None]:
import pandas as pd
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%201/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

df.head()

In [None]:
#APPLY DATA CLEANING OPERATIONS FROM CODING ASSIGNMENT 1
df = df[df['Policy Number'].notnull()]
df['Estimated cost to repair'] = df['Estimated cost to repair'].fillna(0)
df['Estimated cost to replace'] = df['Estimated cost to replace'].fillna(0)

# Identifying Duplicate Observations
In Coding Assignment 1, we **_identified_** duplicate observations using  `duplicated()` 
  - It is a good practice to first identify duplicates using `duplicated()` and then decide how to handle them

In [None]:
duplicates = df[df.duplicated(subset=['House/Apartment Number', 'Street Address', 'City', 'Zip Code'], keep=False)]
print(f"Number of duplicate claims: {len(duplicates)}") 
duplicates

# Dropping Duplicate Observations

**[ChatGPT prompt]** `How can I delete duplicate observations in PANDAS?`

# Removing Duplicate Observations in Pandas

The `drop_duplicates()` method in pandas is the primary way to remove duplicate observations. Here’s a detailed guide:

---

### Remove All Duplicate Rows
```python
df = df.drop_duplicates()
```

This removes any rows that are identical across all columns, keeping the first occurrence.


### Specify Columns to Check for Duplicates
```python
df = df.drop_duplicates(subset=['Column1', 'Column2'])
```
This removes rows where values in Column1 and Column2 are the same.


### Keep the First Occurrence (Default)
```python
df = df.drop_duplicates(keep='first')
```

### Keep the Last Occurrence
```python
df = df.drop_duplicates(keep='last')
```

### Remove All Occurrences of Duplicates
```python
df = df.drop_duplicates(keep=False)
```

### Create a New DataFrame Without Duplicates
```python
new_df = df.drop_duplicates()
```

### Modify the Original DataFrame
```python
df.drop_duplicates(inplace=True)
```

<br>Now let's run this on our dataframe. Note that we are only dropping duplicates based off two columns: `House/Apartment Number` and `Street Address`

In [None]:
print(len(df))
df = df.drop_duplicates(subset=['House/Apartment Number', 'Street Address'], keep='first')
print(len(df))

<br>We can now re-run the duplicates check and see whether any remain in the dataframe.

In [None]:
duplicates = df[df.duplicated(subset=['House/Apartment Number', 'Street Address'], keep=False)]
print(f"Number of duplicate claims: {len(duplicates)}") 
duplicates