[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN%205400/blob/main/Coding%20Assignment%202/GDAN5400%20-%20Coding%20Assignment%202.ipynb)

# Coding Assignment #2

In this second assignment you will continue working with the room insurance claim dataset you used in Coding Assignment #1. In this assignment, you will:
1. Use built-in and custom functions
2. Convert data types
3. Generate a binary variable
4. Drop unneeded columns
5. Drop a duplicate observation
6. Run frequencies
7. Aggregate and group the data

## CASE INTRODUCTION

Casey Lee, an insurance claims processor was reviewing claims received from a recent storm before finalizing authorization for roof replacements. She pulled up and reread the U.S. National Weather Service Announcement:

&nbsp;&nbsp; TORNADO WARNING  
&nbsp;&nbsp; NATIONAL WEATHER SERVICE CHICAGO/ROMEOVILLE   
&nbsp;&nbsp;1215 AM CDT THU SEP 12 20XX  

&nbsp;&nbsp;THE NATIONAL WEATHER SERVICE IN CHICAGO HAS ISSUED A   
&nbsp;&nbsp;*TORNADO WARNING FOR...    
&nbsp;&nbsp;CENTRAL DEKALB COUNTY IN NORTH CENTRAL ILLINOIS...    
&nbsp;&nbsp;UNTIL 530 PM CDT.  

&nbsp;&nbsp;*AT 1218 AM CDT, A SEVERE THUNDERSTORM CAPABLE OF PRODUCING A  
&nbsp;&nbsp;TORNADO WAS LOCATED NEAR SYCAMORE,  
&nbsp;&nbsp;OR NEAR SHABBONA, MOVING SOUTHWEST AT 2 MPH.  
&nbsp;&nbsp;&nbsp;&nbsp;  HAZARD...TORNADO AND QUARTER-SIZED HAIL.  
&nbsp;&nbsp;&nbsp;&nbsp;  SOURCE...RADAR INDICATED ROTATION.  
&nbsp;&nbsp;&nbsp;&nbsp; IMPACT...FLYING DEBRIS WILL BE DANGEROUS TO THOSE CAUGHT WITHOUT SHELTER.   
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;MOBILE HOMES WILL BE DAMAGED OR DESTROYED.  
   &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   DAMAGE TO ROOFS, WINDOWS, AND VEHICLES WILL OCCUR. TREE DAMAGE IS LIKELY.  

&nbsp;&nbsp;*THIS DANGEROUS STORM WILL BE NEAR...  
&nbsp;&nbsp;SYCAMORE AROUND 1240 AM CDT.   
&nbsp;&nbsp;DEKALB AROUND 600 AM CDT.  
&nbsp;&nbsp;COURTLAND AROUND 1140 AM.     

&nbsp;&nbsp;PRECAUTIONARY/PREPAREDNESS ACTIONS...   

&nbsp;&nbsp;TAKE COVER NOW! MOVE TO A BASEMENT OR AN INTERIOR ROOM  
&nbsp;&nbsp;ON THE LOWEST FLOOR OF A STURDY BUILDING.  
&nbsp;&nbsp;AVOID WINDOWS. IF YOU ARE OUTDOORS, IN A MOBILE HOME, OR IN A VEHICLE,   
&nbsp;&nbsp;MOVE TO THE CLOSEST SUBSTANTIAL SHELTER AND PROTECT YOURSELF FROM FLYING DEBRIS.    

Indeed, it appeared to be a bad storm, which could substantiate the large number of claims that she received for new roofs from hail and wind damage. Yet, she felt that something could be off.

While Casey could not process the data from multiple companies, she knew that the National Insurance Crime Bureau might be able to help by aggregating data from multiple insurance companies across the area hit by the storm and evaluating the data to look for anomalies. Casey's request landed on your desk. As a new fraud specialist, you have been hired to investigate claims following storm damage to hopefully reduce the payouts made to false claimants. You also knew you had to act fast. You began by pulling the claims data for roofs. You also received a database that showed the actual path of this storm. Your task is to sort through the claims to see if there were any unusual claim patterns from this recent weather event.

---
Case introduction and dataset comes from: Cheng, C., & Lee, C.-C. (2023). A Case Study Using Data Analytics to Detect Hail Damage Insurance Claim Fraud. *Journal of Forensic Accounting Research, 8,* 287–306.

# Load and Prepare the Dataset
We will first get set up to run the assignment, using code from Coding Assignment #1

### Load the Dataset and show first two rows:
  - Load the roof insurance claim dataset (provided in `.xlsx` format) into a Pandas DataFrame named `df`
  - Show the first two rows

In [None]:
import pandas as pd
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%201/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

df[:2]

### Apply the data cleaning operations from Coding Assignment #1

In [None]:
print(len(df))
df = df[df['Policy Number'].notnull()]
df['Estimated cost to repair'] = df['Estimated cost to repair'].fillna(0)
df['Estimated cost to replace'] = df['Estimated cost to replace'].fillna(0)
print(len(df))

### Strip any whitespace from column names to avoid issues

In [None]:
df.columns = df.columns.str.strip()

### **Identify Duplicate Claims:**
- Use PANDAS to identify whether there are any duplicate claims in your dataset based on ``House/Apartment Number``,	``Street Address``, ``City``, and ``Zip Code``.   
- This is from Coding Assignment #1 – in the assignment you will be dropping one of these duplicates 

In [None]:
duplicates = df[df.duplicated(subset=['House/Apartment Number', 'Street Address', 'City', 'Zip Code'], keep=False)]
print(f"Number of duplicate claims: {len(duplicates)}") #Optional step
duplicates

# **Instructions: Steps to Complete**

### Task 1: Write and Apply Functions
- Create a variable `avg_repair_cost` that calculates the average `Estimated cost to repair` 
  - *hints*: 1) this is not a variable added to the dataset – it should return a single number; 2) use a `built-in` function for this task
- Write a custom function to categorize roof age as 'Old' if `Age of roof` is greater than 15 and 'New' otherwise; apply the function to the dataset and create a new variable in `df` called `roof_age_category`
  - After you have created the new variable, show `df` with first 10 rows of only the `Age of roof` and `roof_age_category` columns

In [None]:
avg_repair_cost = df['Estimated cost to repair'].mean()
print("Average Repair Cost:", avg_repair_cost)

In [None]:
def roof_age_category(age):
    return 'Old' if age > 15 else 'New'
df['roof_age_category'] = df['Age of roof'].apply(roof_age_category)
print(df['roof_age_category'].value_counts(), '\n')
df[['Age of roof', 'roof_age_category']][:10]

### Task 2: Convert Data Type
- Convert `Zip Code` to `string` format

In [None]:
df['Zip Code'].dtype

In [None]:
df['Zip Code'] = df['Zip Code'].astype(str)
df[['Zip Code']][:5]

In [None]:
df['Zip Code'].dtype

### Task 3: Create Binary Variable
- Create a new variable in `df` that flags claims for high rainfall based on whether `Rainfall` is greater than 1.5
  - Call the new variable `high_rainfall` and assign values of 1 for rainfall > 1.5, otherwise 0 

In [None]:
#Option 1: lambda function
df['high_rainfall'] = df['Rainfall'].apply(lambda x: 1 if x > 1.5 else 0)
df[['Rainfall', 'high_rainfall']].head()

In [None]:
#Option 2: Custom function 
def classify_rainfall(rainfall, threshold=1.5):
    return 1 if rainfall > threshold else 0

# Apply the custom function
df['high_rainfall'] = df['Rainfall'].apply(classify_rainfall)
df[['Rainfall', 'high_rainfall']].head()

In [None]:
#Option 3: Using a Boolean Condition Directly
df['high_rainfall'] = (df['Rainfall'] > 1.5).astype(int)
df[['Rainfall', 'high_rainfall']].head()

In [None]:
#Option 4: Using `np.where()` from NumPy
import numpy as np
df['high_rainfall'] = np.where(df['Rainfall'] > 1.5, 1, 0)
df[['Rainfall', 'high_rainfall']].head()

### Task 4: Drop Unnecessary Columns
- Drop the following columns from the dataframe: `['Latitude', 'Longitude', 'Telephon number']`

In [None]:
columns_to_drop = ['Latitude', 'Longitude', 'Telephon number']
df.drop(columns=columns_to_drop, inplace=True)

In [None]:
columns_to_drop = ['Latitude', 'Longitude', 'Telephon number']
df = df.drop(columns=columns_to_drop, axis = 1)

In [None]:
df = df.drop(['Latitude', 'Longitude', 'Telephon number'], axis=1)

In [None]:
print(df.columns)

### Task 5: Drop Duplicate Claim
Delete one of the duplicate observations from the dataset identified in Coding Assignment #1 (see `duplicates` above)

In [None]:
print(len(df))
df = df.drop_duplicates(subset=['House/Apartment Number', 'Street Address', 'City', 'Zip Code'], keep='first')
print(len(df))

<br>[Optional:] We can now re-run the duplicates check and see whether any remain in the dataframe.

In [None]:
duplicates = df[df.duplicated(subset=['House/Apartment Number', 'Street Address', 'City', 'Zip Code'], keep=False)]
print(f"Number of duplicate claims: {len(duplicates)}") #Optional step
duplicates

### Task 6: Show the Frequencies for `Adjuster` 
In this dataset, where each row is a different claim, the frequencies will tell us the **_number of claims_** handled by each Adjuster

In [None]:
# Aggregate claims by Adjuster and Roofing Company
adjuster_counts = df['Adjuster'].value_counts()
print(type(adjuster_counts))
adjuster_counts

### Task 7: Aggregate Data to Get Averages by `Roofing Company`
- Use `groupby()` to create a new dataframe called `average_by_roofer` that contains the average values of `Estimated cost to repair` and `Estimated cost to replace` for each `Roofing Company`. 

In [None]:
average_by_roofer = df.groupby('Roofing Company')[['Estimated cost to repair', 
                                                   'Estimated cost to replace']].mean().reset_index()
average_by_roofer

---

## **Deliverables**
1. Submit the link to you Google Colab notebook in the assignment area in Canvas.
2. Include comments in your code to explain each step.

In [None]:
## Submission
- Submit your completed Colab notebook with all code cells executed.
- Ensure your notebook includes helpful explanations (as Markdown cells) for each step.