[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/GDAN%205400%20-%20Coding%20Assignment%201%20Solutions.ipynb)

# Coding Assignment #1

Welcome to your first coding assignment! You will work with the provided dataset, which contains information about roof insurance claims. In this assignment, you will:
1. Load and inspect the dataset.
2. Perform basic data exploration.
3. Practice Python programming skills, including:
   - Using basic arithmetic and comparison operators.
   - Creating and working with lists.
   - Using booleans.
   - Writing for loops.
   - Using if-elif-else statements.
4. Develop a Python-based workflow to prepare the roof insurance claim dataset for analysis. This assignment focuses on learning how to:
   - Extract, clean, and transform data.
   - Identify and handle missing values.
   - Filter and organize data for analysis.

## Dataset
The dataset you'll be working with is `final_insurance_fraud.xlsx`. This file contains information about insurance claims, including details like claim type, amount, and fraud status.



## CASE INTRODUCTION.

Casey Lee, an insurance claims processor was reviewing claims received from a recent storm before finalizing authorization for roof replacements. She pulled up and reread the U.S. National Weather Service Announcement:

&nbsp;&nbsp; TORNADO WARNING  
&nbsp;&nbsp; NATIONAL WEATHER SERVICE CHICAGO/ROMEOVILLE   
&nbsp;&nbsp;1215 AM CDT THU SEP 12 20XX  

&nbsp;&nbsp;THE NATIONAL WEATHER SERVICE IN CHICAGO HAS ISSUED A   
&nbsp;&nbsp;*TORNADO WARNING FOR...    
&nbsp;&nbsp;CENTRAL DEKALB COUNTY IN NORTH CENTRAL ILLINOIS...    
&nbsp;&nbsp;UNTIL 530 PM CDT.  

&nbsp;&nbsp;*AT 1218 AM CDT, A SEVERE THUNDERSTORM CAPABLE OF PRODUCING A  
&nbsp;&nbsp;TORNADO WAS LOCATED NEAR SYCAMORE,  
&nbsp;&nbsp;OR NEAR SHABBONA, MOVING SOUTHWEST AT 2 MPH.  
&nbsp;&nbsp;&nbsp;&nbsp;  HAZARD...TORNADO AND QUARTER-SIZED HAIL.  
&nbsp;&nbsp;&nbsp;&nbsp;  SOURCE...RADAR INDICATED ROTATION.  
&nbsp;&nbsp;&nbsp;&nbsp; IMPACT...FLYING DEBRIS WILL BE DANGEROUS TO THOSE CAUGHT WITHOUT SHELTER.   
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;MOBILE HOMES WILL BE DAMAGED OR DESTROYED.  
   &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;   DAMAGE TO ROOFS, WINDOWS, AND VEHICLES WILL OCCUR. TREE DAMAGE IS LIKELY.  

&nbsp;&nbsp;*THIS DANGEROUS STORM WILL BE NEAR...  
&nbsp;&nbsp;SYCAMORE AROUND 1240 AM CDT.   
&nbsp;&nbsp;DEKALB AROUND 600 AM CDT.  
&nbsp;&nbsp;COURTLAND AROUND 1140 AM.     

&nbsp;&nbsp;PRECAUTIONARY/PREPAREDNESS ACTIONS...   

&nbsp;&nbsp;TAKE COVER NOW! MOVE TO A BASEMENT OR AN INTERIOR ROOM  
&nbsp;&nbsp;ON THE LOWEST FLOOR OF A STURDY BUILDING.  
&nbsp;&nbsp;AVOID WINDOWS. IF YOU ARE OUTDOORS, IN A MOBILE HOME, OR IN A VEHICLE,   
&nbsp;&nbsp;MOVE TO THE CLOSEST SUBSTANTIAL SHELTER AND PROTECT YOURSELF FROM FLYING DEBRIS.    

Indeed, it appeared to be a bad storm, which could substantiate the large number of claims that she received for new roofs from hail and wind damage. Yet, she felt that something could be off.

While Casey could not process the data from multiple companies, she knew that the National Insurance Crime Bureau might be able to help by aggregating data from multiple insurance companies across the area hit by the storm and evaluating the data to look for anomalies. Casey's request landed on your desk. As a new fraud specialist, you have been hired to investigate claims following storm damage to hopefully reduce the payouts made to false claimants. You also knew you had to act fast. You began by pulling the claims data for roofs. You also received a database that showed the actual path of this storm. Your task is to sort through the claims to see if there were any unusual claim patterns from this recent weather event.

---
Case introduction and dataset comes from: Cheng, C., & Lee, C.-C. (2023). A Case Study Using Data Analytics to Detect Hail Damage Insurance Claim Fraud. *Journal of Forensic Accounting Research, 8,* 287–306.

# **Instructions: Steps to Complete**

### 1. **Load the Dataset:**
   -  [This would be the ``Input`` tool in Alteryx, for example]
   - Load the roof insurance claim dataset (provided in `.xlsx` format) into a Pandas DataFrame named `df`

<br>
You have several options for doing step 1.

#### Option #1 - Upload the File Manually

In [None]:
#Run this code to upload the file:
from google.colab import files
uploaded = files.upload()  # This will prompt you to upload the file

In [None]:
#Once uploaded, you can open the file using pandas:
import pandas as pd
df = pd.read_excel('final_insurance_fraud.xlsx')

#### Option 2: Mount Google Drive
Mount and access an Excel file stored in your Google Drive:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
file_path = '/content/drive/My Drive/final_insurance_fraud.xlsx'
df = pd.read_excel(file_path)

#### Option 3: Download from an Online URL
If the Excel file is hosted online:  
1. Use `requests` to fetch the file

```python
import pandas as pd
import requests

url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%201/final_insurance_fraud.xlsx'
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')
```

### 2. **Inspect the Dataset**
- Print the first 5 rows of the dataset to inspect its structure:
- Print all column names.
- Display the structure and data types for all variables.
- Display summary statistics for all numeric columns.

In [None]:
# View the first 5 rows of the dataset
print("First 5 rows:")
df.head()

In [None]:
print("Print all column names:\n")
print(df.columns)

In [None]:
print("Display the structure and data types for all variables.\n")
df.info()

In [None]:
print("Display summary statistics for all numeric columns:")
df.describe().T # note that I add the `.T` at the end to transpose the output (this is a personal preference)

### 3. **Use Operators**
- Use basic *arithmetic* operators to:
  - Add 10 to the `Wind Speed` for each claim and print the result.

- Use *comparison* operators to:
  - Show all rows of the dataset where `Home Square Feet` is greater than or equal to 3750.  


In [None]:
# Add 10 to Wind Speed and print the result
print("Wind Speed + 10 for the first 5 rows:")
print(df['Wind Speed'].head() + 10)

In [None]:
#Show all rows of the dataset where Home Square Feet is greater than or equal to 3750
df[df['Home Square Feet'] > 3750]

### 4. **Create and Work with a List**
- Create and print out a `list` of all *unique* values from the `City` column. Call the list `city_names`

You have a number of options here. As long as you can print out the unique values then that is fine.

In [None]:
#First step: Create the list
city_names = df['City'].tolist()
print("List of first 5 cities:", set(city_names)) #optional step

In [None]:
#Alternative to creating the list
city_names = list(df['City'])
print("List of first 5 cities:", set(city_names)) #optional step

In [None]:
#Second step: convert the list into a set
city_names = set(city_names)
print(city_names) #This prints out all the unique values in the City column

In [None]:
#Additional final step - convert back to a list
city_names = list(city_names)
print(city_names)

#### Other options

In [None]:
# Convert the column to a Python set to extract unique values:
# Using set() with column
#unique_set = set(df['column_name'])
unique_set = set(df['City'])
print(unique_set)

In [None]:
# Using pd.unique()
# A direct function for getting unique values, similar to .unique():
#unique_values = pd.unique(df['column_name'])
unique_values = pd.unique(df['City'])
print(unique_values)

### 5. **For Loop**
- Use a `for` loop to iterate through the `Rainfall` column and print:
  - `"The amount of rainfall was <rainfall> inches."`, where `<rainfall>` refers to the amount of rainfall.

In [None]:
#You can use any word or letter you'd like in place of "rainfall"
for rainfall in df['Rainfall']:
    print(f"The amount of rainfall: {rainfall} inches.")

### 6. **If-Elif-Else Statement**
- Write a simple conditional statement to check the `Age of roof` for the first 5 rows:
  - If less than 5, print `"New Roof"`.
  - If between 5 and 10, print `"Moderately New Roof"`.
  - Otherwise, print `"Old Roof"`.
- Print the result for the first 5 rows.
- *Hint*: the `if-elif-else` statement is nested under a `for` loop.

In [None]:
# Loop over the first 5 rows. Check and print the age of the roof for each row.
print("Roof age classifications:")
for age in df['Age of roof'].head(5):
    if age < 5:
        print("New Roof")
    elif 5 <= age <= 10:
        print("Moderately New Roof")
    else:
        print("Old Roof")

### 7. **Filter Invalid Records:**
   - Use PANDAS to remove rows where the `Policy Number` is missing (`null`).
   - How many records do you have after using the filter tool for roof claims?

In [None]:
#422 records are left after removing missing values on *Policy Number*
print(len(df))
df = df[df['Policy Number'].notnull()]
print(len(df))

This step filters out 2 records, leaving 422 records remaining. Duplicate claims for the same type of damage should be investigated by insurance companies. Regardless of whether the duplicate claims could have arisen from intentional or unintentional errors, insurance companies should only want to pay the claim one time.

### 8. **Data Cleaning:**
   - Use PANDAS to replace `null` values with `0` in the `Estimated cost to repair` and `Estimated cost to replace` columns. For each claim, only one of these two columns will have data depending on the adjuster’s recommendation.

This next step is not necessary. I am adding this in so you can see that only 381 rows have a value on `Estimated cost to repair` and only 42 rows have a value on `Estimated cost to replace`.

In [None]:
df.describe().T

In [None]:
#Replace null (missing) values with 0 for both columns
df['Estimated cost to repair'].fillna(0, inplace=True)
df['Estimated cost to replace'].fillna(0, inplace=True)

In [None]:
# Alternative: Replace missing values with 0 for both columns (newer method)
df['Estimated cost to repair'] = df['Estimated cost to repair'].fillna(0)
df['Estimated cost to replace'] = df['Estimated cost to replace'].fillna(0)

Let's use `describe()` again. Now you can see that all rows have values for `Estimated cost to repair` and  `Estimated cost to replace`. Note also the different `mean` values.

In [None]:
df.describe().T

### 9. **Identify Duplicate Claims:**
   - Use PANDAS to identify whether there are any duplicate claims in your dataset based on ``House/Apartment Number``,	``Street Address``, ``City``, and ``Zip Code``.   

In [None]:
duplicates = df[df.duplicated(subset=['House/Apartment Number', 'Street Address', 'City', 'Zip Code'], keep=False)]
print(f"Number of duplicate claims: {len(duplicates)}") #Optional step
duplicates

Note that I used `duplicates` above, which will show the dataframe, rather than `print(duplicates)`. This is a personal preference but I like the above output much more than using the print command to show dataframes.

### 10. **Export Data**
   - Export the cleaned dataset to a new `CSV` file.

I will show you two different options. Note that in the Week 2 notebooks we used `to_pickle()` command. In this case you had to use `to_csv()` to save the file in CSV format. There are other options but that is the easiest.

### Option 1: Save to Your Google Drive

In [None]:
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Path to save the file in Google Drive
file_path = '/content/drive/My Drive/cleaned_roof_insurance_claims.csv'

# Save the DataFrame as a CSV file
df.to_csv(file_path, index=False)

print(f"File saved to {file_path}") #Optional step

### Option 2: Save and Download the File to Your Computer
1. Save the DataFrame as a CSV File.  
Use pandas to save the DataFrame to a CSV file in the Colab environment:

In [None]:
df.to_csv('cleaned_roof_insurance_claims.csv', index=False)

2. Download the File to Your Computer.   
After saving the file, use the following code to download it:

In [None]:
from google.colab import files

# Download the file
files.download('cleaned_roof_insurance_claims.csv')

---

## **Deliverables**
1. Submit the link to you Google Colab notebook in the assignment area in Canvas.
2. Include comments in your code to explain each step.

## **Tips**
- Use small chunks of code and inspect your dataset frequently.
- Handle missing and invalid data systematically to maintain data integrity.

Good luck! 🚀

## Submission
- Submit your completed Colab notebook with all code cells executed.
- Ensure your notebook includes helpful explanations (as Markdown cells) for each step.