[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%206%20Notebooks/GDAN%205400%20-%20Week%206%20Notebooks%20%28IV%29%20-%20Task%206%20-%20Box%20Plots.ipynb)

This notebook provides a mini-tutorial on understanding and using box plots in Python 

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Read in Data

In [None]:
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%203/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

print('# of rows:', len(df), '\n')

df[:2]

# **Mini-Tutorial: Understanding Boxplots in Data Analytics**

## **What Is a Boxplot?**
A **boxplot** (also called a **box-and-whisker plot**) is a statistical visualization that summarizes the distribution of a dataset. It provides key insights into **central tendency, variability, and outliers** by displaying five main summary statistics:

1. **Minimum** – The smallest non-outlier value in the dataset.
2. **First Quartile (Q1)** – The 25th percentile (lower quartile), below which 25% of the data falls.
3. **Median (Q2)** – The 50th percentile (middle value), which divides the dataset into two equal halves.
4. **Third Quartile (Q3)** – The 75th percentile (upper quartile), below which 75% of the data falls.
5. **Maximum** – The largest non-outlier value in the dataset.

In addition to these, boxplots **highlight outliers**, which are data points that fall significantly outside the expected range.

---

## **Why Are Boxplots Important?**
Boxplots are valuable for data analysts because they provide a **quick and effective way to detect key statistical properties** in a dataset. Here’s why they are useful:

### **1. Identifying Outliers**
- **Outliers** are extreme values that fall significantly above or below the majority of the data.
- In a boxplot, outliers are typically represented as **individual points** outside the "whiskers" of the plot.
- Detecting outliers is essential for identifying **data entry errors, fraud, or rare but significant events**.

**Example**: In an **insurance claims dataset**, extremely high repair costs might indicate **fraudulent claims** or **highly severe damage**.

---

### **2. Comparing Distributions Across Categories**
- Boxplots allow easy comparison of **distributions across different categories**.
- By plotting multiple boxplots side by side, analysts can **visually compare variations** in different groups.

**Example**: Comparing **repair costs** across different **roofing companies** can help insurers determine which companies **tend to charge higher or lower than the average**.

---

### **3. Understanding Data Spread and Variability**
- The **length of the box** (Interquartile Range, or **IQR**) shows the spread of the **middle 50% of the data**.
- If the box is **small**, most values are **close to the median** (low variability).
- If the box is **large**, there is **high variability**, meaning data points are spread out.

**Example**: If **wind speeds in insurance claims** show a wide spread, it may indicate that claims occur in a variety of weather conditions, rather than just during major storms.

---

### **4. Detecting Skewness and Symmetry**
- If the **median** is in the **center** of the box and the whiskers are approximately **equal in length**, the distribution is **symmetric**.
- If the **median** is closer to one side of the box, or if one whisker is much longer, the data is **skewed**.

**Example**: If **hail diameter in insurance claims** is **right-skewed**, it suggests that most hail events are small, but a few extreme cases involve **large, damaging hail**.

---

### **5. Quickly Summarizing Large Datasets**
- Boxplots are **compact and information-dense**, making them ideal for summarizing **large datasets** in a single visual.
- Unlike histograms, they do not require **choosing bin sizes**, making them a **clean and efficient visualization**.

**Example**: If an insurer wants to **compare claim amounts across multiple cities**, a **single boxplot per city** can provide an at-a-glance comparison.

---

## **Key Takeaways**
- **Boxplots summarize distributions** using quartiles, medians, and whiskers.
- They **identify outliers**, making them useful for fraud detection or data errors.
- Boxplots **compare distributions** across multiple groups easily.
- They **show variability and skewness**, helping analysts understand how data is spread.
- **In finance and insurance**, boxplots are especially useful for analyzing claims, fraud detection, and market trends.

By mastering **boxplots**, data professionals can **quickly assess datasets**, detect anomalies, and make informed decisions based on visual trends.


# **Code Examples: Boxplots for Insurance Claims Data**

Now that we've covered **why boxplots are useful**, let's apply them to our dataset (`df`). These visualizations will help us **detect outliers, compare distributions, and assess variability** in key insurance-related variables.

---

### Example 1: Boxplot of Estimated Repair Costs
**Purpose**: This boxplot helps **understand the distribution of repair costs** and detect **outliers that might indicate overbilling or extreme damages**.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))
sns.boxplot(x=df['Estimated cost to repair'], color='skyblue')
plt.xlabel("Estimated Repair Cost ($)")
plt.title("Boxplot of Estimated Repair Costs")
plt.show()

### Key Insights
- Outliers appear as individual points beyond the whiskers.
- The median line shows the typical claim amount.
- The spread of the box indicates how variable repair costs are.

#### Flipping the Axis in a Boxplot
##### **Purpose**
This example demonstrates how to **flip the axis of a boxplot** to improve readability, especially when working with **long category names or better visualizing numerical distributions**.

In [None]:
plt.figure(figsize=(6,8))
sns.boxplot(y=df['Estimated cost to repair'], color='skyblue')
plt.xlabel("Estimated Repair Cost ($)")
plt.title("Boxplot of Estimated Repair Costs (Flipped Axis)")
plt.show()

#### How and Why Would You Flip the Axis?
- In a default boxplot, numerical variables are usually plotted horizontally (x-axis).
- By flipping the axis (plotting the numerical variable on the y-axis), we may:
    - Improve readability if we have longer category names in a categorical variable.
    - Enhance visualization for skewed distributions by making the spread easier to compare.
   - Ensure better spacing when multiple boxplots are compared together.

**Key Insights**
- The distribution of repair costs is now plotted vertically, which can help make trends more apparent.
- The spread and outliers remain the same, but depending on the dataset and layout of other visuals, this orientation may be preferable.

### Example 2: Boxplot Comparing Repair Costs by Roofing Company
**Purpose** - This visualization helps identify which roofing companies charge higher or lower than average for repairs and whether some companies have more variability in pricing.

In [None]:
plt.figure(figsize=(10,6))
sns.boxplot(x='Roofing Company', hue='Roofing Company', y='Estimated cost to repair', data=df, palette='Blues_r')
plt.xlabel("Roofing Company")
plt.ylabel("Estimated Repair Cost ($)")
plt.title("Distribution of Repair Costs by Roofing Company")
plt.xticks(rotation=45)
plt.show()

### Key Insights
- Are certain roofing companies charging significantly more than others?
- Are there outliers, possibly indicating overcharging or fraud?
- How spread out are the costs for each company?

### Example 3: Boxplot of Rainfall in Claims
**Purpose** – This boxplot helps us see how much rainfall is associated with insurance claims and whether higher rainfall leads to a spike in claims.

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x=df['Rainfall'], color='purple')
plt.xlabel("Rainfall (inches)")
plt.title("Boxplot of Rainfall Levels in Claims")
plt.show()

#### Key Insights
- Does higher rainfall lead to more claims, or are claims spread out across all levels?
- Are there extreme rainfall values that could indicate major storm events?
- Is the distribution skewed, or is damage occurring at consistent rainfall levels?

### Example 4: Boxplot of Roof Age in Claims
**Purpose** – This boxplot helps analyze whether older roofs lead to higher claims and whether newer roofs should have claims at all.

In [None]:
plt.figure(figsize=(8,5))
sns.boxplot(x=df['Age of roof'], color='orange')
plt.xlabel("Roof Age (Years)")
plt.title("Boxplot of Roof Age in Claims")
plt.show()

#### Key Insights
- Do older roofs result in higher claim amounts?
- Are there outliers, such as very new roofs with claims that could indicate fraud?
- How spread out is the age distribution of insured roofs?

## **Conclusion**
Boxplots are an essential visualization tool in data analytics, especially when working with financial and insurance datasets. They allow us to:
- **Identify outliers** that may indicate fraudulent claims or unusual repair costs.
- **Compare distributions** across different categories, such as roofing companies or adjusters.
- **Understand variability** in key metrics like wind speed, rainfall, and estimated costs.

By learning how to adjust the axis orientation and properly format boxplots, analysts can **enhance interpretability** and **improve decision-making** when analyzing large datasets. Mastering these techniques ensures that insights are **clear, actionable, and effectively communicated** to stakeholders.