[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%206%20Notebooks/GDAN%205400%20-%20Week%206%20Notebooks%20%28III%29%20-%20Histograms.ipynb)

This notebook provides a mini-tutorial on understanding and using histograms in Python 

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Read in Data

In [None]:
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%203/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

print('# of rows:', len(df), '\n')

df[:2]

# Histograms

In this lesson we will cover how to visualize distributions using **histograms and trend curves** to identify patterns.

---

# **Mini-Tutorial: Understanding Histograms in Data Analytics**

## **What Is a Histogram?**
A **histogram** is a type of bar chart that represents the **distribution of a numerical variable**. Unlike a traditional bar chart, which compares different categories, a histogram groups data into **bins (intervals)** and shows how many observations fall into each bin.

Histograms provide an **at-a-glance** view of how data is distributed, making them one of the most commonly used visualization tools in data analytics.

---

## **Why Are Histograms Important?**
Histograms help data analysts **understand patterns, trends, and anomalies** in datasets. Some key reasons histograms are valuable include:

### **1. Revealing the Shape of Data Distributions**
Histograms allow analysts to quickly determine whether data follows a **normal distribution, skewed distribution, or other patterns**. 

- A **normal (bell-shaped) distribution** suggests that most values cluster around the center.
- A **right-skewed (positively skewed) distribution** means most values are on the lower end, with some extreme high values.
- A **left-skewed (negatively skewed) distribution** means most values are high, with a few low outliers.

Understanding these patterns helps analysts decide which statistical methods are appropriate.

### **2. Identifying Outliers**
Histograms help **spot outliers**, or extreme values that don’t fit the overall pattern of the dataset. 

For example, in an **insurance claims dataset**, if most claims are between **$5,000 and $15,000** but a few exceed **$100,000**, those extreme values will show up as separate bars on the far right of the histogram. These could indicate **fraudulent claims or special cases** that require further investigation.

### **3. Assessing Data Spread and Variability**
The **width of the histogram** (the range of bins) and the **height of the bars** (frequency of observations) provide insights into how **spread out or concentrated** data is.

- If most values are tightly clustered within a few bins, the dataset has **low variability**.
- If the histogram is widely spread out, the dataset has **high variability**, meaning values fluctuate more.

For accountants and financial analysts, understanding variability is crucial when analyzing revenue fluctuations, stock prices, or risk exposure.

### **4. Comparing Different Groups**
Histograms can help compare **different categories** within a dataset.

For example, in an **insurance dataset**, analysts might create separate histograms for:
- Claims from different **regions**.
- Claims from different **roofing materials**.
- Claims based on **wind speed categories**.

Comparing these distributions helps **identify trends and discrepancies** across groups.

### **5. Spotting Data Entry Errors or Data Quality Issues**
Histograms often reveal **unexpected spikes or gaps** in data, which may indicate errors.

For instance, if an **insurance claims dataset** shows an unusually high number of claims with exactly **$10,000**, it could suggest:
- A rounding issue in the data.
- A reporting bias where adjusters tend to estimate at round numbers.
- A systemic problem in claim processing.

Identifying such anomalies ensures **data integrity and accuracy** in analysis.

---

## **Key Takeaways**
- **Histograms summarize numerical data by grouping values into bins and showing their frequency.**
- They help reveal **data distribution, outliers, variability, and potential errors**.
- Analysts use histograms to **compare groups, detect fraud, and assess risk**.
- A well-interpreted histogram **provides actionable insights** before applying more advanced statistical techniques.

Mastering histograms is an essential skill in **data analytics, finance, and accounting**, enabling professionals to quickly assess and interpret complex datasets.



# **Examples of Histograms Using the Insurance Claims Dataset**

Now that we understand what histograms are and why they are useful, let's explore some examples using the **insurance claims dataset (`df`)**. These examples highlight how histograms can help us **analyze distributions, detect anomalies, and gain business insights**.

---

## **Example 1: Distribution of Estimated Repair Costs**
Understanding how **repair costs** are distributed can help insurance companies assess typical claim amounts, identify outliers, and detect potential fraud.

### **Why This Matters**
- A **normal distribution** of repair costs suggests a predictable range of claims.
- **Skewed distributions** might indicate that some claims are significantly higher or lower than others.
- If there are **multiple peaks**, it may suggest different categories of claims, such as minor vs. major damage.

### **Insights to Look For**
- Are repair costs **normally distributed**, or do we see skewness?
- Are there **extremely high or low claims** that might need further investigation?
- Do most claims fall within a **typical range**, or is there high variability?

---

## **Example 2: Distribution of Wind Speed in Insurance Claims**
Wind speed is a critical factor in storm-related insurance claims. Examining its distribution can help analysts understand **how severe weather events impact claims**.

### **Why This Matters**
- Wind speeds **above certain thresholds** may indicate more severe storms.
- If most claims occur **within a specific wind speed range**, insurers can adjust their risk models accordingly.
- Extremely high wind speeds may be **outliers**, requiring verification of the claim data.

### **Insights to Look For**
- What is the **most common wind speed** for claims?
- Are there **very high wind speeds** that might suggest extreme weather events?
- Do we see a **pattern that aligns with past storm events**?

---

## **Example 3: Distribution of Roof Ages in Claims**
The age of a roof is a key factor in insurance claims, as older roofs are more susceptible to damage. A histogram of **roof ages** can provide valuable insights into risk assessment.

### **Why This Matters**
- **Older roofs** may have more claims, which affects **policy pricing**.
- A **spike in claims for new roofs** could indicate **fraudulent activity**, such as homeowners filing claims shortly after installing a new roof.
- Understanding **the relationship between roof age and claim frequency** helps insurers develop **better risk models**.

### **Insights to Look For**
- Are there **more claims for older roofs**, or is it spread evenly?
- Are there **unusual patterns**, such as a large number of claims for very new roofs?
- Does the **distribution match expectations** based on past data?

---

## **Example 4: Rainfall Levels in Claim Events**
Examining the **distribution of rainfall levels** in claims can help insurers understand how precipitation contributes to property damage.

### **Why This Matters**
- Higher rainfall levels may correlate with **roof leaks, flooding, and structural damage**.
- Identifying **rainfall thresholds** where claims increase can help **refine risk assessment models**.
- If most claims occur **at moderate rainfall levels**, insurers can adjust policies to reflect true risk levels.

### **Insights to Look For**
- What **rainfall range** is most common in claims?
- Are **high rainfall events** leading to significantly more claims?
- Are there **unexpected patterns**, such as claims occurring even at low rainfall levels?

---

## **Example 5: Hail Diameter in Claims**
Hail size is a critical factor in insurance claims, particularly for **roof and vehicle damage**. Understanding its distribution can help insurers assess **damage severity and claim validity**.

### **Why This Matters**
- **Larger hailstones** cause more damage, leading to higher payouts.
- If many claims report **large hail sizes**, but meteorological records suggest otherwise, there may be **reporting inaccuracies**.
- Understanding **typical hail diameters** helps insurers predict **future claim trends**.

### **Insights to Look For**
- Are **most claims** associated with **small, moderate, or large hail**?
- Are there **unusually high hail diameter reports** that might require further verification?
- Does the **distribution match historical weather data** for the region?

---

## **Conclusion**
Histograms are an invaluable tool in **insurance analytics**. By using them to explore **claim amounts, weather conditions, and structural factors**, insurers can detect patterns, assess risks, and **identify anomalies** that could indicate fraud or data quality issues. 

When interpreting histograms, always look for **skewness, peaks, gaps, and outliers**, as these can provide deeper insights into the underlying data.


# **Examples of Histograms Using the Insurance Claims Dataset**

Now that we understand what histograms are and why they are useful, let's explore some examples using the **insurance claims dataset (`df`)**. These examples highlight how histograms can help us **analyze distributions, detect anomalies, and gain business insights**.

---

## **Example 1: Distribution of Estimated Repair Costs**
Understanding how **repair costs** are distributed can help insurance companies assess typical claim amounts, identify outliers, and detect potential fraud.

### **Why This Matters**
- A **normal distribution** of repair costs suggests a predictable range of claims.
- **Skewed distributions** might indicate that some claims are significantly higher or lower than others.
- If there are **multiple peaks**, it may suggest different categories of claims, such as minor vs. major damage.

### **Insights to Look For**
- Are repair costs **normally distributed**, or do we see skewness?
- Are there **extremely high or low claims** that might need further investigation?
- Do most claims fall within a **typical range**, or is there high variability?

---

## **Example 2: Distribution of Wind Speed in Insurance Claims**
Wind speed is a critical factor in storm-related insurance claims. Examining its distribution can help analysts understand **how severe weather events impact claims**.

### **Why This Matters**
- Wind speeds **above certain thresholds** may indicate more severe storms.
- If most claims occur **within a specific wind speed range**, insurers can adjust their risk models accordingly.
- Extremely high wind speeds may be **outliers**, requiring verification of the claim data.

### **Insights to Look For**
- What is the **most common wind speed** for claims?
- Are there **very high wind speeds** that might suggest extreme weather events?
- Do we see a **pattern that aligns with past storm events**?

---

## **Example 3: Distribution of Roof Ages in Claims**
The age of a roof is a key factor in insurance claims, as older roofs are more susceptible to damage. A histogram of **roof ages** can provide valuable insights into risk assessment.

### **Why This Matters**
- **Older roofs** may have more claims, which affects **policy pricing**.
- A **spike in claims for new roofs** could indicate **fraudulent activity**, such as homeowners filing claims shortly after installing a new roof.
- Understanding **the relationship between roof age and claim frequency** helps insurers develop **better risk models**.

### **Insights to Look For**
- Are there **more claims for older roofs**, or is it spread evenly?
- Are there **unusual patterns**, such as a large number of claims for very new roofs?
- Does the **distribution match expectations** based on past data?

---

## **Example 4: Rainfall Levels in Claim Events**
Examining the **distribution of rainfall levels** in claims can help insurers understand how precipitation contributes to property damage.

### **Why This Matters**
- Higher rainfall levels may correlate with **roof leaks, flooding, and structural damage**.
- Identifying **rainfall thresholds** where claims increase can help **refine risk assessment models**.
- If most claims occur **at moderate rainfall levels**, insurers can adjust policies to reflect true risk levels.

### **Insights to Look For**
- What **rainfall range** is most common in claims?
- Are **high rainfall events** leading to significantly more claims?
- Are there **unexpected patterns**, such as claims occurring even at low rainfall levels?

---

## **Example 5: Hail Diameter in Claims**
Hail size is a critical factor in insurance claims, particularly for **roof and vehicle damage**. Understanding its distribution can help insurers assess **damage severity and claim validity**.

### **Why This Matters**
- **Larger hailstones** cause more damage, leading to higher payouts.
- If many claims report **large hail sizes**, but meteorological records suggest otherwise, there may be **reporting inaccuracies**.
- Understanding **typical hail diameters** helps insurers predict **future claim trends**.

### **Insights to Look For**
- Are **most claims** associated with **small, moderate, or large hail**?
- Are there **unusually high hail diameter reports** that might require further verification?
- Does the **distribution match historical weather data** for the region?

---

## **Conclusion**
Histograms are an invaluable tool in **insurance analytics**. By using them to explore **claim amounts, weather conditions, and structural factors**, insurers can detect patterns, assess risks, and **identify anomalies** that could indicate fraud or data quality issues. 

When interpreting histograms, always look for **skewness, peaks, gaps, and outliers**, as these can provide deeper insights into the underlying data.


# **Code Examples: Histograms for Insurance Claims Data**

Now that we've discussed **why histograms are useful**, let's generate some using our dataset (`df`). These visualizations will help us **identify trends, detect anomalies, and gain insights into claim characteristics**.

---

### **Example 1: Distribution of Estimated Replacement Costs**

**Purpose**: Understanding how **claim costs** are distributed using histograms and trend curves.

**Histogram Purpose**: Shows how claim costs are distributed.

**KDE (Kernel Density Estimate)** Line: Adds a smooth trend curve to highlight patterns.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))
sns.histplot(df[df['Estimated cost to replace']>0]['Estimated cost to replace'], bins=20, kde=True, color='blue')
plt.xlabel("Estimated Replacement Cost ($)")
plt.ylabel("Frequency")
plt.title("Distribution of Replacement Costs")
plt.show()

**Key Insights:**
- Look for skewness—is the distribution normally shaped, or do we see more high/low values?
- Identify outliers—are there extremely high repair costs that could indicate fraudulent claims?
- Understand variability—are most claims within a predictable range, or is there a wide spread?

### Example 2: Distribution of Wind Speed in Claims

In [None]:
plt.figure(figsize=(8,5))
sns.histplot(df['Wind Speed'], bins=15, kde=True, color='green')
plt.xlabel("Wind Speed (mph)")
plt.ylabel("Number of Claims")
plt.title("Distribution of Wind Speeds in Insurance Claims")
plt.show()

### Key Insights
- Are most claims occurring at moderate wind speeds, or only at extreme levels?
- Do we see a spike at a particular wind speed that may suggest a major storm event?
- Are there any outlier wind speeds that may indicate data entry errors?

### Example 3: Distribution of Roof Ages in Claims

In [None]:
plt.figure(figsize=(8,5))
sns.histplot(df['Age of roof'], bins=10, kde=True, color='orange')
plt.xlabel("Roof Age (Years)")
plt.ylabel("Frequency")
plt.title("Distribution of Roof Age in Claims")
plt.show()

#### Key Insights
- Are older roofs more likely to have claims?
- Are there unusual spikes, such as many claims from very new roofs, which could suggest fraudulent activity?
- Does the distribution match expectations, or are claims evenly spread across different roof ages?

### Example 4: Rainfall Levels in Claim Events

In [None]:
plt.figure(figsize=(8,5))
sns.histplot(df['Rainfall'], bins=12, kde=True, color='purple')
plt.xlabel("Rainfall (inches)")
plt.ylabel("Claim Count")
plt.title("Distribution of Rainfall Levels in Claims")
plt.show()

#### Key Insights
- Does higher rainfall correlate with more insurance claims?
- Are there unexpected claims at low rainfall levels, possibly suggesting fraudulent or unnecessary claims?
- Can we identify a threshold where rainfall starts to significantly impact claim frequency?

### Example 5: Hail Diameter in Claims

In [None]:
plt.figure(figsize=(8,5))
sns.histplot(df['Hail Diameter'], bins=8, kde=True, color='red')
plt.xlabel("Hail Diameter (inches)")
plt.ylabel("Claim Frequency")
plt.title("Distribution of Hail Sizes in Claims")
plt.show()

#### Key Insights
- Are larger hail sizes associated with more claims?
- Are there many claims for extremely large hail that may not align with historical weather data?
- Understanding hail size patterns can help insurers assess damage severity and pricing policies.

## **Conclusion**
These **histograms** provide a visual representation of how key insurance variables are distributed. By analyzing their shapes, peaks, and outliers, we can **detect patterns, assess risk, and identify data inconsistencies**.

- Use **skewness** to determine whether claims are **normally distributed or biased**.
- Watch for **outliers**, which could indicate **errors or fraudulent claims**.
- Compare distributions to **real-world expectations** to improve **insurance decision-making**.

By incorporating histograms into data analysis, insurance professionals can gain **actionable insights** into claim trends, helping them **refine policies, detect anomalies, and enhance risk assessments**.
