[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%206%20Notebooks/GDAN%205400%20-%20Week%206%20Notebooks%20%28VIII%29%20-%20Bar%20Chart%20After%20Groupby.ipynb)

This notebook provides a mini-tutorial on understanding and using bar graphs after `groupby` in Python 

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Read in Data

In [None]:
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%203/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

print('# of rows:', len(df), '\n')

df[:2]

# **Overview: Bar Charts After GroupBy**

## **What is `groupby()` and Why Use It?**
The `groupby()` function in Pandas is used to **split** a dataset into groups, **apply** calculations to each group, and **combine** the results. This is especially useful for **summarizing and aggregating numerical data based on categorical variables**.

When combined with **bar charts**, grouping helps visualize differences across categories, making it easier to identify **patterns, outliers, and trends**.

### **Common Use Cases for `groupby` and Bar Charts**
- **Comparing average costs** – e.g., average claim replacement cost by roofing company.
- **Summarizing counts** – e.g., number of claims per city.
- **Finding highest/lowest values** – e.g., insurance adjusters with the highest repair estimates.

---

# **Breaking Down the Code Example**

## Example: Bar Chart of Average Replacement Cost by Roofing Company
### **Objective**
Identify which roofing companies have higher or lower **average estimated replacement costs**.

### **Steps**
1. **Group the Data by Roofing Company**
   - The dataset contains multiple claims handled by different roofing companies.
   - We use `.groupby('Roofing Company')` to **group all claims handled by each company**.

2. **Calculate the Mean Replacement Cost** for each `Roofing Company`
   - The column `Estimated cost to replace` stores the replacement costs for each claim.
   - Using `.mean()` calculates the **average replacement cost** per company.

3. **Sort the Results** in descending order
   - `.sort_values(ascending=False)` sorts the companies **from highest to lowest** average replacement cost.

4. **Create a Bar Chart** to visualize the differences
   - `.plot(kind='bar')` generates a bar chart.
   - `color='skyblue'` sets the bar color.
   - `figsize=(10,5)` adjusts the figure size for readability.
   - `plt.xticks(rotation=45)` rotates the x-axis labels to prevent overlap.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

avg_replacement_by_roofer = df.groupby('Roofing Company')['Estimated cost to replace'].mean().sort_values(ascending=False)
avg_replacement_by_roofer.plot(kind='bar', color='skyblue', figsize=(10,5))
plt.title("Average Replacement Cost by Roofing Company")
plt.xlabel("Roofing Company")
plt.ylabel("Avg Replacement Cost ($)")
plt.xticks(rotation=45, ha='right')
plt.show()

#### Key Insights
- *Identifying Cost Differences:* Some roofing companies may charge significantly more for replacements than others. This could indicate differences in materials, labor costs, pricing strategies, or even potential fraud.
- *Potential Overpricing or Fraud:* If a particular company consistently has *higher-than-average* replacement costs, it may warrant further investigation for potential overpricing or fraudulent claims.
- *Regional Variations:* If certain roofing companies are concentrated in specific regions, this could highlight geographic differences in replacement costs due to local labor rates or material availability.
- *Insurer decision-making* – Insurers can compare vendors and assess cost efficiency.
- *Customer Decision-Making:* Policyholders and insurance companies can use this data to *make more informed decisions* when selecting roofing companies for repairs.
- *Comparing to Repair Costs:* This can be *compared with the average repair cost by roofing company* to see if high replacement costs align with high repair costs or if certain companies specialize in replacements.
- *Detecting outliers* – Any roofing company with exceptionally high or low costs might need further investigation.

By combining `groupby()` with bar charts, analysts can summarize large datasets and gain actionable insights from categorical comparisons.

#### Customing the Plot
There are lots of ways we can customize the above plot. Below I use a custom color, remove the border, add 'padding' below the title and around the x-axis and y-axis labels, change the font size, and add bold font. 

In [None]:
# Get Seaborn's default blue color
seaborn_blue = sns.color_palette("deep")[0]

avg_replacement_by_roofer = df.groupby('Roofing Company')['Estimated cost to replace'].mean().sort_values(ascending=False)

plt.figure(figsize=(10,5))
avg_replacement_by_roofer.plot(kind='bar', color=seaborn_blue)
plt.title("Average Replacement Cost by Roofing Company", fontsize=12, pad=20, fontweight='bold')
plt.xlabel("Roofing Company", labelpad=20, fontsize=11, fontweight='bold')  # Added label padding, font size, bold font
plt.ylabel("Avg Replacement Cost ($)", labelpad=20, fontsize=11, fontweight='bold')  # Added label padding, font size, bold font
plt.xticks(rotation=45, ha='right', fontsize=9)  # Align labels properly

plt.gca().spines[['top', 'right', 'left', 'bottom']].set_visible(False)  # Remove border

plt.show()

## **Example 2: Bar Chart of Total Claims by City**
**Purpose** - This visualization helps identify which **cities** have the highest number of claims, allowing analysts to detect high-risk areas.

In [None]:
import matplotlib.pyplot as plt

# Count the number of claims per city
claims_by_city = df.groupby('City')['Claim number'].count().sort_values(ascending=False)

# Create a bar chart
claims_by_city.plot(kind='bar', color='skyblue', figsize=(10,5))
plt.title("Total Claims by City")
plt.xlabel("") #Remove x-axis title
plt.ylabel("Number of Claims", labelpad=20)
plt.xticks(rotation=0)
plt.show()

#### Key Insight
- Cities with a higher volume of claims might require additional risk assessment or adjusted insurance premiums.

## Example 3: Average Repair Cost by Adjuster
**Purpose** - 
This example groups the dataset by `Adjuster` and calculates the average `Estimated cost to repair`, helping to identify patterns in claim assessments.

In [None]:
# Calculate the average repair cost per adjuster and print the results
avg_repair_by_adjuster = df.groupby('Adjuster')['Estimated cost to repair'].mean().sort_values(ascending=False)
print(avg_repair_by_adjuster)

In [None]:
#Plot the graph
avg_repair_by_adjuster.plot(kind='bar', color='skyblue', figsize=(10,5))

In [None]:
#Quick one-liner version
df.groupby('Adjuster')['Estimated cost to repair'].mean().sort_values(ascending=False).plot(kind='bar')

## **Conclusions on Bar Charts with `groupby()`**
Using **bar charts with grouped data** allows for **clear comparisons** between categories, making it an essential tool for financial and insurance analytics. 

### **Key Takeaways**
- **Simplifies large datasets** – Grouping and visualizing data highlights **patterns and trends** across different categories.
- **Enhances decision-making** – Helps identify **which categories perform best or worst**, such as **roofing companies with higher average replacement costs**.
- **Detects anomalies and outliers** – Unusually high or low values may indicate **pricing inconsistencies, overbilling, or inefficiencies**.
- **Supports business strategy** – Insurance firms can **adjust vendor contracts, refine policies, or optimize pricing models** based on insights from grouped data.

### **Best Practices**
- Always **sort the data** in a meaningful way (e.g., descending order for cost comparisons).
- Rotate x-axis labels (`plt.xticks(rotation=45)`) to **prevent overlap and improve readability**.
- Choose appropriate colors and figure sizes for **clarity and presentation**.
- Consider **filtering out extreme outliers** for a more balanced visualization.

By leveraging `groupby()` and bar charts, analysts can efficiently **summarize, compare, and visualize** key financial and operational metrics to drive **better business decisions**.