[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%206%20Notebooks/GDAN%205400%20-%20Week%206%20Notebooks%20%28VI%29%20-%20Task%208%20-%20Bar%20Charts.ipynb)

This notebook provides a mini-tutorial on understanding and using count plots in Python 

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Read in Data

In [None]:
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%203/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

print('# of rows:', len(df), '\n')

df[:2]

# **Mini-Tutorial: Understanding Countplots**

## **What is a Countplot?**
A **countplot** is a type of bar chart used to visualize the **frequency** of categorical data. It displays the count of observations in each category, making it useful for **identifying patterns, trends, and anomalies** in a dataset.

### **Why Use Countplots in Data Analytics?**
- **Quickly summarize categorical data** – See how often each category appears in the dataset.
- **Detect imbalances** – Identify underrepresented or overrepresented categories.
- **Find patterns in data** – Understand distributions across different groups.

## **How Countplots Help in Accounting & Insurance Analytics**
Countplots are particularly useful for professionals in **accounting, finance, and insurance**, as they:
- **Highlight the most frequent claims** – Showing which types of insurance claims are most common.
- **Identify high-risk areas** – Seeing which regions generate the most claims.
- **Detect irregularities** – Spot unusual claim activity that may indicate fraud.
- **Monitor service provider usage** – Understanding which roofing companies or adjusters are handling the most claims.

### **Key Features of Countplots**
1. **Categorical on One Axis, Counts on the Other**  
   - The x-axis (or y-axis) represents a **categorical variable** (e.g., city, adjuster, roof type).
   - The y-axis (or x-axis) shows **the number of occurrences**.

2. **Customization Options**  
   - Can be displayed **vertically or horizontally**.
   - Can **sort categories** by frequency for better readability.
   - Supports **color customization** for clearer differentiation.

3. **Comparison Between Categories**  
   - By grouping by another categorical variable, countplots can show **comparisons within subcategories**.

By using **countplots effectively**, professionals can **gain quick insights** into categorical distributions, making it easier to detect trends, make informed decisions, and identify potential risks.


### **Example: Countplot of Claim Counts by City**
**Purpose** – 
This countplot visualizes **the number of claims made in different cities**, helping insurers understand geographic patterns in claims. Unlike a **bar chart**, which can display **any aggregated numerical value** (e.g., average repair cost), a countplot **only counts the number of occurrences of each category**.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
sns.countplot(y=df['City'], order=df['City'].value_counts().index, color='steelblue')
plt.xlabel("Number of Claims")
plt.ylabel("City")
plt.title("Number of Claims by City")
plt.show()

#### Key Insights
- Shows which cities have the highest claim activity.
- Helps insurers detect regional risks and adjust policies accordingly.
- Reveals potential fraud if certain cities have an unexpectedly high volume of claims.

### How is This Different from a Bar Chart?
- Countplots are specifically used for categorical frequency (i.e., how many times a category appears in the dataset).
- Bar charts can display aggregated metrics, such as average cost per city or total payout per adjuster.
- A countplot always has "count" on one axis, while bar charts have a numerical variable that can be customized (e.g., sum, mean, max).
- Countplots cannot display numerical data categories directly; they require categorical variables.

The main difference between a bar chart and a countplot lies in how they handle data:

1. Bar Chart (sns.barplot() or plt.bar())
Requires pre-aggregated data: You need to provide both the categorical variable (x-axis) and the corresponding numerical values (y-axis).

Used to display comparisons between different categories.

Example: Showing the average claim amount per adjuster.

```python
sns.barplot(x=df['Adjuster'], y=df['Claim_Amount'], estimator=np.mean)
```

2. Countplot (`sns.countplot()`)
Automatically counts occurrences: It works directly with categorical data and plots the frequency of each category.

No need for explicit aggregation: It automatically calculates the count of each unique value in the category column.

Example: Showing how many claims each adjuster has handled.

```python
sns.countplot(y=df['Adjuster'])
```

#### Summary



| Feature        | Bar Chart (`barplot`)       | Countplot (`countplot`)  |
|---------------|-----------------------------|-------------------------|
| **Data Input** | Categorical + numerical | Only categorical |
| **Aggregation** | Requires explicit aggregation (e.g., mean, sum) | Counts occurrences automatically |
| **Use Case** | Comparing values across categories | Counting occurrences of categories |


## **Conclusions on Countplots**
Countplots are a powerful visualization tool for analyzing **categorical data frequency**. They are especially useful in **insurance and accounting analytics** because they:
- **Quickly summarize categorical distributions** – Making it easy to see which categories appear most frequently.
- **Highlight trends and imbalances** – Identifying high-risk locations, commonly used service providers, or frequent claim types.
- **Detect anomalies** – Large variations in frequency could indicate fraud, data entry errors, or unexpected trends.
- **Enhance data-driven decision-making** – Helping insurers, auditors, and financial analysts make better-informed policy and business decisions.

By understanding **when to use a countplot versus a bar chart**, analysts can effectively visualize data **to extract meaningful insights and drive business intelligence**.
