[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%206%20Notebooks/GDAN%205400%20-%20Week%206%20Notebooks%20%28V%29%20-%20Task%207%20-%20Bar%20Charts.ipynb)

This notebook provides a mini-tutorial on understanding and using bar charts in Python 

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Read in Data

In [None]:
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%203/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

print('# of rows:', len(df), '\n')

df[:2]

# **Bar Charts in Data Analytics**

## **Introduction**
Bar charts are one of the most commonly used visualizations in data analytics. They are used to compare categorical variables and **display frequency counts, averages, or other aggregated metrics**. 

In the context of insurance claims, bar charts can help us:
- Compare **the number of claims** handled by different adjusters.
- Analyze **average repair costs** across different roofing companies.
- Examine **the frequency of different types of roof damage**.

---

## Example 1: Bar Chart of Claim Counts by City
**Purpose**
This bar chart helps visualize **which cities have the most insurance claims**, providing insight into regional claim distribution.

#### Option 1: using `matplotlib` and `.plot(kind='bar')`

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10,5))
df['City'].value_counts().plot(kind='bar', color='skyblue')
plt.title("Number of Claims by City")
plt.xlabel(" ")
plt.ylabel("Number of Claims")
plt.xticks(rotation=0)
plt.show()

#### Alternative using `seaborn` and `countplot`

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
sns.countplot(y=df['City'], order=df['City'].value_counts().index, color='steelblue')
plt.xlabel("Number of Claims")
plt.ylabel("City")
plt.title("Number of Claims in Each City")
plt.show()

#### Key Insights
- Helps identify high-claim regions, which could indicate areas more prone to damage.
- If a city has an unusually high number of claims, it may warrant further investigation (e.g., fraud detection or regional weather patterns).
- Uneven distributions might suggest geographic risk factors for insurance policies.

### Differences Between the Two Options
#### Library Used
- First Version (`plt.bar`): Uses Matplotlib’s `.plot(kind='bar')` function.
- Second Version (`sns.countplot`): Uses Seaborn’s `countplot()` function.

#### Data Handling
- First Version (`value_counts()`):
    - Uses `df['City'].value_counts()` to precompute the counts before plotting.
    - The `plot(kind='bar')` method then visualizes these precomputed counts.
- Second Version (`countplot`):
    - Directly counts occurrences within Seaborn without needing `value_counts()`.
    - The `order=df['City'].value_counts().index` ensures bars are ordered by frequency.

#### Orientation
- First Version (`plt.bar`):
  - A **vertical bar chart**, meaning cities are placed along the **x-axis**.
  - Uses `plt.xticks(rotation=0)` to control the x-axis labels.
- Second Version (`sns.countplot`):
  - A **horizontal bar chart** (`y=df['City']`).
  - Cities are placed on the **y-axis**, making it easier to read when many categories exist.


#### Customization
- First Version (`plt.bar`):
  - Uses `color='skyblue'` for bar color.
  - Leaves `xlabel` as empty (`" "`) to reduce clutter.
- Second Version (`sns.countplot`):
  - Uses `color='steelblue'` for bars.
  - Explicitly sets `xlabel` and `ylabel` for clarity.



#### Styling and Readability
- Matplotlib (`plt.bar`):
  - Simple and good for quick visualizations.
  - Might require more formatting adjustments for aesthetics.
- Seaborn (`sns.countplot`):
  - Automatically applies **nicer formatting** (like grid lines and spacing).
  - More visually appealing out of the box.


### Which One to Use?
- Use `plt.bar` when: You already precomputed counts and need a quick bar chart.
- Use `sns.countplot` when: You want cleaner visuals with automatic counting.

### **Final Takeaway**
Both methods achieve the **same goal** (plotting claim counts by city), but:
- `plt.bar` requires **manual pre-aggregation** (`value_counts()`).
- `sns.countplot` **automatically** counts and sorts the categories.
- `sns.countplot`'s **horizontal bars** are often more readable when dealing with long category names.


### Changing the First Option into a Horizontal Bar Chart
To modify the first version (plt.bar) into a horizontal bar chart, you need to use `.plot(kind='barh')` instead of `.plot(kind='bar')`.

In [None]:
plt.figure(figsize=(10,5))
df['City'].value_counts().plot(kind='barh', color='skyblue')
plt.title("Number of Claims by City")
plt.xlabel("Number of Claims")
plt.ylabel(" ")
plt.show()

## Example 2: Bar Chart of Claim Counts by Number of Stories
Showing both vertical bar chart (`bar`) and horizontal bar chart (`barh`) options

In [None]:
plt.figure(figsize=(8,5))
df['Stories'].value_counts().plot(kind='bar', color='lightcoral')
plt.title("Number of Claims by Number of Stories")
plt.xlabel(" ")
plt.ylabel("Number of Claims")
plt.xticks(rotation=0)
plt.show()

In [None]:
plt.figure(figsize=(8,5))
df['Stories'].value_counts().plot(kind='barh', color='lightcoral')
plt.title("Number of Claims by Number of Stories")
plt.xlabel(" ")
plt.ylabel("Number of Claims")
plt.xticks(rotation=0)
plt.show()

## **Conclusions on Bar Graphs**
Bar graphs are a fundamental visualization tool for comparing categorical data, making them highly useful in accounting and insurance analytics. They allow analysts to:
- **Identify patterns and trends** across different categories, such as roofing companies, adjusters, or cities.
- **Highlight outliers** by revealing unusually high or low values that may indicate errors, fraud, or significant regional variations.
- **Simplify complex datasets** into an easily interpretable format, making data-driven decision-making more efficient.
- **Enhance communication of insights** by presenting information in a clear and visually appealing manner.

By carefully selecting between **horizontal and vertical orientations** and choosing appropriate **sorting methods**, bar charts can significantly improve the clarity and impact of data presentations.