[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%206%20Notebooks/GDAN%205400%20-%20Week%206%20Notebooks%20%28II%29%20-%20Tasks%202-4%20-%20Univariate%20Statistics.ipynb)

This notebook provides a mini-tutorial on univariate statistics in Python 

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Read in Data

In [None]:
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%203/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

print('# of rows:', len(df), '\n')

df[:2]

# Mini-Tutorial: Understanding Univariate Statistics in Data Analytics

## **Lesson Objectives**
By the end of this lesson, students will:
1. Understand the concept and application of basic univariate statistics and probability.
2. Learn how to calculate and interpret key univariate statistics: **mean, median, mode, minimum, maximum, range, variance, and standard deviation**.

---

## **What Are Univariate Statistics?**
Univariate statistics refer to the analysis of a **single variable** in a dataset. The term "univariate" comes from "uni-" (one) and "variate" (variable), meaning that these statistics describe, summarize, and interpret the characteristics of one variable at a time.

In data analytics, univariate statistics help analysts understand the **distribution, central tendency, and variability** of data. They form the foundation for identifying patterns, detecting anomalies, and making data-driven decisions.

---

## **Why Are Univariate Statistics Important?**
Univariate statistics are useful for several reasons:

### **1. Summarizing Large Datasets**
Real-world datasets can contain thousands or millions of observations. Univariate statistics allow us to **summarize complex data** into meaningful insights. For example, rather than looking at every transaction in a financial dataset, we can quickly compute the **average transaction amount** to understand general spending behavior.

### **2. Identifying Central Tendencies**
Univariate statistics help pinpoint the **"center" of a dataset** using measures like:
- **Mean (Average)** – The arithmetic average of all values.
- **Median** – The middle value, which is useful for skewed data.
- **Mode** – The most frequently occurring value.

For example, in an **insurance claims dataset**, knowing the **median claim amount** helps companies set expectations for typical payouts.

### **3. Measuring Variability and Risk**
Understanding how **spread out** data is can reveal patterns or inconsistencies:
- **Minimum & Maximum** – The smallest and largest values, helping to spot outliers.
- **Range** – The difference between max and min values, showing data dispersion.
- **Variance & Standard Deviation** – Indicate how much individual values deviate from the mean.

For instance, an **insurance company** may analyze the **standard deviation of repair costs** to assess risk and pricing strategies.

### **4. Detecting Anomalies and Outliers**
Outliers can be **data errors, fraudulent activities, or special cases** that require further investigation. By analyzing:
- **Extremely high or low values**
- **Unusual gaps in data distributions**

companies can identify **potential fraud in financial transactions** or **misreported expenses in accounting records**.

### **5. Supporting Decision-Making**
Univariate statistics form the backbone of **data-driven decision-making**. For example:
- An audit team examining expense reports might look at **variance in spending patterns**.
- A financial analyst predicting future sales might first analyze **historical averages**.
- A risk management team might assess **standard deviations in investment returns** to determine stability.

### **6. Preparing for More Complex Analyses**
Before diving into advanced analytics like **predictive modeling, machine learning, or hypothesis testing**, analysts must first **explore and clean their data** using univariate statistics. 

For example:
- A **skewed distribution** might indicate the need for **log transformations** before regression analysis.
- A **high standard deviation** might suggest the presence of **multiple underlying factors** affecting variability.

---

### **Conclusion**
Univariate statistics are a **fundamental tool in data analytics**. They help professionals summarize data, detect patterns, find anomalies, and make informed decisions. Mastering these concepts ensures that data analysts, accountants, and financial professionals **understand their data before applying more advanced analytical techniques**.

By using univariate statistics effectively, analysts can **turn raw data into actionable insights** that drive better business strategies.

---

### Code Examples: Calculating Basic Statistics using `Estimated cost to repair`

In [None]:
# Calculate basic statistics for 'Estimated cost to repair'
mean_value = df['Estimated cost to repair'].mean()
median_value = df['Estimated cost to repair'].median()
mode_value = df['Estimated cost to repair'].mode()[0]
min_value = df['Estimated cost to repair'].min()
max_value = df['Estimated cost to repair'].max()
range_value = max_value - min_value
std_dev = df['Estimated cost to repair'].std()
variance = df['Estimated cost to repair'].var()

### Print results

In [None]:
print(f"Mean: ${mean_value:.2f}")
print(f"Median: ${median_value:.2f}")
print(f"Mode: ${mode_value}")
print(f"Min: ${min_value:.2f}")
print(f"Max: ${max_value:.2f}")
print(f"Range: ${range_value:.2f}")
print(f"Standard Deviation: ${std_dev:.2f}")
print(f"Variance: ${variance:.2f}")

###  Summary of Different Univariate Statistics Using `Estimated cost to repair` 

Let's analyze **Estimated cost to repair** using different univariate statistics.

| **Statistic**  | **Definition** | **Use in Accounting & Analytics** |
|---------------|---------------|------------------------------------|
| **Mean (Average)** | The sum of all values divided by the count. | Helps calculate the **average claim cost** for insurance companies. |
| **Median** | The middle value when data is sorted. | Useful for **skewed distributions** (e.g., fraud cases with extreme values). |
| **Mode** | The most frequently occurring value. | Identifies the **most common insurance claim amount**. |
| **Minimum (Min)** | The smallest value in the dataset. | Useful for detecting **low-end outliers** in financial reports. |
| **Maximum (Max)** | The largest value in the dataset. | Helps flag **unusually high claim amounts** that may indicate fraud. |
| **Range** | Difference between max and min values. | Shows the **spread of claim amounts** in different regions. |
| **Variance** | The average squared deviation from the mean. | Helps quantify **risk or variability** in financial transactions. |
| **Standard Deviation (Std Dev)** | The square root of variance, measuring spread. | Important for **risk assessment** and understanding volatility. |

---