# **Chapter 3: Descriptive Statistics**

## **Introduction**

This chapter focuses on data exploration and descriptive statistics. Descriptive statistics allow us to understand, summarize, and visualize the characteristics of a dataset before applying more complex models.

## **3.1 The Data Quality Report**

A data quality report summarizes the key characteristics of a dataset. It includes tables and graphs describing both numerical and categorical variables. Typical measures include **central tendency**, **variation**, and **shape**. Data quality reports also check for issues such as **missing values** and **outliers**.

## **3.2 Central Tendency**

Central tendency describes where the data “centers” itself. The three most common measures are the **mean**, **median**, and **mode**.

### The Mean

The arithmetic mean (average) is given by:

$$
\bar{X} = \frac{1}{n} \sum_{i=1}^{n} x_i
$$

**Example in Python:**

```python
import numpy as np

x = np.array([0, 1, 2, 0, 4, 0, 1, 2, 3])
np.mean(x)
```

In [30]:
import numpy as np

x = np.array([0, 1, 2, 0, 4, 0, 1, 2, 3])
np.mean(x)

np.float64(1.4444444444444444)

### The Median

The median is the middle value after sorting the data. If \(n\) is odd, it’s the \((n+1)/2\)-th value. If \(n\) is even, it’s the average of the \(n/2\)-th and \((n/2)+1\)-th values.

**Example in Python:**

```python
salaries = np.array([35000, 37000, 35000, 33000, 210000])
np.median(salaries)
```

In [31]:
salaries = np.array([35000, 37000, 35000, 33000, 210000])
np.median(salaries)

np.float64(35000.0)

### The Mode

The mode is the most frequently occurring value. Unlike the mean, it is not affected by extreme values.


In [32]:
import numpy as np
from scipy import stats

# Example data
data = np.array([1, 2, 2, 3, 4, 4, 4, 5])

# Using SciPy
mode_result = stats.mode(data, keepdims=True)
print("Mode:", mode_result.mode[0])
print("Count:", mode_result.count[0])

# Using pandas (alternative way)
import pandas as pd
print("Mode with pandas:", pd.Series(data).mode()[0])


Mode: 4
Count: 3
Mode with pandas: 4


## **3.3 Variation and Shape**

In addition to central tendency, every variable is characterized by its **variation** (spread) and **shape**.

#### Range

$$
\text{Range} = \max(x) - \min(x)
$$

```python
salaries = np.array([35000, 37000, 35000, 33000, 210000])
np.max(salaries) - np.min(salaries)
```

In [33]:
salaries = np.array([35000, 37000, 35000, 33000, 210000])
np.max(salaries) - np.min(salaries)

np.int64(177000)

#### Variance and Standard Deviation

The sample variance is:

$$
s^2 = \frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})^2
$$

The sample standard deviation is:

$$
s = \sqrt{s^2}
$$

**Example in Python:**

```python
heights = np.array([65.71, 72.30, 68.31, 67.05, 70.68])
np.var(heights, ddof=1), np.std(heights, ddof=1)
```


In [34]:
heights = np.array([65.71, 72.30, 68.31, 67.05, 70.68])
np.var(heights, ddof=1), np.std(heights, ddof=1)

(np.float64(7.158650000000013), np.float64(2.6755653608162917))

#### Z-Scores

The Z-score measures how many standard deviations away from the mean a value is:

$$
Z = \frac{X - \bar{X}}{s}
$$
**Example in Python:**
```python
(heights - np.mean(heights)) / np.std(heights, ddof=1)
```

In [35]:
(heights - np.mean(heights)) / np.std(heights, ddof=1)

array([-1.15863363,  1.30439721, -0.18687639, -0.6578049 ,  0.6989177 ])

### **3.4 Skewness**

Skewness measures the asymmetry of the distribution.

- If mean < median → **Left-skewed**
- If mean = median → **Symmetrical**
- If mean > median → **Right-skewed**

## **3.5 Quartiles**


Quartiles describe the position of a specific data value in relation to the rest of the data. $Q_1$, $Q_2$, $Q_3$ are 3 numbers that divide the ordered observations into 4 equally sized groups (i.e. each group contains 25% of all observations).


### Finding Quartiles




- **$Q_1$** corresponds to the 25th percentile, i.e., 25% of all observations in the data set are of lesser value than $Q_1$. In simple words, $Q_1$ is the median of all observations to the left of the median.
- **$Q_2$** corresponds to the 50th percentile, i.e., 50% of all observations in the data set are of lesser value than $Q_2$. In simple words, $Q_2$ corresponds to the median.
- **$Q_3$** corresponds to the 75th percentile, i.e., 75% of all observations in the data set are of lesser value than $Q_3$. In simple words, $Q_2$ is the median of all observations to the right of the median.

**Note: Before finding quartiles, you must write your numbers in ascending (smallest to largest) order**

**Formula for the position of the first quartile (Q1):**

$$
\text{Position of } Q_1 = \frac{n+1}{4}
$$

**Formula for the position of the third quartile (Q3):**

$$
\text{Position of } Q_3 = \frac{3(n+1)}{4}
$$

**Interquartile Range (IQR) Formula**

The interquartile range (IQR) measures the spread of the middle 50% of a dataset. It is calculated as:


$$
\mathrm{IQR} = Q_3 - Q_1
$$

Where:

- $Q_1$ is the first quartile (25th percentile)

- $Q_3$ is the third quartile (75th percentile)

**Using this method to calculate outliers**

You can use $Q_1$, $Q_3$, and the IQR to find outliers in your dataset. 

$$
\text{Lower Fence} = Q_1 - 1.5 \times \mathrm{IQR}
$$

$$
\text{Upper Fence} = Q_3 + 1.5 \times \mathrm{IQR}
$$

Any number that is < the lower fence is a lower outlier.

Any number that is > the upper fence is the upper outlier.

In [36]:
import numpy as np

# Example dataset
data = np.array([7, 8, 5, 6, 3, 4, 9, 2, 1, 15])

# Step 1: Sort the data
sorted_data = np.sort(data)
print("Sorted data:", sorted_data)

# Step 2: Calculate quartiles
Q1 = np.percentile(sorted_data, 25)
Q2 = np.percentile(sorted_data, 50)  # Median
Q3 = np.percentile(sorted_data, 75)
print(f"Q1 (25th percentile): {Q1}")
print(f"Q2 (Median): {Q2}")
print(f"Q3 (75th percentile): {Q3}")

# Step 3: Interquartile Range (IQR)
IQR = Q3 - Q1
print(f"IQR: {IQR}")

# Step 4: Calculate outlier fences
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR
print(f"Lower Fence: {lower_fence}")
print(f"Upper Fence: {upper_fence}")

# Step 5: Identify outliers
outliers = sorted_data[(sorted_data < lower_fence) | (sorted_data > upper_fence)]
print("Outliers:", outliers)


Sorted data: [ 1  2  3  4  5  6  7  8  9 15]
Q1 (25th percentile): 3.25
Q2 (Median): 5.5
Q3 (75th percentile): 7.75
IQR: 4.5
Lower Fence: -3.5
Upper Fence: 14.5
Outliers: [15]


#### Five-Number Summary

The five-number summary includes:

$$
\text{Min}, Q1, \text{Median}, Q3, \text{Max}
$$

```python
np.percentile(heights, [0, 25, 50, 75, 100])
```


## **3.6 Organizing Categorical Variables**

### **Frequency Tables (See Python Functions - `Group By` in Chapter 1)**

A **frequency table** counts occurrences in each category.  
A **relative frequency table** shows percentages.

In [38]:
import pandas as pd

# Example dataset
data = pd.Series(["Apple", "Banana", "Apple", "Orange", "Banana", "Apple", "Orange", "Orange"])

# Frequency table (counts)
freq_table = data.value_counts()
print("Frequency Table:")
print(freq_table)

# Relative frequency table (percentages)
rel_freq_table = data.value_counts(normalize=True) * 100
print("\nRelative Frequency Table (%):")
print(rel_freq_table)


Frequency Table:
Apple     3
Orange    3
Banana    2
Name: count, dtype: int64

Relative Frequency Table (%):
Apple     37.5
Orange    37.5
Banana    25.0
Name: proportion, dtype: float64


## **3.7 Visualizing Variables**



Visualizations provide insight into the structure of data.

- **Bar chart** → Categorical data comparison  
- **Pie chart** → Relative proportions  
- **Histogram** → Distribution of numerical data  
- **Boxplot** → Five-number summary and outliers  
- **Scatter plot** → Relationship between two variables  
- **Time series plot** → Trends over time  

Note: For further information, see Chapter 2: Data Visualization

### **2.8 Identifying Data Quality Issues**

## **3.8 Summary**

Descriptive statistics provide the foundation for exploratory data analysis. By calculating measures of central tendency, spread, and shape, and by visualizing variables, analysts can identify patterns, errors, and anomalies in the data before moving into predictive or prescriptive analytics.