# 03: Summary Statistics

**Objective:** Quickly summarize your data’s center, spread, and shape so you can spot patterns and anomalies before diving deeper.

**Key Steps:**
<table width="100%">
  <tr>
    <td style="vertical-align: top; text-align: left; width: 60%; padding-right: 20px;">
      <ol style="font-size: 20px; line-height: 1.4;">
        <li>Compute central tendency (mean, median, mode)</li>
        <li>Measure dispersion (range, variance, standard deviation, IQR)</li>
        <li>Assess distribution shape (skewness, kurtosis)</li>
        <li>Detect anomalies or outliers early </li>
      </ol>
    </td>
    <td style="vertical-align: top; text-align: left; width: 40%;">
      <!-- relative path to cleaning.png -->
      <img src="../slides/sum stats.png" alt="summary stats" width="1000" />
    </td>
  </tr>
</table>
---
<audio controls src="../audio/summary.m4a">

---

# Summary Statistics: Step-by-Step

We’ll walk through calculating key summary statistics on our COMPAS dataset in both R and Python. Each step includes:

1. **What we’re doing** (plain English)  
2. **Code to run**  
3. **What the output tells you**

#### Review: Loading the Data

Reload the COMPAS CSV into our table `df`, which contains **24,272 rows** and **24 columns**. With the data in `df`, we’re ready to compute our summary statistics.  


In [1]:

df <- read.csv("../data/compas_scores_raw.csv", stringsAsFactors = FALSE)

#### Step 1: Compute the Mean (Average)

The mean tells us the “typical” value by adding up all values and dividing by the count.

In [2]:
mean_raw <- mean(df$RawScore, na.rm = TRUE)
mean_raw

**What you’ll see:**  
A single number (in this case, `15.557`). This is the center of your RawScore distribution. If it is positive, scores tend to be above zero; if negative, below.  


#### Step 2: Find the Median

The **median** is the middle value when you sort all the numbers. It’s less affected by extreme values.

In [3]:
median_raw <- median(df$RawScore, na.rm = TRUE)
median_raw

**What you’ll see:**  
- One number (for example, `16`).  
- This is the “middle” score: 50% of your data are below it and 50% are above it.  
- If this middle number is very different from the average you calculated before, it means your data are lopsided—more values bunch up on one side than the other.  


#### Step 3: Get the Mode

The **mode** is the value that appears most often in your data. It highlights the single most common observation.

**Note:** Base R does not include a built-in function for the statistical mode. Therefore you will need to load the **DescTools** package and use `Mode()` 

In [4]:
library(DescTools)

Mode(df$RawScore)

"package 'DescTools' was built under R version 4.4.3"



**What you’ll see:**  
A single number (for example, `13`) — the RawScore value that occurs more frequently than any other. This indicates the most typical individual score in your dataset.

**Why it’s useful:**  
- **Identifies the peak** of your distribution, showing where data cluster most densely.  
- **Complements mean and median**, especially when your data are skewed or have multiple peaks.  
- **Helps with categorical data**, by revealing the most popular category or choice.  

#### Step 4: Measure How Spread Out Your Data Are

We want to know not just the “middle” of our scores but how far apart they are. We’ll use three measures:

- **Range**: the gap between the smallest and largest value  
- **Variance**: the average of squared differences from the mean (gets bigger when values are more spread out)  
- **Standard Deviation (SD)**: the square root of the variance, which tells you in the same units how far values typically fall from the mean  

In [5]:
range_raw <- range(df$RawScore, na.rm = TRUE)
var_raw   <- var(df$RawScore, na.rm = TRUE)
sd_raw    <- sd(df$RawScore, na.rm = TRUE)
range_raw; var_raw; sd_raw

**What you’ll see:** 

- **Range: 0.01 to 51**  
  The lowest score is 0.01, the highest is 51—a span of about 51 points. This tells you there are both extremely low and extremely high risk assessments in your data.

- **Variance: 70.3866**  
  Variance squares each deviation from the mean, so this large number confirms the scores are widely spread out.

- **Standard deviation: 8.3897**  
  By taking the square root of the variance, we get back to the original units: on average, a RawScore lies about 10 points away from the mean. A high SD signals substantial variability in individual risk levels.

**Why it matters:**  
- Such a wide range and high spread mean outliers can heavily influence analyses.  
- Before modeling, you’ll often standardize or normalize these scores so each case is treated fairly by downstream algorithms.  

#### Quick Overall Summary

Now that we’ve calculated each statistic separately, we can use one command to get them all at once. This gives you a fast “at-a-glance” view of your data’s key features.


In [6]:
summary(df$RawScore)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.01   13.00   16.00   15.56   21.00   51.00 

**What you’ll see:**  

- **Min.** – the lowest score in your data, showing the smallest value recorded  
- **1st Qu.** – the 25th percentile, meaning 25% of the scores fall below this point  
- **Median** – the 50th percentile, the “middle” score where half the values are below and half above  
- **Mean** – the average score, calculated by summing all scores and dividing by the number of scores  
- **3rd Qu.** – the 75th percentile, meaning 75% of the scores fall below this point  
- **Max.** – the highest score in your data, showing the largest value recorded


**Why it matters:**  
This one-line summary gives you a quick snapshot of where your data cluster (center), how much they vary (spread), and the key cutoff points (percentiles). It’s an essential sanity check before creating charts or running more advanced analyses.  

#### Step 5: Assess Distribution Shape (Skewness & Kurtosis)

We want to know whether our RawScore values are symmetrically distributed or lopsided, and whether they have heavy tails or are more tightly clustered than a normal bell curve.

---

**What we’re doing in plain English**  
- **Skewness** measures asymmetry:  
  - A **positive** skewness means a longer right tail (more extreme high scores).  
  - A **negative** skewness means a longer left tail (more extreme low scores).  
- **Kurtosis** measures “tailedness”:  
  - A **high** kurtosis (> 3) means heavy tails and more outliers than a normal distribution.  
  - A **low** kurtosis (< 3) means light tails and fewer outliers.

---
> **Load the `moments` package**  
> Base R does not include built-in functions to calculate skewness and kurtosis. By loading `moments`, we gain easy access to `skewness()` and `kurtosis()`, letting us quickly measure how asymmetric our data are and how heavy its tails are compared to a normal distribution.  



In [7]:
library(moments)

# drop NAs, then calculate
skew_raw <- skewness(df$RawScore, na.rm = TRUE)
kurt_raw <- kurtosis(df$RawScore, na.rm = TRUE)

# show the results
cat("Skewness:", skew_raw, "\n")
cat("Kurtosis:", kurt_raw, "\n")

Skewness: -0.2938058 


Kurtosis: 3.06853 


#### Interpreting Skewness and Kurtosis

- **Skewness: -0.294**  
A negative value means the distribution has a longer left tail. In plain terms most scores cluster on the right but a few unusually low scores stretch out to the left

- **Kurtosis: 3**  
A value of 3 matches the “normal” bell curve. This tells you your data have about the expected number of outliers—no unusually heavy or light tails.

**Why it matters**  
- Slight left skew (–0.2938) indicates the distribution is close to symmetric, so you likely don’t need to transform or trim low-end scores unless your analysis demands very strict symmetry.
- Normal kurtosis (3) means you don’t have more extreme values than usual, so outlier counts are typical.  


## Next Steps: Data Visualization

We’ve completed our summary statistics and are ready to see the data in action. Choose your path:

1. **Visualize in R** by opening `04_data_visualization_R.ipynb`  
2. **Visualize in Python** by opening `04_data_visualization_Python.ipynb`  

Pick the notebook for your preferred language and let’s start creating charts to uncover patterns in our recidivism data!  
