# Generating Summary Statistics Using Pandas and SciPy

Descriptive statistics provide a **quantitative summary of a variable and the data points that comprise it**. They help us understand:

- Typical values in a dataset
- How spread out the data is
- Whether unusual or potentially dangerous values exist

### Real-world motivation
Imagine working for a company that monitors patients' health in real time. Incoming sensor data is continuously generated. By computing **summary statistics in micro-batches**, you can:

- Detect unusually high or low values
- Trigger alerts when thresholds are exceeded
- Identify potential anomalies indicating dangerous health conditions

### Two categories of descriptive statistics

1. **Statistics describing values**
   - Sum
   - Mean
   - Median
   - Maximum / Minimum

2. **Statistics describing spread (distribution)**
   - Standard deviation
   - Variance
   - Counts
   - Quartiles

In this notebook, we generate these statistics using **pandas** and **scipy**, using the classic **mtcars dataset**.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import scipy
from scipy import stats

## Loading the dataset

The **mtcars** dataset contains specifications for 32 cars, including mileage, engine configuration, weight, and transmission type.

We load the CSV file and assign readable column names for clarity.

In [2]:
address = '/workspaces/python-for-data-science-and-machine-learning-essential-training-part-1-3006708/data/mtcars.csv'

cars = pd.read_csv(address)
cars.columns = ['car_names','mpg','cyl','disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']

### Previewing the data

Before computing statistics, it is always good practice to inspect the first few rows of the dataset.

In [3]:
cars.head()

## Summary statistics describing numeric values

These statistics describe **typical or extreme values** for each variable.

### Column-wise sum

The `sum()` method adds all values **column by column** by default.

⚠️ Note: Non-numeric columns (such as car names) may produce unexpected results when summed.

In [4]:
cars.sum()

### Row-wise sum

To sum values **row-wise**, specify `axis=1`. We also restrict the operation to numeric columns only.

In [5]:
cars.sum(axis=1, numeric_only=True)

### Median

The **median** represents the middle value when data is ordered.

It is less sensitive to extreme values than the mean.

In [6]:
cars.median(numeric_only=True)

### Mean

The **mean** represents the average value of each variable.

In [7]:
cars.mean(numeric_only=True)

### Maximum values

The `max()` method identifies the **largest observed value** in each variable.

In [8]:
cars.max()

### Identifying where the maximum occurs

To find **which row contains the maximum value**, use `idxmax()`.

Here we locate the row index of the highest miles-per-gallon (mpg) value.

In [9]:
mpg = cars.mpg
mpg.idxmax()

## Summary statistics describing distribution

These statistics help us understand **how spread out** the data is.

### Standard deviation

Standard deviation measures the **average distance of values from the mean**.

Higher values indicate greater variability.

In [10]:
cars.std(numeric_only=True)

### Variance

Variance is the **square of the standard deviation** and emphasizes large deviations.

In [11]:
cars.var(numeric_only=True)

### Frequency counts

`value_counts()` shows how often each unique value appears.

This is especially useful for **categorical or discrete variables**, such as number of gears.

In [12]:
gear = cars.gear
gear.value_counts()

## Comprehensive summary with `describe()`

The `describe()` method produces a **complete statistical summary** in a single table, including:

- Count
- Mean
- Standard deviation
- Minimum
- Quartiles (25%, 50%, 75%)
- Maximum

This is often the **first command run during exploratory data analysis (EDA)**.

In [13]:
cars.describe()