# Summary Statistics

In this notebook, we will explore how to compute basic summary statistics using Pandas. We'll use the **Automobile** dataset (`datasets/automobile.csv`) to illustrate these concepts.

Summary statistics help us understand the central tendency, spread, and shape of the data distribution.

In [4]:
import pandas as pd
import numpy as np

# Load the dataset
# We specify na_values='?' because the dataset uses '?' for missing information
df = pd.read_csv('datasets/automobile.csv', na_values='?')

# Move 'price' to the first column for better visibility
col = df.pop('price')
df.insert(0, 'price', col)

# Display the first few rows to understand the structure
df.head()

Unnamed: 0,price,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,...,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg
0,13495.0,3,,alfa-romero,gas,std,two,convertible,rwd,front,...,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27
1,16500.0,3,,alfa-romero,gas,std,two,convertible,rwd,front,...,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27
2,16500.0,1,,alfa-romero,gas,std,two,hatchback,rwd,front,...,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26
3,13950.0,2,164.0,audi,gas,std,four,sedan,fwd,front,...,four,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30
4,17450.0,2,164.0,audi,gas,std,four,sedan,4wd,front,...,five,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22


## 1. Descriptive Statistics with `describe()`

The `describe()` method provides a quick overview of the numerical data in the DataFrame. It calculates count, mean, standard deviation, minimum, maximum, and quartile values (25%, 50%, 75%).

In [5]:
# Generate summary statistics for all numerical columns
df.describe()

Unnamed: 0,price,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg
count,201.0,205.0,164.0,205.0,205.0,205.0,205.0,205.0,205.0,201.0,201.0,205.0,203.0,203.0,205.0,205.0
mean,13207.129353,0.834146,122.0,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329751,3.255423,10.142537,104.256158,5125.369458,25.219512,30.75122
std,7947.066342,1.245307,35.442168,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.273539,0.316717,3.97204,39.714369,479.33456,6.542142,6.886443
min,5118.0,-2.0,65.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0
25%,7775.0,0.0,94.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0
50%,10295.0,1.0,115.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0
75%,16500.0,2.0,150.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.59,3.41,9.4,116.0,5500.0,30.0,34.0
max,45400.0,3.0,256.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0


**Key Metrics Explained:**
*   **count**: Number of non-null observations.
*   **mean**: The average value of the column.
*   **std**: Standard deviation, which measures the amount of variation or dispersion of the values.
*   **min/max**: The smallest and largest values.
*   **values at 25%, 50%, 75%**: These are percentiles. The 50% percentile is the **median**.

## 2. Specific Aggregations

Sometimes you only want to compute specific statistics for a subset of columns. You can use the `.agg()` method for this. This is cleaner than running `describe()` on the whole dataframe.

In [6]:
# Select specific columns of interest
cols_of_interest = ['price', 'horsepower', 'city-mpg', 'highway-mpg']

# Compute mean, minimum, and maximum for these columns
df[cols_of_interest].agg(['mean', 'min', 'max'])

Unnamed: 0,price,horsepower,city-mpg,highway-mpg
mean,13207.129353,104.256158,25.219512,30.75122
min,5118.0,48.0,13.0,16.0
max,45400.0,288.0,49.0,54.0


## 3. Individual Statistics

You can also compute individual statistics for specific columns directly. This is useful for assigning values to variables for further analysis.

In [7]:
# Calculate the Mean Price
price_mean = df['price'].mean()
print(f"Mean Price: {price_mean:.2f}")

# Calculate the Median Price
# The median is the middle value and is less sensitive to outliers than the mean
price_median = df['price'].median()
print(f"Median Price: {price_median:.2f}")

# Calculate Standard Deviation
# A high std dev indicates the prices are spread out over a wider range
price_std = df['price'].std()
print(f"Price Std Dev: {price_std:.2f}")

# Calculate Variance
# Variance is the square of the standard deviation
price_var = df['price'].var()
print(f"Price Variance: {price_var:.2f}")

Mean Price: 13207.13
Median Price: 10295.00
Price Std Dev: 7947.07
Price Variance: 63155863.44


## 4. Quantiles

Quantiles allow you to understand the distribution of data at specific points. The most common are quartiles (25%, 50%, 75%).

In [8]:
# Calculate specific quantiles for Price
# 0.25 = 25th percentile (1st quartile)
# 0.50 = 50th percentile (Median)
# 0.75 = 75th percentile (3rd quartile)
price_quantiles = df['price'].quantile([0.25, 0.5, 0.75])

print("Price Quantiles:")
print(price_quantiles)

Price Quantiles:
0.25     7775.0
0.50    10295.0
0.75    16500.0
Name: price, dtype: float64


### Interpretation of Quantiles

*   **25th Percentile (0.25)**: 25% of the cars cost less than this value.
*   **50th Percentile (Median)**: 50% of the cars cost less than this value. This is the middle price.
*   **75th Percentile (0.75)**: 75% of the cars cost less than this value (or 25% of the cars cost more).

## 5. Correlation

Correlation measures the strength and direction of the relationship between two numerical variables. 
The result is a value between -1 and 1:
*   **1**: Perfect positive correlation (as x increases, y increases)
*   **-1**: Perfect negative correlation (as x increases, y decreases)
*   **0**: No correlation

### Pearson Correlation

Pearson correlation assesses the **linear** relationship between two continuous variables. It assumes a normal distribution and linearity.

In [9]:
# Calculate Pearson Correlation
# We'll look at the correlation between Price, Horsepower, Engine Size, and MPG
cols_corr = ['price', 'horsepower', 'engine-size', 'city-mpg', 'highway-mpg']
pearson_corr = df[cols_corr].corr(method='pearson')

print("Pearson Correlation Matrix:")
print(pearson_corr)

Pearson Correlation Matrix:
                price  horsepower  engine-size  city-mpg  highway-mpg
price        1.000000    0.810533     0.872335 -0.686571    -0.704692
horsepower   0.810533    1.000000     0.810773 -0.803620    -0.770908
engine-size  0.872335    0.810773     1.000000 -0.653658    -0.677470
city-mpg    -0.686571   -0.803620    -0.653658  1.000000     0.971337
highway-mpg -0.704692   -0.770908    -0.677470  0.971337     1.000000


### Spearman Correlation

Spearman's rank correlation assesses the **monotonic** relationship (whether linear or not). It is based on the ranked values rather than the raw data and is less sensitive to outliers.

In [10]:
# Calculate Spearman Correlation
spearman_corr = df[cols_corr].corr(method='spearman')

print("\nSpearman Correlation Matrix:")
print(spearman_corr)


Spearman Correlation Matrix:
                price  horsepower  engine-size  city-mpg  highway-mpg
price        1.000000    0.850532     0.828417 -0.831284    -0.827265
horsepower   0.850532    1.000000     0.819521 -0.912505    -0.884378
engine-size  0.828417    0.819521     1.000000 -0.730056    -0.721342
city-mpg    -0.831284   -0.912505    -0.730056  1.000000     0.967738
highway-mpg -0.827265   -0.884378    -0.721342  0.967738     1.000000


### Interpretation of Results

*   **Price vs Horsepower (0.81 Pearson, 0.86 Spearman)**: Strong positive correlation. As horsepower increases, price tends to increase significantly.
*   **Price vs City-MPG (-0.69 Pearson, -0.83 Spearman)**: Strong negative correlation. Cars with higher fuel efficiency (MPG) tend to be cheaper.
*   **City-MPG vs Highway-MPG (0.97 Pearson)**: Very strong positive correlation, which makes sense as they measure similar attributes.