---
## üìò Author Information

**üë®‚Äçüíª Name:** Abdul Rehman  
**üìå Role:** Data Science Enthusiast | Python Learner  
**üìÖ Notebook Created:** July 2025  

**üîó Connect with Me:**  


[![LinkedIn](https://img.shields.io/badge/LinkedIn-blue?style=flat&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/abdul-rehman-74b418350/)
[![GitHub](https://img.shields.io/badge/GitHub-black?style=flat&logo=github&logoColor=white)](https://github.com/datawithrehman/Data-Science-Beginning)
[![Twitter](https://img.shields.io/badge/Twitter-blue?style=flat&logo=twitter&logoColor=white)](https://x.com/datawithrehman)



# Descriptive Statistics with Pandas

## Introduction

Welcome to this beginner-friendly guide on performing descriptive statistics using the powerful pandas library in Python! Descriptive statistics help us summarize and understand the main features of a dataset. Instead of sifting through raw data, we can quickly get insights into its central tendency, variability, and distribution.

This notebook will cover essential pandas functions like `describe()`, `mean()`, `median()`, `std()`, `mode()`, and `value_counts()`, along with practical examples.


## 1. Getting Started: Importing Pandas and Creating a DataFrame

First, let's import the pandas library and create a sample DataFrame that we'll use throughout this tutorial.

In [None]:
import pandas as pd

# Create a sample DataFrame
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 28, 32, 38, 42, 48],
    'Income': [50000, 60000, 75000, 80000, 90000, 100000, 110000, 120000, 130000, 140000, 55000, 68000, 82000, 95000, 105000],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male']
}
df = pd.DataFrame(data)

print("Our DataFrame:\n")
print(df.head())

## 2. The Power of `df.describe()`

The `describe()` method generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution, excluding `NaN` values. It's incredibly useful for a quick overview of your numerical columns.

For numerical data, `describe()` provides:
- `count`: Number of non-null observations
- `mean`: Arithmetic mean
- `std`: Standard deviation
- `min`: Minimum value
- `25%` (Q1): 25th percentile (first quartile)
- `50%` (Q2): 50th percentile (median)
- `75%` (Q3): 75th percentile (third quartile)
- `max`: Maximum value

In [None]:
print("\nDescriptive statistics for numerical columns:\n")
print(df.describe())

To include categorical columns, you can use `include='all'`:

In [None]:
print("\nDescriptive statistics for all columns (including categorical):\n")
print(df.describe(include='all'))

## 3. Specific Statistical Measures

Pandas also allows you to calculate individual statistical measures for specific columns.

### Mean (`.mean()`)

The average value of a column.

In [None]:
print(f"\nAverage Age: {df['Age'].mean():.2f}")
print(f"Average Income: {df['Income'].mean():.2f}")

### Median (`.median()`)

The middle value of a column when sorted. It's less affected by outliers than the mean.

In [None]:
print(f"\nMedian Age: {df['Age'].median()}")
print(f"Median Income: {df['Income'].median()}")

### Standard Deviation (`.std()`)

A measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean, while a high standard deviation indicates that the values are spread out over a wider range.

In [None]:
print(f"\nStandard Deviation of Age: {df['Age'].std():.2f}")
print(f"Standard Deviation of Income: {df['Income'].std():.2f}")

### Mode (`.mode()`)

The value that appears most frequently in a column. A column can have one mode, multiple modes, or no mode.

In [None]:
print("\nMode of Age:\n", df['Age'].mode())
print("Mode of City:\n", df['City'].mode())

## 4. Analyzing Categorical Data with `value_counts()`

For categorical (non-numerical) data, `value_counts()` is incredibly useful. It returns a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.

In [None]:
print("\nCounts of each City:\n")
print(df['City'].value_counts())

print("\nCounts of each Gender:\n")
print(df['Gender'].value_counts())

You can also get percentages by setting `normalize=True`:

In [None]:
print("\nPercentage of each City:\n")
print(df['City'].value_counts(normalize=True))

## 5. Common Mistake Alert: Running Numerical Stats on Text Columns

It's crucial to understand your data types. Attempting to calculate numerical statistics (like `mean()`) on non-numerical (text/object) columns will result in a `TypeError`.

In [None]:
try:
    print(df['City'].mean())
except TypeError as e:
    print(f"\nError: {e}\n")
    print("You cannot calculate the mean of a text (object) column. Always check your data types!")

## Mini Challenge

Try to calculate the `mode()` for the 'Income' column and `value_counts()` for the 'Gender' column in the DataFrame we created. What insights can you gather from these results?

In [None]:
# Your code here for the mini challenge
# print(df['Income'].mode())
# print(df['Gender'].value_counts())

## Conclusion

Pandas provides a robust and intuitive way to perform descriptive statistics on your datasets. These functions are fundamental for initial data exploration and understanding, helping you quickly grasp the characteristics of your data without complex operations. Keep exploring and happy analyzing!