# Basic Statistics with Pandas

This notebook demonstrates how to calculate fundamental statistical measures using Pandas. Understanding basic statistics is essential for data analysis and helps us summarize and understand our datasets.

## Learning Objectives
- Calculate basic statistical measures (mean, median, sum)
- Apply statistical functions to single and multiple columns
- Generate comprehensive descriptive statistics
- Interpret statistical results in context
- Understand the difference between various statistical measures

## Prerequisites
- Basic understanding of Python and Pandas
- Familiarity with DataFrames
- Basic knowledge of statistical concepts
- Pandas library installed

## Step 1: Import Required Libraries

Let's start by importing the necessary libraries for our statistical analysis.

In [1]:
# Import the Python libraries
import pandas as pd
import numpy as np

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Libraries imported successfully!
Pandas version: 2.3.1
NumPy version: 1.26.4


## Step 2: Load the Dataset

We'll load our data for statistical analysis. If the CSV file doesn't exist, we'll create sample data to demonstrate the concepts.

In [2]:
# Load the CSV file into a DataFrame
try:
    df = pd.read_csv("./data/unknown.csv")
    print("CSV file loaded successfully!")
    print(f"Dataset shape: {df.shape}")
except FileNotFoundError:
    print("CSV file not found. Creating sample data for demonstration.")
    # Create sample data with multiple numeric columns
    np.random.seed(42)  # For reproducible results
    df = pd.DataFrame({
        'Name': [f'Person_{i}' for i in range(1, 21)],
        'Age': np.random.randint(20, 65, 20),
        'Salary': np.random.randint(30000, 100000, 20),
        'Experience': np.random.randint(0, 30, 20),
        'Rating': np.random.uniform(1, 5, 20).round(2),
        'Department': np.random.choice(['IT', 'Finance', 'Marketing', 'HR'], 20),
        'City': np.random.choice(['New York', 'Dallas', 'Chicago', 'Houston'], 20)
    })
    print("Sample data created successfully!")

# Display basic information about our dataset
print(f"\nDataset Overview:")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print("\nFirst 5 rows:")
print(df.head())

CSV file not found. Creating sample data for demonstration.
Sample data created successfully!

Dataset Overview:
Shape: (20, 7)
Columns: ['Name', 'Age', 'Salary', 'Experience', 'Rating', 'Department', 'City']

First 5 rows:
       Name  Age  Salary  Experience  Rating Department      City
0  Person_1   58   94925          24    2.56         IT   Chicago
1  Person_2   48   97969          27    1.73         HR  New York
2  Person_3   34   35311          13    4.02         HR   Houston
3  Person_4   62   83707          17    2.70         HR   Chicago
4  Person_5   27   58693          25    1.83         IT   Chicago


## Step 3: Single Column Statistics

Let's start by calculating basic statistics for individual columns. We'll focus on the 'Age' column as an example.

### Mean (Average)
The mean is the sum of all values divided by the number of values. It represents the central tendency of the data.

In [3]:
# Calculating the Mean
mean_age = df["Age"].mean()
print(f"Mean Age: {mean_age:.2f} years")

# Let's also see the calculation step by step
total_age = df["Age"].sum()
count_people = len(df["Age"])
manual_mean = total_age / count_people
print(f"Manual calculation: {total_age} ÷ {count_people} = {manual_mean:.2f}")

Mean Age: 42.85 years
Manual calculation: 857 ÷ 20 = 42.85


### Median
The median is the middle value when all values are arranged in order. It's less affected by extreme values (outliers) than the mean.

In [4]:
# Calculating the Median
median_age = df["Age"].median()
print(f"Median Age: {median_age} years")

# Let's see the sorted ages to understand the median better
sorted_ages = sorted(df["Age"])
print(f"Sorted ages: {sorted_ages}")
print(f"Middle position(s): {len(sorted_ages)//2}")

# Compare mean vs median
print(f"\nComparison:")
print(f"Mean: {mean_age:.2f}")
print(f"Median: {median_age}")
print(f"Difference: {abs(mean_age - median_age):.2f}")

Median Age: 42.5 years
Sorted ages: [21, 22, 27, 30, 30, 34, 38, 40, 41, 42, 43, 43, 43, 48, 55, 58, 58, 59, 62, 63]
Middle position(s): 10

Comparison:
Mean: 42.85
Median: 42.5
Difference: 0.35


### Sum
The sum is the total of all values in a column. This is useful for understanding totals and aggregates.

In [5]:
# Calculating the Sum
sum_age = df["Age"].sum()
print(f"Total Sum of Ages: {sum_age} years")

# This represents the combined age of all people in our dataset
print(f"If we combined all {len(df)} people's ages, it would be {sum_age} years")
print(f"That's an average of {sum_age/len(df):.2f} years per person")

Total Sum of Ages: 857 years
If we combined all 20 people's ages, it would be 857 years
That's an average of 42.85 years per person


## Step 4: Multiple Column Statistics

Now let's apply statistical functions to multiple columns simultaneously. This is efficient when working with datasets that have many numeric columns.

### Mean for All Numeric Columns

In [6]:
# Applying statistics to multiple columns
# Calculate mean for all numeric columns in the DataFrame
mean_values = df.mean(numeric_only=True)
print("Mean values for all numeric columns:")
print(mean_values)

# Let's format this nicely
print("\n=== MEAN VALUES ===")
for column, value in mean_values.items():
    if column == 'Age':
        print(f"{column}: {value:.1f} years")
    elif column == 'Salary':
        print(f"{column}: ${value:,.2f}")
    elif column == 'Experience':
        print(f"{column}: {value:.1f} years")
    elif column == 'Rating':
        print(f"{column}: {value:.2f} / 5.0")
    else:
        print(f"{column}: {value:.2f}")

Mean values for all numeric columns:
Age              42.85
Salary        69368.65
Experience       16.40
Rating            3.20
dtype: float64

=== MEAN VALUES ===
Age: 42.9 years
Salary: $69,368.65
Experience: 16.4 years
Rating: 3.20 / 5.0


### Median for All Numeric Columns

In [7]:
# Calculating the median for all numeric columns
median_values = df.median(numeric_only=True)
print("Median values for all numeric columns:")
print(median_values)

# Compare means vs medians
print("\n=== MEAN vs MEDIAN COMPARISON ===")
comparison_df = pd.DataFrame({
    'Mean': mean_values,
    'Median': median_values,
    'Difference': mean_values - median_values
})
print(comparison_df.round(2))

Median values for all numeric columns:
Age              42.500
Salary        68689.500
Experience       15.500
Rating            3.215
dtype: float64

=== MEAN vs MEDIAN COMPARISON ===
                Mean    Median  Difference
Age            42.85     42.50        0.35
Salary      69368.65  68689.50      679.15
Experience     16.40     15.50        0.90
Rating          3.20      3.22       -0.01


### Sum for All Numeric Columns

In [8]:
# Calculating the sum for all numeric columns
sum_values = df.sum(numeric_only=True)
print("Sum of values for all numeric columns:")
print(sum_values)

# Let's interpret these sums
print("\n=== INTERPRETATION OF SUMS ===")
for column, value in sum_values.items():
    if column == 'Age':
        print(f"Total {column}: {value} years (combined age of all people)")
    elif column == 'Salary':
        print(f"Total {column}: ${value:,} (total payroll)")
    elif column == 'Experience':
        print(f"Total {column}: {value} years (combined experience)")
    elif column == 'Rating':
        print(f"Total {column}: {value:.2f} (sum of all ratings)")
    else:
        print(f"Total {column}: {value}")

Sum of values for all numeric columns:
Age               857.0
Salary        1387373.0
Experience        328.0
Rating             64.0
dtype: float64

=== INTERPRETATION OF SUMS ===
Total Age: 857.0 years (combined age of all people)
Total Salary: $1,387,373.0 (total payroll)
Total Experience: 328.0 years (combined experience)
Total Rating: 64.00 (sum of all ratings)


## Step 5: Comprehensive Descriptive Statistics

The `describe()` method provides a comprehensive summary of statistical measures for all numeric columns at once.

In [9]:
# Descriptive statistics
summary = df.describe()
print("Summary of Descriptive Statistics:")
print(summary)

# Let's understand what each statistic means
print("\n=== EXPLANATION OF STATISTICS ===")
print("count: Number of non-null values")
print("mean:  Average value")
print("std:   Standard deviation (measure of spread)")
print("min:   Minimum value")
print("25%:   First quartile (25% of values are below this)")
print("50%:   Median (50% of values are below this)")
print("75%:   Third quartile (75% of values are below this)")
print("max:   Maximum value")

Summary of Descriptive Statistics:
             Age        Salary  Experience     Rating
count  20.000000     20.000000   20.000000  20.000000
mean   42.850000  69368.650000   16.400000   3.200000
std    13.128013  24795.167751    8.804305   1.023415
min    21.000000  32747.000000    1.000000   1.130000
25%    33.000000  46671.750000   10.250000   2.575000
50%    42.500000  68689.500000   15.500000   3.215000
75%    55.750000  95125.000000   25.000000   3.997500
max    63.000000  99092.000000   28.000000   4.840000

=== EXPLANATION OF STATISTICS ===
count: Number of non-null values
mean:  Average value
std:   Standard deviation (measure of spread)
min:   Minimum value
25%:   First quartile (25% of values are below this)
50%:   Median (50% of values are below this)
75%:   Third quartile (75% of values are below this)
max:   Maximum value


## Step 6: Additional Statistical Measures

Let's explore some additional statistical measures that are commonly used in data analysis.

In [10]:
# Additional statistical measures
print("=== ADDITIONAL STATISTICS ===")

# Standard deviation (measure of variability)
print("Standard Deviation:")
std_values = df.std(numeric_only=True)
for column, value in std_values.items():
    print(f"  {column}: {value:.2f}")

# Variance (square of standard deviation)
print("\nVariance:")
var_values = df.var(numeric_only=True)
for column, value in var_values.items():
    print(f"  {column}: {value:.2f}")

# Min and Max values
print("\nMinimum values:")
min_values = df.min(numeric_only=True)
print(min_values)

print("\nMaximum values:")
max_values = df.max(numeric_only=True)
print(max_values)

# Range (difference between max and min)
print("\nRange (Max - Min):")
range_values = max_values - min_values
print(range_values)

=== ADDITIONAL STATISTICS ===
Standard Deviation:
  Age: 13.13
  Salary: 24795.17
  Experience: 8.80
  Rating: 1.02

Variance:
  Age: 172.34
  Salary: 614800343.82
  Experience: 77.52
  Rating: 1.05

Minimum values:
Age              21.00
Salary        32747.00
Experience        1.00
Rating            1.13
dtype: float64

Maximum values:
Age              63.00
Salary        99092.00
Experience       28.00
Rating            4.84
dtype: float64

Range (Max - Min):
Age              42.00
Salary        66345.00
Experience       27.00
Rating            3.71
dtype: float64


## Step 7: Statistical Analysis by Groups

Let's perform statistical analysis grouped by categorical variables to gain more insights.

In [11]:
# Group statistics by Department
print("=== STATISTICS BY DEPARTMENT ===")
dept_stats = df.groupby('Department').agg({
    'Age': ['mean', 'median', 'std'],
    'Salary': ['mean', 'median', 'std'],
    'Experience': ['mean', 'median', 'std'],
    'Rating': ['mean', 'median', 'std']
}).round(2)

print(dept_stats)

# Group statistics by City
print("\n=== STATISTICS BY CITY ===")
city_stats = df.groupby('City').agg({
    'Age': 'mean',
    'Salary': 'mean',
    'Experience': 'mean',
    'Rating': 'mean'
}).round(2)

print(city_stats)

=== STATISTICS BY DEPARTMENT ===
              Age                  Salary                    Experience  \
             mean median    std      mean   median       std       mean   
Department                                                                
HR          41.00   41.0  16.45  69738.00  83707.0  31251.60      16.00   
IT          42.56   42.0  12.23  72976.89  65773.0  18517.39      18.56   
Marketing   46.75   43.0  11.09  60603.75  56116.5  29735.83      12.25   

                        Rating               
           median   std   mean median   std  
Department                                   
HR           14.0  8.76   3.56   3.99  1.07  
IT           24.0  9.57   2.70   2.58  1.00  
Marketing    13.5  7.50   3.70   3.63  0.55  

=== STATISTICS BY CITY ===
           Age    Salary  Experience  Rating
City                                        
Chicago   49.0  79714.00       19.33    2.35
Dallas    31.0  62937.67       14.00    3.83
Houston   43.0  60662.00       1

## Step 8: Statistical Insights and Interpretation

Let's analyze our statistical results to gain meaningful insights from the data.

In [12]:
# Generate insights from our statistical analysis
print("=== DATA INSIGHTS ===")

# Age analysis
age_mean = df['Age'].mean()
age_std = df['Age'].std()
print(f"Age Analysis:")
print(f"  Average age: {age_mean:.1f} years")
print(f"  Most people are between {age_mean-age_std:.1f} and {age_mean+age_std:.1f} years old")

# Salary analysis
salary_mean = df['Salary'].mean()
salary_median = df['Salary'].median()
print(f"\nSalary Analysis:")
print(f"  Average salary: ${salary_mean:,.2f}")
print(f"  Median salary: ${salary_median:,.2f}")
if salary_mean > salary_median:
    print(f"  Mean > Median suggests some high earners are pulling the average up")
else:
    print(f"  Mean ≤ Median suggests a relatively even salary distribution")

# Experience vs Age correlation
experience_mean = df['Experience'].mean()
print(f"\nExperience Analysis:")
print(f"  Average experience: {experience_mean:.1f} years")
print(f"  Experience to age ratio: {experience_mean/age_mean:.2f}")

# Rating analysis
rating_mean = df['Rating'].mean()
rating_std = df['Rating'].std()
print(f"\nRating Analysis:")
print(f"  Average rating: {rating_mean:.2f} / 5.0")
print(f"  Standard deviation: {rating_std:.2f}")
if rating_std < 1:
    print(f"  Low standard deviation suggests consistent performance ratings")
else:
    print(f"  High standard deviation suggests varied performance ratings")

=== DATA INSIGHTS ===
Age Analysis:
  Average age: 42.9 years
  Most people are between 29.7 and 56.0 years old

Salary Analysis:
  Average salary: $69,368.65
  Median salary: $68,689.50
  Mean > Median suggests some high earners are pulling the average up

Experience Analysis:
  Average experience: 16.4 years
  Experience to age ratio: 0.38

Rating Analysis:
  Average rating: 3.20 / 5.0
  Standard deviation: 1.02
  High standard deviation suggests varied performance ratings


## Summary

In this notebook, we explored fundamental statistical concepts using Pandas:

### Basic Statistics Covered
1. **Mean**: Average value, sensitive to outliers
2. **Median**: Middle value, robust to outliers
3. **Sum**: Total of all values
4. **Standard Deviation**: Measure of data spread
5. **Min/Max**: Extreme values
6. **Quartiles**: Data distribution points

### Key Functions Used
- `df.mean()`: Calculate averages
- `df.median()`: Find middle values
- `df.sum()`: Compute totals
- `df.describe()`: Comprehensive statistics summary
- `df.std()`: Standard deviation
- `df.var()`: Variance
- `df.min()` / `df.max()`: Extreme values
- `df.groupby().agg()`: Group-wise statistics

### Important Insights
- **Mean vs Median**: Compare to understand data distribution
- **Standard Deviation**: Higher values indicate more variability
- **Grouped Statistics**: Reveal patterns across categories
- **Statistical Interpretation**: Context matters for meaningful analysis

## Best Practices

1. **Always explore your data** before applying statistics
2. **Use `numeric_only=True`** when working with mixed data types
3. **Compare mean and median** to understand distribution shape
4. **Consider outliers** when interpreting results
5. **Group by categories** for deeper insights
6. **Round results appropriately** for readability
7. **Provide context** when interpreting statistical measures

## Next Steps

- Learn about data visualization to complement statistical analysis
- Explore correlation analysis between variables
- Study advanced statistical measures (skewness, kurtosis)
- Practice with different datasets and domains
- Learn about statistical hypothesis testing