# Applying Aggregation Functions Directly to a DataFrame

One of the strengths of Pandas is that you can apply statistical and aggregation methods directly to a DataFrame or Series. These methods summarize data and provide insights without needing extra loops or manual calculations.

### Common Aggregation Methods

Here are some of the most commonly used methods:

| Method        | Description                                              | Works On           |
|---------------|----------------------------------------------------------|--------------------|
| `.sum()`      | Returns the **sum** of values                            | DataFrame / Series |
| `.mean()`     | Returns the **average (mean)** value                     | DataFrame / Series |
| `.count()`    | Counts **non-null values**                               | DataFrame / Series |
| `.min()`      | Returns the **minimum** value                            | DataFrame / Series |
| `.max()`      | Returns the **maximum** value                            | DataFrame / Series |
| `.std()`      | Returns the **standard deviation**                       | DataFrame / Series |
| `.var()`      | Returns the **variance**                                 | DataFrame / Series |
| `.describe()` | Generates **summary statistics** (count, mean, std, min, quartiles, max) | DataFrame / Series |


Example: Aggregating a Series


In [None]:
import pandas as pd

# Salary data
salaries = pd.Series([50000, 60000, 55000, 65000, 70000])

print("Sum:", salaries.sum())
print("Mean:", salaries.mean())
print("Max:", salaries.max())
print("Std Dev:", salaries.std())

Each method is applied directly to the Series, returning a single value.

Example: Aggregating a DataFrame

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 30, 28],
    'Salary': [50000, 60000, 55000]
}
df = pd.DataFrame(data)

print(df.sum(numeric_only=True))   # Sum of numeric columns
print(df.mean(numeric_only=True))  # Mean of numeric columns
print(df.describe())

Notice how these functions automatically ignore non-numeric columns (like “Name”).

# More Advanced: Filtering Data & Apply Statistical Functions

We can combine **row filtering** with **aggregation functions** to analyze subsets of a DataFrame.  

The general syntax is:

> **df[df['column_name'] <condition> value]['target_column'].function()**

where:

- df[...] → filters the rows that meet the condition

- ['target_column'] → selects the column to aggregate

- .function() → applies the aggregation function


In [None]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 35, 28, 40],
    'Salary': [50000, 66000, 55000, 70000]
}
df = pd.DataFrame(data)

# Average salary of employees older than 30
avg_salary = df[df['Age'] > 30]['Salary'].mean()
print(avg_salary)

# Maximum salary for employees younger than 30
df[df['Age'] < 30]['Salary'].max()

# Count employees with salary above 60,000
df[df['Salary'] > 60000]['Name'].count()

# Standard deviation of salary for people aged 25–40
df[(df['Age'] >= 25) & (df['Age'] <= 40)]['Salary'].std()

So the syntax pattern is:

df[ df['condition'] ]['column'].aggregation()

| Expression                                                  | Meaning                                          |
| ----------------------------------------------------------- | ------------------------------------------------ |
| `df[df['Age'] > 30]['Salary'].mean()`                       | Mean of Salary where Age > 30                    |
| `df[df['Salary'] > 60000]['Name'].count()`                  | Count of employees with Salary > 60k             |
| `df[(df['Age'] >= 25) & (df['Age'] <= 40)]['Salary'].std()` | Standard deviation of Salary for 25–40 year olds |


This pattern allows you to filter data first, then aggregate only on the rows that meet your condition.

# Grouping Data with `groupby`

While filtering + aggregation lets us summarize a **subset** of data, the `groupby()` method allows us to compute statistics **across categories**.  
This is the classic **split–apply–combine** process:

1. **Split** data into groups based on one or more columns.  
2. **Apply** an aggregation function to each group.  
3. **Combine** results into a new DataFrame or Series.  

---

## Basic Syntax

> df.groupby('column_name')['target_column'].aggregation_function()
where:
- `groupby('column_name')` → splits the data into groups.  
- `['target_column']` → selects the column to aggregate.  
- `.aggregation_function()` → applies functions like `mean()`, `sum()`, `count()`.  


In [None]:
## Example: Salary by Department

import pandas as pd

# Sample dataset
data = {
    'Department': ['HR','HR','IT','IT','Finance','Finance'],
    'Employee': ['Alice','Bob','Charlie','David','Eva','Frank'],
    'Salary': [50000, 52000, 60000, 62000, 58000, 60000]
}
df = pd.DataFrame(data)
print(df)

# Average salary per department
df.groupby('Department')['Salary'].mean()



### Grouping by Multiple Columns

In [None]:
# Example dataset with Region added
data2 = {
    'Department': ['HR','HR','IT','IT','Finance','Finance'],
    'Region': ['East','West','East','West','East','West'],
    'Salary': [50000, 52000, 60000, 62000, 58000, 60000]
}
df2 = pd.DataFrame(data2)

# Group by Department and Region
df2.groupby(['Department','Region'])['Salary'].mean()

# The pandas Ecosystem: How It Fits In

Pandas does not exist in a vacuum. It is a central hub in the Python data science stack:

* **NumPy:** Provides the foundational n-dimensional array object. Pandas DataFrames are built on top of NumPy arrays.

* **Matplotlib/Seaborn:** Used for visualization. You can plot data directly from DataFrames and Series.

* **Scikit-learn:** The premier machine learning library. It accepts DataFrames and Series as inputs for model training.

* **Jupyter Notebooks:** The ideal interactive environment for exploratory data analysis with pandas.

# When to Use Pandas (And When Not To)

##Use pandas when:

* Working with tabular data (like spreadsheets or database tables)

* Data cleaning and preprocessing

* Exploratory data analysis

* Medium-sized datasets (up to a few gigabytes)

##Consider alternatives when:

* Working with very large datasets that don't fit in memory.

* Need extremely high performance for numerical computations (consider NumPy directly)

* Working with unstructured data like images or text