# Grouping and Aggregation - Solutions

GroupBy operations with aggregation functions, multiple aggregations, and hierarchical groupings.

## Question 1
Create a DataFrame with sales data (Product, Region, Sales, Quantity) and group by Product to calculate the total sales for each product.

In [None]:
import pandas as pd
import numpy as np

sales_data = {
    'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B'],
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'Sales': [1000, 1500, 800, 1200, 900, 1100, 1300, 1400],
    'Quantity': [10, 15, 8, 12, 9, 11, 13, 14]
}
df_sales = pd.DataFrame(sales_data)
print(df_sales)

total_sales_by_product = df_sales.groupby('Product')['Sales'].sum()
print("\nTotal sales by product:")
print(total_sales_by_product)

## Question 2
Using the same sales DataFrame, group by Region and calculate both mean sales and total quantity using agg().

In [None]:
region_stats = df_sales.groupby('Region').agg({
    'Sales': 'mean',
    'Quantity': 'sum'
})
print(region_stats)

## Question 3
Group the sales data by both Product and Region (hierarchical grouping) and calculate the sum of Sales.

In [None]:
hierarchical_sales = df_sales.groupby(['Product', 'Region'])['Sales'].sum()
print(hierarchical_sales)

## Question 4
Create a DataFrame with employee data (Department, Employee, Salary, Age) and find the employee with the highest salary in each department.

In [None]:
employee_data = {
    'Department': ['IT', 'HR', 'IT', 'Finance', 'HR', 'IT', 'Finance'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Salary': [70000, 55000, 80000, 65000, 60000, 75000, 70000],
    'Age': [28, 35, 32, 40, 29, 31, 38]
}
df_employees = pd.DataFrame(employee_data)
print(df_employees)

highest_paid_by_dept = df_employees.loc[df_employees.groupby('Department')['Salary'].idxmax()]
print("\nHighest paid employee by department:")
print(highest_paid_by_dept)

## Question 5
Using the employee DataFrame, group by Department and apply multiple aggregation functions: count of employees, mean salary, and max age.

In [None]:
dept_stats = df_employees.groupby('Department').agg({
    'Employee': 'count',
    'Salary': 'mean',
    'Age': 'max'
})
print(dept_stats)

## Question 6
Create a custom aggregation function that calculates the range (max - min) and apply it to the Salary column grouped by Department.

In [None]:
def salary_range(x):
    return x.max() - x.min()

salary_ranges = df_employees.groupby('Department')['Salary'].agg(salary_range)
print("Salary range by department:")
print(salary_ranges)

## Question 7
Using transform() with groupby, add a column showing each employee's salary as a percentage of their department's total salary.

In [None]:
df_employees['Dept_Total_Salary'] = df_employees.groupby('Department')['Salary'].transform('sum')
df_employees['Salary_Percentage'] = (df_employees['Salary'] / df_employees['Dept_Total_Salary']) * 100
print(df_employees[['Employee', 'Department', 'Salary', 'Salary_Percentage']])

## Question 8
Filter groups to show only departments where the average salary is greater than 60000.

In [None]:
high_salary_depts = df_employees.groupby('Department').filter(lambda x: x['Salary'].mean() > 60000)
print(high_salary_depts)

## Question 9
Using apply() with groupby, create a function that returns the top 2 highest paid employees in each department.

In [None]:
def top_2_employees(group):
    return group.nlargest(2, 'Salary')

top_2_by_dept = df_employees.groupby('Department').apply(top_2_employees)
print(top_2_by_dept)

## Question 10
Create a pivot table from the sales data showing Products as rows, Regions as columns, and sum of Sales as values.

In [None]:
pivot_table = df_sales.pivot_table(index='Product', columns='Region', values='Sales', aggfunc='sum')
print(pivot_table)