# Pandas Walkthrough

## Table of Contents
1. [Setting Up and Importing Libraries](#1-setting-up-and-importing-libraries)
2. [Creating Dummy Data](#2-creating-dummy-data)
3. [Understanding DataFrames](#3-understanding-dataframes)
4. [Loading and Saving Data](#4-loading-and-saving-data)
5. [Basic Data Exploration](#5-basic-data-exploration)
6. [Data Selection and Filtering](#6-data-selection-and-filtering)
7. [Basic Data Processing](#7-basic-data-processing)
8. [Handling Missing Data](#8-handling-missing-data)
9. [Grouping and Aggregation](#9-grouping-and-aggregation)
10. [Summary](#10-summary)

---



# Introduction to Pandas

Pandas is one of the most widely used Python libraries for **data analysis** and **machine learning workflows**.  
It provides fast, flexible, and expressive data structures that make working with structured data both easy and powerful.


## Why Pandas?

- **Data Cleaning**  
  Handle missing values, remove duplicates, and correct inconsistencies in raw datasets.

- **Data Transformation**  
  Convert, reshape, merge, and filter datasets into the format required for analysis.

- **Exploratory Data Analysis (EDA)**  
  Summarise datasets using descriptive statistics, and quickly generate insights.

- **Integration with Machine Learning**  
  Prepare and feed clean, well-structured data into machine learning libraries such as Scikit-Learn.


## Key Data Structures

1. **Series**  
   - A one-dimensional labelled array.  
   - Useful for representing a single column or a list with labels (like an Excel column).

2. **DataFrame**  
   - A two-dimensional table of rows and columns.  
   - The most commonly used structure in Pandas.  
   - Similar to a spreadsheet or SQL table.


## Example Use Cases in Data Analysis

- Importing data from **CSV, Excel, SQL databases, or JSON** files.  
- Cleaning messy datasets by filling missing values or correcting data types.  
- Filtering large datasets to focus on relevant records.  
- Grouping and aggregating values to calculate totals, averages, and other statistics.  
- Creating quick visualisations of trends and distributions.


## Example Use Cases in Machine Learning

- **Preprocessing Data**: Encoding categorical variables, scaling numerical features, and handling null values.  
- **Feature Engineering**: Creating new features that improve model performance.  
- **Splitting Data**: Separating training and testing datasets to evaluate models properly.  
- **Pipeline Support**: Feeding clean DataFrames directly into machine learning libraries.


In short, Pandas acts as the **bridge between raw data and useful insights**, making it a core skill for anyone working in data analysis or machine learning.


## 1. Setting Up and Importing Libraries

First, let's import the essential libraries we'll need for this tutorial. Pandas is the main library for data manipulation, whilst NumPy helps with numerical operations and random data generation.

In [54]:
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta

# Set random seed for reproducible results
np.random.seed(42)
random.seed(42)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")


Libraries imported successfully!
Pandas version: 2.0.3




**What's happening here?**
- `pandas` (imported as `pd`) is our main data manipulation library
- `numpy` (imported as `np`) provides mathematical functions and random number generation
- `random` and `datetime` help us create realistic dummy data
- Setting a seed ensures we get the same "random" data each time we run the code

---

## 2. Creating Dummy Data

Let's create a realistic dataset about employees in a company. This will give us interesting data to work with throughout the tutorial.



In [55]:
# Create dummy data for an employee database
def create_employee_data(n_employees=100):
    """Create dummy employee data with realistic information"""
    
    # Define lists of possible values
    departments = ['Sales', 'Marketing', 'IT', 'HR', 'Finance', 'Operations']
    cities = ['London', 'Manchester', 'Birmingham', 'Glasgow', 'Bristol', 'Liverpool']
    job_titles = ['Analyst', 'Manager', 'Specialist', 'Coordinator', 'Director', 'Assistant']
    
    # Generate random data
    data = {
        'employee_id': range(1001, 1001 + n_employees),
        'first_name': [f"Employee_{i}" for i in range(1, n_employees + 1)],
        'department': np.random.choice(departments, n_employees),
        'job_title': np.random.choice(job_titles, n_employees),
        'salary': np.random.normal(45000, 15000, n_employees).round(0).astype(int),
        'years_experience': np.random.randint(0, 25, n_employees),
        'city': np.random.choice(cities, n_employees),
        'performance_score': np.random.uniform(1.0, 5.0, n_employees).round(2),
        'start_date': [
            datetime(2020, 1, 1) + timedelta(days=np.random.randint(0, 1460))
            for _ in range(n_employees)
        ]
    }
    
    return pd.DataFrame(data)

# Create the dataset
df = create_employee_data(100)

# Add some missing values to make it more realistic
df.loc[df.sample(5).index, 'performance_score'] = np.nan
df.loc[df.sample(3).index, 'salary'] = np.nan

print("Dummy data created successfully!")
print(f"Dataset shape: {df.shape}")


Dummy data created successfully!
Dataset shape: (100, 9)




**What's happening here?**
- We've created a function that generates realistic employee data
- The dataset includes various data types: integers, floats, strings, and dates
- We intentionally added some missing values (NaN) to demonstrate how to handle them later
- `df.shape` tells us we have 100 rows and 9 columns

---

## 3. Understanding DataFrames

A DataFrame is like a spreadsheet or table with rows and columns. Let's explore the basic properties of our DataFrame.



In [56]:
# Basic information about our DataFrame
print("=== BASIC DATAFRAME INFORMATION ===")
print(f"Shape (rows, columns): {df.shape}")
print(f"Total number of elements: {df.size}")
print(f"Memory usage: {df.memory_usage(deep=True).sum()} bytes")

print("\n=== COLUMN INFORMATION ===")
print("Column names:")
print(df.columns.tolist())

print("\nData types of each column:")
print(df.dtypes)

print("\n=== FIRST 5 ROWS (HEAD) ===")
print(df.head())

print("\n=== LAST 5 ROWS (TAIL) ===")
print(df.tail())


=== BASIC DATAFRAME INFORMATION ===
Shape (rows, columns): (100, 9)
Total number of elements: 900
Memory usage: 29901 bytes

=== COLUMN INFORMATION ===
Column names:
['employee_id', 'first_name', 'department', 'job_title', 'salary', 'years_experience', 'city', 'performance_score', 'start_date']

Data types of each column:
employee_id                   int64
first_name                   object
department                   object
job_title                    object
salary                      float64
years_experience              int32
city                         object
performance_score           float64
start_date           datetime64[ns]
dtype: object

=== FIRST 5 ROWS (HEAD) ===
   employee_id  first_name department job_title   salary  years_experience  \
0         1001  Employee_1         HR  Director  49520.0                 9   
1         1002  Employee_2    Finance   Analyst  23122.0                21   
2         1003  Employee_3         IT   Analyst  35059.0                 2 



**Key concepts:**
- **Shape**: Shows (number of rows, number of columns)
- **Columns**: The names of your data fields
- **Data types**: Pandas automatically detects whether data is text (object), numbers (int64, float64), or dates (datetime64)
- **Head/Tail**: Quick ways to peek at your data

---

## 4. Loading and Saving Data

Let's save our dummy data to a CSV file and then load it back. This is how you'd typically work with real data files.



In [57]:
# Save DataFrame to CSV
csv_filename = 'employee_data.csv'
df.to_csv(csv_filename, index=False)
print(f"Data saved to {csv_filename}")

# Load data from CSV
df_loaded = pd.read_csv(csv_filename)
print(f"Data loaded from {csv_filename}")
print(f"Loaded data shape: {df_loaded.shape}")

# Check if the data is identical
print(f"Data identical after loading: {df.equals(df_loaded)}")

# Note: Dates might load as strings, so let's convert them properly
df_loaded['start_date'] = pd.to_datetime(df_loaded['start_date'])
print("Converted start_date to datetime format")

# Common CSV loading options
print("\n=== COMMON CSV LOADING OPTIONS ===")
print("# Load only first 10 rows:")
print("df_sample = pd.read_csv('employee_data.csv', nrows=10)")

print("# Load specific columns only:")
print("df_subset = pd.read_csv('employee_data.csv', usecols=['first_name', 'department', 'salary'])")

print("# Skip header row:")
print("df_no_header = pd.read_csv('employee_data.csv', skiprows=1, names=custom_column_names)")


Data saved to employee_data.csv
Data loaded from employee_data.csv
Loaded data shape: (100, 9)
Data identical after loading: False
Converted start_date to datetime format

=== COMMON CSV LOADING OPTIONS ===
# Load only first 10 rows:
df_sample = pd.read_csv('employee_data.csv', nrows=10)
# Load specific columns only:
df_subset = pd.read_csv('employee_data.csv', usecols=['first_name', 'department', 'salary'])
# Skip header row:
df_no_header = pd.read_csv('employee_data.csv', skiprows=1, names=custom_column_names)




**Key points about CSV operations:**
- `index=False` prevents pandas from saving row numbers as a column
- Always check your data after loading to ensure it's correct
- Dates often need manual conversion using `pd.to_datetime()`
- You can load partial data using parameters like `nrows` or `usecols`

---

## 5. Basic Data Exploration

Now let's explore our data to understand what we're working with. This is always the first step in any data analysis.



In [58]:
print("=== COMPREHENSIVE DATA OVERVIEW ===")
print(df.info())

print("\n=== STATISTICAL SUMMARY ===")
print(df.describe())

print("\n=== MISSING VALUES CHECK ===")
missing_data = df.isnull().sum()
print("Missing values per column:")
print(missing_data[missing_data > 0])

print(f"\nTotal missing values: {df.isnull().sum().sum()}")
print(f"Percentage of missing data: {(df.isnull().sum().sum() / df.size) * 100:.2f}%")

print("\n=== UNIQUE VALUES IN CATEGORICAL COLUMNS ===")
categorical_columns = ['department', 'job_title', 'city']
for col in categorical_columns:
    print(f"\n{col.upper()}:")
    print(f"  Unique values: {df[col].nunique()}")
    print(f"  Values: {df[col].unique().tolist()}")

print("\n=== VALUE COUNTS FOR DEPARTMENTS ===")
print(df['department'].value_counts())

print("\n=== BASIC STATISTICS FOR SALARY ===")
salary_stats = df['salary'].describe()
print(salary_stats)
print(f"Salary range: ${df['salary'].min():,.0f} to ${df['salary'].max():,.0f}")


=== COMPREHENSIVE DATA OVERVIEW ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   employee_id        100 non-null    int64         
 1   first_name         100 non-null    object        
 2   department         100 non-null    object        
 3   job_title          100 non-null    object        
 4   salary             97 non-null     float64       
 5   years_experience   100 non-null    int32         
 6   city               100 non-null    object        
 7   performance_score  95 non-null     float64       
 8   start_date         100 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int32(1), int64(1), object(4)
memory usage: 6.8+ KB
None

=== STATISTICAL SUMMARY ===
       employee_id        salary  years_experience  performance_score  \
count   100.000000     97.000000        100.000000        



**What each function tells us:**
- `info()`: Overall structure, data types, and memory usage
- `describe()`: Statistical summary of numerical columns
- `isnull().sum()`: Counts missing values in each column
- `nunique()`: Number of unique values (useful for categorical data)
- `value_counts()`: Frequency of each unique value
- `unique()`: List of all unique values

---

## 6. Data Selection and Filtering

Learning how to select specific data is crucial. Let's explore different ways to slice and filter our DataFrame.



In [59]:
print("=== SELECTING COLUMNS ===")
# Select single column (returns a Series)
department_series = df['department']
print("Single column (Series):")
print(f"Type: {type(department_series)}")
print(department_series.head())

# Select single column (returns a DataFrame)
department_df = df[['department']]
print(f"\nSingle column (DataFrame): Type: {type(department_df)}")

# Select multiple columns
selected_columns = df[['first_name', 'department', 'salary']]
print(f"\nMultiple columns shape: {selected_columns.shape}")
print(selected_columns.head())

print("\n=== SELECTING ROWS ===")
# Select by position using iloc
print("First 5 rows using iloc:")
print(df.iloc[:5])

print("\nRows 10-15 using iloc:")
print(df.iloc[10:15])

print("\nSpecific rows and columns using iloc:")
print(df.iloc[0:5, 1:4])  # First 5 rows, columns 1-3

print("\n=== FILTERING DATA (BOOLEAN INDEXING) ===")
# Simple filter
high_earners = df[df['salary'] > 60000]
print(f"Employees earning more than $60,000: {len(high_earners)}")
print(high_earners[['first_name', 'department', 'salary']].head())

# Multiple conditions with AND (&)
it_high_earners = df[(df['department'] == 'IT') & (df['salary'] > 50000)]
print(f"\nIT employees earning more than $50,000: {len(it_high_earners)}")

# Multiple conditions with OR (|)
sales_or_marketing = df[(df['department'] == 'Sales') | (df['department'] == 'Marketing')]
print(f"\nSales or Marketing employees: {len(sales_or_marketing)}")

# Using isin() for multiple values
london_birmingham = df[df['city'].isin(['London', 'Birmingham'])]
print(f"\nEmployees in London or Birmingham: {len(london_birmingham)}")

# String filtering
managers = df[df['job_title'].str.contains('Manager', na=False)]
print(f"\nEmployees with 'Manager' in job title: {len(managers)}")

print("\n=== QUERY METHOD (ALTERNATIVE FILTERING) ===")
# Using query method (often more readable)
experienced_analysts = df.query("job_title == 'Analyst' and years_experience > 5")
print(f"Experienced analysts: {len(experienced_analysts)}")

# Query with string values
london_high_performers = df.query("city == 'London' and performance_score > 4.0")
print(f"High-performing London employees: {len(london_high_performers)}")


=== SELECTING COLUMNS ===
Single column (Series):
Type: <class 'pandas.core.series.Series'>
0         HR
1    Finance
2         IT
3    Finance
4    Finance
Name: department, dtype: object

Single column (DataFrame): Type: <class 'pandas.core.frame.DataFrame'>

Multiple columns shape: (100, 3)
   first_name department   salary
0  Employee_1         HR  49520.0
1  Employee_2    Finance  23122.0
2  Employee_3         IT  35059.0
3  Employee_4    Finance  42803.0
4  Employee_5    Finance  32310.0

=== SELECTING ROWS ===
First 5 rows using iloc:
   employee_id  first_name department job_title   salary  years_experience  \
0         1001  Employee_1         HR  Director  49520.0                 9   
1         1002  Employee_2    Finance   Analyst  23122.0                21   
2         1003  Employee_3         IT   Analyst  35059.0                 2   
3         1004  Employee_4    Finance   Analyst  42803.0                 7   
4         1005  Employee_5    Finance   Analyst  32310.0      



**Key selection methods:**
- **Single brackets `df['col']`**: Returns a Series
- **Double brackets `df[['col']]`**: Returns a DataFrame
- **`iloc`**: Integer-based location selection
- **Boolean indexing**: Filter using conditions
- **`&` and `|`**: AND and OR for combining conditions
- **`isin()`**: Check if values are in a list
- **`str.contains()`**: String pattern matching
- **`query()`**: Alternative syntax for filtering

---

## 7. Basic Data Processing

Let's learn how to transform and modify our data. This includes creating new columns, modifying existing ones, and basic calculations.



In [60]:
print("=== CREATING NEW COLUMNS ===")
# Create a new column based on existing data
df['salary_band'] = df['salary'].apply(lambda x: 'High' if x > 55000 
                                      else 'Medium' if x > 35000 
                                      else 'Low' if pd.notna(x) else 'Unknown')

print("Salary bands created:")
print(df['salary_band'].value_counts())

# Create column using numpy where
df['senior_employee'] = np.where(df['years_experience'] >= 10, 'Senior', 'Junior')
print(f"\nSenior vs Junior employees:")
print(df['senior_employee'].value_counts())

# Create a column based on multiple conditions
def categorise_employee(row):
    if pd.isna(row['performance_score']):
        return 'Not Rated'
    elif row['performance_score'] >= 4.0:
        return 'High Performer'
    elif row['performance_score'] >= 3.0:
        return 'Good Performer'
    else:
        return 'Needs Improvement'

df['performance_category'] = df.apply(categorise_employee, axis=1)
print(f"\nPerformance categories:")
print(df['performance_category'].value_counts())

print("\n=== MODIFYING EXISTING COLUMNS ===")
# Convert salary to thousands for easier reading
df['salary_k'] = (df['salary'] / 1000).round(1)
print("Salary in thousands ($k):")
print(df[['first_name', 'salary', 'salary_k']].head())

# Modify string columns
df['department_upper'] = df['department'].str.upper()
df['initials'] = df['first_name'].str[0] + '.'
print(f"\nString modifications:")
print(df[['first_name', 'initials', 'department', 'department_upper']].head())

print("\n=== SORTING DATA ===")
# Sort by single column
salary_sorted = df.sort_values('salary', ascending=False)
print("Top 5 earners:")
print(salary_sorted[['first_name', 'department', 'salary']].head())

# Sort by multiple columns
multi_sorted = df.sort_values(['department', 'salary'], ascending=[True, False])
print("\nSorted by department (A-Z), then salary (high to low):")
print(multi_sorted[['first_name', 'department', 'salary']].head(10))

print("\n=== BASIC CALCULATIONS AND AGGREGATIONS ===")
# Basic statistics
print(f"Average salary: ${df['salary'].mean():,.0f}")
print(f"Median salary: ${df['salary'].median():,.0f}")
print(f"Salary standard deviation: ${df['salary'].std():,.0f}")

# Calculations by group
dept_avg_salary = df.groupby('department')['salary'].mean().sort_values(ascending=False)
print(f"\nAverage salary by department:")
for dept, avg_sal in dept_avg_salary.items():
    print(f"  {dept}: ${avg_sal:,.0f}")

print("\n=== RENAMING COLUMNS ===")
# Rename specific columns
df_renamed = df.rename(columns={
    'first_name': 'employee_name',
    'years_experience': 'experience_years'
})
print("Columns after renaming:")
print(df_renamed.columns.tolist()[:5])  # Show first 5 column names


=== CREATING NEW COLUMNS ===
Salary bands created:
salary_band
Medium     48
High       25
Low        24
Unknown     3
Name: count, dtype: int64

Senior vs Junior employees:
senior_employee
Senior    63
Junior    37
Name: count, dtype: int64

Performance categories:
performance_category
Needs Improvement    46
Good Performer       27
High Performer       22
Not Rated             5
Name: count, dtype: int64

=== MODIFYING EXISTING COLUMNS ===
Salary in thousands ($k):
   first_name   salary  salary_k
0  Employee_1  49520.0      49.5
1  Employee_2  23122.0      23.1
2  Employee_3  35059.0      35.1
3  Employee_4  42803.0      42.8
4  Employee_5  32310.0      32.3

String modifications:
   first_name initials department department_upper
0  Employee_1       E.         HR               HR
1  Employee_2       E.    Finance          FINANCE
2  Employee_3       E.         IT               IT
3  Employee_4       E.    Finance          FINANCE
4  Employee_5       E.    Finance          FINANCE





**Key data processing techniques:**
- **`apply()` with lambda**: Apply functions to columns
- **`np.where()`**: Conditional value assignment
- **Custom functions with `apply()`**: More complex logic
- **String operations**: `.str` accessor for text manipulation
- **`sort_values()`**: Sorting by one or multiple columns
- **`groupby()`**: Group data for calculations
- **`rename()`**: Change column names

---

## 8. Handling Missing Data

Missing data is common in real datasets. Let's learn different strategies to handle it.



In [61]:
print("=== IDENTIFYING MISSING DATA ===")
# Check for missing values
print("Missing values per column:")
missing_summary = df.isnull().sum()
print(missing_summary[missing_summary > 0])

# Visualise missing data patterns
print(f"\nRows with any missing data: {df.isnull().any(axis=1).sum()}")
print(f"Rows with all data complete: {df.dropna().shape[0]}")

# Show rows with missing data
print("\nRows with missing values:")
missing_rows = df[df.isnull().any(axis=1)]
print(missing_rows[['employee_id', 'first_name', 'salary', 'performance_score']])

print("\n=== DIFFERENT STRATEGIES FOR HANDLING MISSING DATA ===")

# Strategy 1: Drop rows with missing data
df_dropped_rows = df.dropna()
print(f"Original rows: {len(df)}, After dropping rows with NaN: {len(df_dropped_rows)}")

# Strategy 2: Drop columns with missing data
df_dropped_cols = df.dropna(axis=1)
print(f"Original columns: {df.shape[1]}, After dropping columns with NaN: {df_dropped_cols.shape[1]}")

# Strategy 3: Fill missing values with a constant
df_filled_constant = df.copy()
df_filled_constant['salary'].fillna(0, inplace=True)
df_filled_constant['performance_score'].fillna('Not Rated', inplace=True)
print(f"Missing values after filling with constants: {df_filled_constant.isnull().sum().sum()}")

# Strategy 4: Fill with statistical measures
df_filled_stats = df.copy()
# Fill salary with median
median_salary = df['salary'].median()
df_filled_stats['salary'].fillna(median_salary, inplace=True)
print(f"Filled missing salaries with median: ${median_salary:,.0f}")

# Fill performance score with mean
mean_performance = df['performance_score'].mean()
df_filled_stats['performance_score'].fillna(mean_performance, inplace=True)
print(f"Filled missing performance scores with mean: {mean_performance:.2f}")

# Strategy 5: Forward fill and backward fill
df_ffill = df.copy()
df_ffill['performance_score'].fillna(method='ffill', inplace=True)  # Forward fill
df_ffill['performance_score'].fillna(method='bfill', inplace=True)  # Backward fill

print("\n=== CHECKING OUR WORK ===")
print("Missing values after different strategies:")
print(f"Original: {df.isnull().sum().sum()}")
print(f"Dropped rows: {df_dropped_rows.isnull().sum().sum()}")
print(f"Filled with constants: {df_filled_constant.isnull().sum().sum()}")
print(f"Filled with statistics: {df_filled_stats.isnull().sum().sum()}")
print(f"Forward/backward fill: {df_ffill.isnull().sum().sum()}")

# Let's use the statistically filled version for the rest of our analysis
df = df_filled_stats.copy()
print(f"\nUsing statistically filled dataset. Missing values: {df.isnull().sum().sum()}")


=== IDENTIFYING MISSING DATA ===
Missing values per column:
salary               3
performance_score    5
salary_k             3
dtype: int64

Rows with any missing data: 8
Rows with all data complete: 92

Rows with missing values:
    employee_id   first_name   salary  performance_score
6          1007   Employee_7  61301.0                NaN
15         1016  Employee_16      NaN               2.99
32         1033  Employee_33  58313.0                NaN
35         1036  Employee_36  40951.0                NaN
45         1046  Employee_46  56119.0                NaN
54         1055  Employee_55  40718.0                NaN
61         1062  Employee_62      NaN               3.83
85         1086  Employee_86      NaN               2.29

=== DIFFERENT STRATEGIES FOR HANDLING MISSING DATA ===
Original rows: 100, After dropping rows with NaN: 92
Original columns: 15, After dropping columns with NaN: 12
Missing values after filling with constants: 3
Filled missing salaries with median: $45,



**Missing data strategies:**
- **`dropna()`**: Remove rows or columns with missing values
- **`fillna()`**: Replace missing values with specified values
- **Statistical fills**: Use mean, median, or mode to fill gaps
- **Forward/backward fill**: Use adjacent values to fill gaps
- **Domain-specific fills**: Use business logic to determine appropriate values

---

## 9. Grouping and Aggregation

Grouping data allows us to calculate statistics for different categories. This is essential for data analysis.



In [62]:
print("=== BASIC GROUPING ===")
# Group by single column
dept_groups = df.groupby('department')

# Basic statistics by group
print("Employee count by department:")
print(dept_groups.size())

print(f"\nAverage salary by department:")
dept_salary_avg = dept_groups['salary'].mean().sort_values(ascending=False)
for dept, avg in dept_salary_avg.items():
    print(f"  {dept}: ${avg:,.0f}")

print("\n=== MULTIPLE AGGREGATIONS ===")
# Multiple statistics at once
dept_stats = dept_groups['salary'].agg(['count', 'mean', 'median', 'min', 'max', 'std'])
dept_stats.columns = ['Count', 'Mean', 'Median', 'Min', 'Max', 'Std_Dev']
dept_stats = dept_stats.round(0)
print("Comprehensive salary statistics by department:")
print(dept_stats)

print("\n=== GROUPING BY MULTIPLE COLUMNS ===")
# Group by department and city
dept_city_groups = df.groupby(['department', 'city'])
dept_city_avg = dept_city_groups['salary'].mean().round(0)
print("Average salary by department and city:")
print(dept_city_avg)

# Convert to more readable format
dept_city_table = dept_city_avg.unstack(fill_value=0)
print(f"\nSalary table (departments vs cities):")
print(dept_city_table)

print("\n=== CUSTOM AGGREGATIONS ===")
# Define custom aggregation functions
def salary_range(series):
    return series.max() - series.min()

def high_performers_count(series):
    return (series > 4.0).sum()

# Apply custom functions
dept_custom = dept_groups.agg({
    'salary': ['mean', salary_range],
    'performance_score': ['mean', high_performers_count],
    'years_experience': 'mean'
}).round(2)

print("Custom aggregations by department:")
print(dept_custom)

print("\n=== FILTERING GROUPS ===")
# Filter groups based on conditions
large_departments = dept_groups.filter(lambda x: len(x) >= 15)
print(f"Employees in departments with 15+ people: {len(large_departments)}")

# Show which departments are large
large_dept_names = large_departments['department'].unique()
print(f"Large departments: {large_dept_names.tolist()}")

print("\n=== PIVOT TABLES ===")
# Create pivot table (like Excel pivot tables)
pivot_performance = df.pivot_table(
    values='salary',
    index='department',
    columns='senior_employee',
    aggfunc='mean',
    fill_value=0
).round(0)

print("Pivot table: Average salary by department and seniority")
print(pivot_performance)

# More complex pivot table
pivot_complex = df.pivot_table(
    values=['salary', 'performance_score'],
    index='department',
    columns='salary_band',
    aggfunc={'salary': 'mean', 'performance_score': 'mean'},
    fill_value=0
).round(2)

print(f"\nComplex pivot table:")
print(pivot_complex)

print("\n=== CROSS-TABULATION ===")
# Cross-tabulation (frequency tables)
crosstab = pd.crosstab(df['department'], df['salary_band'], margins=True)
print("Cross-tabulation: Department vs Salary Band")
print(crosstab)

# Percentage cross-tabulation
crosstab_pct = pd.crosstab(df['department'], df['salary_band'], normalize='index') * 100
crosstab_pct = crosstab_pct.round(1)
print(f"\nPercentage distribution within each department:")
print(crosstab_pct)


=== BASIC GROUPING ===
Employee count by department:
department
Finance       17
HR            25
IT            11
Marketing     19
Operations    17
Sales         11
dtype: int64

Average salary by department:
  Operations: $52,299
  HR: $46,468
  Sales: $46,047
  Marketing: $44,675
  IT: $43,763
  Finance: $42,741

=== MULTIPLE AGGREGATIONS ===
Comprehensive salary statistics by department:
            Count     Mean   Median      Min      Max  Std_Dev
department                                                    
Finance        17  42741.0  42941.0  15837.0  74619.0  14003.0
HR             25  46468.0  45889.0  15954.0  71829.0  11355.0
IT             11  43763.0  47260.0  23348.0  61301.0  14562.0
Marketing      19  44675.0  44554.0  19342.0  74476.0  15915.0
Operations     17  52299.0  49738.0  33561.0  90916.0  16104.0
Sales          11  46047.0  52076.0  10229.0  66020.0  17907.0

=== GROUPING BY MULTIPLE COLUMNS ===
Average salary by department and city:
department  city      
F



**Grouping and aggregation concepts:**
- **`groupby()`**: Split data into groups based on column values
- **`agg()`**: Apply multiple functions to grouped data
- **Multiple grouping**: Group by several columns simultaneously
- **Custom functions**: Write your own aggregation functions
- **`filter()`**: Keep only groups that meet certain conditions
- **`pivot_table()`**: Reshape data like Excel pivot tables
- **`crosstab()`**: Create frequency tables

---

## 10. Summary

Congratulations! You've learned the fundamentals of pandas for data manipulation and analysis. Let's summarise what we've covered:



In [63]:
print("=== FINAL DATA SUMMARY ===")
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"Data types: {df.dtypes.value_counts().to_dict()}")
print(f"Missing values: {df.isnull().sum().sum()}")

print("\n=== KEY INSIGHTS FROM OUR DATA ===")
print(f"Total employees: {len(df)}")
print(f"Departments: {df['department'].nunique()} ({', '.join(df['department'].unique())})")
print(f"Average salary: ${df['salary'].mean():,.0f}")
print(f"Salary range: ${df['salary'].min():,.0f} - ${df['salary'].max():,.0f}")
print(f"Senior employees: {(df['senior_employee'] == 'Senior').sum()}")
print(f"High performers: {(df['performance_category'] == 'High Performer').sum()}")

print("\n=== WHAT YOU'VE LEARNED ===")
skills_learned = [
    "Creating and understanding DataFrames",
    "Loading and saving CSV files", 
    "Exploring data with info(), describe(), and value_counts()",
    "Selecting columns and filtering rows",
    "Creating new columns and modifying existing ones",
    "Sorting and basic calculations",
    "Handling missing data with various strategies",
    "Grouping data and calculating aggregations",
    "Creating pivot tables and cross-tabulations"
]

for i, skill in enumerate(skills_learned, 1):
    print(f"{i:2d}. {skill}")

print("\n=== NEXT STEPS ===")
next_steps = [
    "Learn data visualisation with matplotlib or seaborn",
    "Explore advanced pandas functions like merge() and join()",
    "Practice with real datasets from your domain",
    "Learn about time series analysis with pandas",
    "Explore statistical analysis and machine learning"
]

for i, step in enumerate(next_steps, 1):
    print(f"{i}. {step}")

print("\n=== SAVE YOUR WORK ===")
# Save the processed dataset
final_filename = 'processed_employee_data.csv'
df.to_csv(final_filename, index=False)
print(f"Processed data saved as: {final_filename}")

print("\nHappy analysing! 🐼")


=== FINAL DATA SUMMARY ===
Dataset shape: (100, 15)
Columns: ['employee_id', 'first_name', 'department', 'job_title', 'salary', 'years_experience', 'city', 'performance_score', 'start_date', 'salary_band', 'senior_employee', 'performance_category', 'salary_k', 'department_upper', 'initials']
Data types: {dtype('O'): 9, dtype('float64'): 3, dtype('int64'): 1, dtype('int32'): 1, dtype('<M8[ns]'): 1}
Missing values: 3

=== KEY INSIGHTS FROM OUR DATA ===
Total employees: 100
Departments: 6 (HR, Finance, IT, Marketing, Operations, Sales)
Average salary: $46,141
Salary range: $10,229 - $90,916
Senior employees: 63
High performers: 22

=== WHAT YOU'VE LEARNED ===
 1. Creating and understanding DataFrames
 2. Loading and saving CSV files
 3. Exploring data with info(), describe(), and value_counts()
 4. Selecting columns and filtering rows
 5. Creating new columns and modifying existing ones
 6. Sorting and basic calculations
 7. Handling missing data with various strategies
 8. Grouping data 



## Key Takeaways

**Essential pandas concepts you now understand:**
- DataFrames are like spreadsheets with powerful manipulation capabilities
- Always explore your data first using `info()`, `describe()`, and `head()`
- Boolean indexing is your friend for filtering data
- Missing data is normal - choose the right strategy for your situation
- Groupby operations are powerful for calculating statistics by category
- Pandas integrates well with other Python libraries for further analysis

**Best practices to remember:**
- Always check your data after loading or transforming it
- Use meaningful variable names and add comments to your code
- Save intermediate results when doing complex transformations
- Start with simple operations before moving to complex ones
- Practice with different datasets to reinforce your learning

This tutorial provides a solid foundation for data analysis with pandas. Each concept builds upon the previous ones, so take your time to understand each section before moving on.