# Module 7: Pandas for Data Analysis

## Topics Covered
1. Introduction to Pandas - Series and DataFrames
2. Reading Data (CSV, Excel, JSON)
3. Exploring DataFrames (head, info, describe)
4. Selecting Data - loc and iloc
5. Filtering and Boolean Indexing
6. Handling Missing Data
7. Data Types and Type Conversion
8. Adding and Removing Columns
9. Sorting and Ranking
10. Groupby Operations
11. Merging and Joining DataFrames
12. Pivot Tables and Cross-tabulation
13. Apply, Map, and Applymap
14. String Methods in Pandas
15. DateTime Operations
16. Exporting Data

## Learning Objectives

By the end of this module, you will be able to:
- Create and manipulate pandas Series and DataFrames
- Load data from various file formats
- Select, filter, and transform data efficiently
- Handle missing values appropriately
- Aggregate and group data for analysis
- Merge multiple datasets together
- Work with text and datetime data

---

---
# Section 1: Introduction to Pandas - Series and DataFrames
---

## What is Pandas?

Pandas is the most important Python library for data analysis. It provides:

- **DataFrame**: A 2D labeled data structure (like a spreadsheet)
- **Series**: A 1D labeled array (like a column)
- **Rich functionality**: Data manipulation, cleaning, analysis, and visualization

### Why This Matters in Data Science

Pandas is the foundation of the data science workflow in Python. It's used for:
- Loading and saving data in various formats
- Data cleaning and preprocessing
- Exploratory data analysis
- Feature engineering for machine learning

In [None]:
# Import pandas (convention: import as 'pd')
import pandas as pd
import numpy as np

print(f"Pandas version: {pd.__version__}")

## Series

A Series is a one-dimensional labeled array. Think of it as a single column of data with an index.

In [None]:
# Example: Creating a Series

# From a list
sales = pd.Series([100, 150, 200, 175, 225])
print("Series from list:")
print(sales)
print(f"\nIndex: {sales.index.tolist()}")
print(f"Values: {sales.values}")

In [None]:
# Example: Series with custom index

sales = pd.Series(
    [100, 150, 200, 175, 225],
    index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri']
)
print("Series with custom index:")
print(sales)

# Access by label
print(f"\nWednesday sales: {sales['Wed']}")
print(f"Monday to Wednesday: \n{sales['Mon':'Wed']}")

In [None]:
# Example: Series from dictionary

population = pd.Series({
    'California': 39538223,
    'Texas': 29145505,
    'Florida': 21538187,
    'New York': 20201249
})
print("Population by state:")
print(population)
print(f"\nTexas population: {population['Texas']:,}")

## DataFrame

A DataFrame is a 2D labeled data structure with columns of potentially different types. Think of it as a spreadsheet or SQL table.

In [None]:
# Example: Creating a DataFrame from a dictionary

data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'salary': [70000, 80000, 90000, 75000]
}

df = pd.DataFrame(data)
print("DataFrame:")
print(df)

In [None]:
# Example: DataFrame attributes

print(f"Shape: {df.shape}")        # (rows, columns)
print(f"Columns: {df.columns.tolist()}")
print(f"Index: {df.index.tolist()}")
print(f"Data types:\n{df.dtypes}")

In [None]:
# Example: Accessing columns (returns Series)

# Using bracket notation
print("Names column:")
print(df['name'])
print(f"\nType: {type(df['name'])}")

# Using dot notation (only works for valid Python identifiers)
print(f"\nAges: {df.age.tolist()}")

In [None]:
# Example: Selecting multiple columns

subset = df[['name', 'salary']]
print("Name and Salary:")
print(subset)

## Practice Exercise 1.1

**Task:** Create a DataFrame containing information about 5 products:
- product_name: Laptop, Mouse, Keyboard, Monitor, Headphones
- price: 999.99, 29.99, 79.99, 299.99, 149.99
- quantity: 50, 200, 150, 75, 100

Then calculate the total inventory value (price * quantity) for each product.

In [None]:
# Your code here


In [None]:
# Solution 1.1

import pandas as pd

products = pd.DataFrame({
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'price': [999.99, 29.99, 79.99, 299.99, 149.99],
    'quantity': [50, 200, 150, 75, 100]
})

# Calculate inventory value
products['inventory_value'] = products['price'] * products['quantity']

print("Product Inventory:")
print(products)
print(f"\nTotal Inventory Value: ${products['inventory_value'].sum():,.2f}")

---
# Section 2: Reading Data (CSV, Excel, JSON)
---

Pandas can read data from many file formats. The most common are CSV, Excel, and JSON.

In [None]:
# Example: Reading a CSV file

# Read the sales data
sales_df = pd.read_csv('assets/datasets/sales_data.csv')

print(f"Shape: {sales_df.shape}")
print(f"Columns: {sales_df.columns.tolist()}")
print("\nFirst 5 rows:")
print(sales_df.head())

In [None]:
# Example: Reading with options

# Common read_csv parameters:
# - sep: delimiter (default ',')
# - header: row number for column names
# - names: list of column names
# - usecols: which columns to read
# - dtype: data types for columns
# - parse_dates: columns to parse as dates
# - na_values: values to treat as NA

# Read only specific columns
sales_subset = pd.read_csv(
    'assets/datasets/sales_data.csv',
    usecols=['transaction_id', 'date', 'product', 'total_amount']
)
print("Subset of columns:")
print(sales_subset.head())

In [None]:
# Example: Reading with date parsing

sales_df = pd.read_csv(
    'assets/datasets/sales_data.csv',
    parse_dates=['date']
)

print(f"Date column type: {sales_df['date'].dtype}")
print(f"\nDate range: {sales_df['date'].min()} to {sales_df['date'].max()}")

In [None]:
# Example: Reading JSON

# Read the products JSON file
products_json = pd.read_json('assets/datasets/products.json')
print("JSON data:")
print(products_json)

In [None]:
# Example: Reading nested JSON (common with API responses)

import json

# Load JSON file
with open('assets/datasets/products.json', 'r') as f:
    products_data = json.load(f)

# Extract electronics products into DataFrame
electronics = pd.DataFrame(products_data['categories']['Electronics']['products'])
print("Electronics products:")
print(electronics)

In [None]:
# Example: Reading employees CSV

employees = pd.read_csv('assets/datasets/employees.csv')
print(f"Employees dataset: {employees.shape[0]} rows, {employees.shape[1]} columns")
print("\nColumn names:")
print(employees.columns.tolist())
print("\nFirst 3 rows:")
print(employees.head(3))

---
# Section 3: Exploring DataFrames
---

Before analyzing data, you need to understand its structure and content.

In [None]:
# Load data for exploration
df = pd.read_csv('assets/datasets/sales_data.csv')
print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")

In [None]:
# Example: head() and tail()

print("First 5 rows (head):")
print(df.head())

print("\nLast 3 rows (tail):")
print(df.tail(3))

In [None]:
# Example: info() - overview of the DataFrame

print("DataFrame Info:")
df.info()

In [None]:
# Example: describe() - statistical summary

print("Statistical Summary (numeric columns):")
print(df.describe())

In [None]:
# Example: describe() for all columns including non-numeric

print("Summary of all columns:")
print(df.describe(include='all'))

In [None]:
# Example: Value counts for categorical columns

print("Products sold (top 10):")
print(df['product'].value_counts().head(10))

print("\nSales by region:")
print(df['region'].value_counts())

In [None]:
# Example: Unique values

print(f"Unique categories: {df['category'].unique()}")
print(f"Number of unique products: {df['product'].nunique()}")
print(f"Number of unique sales reps: {df['sales_rep'].nunique()}")

In [None]:
# Example: Checking for missing values

print("Missing values per column:")
print(df.isnull().sum())

print(f"\nTotal missing values: {df.isnull().sum().sum()}")
print(f"Percentage missing: {df.isnull().sum().sum() / df.size * 100:.2f}%")

## Practice Exercise 3.1

**Task:** Load the employees.csv file and answer:
1. How many employees are there?
2. What departments exist and how many employees in each?
3. What is the average salary?
4. Are there any missing values?

In [None]:
# Your code here


In [None]:
# Solution 3.1

employees = pd.read_csv('assets/datasets/employees.csv')

# 1. Number of employees
print(f"1. Total employees: {len(employees)}")

# 2. Departments and counts
print("\n2. Employees by department:")
print(employees['department'].value_counts())

# 3. Average salary
print(f"\n3. Average salary: ${employees['salary'].mean():,.2f}")

# 4. Missing values
print("\n4. Missing values:")
missing = employees.isnull().sum()
print(missing[missing > 0])

---
# Section 4: Selecting Data - loc and iloc
---

Pandas provides two main ways to select data:
- **loc**: Label-based selection
- **iloc**: Integer position-based selection

In [None]:
# Create sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'department': ['Sales', 'IT', 'HR', 'IT', 'Sales'],
    'salary': [50000, 60000, 55000, 65000, 52000]
}, index=['E001', 'E002', 'E003', 'E004', 'E005'])

print("Sample DataFrame:")
print(df)

In [None]:
# Example: loc - label-based selection

# Single row by label
print("Row E002:")
print(df.loc['E002'])

# Multiple rows
print("\nRows E001 and E003:")
print(df.loc[['E001', 'E003']])

In [None]:
# Example: loc with row and column selection

# Single value
print(f"E002's salary: {df.loc['E002', 'salary']}")

# Row with specific columns
print("\nE003's name and department:")
print(df.loc['E003', ['name', 'department']])

# Multiple rows and columns
print("\nSubset:")
print(df.loc['E001':'E003', 'name':'department'])

In [None]:
# Example: iloc - position-based selection

# First row (position 0)
print("First row (iloc[0]):")
print(df.iloc[0])

# Last row
print("\nLast row (iloc[-1]):")
print(df.iloc[-1])

In [None]:
# Example: iloc with slicing

# First 3 rows
print("First 3 rows:")
print(df.iloc[:3])

# Rows 1-3, columns 0-2
print("\nRows 1-3, columns 0-2:")
print(df.iloc[1:4, 0:3])

In [None]:
# Example: Modifying values with loc

df_copy = df.copy()

# Update single value
df_copy.loc['E002', 'salary'] = 65000
print("After updating E002's salary:")
print(df_copy)

# Update multiple values
df_copy.loc['E001', ['department', 'salary']] = ['Marketing', 55000]
print("\nAfter updating E001:")
print(df_copy)

---
# Section 5: Filtering and Boolean Indexing
---

Filtering data based on conditions is one of the most common operations in data analysis.

In [None]:
# Load sales data
sales = pd.read_csv('assets/datasets/sales_data.csv')
print(f"Total records: {len(sales)}")

In [None]:
# Example: Simple filtering

# Filter by category
electronics = sales[sales['category'] == 'Electronics']
print(f"Electronics transactions: {len(electronics)}")
print(electronics.head())

In [None]:
# Example: Numeric comparisons

# High-value transactions
high_value = sales[sales['total_amount'] > 500]
print(f"Transactions over $500: {len(high_value)}")
print(high_value.head())

In [None]:
# Example: Multiple conditions

# Use & for AND, | for OR (with parentheses!)
filtered = sales[
    (sales['category'] == 'Electronics') & 
    (sales['total_amount'] > 200)
]
print(f"Electronics over $200: {len(filtered)}")

# OR condition
filtered2 = sales[
    (sales['region'] == 'North') | 
    (sales['region'] == 'South')
]
print(f"North or South region: {len(filtered2)}")

In [None]:
# Example: Using isin() for multiple values

# Filter for specific regions
regions = ['North', 'Central']
filtered = sales[sales['region'].isin(regions)]
print(f"North or Central: {len(filtered)}")

# Exclude certain products
exclude_products = ['Laptop', 'Monitor 27inch']
filtered2 = sales[~sales['product'].isin(exclude_products)]
print(f"Excluding Laptop and Monitor: {len(filtered2)}")

In [None]:
# Example: String methods for filtering

# Products containing 'Mouse'
mouse_products = sales[sales['product'].str.contains('Mouse', na=False)]
print(f"Mouse products: {len(mouse_products)}")

# Products starting with 'W'
w_products = sales[sales['product'].str.startswith('W', na=False)]
print(f"Products starting with W: {len(w_products)}")
print(w_products['product'].unique())

In [None]:
# Example: Using query() method (cleaner syntax)

# Equivalent to boolean indexing
result = sales.query('category == "Electronics" and total_amount > 200')
print(f"Electronics over $200 (using query): {len(result)}")

# With variables
min_amount = 300
result2 = sales.query('total_amount > @min_amount')
print(f"Transactions over ${min_amount}: {len(result2)}")

## Practice Exercise 5.1

**Task:** Using the sales data, find:
1. All transactions from the Central region with quantity > 5
2. All Furniture or Office Supplies transactions over $100
3. Transactions where the sales_rep name contains 'son' (like Johnson, Wilson)

In [None]:
# Your code here


In [None]:
# Solution 5.1

sales = pd.read_csv('assets/datasets/sales_data.csv')

# 1. Central region, quantity > 5
central_high_qty = sales[(sales['region'] == 'Central') & (sales['quantity'] > 5)]
print(f"1. Central with qty > 5: {len(central_high_qty)} transactions")

# 2. Furniture or Office Supplies over $100
furn_office = sales[
    (sales['category'].isin(['Furniture', 'Office Supplies'])) & 
    (sales['total_amount'] > 100)
]
print(f"2. Furniture/Office Supplies > $100: {len(furn_office)} transactions")

# 3. Sales rep containing 'son'
son_reps = sales[sales['sales_rep'].str.contains('son', case=False, na=False)]
print(f"3. Sales reps with 'son': {len(son_reps)} transactions")
print(f"   Reps: {son_reps['sales_rep'].unique()}")

---
# Section 6: Handling Missing Data
---

Real-world data often has missing values. Pandas uses `NaN` (Not a Number) to represent missing data.

In [None]:
# Load data with missing values
sales = pd.read_csv('assets/datasets/sales_data.csv')

print("Missing values:")
print(sales.isnull().sum())

In [None]:
# Example: Finding rows with missing values

# Rows with any missing value
rows_with_missing = sales[sales.isnull().any(axis=1)]
print(f"Rows with missing values: {len(rows_with_missing)}")
print(rows_with_missing.head())

In [None]:
# Example: Dropping missing values

# Drop rows with ANY missing value
clean_df = sales.dropna()
print(f"Original: {len(sales)} rows")
print(f"After dropna(): {len(clean_df)} rows")

# Drop rows only if specific columns are missing
clean_df2 = sales.dropna(subset=['sales_rep', 'unit_price'])
print(f"After dropna(subset): {len(clean_df2)} rows")

In [None]:
# Example: Filling missing values

df = sales.copy()

# Fill with a specific value
df['customer_rating'] = df['customer_rating'].fillna(0)

# Fill with mean
mean_price = df['unit_price'].mean()
df['unit_price'] = df['unit_price'].fillna(mean_price)

# Fill with a string
df['sales_rep'] = df['sales_rep'].fillna('Unknown')

print("After filling:")
print(df.isnull().sum())

In [None]:
# Example: Forward fill and backward fill

# Useful for time series data
data = pd.Series([1, np.nan, np.nan, 4, np.nan, 6])
print(f"Original: {data.tolist()}")

# Forward fill (use previous value)
print(f"Forward fill: {data.ffill().tolist()}")

# Backward fill (use next value)
print(f"Backward fill: {data.bfill().tolist()}")

---
# Section 7: Data Types and Type Conversion
---

Correct data types are essential for proper analysis and memory efficiency.

In [None]:
# Example: Checking data types

sales = pd.read_csv('assets/datasets/sales_data.csv')
print("Data types:")
print(sales.dtypes)

In [None]:
# Example: Converting types with astype()

df = sales.copy()

# Convert to category (saves memory for repeated strings)
df['category'] = df['category'].astype('category')
df['region'] = df['region'].astype('category')

print("After conversion:")
print(df.dtypes)
print(f"\nCategory values: {df['category'].cat.categories.tolist()}")

In [None]:
# Example: Converting dates

df = sales.copy()
print(f"Date type before: {df['date'].dtype}")

# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
print(f"Date type after: {df['date'].dtype}")

# Now we can use datetime operations
print(f"\nYear range: {df['date'].dt.year.min()} - {df['date'].dt.year.max()}")

In [None]:
# Example: Numeric conversions with errors handling

# Sample data with problematic values
data = pd.Series(['10', '20', 'thirty', '40', None])
print(f"Original: {data.tolist()}")

# errors='coerce' converts invalid values to NaN
numeric = pd.to_numeric(data, errors='coerce')
print(f"As numeric: {numeric.tolist()}")

---
# Section 8: Adding and Removing Columns
---

In [None]:
# Example: Adding new columns

df = pd.read_csv('assets/datasets/sales_data.csv')

# Add calculated column
df['unit_price_filled'] = df['unit_price'].fillna(df['total_amount'] / df['quantity'])

# Add column with constant value
df['currency'] = 'USD'

# Add column based on condition
df['high_value'] = df['total_amount'] > 200

print(df[['transaction_id', 'total_amount', 'high_value', 'currency']].head())

In [None]:
# Example: Adding column using np.where (if-else)

df = pd.read_csv('assets/datasets/sales_data.csv')

# Create size category based on quantity
df['order_size'] = np.where(df['quantity'] > 5, 'Large', 'Small')

print(df[['product', 'quantity', 'order_size']].head(10))

In [None]:
# Example: Multiple conditions with np.select

conditions = [
    df['quantity'] <= 2,
    df['quantity'] <= 5,
    df['quantity'] <= 10,
    df['quantity'] > 10
]
choices = ['XS', 'S', 'M', 'L']

df['size_category'] = np.select(conditions, choices)

print("Order size distribution:")
print(df['size_category'].value_counts())

In [None]:
# Example: Removing columns

df = pd.read_csv('assets/datasets/sales_data.csv')
print(f"Columns before: {df.columns.tolist()}")

# Drop single column
df = df.drop('customer_rating', axis=1)

# Drop multiple columns
df = df.drop(['sales_rep', 'unit_price'], axis=1)

print(f"Columns after: {df.columns.tolist()}")

---
# Section 9: Sorting and Ranking
---

In [None]:
# Example: Sorting by values

df = pd.read_csv('assets/datasets/sales_data.csv')

# Sort by single column
sorted_by_amount = df.sort_values('total_amount', ascending=False)
print("Top 5 transactions by amount:")
print(sorted_by_amount[['transaction_id', 'product', 'total_amount']].head())

In [None]:
# Example: Sorting by multiple columns

# Sort by category (asc), then by total_amount (desc)
sorted_multi = df.sort_values(
    ['category', 'total_amount'], 
    ascending=[True, False]
)
print("Sorted by category then amount:")
print(sorted_multi[['category', 'product', 'total_amount']].head(10))

In [None]:
# Example: Ranking

df = pd.read_csv('assets/datasets/sales_data.csv')

# Add rank column
df['amount_rank'] = df['total_amount'].rank(ascending=False)

# Show top ranked
top_ranked = df.nsmallest(5, 'amount_rank')
print("Top 5 by rank:")
print(top_ranked[['transaction_id', 'total_amount', 'amount_rank']])

In [None]:
# Example: nlargest and nsmallest

# Top 5 highest amounts
print("Top 5 highest:")
print(df.nlargest(5, 'total_amount')[['transaction_id', 'product', 'total_amount']])

# Bottom 5
print("\nBottom 5:")
print(df.nsmallest(5, 'total_amount')[['transaction_id', 'product', 'total_amount']])

---
# Section 10: Groupby Operations
---

Groupby is one of the most powerful features in pandas. It allows you to:
1. **Split** data into groups
2. **Apply** a function to each group
3. **Combine** results into a new DataFrame

In [None]:
# Load data
sales = pd.read_csv('assets/datasets/sales_data.csv')

# Example: Basic groupby with single aggregation
sales_by_region = sales.groupby('region')['total_amount'].sum()
print("Total sales by region:")
print(sales_by_region)

In [None]:
# Example: Multiple aggregations

region_stats = sales.groupby('region')['total_amount'].agg(['sum', 'mean', 'count'])
print("Region statistics:")
print(region_stats)

In [None]:
# Example: Named aggregations (recommended approach)

region_summary = sales.groupby('region').agg(
    total_sales=('total_amount', 'sum'),
    avg_sale=('total_amount', 'mean'),
    num_transactions=('transaction_id', 'count'),
    avg_quantity=('quantity', 'mean')
).round(2)

print("Region summary:")
print(region_summary)

In [None]:
# Example: Group by multiple columns

category_region = sales.groupby(['category', 'region']).agg(
    total_sales=('total_amount', 'sum'),
    num_transactions=('transaction_id', 'count')
).round(2)

print("Sales by category and region:")
print(category_region)

In [None]:
# Example: Resetting index after groupby

category_region_flat = sales.groupby(['category', 'region']).agg(
    total_sales=('total_amount', 'sum')
).reset_index()

print("Flattened result:")
print(category_region_flat.head(10))

In [None]:
# Example: Transform - apply function and keep original shape

# Calculate each transaction's percentage of region total
sales['region_total'] = sales.groupby('region')['total_amount'].transform('sum')
sales['pct_of_region'] = (sales['total_amount'] / sales['region_total'] * 100).round(2)

print("Sample with percentages:")
print(sales[['region', 'total_amount', 'region_total', 'pct_of_region']].head())

## Practice Exercise 10.1

**Task:** Using the sales data:
1. Find the total sales and average transaction amount for each product
2. Find the top-selling product in each category (by total sales)
3. Calculate what percentage each category contributes to total sales

In [None]:
# Your code here


In [None]:
# Solution 10.1

sales = pd.read_csv('assets/datasets/sales_data.csv')

# 1. Total and average by product
product_stats = sales.groupby('product').agg(
    total_sales=('total_amount', 'sum'),
    avg_transaction=('total_amount', 'mean')
).round(2).sort_values('total_sales', ascending=False)

print("1. Product statistics (top 10):")
print(product_stats.head(10))

# 2. Top product per category
category_product = sales.groupby(['category', 'product'])['total_amount'].sum().reset_index()
top_per_category = category_product.loc[
    category_product.groupby('category')['total_amount'].idxmax()
]
print("\n2. Top product per category:")
print(top_per_category)

# 3. Category percentage of total
total_sales = sales['total_amount'].sum()
category_pct = (sales.groupby('category')['total_amount'].sum() / total_sales * 100).round(2)
print("\n3. Category share of total sales:")
print(category_pct)

---
# Section 11: Merging and Joining DataFrames
---

Combining data from multiple sources is essential in data analysis.

In [None]:
# Create sample DataFrames
orders = pd.DataFrame({
    'order_id': [1, 2, 3, 4, 5],
    'customer_id': ['C001', 'C002', 'C001', 'C003', 'C002'],
    'amount': [100, 200, 150, 300, 250]
})

customers = pd.DataFrame({
    'customer_id': ['C001', 'C002', 'C003', 'C004'],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'city': ['NYC', 'LA', 'Chicago', 'Houston']
})

print("Orders:")
print(orders)
print("\nCustomers:")
print(customers)

In [None]:
# Example: Inner merge (only matching rows)

merged = pd.merge(orders, customers, on='customer_id', how='inner')
print("Inner merge:")
print(merged)

In [None]:
# Example: Left merge (keep all rows from left)

merged_left = pd.merge(orders, customers, on='customer_id', how='left')
print("Left merge:")
print(merged_left)

In [None]:
# Example: Right merge (keep all rows from right)

merged_right = pd.merge(orders, customers, on='customer_id', how='right')
print("Right merge:")
print(merged_right)

In [None]:
# Example: Outer merge (keep all rows from both)

merged_outer = pd.merge(orders, customers, on='customer_id', how='outer')
print("Outer merge:")
print(merged_outer)

In [None]:
# Example: Merging on different column names

df1 = pd.DataFrame({'emp_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'employee_id': [1, 2, 4], 'salary': [50000, 60000, 55000]})

merged = pd.merge(df1, df2, left_on='emp_id', right_on='employee_id', how='inner')
print("Merge on different column names:")
print(merged)

In [None]:
# Example: Concatenating DataFrames

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Vertical concatenation (stacking rows)
stacked = pd.concat([df1, df2], ignore_index=True)
print("Vertical concat:")
print(stacked)

# Horizontal concatenation (adding columns)
df3 = pd.DataFrame({'C': [9, 10]})
combined = pd.concat([df1, df3], axis=1)
print("\nHorizontal concat:")
print(combined)

---
# Section 12: Pivot Tables and Cross-tabulation
---

In [None]:
# Load sales data
sales = pd.read_csv('assets/datasets/sales_data.csv')

# Example: Basic pivot table
pivot = pd.pivot_table(
    sales,
    values='total_amount',
    index='category',
    columns='region',
    aggfunc='sum'
)
print("Pivot table - Sales by Category and Region:")
print(pivot.round(2))

In [None]:
# Example: Pivot with multiple aggregations

pivot_multi = pd.pivot_table(
    sales,
    values='total_amount',
    index='category',
    columns='region',
    aggfunc=['sum', 'mean', 'count']
)
print("Pivot with multiple aggregations:")
print(pivot_multi.round(2))

In [None]:
# Example: Pivot with margins (totals)

pivot_margins = pd.pivot_table(
    sales,
    values='total_amount',
    index='category',
    columns='region',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)
print("Pivot with margins:")
print(pivot_margins.round(2))

In [None]:
# Example: Cross-tabulation (counts)

crosstab = pd.crosstab(sales['category'], sales['region'])
print("Cross-tabulation (counts):")
print(crosstab)

In [None]:
# Example: Cross-tabulation with percentages

crosstab_pct = pd.crosstab(
    sales['category'], 
    sales['region'],
    normalize='index'  # Row percentages
) * 100

print("Cross-tabulation (row percentages):")
print(crosstab_pct.round(1))

---
# Section 13: Apply, Map, and Applymap
---

These methods allow you to apply custom functions to your data.

In [None]:
# Example: apply() on Series

sales = pd.read_csv('assets/datasets/sales_data.csv')

# Apply a function to each value
def categorize_amount(amount):
    if amount < 100:
        return 'Small'
    elif amount < 500:
        return 'Medium'
    else:
        return 'Large'

sales['amount_category'] = sales['total_amount'].apply(categorize_amount)
print(sales[['total_amount', 'amount_category']].head(10))

In [None]:
# Example: apply() with lambda

# Square root of amount
sales['amount_sqrt'] = sales['total_amount'].apply(lambda x: np.sqrt(x))

# Discount calculation
sales['discount_price'] = sales['total_amount'].apply(lambda x: x * 0.9)

print(sales[['total_amount', 'amount_sqrt', 'discount_price']].head())

In [None]:
# Example: apply() on DataFrame (row-wise)

def calc_adjusted_amount(row):
    base = row['total_amount']
    if row['category'] == 'Electronics':
        return base * 1.1  # 10% markup
    elif row['category'] == 'Furniture':
        return base * 0.95  # 5% discount
    else:
        return base

sales['adjusted_amount'] = sales.apply(calc_adjusted_amount, axis=1)
print(sales[['category', 'total_amount', 'adjusted_amount']].head(10))

In [None]:
# Example: map() for value replacement

region_mapping = {
    'North': 'N',
    'South': 'S',
    'East': 'E',
    'West': 'W',
    'Central': 'C'
}

sales['region_code'] = sales['region'].map(region_mapping)
print(sales[['region', 'region_code']].drop_duplicates())

In [None]:
# Example: map() on DataFrame (element-wise) - now called map() instead of applymap()

df = pd.DataFrame({
    'A': [1.234, 2.567, 3.891],
    'B': [4.123, 5.456, 6.789]
})

# Format all values to 1 decimal place
formatted = df.map(lambda x: f'{x:.1f}')
print("Formatted DataFrame:")
print(formatted)

---
# Section 14: String Methods in Pandas
---

Pandas provides string methods through the `.str` accessor.

In [None]:
# Example: Basic string methods

employees = pd.read_csv('assets/datasets/employees.csv')

# Create full name
employees['full_name'] = employees['first_name'] + ' ' + employees['last_name']

# Upper and lower case
print("Upper case names:")
print(employees['full_name'].str.upper().head())

print("\nLower case:")
print(employees['full_name'].str.lower().head())

In [None]:
# Example: String contains, starts with, ends with

# Find managers
managers = employees[employees['title'].str.contains('Manager', na=False)]
print(f"Managers: {len(managers)}")
print(managers['title'].unique())

# Names starting with 'J'
j_names = employees[employees['first_name'].str.startswith('J')]
print(f"\nNames starting with J: {len(j_names)}")

In [None]:
# Example: String length and slicing

# Get length of names
employees['name_length'] = employees['first_name'].str.len()

# Get first 3 characters
employees['initials'] = employees['first_name'].str[:1] + employees['last_name'].str[:1]

print(employees[['first_name', 'last_name', 'name_length', 'initials']].head())

In [None]:
# Example: String split and extract

# Extract username from email
employees['username'] = employees['email'].str.split('@').str[0]

print(employees[['email', 'username']].head())

In [None]:
# Example: String replace

sales = pd.read_csv('assets/datasets/sales_data.csv')

# Replace in product names
sales['product_clean'] = sales['product'].str.replace(' ', '_')
print(sales[['product', 'product_clean']].drop_duplicates().head())

---
# Section 15: DateTime Operations
---

In [None]:
# Load and parse dates
sales = pd.read_csv('assets/datasets/sales_data.csv', parse_dates=['date'])
print(f"Date column type: {sales['date'].dtype}")
print(sales['date'].head())

In [None]:
# Example: Extracting date components

sales['year'] = sales['date'].dt.year
sales['month'] = sales['date'].dt.month
sales['day'] = sales['date'].dt.day
sales['day_of_week'] = sales['date'].dt.day_name()
sales['quarter'] = sales['date'].dt.quarter

print(sales[['date', 'year', 'month', 'day', 'day_of_week', 'quarter']].head())

In [None]:
# Example: Filtering by date

# Sales in 2023
sales_2023 = sales[sales['date'].dt.year == 2023]
print(f"Sales in 2023: {len(sales_2023)}")

# Sales in Q4
sales_q4 = sales[sales['quarter'] == 4]
print(f"Q4 sales: {len(sales_q4)}")

# Date range filter
start_date = '2023-01-01'
end_date = '2023-06-30'
sales_h1 = sales[(sales['date'] >= start_date) & (sales['date'] <= end_date)]
print(f"H1 2023 sales: {len(sales_h1)}")

In [None]:
# Example: Grouping by time periods

# Monthly sales
monthly_sales = sales.groupby(sales['date'].dt.to_period('M'))['total_amount'].sum()
print("Monthly sales (first 6 months):")
print(monthly_sales.head(6))

In [None]:
# Example: Date arithmetic

sales['days_ago'] = (pd.Timestamp.now() - sales['date']).dt.days

print("Recent transactions:")
print(sales[['date', 'days_ago']].head())

---
# Section 16: Exporting Data
---

In [None]:
# Create sample DataFrame to export
sales = pd.read_csv('assets/datasets/sales_data.csv')
summary = sales.groupby('category').agg(
    total_sales=('total_amount', 'sum'),
    avg_sale=('total_amount', 'mean'),
    num_transactions=('transaction_id', 'count')
).round(2)

print("Summary to export:")
print(summary)

In [None]:
# Example: Export to CSV

summary.to_csv('assets/datasets/category_summary_export.csv')
print("Exported to CSV")

# Without index
summary.to_csv('assets/datasets/category_summary_no_index.csv', index=False)
print("Exported to CSV (no index)")

In [None]:
# Example: Export to JSON

# Different orientations
summary.reset_index().to_json('assets/datasets/summary_records.json', orient='records', indent=2)
print("Exported to JSON")

In [None]:
# Example: Verify exports

print("Verify CSV export:")
verify_df = pd.read_csv('assets/datasets/category_summary_export.csv')
print(verify_df)

## Practice Exercise 16.1

**Task:** Create a comprehensive sales report:
1. Load the sales data
2. Create a summary with: total sales, average transaction, top product, top region for each category
3. Export to both CSV and JSON formats

In [None]:
# Your code here


In [None]:
# Solution 16.1

import pandas as pd

# 1. Load data
sales = pd.read_csv('assets/datasets/sales_data.csv')

# 2. Create comprehensive summary
# Basic aggregations
summary = sales.groupby('category').agg(
    total_sales=('total_amount', 'sum'),
    avg_transaction=('total_amount', 'mean'),
    num_transactions=('transaction_id', 'count')
).round(2)

# Find top product per category
top_products = sales.groupby(['category', 'product'])['total_amount'].sum().reset_index()
top_products = top_products.loc[top_products.groupby('category')['total_amount'].idxmax()]
top_products = top_products.set_index('category')['product']

# Find top region per category
top_regions = sales.groupby(['category', 'region'])['total_amount'].sum().reset_index()
top_regions = top_regions.loc[top_regions.groupby('category')['total_amount'].idxmax()]
top_regions = top_regions.set_index('category')['region']

# Add to summary
summary['top_product'] = top_products
summary['top_region'] = top_regions

print("Sales Report by Category:")
print(summary)

# 3. Export
summary.to_csv('assets/datasets/sales_report.csv')
summary.reset_index().to_json('assets/datasets/sales_report.json', orient='records', indent=2)

print("\nExported to sales_report.csv and sales_report.json")

---
# Module Summary

## Key Takeaways

1. **DataFrames and Series** are the core data structures in pandas
2. **Reading data** with `read_csv()`, `read_json()`, `read_excel()`
3. **Exploring data** with `head()`, `info()`, `describe()`, `value_counts()`
4. **Selecting data** with `loc` (label-based) and `iloc` (position-based)
5. **Filtering** with boolean indexing and `query()`
6. **Missing data** handled with `dropna()`, `fillna()`
7. **Groupby** for aggregations: split-apply-combine
8. **Merging** DataFrames with `merge()` and `concat()`
9. **Pivot tables** for multi-dimensional summaries
10. **String and datetime** methods for text and time data

## Essential Functions

```python
# Reading/Writing
pd.read_csv(), df.to_csv()

# Exploration
df.head(), df.info(), df.describe(), df.value_counts()

# Selection
df.loc[], df.iloc[], df[condition]

# Transformation
df.groupby(), df.merge(), df.pivot_table()
df.apply(), df['col'].str, df['col'].dt
```

## Next Module

In the next module, we'll cover **Data Visualization** using Matplotlib and Seaborn to create informative charts and graphs from your data.

## Additional Practice

For extra practice, try these challenges:

1. **Sales Dashboard**: Create a comprehensive analysis of the sales data including trends over time, regional comparisons, and product performance

2. **Employee Analysis**: Using the employees dataset, analyze salary distributions by department, identify experience levels based on hire dates, and create performance rankings

3. **Data Pipeline**: Create a pipeline that reads multiple data files, cleans and transforms them, joins them together, and exports summary reports