# 🐼 Pandas Fundamentals - Session 3## Python Fundamentals Workshop: Data Science & ML**What you'll learn in this notebook:**- ✅ Create and manipulate Pandas Series and DataFrames- ✅ Read data from CSV and JSON files- ✅ Select and filter data- ✅ Clean and prepare data for analysis- ✅ Perform grouping and aggregation operations**Estimated time:** 40-50 minutes---

## 🎯 Why Pandas?Pandas is the most popular library for data manipulation and analysis in Python:- **Powerful:** Handle large datasets with ease- **Flexible:** Work with structured data (like Excel, CSV, SQL)- **Fast:** Optimized for performance- **Intuitive:** Easy-to-use syntax for complex operationsThink of Pandas as **Excel on steroids**!Let's start by importing Pandas!

In [None]:
import pandas as pdimport numpy as npprint(f"Pandas version: {pd.__version__}")print("✅ Pandas imported successfully!")

---## 1️⃣ Pandas SeriesA Series is a **one-dimensional** labeled array. Think of it as a single column in a spreadsheet.

### 💡 Example: Creating and Using Series

In [None]:
# Create a Series from a listtemperatures = pd.Series([22, 25, 19, 23, 28, 30, 27])print("Temperature Series:")print(temperatures)print()# Series with custom indexdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']temp_series = pd.Series([22, 25, 19, 23, 28, 30, 27], index=days)print("Temperature by Day:")print(temp_series)print()# Access elements by indexprint(f"Temperature on Monday: {temp_series['Mon']}°C")print(f"Temperature on Friday: {temp_series['Fri']}°C")print()# Perform operationsprint("Temperatures in Fahrenheit:")print(temp_series * 9/5 + 32)print()# Basic statisticsprint("Statistics:")print(f"  Mean: {temp_series.mean():.2f}°C")print(f"  Max: {temp_series.max()}°C")print(f"  Min: {temp_series.min()}°C")

### ✏️ Your Turn: Create Your Own Series

In [None]:
# ✏️ YOUR TURN: Practice creating and using Series# Task 1: Create a Series of student scores with names as index# Names: ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']# Scores: [85, 92, 78, 88, 95]# Your code herestudent_scores = None  # Create Series with index=namesprint("Student Scores:")print(student_scores if student_scores is not None else "Not created yet")print()# Task 2: Access Bob's score# Your code herebob_score = None  # Access using indexprint(f"Bob's score: {bob_score}" if bob_score is not None else "Not accessed yet")print()# Task 3: Calculate the average score# Your code hereavg_score = None  # Use .mean()print(f"Average score: {avg_score:.2f}" if avg_score is not None else "Not calculated yet")print()# Task 4: Find scores above 85# Hint: Use boolean indexing like: series[series > 85]# Your code herehigh_scores = None  # Filter scores > 85print("Scores above 85:")print(high_scores if high_scores is not None else "Not filtered yet")

---## 2️⃣ Pandas DataFramesA DataFrame is a **two-dimensional** labeled data structure. Think of it as a spreadsheet or SQL table.

### 💡 Example: Creating and Exploring DataFrames

In [None]:
# Create a DataFrame from a dictionarydata = {    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],    'Age': [25, 30, 35, 28, 32],    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston'],    'Salary': [70000, 85000, 65000, 75000, 90000]}df = pd.DataFrame(data)print("Employee DataFrame:")print(df)print()# Display basic infoprint("DataFrame Info:")print(f"  Shape: {df.shape} (rows, columns)")print(f"  Columns: {list(df.columns)}")print(f"  Data types:")print(df.dtypes)print()# Access a single column (returns a Series)print("Names column:")print(df['Name'])print()# Access multiple columnsprint("Name and Salary:")print(df[['Name', 'Salary']])print()# Access a row by indexprint("First row:")print(df.iloc[0])  # iloc = integer locationprint()# Access row by index labelprint("Row at index 2:")print(df.loc[2])  # loc = label location

### ✏️ Your Turn: Create and Explore DataFrames

In [None]:
# ✏️ YOUR TURN: Practice creating and exploring DataFrames# Task 1: Create a DataFrame with product information# Products: ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard']# Prices: [1200, 800, 500, 300, 100]# Stock: [15, 30, 25, 10, 50]# Your code hereproducts_df = None  # Create DataFrame from dictprint("Products DataFrame:")print(products_df if products_df is not None else "Not created yet")print()# Task 2: Display the shape of the DataFrame# Your code heredf_shape = None  # Use .shapeprint(f"DataFrame shape: {df_shape}" if df_shape is not None else "Not accessed yet")print()# Task 3: Access only the 'Product' and 'Price' columns# Your code hereprice_info = None  # Select columnsprint("Product and Price:")print(price_info if price_info is not None else "Not selected yet")print()# Task 4: Get the product with the highest price# Hint: Use df['Price'].max() to find max, then filter# Your code heremax_price = None  # Find maximum priceprint(f"Highest price: ${max_price}" if max_price is not None else "Not calculated yet")

---## 3️⃣ Reading Data from FilesPandas can read data from many formats: CSV, Excel, JSON, SQL, and more!

### 💡 Example: Reading CSV Data

In [None]:
from io import StringIO# Sample CSV data (in real scenarios, you'd use pd.read_csv('file.csv'))csv_data = """Name,Department,Salary,YearsAlice,Engineering,95000,5Bob,Marketing,65000,3Charlie,Engineering,88000,4Diana,Sales,72000,6Eve,Marketing,70000,2Frank,Engineering,105000,8Grace,Sales,68000,3"""# Read CSV from string (simulating file read)df = pd.read_csv(StringIO(csv_data))print("Data loaded from CSV:")print(df)print()# Explore the dataprint("First 3 rows:")print(df.head(3))print()print("Last 3 rows:")print(df.tail(3))print()print("Summary statistics:")print(df.describe())print()print("Quick info:")print(df.info())

### ✏️ Your Turn: Load and Explore Data

In [None]:
# ✏️ YOUR TURN: Practice loading and exploring datafrom io import StringIO# Sample sales datasales_csv = """Product,Category,Units,RevenueLaptop,Electronics,45,54000Phone,Electronics,120,96000Desk,Furniture,30,9000Chair,Furniture,75,11250Monitor,Electronics,60,18000Lamp,Furniture,90,4500"""# Task 1: Load the CSV data into a DataFrame# Hint: Use pd.read_csv(StringIO(sales_csv))# Your code heresales_df = None  # Load CSVprint("Sales Data:")print(sales_df if sales_df is not None else "Not loaded yet")print()# Task 2: Display the first 3 rows# Your code herefirst_rows = None  # Use .head(3)print("First 3 rows:")print(first_rows if first_rows is not None else "Not displayed yet")print()# Task 3: Get summary statistics# Your code herestats = None  # Use .describe()print("Summary statistics:")print(stats if stats is not None else "Not calculated yet")print()# Task 4: Check data types of each column# Your code heredata_types = None  # Use .dtypesprint("Data types:")print(data_types if data_types is not None else "Not checked yet")

---## 4️⃣ Data Selection and FilteringSelect specific data based on conditions - one of Pandas' most powerful features!

### 💡 Example: Filtering DataFrames

In [None]:
# Create sample datadata = {    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],    'Age': [25, 35, 28, 42, 31, 29],    'Department': ['HR', 'IT', 'IT', 'Sales', 'HR', 'IT'],    'Salary': [60000, 85000, 72000, 95000, 65000, 78000]}df = pd.DataFrame(data)print("Employee Data:")print(df)print()# Filter: Employees older than 30older_than_30 = df[df['Age'] > 30]print("Employees older than 30:")print(older_than_30)print()# Filter: IT Department employeesit_employees = df[df['Department'] == 'IT']print("IT Department:")print(it_employees)print()# Multiple conditions: IT employees earning more than 75000it_high_earners = df[(df['Department'] == 'IT') & (df['Salary'] > 75000)]print("IT employees earning > $75,000:")print(it_high_earners)print()# Using .loc for label-based selectionprint("Rows 1-3, Name and Salary columns:")print(df.loc[1:3, ['Name', 'Salary']])print()# Using .iloc for position-based selectionprint("First 3 rows, first 2 columns:")print(df.iloc[0:3, 0:2])

### ✏️ Your Turn: Filter and Select Data

In [None]:
# ✏️ YOUR TURN: Practice filtering and selecting data# Sample product dataproducts_data = {    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard', 'Mouse'],    'Category': ['Electronics', 'Electronics', 'Electronics', 'Electronics', 'Accessories', 'Accessories'],    'Price': [1200, 800, 500, 300, 100, 50],    'Stock': [15, 30, 25, 10, 50, 75],    'Rating': [4.5, 4.7, 4.2, 4.6, 4.3, 4.4]}products_df = pd.DataFrame(products_data)print("Product Data:")print(products_df)print()# Task 1: Filter products with price > 500# Your code hereexpensive_products = None  # Filter using boolean indexingprint("Products over $500:")print(expensive_products if expensive_products is not None else "Not filtered yet")print()# Task 2: Filter Electronics category products# Your code hereelectronics = None  # Filter by categoryprint("Electronics:")print(electronics if electronics is not None else "Not filtered yet")print()# Task 3: Find products with rating >= 4.5 AND price < 1000# Hint: Use (condition1) & (condition2)# Your code heregood_value = None  # Multiple conditionsprint("High-rated products under $1000:")print(good_value if good_value is not None else "Not filtered yet")print()# Task 4: Select only Product and Price columns for products in stock > 20# Your code herehigh_stock = None  # Filter then select columnsprint("High stock items (Product and Price):")print(high_stock if high_stock is not None else "Not selected yet")

---## 5️⃣ Data CleaningReal-world data is messy! Pandas makes cleaning data much easier.

### 💡 Example: Handling Missing Data

In [None]:
# Create data with missing valuesdata = {    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],    'Age': [25, np.nan, 28, 35, np.nan],    'Salary': [70000, 85000, np.nan, 90000, 75000],    'Department': ['HR', 'IT', 'IT', np.nan, 'HR']}df = pd.DataFrame(data)print("Data with missing values (NaN):")print(df)print()# Check for missing valuesprint("Missing values per column:")print(df.isnull().sum())print()# Drop rows with any missing valuesdf_dropped = df.dropna()print("After dropping rows with NaN:")print(df_dropped)print()# Fill missing values with a defaultdf_filled = df.fillna({    'Age': df['Age'].mean(),    'Salary': df['Salary'].median(),    'Department': 'Unknown'})print("After filling missing values:")print(df_filled)print()# Fill forward (use previous value)df_ffill = df.fillna(method='ffill')print("Forward fill:")print(df_ffill)

### ✏️ Your Turn: Clean Your Data

In [None]:
# ✏️ YOUR TURN: Practice data cleaning# Create messy datamessy_data = {    'Product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],    'Price': [1200, np.nan, 500, 300, np.nan],    'Stock': [15, 30, np.nan, 10, 50],    'Rating': [4.5, 4.7, 4.2, np.nan, 4.3]}messy_df = pd.DataFrame(messy_data)print("Messy Data:")print(messy_df)print()# Task 1: Count missing values in each column# Your code heremissing_count = None  # Use .isnull().sum()print("Missing values per column:")print(missing_count if missing_count is not None else "Not counted yet")print()# Task 2: Fill missing Price values with the mean price# Your code herecleaned_df = None  # Use .fillna()print("After filling missing prices with mean:")print(cleaned_df if cleaned_df is not None else "Not cleaned yet")print()# Task 3: Drop all rows that have any missing values# Your code herecomplete_df = None  # Use .dropna()print("After dropping rows with missing values:")print(complete_df if complete_df is not None else "Not dropped yet")print()# Task 4: Fill missing Rating with 4.0 (neutral rating)# Your code herefilled_ratings = None  # Use .fillna({'Rating': 4.0})print("After filling missing ratings:")print(filled_ratings if filled_ratings is not None else "Not filled yet")

---## 6️⃣ Grouping and AggregationGroup data by categories and calculate statistics - like pivot tables in Excel!

### 💡 Example: Group By Operations

In [None]:
# Create employee datadata = {    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],    'Department': ['IT', 'HR', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'Sales'],    'Salary': [85000, 60000, 78000, 92000, 65000, 95000, 88000, 75000],    'Years': [5, 3, 4, 6, 2, 8, 5, 3]}df = pd.DataFrame(data)print("Employee Data:")print(df)print()# Group by Department and calculate mean salarydept_salary = df.groupby('Department')['Salary'].mean()print("Average Salary by Department:")print(dept_salary)print()# Multiple aggregationsdept_stats = df.groupby('Department').agg({    'Salary': ['mean', 'min', 'max'],    'Years': 'mean'})print("Department Statistics:")print(dept_stats)print()# Count employees per departmentdept_count = df.groupby('Department').size()print("Employee Count by Department:")print(dept_count)print()# Group by and transformdf['Dept_Avg_Salary'] = df.groupby('Department')['Salary'].transform('mean')print("With Department Average:")print(df[['Name', 'Department', 'Salary', 'Dept_Avg_Salary']])

### ✏️ Your Turn: Group and Aggregate

In [None]:
# ✏️ YOUR TURN: Practice grouping and aggregation# Sample sales datasales_data = {    'Region': ['North', 'South', 'North', 'East', 'West', 'South', 'East', 'West', 'North', 'South'],    'Product': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'C'],    'Sales': [1200, 1500, 1100, 1800, 1400, 1300, 1600, 1250, 1350, 1700],    'Units': [30, 25, 28, 35, 22, 32, 30, 26, 27, 33]}sales_df = pd.DataFrame(sales_data)print("Sales Data:")print(sales_df)print()# Task 1: Calculate total sales by Region# Your code hereregional_sales = None  # Use .groupby('Region')['Sales'].sum()print("Total Sales by Region:")print(regional_sales if regional_sales is not None else "Not calculated yet")print()# Task 2: Calculate average units sold per Product# Your code hereavg_units = None  # Use .groupby('Product')['Units'].mean()print("Average Units by Product:")print(avg_units if avg_units is not None else "Not calculated yet")print()# Task 3: Find min and max sales for each Region# Hint: Use .agg(['min', 'max'])# Your code hereregion_range = None  # Use .groupby().agg()print("Sales Range by Region:")print(region_range if region_range is not None else "Not calculated yet")print()# Task 4: Count how many sales transactions per Product# Your code hereproduct_count = None  # Use .groupby('Product').size()print("Transaction Count by Product:")print(product_count if product_count is not None else "Not counted yet")

---## 🎉 Congratulations!You've completed Pandas Fundamentals! You now know how to:✅ Create and manipulate Series and DataFrames✅ Read data from CSV files✅ Select and filter data with conditions✅ Clean messy data (handle missing values)✅ Group data and calculate aggregations### 🚀 Next StepsReady for visualization? Open **`03-matplotlib-visualization.ipynb`** to learn how to create beautiful charts!---## 📚 Quick Reference**Creating DataFrames:**```pythonpd.DataFrame(dict)               # From dictionarypd.read_csv('file.csv')          # From CSV filepd.read_json('file.json')        # From JSON file```**Exploring Data:**```pythondf.head()                        # First 5 rowsdf.tail()                        # Last 5 rowsdf.info()                        # Column infodf.describe()                    # Statisticsdf.shape                         # (rows, cols)```**Selection:**```pythondf['column']                     # Single columndf[['col1', 'col2']]            # Multiple columnsdf[df['age'] > 30]              # Filter rowsdf.loc[0:5, 'name']             # Label-baseddf.iloc[0:5, 0:3]               # Position-based```**Cleaning:**```pythondf.isnull()                      # Check for NaNdf.dropna()                      # Drop missingdf.fillna(value)                 # Fill missing```**Grouping:**```pythondf.groupby('col').mean()         # Group and aggregatedf.groupby('col').agg(['min', 'max'])  # Multiple aggregations```---*Session 3 - Pandas Fundamentals* 🐍