# Pandas Practice Tutorial

This notebook provides a comprehensive introduction to pandas, Python's most popular data manipulation and analysis library, along with practical exercises to reinforce your learning.

## Table of Contents
1. [Introduction to Pandas](#Introduction-to-Pandas)
2. [Installation](#Installation)
3. [Core Data Structures](#Core-Data-Structures)
4. [Data Loading and Saving](#Data-Loading-and-Saving)
5. [Data Exploration](#Data-Exploration)
6. [Data Cleaning](#Data-Cleaning)
7. [Data Manipulation](#Data-Manipulation)
8. [Grouping and Aggregation](#Grouping-and-Aggregation)
9. [Merging and Joining](#Merging-and-Joining)
10. [Time Series Analysis](#Time-Series-Analysis)
11. [Practice Exercises](#Practice-Exercises)
12. [Visualization with Pandas](#Visualization-with-Pandas)
13. [Performance Tips](#Performance-Tips)
14. [Common Pitfalls and Best Practices](#Common-Pitfalls-and-Best-Practices)
15. [Additional Resources](#Additional-Resources)

## Introduction to Pandas

Pandas is a powerful, open-source data analysis and manipulation library built on top of NumPy. It provides data structures and functions needed to work with structured data seamlessly.

### Key Features:
- **DataFrame and Series**: Primary data structures for handling structured data
- **Data Alignment**: Automatic alignment of data based on labels
- **Missing Data Handling**: Robust tools for dealing with missing data
- **Data Import/Export**: Read/write data from various formats (CSV, Excel, JSON, SQL, etc.)
- **Data Transformation**: Powerful tools for reshaping, pivoting, and transforming data
- **Time Series**: Comprehensive time series functionality

## Installation

In [None]:
# Install pandas using pip
!pip install pandas

# Install additional dependencies for full functionality
!pip install openpyxl xlrd matplotlib seaborn

In [19]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Pandas version: 2.3.0
NumPy version: 2.3.0


## Core Data Structures

### Series
A Series is a one-dimensional labeled array capable of holding any data type.

In [20]:
# Creating a Series
data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index, name='my_series')
print("Series:")
print(series)
print(f"\nData type: {series.dtype}")
print(f"Index: {series.index.tolist()}")
print(f"Values: {series.values}")

Series:
a    10
b    20
c    30
d    40
e    50
Name: my_series, dtype: int64

Data type: int64
Index: ['a', 'b', 'c', 'd', 'e']
Values: [10 20 30 40 50]


In [21]:
# Series from dictionary
dict_data = {'apple': 5, 'banana': 3, 'orange': 8, 'grape': 12}
fruit_series = pd.Series(dict_data)
print("\nFruit Series:")
print(fruit_series)


Fruit Series:
apple      5
banana     3
orange     8
grape     12
dtype: int64


### DataFrame
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

In [22]:
# Creating a DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'Sydney'],
    'Salary': [50000, 60000, 70000, 55000, 65000],
    'Department': ['IT', 'Finance', 'IT', 'HR', 'Finance']
}

df = pd.DataFrame(data)
print("DataFrame:")
print(df)
print(f"\nShape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Index: {df.index.tolist()}")

DataFrame:
      Name  Age      City  Salary Department
0    Alice   25  New York   50000         IT
1      Bob   30    London   60000    Finance
2  Charlie   35     Tokyo   70000         IT
3    Diana   28     Paris   55000         HR
4      Eve   32    Sydney   65000    Finance

Shape: (5, 5)
Columns: ['Name', 'Age', 'City', 'Salary', 'Department']
Index: [0, 1, 2, 3, 4]


In [None]:
# DataFrame info
print("\nDataFrame Info:")
print(df.info())
print("\nData Types:")
print(df.dtypes)


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        5 non-null      object
 1   Age         5 non-null      int64 
 2   City        5 non-null      object
 3   Salary      5 non-null      int64 
 4   Department  5 non-null      object
dtypes: int64(2), object(3)
memory usage: 332.0+ bytes
None

Data Types:
Name          object
Age            int64
City          object
Salary         int64
Department    object
dtype: object


## Data Loading and Saving

In [24]:
# Create sample data for demonstration
sample_data = {
    'product_id': range(1, 101),
    'product_name': [f'Product_{i}' for i in range(1, 101)],
    'category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], 100),
    'price': np.random.uniform(10, 500, 100).round(2),
    'quantity_sold': np.random.randint(1, 100, 100),
    'rating': np.random.uniform(1, 5, 100).round(1),
    'date_sold': pd.date_range('2023-01-01', periods=100, freq='D')
}

products_df = pd.DataFrame(sample_data)
print("Sample Products DataFrame:")
print(products_df.head())

Sample Products DataFrame:
   product_id product_name     category   price  quantity_sold  rating  \
0           1    Product_1     Clothing  164.32             63     4.8   
1           2    Product_2         Home  421.23              2     4.4   
2           3    Product_3  Electronics  163.21             28     2.3   
3           4    Product_4         Home  230.79              5     3.2   
4           5    Product_5  Electronics  250.13             13     3.5   

   date_sold  
0 2023-01-01  
1 2023-01-02  
2 2023-01-03  
3 2023-01-04  
4 2023-01-05  


In [None]:
# Save to CSV
products_df.to_csv('products.csv', index=False)
print("Data saved to products.csv")

# Read from CSV
loaded_df = pd.read_csv('products.csv')
print("\nLoaded DataFrame:")
print(loaded_df.head())

In [None]:
# Other file formats
# Save to Excel
products_df.to_excel('products.xlsx', index=False, sheet_name='Products')

# Save to JSON
products_df.to_json('products.json', orient='records', date_format='iso')

# Read from JSON
json_df = pd.read_json('products.json')
print("\nJSON DataFrame shape:", json_df.shape)

## Data Exploration

In [25]:
# Basic exploration methods
print("First 5 rows:")
print(products_df.head())

print("\nLast 5 rows:")
print(products_df.tail())

print("\nRandom sample:")
print(products_df.sample(3))

First 5 rows:
   product_id product_name     category   price  quantity_sold  rating  \
0           1    Product_1     Clothing  164.32             63     4.8   
1           2    Product_2         Home  421.23              2     4.4   
2           3    Product_3  Electronics  163.21             28     2.3   
3           4    Product_4         Home  230.79              5     3.2   
4           5    Product_5  Electronics  250.13             13     3.5   

   date_sold  
0 2023-01-01  
1 2023-01-02  
2 2023-01-03  
3 2023-01-04  
4 2023-01-05  

Last 5 rows:
    product_id product_name     category   price  quantity_sold  rating  \
95          96   Product_96     Clothing  265.26             59     4.0   
96          97   Product_97  Electronics  111.48             84     1.6   
97          98   Product_98     Clothing   31.88             87     1.3   
98          99   Product_99  Electronics   43.30             36     1.4   
99         100  Product_100        Books  336.88             2

In [26]:
# Statistical summary
print("Statistical Summary:")
print(products_df.describe())

print("\nValue counts for category:")
print(products_df['category'].value_counts())

print("\nUnique values in category:")
print(products_df['category'].unique())
print(f"Number of unique categories: {products_df['category'].nunique()}")

Statistical Summary:
       product_id       price  quantity_sold      rating            date_sold
count  100.000000  100.000000     100.000000  100.000000                  100
mean    50.500000  250.066500      49.030000    2.923000  2023-02-19 12:00:00
min      1.000000   12.160000       2.000000    1.000000  2023-01-01 00:00:00
25%     25.750000  141.990000      27.000000    1.900000  2023-01-25 18:00:00
50%     50.500000  250.025000      47.500000    2.800000  2023-02-19 12:00:00
75%     75.250000  346.825000      75.000000    4.025000  2023-03-16 06:00:00
max    100.000000  499.020000      99.000000    5.000000  2023-04-10 00:00:00
std     29.011492  131.681611      28.236896    1.172139                  NaN

Value counts for category:
category
Electronics    28
Home           26
Books          24
Clothing       22
Name: count, dtype: int64

Unique values in category:
['Clothing' 'Home' 'Electronics' 'Books']
Number of unique categories: 4


In [27]:
# Data selection and indexing
print("Select single column:")
print(products_df['product_name'].head())

print("\nSelect multiple columns:")
print(products_df[['product_name', 'price', 'category']].head())

print("\nSelect rows by index:")
print(products_df.iloc[0:3])  # First 3 rows

print("\nSelect rows by condition:")
expensive_products = products_df[products_df['price'] > 400]
print(f"Products with price > 400: {len(expensive_products)}")
print(expensive_products[['product_name', 'price']].head())

Select single column:
0    Product_1
1    Product_2
2    Product_3
3    Product_4
4    Product_5
Name: product_name, dtype: object

Select multiple columns:
  product_name   price     category
0    Product_1  164.32     Clothing
1    Product_2  421.23         Home
2    Product_3  163.21  Electronics
3    Product_4  230.79         Home
4    Product_5  250.13  Electronics

Select rows by index:
   product_id product_name     category   price  quantity_sold  rating  \
0           1    Product_1     Clothing  164.32             63     4.8   
1           2    Product_2         Home  421.23              2     4.4   
2           3    Product_3  Electronics  163.21             28     2.3   

   date_sold  
0 2023-01-01  
1 2023-01-02  
2 2023-01-03  

Select rows by condition:
Products with price > 400: 16
   product_name   price
1     Product_2  421.23
14   Product_15  441.03
17   Product_18  437.33
21   Product_22  499.02
29   Product_30  484.49


## Data Cleaning

In [28]:
# Create data with missing values for demonstration
clean_data = products_df.copy()
# Introduce some missing values
clean_data.loc[5:10, 'rating'] = np.nan
clean_data.loc[15:20, 'price'] = np.nan
clean_data.loc[25, 'category'] = np.nan

print("Missing values:")
print(clean_data.isnull().sum())

print("\nPercentage of missing values:")
print((clean_data.isnull().sum() / len(clean_data) * 100).round(2))

Missing values:
product_id       0
product_name     0
category         1
price            6
quantity_sold    0
rating           6
date_sold        0
dtype: int64

Percentage of missing values:
product_id       0.0
product_name     0.0
category         1.0
price            6.0
quantity_sold    0.0
rating           6.0
date_sold        0.0
dtype: float64


In [30]:
# Handling missing values
# Drop rows with any missing values
clean_df_dropped = clean_data.dropna()
print(f"Original shape: {clean_data.shape}")
print(f"After dropping NaN: {clean_df_dropped.shape}")

# Fill missing values
clean_df_filled = clean_data.copy()
clean_df_filled['rating'] = clean_df_filled['rating'].fillna(clean_df_filled['rating'].mean())
clean_df_filled['price'] = clean_df_filled['price'].fillna(clean_df_filled['price'].median())
clean_df_filled['category'] = clean_df_filled['category'].fillna('Unknown')

print("\nAfter filling missing values:")
print(clean_df_filled.isnull().sum())

Original shape: (100, 7)
After dropping NaN: (87, 7)

After filling missing values:
product_id       0
product_name     0
category         0
price            0
quantity_sold    0
rating           0
date_sold        0
dtype: int64


In [32]:
# Remove duplicates
print(f"Original shape: {clean_df_filled.shape}")
clean_df_no_duplicates = clean_df_filled.drop_duplicates()
print(f"After removing duplicates: {clean_df_no_duplicates.shape}")

# Data type conversion
print("\nData types before conversion:")
print(clean_df_filled.dtypes)

# Convert date column to datetime
clean_df_filled['date_sold'] = pd.to_datetime(clean_df_filled['date_sold'])
print("\nData types after conversion:")
print(clean_df_filled.dtypes)

Original shape: (100, 7)
After removing duplicates: (100, 7)

Data types before conversion:
product_id                int64
product_name             object
category                 object
price                   float64
quantity_sold             int64
rating                  float64
date_sold        datetime64[ns]
dtype: object

Data types after conversion:
product_id                int64
product_name             object
category                 object
price                   float64
quantity_sold             int64
rating                  float64
date_sold        datetime64[ns]
dtype: object


### 🎯 Practice Problems: Data Cleaning

**Problem 1**: Create a DataFrame with missing values and practice different strategies to handle them. Create a dataset with columns ['name', 'age', 'salary'] where some values are missing, then fill missing ages with the median and missing salaries with the mean.

<details>
<summary>Click to see solution</summary>

```python
# Solution
import numpy as np

# Create sample data with missing values
data_with_missing = {
    'name': ['John', 'Jane', 'Bob', 'Alice', 'Charlie'],
    'age': [25, np.nan, 35, 28, np.nan],
    'salary': [50000, 60000, np.nan, 55000, np.nan]
}

df_missing = pd.DataFrame(data_with_missing)
print('Original DataFrame:')
print(df_missing)
print('\nMissing values:')
print(df_missing.isnull().sum())

# Fill missing values
df_cleaned = df_missing.copy()
df_cleaned['age'] = df_cleaned['age'].fillna(df_cleaned['age'].median())
df_cleaned['salary'] = df_cleaned['salary'].fillna(df_cleaned['salary'].mean())

print('\nCleaned DataFrame:')
print(df_cleaned)
```

</details>

**Problem 2**: Remove duplicate rows from a DataFrame and convert a string column to proper datetime format.

<details>
<summary>Click to see solution</summary>

```python
# Solution
# Create sample data with duplicates
messy_data = {
    'product': ['A', 'B', 'A', 'C', 'B'],
    'price': [100, 200, 100, 300, 200],
    'date': ['2023-01-01', '2023-01-02', '2023-01-01', '2023-01-03', '2023-01-02']
}

df_messy = pd.DataFrame(messy_data)
print('Original DataFrame:')
print(df_messy)
print(f'Shape: {df_messy.shape}')

# Remove duplicates
df_clean = df_messy.drop_duplicates()
print(f'\nAfter removing duplicates: {df_clean.shape}')

# Convert date column
df_clean['date'] = pd.to_datetime(df_clean['date'])
print('\nData types after conversion:')
print(df_clean.dtypes)
```

</details>

In [None]:
# Your code for Problem 1 here



In [None]:
# Your code for Problem 2 here



## Data Manipulation

In [33]:
# Adding new columns
df_manipulated = clean_df_filled.copy()
df_manipulated['revenue'] = df_manipulated['price'] * df_manipulated['quantity_sold']
df_manipulated['price_category'] = pd.cut(df_manipulated['price'], 
                                         bins=[0, 50, 150, 300, 500], 
                                         labels=['Low', 'Medium', 'High', 'Premium'])

print("DataFrame with new columns:")
print(df_manipulated[['product_name', 'price', 'quantity_sold', 'revenue', 'price_category']].head())

DataFrame with new columns:
  product_name   price  quantity_sold   revenue price_category
0    Product_1  164.32             63  10352.16           High
1    Product_2  421.23              2    842.46        Premium
2    Product_3  163.21             28   4569.88           High
3    Product_4  230.79              5   1153.95           High
4    Product_5  250.13             13   3251.69           High


In [34]:
# Sorting data
print("Top 5 products by revenue:")
top_revenue = df_manipulated.nlargest(5, 'revenue')
print(top_revenue[['product_name', 'price', 'quantity_sold', 'revenue']])

print("\nSorted by price (descending):")
sorted_by_price = df_manipulated.sort_values('price', ascending=False)
print(sorted_by_price[['product_name', 'price']].head())

Top 5 products by revenue:
   product_name   price  quantity_sold   revenue
38   Product_39  476.55             93  44319.15
21   Product_22  499.02             75  37426.50
91   Product_92  431.52             69  29774.88
80   Product_81  479.57             59  28294.63
58   Product_59  445.77             62  27637.74

Sorted by price (descending):
   product_name   price
21   Product_22  499.02
86   Product_87  491.82
29   Product_30  484.49
48   Product_49  481.95
80   Product_81  479.57


In [35]:
# String operations
print("String operations:")
df_manipulated['product_name_upper'] = df_manipulated['product_name'].str.upper()
df_manipulated['product_name_length'] = df_manipulated['product_name'].str.len()
df_manipulated['contains_product'] = df_manipulated['product_name'].str.contains('Product')

print(df_manipulated[['product_name', 'product_name_upper', 'product_name_length', 'contains_product']].head())

String operations:
  product_name product_name_upper  product_name_length  contains_product
0    Product_1          PRODUCT_1                    9              True
1    Product_2          PRODUCT_2                    9              True
2    Product_3          PRODUCT_3                    9              True
3    Product_4          PRODUCT_4                    9              True
4    Product_5          PRODUCT_5                    9              True


### 🎯 Practice Problems: Data Manipulation

**Problem 1**: Create a new column 'price_per_unit' by dividing price by quantity_sold. Then create a categorical column 'performance' that assigns 'High' for ratings ≥4.0, 'Medium' for ratings 2.5-3.9, and 'Low' for ratings <2.5.

<details>
<summary>Click to see solution</summary>

```python
# Solution
df_practice = products_df.copy()

# Create price per unit column
df_practice['price_per_unit'] = df_practice['price'] / df_practice['quantity_sold']

# Create performance categories
def categorize_performance(rating):
    if rating >= 4.0:
        return 'High'
    elif rating >= 2.5:
        return 'Medium'
    else:
        return 'Low'

df_practice['performance'] = df_practice['rating'].apply(categorize_performance)

print('New columns added:')
print(df_practice[['product_name', 'price', 'quantity_sold', 'price_per_unit', 'rating', 'performance']].head())
print(f'\nPerformance distribution:')
print(df_practice['performance'].value_counts())
```

</details>

**Problem 2**: Filter the DataFrame to show only products where the product_name contains a number greater than 50, and sort them by price in descending order.

<details>
<summary>Click to see solution</summary>

```python
# Solution
# Extract number from product name and filter
df_practice['product_number'] = df_practice['product_name'].str.extract('(\d+)').astype(int)

# Filter products with number > 50
high_number_products = df_practice[df_practice['product_number'] > 50]

# Sort by price descending
result = high_number_products.sort_values('price', ascending=False)

print(f'Products with number > 50: {len(result)} products')
print(result[['product_name', 'product_number', 'price', 'category']].head(10))
```

</details>

In [None]:
# Your code for Problem 1 here



In [None]:
# Your code for Problem 2 here



## Grouping and Aggregation

In [37]:
df_manipulated

Unnamed: 0,product_id,product_name,category,price,quantity_sold,rating,date_sold,revenue,price_category,product_name_upper,product_name_length,contains_product
0,1,Product_1,Clothing,164.32,63,4.8,2023-01-01,10352.16,High,PRODUCT_1,9,True
1,2,Product_2,Home,421.23,2,4.4,2023-01-02,842.46,Premium,PRODUCT_2,9,True
2,3,Product_3,Electronics,163.21,28,2.3,2023-01-03,4569.88,High,PRODUCT_3,9,True
3,4,Product_4,Home,230.79,5,3.2,2023-01-04,1153.95,High,PRODUCT_4,9,True
4,5,Product_5,Electronics,250.13,13,3.5,2023-01-05,3251.69,High,PRODUCT_5,9,True
...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,Product_96,Clothing,265.26,59,4.0,2023-04-06,15650.34,High,PRODUCT_96,10,True
96,97,Product_97,Electronics,111.48,84,1.6,2023-04-07,9364.32,Medium,PRODUCT_97,10,True
97,98,Product_98,Clothing,31.88,87,1.3,2023-04-08,2773.56,Low,PRODUCT_98,10,True
98,99,Product_99,Electronics,43.30,36,1.4,2023-04-09,1558.80,Low,PRODUCT_99,10,True


In [36]:
# Group by category
category_stats = df_manipulated.groupby('category').agg({
    'price': ['mean', 'min', 'max'],
    'quantity_sold': 'sum',
    'revenue': ['sum', 'mean'],
    'rating': 'mean'
}).round(2)

print("Statistics by category:")
print(category_stats)

Statistics by category:
              price                 quantity_sold    revenue           rating
               mean     min     max           sum        sum      mean   mean
category                                                                     
Books        291.54   85.45  481.95          1068  297858.00  12950.35   2.76
Clothing     266.72   31.88  484.49          1104  265183.80  12053.81   2.73
Electronics  213.34   42.76  431.52          1355  278155.99   9934.14   2.98
Home         266.66   12.16  499.02          1339  327758.07  12606.08   3.18
Unknown      120.76  120.76  120.76            37    4468.12   4468.12   3.50


In [41]:
# Multiple grouping
price_category_stats = df_manipulated.groupby(['category', 'price_category'], observed=False).agg({
    'revenue': 'sum',
    'quantity_sold': 'mean',
    'rating': ['mean', 'sum
}).round(2)

print("\nStatistics by category and price category:")
print(price_category_stats)


Statistics by category and price category:
                              revenue  quantity_sold  rating
category    price_category                                  
Books       Low                  0.00            NaN     NaN
            Medium            1025.40          12.00    2.60
            High            158701.51          57.08    2.78
            Premium         138131.09          37.10    2.76
Clothing    Low               2773.56          87.00    1.30
            Medium           20899.02          66.00    3.21
            High            134658.83          51.55    3.09
            Premium         106852.39          36.00    2.17
Electronics Low               4979.60          58.00    3.15
            Medium           53786.33          51.44    2.69
            High            137749.29          51.73    2.82
            Premium          81640.77          34.50    3.63
Home        Low               4775.18          84.00    4.55
            Medium           16289.94    

In [None]:
# Pivot tables
pivot_table = df_manipulated.pivot_table(
    values='revenue',
    index='category',
    columns='price_category',
    aggfunc='sum',
    fill_value=0
)

print("\nPivot table - Revenue by Category and Price Category:")
print(pivot_table)

### Practice Problem 1: Sales Analysis by Category

Using the `df_manipulated` DataFrame, create a grouped analysis that shows:
1. Total revenue by category
2. Average rating by category
3. Count of products by category
4. Maximum price by category

In [None]:
# Your solution here



<details>
<summary>Click to see solution</summary>

```python
# Solution: Sales Analysis by Category
category_analysis = df_manipulated.groupby('category').agg({
    'revenue': 'sum',
    'rating': 'mean',
    'product_id': 'count',
    'price': 'max'
}).round(2)

category_analysis.columns = ['Total Revenue', 'Avg Rating', 'Product Count', 'Max Price']
print(category_analysis)
```

</details>

### Practice Problem 2: Advanced Pivot Analysis

Create a pivot table that shows:
1. Average quantity sold for each combination of category and price_category
2. Fill missing values with 0
3. Add row and column totals

In [None]:
# Your solution here



<details>
<summary>Click to see solution</summary>

```python
# Solution: Advanced Pivot Analysis
pivot_qty = df_manipulated.pivot_table(
    values='quantity_sold',
    index='category',
    columns='price_category',
    aggfunc='mean',
    fill_value=0,
    margins=True
).round(2)

print('Average Quantity Sold by Category and Price Category:')
print(pivot_qty)
```

</details>

## Merging and Joining

In [None]:
# Create additional DataFrames for merging
suppliers = pd.DataFrame({
    'product_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'supplier_name': ['Supplier_A', 'Supplier_B', 'Supplier_A', 'Supplier_C', 'Supplier_B',
                     'Supplier_A', 'Supplier_C', 'Supplier_B', 'Supplier_A', 'Supplier_C'],
    'supplier_country': ['USA', 'Germany', 'USA', 'Japan', 'Germany',
                        'USA', 'Japan', 'Germany', 'USA', 'Japan']
})

print("Suppliers DataFrame:")
print(suppliers)

In [None]:
# Merge DataFrames
merged_df = pd.merge(df_manipulated, suppliers, on='product_id', how='left')
print("\nMerged DataFrame (first 10 rows):")
print(merged_df[['product_name', 'category', 'price', 'supplier_name', 'supplier_country']].head(10))

print(f"\nOriginal shape: {df_manipulated.shape}")
print(f"Merged shape: {merged_df.shape}")

In [None]:
# Different types of joins
inner_join = pd.merge(df_manipulated, suppliers, on='product_id', how='inner')
outer_join = pd.merge(df_manipulated, suppliers, on='product_id', how='outer')

print(f"Inner join shape: {inner_join.shape}")
print(f"Outer join shape: {outer_join.shape}")
print(f"Left join shape: {merged_df.shape}")

### Practice Problem 1: Data Merging Challenge

Create a new DataFrame called `inventory` with the following data:
- product_id: [1, 2, 3, 11, 12, 13]
- stock_level: [50, 30, 75, 20, 40, 60]
- warehouse: ['A', 'B', 'A', 'C', 'B', 'A']

Then perform different types of joins with the main DataFrame and analyze the results.

In [None]:
# Your solution here



<details>
<summary>Click to see solution</summary>

```python
# Solution: Data Merging Challenge
inventory = pd.DataFrame({
    'product_id': [1, 2, 3, 11, 12, 13],
    'stock_level': [50, 30, 75, 20, 40, 60],
    'warehouse': ['A', 'B', 'A', 'C', 'B', 'A']
})

# Inner join - only matching records
inner_result = pd.merge(df_manipulated, inventory, on='product_id', how='inner')
print(f'Inner join shape: {inner_result.shape}')

# Left join - all records from main DataFrame
left_result = pd.merge(df_manipulated, inventory, on='product_id', how='left')
print(f'Left join shape: {left_result.shape}')

# Outer join - all records from both DataFrames
outer_result = pd.merge(df_manipulated, inventory, on='product_id', how='outer')
print(f'Outer join shape: {outer_result.shape}')

# Check for missing values after left join
print(f'Missing stock_level values: {left_result["stock_level"].isna().sum()}')
```

</details>

### Practice Problem 2: Supplier Analysis

Using the merged DataFrame with suppliers:
1. Find the supplier with the highest total revenue
2. Calculate average price by supplier country
3. Identify which supplier country has the most diverse product categories

In [None]:
# Your solution here



<details>
<summary>Click to see solution</summary>

```python
# Solution: Supplier Analysis
# 1. Supplier with highest total revenue
supplier_revenue = merged_df.groupby('supplier_name')['revenue'].sum().sort_values(ascending=False)
print('Supplier revenue ranking:')
print(supplier_revenue)
print(f'Top supplier: {supplier_revenue.index[0]}')

# 2. Average price by supplier country
country_avg_price = merged_df.groupby('supplier_country')['price'].mean().round(2)
print('\nAverage price by supplier country:')
print(country_avg_price)

# 3. Most diverse product categories by country
category_diversity = merged_df.groupby('supplier_country')['category'].nunique().sort_values(ascending=False)
print('\nCategory diversity by supplier country:')
print(category_diversity)
print(f'Most diverse country: {category_diversity.index[0]}')
```

</details>

## Time Series Analysis

In [None]:
# Time series operations
df_time = merged_df.copy()
df_time['date_sold'] = pd.to_datetime(df_time['date_sold'])
df_time.set_index('date_sold', inplace=True)

print("Time series DataFrame:")
print(df_time.head())

In [None]:
# Resample data by month
monthly_sales = df_time.resample('M').agg({
    'revenue': 'sum',
    'quantity_sold': 'sum',
    'price': 'mean'
}).round(2)

print("\nMonthly sales summary:")
print(monthly_sales)

In [None]:
# Rolling statistics
df_time['revenue_7day_avg'] = df_time['revenue'].rolling(window=7).mean()
df_time['revenue_cumsum'] = df_time['revenue'].cumsum()

print("\nTime series with rolling statistics:")
print(df_time[['revenue', 'revenue_7day_avg', 'revenue_cumsum']].head(10))

### Practice Problem 1: Time Series Resampling

Using the time series DataFrame:
1. Resample the data by week and calculate total revenue and average rating
2. Find the week with the highest total revenue
3. Calculate the percentage change in weekly revenue

In [None]:
# Your solution here



<details>
<summary>Click to see solution</summary>

```python
# Solution: Time Series Resampling
# 1. Weekly resampling
weekly_stats = df_time.resample('W').agg({
    'revenue': 'sum',
    'rating': 'mean'
}).round(2)

print('Weekly statistics:')
print(weekly_stats)

# 2. Week with highest revenue
max_revenue_week = weekly_stats['revenue'].idxmax()
max_revenue_value = weekly_stats['revenue'].max()
print(f'\nWeek with highest revenue: {max_revenue_week.strftime("%Y-%m-%d")} (${max_revenue_value:,.2f})')

# 3. Percentage change in weekly revenue
weekly_stats['revenue_pct_change'] = weekly_stats['revenue'].pct_change() * 100
print('\nWeekly revenue percentage change:')
print(weekly_stats[['revenue', 'revenue_pct_change']].dropna())
```

</details>

### Practice Problem 2: Rolling Window Analysis

Create rolling window calculations:
1. Calculate a 5-day rolling average for quantity sold
2. Find the 3-day rolling maximum price
3. Calculate the rolling standard deviation of revenue (7-day window)

In [None]:
# Your solution here



<details>
<summary>Click to see solution</summary>

```python
# Solution: Rolling Window Analysis
df_rolling = df_time.copy()

# 1. 5-day rolling average for quantity sold
df_rolling['qty_5day_avg'] = df_rolling['quantity_sold'].rolling(window=5).mean()

# 2. 3-day rolling maximum price
df_rolling['price_3day_max'] = df_rolling['price'].rolling(window=3).max()

# 3. 7-day rolling standard deviation of revenue
df_rolling['revenue_7day_std'] = df_rolling['revenue'].rolling(window=7).std()

print('Rolling window statistics:')
print(df_rolling[['quantity_sold', 'qty_5day_avg', 'price', 'price_3day_max', 
                  'revenue', 'revenue_7day_std']].head(10))

# Show summary statistics
print('\nSummary of rolling calculations:')
print(df_rolling[['qty_5day_avg', 'price_3day_max', 'revenue_7day_std']].describe())
```

</details>

## Practice Exercises

### Exercise 1: Sales Analysis
**Task**: Analyze the sales data to answer the following questions:

In [None]:
# Exercise 1 - Sales Analysis
print("=== EXERCISE 1: SALES ANALYSIS ===")
print("\nYour tasks:")
print("1. Find the top 3 categories by total revenue")
print("2. Calculate the average rating for each price category")
print("3. Identify products with rating above 4.5 and price below 100")
print("4. Create a summary showing total quantity sold by supplier country")

# Your solution here



<details>
<summary>Click to see solution</summary>

```python
# Solution: Sales Analysis
# Task 1: Top 3 categories by total revenue
top_categories = merged_df.groupby('category')['revenue'].sum().nlargest(3)
print('1. Top 3 categories by total revenue:')
print(top_categories)

# Task 2: Average rating by price category
avg_rating_by_price = merged_df.groupby('price_category')['rating'].mean().round(2)
print('\n2. Average rating by price category:')
print(avg_rating_by_price)

# Task 3: High-rated, low-priced products
high_rated_low_price = merged_df[(merged_df['rating'] > 4.5) & (merged_df['price'] < 100)]
print(f'\n3. Products with rating > 4.5 and price < 100: {len(high_rated_low_price)} products')
if len(high_rated_low_price) > 0:
    print(high_rated_low_price[['product_name', 'price', 'rating']].head())

# Task 4: Quantity sold by supplier country
qty_by_country = merged_df.groupby('supplier_country')['quantity_sold'].sum().sort_values(ascending=False)
print('\n4. Total quantity sold by supplier country:')
print(qty_by_country)
```

</details>

### Exercise 2: Data Transformation
**Task**: Transform and clean the data according to specifications:

In [None]:
# Exercise 2 - Data Transformation
print("\n=== EXERCISE 2: DATA TRANSFORMATION ===")
print("\nYour tasks:")
print("1. Create a new column 'profit_margin' assuming 30% profit margin")
print("2. Categorize products as 'Bestseller' (quantity_sold > 75) or 'Regular'")
print("3. Create a pivot table showing average price by category and price_category")
print("4. Find the month with the highest total revenue")

# Your solution here



<details>
<summary>Click to see solution</summary>

```python
# Solution: Data Transformation
df_ex2 = merged_df.copy()

# Task 1: Profit margin
df_ex2['profit_margin'] = df_ex2['price'] * 0.30
print('1. Added profit_margin column')
print(df_ex2[['product_name', 'price', 'profit_margin']].head())

# Task 2: Bestseller categorization
df_ex2['sales_category'] = df_ex2['quantity_sold'].apply(lambda x: 'Bestseller' if x > 75 else 'Regular')
print('\n2. Sales category distribution:')
print(df_ex2['sales_category'].value_counts())

# Task 3: Pivot table
price_pivot = df_ex2.pivot_table(
    values='price',
    index='category',
    columns='price_category',
    aggfunc='mean'
).round(2)
print('\n3. Average price by category and price category:')
print(price_pivot)

# Task 4: Month with highest revenue
df_ex2['month'] = pd.to_datetime(df_ex2['date_sold']).dt.to_period('M')
monthly_revenue = df_ex2.groupby('month')['revenue'].sum()
highest_month = monthly_revenue.idxmax()
highest_revenue = monthly_revenue.max()
print(f'\n4. Month with highest revenue: {highest_month} (${highest_revenue:,.2f})')
```

</details>

### Exercise 3: Advanced Analysis
**Task**: Perform advanced data analysis:

In [None]:
# Exercise 3 - Advanced Analysis
print("\n=== EXERCISE 3: ADVANCED ANALYSIS ===")
print("\nYour tasks:")
print("1. Calculate correlation between price and rating")
print("2. Find the supplier with the highest average product rating")
print("3. Create a time series showing daily revenue trend")
print("4. Identify outliers in the price column using IQR method")

# Your solution here



<details>
<summary>Click to see solution</summary>

```python
# Solution: Advanced Analysis
# Task 1: Correlation between price and rating
correlation = merged_df['price'].corr(merged_df['rating'])
print(f'1. Correlation between price and rating: {correlation:.3f}')

# Task 2: Supplier with highest average rating
supplier_ratings = merged_df.groupby('supplier_name')['rating'].mean().sort_values(ascending=False)
print('\n2. Suppliers by average product rating:')
print(supplier_ratings)

# Task 3: Daily revenue trend
daily_revenue = merged_df.groupby('date_sold')['revenue'].sum()
print('\n3. Daily revenue trend (first 10 days):')
print(daily_revenue.head(10))

# Task 4: Price outliers using IQR
Q1 = merged_df['price'].quantile(0.25)
Q3 = merged_df['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = merged_df[(merged_df['price'] < lower_bound) | (merged_df['price'] > upper_bound)]
print(f'\n4. Price outliers (IQR method): {len(outliers)} products')
print(f'   Lower bound: ${lower_bound:.2f}, Upper bound: ${upper_bound:.2f}')
if len(outliers) > 0:
    print('   Outlier products:')
    print(outliers[['product_name', 'price']].head())
```

</details>

## Visualization with Pandas

In [None]:
# Basic plotting with pandas
import matplotlib.pyplot as plt

# Set up the plotting style
plt.style.use('default')
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Revenue by category
category_revenue = merged_df.groupby('category')['revenue'].sum()
category_revenue.plot(kind='bar', ax=axes[0,0], title='Total Revenue by Category')
axes[0,0].set_ylabel('Revenue ($)')
axes[0,0].tick_params(axis='x', rotation=45)

# 2. Price distribution
merged_df['price'].hist(bins=20, ax=axes[0,1], title='Price Distribution')
axes[0,1].set_xlabel('Price ($)')
axes[0,1].set_ylabel('Frequency')

# 3. Rating vs Price scatter plot
merged_df.plot.scatter(x='price', y='rating', ax=axes[1,0], title='Rating vs Price')
axes[1,0].set_xlabel('Price ($)')
axes[1,0].set_ylabel('Rating')

# 4. Daily revenue trend
daily_revenue = merged_df.groupby('date_sold')['revenue'].sum()
daily_revenue.plot(ax=axes[1,1], title='Daily Revenue Trend')
axes[1,1].set_xlabel('Date')
axes[1,1].set_ylabel('Revenue ($)')
axes[1,1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("Visualization plots created successfully!")

## Performance Tips

In [None]:
# Performance tips for pandas
print("=== PANDAS PERFORMANCE TIPS ===")

# 1. Use vectorized operations instead of loops
import time

# Slow way (using loops)
start_time = time.time()
result_slow = []
for price in merged_df['price']:
    result_slow.append(price * 1.1)
slow_time = time.time() - start_time

# Fast way (vectorized)
start_time = time.time()
result_fast = merged_df['price'] * 1.1
fast_time = time.time() - start_time

print(f"Loop method time: {slow_time:.6f} seconds")
print(f"Vectorized method time: {fast_time:.6f} seconds")
print(f"Speedup: {slow_time/fast_time:.1f}x faster")

# 2. Use appropriate data types
print("\n2. Memory usage optimization:")
print(f"Original memory usage: {merged_df.memory_usage(deep=True).sum() / 1024:.2f} KB")

# Optimize data types
df_optimized = merged_df.copy()
df_optimized['product_id'] = df_optimized['product_id'].astype('int16')
df_optimized['quantity_sold'] = df_optimized['quantity_sold'].astype('int8')
df_optimized['category'] = df_optimized['category'].astype('category')
df_optimized['supplier_name'] = df_optimized['supplier_name'].astype('category')

print(f"Optimized memory usage: {df_optimized.memory_usage(deep=True).sum() / 1024:.2f} KB")
print(f"Memory reduction: {(1 - df_optimized.memory_usage(deep=True).sum() / merged_df.memory_usage(deep=True).sum()) * 100:.1f}%")

# 3. Use query() for complex filtering
print("\n3. Query method for filtering:")
# Traditional filtering
filtered_traditional = merged_df[(merged_df['price'] > 100) & (merged_df['rating'] > 4.0) & (merged_df['category'] == 'Electronics')]

# Using query (more readable)
filtered_query = merged_df.query('price > 100 and rating > 4.0 and category == "Electronics"')

print(f"Traditional filtering result: {len(filtered_traditional)} rows")
print(f"Query method result: {len(filtered_query)} rows")
print("Query method is more readable and often faster for complex conditions")

## Common Pitfalls and Best Practices

In [None]:
print("=== COMMON PITFALLS AND BEST PRACTICES ===")

# 1. Chained assignment warning
print("1. Avoiding chained assignment:")
# Bad practice (may cause SettingWithCopyWarning)
# df_subset = merged_df[merged_df['price'] > 100]
# df_subset['new_column'] = 'value'  # This might not work as expected

# Good practice
df_subset = merged_df[merged_df['price'] > 100].copy()
df_subset['new_column'] = 'value'
print("Use .copy() when creating subsets that you plan to modify")

# 2. Efficient string operations
print("\n2. Efficient string operations:")
# Use vectorized string methods
product_names_upper = merged_df['product_name'].str.upper()
print("Use .str accessor for vectorized string operations")

# 3. Memory-efficient iteration
print("\n3. Memory-efficient iteration:")
# Use itertuples() instead of iterrows() for better performance
print("Use itertuples() instead of iterrows() for iteration")
for row in merged_df.head(3).itertuples():
    print(f"Product {row.product_id}: {row.product_name} - ${row.price}")

# 4. Proper handling of missing data
print("\n4. Missing data best practices:")
print("- Always check for missing data before analysis")
print("- Choose appropriate strategy: drop, fill, or interpolate")
print("- Document your missing data handling decisions")

# 5. Index usage
print("\n5. Efficient index usage:")
print("- Set meaningful indexes for faster lookups")
print("- Use .loc and .iloc for explicit indexing")
print("- Reset index when necessary to avoid confusion")

## Additional Resources

### Official Documentation
- [Pandas Official Documentation](https://pandas.pydata.org/docs/)
- [10 Minutes to Pandas](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Pandas Cookbook](https://pandas.pydata.org/docs/user_guide/cookbook.html)

### Recommended Books
- "Python for Data Analysis" by Wes McKinney (creator of pandas)
- "Pandas in Action" by Boris Paskhaver
- "Effective Pandas" by Matt Harrison

### Online Resources
- [Kaggle Learn - Pandas Course](https://www.kaggle.com/learn/pandas)
- [DataCamp Pandas Tutorials](https://www.datacamp.com/tutorial/pandas)
- [Real Python Pandas Tutorials](https://realpython.com/pandas-python-explore-dataset/)

### Practice Datasets
- [Kaggle Datasets](https://www.kaggle.com/datasets)
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
- [FiveThirtyEight Data](https://github.com/fivethirtyeight/data)
- [Our World in Data](https://ourworldindata.org/)

### Advanced Topics to Explore
1. **Multi-indexing**: Working with hierarchical indexes
2. **Categorical Data**: Efficient handling of categorical variables
3. **Time Series**: Advanced time series analysis and forecasting
4. **Performance Optimization**: Using Dask for larger-than-memory datasets
5. **Integration**: Working with SQL databases, APIs, and other data sources

## Summary

This tutorial covered the essential aspects of pandas:

✅ **Core Concepts**: Series and DataFrame structures
✅ **Data I/O**: Reading and writing various file formats
✅ **Data Exploration**: Basic statistics and data inspection
✅ **Data Cleaning**: Handling missing values and duplicates
✅ **Data Manipulation**: Filtering, sorting, and transforming data
✅ **Grouping & Aggregation**: Summarizing data by groups
✅ **Merging & Joining**: Combining multiple datasets
✅ **Time Series**: Working with temporal data
✅ **Performance**: Optimization techniques and best practices

### Next Steps
1. Practice with real datasets from Kaggle or other sources
2. Explore advanced pandas features like multi-indexing
3. Learn complementary libraries (NumPy, Matplotlib, Seaborn)
4. Consider Dask for big data scenarios
5. Integrate pandas with machine learning workflows

Remember: The key to mastering pandas is practice! Start with small datasets and gradually work your way up to more complex analyses.

---

*Happy data analyzing! 🐼📊*