# Pandas Practice - Essential Operations

A focused, practical guide to pandas using real datasets.

**Datasets:**
- **Flipkart**: E-commerce product data (25 columns)
- **Heart Disease**: Medical patient data (14 columns)

**What You'll Learn:**
1. Data Loading & Inspection
2. Selecting & Filtering Data
3. Data Cleaning Essentials
4. Creating New Columns
5. Sorting & Statistics
6. GroupBy & Aggregations
7. Visualizations
8. Practical Analysis

## 1. Setup & Data Loading

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print(f"Pandas version: {pd.__version__}")

In [None]:
# Load datasets
flipkart = pd.read_csv('data/flipkard.csv')
heart = pd.read_csv('data/Heart_disease.csv')

print("Datasets loaded successfully!")
print(f"Flipkart shape: {flipkart.shape}")
print(f"Heart Disease shape: {heart.shape}")

In [None]:
# Quick look at Flipkart data
print("=== FLIPKART - First 5 rows ===")
flipkart.head()

In [None]:
# Quick look at Heart Disease data
print("=== HEART DISEASE - First 5 rows ===")
heart.head()

In [None]:
# Dataset info - see structure, data types, memory usage
print("=== FLIPKART INFO ===")
flipkart.info()

In [None]:
print("=== HEART DISEASE INFO ===")
heart.info()

## 2. Data Inspection & Selection

Learn to explore and select data using various methods.

In [None]:
# Basic inspection - Flipkart
print("Columns:", flipkart.columns.tolist())
print("\nShape:", flipkart.shape)
print("\nData types:")
print(flipkart.dtypes)

In [None]:
# Statistical summary
flipkart.describe()

In [None]:
# Select single column (returns Series)
prices = flipkart['price']
print(type(prices))
print(prices.head())

In [None]:
# Select multiple columns (returns DataFrame)
product_info = flipkart[['product_name', 'category', 'price', 'rating']]
print(type(product_info))
product_info.head()

In [None]:
# Row selection with .iloc[] (position-based)
print("First row:")
print(flipkart.iloc[0])

print("\nFirst 3 rows, first 4 columns:")
flipkart.iloc[:3, :4]

In [None]:
# Row selection with .loc[] (label-based)
print("Specific rows and columns:")
flipkart.loc[0:4, ['product_name', 'price', 'rating']]

In [None]:
# Boolean indexing - simple filter
expensive = flipkart[flipkart['price'] > 50000]
print(f"Products > 50000: {len(expensive)}")
expensive[['product_name', 'price', 'category']].head()

In [None]:
# Multiple conditions with & (AND)
electronics_expensive = flipkart[
    (flipkart['category'] == 'Electronics') & 
    (flipkart['price'] > 40000)
]
print(f"Expensive Electronics: {len(electronics_expensive)}")
electronics_expensive[['product_name', 'price', 'rating']].head()

In [None]:
# Multiple conditions with | (OR)
mobiles_or_electronics = flipkart[
    (flipkart['category'] == 'Mobiles') | 
    (flipkart['category'] == 'Electronics')
]
print(f"Mobiles OR Electronics: {len(mobiles_or_electronics)}")
mobiles_or_electronics[['product_name', 'category']].head()

In [None]:
# Using .isin() for multiple values
selected_categories = flipkart[flipkart['category'].isin(['Electronics', 'Mobiles', 'Appliances'])]
print(f"Products in selected categories: {len(selected_categories)}")
print("\nCategory counts:")
print(selected_categories['category'].value_counts())

In [None]:
# Using .between() for range filtering
mid_range = flipkart[flipkart['price'].between(20000, 40000)]
print(f"Products priced between 20K-40K: {len(mid_range)}")
mid_range[['product_name', 'price', 'category']].head()

In [None]:
# Heart Disease - filter patients by multiple conditions
high_risk = heart[
    (heart['age'] > 60) & 
    (heart['chol'] > 240) & 
    (heart['target'] == 1)
]
print(f"High risk patients: {len(high_risk)}")
high_risk.head()

## 3. Data Cleaning Essentials

Handle missing values, duplicates, and data type conversions.

In [None]:
# Check for missing values
print("=== FLIPKART - Missing Values ===")
print(flipkart.isnull().sum())

print("\n=== HEART DISEASE - Missing Values ===")
print(heart.isnull().sum())

In [None]:
# Missing value percentage
print("Missing percentage in Flipkart:")
(flipkart.isnull().sum() / len(flipkart) * 100).round(2)

In [None]:
# Create a working copy for cleaning
flipkart_clean = flipkart.copy()
heart_clean = heart.copy()

print("Working copies created!")

In [None]:
# Fill missing values with mean (for numeric columns)
if flipkart_clean['price'].isnull().any():
    flipkart_clean['price'].fillna(flipkart_clean['price'].mean(), inplace=True)
    print("Filled missing prices with mean")

if flipkart_clean['rating'].isnull().any():
    flipkart_clean['rating'].fillna(flipkart_clean['rating'].median(), inplace=True)
    print("Filled missing ratings with median")

In [None]:
# Drop rows with missing values (alternative approach)
# Uncomment if needed:
# heart_clean = heart_clean.dropna()
# print(f"Rows after dropping NaN: {len(heart_clean)}")

print("No missing values to drop in heart dataset")

In [None]:
# Check for duplicates
print(f"Flipkart duplicates: {flipkart_clean.duplicated().sum()}")
print(f"Heart duplicates: {heart_clean.duplicated().sum()}")

In [None]:
# Remove duplicates if any
flipkart_clean = flipkart_clean.drop_duplicates()
heart_clean = heart_clean.drop_duplicates()

print(f"After removing duplicates:")
print(f"Flipkart: {len(flipkart_clean)} rows")
print(f"Heart: {len(heart_clean)} rows")

In [None]:
# Data type conversions
print("Original data types:")
print(flipkart_clean[['listing_date', 'category', 'brand']].dtypes)

In [None]:
# Convert to datetime
flipkart_clean['listing_date'] = pd.to_datetime(flipkart_clean['listing_date'])

# Convert to category (saves memory for repeated values)
flipkart_clean['category'] = flipkart_clean['category'].astype('category')
flipkart_clean['brand'] = flipkart_clean['brand'].astype('category')

print("After conversion:")
print(flipkart_clean[['listing_date', 'category', 'brand']].dtypes)

In [None]:
# String operations - clean product names
print("Original product names:")
print(flipkart_clean['product_name'].head())

# Convert to uppercase
flipkart_clean['product_name_upper'] = flipkart_clean['product_name'].str.upper()

print("\nUppercase:")
print(flipkart_clean['product_name_upper'].head())

In [None]:
# Extract brand from product name (example)
flipkart_clean['brand_from_name'] = flipkart_clean['product_name'].str.split(' ').str[0]
print(flipkart_clean[['product_name', 'brand', 'brand_from_name']].head())

## 4. Creating & Transforming Columns

Add new columns with calculations and transformations.

In [None]:
# Calculate actual discount amount
flipkart_clean['discount_amount'] = flipkart_clean['price'] - flipkart_clean['final_price']

# Calculate savings percentage (verify against discount_percent)
flipkart_clean['savings_pct'] = ((flipkart_clean['discount_amount'] / flipkart_clean['price']) * 100).round(2)

print("New columns added!")
flipkart_clean[['product_name', 'price', 'final_price', 'discount_amount', 'savings_pct']].head()

In [None]:
# Create price categories using pd.cut()
flipkart_clean['price_category'] = pd.cut(
    flipkart_clean['price'],
    bins=[0, 10000, 30000, 50000, 100000],
    labels=['Budget', 'Mid-Range', 'Premium', 'Luxury']
)

print("Price categories:")
print(flipkart_clean['price_category'].value_counts().sort_index())

In [None]:
# Create age groups in heart disease data
heart_clean['age_group'] = pd.cut(
    heart_clean['age'],
    bins=[0, 40, 50, 60, 100],
    labels=['<40', '41-50', '51-60', '60+']
)

print("Age groups:")
print(heart_clean['age_group'].value_counts().sort_index())

In [None]:
# Using lambda functions
flipkart_clean['is_discounted'] = flipkart_clean['discount_percent'].apply(lambda x: 'Yes' if x > 0 else 'No')

print("Discounted products:")
print(flipkart_clean['is_discounted'].value_counts())

In [None]:
# Using np.where() for conditional columns
flipkart_clean['rating_category'] = np.where(
    flipkart_clean['rating'] >= 4.0, 'High',
    np.where(flipkart_clean['rating'] >= 3.0, 'Medium', 'Low')
)

print("Rating categories:")
print(flipkart_clean['rating_category'].value_counts())

In [None]:
# More complex function with .apply()
def categorize_health_risk(row):
    if row['age'] > 60 and row['chol'] > 240:
        return 'High Risk'
    elif row['age'] > 50 and row['chol'] > 200:
        return 'Medium Risk'
    else:
        return 'Low Risk'

heart_clean['risk_category'] = heart_clean.apply(categorize_health_risk, axis=1)

print("Risk categories:")
print(heart_clean['risk_category'].value_counts())

## 5. Sorting & Basic Statistics

Sort data and calculate statistical summaries.

In [None]:
# Sort by single column
print("Top 10 most expensive products:")
flipkart_clean.sort_values('price', ascending=False)[['product_name', 'price', 'category']].head(10)

In [None]:
# Sort by multiple columns
print("Sorted by category (asc) and price (desc):")
flipkart_clean.sort_values(
    ['category', 'price'], 
    ascending=[True, False]
)[['product_name', 'category', 'price']].head(10)

In [None]:
# Quick way to get top/bottom values
print("Top 5 highest rated products:")
print(flipkart_clean.nlargest(5, 'rating')[['product_name', 'rating', 'review_count']])

print("\nTop 5 lowest rated products:")
print(flipkart_clean.nsmallest(5, 'rating')[['product_name', 'rating', 'review_count']])

In [None]:
# Basic statistics - Heart Disease
print("=== AGE Statistics ===")
print(f"Mean: {heart_clean['age'].mean():.2f}")
print(f"Median: {heart_clean['age'].median():.2f}")
print(f"Std Dev: {heart_clean['age'].std():.2f}")
print(f"Min: {heart_clean['age'].min()}")
print(f"Max: {heart_clean['age'].max()}")

In [None]:
# Percentiles and quantiles
print("Price percentiles:")
print(flipkart_clean['price'].quantile([0.25, 0.5, 0.75, 0.9, 0.95]))

In [None]:
# Value counts for categorical data
print("Category distribution:")
print(flipkart_clean['category'].value_counts())

print("\nPercentage distribution:")
print(flipkart_clean['category'].value_counts(normalize=True).mul(100).round(2))

In [None]:
# Correlation analysis
print("Correlation between age and cholesterol:")
print(heart_clean[['age', 'chol']].corr())

In [None]:
# Correlation matrix for all numeric columns
numeric_cols = heart_clean.select_dtypes(include=[np.number]).columns
correlation_matrix = heart_clean[numeric_cols].corr()
print("Correlation with target (heart disease):")
print(correlation_matrix['target'].sort_values(ascending=False))

## 6. GroupBy & Aggregations

Group data and calculate aggregate statistics.

In [None]:
# Simple groupby - average price by category
print("Average price by category:")
flipkart_clean.groupby('category')['price'].mean().sort_values(ascending=False).round(2)

In [None]:
# Multiple aggregations using .agg()
print("Price statistics by category:")
flipkart_clean.groupby('category')['price'].agg(['count', 'mean', 'median', 'min', 'max']).round(2)

In [None]:
# Group by multiple columns
print("Average rating by category and brand (top 15):")
brand_category_rating = flipkart_clean.groupby(['category', 'brand'])['rating'].mean().round(2)
print(brand_category_rating.head(15))

In [None]:
# Aggregate multiple columns with different functions
print("Category summary:")
flipkart_clean.groupby('category').agg({
    'price': ['mean', 'max'],
    'rating': 'mean',
    'review_count': 'sum',
    'units_sold': 'sum'
}).round(2)

In [None]:
# Named aggregations (cleaner column names)
category_stats = flipkart_clean.groupby('category').agg(
    avg_price=('price', 'mean'),
    max_price=('price', 'max'),
    total_units_sold=('units_sold', 'sum'),
    avg_rating=('rating', 'mean'),
    product_count=('product_id', 'count')
).round(2)

print("Category statistics with named aggregations:")
category_stats.sort_values('avg_rating', ascending=False)

In [None]:
# Heart disease - group by age group and sex
print("Health metrics by age group and sex:")
heart_clean.groupby(['age_group', 'sex']).agg({
    'chol': 'mean',
    'trestbps': 'mean',
    'thalach': 'mean',
    'target': 'mean'
}).round(2)

In [None]:
# Pivot table - average price by category and city
pivot = pd.pivot_table(
    flipkart_clean,
    values='price',
    index='category',
    columns='seller_city',
    aggfunc='mean'
).round(2)

print("Average price by category and city:")
pivot

In [None]:
# Cross-tabulation - count of products
crosstab = pd.crosstab(
    flipkart_clean['category'],
    flipkart_clean['seller_city'],
    margins=True
)

print("Product count by category and city:")
crosstab

## 7. Visualizations

Create charts to understand data visually.

In [None]:
# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

In [None]:
# Price distribution - Histogram
plt.figure(figsize=(10, 5))
plt.hist(flipkart_clean['price'], bins=50, edgecolor='black', alpha=0.7)
plt.title('Price Distribution - Flipkart Products', fontsize=14, fontweight='bold')
plt.xlabel('Price (INR)')
plt.ylabel('Frequency')
plt.axvline(flipkart_clean['price'].mean(), color='red', linestyle='--', label=f'Mean: {flipkart_clean["price"].mean():.0f}')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Top categories - Bar chart
category_counts = flipkart_clean['category'].value_counts().head(10)

plt.figure(figsize=(10, 6))
category_counts.plot(kind='barh', color='skyblue', edgecolor='black')
plt.title('Top 10 Categories by Product Count', fontsize=14, fontweight='bold')
plt.xlabel('Number of Products')
plt.ylabel('Category')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Rating vs Price - Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(flipkart_clean['price'], flipkart_clean['rating'], alpha=0.5, s=30)
plt.title('Rating vs Price', fontsize=14, fontweight='bold')
plt.xlabel('Price (INR)')
plt.ylabel('Rating')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Discount patterns by category - Box plot
top_categories = flipkart_clean['category'].value_counts().head(5).index
data_for_box = flipkart_clean[flipkart_clean['category'].isin(top_categories)]

plt.figure(figsize=(12, 6))
data_for_box.boxplot(column='discount_percent', by='category', figsize=(12, 6))
plt.title('Discount Percentage by Category', fontsize=14, fontweight='bold')
plt.suptitle('')  # Remove default title
plt.xlabel('Category')
plt.ylabel('Discount %')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Age distribution - Heart Disease
plt.figure(figsize=(10, 5))
plt.hist(heart_clean['age'], bins=20, edgecolor='black', alpha=0.7, color='coral')
plt.title('Age Distribution - Heart Disease Patients', fontsize=14, fontweight='bold')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.axvline(heart_clean['age'].mean(), color='red', linestyle='--', label=f'Mean Age: {heart_clean["age"].mean():.1f}')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap - Heart Disease
plt.figure(figsize=(12, 8))
numeric_cols = heart_clean.select_dtypes(include=[np.number]).columns[:10]  # Select first 10 numeric columns
corr = heart_clean[numeric_cols].corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap - Health Metrics', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Target distribution - Pie chart
target_counts = heart_clean['target'].value_counts()

plt.figure(figsize=(8, 8))
plt.pie(target_counts, labels=['No Disease', 'Disease'], autopct='%1.1f%%', 
        startangle=90, colors=['lightgreen', 'lightcoral'])
plt.title('Heart Disease Distribution', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 8. Practical Analysis Examples

Real-world analysis scenarios using both datasets.

### Flipkart Analysis

In [None]:
# Find best value products (high rating, low price)
best_value = flipkart_clean[
    (flipkart_clean['rating'] >= 4.0) & 
    (flipkart_clean['price'] < flipkart_clean['price'].median())
].copy()

# Calculate value score
best_value['value_score'] = best_value['rating'] / (best_value['price'] / 1000)

print("Top 10 Best Value Products:")
best_value.nlargest(10, 'value_score')[[
    'product_name', 'category', 'price', 'rating', 'value_score'
]]

In [None]:
# Category-wise discount patterns
discount_analysis = flipkart_clean.groupby('category').agg({
    'discount_percent': ['mean', 'max'],
    'final_price': 'mean',
    'product_id': 'count'
}).round(2)

discount_analysis.columns = ['avg_discount', 'max_discount', 'avg_final_price', 'product_count']
print("\nDiscount Patterns by Category:")
discount_analysis.sort_values('avg_discount', ascending=False)

In [None]:
# Seller performance ranking
seller_performance = flipkart_clean.groupby('seller').agg(
    avg_rating=('rating', 'mean'),
    total_products=('product_id', 'count'),
    total_reviews=('review_count', 'sum'),
    total_units_sold=('units_sold', 'sum'),
    avg_seller_rating=('seller_rating', 'mean')
).round(2)

# Filter sellers with at least 10 products
seller_performance = seller_performance[seller_performance['total_products'] >= 10]

print("\nTop 10 Sellers by Performance:")
seller_performance.sort_values('avg_rating', ascending=False).head(10)

In [None]:
# Visualize seller performance
top_sellers = seller_performance.nlargest(10, 'avg_rating')

plt.figure(figsize=(10, 6))
top_sellers['avg_rating'].plot(kind='barh', color='steelblue', edgecolor='black')
plt.title('Top 10 Sellers by Average Rating', fontsize=14, fontweight='bold')
plt.xlabel('Average Rating')
plt.ylabel('Seller')
plt.xlim(0, 5)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

### Heart Disease Analysis

In [None]:
# Risk factors by age group
age_risk_analysis = heart_clean.groupby('age_group').agg({
    'chol': 'mean',
    'trestbps': 'mean',
    'thalach': 'mean',
    'oldpeak': 'mean',
    'target': 'mean'
}).round(2)

age_risk_analysis.columns = ['Avg_Cholesterol', 'Avg_BP', 'Avg_Heart_Rate', 'Avg_Oldpeak', 'Disease_Rate']
print("Risk Factors by Age Group:")
age_risk_analysis

In [None]:
# Gender-based health metrics
gender_analysis = heart_clean.groupby('sex').agg({
    'age': 'mean',
    'chol': 'mean',
    'trestbps': 'mean',
    'thalach': 'mean',
    'target': ['mean', 'count']
}).round(2)

print("\nHealth Metrics by Gender (0=Female, 1=Male):")
gender_analysis

In [None]:
# Identify high-risk patient profiles
high_risk_patients = heart_clean[
    (heart_clean['age'] > 60) |
    (heart_clean['chol'] > 240) |
    (heart_clean['trestbps'] > 140)
].copy()

print(f"\nHigh Risk Patients: {len(high_risk_patients)} out of {len(heart_clean)} ({len(high_risk_patients)/len(heart_clean)*100:.1f}%)")
print(f"Disease rate in high-risk group: {high_risk_patients['target'].mean()*100:.1f}%")
print(f"Disease rate in overall population: {heart_clean['target'].mean()*100:.1f}%")

In [None]:
# Key findings summary
print("="*60)
print("KEY FINDINGS SUMMARY")
print("="*60)

# Flipkart insights
print("\nFLIPKART INSIGHTS:")
print(f"1. Total products analyzed: {len(flipkart_clean):,}")
print(f"2. Average product price: â‚¹{flipkart_clean['price'].mean():,.2f}")
print(f"3. Average discount: {flipkart_clean['discount_percent'].mean():.1f}%")
print(f"4. Highest rated category: {flipkart_clean.groupby('category')['rating'].mean().idxmax()}")
print(f"5. Most expensive category: {flipkart_clean.groupby('category')['price'].mean().idxmax()}")

# Heart Disease insights
print("\nHEART DISEASE INSIGHTS:")
print(f"1. Total patients analyzed: {len(heart_clean):,}")
print(f"2. Average patient age: {heart_clean['age'].mean():.1f} years")
print(f"3. Disease prevalence: {heart_clean['target'].mean()*100:.1f}%")
print(f"4. Average cholesterol: {heart_clean['chol'].mean():.1f} mg/dl")
print(f"5. Average blood pressure: {heart_clean['trestbps'].mean():.1f} mm Hg")

print("\n" + "="*60)

## Summary

You've learned the essential pandas operations:

1. **Data Loading & Inspection**: Read CSVs, explore structure
2. **Selection & Filtering**: Access specific data with conditions
3. **Data Cleaning**: Handle missing values, duplicates, types
4. **Transformations**: Create new columns, apply functions
5. **Statistics**: Sort, aggregate, calculate metrics
6. **GroupBy**: Aggregate data by categories
7. **Visualizations**: Create charts to understand data
8. **Practical Analysis**: Solve real-world problems

**Next Steps:**
- Practice with your own datasets
- Explore more visualization options with seaborn
- Learn time series analysis
- Study pandas performance optimization
- Build end-to-end data analysis projects