Copyright (c) 2025 aamirmd. All Rights Reserved.

This work is licensed under the MIT License. See LICENSE file for details.

# 📊 Pandas Tutorial: Your Guide to Data Analysis in Python

Welcome to this Pandas tutorial! 

Pandas is one of the most powerful and widely-used data analysis libraries in Python. Whether you're a data scientist, analyst, or developer, Pandas will become an essential tool in your data manipulation toolkit.

### Pre-requisites:
- Basic Python knowledge (variables, types, lists, indexing, dictionaries)
- Familiarity with NumPy (preferred but not required)

### What you'll learn:
- How to work with Series and DataFrames
- Loading and manipulating data
- Essential data analysis operations
- Best practices for data handling

## 🚀 Getting Started: Installing and Importing Pandas

Pandas is easy to install and import. Let's get started!

📚 **Official Documentation**: For more detailed information, visit [pandas.pydata.org](https://pandas.pydata.org/)

💡 **Pro Tip**: Pandas is usually imported with the alias `pd`, which is a community standard you'll see in most Python data analysis code.

In [None]:
!pip install pandas

☁️ **Run on Colab!**

This notebook can be run on Google Colab without needing to install Python on your computer.

🔗 Follow the step-by-step guide here: [Google Colab Instructions](https://medium.com/@jessica0greene/running-your-notebooks-in-the-cloud-with-google-colab-4387529bfad4)

💫 **Benefits of using Colab:**
- Free GPU access
- Pre-installed data science libraries
- Easy sharing and collaboration
- No local setup required

In [None]:
import pandas as pd

## 📦 Series and DataFrames: The Building Blocks

Pandas provides two fundamental data structures that make data analysis a breeze:

### Series (1-dimensional)
- Think of it as a smart, labeled array
- Perfect for time series data, labeled vectors, and single-column data
- Combines array-like operations with dictionary-like indexing

### DataFrame (2-dimensional)
- Like a spreadsheet or SQL table in Python
- Consists of rows and columns with labels
- Can hold different types of data in different columns
- The most commonly used Pandas object

In [None]:
# Creating a series
s = pd.Series([1,2,3])
print(f"Series example:")
display(s)

df = pd.DataFrame({
    'A': [1,2,3],
    'B': [4,5,6],
    'C':[7,8,9]
})
print(f"Dataframe example:")
display(df)

## 🎯 Indexing and Slicing: Accessing Your Data

Pandas provides powerful ways to access and slice your data. Here's what makes it special:

### Key Features:
- Label-based indexing using `.loc[]`
- Position-based indexing using `.iloc[]`
- Boolean indexing for filtering data
- Multi-level indexing for hierarchical data

💡 **Best Practice**: Always prefer `.loc[]` and `.iloc[]` over direct `[]` indexing for clarity and performance.

In [None]:
# Indexing example
s = pd.Series([1,2,3], index=['a', 'b', 'c'])
print(f"Series: ")
display(s)
print(f"Element 'b': {s['b']}")

# Another indexing example
s = pd.Series(['d', 'e', 'f'], index=[215, 168, 900])
print(f"Series: ")
display(s)
print(f"Element 900: {s[900]}")

# Slicing example
s = pd.Series([1,2,3,4,5], index=['a', 'b', 'c', 'd', 'e'])

## 📥 Data Loading: Getting Your Data into Pandas

Pandas excels at importing data from various sources. Here are the most common formats:

### Supported Formats:
- CSV files (`pd.read_csv()`)
- Excel files (`pd.read_excel()`)
- JSON data (`pd.read_json()`)
- SQL databases (`pd.read_sql()`)
- And many more!

💡 **Pro Tip**: Always check the data after loading using `.head()`, `.info()`, and `.describe()` to understand its structure and quality.

In [None]:
path_to_data = "data/iris.csv"
df = pd.read_csv(path_to_data)
df

Common operations to get a glimpse of the dataset.

In [None]:
# To get first 4 elements
print("First 4 elements")
print(df.head(4))
print()

# To get last 4 elements
print("Last 4 elements")
print(df.tail(4))
print()

# To get names of columns as a Series object
columns = df.columns
print("Columns: ", columns)
print()

How to get metadata and summary of data?

In [None]:
# To get metadata about the dataset
print("Metadata about dataset")
df.info()
print()

# To get statistics about columns
print("Summary statistics of columns/features")
df.describe()

## Data Selection and Indexing

In [None]:
path_to_data = "data/iris.csv"
df = pd.read_csv(path_to_data)
print(f"Columns are {df.columns.tolist()}")

- Some common ways to retreive data from dataframes are given below.

In [None]:
# To get a column as a Series object
column = df['sepal.length']
print("Sepal length column: ")
print(column)
print()

# To get a single element from row index and column index
item = df.loc[4, "petal.width"]
print(f"Item at in row index '4' and column 'petal.width' {item}")
print()

# To get single element using position-based indexing (like python lists)
item = df.iloc[6,3]
print(f"Element in 6th row and 3rd column (0-based indexing): {item}")
print()

## 🔧 Data Manipulation: Shaping Your Data

Data rarely comes in the exact format we need. Pandas makes it easy to transform and manipulate your data:

### Common Operations:
- Adding/removing columns
- Calculating derived values
- Handling missing data
- Merging and joining datasets
- Grouping and aggregating data

💡 **Best Practice**: Try to use vectorized operations instead of loops for better performance.

In [None]:
path_to_data = "data/iris.csv"
df = pd.read_csv(path_to_data)
print(f"Columns are {df.columns.tolist()}")

In [None]:
# Create a new column called sepal.area = sepal.length * sepal.width
df['sepal.area'] = df['sepal.length'] * df['sepal.width']
print("Iris dataframe (part) after adding 'area' column")
print(df[["sepal.length", "sepal.width", "sepal.area"]])

## 🔄 Sorting Data in Pandas

Organizing your data is crucial for analysis and visualization. Pandas' sorting capabilities make this task effortless!

### Key Features:
- Sort by one or multiple columns using the `by` parameter
- Control sort order with `ascending=True/False` (default is ascending)
- Handle missing values with `na_position='first'/'last'`
- Sort in-place with `inplace=True` or create a new sorted dataframe

### Example Syntax:
```python
df.sort_values(by='column_name')  # Sort by single column
df.sort_values(by=['col1', 'col2'])  # Sort by multiple columns
```

💡 **Pro Tip**: When sorting by multiple columns, you can specify different sort orders for each column using a list of booleans for the `ascending` parameter.

In [None]:
# Load the iris dataset
path_to_data = "data/iris.csv"
df = pd.read_csv(path_to_data)

# Sort the dataframe by sepal.length
df_sorted = df.sort_values(by='sepal.length')

print("Iris dataset sorted by sepal length:")
display(df_sorted)