# Getting Started with Pandas

# 🐼 Introduction to Pandas

## What is Pandas?

**Pandas** is an open-source Python library used for **data analysis and manipulation**. It provides high-level data structures and powerful tools to work with structured (tabular, multidimensional, time series) data.

Pandas is built on top of **NumPy** and is widely used in fields such as:
- Data science and analytics
- Machine learning pipelines
- ETL (Extract, Transform, Load) workflows
- Financial and statistical modeling

---
## 📌 Goal of This Notebook
The purpose of this notebook is to help you build a strong foundation in using the Pandas library, which is essential for data analysis and manipulation in Python. Through hands-on examples and explanations, you'll learn how to inspect, filter, transform, and summarize structured data using Pandas' core tools.

---
## 🎯 What You Will Learn
By the end of this notebook, you will be able to:

* Understand the difference between `Series` and `DataFrames`
* Load and explore data using methods like `.head()`, `.info()`, `.describe()`, etc.
* Perform column and row selection using `.loc[]`, `.iloc[]`, and Boolean filtering
* Handle missing data using `.isnull()`, `.dropna()`
* Create, update, rename, and replace columns
* Work with datetime columns and extract features like year, month, and weekday
* Group data and perform aggregations using `.groupby()` and `.agg()`
* Apply string operations to clean or filter textual data

---
# Let's get started!!
---
## 🧱 Core Data Structures in Pandas

Pandas provides two primary data structures:

### 1. 📐 Series

A **Series** is a **one-dimensional labeled array** capable of holding any data type (integers, floats, strings, objects, etc.).

#### Key Features:
- Like a column in Excel or a single column in a table.
- Has both **values** and an **index**.
- Can store missing or NaN values.


### 2. 📊 DataFrame
A DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet, SQL table, or a dictionary of Series objects.

**Key Features:**
Consists of rows and columns.

* Each column is a Series.
* Can hold different data types in different columns.
* Supports rich operations: filtering, grouping, merging, pivoting, etc

### ✅ Summary Table
| Concept       | Type             | Dimensions | Use Case                            |
| ------------- | ---------------- | ---------- | ----------------------------------- |
| **Series**    | 1D labeled array | 1          | Storing a single column of data     |
| **DataFrame** | 2D table         | 2          | Storing structured/tabular datasets |


In [1]:
# Install Pandaas
!pip install pandas



## 📦 1. Imports

We begin by importing the required libraries:

* `pandas` is imported as pd: used for data manipulation and analysis.

* `numpy` is imported as np: used for numerical operations, including NaN values (np.nan).

`🔍 Note:` Keywords in pandas are case-sensitive.

In [2]:
import pandas as pd
import numpy as np

# Note: Keywords are case-sensitive in pandas

## 🧪 2. Ways of Creating Series and DataFrame

### 🔹 2.1 Series
A Series in pandas is a one-dimensional labeled array that can hold any data type.

In [3]:
# Pandas uses two data structures - Series and DataFrame

# Creating the Series 
ser = pd.Series([1,2,3,np.nan,'A'])

# ✅ Creates a Series with integers, a NaN (missing value), and a string.
ser

0      1
1      2
2      3
3    NaN
4      A
dtype: object

* ✅ Creates a Series from a NumPy array.

* dtype: int32 or int64 depending on platform.

In [4]:
# Creating Series from numpy array
arr = np.array([14,56,3,77,11])
ser = pd.Series(arr)
ser

0    14
1    56
2     3
3    77
4    11
dtype: int32

## 🔹 2.2 DataFrame
A DataFrame is a two-dimensional, size-mutable, and heterogeneous data structure with labeled axes (rows and columns).

#### ✅ From Series


In [5]:
# Creating the dataframe from a series
sr = pd.Series(['Anna','Bob','Jenna','Park'])

# Converts a Series into a one-column DataFrame.
df = pd.DataFrame(sr)
df

Unnamed: 0,0
0,Anna
1,Bob
2,Jenna
3,Park


#### ✅ From Array

In [6]:
# creating dataframe from array
df = pd.DataFrame(arr)
# Converts a 1D NumPy array into a single-column DataFrame.
df

Unnamed: 0,0
0,14
1,56
2,3
3,77
4,11


#### ✅ From Dictionary

In [7]:
# Creating Dataframe from dictionary

# Dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 70000]
}
# Dataframe from dictionary
df = pd.DataFrame(data)

# Let's see the dictionary
df


Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,70000


* Creates a structured table with named columns.
* Each key in the dictionary becomes a column.

### 💾 3. Reading & Writing Data (I/O Operations)


* `read_*` functions load data from external files into DataFrames.

* `to_*` functions save DataFrames into respective formats.

* `index=False` prevents pandas from writing row numbers into the output files.

In [None]:
# Reading and writing CSV
df = pd.read_csv('file.csv')
df.to_csv('file.csv', index=False)

# Reading and writing Excel
df = pd.read_excel('file.xlsx')
df.to_excel('file.xlsx', index=False)

# Reading and writing JSON
df = pd.read_json('file.json')
df.to_json('file.json')

# Reading and writing Parquet
df = pd.read_parquet('file.parquet')
df.to_parquet('file.parquet', index=False)


## 🧾 Pandas DataFrame Inspection Commands

These are essential commands to **understand the structure, summary, and metadata** of a DataFrame `df`.

---

#### 📌 `df.head()`
* Returns the first 2 rows of the DataFrame by default.
* Useful for getting a quick look at the data after loading or preprocessing.
* You can pass a number inside head(n) to see the first n rows.

In [9]:
# Dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie','Tim','Mark','Lora'],
    'Age': [25, 30, 35,46,23,67],
    'Salary': [50000, 60000, 70000,80000,340000,760000]
}
# Dataframe from dictionary
df = pd.DataFrame(data)

# Fetching First 2 rows pf table
df.head(2)


Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000


#### 📌 tail()
* Returns the last 5 rows of the DataFrame.
* Helpful to see recent or ending data entries.


In [10]:
# Fetching Last 2 rows pf table
df.tail(2)

Unnamed: 0,Name,Age,Salary
4,Mark,23,340000
5,Lora,67,760000


#### 📌 info()
* Provides a summary of the DataFrame, including:
    * Number of non-null values
    * Column names
    * Data types
    * Memory usage
* Great for checking missing values and structure at a glance.

In [11]:
df.info()  # Overview of DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    6 non-null      object
 1   Age     6 non-null      int64 
 2   Salary  6 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 272.0+ bytes


#### 📌 describe()

* Generates statistical summary of numeric columns by default.
* Includes:
    * Count
    * Mean
    * Std deviation
    * Min, Max
    * 25%, 50%, 75% percentiles
* Use `df.describe(include='all')` to include non-numeric columns as well.

In [12]:
df.describe()  # Statistical summary


Unnamed: 0,Age,Salary
count,6.0,6.0
mean,37.666667,226666.666667
std,16.560998,283666.470819
min,23.0,50000.0
25%,26.25,62500.0
50%,32.5,75000.0
75%,43.25,275000.0
max,67.0,760000.0


#### 📌 shape
* Returns a tuple: (number of rows, number of columns).
* Helps you quickly understand the size of your dataset.

In [13]:
df.shape  # (rows, columns)

(6, 3)

#### 📌 columns
* Lists the column names in the DataFrame.
* Returns an Index object (like an array of strings).
* You can convert to list using: list(df.columns)

In [14]:
df.columns  # Column names

Index(['Name', 'Age', 'Salary'], dtype='object')

#### 📌 index
* Displays the index (row labels) of the DataFrame.
* Shows the range, type, or custom indexing if applied.



In [15]:
df.index  # R index

RangeIndex(start=0, stop=6, step=1)

#### 📌 dtypes
* Returns the data type of each column.
* Helpful for type checking before operations (e.g., numeric vs. object).

In [16]:
df.dtypes  # Data types

Name      object
Age        int64
Salary     int64
dtype: object

## 🔍 Selecting and Filtering Data in Pandas

Pandas provides flexible and powerful ways to select, access, and filter data using **labels**, **positions**, and **conditions**.

---

### 📌 Selecting Columns

#### ✅ Single Column (Returns a Series)


In [17]:
df['Age']
# Returns a pandas Series containing values from the Age column.
# Series includes the index and single column of data.

0    25
1    30
2    35
3    46
4    23
5    67
Name: Age, dtype: int64

#### ✅ Multiple Columns (Returns a DataFrame).

In [18]:
# Returns a pandas DataFrame with selected columns.
# Column names must be passed as a list
df[['Name', 'Salary']]

Unnamed: 0,Name,Salary
0,Alice,50000
1,Bob,60000
2,Charlie,70000
3,Tim,80000
4,Mark,340000
5,Lora,760000


### 📌 Selecting Rows by Index
#### ✅ By Integer Position: iloc[]


In [19]:
df.iloc[0]

Name      Alice
Age          25
Salary    50000
Name: 0, dtype: object

* Fetches the first row using its integer location (zero-based).
* Returns a Series representing the row.
* Use `df.iloc[n]` to get the row at position n.
 
#### ✅ By Label: loc[]

In [20]:
df.loc[0]   # Fetching 1st row
df.loc[1:3]  # Slicing the 2nd and 4th row

Unnamed: 0,Name,Age,Salary
1,Bob,30,60000
2,Charlie,35,70000
3,Tim,46,80000


* Fetches the row with index label 0.
* Returns a Series.
* Useful when custom index labels are used.
* Unlike `iloc`, `loc` is label-based and inclusive for slices.

### 📌 Filtering Rows (Conditional Selection)

#### ✅ Basic Condition

In [21]:
df[df['Age'] > 30]

Unnamed: 0,Name,Age,Salary
2,Charlie,35,70000
3,Tim,46,80000
5,Lora,67,760000


* Returns a new DataFrame with rows where Age > 30.
* This is a boolean mask filter.

### ✅ Multiple Conditions Using query()

In [22]:
df.query('Age > 30 & Salary < 760000')


Unnamed: 0,Name,Age,Salary
2,Charlie,35,70000
3,Tim,46,80000


* Uses a string expression to filter rows.
* Supports logical operators:

    * & (and), | (or), ~ (not)

* Very readable and clean for complex filtering.

## 🧼 3. Handling Missing Values & NA in Pandas

Real-world datasets often contain **missing values**. Pandas provides tools to **detect, handle, and clean** such values effectively.

---

### 📌 Detecting Missing Values

#### ✅ `df.isnull()`

In [24]:
df.isnull()

Unnamed: 0,Name,Age,Salary
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False


* Returns a DataFrame of booleans:
    * True if a value is missing (NaN)
    * False otherwise
* You can use it to identify where missing values occur.

### 🔧 Handling Missing Values with `fillna()`
In data analysis, missing values are common. Instead of removing them entirely using dropna(), you can fill them with meaningful replacements using fillna().

In [66]:
df['Salary'].fillna(0)  # Replace missing values in 'Salary' with 0

df['Age'].fillna(df['Age'].mean())  # Replace missing ages with the column mean

df.fillna(method='ffill')  # Forward fill: fill missing values with the previous non-null value

df.fillna(method='bfill')  # Backward fill: fill missing values with the next non-null value

df.fillna({'Age': 25, 'Salary': 50000})  # Fill specific columns with different values


Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,70000


#### ✅ Use Cases:
Replacing missing numeric values with 0, mean, median, or custom values

Using forward `fill (ffill)` to propagate the last valid observation

Using backward `fill (bfill)` to fill gaps with the next valid observation

#### 🔎 Why Use fillna()?
It helps preserve data rows that would otherwise be dropped due to missing values, ensuring better analysis and model performance.



### 🗑️ Dropping Missing Values
#### ✅ `dropna()`


In [27]:
df.dropna()

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,70000
3,Tim,46,80000
4,Mark,23,340000
5,Lora,67,760000


* Removes any row that contains at least one missing value.
* Returns a new DataFrame with such rows dropped.

Parameters:
* axis=0 (default): drop rows
* axis=1: drop columns
* how='any': drop if any NaN present (default)
* how='all': drop only if all values are NaN in the row/column
* inplace=True: perform operation without returning a new DataFrame

## 🛠️ 4. Modifying and Updating Data in Pandas

Pandas makes it easy to modify existing columns, create new ones, rename, and replace values in a DataFrame.

---

#### 🔼 1. Incrementing Column Values

In [28]:
df['Age'] = df['Age'] + 1
df

Unnamed: 0,Name,Age,Salary
0,Alice,26,50000
1,Bob,31,60000
2,Charlie,36,70000
3,Tim,47,80000
4,Mark,24,340000
5,Lora,68,760000


* Increases each value in the Age column by 1.
* This updates the column in-place.
* Works element-wise, just like a NumPy array.

#### 🆕 2. Creating a New Column

In [29]:
df['New_Column'] = df['Salary'] / 1000
df

Unnamed: 0,Name,Age,Salary,New_Column
0,Alice,26,50000,50.0
1,Bob,31,60000,60.0
2,Charlie,36,70000,70.0
3,Tim,47,80000,80.0
4,Mark,24,340000,340.0
5,Lora,68,760000,760.0


* Creates a new column named New_Column.
* Each value is the result of dividing Salary by 1000.
* You can create new columns by applying operations on existing ones.

#### ✏️ 3. Renaming Columns

In [30]:
df.rename(columns={'Salary': 'Annual Salary'}, inplace=True)
df

Unnamed: 0,Name,Age,Annual Salary,New_Column
0,Alice,26,50000,50.0
1,Bob,31,60000,60.0
2,Charlie,36,70000,70.0
3,Tim,47,80000,80.0
4,Mark,24,340000,340.0
5,Lora,68,760000,760.0


* Renames the column `Salary` to `Annual Salary`.
* `columns` expects a dictionary of `{old_name: new_name}`.
* `inplace=True` applies the change directly to the DataFrame without returning a new one.

#### 🔁 4. Replacing Values

In [31]:
df.replace({'Alice': 'Alicia'}, inplace=True)
df

Unnamed: 0,Name,Age,Annual Salary,New_Column
0,Alicia,26,50000,50.0
1,Bob,31,60000,60.0
2,Charlie,36,70000,70.0
3,Tim,47,80000,80.0
4,Mark,24,340000,340.0
5,Lora,68,760000,760.0


* Replaces all occurrences of 'Alice' with 'Alicia' in the entire DataFrame.
* Works on both strings and numeric values.
* Can be used with dictionaries, lists, or regex.

### 🔤 String Operations on Pandas Series (.str Accessor)
When working with textual (string) data in Pandas, the `.str` accessor provides a variety of vectorized string functions similar to Python string methods.

#### 🔹 Convert to Lowercase

In [59]:
# Dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 70000]
}
# Dataframe from dictionary
df = pd.DataFrame(data)


df['Name'].str.lower()

0      alice
1        bob
2    charlie
Name: Name, dtype: object

* What It Does: Converts all characters in the `'Name'` column to lowercase.
* Example: `"Alice"` → `"alice"`
* Use Case: Standardize text data before comparisons or filtering.

#### 🔹 Replace Substring

In [60]:
df['Name'].str.replace('Alice', 'Alicia')


0     Alicia
1        Bob
2    Charlie
Name: Name, dtype: object

* What It Does: Replaces all occurrences of `'Alice'` with `'Alicia'` in the 'Name' column.

* Example: `"Alice Johnson"` → `"Alicia Johnson"`

* Use Case: Clean or update names, fix typos, or apply consistent formatting.

`⚠️ Note:` From pandas 1.4+, to avoid regex warnings, use `regex=False` if you're doing simple string replacement:

In [61]:
df['Name'].str.replace('Alice', 'Alicia', regex=False)


0     Alicia
1        Bob
2    Charlie
Name: Name, dtype: object

#### 🔹 Check if Substring is Present

In [63]:
df['Name'].str.contains('li')


0     True
1    False
2     True
Name: Name, dtype: bool

* What It Does: Returns a boolean Series where True indicates the substring `'li'` is present in the name.

* Example: `"Alice" → True, "Bob" → False`

* Use Case: Useful for filtering rows based on partial matches.

#### 🔹 Remove Leading and Trailing Spaces


In [64]:
df['Name'].str.strip()


0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

* What It Does: Removes whitespace characters from both ends of the string.
* Example: `" Alice " → "Alice"`
* Use Case: Clean messy data before applying other string operations or comparisons.

## 🕒 5. Handling Dates and Time in Pandas

Pandas provides powerful tools to work with datetime data. You can convert strings to datetime, extract parts like year/month/day, and calculate elapsed time.

---

## 📅 1. Convert Column to DateTime


In [47]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 32, 45, 29, 38],
    'Salary': [50000, 60000, 80000, 55000, 72000],
    'Date': ['2024-01-10', '15th Feb 2024', '2024-05-22', '1 July 24', '2024-08-18']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Salary,Date
0,Alice,25,50000,2024-01-10
1,Bob,32,60000,15th Feb 2024
2,Charlie,45,80000,2024-05-22
3,David,29,55000,1 July 24
4,Eva,38,72000,2024-08-18


In [48]:
df['Date'] = pd.to_datetime(df['Date'])
df

Unnamed: 0,Name,Age,Salary,Date
0,Alice,25,50000,2024-01-10
1,Bob,32,60000,2024-02-15
2,Charlie,45,80000,2024-05-22
3,David,29,55000,2024-07-01
4,Eva,38,72000,2024-08-18


* Converts a string column (e.g., '2024-03-15') into a proper datetime64 format.
* Essential before performing date-based operations.

#### 📆 2. Extract Year
* Extracts the year part (e.g., 2024) from each datetime entry.
* Stored as a new column called Year

In [41]:
df['Year'] = df['Date'].dt.year
df

Unnamed: 0,Name,Age,Salary,Date,Year
0,Alice,25,50000,2024-01-10,2024
1,Bob,32,60000,2024-02-15,2024
2,Charlie,45,80000,2024-05-22,2024
3,David,29,55000,2024-07-01,2024
4,Eva,38,72000,2024-08-18,2024


### ✅ Summary Table
| Operation                       | Output Example | Description                    |
| ------------------------------- | -------------- | ------------------------------ |
| `pd.to_datetime()`              | `datetime64`   | Converts string to datetime    |
| `.dt.year`, `.dt.month`         | `2024`, `3`    | Extract year/month             |
| `.dt.isocalendar().week`        | `11`           | Extract ISO week number        |
| `.dt.day_name()`                | `'Monday'`     | Get full weekday name          |
| `.dt.time`                      | `'13:45:00'`   | Extract time portion           |
| `df['Date'] - df['Date'].min()` | `Timedelta`    | Get difference from first date |


### 📌 Setting a Column as Index in Pandas

✅ What It Does:
* This line sets the Date column as the index of the DataFrame df.
* The index in pandas is used to label and access rows efficiently.
* `inplace=True` means the change is applied directly to df without needing to assign it back.

In [49]:
df.set_index('Date', inplace=True)
df

Unnamed: 0_level_0,Name,Age,Salary
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-01-10,Alice,25,50000
2024-02-15,Bob,32,60000
2024-05-22,Charlie,45,80000
2024-07-01,David,29,55000
2024-08-18,Eva,38,72000


### 📊 Grouping and Aggregating Data in Pandas
Pandas provides powerful group-based operations using the `.groupby()` method. It's especially useful when you want to perform summary statistics across categories.

#### 🧩 Example 1: Grouping and Calculating Mean


In [57]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 32, 45, 29, 38],
    'Salary': [50000, 60000, 80000, 55000, 72000],
    'Date': ['2023-01-10', '15th Feb 2022', '2025-05-22', '1 July 20', '2019-08-18']
}

df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])

df.groupby(df['Date'].dt.year)['Salary'].mean()


Date
2019    72000.0
2020    55000.0
2022    60000.0
2023    50000.0
2025    80000.0
Name: Salary, dtype: float64

##### ✅ What It Does:
* Groups the data based on unique values for year in the Date column.
* Calculates the mean (average) of the Salary column for each group.

#### 🧩 Example 2: Multiple Aggregations Using .agg()

In [58]:
df.groupby(df['Date'].dt.year).agg({'Salary': 'sum', 'Age': 'mean'})


Unnamed: 0_level_0,Salary,Age
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019,72000,38.0
2020,55000,29.0
2022,60000,32.0
2023,50000,25.0
2025,80000,45.0


#### ✅ What It Does:
* Groups by the Category column.
* Applies multiple aggregation functions:

    * `'Salary': 'sum'` → Total salary per category.
    * `'Age': 'mean'` → Average age per category.

# ✅ Conclusion
This notebook covered the foundational operations in Pandas—from understanding DataFrames and Series to filtering, grouping, handling missing values, datetime parsing, and string manipulation. These are the core building blocks for efficient data analysis and preprocessing using Pandas.

If you’ve grasped these concepts, you’re well on your way to becoming confident with Pandas.
To go deeper and explore advanced functionality, check out the official Pandas Documentation. Link: https://pandas.pydata.org/docs/

**Happy Data Wrangling! 🐼📊✨**