### **üèÅ Lesson 1 ‚Äî Introduction to Pandas (Notebook 01_intro_to_pandas.ipynb)**

In [6]:
# ==========================================================
# üß© Environment Setup Cell
# Purpose:
# Before beginning any data analysis or visualization task,
# it's important to import and verify your core libraries.
# This ensures your environment is properly configured and
# you know exactly which versions of libraries are being used.
# ==========================================================

# --- Step 1: Import core data and numerical libraries ---

import pandas as pd              # üìä Pandas: For data analysis & DataFrame manipulation
import numpy as np               # üî¢ NumPy: For numerical computations, arrays, and math functions
import matplotlib.pyplot as plt  # üìà Matplotlib: For creating visualizations and plots
import openpyxl

# --- Step 2: Check library versions (optional but good practice) ---
# Why:
#   Version mismatches can cause function differences or bugs.
#   Logging versions ensures your code is reproducible and consistent
#   across machines, environments, or collaborators.

print("‚úÖ Pandas Version :", pd.__version__)   # shows which version of Pandas is currently active
print("‚úÖ NumPy Version  :", np.__version__)   # shows which version of NumPy is currently active

# --- Step 3: (Optional) Setup Matplotlib style ---
# You can uncomment and customize this section if you want consistent plot styling.
# plt.style.use("seaborn-v0_8-whitegrid")  # applies a clean Seaborn-like style for all plots
# print("Matplotlib style set to 'seaborn-whitegrid'")

# ==========================================================
# ‚úÖ Output Example:
# Pandas Version : 2.2.3
# NumPy Version  : 1.26.4
#
# If both versions print correctly, your environment is ready.
# ==========================================================


‚úÖ Pandas Version : 2.3.2
‚úÖ NumPy Version  : 2.3.2


#### **üß© 3Ô∏è‚É£ Why Pandas?**

Pandas is a **powerful data analysis library** built on top of NumPy.  
It simplifies data manipulation, cleaning, and exploration that would otherwise require many lines of raw Python code.

---

### üîç Comparison ‚Äî Raw Python vs. Pandas

| Problem | Raw Python Approach | Pandas Solution |
|----------|--------------------|-----------------|
| Manual loops for data cleaning** | `for` loops, nested conditionals | `.apply()`, `.map()` |
| Complex joins & merges | Manual nested loops or dictionary merges | `.merge()` |
| Missing data handling | Manual `if`/`None` checks | `.isna()`, `.fillna()` |
| Summaries & statistics | Custom aggregate functions | `.describe()`, `.agg()` |
| Large CSV processing | Line-by-line file reading | Vectorized I/O with `pd.read_csv()` |

---

‚úÖ **In short:**  
Pandas converts complex data handling tasks into **simple, readable one-liners** ‚Äî  
making it the go-to tool for data analysis, ETL, and preprocessing in GenAI, Data Science, and BI pipelines.


**üß© Import Data Example**

In [7]:


# ‚úÖ Step 1: Define your Excel file path
# Use raw string (r"") or double backslashes (\\) to avoid escape-sequence errors
file_path = r"C:\Users\dhira\Desktop\python-mastery\pandas\dataset\raw\people_basic_data.xlsx"


# ‚úÖ Step 2: Read the Excel file into a DataFrame
df = pd.read_excel(file_path)  # Works for .xlsx, .xls, .xlsm, etc.

# ‚úÖ Step 3: Print / Preview your data
print(df)  # Shows the full DataFrame in console (use df.head() for top 5 rows)

       Name  Age       City  Salary_INR
0     Aarav   23     Mumbai      115059
1    Vivaan   50  Ahmedabad       93035
2    Aditya   46  Ahmedabad       61033
3    Vihaan   37       Pune      187550
4     Arjun   37      Delhi      162866
5       Sai   54      Noida      139457
6   Krishna   23  Hyderabad       98410
7    Ishaan   47    Kolkata      169869
8    Pranav   28     Jaipur      122722
9     Rohit   23  Ahmedabad      105380
10   Ananya   41  Ahmedabad      106578
11     Diya   59     Mumbai       68630
12     Isha   49       Pune       45663
13   Aadhya   50  Hyderabad      124279
14     Myra   48  Hyderabad      180941
15    Pooja   51      Delhi       40848
16    Vanya   40    Chennai      175909
17    Navya   25     Mumbai      170697
18    Meera   53  Ahmedabad       77049
19   Aarohi   38     Jaipur      137382


**üß© Mini Exercise**
- Create a list of five integers ‚Üí convert it into a Series.
- Create a dict of names + ages ‚Üí convert it into a DataFrame.
- Print the DataFrame info and describe summary._

In [8]:

# Here we define two lists:
# 1Ô∏è‚É£ age  ‚Üí numeric values (integers)
# 2Ô∏è‚É£ name ‚Üí string values (text)
# Both lists must be of the same length so that each row aligns properly.
# ==========================================================
age  = [36, 34, 22, 24, 45]
name = ["Dhiraj", "John", "David", "Kevin", "Ravi"]

# ----------------------------------------------------------
# pd.DataFrame() converts Python objects (lists, dicts, arrays) into
# a 2D labeled table ‚Äî the core data structure in Pandas.
# ==========================================================
df_ex = pd.DataFrame({"name": name, "age": age})

# ==========================================================
# üßæ Step 4: Inspect the DataFrame Structure
# ----------------------------------------------------------
# .info() ‚Üí Displays meta information about the DataFrame:
#            - number of rows and columns
#            - column names and data types
#            - non-null counts (missing values check)
# ==========================================================

display(df_ex.info())

# ==========================================================
# .describe() ‚Üí Gives descriptive statistics:
#                count, mean, std, min, max, percentiles, etc.
#                Only applies to numeric columns by default.
# ==========================================================
display(df_ex.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    5 non-null      object
 1   age     5 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 212.0+ bytes


None

Unnamed: 0,age
count,5.0
mean,32.2
std,9.391486
min,22.0
25%,24.0
50%,34.0
75%,36.0
max,45.0
