# why we are using Pandas

# 📊 Pandas Overview  

**Pandas** is a powerful Python library used for **data manipulation and analysis**.  
It provides efficient and flexible data structures like **DataFrame** (2D) and **Series** (1D),  
making it easy to **clean, transform, and analyze data**.  

---

## 🔹 Key Features of Pandas  

- 📥 **Easy Data Import** → from CSV, Excel, SQL, etc.  
- 🧹 **Data Cleansing** → handle missing or incorrect values easily.  
- 📏 **Size Mutability** → add/delete columns and rows dynamically.  
- 🔄 **Reshaping and Pivoting** → transform datasets into desired formats.  
- ⚡ **Efficient Data Manipulation & Extraction** → fast filtering, selection, and aggregation.  


# There are mainly two Data Structures in Pandas  

## 1. **Series** → 1D  
## 2. **DataFrame** → 2D  

**Data Structures** are collections of data types that provide the best way of organizing items (values) in terms of memory usage.


| Feature                            | **Series** (1D)                                         | **DataFrame** (2D)                                               |
|------------------------------------|---------------------------------------------------------|-------------------------------------------------------------------|
| **Definition**                     | One-dimensional labeled array                           | Two-dimensional labeled table (rows + columns)                    |
| **Values mutability**              | **Mutable** — elements can be assigned/changed in-place | **Mutable** — cell values can be assigned/changed in-place        |
| **Structural mutability**          | Structural ops (drop/concat/reindex) **usually return a new object** | Structural ops (drop/concat/rename) **usually return a new object** |
| **Index / axes**                   | Single index                                            | Row index + column labels                                         |
| **Data types**                     | Typically homogeneous (backed by ndarray)               | Heterogeneous (different dtypes per column)                       |
| **Common creation**                | `pd.Series([1,2,3])`                                    | `pd.DataFrame({'a':[1,2],'b':[3,4]})`                             |
| **Interview tip**                  | Emphasize difference: *values* can change in-place; *structure* changes often create copies. |


# 🔹 Why Pandas when we already have NumPy?

NumPy provides powerful tools for numerical computations, but **Pandas** builds on top of it to handle **structured/tabular data** more effectively.

---

## ✅ Limitations of NumPy
1. Works mainly with **homogeneous data** (all elements must be of the same type).
2. Provides only **array indexing** (difficult to work with row/column labels).
3. No direct support for handling **missing values (NaN)** in a structured way.
4. Lacks built-in tools for:
   - Data cleaning
   - Grouping/aggregating
   - Handling categorical/string data
   - Joining/merging multiple datasets
5. Less human-readable outputs for real-world tabular datasets.

---

## ✅ Why Pandas is Better
1. **Tabular data handling**  
   - Uses **DataFrame** (rows & columns like Excel/SQL table).  
   - Easier to read, interpret, and manipulate.

2. **Label-based indexing**  
   - Access data using column names & row labels (not just numbers).

3. **Heterogeneous data support**  
   - Can store different data types in the same table (int, float, string, datetime).

4. **Missing data handling**  
   - Functions like `.isnull()`, `.dropna()`, `.fillna()` make it easy.

5. **Data Cleaning & Transformation**  
   - Built-in functions for renaming, replacing, filtering, etc.

6. **Merging & Joining**  
   - SQL-like operations (`merge`, `join`, `concat`).

7. **Grouping & Aggregation**  
   - Powerful `.groupby()` functionality for summarizing data.

8. **File I/O**  
   - Can easily read/write CSV, Excel, SQL, JSON, etc. (NumPy can’t).

---

## ✅ Example

### NumPy (less readable)
```python
import numpy as np

data = np.array([[1, "Amit", 25],
                 [2, "Riya", 28],
                 [3, "Karan", 22]])

print(data)
# Hard to see structure, all values become strings


In [1]:
import pandas as pd
#1. Values are mutable (in-place change works)
# Series Example
s = pd.Series([1, 2, 3])
print("Original Series:\n", s)

# Change element in-place
s[0] = 100
print("\nAfter s[0] = 100 (value mutation):\n", s)

# DataFrame Example
df = pd.DataFrame({"A": [10, 20], "B": [30, 40]})
print("\nOriginal DataFrame:\n", df)

# Change a single cell in-place
df.loc[0, "A"] = 999
print("\nAfter df.loc[0,'A'] = 999 (value mutation):\n", df)


Original Series:
 0    1
1    2
2    3
dtype: int64

After s[0] = 100 (value mutation):
 0    100
1      2
2      3
dtype: int64

Original DataFrame:
     A   B
0  10  30
1  20  40

After df.loc[0,'A'] = 999 (value mutation):
      A   B
0  999  30
1   20  40


In [2]:
#🔹 2. Structural changes return new objects (immutability-like behavior)
# Drop operation on Series
s2 = s.drop(1)   # removes index 1
print("\nResult of s.drop(1):\n", s2)
print("\nOriginal Series remains unchanged:\n", s)

# Drop operation on DataFrame
df2 = df.drop(columns="B")
print("\nResult of df.drop(columns='B'):\n", df2)
print("\nOriginal DataFrame remains unchanged:\n", df)


Result of s.drop(1):
 0    100
2      3
dtype: int64

Original Series remains unchanged:
 0    100
1      2
2      3
dtype: int64

Result of df.drop(columns='B'):
      A
0  999
1   20

Original DataFrame remains unchanged:
      A   B
0  999  30
1   20  40


## 🔑 Pandas Mutability (Interview Note)

- **Series & DataFrame are not fully immutable.**  
- **Values are mutable** → can update in-place (`s[0]=5`, `df.loc[0,'col']=7`).  
- **Structural changes** (add/remove rows/cols) → usually return a **new object** (`drop`, `concat`, `reindex`).  
- `inplace=True` exists but is discouraged (may still copy internally).  

👉 **Interview phrasing:**  
“Pandas allows in-place modification of values, but structural operations generally return new objects — so treat element mutation and structure mutation differently.”


## 🔹 Homogeneous Nature of Pandas Series  

- A **Series is homogeneous** → all elements share the **same dtype**.  
- If we insert a different type, Pandas **upcasts** all values to the most common compatible dtype.  

### Example:




In [4]:
import pandas as pd

# Mixed types: int + string
s3 = pd.Series([10, 23, 43, 54, "abcs"])
print(s3)
print("Dtype:", s3.dtype)

# Pure integers
s2 = pd.Series([10, 23, 43, 54])
print(s2)
print("Dtype:", s2.dtype)

0      10
1      23
2      43
3      54
4    abcs
dtype: object
Dtype: object
0    10
1    23
2    43
3    54
dtype: int64
Dtype: int64


In [None]:
s=pd.Series([10,23,43,4,45,56,78])
s

0    10
1    23
2    43
3     4
4    45
5    56
6    78
dtype: int64

In [None]:
s.dtype

dtype('int64')

In [None]:
#here we are give s.name and it give None b/c we have not asign any value to 
#the column
print(s.name)

None


In [None]:
# now I am assinging the value the to the colume 
s.name="numbers"

In [None]:
print(s)

0    10
1    23
2    43
3     4
4    45
5    56
6    78
Name: numbers, dtype: int64


# 📌 Indexing in Pandas Series  

A **Series** is like a 1D array with **labels (index)**. Indexing helps access, slice, and filter elements efficiently.  

---

## 🔹 Types of Indexing  

1. **Default Indexing**
   - Pandas assigns an integer index starting from 0.
   ```python
   s = pd.Series([10, 20, 30, 40])
   print(s[0])   # Access first element → 10


In [None]:
# 1.Default Indexing
s4 = pd.Series([10, 20, 30, 40])
print(s4[0])
#for selecting a multiple index we use 
s[0:2] # state(included ): stop value (values to jump)


10


0    10
1    23
Name: numbers, dtype: int64

# 2.Custom Indexing

# You can define custom labels for indices.

In [None]:


s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(s['b'])   # Access element with label 'b' → 20




















# 3.Slicing

## Works similar to Python lists (start:end).

In [None]:
s = pd.Series([10, 20, 30, 40, 50])
print(s[1:4])   # Elements at positions 1 to 3

# 4.Label-based Indexing (.loc)

## Access elements using index labels.

In [None]:
s = pd.Series([100, 200, 300], index=['x', 'y', 'z'])
print(s.loc['y'])   # → 200

# 5.Position-based Indexing (.iloc)

## Access elements using integer positions.

In [None]:
s = pd.Series([100, 200, 300], index=['x', 'y', 'z'])
print(s.iloc[1])   # → 200

200


In [None]:
s.iloc[[0,1,2]]

x    100
y    200
z    300
dtype: int64


# 6.Boolean Indexing

## Filter elements based on conditions.

In [None]:
s = pd.Series([10, 20, 30, 40, 50])
print(s[s > 25])   # Returns elements > 25

2    30
3    40
4    50
dtype: int64


# 📌 Creating a Pandas Series from a Dictionary  

- A **dictionary** in Python has **key-value pairs**.  
- In Pandas Series:
  - **Keys** → become **index labels**  
  - **Values** → become **data elements**  

### Example:
```python


# Dictionary



In [None]:

# Dictionary of fruits and their protein content (grams per 100g)
fruits_protein = {
    "Apple": 0.3,
    "Banana": 1.1,
    "Orange": 0.9,
    "Mango": 0.8,
    "Papaya": 0.5,
    "Guava": 2.6,
    "Grapes": 0.6,
    "Pineapple": 0.5,
    "Strawberry": 0.8,
    "Blueberry": 0.7,
    "Blackberry": 1.4,
    "Raspberry": 1.2,
    "Kiwi": 1.1,
    "Pomegranate": 1.7,
    "Watermelon": 0.6,
    "Cantaloupe (Muskmelon)": 0.8,
    "Cherry": 1.0,
    "Peach": 0.9,
    "Pear": 0.4,
    "Plum": 0.7,
    "Apricot": 1.4,
    "Fig": 0.8,
    "Date": 1.8,
    "Jackfruit": 1.7,
    "Avocado": 2.0,
    "Dragon Fruit": 1.2,
    "Lychee": 0.8,
    "Coconut (fresh)": 3.3
}


# Create Series from dictionary
s = pd.Series(fruits_protein,name="protein")

print("Series from Dictionary:\n", s)
print("\nIndex:", s.index)
print("Values:", s.values)

Series from Dictionary:
 Apple                     0.3
Banana                    1.1
Orange                    0.9
Mango                     0.8
Papaya                    0.5
Guava                     2.6
Grapes                    0.6
Pineapple                 0.5
Strawberry                0.8
Blueberry                 0.7
Blackberry                1.4
Raspberry                 1.2
Kiwi                      1.1
Pomegranate               1.7
Watermelon                0.6
Cantaloupe (Muskmelon)    0.8
Cherry                    1.0
Peach                     0.9
Pear                      0.4
Plum                      0.7
Apricot                   1.4
Fig                       0.8
Date                      1.8
Jackfruit                 1.7
Avocado                   2.0
Dragon Fruit              1.2
Lychee                    0.8
Coconut (fresh)           3.3
Name: protein, dtype: float64

Index: Index(['Apple', 'Banana', 'Orange', 'Mango', 'Papaya', 'Guava', 'Grapes',
       'Pineapple', 'St

### ✅ Conditional Selection in Pandas

- **Definition:** Selecting rows/values based on a condition (Boolean indexing).
- Works like filtering in SQL/Excel.
- Returns only the data that matches the condition.**



### 📌 Conditional Selection in Pandas (Series)

- **Definition:** Extract elements from a Series using conditions.
- Produces a Boolean mask (`True/False`) and returns matching values.

In [5]:
# conditonal selection:
s2>1

0    True
1    True
2    True
3    True
dtype: bool

In [6]:
s[s>1]

0    100
1      2
2      3
dtype: int64

# logical operater
## 1.and
## 2.Or
## 3.Nor

In [9]:
s[(s>1) & (s<4)]

1    2
2    3
dtype: int64

In [12]:
#not opration 
s[~(1>s)]

0    100
1      2
2      3
dtype: int64

### 📌 Pandas DataFrame

- **Definition:** A 2D labeled data structure (rows & columns), like a table in Excel/SQL.
- **Components:**
  - **Rows (Index):** Labels for each record.
  - **Columns:** Labels for each variable/feature.
  - **Values:** Actual data stored in cells.
- **Can store:** Different data types (int, float, string, bool) in different columns.
- **Creation:** From dictionaries, lists of lists, NumPy arrays, CSV/Excel/SQL files.

#### ✅ Example
```python
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["Delhi", "Mumbai", "Pune"]
}

df = pd.DataFrame(data)
print(df)


In [13]:
### 📌 Larger Pandas DataFrame Example


import pandas as pd

# Sample data: Employees information
data = {
    "EmployeeID": [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    "Name": ["Alice", "Bob", "Charlie", "David", "Eva", "Frank", "Grace", "Hannah", "Ian", "Jack"],
    "Age": [25, 30, 28, 35, 29, 40, 32, 27, 31, 38],
    "Department": ["HR", "IT", "IT", "Finance", "HR", "Finance", "IT", "HR", "Finance", "IT"],
    "Salary": [50000, 60000, 55000, 70000, 52000, 75000, 58000, 51000, 72000, 61000],
    "JoiningDate": ["2020-01-15", "2019-03-22", "2021-07-10", "2018-11-05", "2020-06-30",
                    "2017-09-17", "2019-12-01", "2021-04-12", "2018-05-20", "2020-08-25"]
}

# Create DataFrame
df = pd.DataFrame(data)

print(df)


   EmployeeID     Name  Age Department  Salary JoiningDate
0         101    Alice   25         HR   50000  2020-01-15
1         102      Bob   30         IT   60000  2019-03-22
2         103  Charlie   28         IT   55000  2021-07-10
3         104    David   35    Finance   70000  2018-11-05
4         105      Eva   29         HR   52000  2020-06-30
5         106    Frank   40    Finance   75000  2017-09-17
6         107    Grace   32         IT   58000  2019-12-01
7         108   Hannah   27         HR   51000  2021-04-12
8         109      Ian   31    Finance   72000  2018-05-20
9         110     Jack   38         IT   61000  2020-08-25


In [14]:
# get only head row we will use head()
df.head(3)

Unnamed: 0,EmployeeID,Name,Age,Department,Salary,JoiningDate
0,101,Alice,25,HR,50000,2020-01-15
1,102,Bob,30,IT,60000,2019-03-22
2,103,Charlie,28,IT,55000,2021-07-10


In [16]:
# now if I want 
df.tail(3)

Unnamed: 0,EmployeeID,Name,Age,Department,Salary,JoiningDate
7,108,Hannah,27,HR,51000,2021-04-12
8,109,Ian,31,Finance,72000,2018-05-20
9,110,Jack,38,IT,61000,2020-08-25


In [17]:
# loc and iloc function 
df.iloc[1:3]

Unnamed: 0,EmployeeID,Name,Age,Department,Salary,JoiningDate
1,102,Bob,30,IT,60000,2019-03-22
2,103,Charlie,28,IT,55000,2021-07-10


In [18]:
df.loc[1:3,['Age',"Department"]]

Unnamed: 0,Age,Department
1,30,IT
2,28,IT
3,35,Finance


In [19]:
df["Age"]

0    25
1    30
2    28
3    35
4    29
5    40
6    32
7    27
8    31
9    38
Name: Age, dtype: int64

### 📌 Drop Operation in Pandas DataFrame / Series

- **Definition:** Used to remove rows or columns from a DataFrame or elements from a Series.
- **Default behavior:** Returns a **new object**; the original remains unchanged.
- **`inplace=True` option:** Modifies the original object directly without creating a new one.

---

#### Example: Drop in DataFrame


In [20]:

import pandas as pd



# Drop a column (creates new DataFrame)
new_df = df.drop(columns="Age")
print("New DataFrame after drop:\n", new_df)

# Drop a row using index (in-place)
df.drop(index=1, inplace=True)  # Removes Bob
print("\nOriginal DataFrame after in-place drop:\n", df)


New DataFrame after drop:
    EmployeeID     Name Department  Salary JoiningDate
0         101    Alice         HR   50000  2020-01-15
1         102      Bob         IT   60000  2019-03-22
2         103  Charlie         IT   55000  2021-07-10
3         104    David    Finance   70000  2018-11-05
4         105      Eva         HR   52000  2020-06-30
5         106    Frank    Finance   75000  2017-09-17
6         107    Grace         IT   58000  2019-12-01
7         108   Hannah         HR   51000  2021-04-12
8         109      Ian    Finance   72000  2018-05-20
9         110     Jack         IT   61000  2020-08-25

Original DataFrame after in-place drop:
    EmployeeID     Name  Age Department  Salary JoiningDate
0         101    Alice   25         HR   50000  2020-01-15
2         103  Charlie   28         IT   55000  2021-07-10
3         104    David   35    Finance   70000  2018-11-05
4         105      Eva   29         HR   52000  2020-06-30
5         106    Frank   40    Finance   7

In [21]:
df.shape


(9, 6)

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, 0 to 9
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   EmployeeID   9 non-null      int64 
 1   Name         9 non-null      object
 2   Age          9 non-null      int64 
 3   Department   9 non-null      object
 4   Salary       9 non-null      int64 
 5   JoiningDate  9 non-null      object
dtypes: int64(3), object(3)
memory usage: 504.0+ bytes


In [23]:
df.describe()

Unnamed: 0,EmployeeID,Age,Salary
count,9.0,9.0,9.0
mean,105.888889,31.666667,60444.444444
std,2.934469,5.09902,9632.122185
min,101.0,25.0,50000.0
25%,104.0,28.0,52000.0
50%,106.0,31.0,58000.0
75%,108.0,35.0,70000.0
max,110.0,40.0,75000.0


In [24]:
# Broadcasting

df["Salary"]=df["Salary"]+5000

In [25]:
df["Salary"]

0    55000
2    60000
3    75000
4    57000
5    80000
6    63000
7    56000
8    77000
9    66000
Name: Salary, dtype: int64

In [26]:
# rename Columen

df.rename(columns={"Department" : "Dept"},inplace=True)

In [27]:
df

Unnamed: 0,EmployeeID,Name,Age,Dept,Salary,JoiningDate
0,101,Alice,25,HR,55000,2020-01-15
2,103,Charlie,28,IT,60000,2021-07-10
3,104,David,35,Finance,75000,2018-11-05
4,105,Eva,29,HR,57000,2020-06-30
5,106,Frank,40,Finance,80000,2017-09-17
6,107,Grace,32,IT,63000,2019-12-01
7,108,Hannah,27,HR,56000,2021-04-12
8,109,Ian,31,Finance,77000,2018-05-20
9,110,Jack,38,IT,66000,2020-08-25


In [28]:
# to check for the unique Department 

df["Dept"].unique()

array(['HR', 'IT', 'Finance'], dtype=object)

In [29]:
df["Dept"].value_counts()

Dept
HR         3
IT         3
Finance    3
Name: count, dtype: int64

In [31]:
df.isnull().sum()

EmployeeID     0
Name           0
Age            0
Dept           0
Salary         0
JoiningDate    0
dtype: int64

# Pandas `dropna()` Notes

## What is `dropna()`?
- Removes missing values (`NaN` / `None`) from a DataFrame or Series.

---

## Syntax


---

## Parameters

### 1. axis
- `0` → drop **rows** with NaN (default)  
- `1` → drop **columns** with NaN  

### 2. how
- `'any'` → drop row/column if **any NaN** exists (default)  
- `'all'` → drop row/column only if **all values are NaN**  

### 3. thresh
- Minimum number of **non-NaN values required** to keep the row/column  

### 4. subset
- Specify column(s) to check for NaN while dropping  

### 5. inplace
- `False` → returns new DataFrame (default)  
- `True` → modifies the original DataFrame directly  

---

## Examples



In [39]:

import pandas as pd
import numpy as np

data = {
    "Name": ["Amit", "Riya", "Karan", "Sneha", "Vikas", "Neha", np.nan, "Rohan"],
    "Age": [25, np.nan, 30, 22, np.nan, 28, 35, 40],
    "City": ["Delhi", "Mumbai", np.nan, "Pune", "Delhi", "Chennai", "Kolkata", np.nan],
    "Salary": [50000, 60000, np.nan, 45000, 52000, np.nan, 70000, 80000]
}
df2 = pd.DataFrame(data)
df2



Unnamed: 0,Name,Age,City,Salary
0,Amit,25.0,Delhi,50000.0
1,Riya,,Mumbai,60000.0
2,Karan,30.0,,
3,Sneha,22.0,Pune,45000.0
4,Vikas,,Delhi,52000.0
5,Neha,28.0,Chennai,
6,,35.0,Kolkata,70000.0
7,Rohan,40.0,,80000.0


In [42]:
# Drop rows with any NaN
df2.dropna()



Unnamed: 0,Name,Age,City,Salary
0,Amit,25.0,Delhi,50000.0
3,Sneha,22.0,Pune,45000.0


In [44]:
# Drop columns with any NaN
df2.dropna(axis=1)
#in the exple all the colums has NAN



0
1
2
3
4
5
6
7


In [45]:
# Drop rows where all values are NaN
df2.dropna(how='all')



Unnamed: 0,Name,Age,City,Salary
0,Amit,25.0,Delhi,50000.0
1,Riya,,Mumbai,60000.0
2,Karan,30.0,,
3,Sneha,22.0,Pune,45000.0
4,Vikas,,Delhi,52000.0
5,Neha,28.0,Chennai,
6,,35.0,Kolkata,70000.0
7,Rohan,40.0,,80000.0


In [47]:
# Drop rows with less than 2 non-NaN values
df2.dropna(thresh=2)




Unnamed: 0,Name,Age,City,Salary
0,Amit,25.0,Delhi,50000.0
1,Riya,,Mumbai,60000.0
2,Karan,30.0,,
3,Sneha,22.0,Pune,45000.0
4,Vikas,,Delhi,52000.0
5,Neha,28.0,Chennai,
6,,35.0,Kolkata,70000.0
7,Rohan,40.0,,80000.0


In [48]:
# Drop rows if NaN in column 'B'
df.dropna(subset=['Salary'])

Unnamed: 0,Name,Age,City,Salary
0,Amit,25.0,Delhi,50000.0
1,Riya,,Mumbai,60000.0
3,Sneha,22.0,Pune,45000.0
4,Vikas,,Delhi,52000.0
6,,35.0,Kolkata,70000.0
7,Rohan,40.0,,80000.0


In [None]:
#now if you want to drop some data having null value 
#then You can use df.dropna()

df.dropna()


Unnamed: 0,EmployeeID,Name,Age,Dept,Salary,JoiningDate
0,101,Alice,25,HR,55000,2020-01-15
2,103,Charlie,28,IT,60000,2021-07-10
3,104,David,35,Finance,75000,2018-11-05
4,105,Eva,29,HR,57000,2020-06-30
5,106,Frank,40,Finance,80000,2017-09-17
6,107,Grace,32,IT,63000,2019-12-01
7,108,Hannah,27,HR,56000,2021-04-12
8,109,Ian,31,Finance,77000,2018-05-20
9,110,Jack,38,IT,66000,2020-08-25


In [33]:
df

Unnamed: 0,EmployeeID,Name,Age,Dept,Salary,JoiningDate
0,101,Alice,25,HR,55000,2020-01-15
2,103,Charlie,28,IT,60000,2021-07-10
3,104,David,35,Finance,75000,2018-11-05
4,105,Eva,29,HR,57000,2020-06-30
5,106,Frank,40,Finance,80000,2017-09-17
6,107,Grace,32,IT,63000,2019-12-01
7,108,Hannah,27,HR,56000,2021-04-12
8,109,Ian,31,Finance,77000,2018-05-20
9,110,Jack,38,IT,66000,2020-08-25



---

## Parameters

### 1. value
- Replace NaN with a specific value (number, string, dict, or Series).  
- Example: `df.fillna(0)` replaces all NaN with `0`.

### 2. method
- `'ffill'` (forward fill) → fills NaN with the **previous value**.  
- `'bfill'` (backward fill) → fills NaN with the **next value**.  

### 3. axis
- `0` → fill values **down rows** (default).  
- `1` → fill values **across columns**.  

### 4. inplace
- `False` → returns a new DataFrame (default).  
- `True` → modifies the original DataFrame directly.  

### 5. limit
- Maximum number of NaN values to forward/backward fill.  

---

## Example Dataset





In [72]:
import pandas as pd
import numpy as np

data = {
    "Name": ["Amit", "Riya", "Karan", "Sneha", "Vikas", "Neha", np.nan, "Rohan"],
    "Age": [25, np.nan, 30, 22, np.nan, 28, 35, 40],
    "City": ["Delhi", "Mumbai", np.nan, "Pune", "Delhi", "Chennai", "Kolkata", np.nan],
    "Salary": [50000, 60000, np.nan, 45000, 52000, np.nan, 70000, 80000]
}
df3 = pd.DataFrame(data)
print("Original DataFrame:\n", df3)

Original DataFrame:
     Name   Age     City   Salary
0   Amit  25.0    Delhi  50000.0
1   Riya   NaN   Mumbai  60000.0
2  Karan  30.0      NaN      NaN
3  Sneha  22.0     Pune  45000.0
4  Vikas   NaN    Delhi  52000.0
5   Neha  28.0  Chennai      NaN
6    NaN  35.0  Kolkata  70000.0
7  Rohan  40.0      NaN  80000.0


In [50]:
# 1. Fill all NaN with a constant value
df3.fillna(0)

Unnamed: 0,Name,Age,City,Salary
0,Amit,25.0,Delhi,50000.0
1,Riya,0.0,Mumbai,60000.0
2,Karan,30.0,0,0.0
3,Sneha,22.0,Pune,45000.0
4,Vikas,0.0,Delhi,52000.0
5,Neha,28.0,Chennai,0.0
6,0,35.0,Kolkata,70000.0
7,Rohan,40.0,0,80000.0


In [51]:
# 2. Fill NaN in a specific column with a value
df3["Age"].fillna(df3["Age"].mean())   # replace with mean age

0    25.0
1    30.0
2    30.0
3    22.0
4    30.0
5    28.0
6    35.0
7    40.0
Name: Age, dtype: float64

In [52]:
# 3. Forward fill (copy previous value)
df3.fillna(method="ffill")

Unnamed: 0,Name,Age,City,Salary
0,Amit,25.0,Delhi,50000.0
1,Riya,25.0,Mumbai,60000.0
2,Karan,30.0,Mumbai,60000.0
3,Sneha,22.0,Pune,45000.0
4,Vikas,22.0,Delhi,52000.0
5,Neha,28.0,Chennai,52000.0
6,Neha,35.0,Kolkata,70000.0
7,Rohan,40.0,Kolkata,80000.0


In [None]:
# 4. Backward fill (copy next value)
df3.fillna(method="bfill")

In [53]:
# 5. Fill with dictionary (different values for each column)
df3.fillna({"Age": 25, "City": "Unknown", "Salary": 0})

Unnamed: 0,Name,Age,City,Salary
0,Amit,25.0,Delhi,50000.0
1,Riya,25.0,Mumbai,60000.0
2,Karan,30.0,Unknown,0.0
3,Sneha,22.0,Pune,45000.0
4,Vikas,25.0,Delhi,52000.0
5,Neha,28.0,Chennai,0.0
6,,35.0,Kolkata,70000.0
7,Rohan,40.0,Unknown,80000.0


In [54]:
# 6. Fill only limited NaNs
df3.fillna(method="ffill", limit=1)

Unnamed: 0,Name,Age,City,Salary
0,Amit,25.0,Delhi,50000.0
1,Riya,25.0,Mumbai,60000.0
2,Karan,30.0,Mumbai,60000.0
3,Sneha,22.0,Pune,45000.0
4,Vikas,22.0,Delhi,52000.0
5,Neha,28.0,Chennai,52000.0
6,Neha,35.0,Kolkata,70000.0
7,Rohan,40.0,Kolkata,80000.0


In [55]:
# we can also fill my replace
df["Name"].replace('Amit','Rose')

0     Rose
1     Riya
2    Karan
3    Sneha
4    Vikas
5     Neha
6      NaN
7    Rohan
Name: Name, dtype: object

# Dealing with Duplicate Values in Pandas

## What are duplicate values?
- Duplicate rows are rows in a DataFrame where **all or some columns have the same values**.
- They can cause incorrect analysis, so we need to **detect and remove them**.

---

## Functions to handle duplicates

### 1. `duplicated()`
- Returns a Boolean Series showing which rows are duplicates.
- Syntax:
  ```python
  DataFrame.duplicated(subset=None, keep='first')


In [58]:
# Duplicates:
# we can take keep='first' as well as 'last' 

df_dup=df3[df3.duplicated(keep='last')]
df_dup

# now if you want to drop duplicate then 

df3=df3.drop_duplicates()

# Dealing with Invalid Values using Lambda in Pandas

## What are invalid values?
- Values in a DataFrame that are **not correct or meaningful**.
- Examples:
  - Negative age values
  - Wrong city names
  - Out-of-range marks/scores

We can use **`apply()` with a `lambda` function** to detect and fix them.

---

## Using `lambda` with `apply()`

### General Syntax
```python
DataFrame['column'] = DataFrame['column'].apply(lambda x: <condition>)


In [59]:
# how to deal with invalide fuction 
# we are using lamdba

# 1. Replace invalid ages (<0 or >120) with NaN
df3["Age"] = df3["Age"].apply(lambda x: x if (x >= 0 and x <= 120) else pd.NA)

In [60]:
df3

Unnamed: 0,Name,Age,City,Salary
0,Amit,25.0,Delhi,50000.0
1,Riya,,Mumbai,60000.0
2,Karan,30.0,,
3,Sneha,22.0,Pune,45000.0
4,Vikas,,Delhi,52000.0
5,Neha,28.0,Chennai,
6,,35.0,Kolkata,70000.0
7,Rohan,40.0,,80000.0


# Splitting a Column into Multiple Columns using `str.split()`

## Example

```python
import pandas as pd

# Create a sample DataFrame


In [62]:
# in Order to deal with string value like name like Vaibhav_Bajpai
df4 = pd.DataFrame({
    "name": ["Vaibhav_Bajpai", "Riya_Sharma", "Karan_Singh", "Sneha_Gupta"]
})

print("Original DataFrame:\n", df4)

# Split the 'name' column into two new columns: first_name and last_name
df4[["first_name", "last_name"]] = df4["name"].str.split("_", expand=True)

print("\nAfter splitting:\n", df4)


Original DataFrame:
              name
0  Vaibhav_Bajpai
1     Riya_Sharma
2     Karan_Singh
3     Sneha_Gupta

After splitting:
              name first_name last_name
0  Vaibhav_Bajpai    Vaibhav    Bajpai
1     Riya_Sharma       Riya    Sharma
2     Karan_Singh      Karan     Singh
3     Sneha_Gupta      Sneha     Gupta


In [75]:
def multiplying_age(x):
    return x * 2

df3["Age"] = df3["Age"].apply(multiplying_age)
df3


Unnamed: 0,Name,Age,City,Salary
0,Amit,100.0,Delhi,50000.0
1,Riya,,Mumbai,60000.0
2,Karan,120.0,,
3,Sneha,88.0,Pune,45000.0
4,Vikas,,Delhi,52000.0
5,Neha,112.0,Chennai,
6,,140.0,Kolkata,70000.0
7,Rohan,160.0,,80000.0


In [78]:
df3["Age"]=df3["Age"].apply(lambda x:x/2)
df3

Unnamed: 0,Name,Age,City,Salary
0,Amit,25.0,Delhi,50000.0
1,Riya,,Mumbai,60000.0
2,Karan,30.0,,
3,Sneha,22.0,Pune,45000.0
4,Vikas,,Delhi,52000.0
5,Neha,28.0,Chennai,
6,,35.0,Kolkata,70000.0
7,Rohan,40.0,,80000.0


# Joins in Pandas

## What are Joins?
- A **join** combines rows from two DataFrames based on a common column (called a **key**).
- Similar to **SQL JOINs**.
- In Pandas, joins are mainly performed using:
  - `merge()` → column or index based
  - `join()` → index based

---

## 🔹 Syntax for `merge()`
```python
pd.merge(left, right, how='join_type', on='key_column')


In [79]:
import pandas as pd

# DataFrame 1: Employee details
ndf1 = pd.DataFrame({
    "id": [1, 2, 3, 4, 5, 6],
    "name": ["Amit", "Riya", "Karan", "Sneha", "Anuj", "Priya"],
    "age": [25, 28, 22, 30, 26, 24]
})

# DataFrame 2: Salary & Department
ndf2 = pd.DataFrame({
    "id": [3, 4, 5, 6, 7, 8],
    "salary": [55000, 60000, 70000, 80000, 75000, 90000],
    "department": ["IT", "HR", "Finance", "Marketing", "Sales", "Admin"]
})

print("DataFrame 1:\n", ndf1)
print("\nDataFrame 2:\n", ndf2)


DataFrame 1:
    id   name  age
0   1   Amit   25
1   2   Riya   28
2   3  Karan   22
3   4  Sneha   30
4   5   Anuj   26
5   6  Priya   24

DataFrame 2:
    id  salary department
0   3   55000         IT
1   4   60000         HR
2   5   70000    Finance
3   6   80000  Marketing
4   7   75000      Sales
5   8   90000      Admin


## 🔹 Inner Join
- Keeps only matching rows from both DataFrames.


In [80]:
inner = pd.merge(ndf1, ndf2, on="id", how="inner")
print(inner)


   id   name  age  salary department
0   3  Karan   22   55000         IT
1   4  Sneha   30   60000         HR
2   5   Anuj   26   70000    Finance
3   6  Priya   24   80000  Marketing


## 🔹 Left Join
- Keeps all rows from `ndf1` and matching rows from `ndf2`.


In [81]:
left = pd.merge(ndf1, ndf2, on="id", how="left")
print(left)


   id   name  age   salary department
0   1   Amit   25      NaN        NaN
1   2   Riya   28      NaN        NaN
2   3  Karan   22  55000.0         IT
3   4  Sneha   30  60000.0         HR
4   5   Anuj   26  70000.0    Finance
5   6  Priya   24  80000.0  Marketing


## 🔹 Right Join
- Keeps all rows from `ndf2` and matching rows from `ndf1`.


In [82]:
right = pd.merge(ndf1, ndf2, on="id", how="right")
print(right)


   id   name   age  salary department
0   3  Karan  22.0   55000         IT
1   4  Sneha  30.0   60000         HR
2   5   Anuj  26.0   70000    Finance
3   6  Priya  24.0   80000  Marketing
4   7    NaN   NaN   75000      Sales
5   8    NaN   NaN   90000      Admin


## 🔹 Outer Join
- Keeps all rows from both DataFrames, fills missing values with NaN.


In [83]:
outer = pd.merge(ndf1, ndf2, on="id", how="outer")
print(outer)


   id   name   age   salary department
0   1   Amit  25.0      NaN        NaN
1   2   Riya  28.0      NaN        NaN
2   3  Karan  22.0  55000.0         IT
3   4  Sneha  30.0  60000.0         HR
4   5   Anuj  26.0  70000.0    Finance
5   6  Priya  24.0  80000.0  Marketing
6   7    NaN   NaN  75000.0      Sales
7   8    NaN   NaN  90000.0      Admin


In [85]:
# Concacte

# ✅ Row-wise concat (default)
row_concat = pd.concat([ndf1, ndf2])

# ✅ Column-wise concat
col_concat = pd.concat([ndf1, ndf2], axis=1)

print("Row-wise Concat:\n", row_concat)
print("\nColumn-wise Concat:\n", col_concat)

Row-wise Concat:
    id   name   age   salary department
0   1   Amit  25.0      NaN        NaN
1   2   Riya  28.0      NaN        NaN
2   3  Karan  22.0      NaN        NaN
3   4  Sneha  30.0      NaN        NaN
4   5   Anuj  26.0      NaN        NaN
5   6  Priya  24.0      NaN        NaN
0   3    NaN   NaN  55000.0         IT
1   4    NaN   NaN  60000.0         HR
2   5    NaN   NaN  70000.0    Finance
3   6    NaN   NaN  80000.0  Marketing
4   7    NaN   NaN  75000.0      Sales
5   8    NaN   NaN  90000.0      Admin

Column-wise Concat:
    id   name  age  id  salary department
0   1   Amit   25   3   55000         IT
1   2   Riya   28   4   60000         HR
2   3  Karan   22   5   70000    Finance
3   4  Sneha   30   6   80000  Marketing
4   5   Anuj   26   7   75000      Sales
5   6  Priya   24   8   90000      Admin
