# Pandas Series

A **Pandas Series** is a **one-dimensional labeled array** in Python’s pandas library. You can think of it like a column in an Excel sheet or a single list of values, but with labels (called an **index**) attached to each element.  

---

## Key Features
- **One-dimensional**: Stores data in a single column (unlike a DataFrame, which has multiple columns).
- **Index labels**: Each value has an associated label (by default integers `0, 1, 2, ...`, but you can assign custom labels).
- **Data types**: Can hold integers, floats, strings, objects, or even Python functions.
- **Vectorized operations**: You can perform arithmetic and logical operations directly on a Series without using loops.
- **Integration with NumPy**: Many NumPy functions work seamlessly with Series.

---

## Example


In [1]:
import pandas as pd

# Create a Series from a list
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)


a    10
b    20
c    30
d    40
dtype: int64


Here:

The **values** are `[10, 20, 30, 40]`

The **index labels** are `['a', 'b', 'c', 'd']`

# 📌 Pandas Series Fundamentals

A **Series** is a one-dimensional labeled array capable of holding data of any type (integers, floats, strings, Python objects, etc.).  
It has two main components:

- **Values** → the actual data stored  
- **Index** → labels that identify each element  

---

## 1. Creating a Series

You can create a Series from:
- A **list**
- A **NumPy array**
- A **dictionary**
- A **scalar value**


In [2]:
import pandas as pd
import numpy as np

# From list
s1 = pd.Series([10, 20, 30, 40])

s1

0    10
1    20
2    30
3    40
dtype: int64

In [3]:
# From numpy array
s2 = pd.Series(np.array([1, 2, 3, 4]))
s2

0    1
1    2
2    3
3    4
dtype: int32

In [14]:
# From numpy array
arr = np.array([0.1, 0.2, 0.3])
s_np = pd.Series(arr, index=['x', 'y', 'z'], name='prob')
s_np


x    0.1
y    0.2
z    0.3
Name: prob, dtype: float64

In [5]:
# From dictionary (keys become index)
s3 = pd.Series({'a': 100, 'b': 200, 'c': 300})

s3


a    100
b    200
c    300
dtype: int64


In [13]:
s_dict = pd.Series({'NY': 8.4, 'LA': 3.9, 'CHI': 2.7}, name='population_millions')
s_dict


NY     8.4
LA     3.9
CHI    2.7
Name: population_millions, dtype: float64

In [8]:
# From scalar (broadcasted to index)
s4 = pd.Series(5, index=['x', 'y', 'z'])
s4

x    5
y    5
z    5
dtype: int64

In [16]:
s = s4
s.index, s.dtype

(Index(['x', 'y', 'z'], dtype='object'), dtype('int64'))

## 2. Index and Values

A Series has **two attributes**:

`s.index` → the labels

`s.values` → the data values

In [9]:
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print("Index:", s.index)
print("Values:", s.values)


Index: Index(['a', 'b', 'c'], dtype='object')
Values: [10 20 30]


## 3. Data Selection

You can access Series data using:

**Index label** `(s['a'])`

**Position** `(s[0])`

**Slicing** `(s[1:] or s['b':])`

In [10]:
print(s['a'])   # by label → 10
print(s[0])     # by position → 10
print(s['b':])  # slice by label → b, c


10
10
b    20
c    30
dtype: int64


## 4. Vectorized Operations

Operations are applied element-wise without loops (just like NumPy).

In [11]:
s = pd.Series([1, 2, 3, 4])
print(s * 2)      # multiply each element
print(s + 10)     # add 10 to each element


0    2
1    4
2    6
3    8
dtype: int64
0    11
1    12
2    13
3    14
dtype: int64


In [17]:
#Vectorized ops (elementwise, fast, no loops)
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'], name='base')
s_times = s * 10          # multiply each element
s_add = s + 5             # add scalar
s_pow = s ** 2            # power
s_times, s_add, s_pow

(a    10
 b    20
 c    30
 Name: base, dtype: int64,
 a    6
 b    7
 c    8
 Name: base, dtype: int64,
 a    1
 b    4
 c    9
 Name: base, dtype: int64)

## 5. Alignment by Index

When performing operations between two Series, Pandas aligns them by **index labels**.

In [12]:
s1 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s2 = pd.Series([100, 200, 300], index=['b', 'c', 'd'])

print(s1 + s2)


a      NaN
b    120.0
c    230.0
d      NaN
dtype: float64


In [18]:
#Alignment by index (labels drive arithmetic)
s1 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s2 = pd.Series([1, 2, 3],   index=['b', 'c', 'd'])
s1_plus_s2 = s1 + s2        # aligns on labels; missing pairs → NaN
s1_plus_s2

a     NaN
b    21.0
c    32.0
d     NaN
dtype: float64

**Tip**: Use `.add(other, fill_value=0)` (and similar: `sub`, `mul`, `div`) to control missing alignments.


In [19]:
s1.add(s2, fill_value=0)

a    10.0
b    21.0
c    32.0
d     3.0
dtype: float64

#### map vs apply (on Series)

`map`: elementwise transform using a dict, Series, or callable; best for simple 1→1 mappings.

`apply`: elementwise callable; more general/flexible (slightly slower), supports complex logic.

In [6]:
s = pd.Series(['NY', 'LA', 'CHI', 'LA'])

s

0     NY
1     LA
2    CHI
3     LA
dtype: object

In [7]:
# map with dict (fast remap)
city_to_state = {'NY': 'NY', 'LA': 'CA', 'CHI': 'IL'}
city_to_state

{'NY': 'NY', 'LA': 'CA', 'CHI': 'IL'}

In [8]:
s_map = s.map(city_to_state)       # 'LA'→'CA', etc.

s_map

0    NY
1    CA
2    IL
3    CA
dtype: object

In [22]:
# map with function
s_map_func = s.map(lambda x: x.lower())

s_map_func

0     ny
1     la
2    chi
3     la
dtype: object

In [23]:
# apply with function (can do more complex logic)
#apply() allows more complex logic:
#if/else conditions
#Multiple return values depending on rules
#Using external variables inside the function

def tag_city(x):
    if x == 'LA':
        return 'West Coast'
    return 'Other'
s_apply = s.apply(tag_city)

s_apply


0         Other
1    West Coast
2         Other
3    West Coast
dtype: object

**Rule of thumb**

-Prefer `map` when remapping values or doing simple elementwise transforms.

-Use `apply` for custom logic that can’t be expressed as a simple mapping.

### Missing Data

In [24]:
s = pd.Series([1.0, None, 3.5, np.nan, 5.0], name='measure')

s

0    1.0
1    NaN
2    3.5
3    NaN
4    5.0
Name: measure, dtype: float64

In [25]:
# Detect missing
mask = s.isna()            # True where NaN/None
mask

0    False
1     True
2    False
3     True
4    False
Name: measure, dtype: bool

In [26]:
num_missing = s.isna().sum()

num_missing

2

In [27]:
# Drop missing
s_drop = s.dropna()

s_drop

0    1.0
2    3.5
4    5.0
Name: measure, dtype: float64

In [28]:
# Fill missing
s_fill_const = s.fillna(0)                     # constant
s_fill_const

0    1.0
1    0.0
2    3.5
3    0.0
4    5.0
Name: measure, dtype: float64

In [29]:
s_fill_ffill = s.fillna(method='ffill')        # forward fill
s_fill_ffill

0    1.0
1    1.0
2    3.5
3    3.5
4    5.0
Name: measure, dtype: float64

In [30]:
s_fill_bfill = s.fillna(method='bfill')        # backward fill
s_fill_bfill

0    1.0
1    3.5
2    3.5
3    5.0
4    5.0
Name: measure, dtype: float64

In [31]:
s_fill_mean  = s.fillna(s.mean())              # statistic-based fill

s_fill_mean


0    1.000000
1    3.166667
2    3.500000
3    3.166667
4    5.000000
Name: measure, dtype: float64

### Quick Tour: .str Accessor (for string Series)

`.str` provides vectorized string operations; works only if the Series has string dtype (or values that can be treated as strings).

In [32]:
names = pd.Series(['  Alice  ', 'Bob', None, 'carol', 'DAVE '], dtype='string')

names

0      Alice  
1          Bob
2         <NA>
3        carol
4        DAVE 
dtype: string

In [33]:
# Clean whitespace and case
clean = names.str.strip().str.title()   # 'Alice', 'Bob', NaN, 'Carol', 'Dave'

clean


0    Alice
1      Bob
2     <NA>
3    Carol
4     Dave
dtype: string

In [34]:
# Pattern contains / replace
has_a = clean.str.contains('a', case=False, na=False)   # boolean mask
replaced = clean.str.replace(r'^[A-Z]', 'X', regex=True)

has_a, replaced

(0     True
 1    False
 2    False
 3     True
 4     True
 dtype: boolean,
 0    Xlice
 1      Xob
 2     <NA>
 3    Xarol
 4     Xave
 dtype: string)

In [9]:
# Split and extract
emails = pd.Series(['a@x.com', 'b@y.org', None, 'c@z.net'], dtype='string')
domains = emails.str.split('@').str[1]

domains

0    x.com
1    y.org
2     <NA>
3    z.net
dtype: object

In [10]:
#\. → literal dot
#(\w+) → one or more word characters (letters/digits/underscore)
#$ → end of string
tlds = emails.str.extract(r'\.(\w+)$')   # capture group for TLD
tlds

Unnamed: 0,0
0,com
1,org
2,
3,net


**Common `.str` ops**

Cleaning: `strip()`, `.lower()`, `.upper()`, `.title()`

Find/replace: `.contains()`, `.replace(pattern, repl, regex=True)`

Split/extract: `.split()`, `.extract(regex)`

Length/startswith/endswith: `.len()`, `.startswith()`, `.endswith()`

# 📌 Pandas DataFrame Fundamentals

A **DataFrame** in pandas is the most commonly used data structure — you can think of it as a **2-dimensional labeled table** (like an Excel sheet or SQL table) where:

- **Rows** have an index (labels, default 0, 1, 2, …)  
- **Columns** have names (labels, which you define or pandas assigns)  
- Each **column is a Series** (1D array), but they can have different data types  

---

## 📌 Key Features of a DataFrame

- **2D labeled data structure** → rows + columns  
- **Heterogeneous data types** → different columns can hold int, float, string, bool, etc.  
- **Size mutable** → can add/delete rows and columns  
- **Rich functionality** → indexing, filtering, grouping, aggregation, merging, reshaping, etc.  
- **Integration** → works seamlessly with NumPy, Excel, SQL, CSV, JSON  


In [36]:
import pandas as pd

# Create a DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 75000, 90000]
}

df = pd.DataFrame(data)
print(df)


      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   75000
3    David   40   90000


In [38]:
# List of dicts (good for JSON-like records)
records = [
    {"name": "Alice", "age": 25},
    {"name": "Bob",   "age": 30, "dept": "SE"},
]
df2 = pd.DataFrame(records)
df2

Unnamed: 0,name,age,dept
0,Alice,25,
1,Bob,30,SE


In [39]:
# From NumPy array (provide columns & index)
arr = np.array([[1, 2], [3, 4], [5, 6]])
df3 = pd.DataFrame(arr, columns=["x", "y"], index=["a", "b", "c"])
df3

Unnamed: 0,x,y
a,1,2
b,3,4
c,5,6


In [40]:
# From Series dict (aligns on indexes)
s1 = pd.Series([10, 20, 30], index=["a", "b", "c"], name="u")
s2 = pd.Series([0.1, 0.2, 0.3], index=["a", "b", "c"], name="v")
df4 = pd.DataFrame({"u": s1, "v": s2})
df4

Unnamed: 0,u,v
a,10,0.1
b,20,0.2
c,30,0.3


In [42]:
import pandas as pd

# Create sample employee data
data = {
    "id": [101, 102, 103, 104, 105],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Ethan"],
    "age": [25, 30, 28, 35, 40],
    "salary": [50000, 60000, 55000, 75000, 80000],
    "start_date": ["2021-06-15", "2020-08-01", "2019-03-20", "2018-11-10", "2017-01-25"]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Save to CSV
file_path = "employee.csv"
df.to_csv(file_path, index=False)

file_path


'employee.csv'

In [44]:
#read CSV
# Minimal
df = pd.read_csv("employee.csv")
df


Unnamed: 0,id,name,age,salary,start_date
0,101,Alice,25,50000,6/15/2021
1,102,Bob,30,60000,8/1/2020
2,103,Charlie,28,55000,3/20/2019
3,104,Diana,35,75000,11/10/2018
4,105,Ethan,40,80000,1/25/2017


In [47]:
# Common options
df = pd.read_csv(
    "employee.csv",
    usecols=["id", "name", "age", "salary", "start_date"],
    dtype={"id": "int64", "age": "Int64"},     # Int64 = nullable integer
    parse_dates=["start_date"],                # convert to datetime64[ns]
    na_values=["", "NA", "null"],              # custom missing markers
    thousands=",",                             # "10,000" -> 10000
    engine="python"                            # robust parsing when needed
)
df

Unnamed: 0,id,name,age,salary,start_date
0,101,Alice,25,50000,2021-06-15
1,102,Bob,30,60000,2020-08-01
2,103,Charlie,28,55000,2019-03-20
3,104,Diana,35,75000,2018-11-10
4,105,Ethan,40,80000,2017-01-25


In [53]:
#Inspecting data quickly
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Cara", "Dan"],
    "age":  [25, 30, 28, 40],
    "salary": [80_000, 95_000, 87_500, 120_000]
})

df.head(3)                 # first 3 rows (use .tail() for last)


Unnamed: 0,name,age,salary
0,Alice,25,80000
1,Bob,30,95000
2,Cara,28,87500


In [54]:
df.info()                  # column dtypes, non-nulls, memory footprint


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4 non-null      object
 1   age     4 non-null      int64 
 2   salary  4 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 228.0+ bytes


In [55]:
df.describe()              # numeric summary (count/mean/std/min/quantiles/max)

Unnamed: 0,age,salary
count,4.0,4.0
mean,30.75,95625.0
std,6.5,17365.554987
min,25.0,80000.0
25%,27.25,85625.0
50%,29.0,91250.0
75%,32.5,101250.0
max,40.0,120000.0


In [56]:
#Renaming columns and index
# Rename selected columns
renamed = df.rename(columns={"salary": "annual_salary"})

renamed

Unnamed: 0,name,age,annual_salary
0,Alice,25,80000
1,Bob,30,95000
2,Cara,28,87500
3,Dan,40,120000


In [57]:
# Bulk overwrite all column names (ensure same length)
df2 = df.set_axis(["Name", "Age", "Salary"], axis="columns")

df2

Unnamed: 0,Name,Age,Salary
0,Alice,25,80000
1,Bob,30,95000
2,Cara,28,87500
3,Dan,40,120000


In [58]:
# Rename index (rarely needed)
df3 = df.rename(index={0: "row0", 1: "row1"})
df3

Unnamed: 0,name,age,salary
row0,Alice,25,80000
row1,Bob,30,95000
2,Cara,28,87500
3,Dan,40,120000


In [59]:
#Creating computed columns (vectorized)
# Vectorized arithmetic
df["monthly_salary"] = df["salary"] / 12
df

Unnamed: 0,name,age,salary,monthly_salary
0,Alice,25,80000,6666.666667
1,Bob,30,95000,7916.666667
2,Cara,28,87500,7291.666667
3,Dan,40,120000,10000.0


In [61]:
# Conditional with np.where
import numpy as np
df["senior"] = np.where(df["age"] >= 30, "Y", "N")
df

Unnamed: 0,name,age,salary,monthly_salary,senior
0,Alice,25,80000,6666.666667,N
1,Bob,30,95000,7916.666667,Y
2,Cara,28,87500,7291.666667,N
3,Dan,40,120000,10000.0,Y


In [62]:
# String ops on a column (ensure string dtype)
df["initial"] = df["name"].astype("string").str[0]
df

Unnamed: 0,name,age,salary,monthly_salary,senior,initial
0,Alice,25,80000,6666.666667,N,A
1,Bob,30,95000,7916.666667,Y,B
2,Cara,28,87500,7291.666667,N,C
3,Dan,40,120000,10000.0,Y,D


## 📌 Selection and Indexing in Pandas

In pandas, selection and indexing refer to the ways you can access, filter, and manipulate subsets of data from Series and DataFrames.

**Indexing Basics**

`Index` → labels that identify rows in a DataFrame or elements in a Series.

Can be numbers (default 0,1,2…) or custom labels (e.g., employee IDs)

In [63]:
#we will use this dataset

df = pd.DataFrame({
    "id":    [101, 102, 103, 104, 105, 106],
    "name":  ["Alice", "Bob", "Cara", "Dan", "Eve", "Frank"],
    "dept":  ["DS", "SE", "DS", "SE", "HR", "DS"],
    "age":   [25, 30, 28, 40, 35, 28],
    "salary":[80000, 95000, 87500, 120000, 70000, 87500],
    "city":  ["NY", "SF", "NY", "SF", "NY", "NY"]
})
df

Unnamed: 0,id,name,dept,age,salary,city
0,101,Alice,DS,25,80000,NY
1,102,Bob,SE,30,95000,SF
2,103,Cara,DS,28,87500,NY
3,104,Dan,SE,40,120000,SF
4,105,Eve,HR,35,70000,NY
5,106,Frank,DS,28,87500,NY


#### 1) Column selection (view vs copy) & safe writes with `.loc`


In [64]:
#Selecting columns
col = df["salary"]       # 1 column → Series (may be a view, not guaranteed)
col


0     80000
1     95000
2     87500
3    120000
4     70000
5     87500
Name: salary, dtype: int64

In [65]:
sub = df[["name","dept"]]# list of columns → DataFrame (new object)
sub

Unnamed: 0,name,dept
0,Alice,DS
1,Bob,SE
2,Cara,DS
3,Dan,SE
4,Eve,HR
5,Frank,DS


**View vs copy (why it matters):**

`df["col"]` often returns a view; modifying it can unintentionally modify the parent or raise a warning.

If you need an independent object: `.copy()`

In [66]:
s = df["salary"].copy()  # guaranteed independent copy
s


0     80000
1     95000
2     87500
3    120000
4     70000
5     87500
Name: salary, dtype: int64

In [68]:
s.iloc[0] = 999999       # does not affect df
s

0    999999
1     95000
2     87500
3    120000
4     70000
5     87500
Name: salary, dtype: int64

In [69]:
df

Unnamed: 0,id,name,dept,age,salary,city
0,101,Alice,DS,25,80000,NY
1,102,Bob,SE,30,95000,SF
2,103,Cara,DS,28,87500,NY
3,104,Dan,SE,40,120000,SF
4,105,Eve,HR,35,70000,NY
5,106,Frank,DS,28,87500,NY


**Safe writes with `.loc`**

Always use `.loc[row_filter, column] = value` to assign.

In [70]:
# Give a 5% raise to DS department employees
df.loc[df["dept"] == "DS", "salary"] = (df.loc[df["dept"] == "DS", "salary"] * 1.05).round(0)
df


Unnamed: 0,id,name,dept,age,salary,city
0,101,Alice,DS,25,84000,NY
1,102,Bob,SE,30,95000,SF
2,103,Cara,DS,28,91875,NY
3,104,Dan,SE,40,120000,SF
4,105,Eve,HR,35,70000,NY
5,106,Frank,DS,28,91875,NY


**2) Row/Column selection: .loc, .iloc, .at, .iat**

`.loc` → label-based (rows & columns by labels; slicing is inclusive on labels)

`.iloc` → integer-position-based (0..n-1; slicing is end-exclusive)

`.at` → fast scalar access by label (single cell)

`.iat` → fast scalar access by position (single cell)

In [72]:
# .loc: rows by condition, columns by label
df.loc[df["age"] >= 30, ["name", "dept", "age"]]


Unnamed: 0,name,dept,age
1,Bob,SE,30
3,Dan,SE,40
4,Eve,HR,35


In [73]:
# .loc: label slicing (inclusive) — after setting a labeled index (see later)
# df.loc["a":"c", "dept":"salary"]

# .iloc: rows/cols by integer position
df.iloc[0:3, 1:4]   # first 3 rows; columns 1..3


Unnamed: 0,name,dept,age
0,Alice,DS,25
1,Bob,SE,30
2,Cara,DS,28


**3) Boolean masks: isin, between, multiple conditions**

In [77]:
# 3.1 .isin — membership tests
tech = df[df["dept"].isin(["DS", "SE"])]

tech

Unnamed: 0,id,name,dept,age,salary,city
0,101,Alice,DS,25,84000,NY
1,102,Bob,SE,30,95000,SF
2,103,Cara,DS,28,91875,NY
3,104,Dan,SE,40,120000,SF
5,106,Frank,DS,28,91875,NY


In [78]:
# 3.2 .between — numeric ranges (inclusive)
young = df[df["age"].between(25, 30)]

young

Unnamed: 0,id,name,dept,age,salary,city
0,101,Alice,DS,25,84000,NY
1,102,Bob,SE,30,95000,SF
2,103,Cara,DS,28,91875,NY
5,106,Frank,DS,28,91875,NY


In [79]:
# 3.3 Multiple conditions — use &, |, ~ with parentheses
ny_ds = df[(df["city"] == "NY") & (df["dept"] == "DS")]
ny_ds

Unnamed: 0,id,name,dept,age,salary,city
0,101,Alice,DS,25,84000,NY
2,103,Cara,DS,28,91875,NY
5,106,Frank,DS,28,91875,NY


In [80]:
not_hr_or_sf = df[(df["dept"] != "HR") | (df["city"] != "SF")]
not_hr_or_sf

Unnamed: 0,id,name,dept,age,salary,city
0,101,Alice,DS,25,84000,NY
1,102,Bob,SE,30,95000,SF
2,103,Cara,DS,28,91875,NY
3,104,Dan,SE,40,120000,SF
4,105,Eve,HR,35,70000,NY
5,106,Frank,DS,28,91875,NY


**4) Sorting & ranking; handling duplicates**

In [81]:
# sort by values (multi-key, stable)
sorted_df = df.sort_values(by=["dept","salary"], ascending=[True, False])

sorted_df

Unnamed: 0,id,name,dept,age,salary,city
2,103,Cara,DS,28,91875,NY
5,106,Frank,DS,28,91875,NY
0,101,Alice,DS,25,84000,NY
4,105,Eve,HR,35,70000,NY
3,104,Dan,SE,40,120000,SF
1,102,Bob,SE,30,95000,SF


In [82]:
# sort by index
sorted_idx = df.sort_index()
sorted_idx

Unnamed: 0,id,name,dept,age,salary,city
0,101,Alice,DS,25,84000,NY
1,102,Bob,SE,30,95000,SF
2,103,Cara,DS,28,91875,NY
3,104,Dan,SE,40,120000,SF
4,105,Eve,HR,35,70000,NY
5,106,Frank,DS,28,91875,NY


In [83]:
#Ranking
# Rank salaries within each dept (highest = rank 1)
df["dept_rank"] = (
    df.groupby("dept")["salary"]
      .rank(ascending=False, method="dense")
      .astype("Int64")
)
df[["name","dept","salary","dept_rank"]].sort_values(["dept","dept_rank"])


Unnamed: 0,name,dept,salary,dept_rank
2,Cara,DS,91875,1
5,Frank,DS,91875,1
0,Alice,DS,84000,2
4,Eve,HR,70000,1
3,Dan,SE,120000,1
1,Bob,SE,95000,2


In [84]:
#Duplicates
# Create a duplicate to demo
dups = pd.concat([df, df.iloc[[2]]], ignore_index=True)

dups

Unnamed: 0,id,name,dept,age,salary,city,dept_rank
0,101,Alice,DS,25,84000,NY,2
1,102,Bob,SE,30,95000,SF,2
2,103,Cara,DS,28,91875,NY,1
3,104,Dan,SE,40,120000,SF,1
4,105,Eve,HR,35,70000,NY,1
5,106,Frank,DS,28,91875,NY,1
6,103,Cara,DS,28,91875,NY,1


In [86]:
# Mark duplicates (True for all but first occurrence by default)
dups["is_dup_name"] = dups.duplicated(subset=["name"])

dups["is_dup_name"]

0    False
1    False
2    False
3    False
4    False
5    False
6     True
Name: is_dup_name, dtype: bool

In [87]:
# Drop duplicates; keep='first' | 'last' | False (drop all duplicates)
dedup = dups.drop_duplicates(subset=["name"], keep="first")
dedup


Unnamed: 0,id,name,dept,age,salary,city,dept_rank,is_dup_name
0,101,Alice,DS,25,84000,NY,2,False
1,102,Bob,SE,30,95000,SF,2,False
2,103,Cara,DS,28,91875,NY,1,False
3,104,Dan,SE,40,120000,SF,1,False
4,105,Eve,HR,35,70000,NY,1,False
5,106,Frank,DS,28,91875,NY,1,False


## 📌 Aggregation in Pandas

Aggregation in pandas means computing summary statistics (like sum, mean, min, max, count, etc.) over a Series, DataFrame, or groups of data.
It’s a way to collapse data into a single value or a smaller set of values that summarize the dataset.

In [88]:
emp = pd.DataFrame({
    "id":    [101,102,103,104,105,106,107],
    "name":  ["Alice","Bob","Cara","Dan","Eve","Frank","Gina"],
    "dept":  ["DS","SE","DS","SE","HR","DS","DS"],
    "city":  ["NY","SF","NY","SF","NY","NY","SF"],
    "age":   [25,30,28,40,35,28,33],
    "salary":[80000, 95000, 87500, 120000, 70000, 88000, 91000],
    "fte":   [1.0,1.0,0.8,1.0,0.5,1.0,0.6],             # weight example
    "perf":  [3, 4, 4, 5, 3, 4, 5]                      # weight example
})
emp


Unnamed: 0,id,name,dept,city,age,salary,fte,perf
0,101,Alice,DS,NY,25,80000,1.0,3
1,102,Bob,SE,SF,30,95000,1.0,4
2,103,Cara,DS,NY,28,87500,0.8,4
3,104,Dan,SE,SF,40,120000,1.0,5
4,105,Eve,HR,NY,35,70000,0.5,3
5,106,Frank,DS,NY,28,88000,1.0,4
6,107,Gina,DS,SF,33,91000,0.6,5


**1) Split–Apply–Combine & agg**


In [91]:
#A. Basic reductions
# mean salary by department
emp.groupby("dept")["salary"].mean()



dept
DS     86625.0
HR     70000.0
SE    107500.0
Name: salary, dtype: float64

In [92]:
# multiple keys
emp.groupby(["dept","city"])["salary"].mean()


dept  city
DS    NY       85166.666667
      SF       91000.000000
HR    NY       70000.000000
SE    SF      107500.000000
Name: salary, dtype: float64

In [93]:
#B. agg with list/dict syntax
# list of reductions on one column
emp.groupby("dept")["salary"].agg(["count","mean","min","max","std"])


Unnamed: 0_level_0,count,mean,min,max,std
dept,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DS,4,86625.0,80000,91000,4679.29838
HR,1,70000.0,70000,70000,
SE,2,107500.0,95000,120000,17677.66953


In [94]:
# dict: different reductions per column
emp.groupby("dept").agg({
    "salary": ["mean","median","max"],
    "age":    ["median","nunique"]
})

Unnamed: 0_level_0,salary,salary,salary,age,age
Unnamed: 0_level_1,mean,median,max,median,nunique
dept,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
DS,86625.0,87750.0,91000,28.0,3
HR,70000.0,70000.0,70000,35.0,1
SE,107500.0,107500.0,120000,35.0,2


**C. Named aggregation (clean column names)**

In [95]:
emp.groupby("dept").agg(
    n_emps = ("id", "count"),
    avg_age = ("age", "mean"),
    avg_salary = ("salary","mean"),
    p90_salary = ("salary", lambda s: s.quantile(0.90))
).reset_index()


Unnamed: 0,dept,n_emps,avg_age,avg_salary,p90_salary
0,DS,4,28.5,86625.0,90100.0
1,HR,1,35.0,70000.0,70000.0
2,SE,2,35.0,107500.0,117500.0


**2) transform (shape-preserving) vs apply (flexible)**

`transform` returns a Series aligned to the original rows (same length). Use for group metrics that you want broadcast back (feature engineering).

`apply` can return scalar, Series, or DataFrame per group (flexible, often slower).

In [97]:
# Within-dept z-score of salary (broadcast back)
emp["dept_salary_mean"] = emp.groupby("dept")["salary"].transform("mean")
emp["dept_salary_mean"]

0     86625.0
1    107500.0
2     86625.0
3    107500.0
4     70000.0
5     86625.0
6     86625.0
Name: dept_salary_mean, dtype: float64

In [98]:
emp["dept_salary_std"]  = emp.groupby("dept")["salary"].transform("std")
emp["salary_z_in_dept"] = (emp["salary"] - emp["dept_salary_mean"]) / emp["dept_salary_std"]
emp[["name","dept","salary","salary_z_in_dept"]]

# apply: custom summary per group → DataFrame
def robust_summary(s):
    q1, q3 = s.quantile([0.25, 0.75])
    return pd.Series({
        "mean": s.mean(),
        "iqr":  q3 - q1,
        "mad":  (s - s.median()).abs().median()
    })

emp.groupby("dept")["salary"].apply(robust_summary)


dept      
DS    mean     86625.0
      iqr       3125.0
      mad       1750.0
HR    mean     70000.0
      iqr          0.0
      mad          0.0
SE    mean    107500.0
      iqr      12500.0
      mad      12500.0
Name: salary, dtype: float64

## Reshaping in pandas

What it is: Changing the layout of your table without changing the underlying data.

In [101]:
people = pd.DataFrame({
    "id": [101,102,103,104],
    "name": ["Alice","Bob","Cara","Dan"],
    "dept": ["DS","SE","DS","SE"],
    "city": ["NY","SF","NY","SF"]
})

people

Unnamed: 0,id,name,dept,city
0,101,Alice,DS,NY
1,102,Bob,SE,SF
2,103,Cara,DS,NY
3,104,Dan,SE,SF


In [102]:
scores_wide = pd.DataFrame({
    "id": [101,102,103,104],
    "Math": [88, 92, 79, 85],
    "Eng":  [90, 86, 81, 89],
    "CS":   [95, 84, 93, 87]
})
scores_wide

Unnamed: 0,id,Math,Eng,CS
0,101,88,90,95
1,102,92,86,84
2,103,79,81,93
3,104,85,89,87


**1) Long ↔ Wide reshaping**

A) melt (wide → long, “tidy”)

In [100]:
scores_long = scores_wide.melt(
    id_vars="id",                  # columns to keep
    var_name="subject",            # new column for former column names
    value_name="score"             # new column for values
)
# Result columns: id, subject, score
scores_long

Unnamed: 0,id,subject,score
0,101,Math,88
1,102,Math,92
2,103,Math,79
3,104,Math,85
4,101,Eng,90
5,102,Eng,86
6,103,Eng,81
7,104,Eng,89
8,101,CS,95
9,102,CS,84


B) pivot (long → wide)

Requires unique pairs of index × columns. If duplicates exist, use pivot_table with an aggregator.

In [103]:
wide_back = scores_long.pivot(
    index="id",
    columns="subject",
    values="score"
).reset_index()
# columns: id, CS, Eng, Math
wide_back

subject,id,CS,Eng,Math
0,101,95,90,88
1,102,84,86,92
2,103,93,81,79
3,104,87,89,85


C) stack / unstack (with MultiIndex)

In [104]:
# Build a MultiIndex example: mean score per dept × subject
long_with_dept = scores_long.merge(people[["id","dept"]], on="id", how="left")
long_with_dept

Unnamed: 0,id,subject,score,dept
0,101,Math,88,DS
1,102,Math,92,SE
2,103,Math,79,DS
3,104,Math,85,SE
4,101,Eng,90,DS
5,102,Eng,86,SE
6,103,Eng,81,DS
7,104,Eng,89,SE
8,101,CS,95,DS
9,102,CS,84,SE


In [105]:
grp = long_with_dept.groupby(["dept","subject"], as_index=True)["score"].mean()
# 'grp' has a MultiIndex (dept, subject)

grp

dept  subject
DS    CS         94.0
      Eng        85.5
      Math       83.5
SE    CS         85.5
      Eng        87.5
      Math       88.5
Name: score, dtype: float64

In [107]:
wide_subjects = grp.unstack("subject")      # rows=dept, columns=subject
wide_subjects

subject,CS,Eng,Math
dept,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DS,94.0,85.5,83.5
SE,85.5,87.5,88.5


In [108]:
long_again    = wide_subjects.stack("subject")  # back to long
long_again

dept  subject
DS    CS         94.0
      Eng        85.5
      Math       83.5
SE    CS         85.5
      Eng        87.5
      Math       88.5
dtype: float64

#### Rule of thumb

Use melt/pivot when you have a clean key for rows and a single header dimension.

Use stack/unstack when working with MultiIndex objects (e.g., after groupby or pivot operations).

**2) Combining tables with concat**

Row-wise (append new rows)

In [109]:
new_hires = pd.DataFrame({
    "id":[105,106], "name":["Eve","Frank"], "dept":["HR","DS"], "city":["NY","NY"]
})

new_hires

Unnamed: 0,id,name,dept,city
0,105,Eve,HR,NY
1,106,Frank,DS,NY


In [110]:
all_people = pd.concat([people, new_hires], axis=0, ignore_index=True)
# same columns, more rows
all_people

Unnamed: 0,id,name,dept,city
0,101,Alice,DS,NY
1,102,Bob,SE,SF
2,103,Cara,DS,NY
3,104,Dan,SE,SF
4,105,Eve,HR,NY
5,106,Frank,DS,NY


Column-wise (add new columns, align on index)

In [111]:
# Feature tables aligned by id
skills = pd.DataFrame({"id":[101,102,103,104], "python":[3,2,3,1]}).set_index("id")
tenure = pd.DataFrame({"id":[101,103,104], "years":[4,2,6]}).set_index("id")

cols_join = pd.concat([skills, tenure], axis=1, join="outer")  # align on index; NaN where missing
# Use join="inner" to keep only the intersection of ids
cols_join

Unnamed: 0_level_0,python,years
id,Unnamed: 1_level_1,Unnamed: 2_level_1
101,3,4.0
102,2,
103,3,2.0
104,1,6.0


In [112]:
#Add a source key during concat
survey_a = pd.DataFrame({"id":[101,103], "sat":[4,5]})
survey_b = pd.DataFrame({"id":[102,104], "sat":[3,4]})

stacked = pd.concat({"A": survey_a, "B": survey_b}, axis=0)
# Now stacked has a MultiIndex (source, row)
stacked

Unnamed: 0,Unnamed: 1,id,sat
A,0,101,4
A,1,103,5
B,0,102,3
B,1,104,4


**3) merge joins (SQL-style)**

In [113]:
#Basic joins & join types
dept_mgr = pd.DataFrame({
    "dept":["DS","SE","HR"], "manager":["Mia","Noah","Zoe"]
})
dept_mgr

Unnamed: 0,dept,manager
0,DS,Mia
1,SE,Noah
2,HR,Zoe


In [114]:
# LEFT: keep all people; add manager where dept matches
left_join  = people.merge(dept_mgr, on="dept", how="left")
left_join

Unnamed: 0,id,name,dept,city,manager
0,101,Alice,DS,NY,Mia
1,102,Bob,SE,SF,Noah
2,103,Cara,DS,NY,Mia
3,104,Dan,SE,SF,Noah


In [115]:
# INNER: only rows with matching dept
inner_join = people.merge(dept_mgr, on="dept", how="inner")
inner_join

Unnamed: 0,id,name,dept,city,manager
0,101,Alice,DS,NY,Mia
1,103,Cara,DS,NY,Mia
2,102,Bob,SE,SF,Noah
3,104,Dan,SE,SF,Noah


In [116]:
# RIGHT & OUTER as needed
right_join = people.merge(dept_mgr, on="dept", how="right")
outer_join = people.merge(dept_mgr, on="dept", how="outer")
right_join, outer_join

(      id   name dept city manager
 0  101.0  Alice   DS   NY     Mia
 1  103.0   Cara   DS   NY     Mia
 2  102.0    Bob   SE   SF    Noah
 3  104.0    Dan   SE   SF    Noah
 4    NaN    NaN   HR  NaN     Zoe,
       id   name dept city manager
 0  101.0  Alice   DS   NY     Mia
 1  103.0   Cara   DS   NY     Mia
 2  102.0    Bob   SE   SF    Noah
 3  104.0    Dan   SE   SF    Noah
 4    NaN    NaN   HR  NaN     Zoe)