# Pandas basic


### pandas data structure: Series and DataFrame
To be honest the API in pandas is way less natural in those in numpy, but given that they are built upon numpy and the datastructure is also similar, it's crucial to learn how they relate to the ndarray in numpy.

There 2 fundamental data object in pandas, Series and DataFrame. 






In Real projects, the typical data wrangling workflow:
```
Load data
 ‚Üí Inspect
 ‚Üí Clean (missing, types, duplicates)
 ‚Üí Filter
 ‚Üí Feature engineering (new columns)
 ‚Üí Group / Aggregate
 ‚Üí Export / Plot / Model
```

TO better learn this, we will first look at the properties of pandas data structure and how to relate them with numpy counterparts.

On the other sides, we will dive into each process through a few demos, each focusing on 1 important section.

In [112]:
import pandas as pd
import numpy as np


### Constructor
Here we only give a demo of the most commonly used cases

| **Input Type**     | **Typical Use Case** | **Best Feature**          |
| ------------------ | -------------------- | ------------------------- |
| **Dict of Lists**  | Manual data entry    | Clean and readable        |
| **List of Dicts**  | API / JSON data      | Handles row-based records |
| **NumPy Array**    | ML / Science         | High performance          |
| **Dict of Series** | Complex data merging | Automatic index alignment |



In [113]:
## Dict of List, Manually data Entry
data = {
    "name": ["Alice", "Bob", "Charlie", "David", "Eva"],
    "age": [20, 21, 19, 22, 20],
    "major": ["CS", "Math", "CS", "Physics", "Math"],
    "score": [88, 92, 79, 85, 90]
}
#keys are columns, values are indexes with natural numbers

df = pd.DataFrame(data)
print(df)


      name  age    major  score
0    Alice   20       CS     88
1      Bob   21     Math     92
2  Charlie   19       CS     79
3    David   22  Physics     85
4      Eva   20     Math     90


In [114]:
import json

# Example JSON data (List of Dictionaries)
json_data = '''
[
    {"id": 1, "score": 85, "active": true},
    {"id": 2, "score": 92, "active": false},
    {"id": 3, "score": 78, "active": true}
]
'''

# Convert JSON string to a DataFrame 
# which is usually a list of dicts
# Option A: Directly from a string/file
df = pd.read_json(json_data)

# Option B: If you already have a Python list of dicts (very common)
data_list = json.loads(json_data)
df = pd.DataFrame(data_list)

print(df.head())

   id  score  active
0   1     85    True
1   2     92   False
2   3     78    True


  df = pd.read_json(json_data)


In [115]:
import numpy as np
import pandas as pd

# --- NumPy -> Pandas ---
# Create a 4x3 matrix of random numbers
np_data = np.random.rand(4, 3)

# Convert to DataFrame and add names
df = pd.DataFrame(np_data, columns=['Intensity', 'Velocity', 'Mass'])
print(df)
# Perform a pandas-specific manipulation
df['Total'] = df.sum(axis=1)

# --- Pandas -> NumPy ---
# Convert back to an array for a machine learning model
array_back = df.to_numpy()


   Intensity  Velocity      Mass
0   0.907062  0.550655  0.811282
1   0.896156  0.273101  0.600234
2   0.476619  0.942895  0.657158
3   0.939339  0.546861  0.491751


In [116]:
data = {
    'A': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
    'B': pd.Series([4, 5], index=['a', 'b']) # 'c' will become NaN
}

# I include this only to declare that this is safety
df = pd.DataFrame(data)
df

Unnamed: 0,A,B
a,1,4.0
b,2,5.0
c,3,


In [117]:
df.head()

Unnamed: 0,A,B
a,1,4.0
b,2,5.0
c,3,


In [118]:
df.columns

Index(['A', 'B'], dtype='object')

In [119]:
df.dtypes

A      int64
B    float64
dtype: object

In [120]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, a to c
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       3 non-null      int64  
 1   B       2 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 180.0+ bytes


In [121]:
data = {
    "name": ["Alice", "Bob", "Charlie", "David", "Eva"],
    "age": [20, 21, 19, 22, 20],
    "major": ["CS", "Math", "CS", "Physics", "Math"],
    "score": [88, 92, 79, 85, 90]
}

df = pd.DataFrame(data)
print(df)


df["name"]              # Series
df[["name", "score"]]  # DataFrame


      name  age    major  score
0    Alice   20       CS     88
1      Bob   21     Math     92
2  Charlie   19       CS     79
3    David   22  Physics     85
4      Eva   20     Math     90


Unnamed: 0,name,score
0,Alice,88
1,Bob,92
2,Charlie,79
3,David,85
4,Eva,90


In [122]:
df.iloc[(1,3)]      #row major

np.int64(92)

In [123]:
df.iloc[0:3]

Unnamed: 0,name,age,major,score
0,Alice,20,CS,88
1,Bob,21,Math,92
2,Charlie,19,CS,79


In [124]:
df["score"] >85 #boolean series, can be used to index the rows(not columns)

df["name"] #indexing columns with their names

0      Alice
1        Bob
2    Charlie
3      David
4        Eva
Name: name, dtype: object

In [125]:
df[df["major"] == "CS"]

Unnamed: 0,name,age,major,score
0,Alice,20,CS,88
2,Charlie,19,CS,79


In [126]:
df[(df["major"] == "CS") & (df["score"] > 80)] # filtering with multiple conditions


Unnamed: 0,name,age,major,score
0,Alice,20,CS,88


In [127]:
df["score"] = df["score"] + 5
print(df["score"])
# df[df["score"]>=100] = 100
# print(df["score"])
df.index = ("a","b","c","d","e")

#dataframe, different ways to indexing(columns and rows)
print("\n",df)
print(df.loc["a"]) #indexing with names of columns and rows
print(df["name"]) #indexing with names of number for iloc

0    93
1    97
2    84
3    90
4    95
Name: score, dtype: int64

       name  age    major  score
a    Alice   20       CS     93
b      Bob   21     Math     97
c  Charlie   19       CS     84
d    David   22  Physics     90
e      Eva   20     Math     95
name     Alice
age         20
major       CS
score       93
Name: a, dtype: object
a      Alice
b        Bob
c    Charlie
d      David
e        Eva
Name: name, dtype: object


### Basic Process of Handling Data


1. cleaning & Inspection
2. Feature Engineerig & Filtering
3. Grouping & Aggregating
   

![](https://encrypted-tbn3.gstatic.com/licensed-image?q=tbn:ANd9GcRyepE8z-lPaT34Z4YcYNeOkoBl32-RNSiddhXm1AuKKSDgS86FsAdFmmcZqyGxaZkO_CHuLYfoOvJe-G_Y6c5CTJNU6GoNWJWJNRXL-Qg84mhQi40)


- Inspection & Basic Cleaning üßπ
We'll look at how to get a "health check" on your data. This involves identifying missing values ($NaN$), checking if your numbers are actually being treated as strings (data types), and spotting duplicates that might skew your results.

- Feature Engineering & Filtering üõ†Ô∏è This is about shaping the data. We'll explore how to create new columns based on existing ones (like calculating "Profit" from "Revenue" and "Cost") and how to use boolean indexing to zoom in on the specific rows you care about.

- Grouping & Aggregating üìä This is the "Summary" stage. We'll use the Split-Apply-Combine pattern to answer questions like, "What is the average sales volume per region?" or "Which category has the highest growth?"

We will first see these with toy demo and conceptual analysis, and then going down into practice with a tiny project.

In [128]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', None],
    'Age': [25, np.nan, 30, 22, 28],
    'Score': [85, 90, np.nan, 88, 92],
    'City': ['NY', 'LA', 'SF', 'NY', 'LA'],
    'Visits': [1, 2, 2, 1, 1]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

Original DataFrame:
      Name   Age  Score City  Visits
0    Alice  25.0   85.0   NY       1
1      Bob   NaN   90.0   LA       2
2  Charlie  30.0    NaN   SF       2
3    David  22.0   88.0   NY       1
4     None  28.0   92.0   LA       1


In [129]:
#Inspect first rows
df.head()

Unnamed: 0,Name,Age,Score,City,Visits
0,Alice,25.0,85.0,NY,1
1,Bob,,90.0,LA,2
2,Charlie,30.0,,SF,2
3,David,22.0,88.0,NY,1
4,,28.0,92.0,LA,1


In [130]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    4 non-null      object 
 1   Age     4 non-null      float64
 2   Score   4 non-null      float64
 3   City    5 non-null      object 
 4   Visits  5 non-null      int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 332.0+ bytes


In [131]:
print("\nDescribe:")
print(df.describe()) # this gives a description for the columns of float elements



Describe:
         Age      Score    Visits
count   4.00   4.000000  5.000000
mean   26.25  88.750000  1.400000
std     3.50   2.986079  0.547723
min    22.00  85.000000  1.000000
25%    24.25  87.250000  1.000000
50%    26.50  89.000000  1.000000
75%    28.50  90.500000  2.000000
max    30.00  92.000000  2.000000


In [132]:
#detect the missing values
print("\nIs Null (True = missing):")
print(df.isnull())
print("\n the notnull method:")
print(df.notna()) # exactly the opposite compared to the former one


Is Null (True = missing):
    Name    Age  Score   City  Visits
0  False  False  False  False   False
1  False   True  False  False   False
2  False  False   True  False   False
3  False  False  False  False   False
4   True  False  False  False   False

 the notnull method:
    Name    Age  Score  City  Visits
0   True   True   True  True    True
1   True  False   True  True    True
2   True   True  False  True    True
3   True   True   True  True    True
4  False   True   True  True    True


In [133]:
df_dropped = df.dropna()
print("\nAfter dropna():")
print(df_dropped)


After dropna():
    Name   Age  Score City  Visits
0  Alice  25.0   85.0   NY       1
3  David  22.0   88.0   NY       1


In [134]:
df_dropped_strict = df.dropna(thresh=len(df.columns))  # require all columns non-null
print(df_dropped_strict)

    Name   Age  Score City  Visits
0  Alice  25.0   85.0   NY       1
3  David  22.0   88.0   NY       1


There are other methods for fixing nan values
```python
# Fill with a fixed value
filled = df.fillna(0)

# Forward-fill (propagate last valid observation)
filled_ffill = df.fillna(method='ffill')

# Backward-fill
filled_bfill = df.fillna(method='bfill')
```

In [135]:
df_filled_zero = df.fillna(0)
print("\nFill all NaNs with 0:")
print(df_filled_zero)
# you can see that name = 0 is not proper
df_filled_zero.loc[4,"Name"] = "Unknown"
df_filled_zero # a better fill


Fill all NaNs with 0:
      Name   Age  Score City  Visits
0    Alice  25.0   85.0   NY       1
1      Bob   0.0   90.0   LA       2
2  Charlie  30.0    0.0   SF       2
3    David  22.0   88.0   NY       1
4        0  28.0   92.0   LA       1


Unnamed: 0,Name,Age,Score,City,Visits
0,Alice,25.0,85.0,NY,1
1,Bob,0.0,90.0,LA,2
2,Charlie,30.0,0.0,SF,2
3,David,22.0,88.0,NY,1
4,Unknown,28.0,92.0,LA,1


In [136]:
# Add duplicate row (modern pandas way)
df_with_dup = pd.concat([df_filled_zero, df_filled_zero.iloc[[0]]], ignore_index=True)
print("DataFrame with duplicate row:")
print(df_with_dup)

# Remove duplicates
df_no_dup = df_with_dup.drop_duplicates()
print("\nAfter drop_duplicates():")
print(df_no_dup)

DataFrame with duplicate row:
      Name   Age  Score City  Visits
0    Alice  25.0   85.0   NY       1
1      Bob   0.0   90.0   LA       2
2  Charlie  30.0    0.0   SF       2
3    David  22.0   88.0   NY       1
4  Unknown  28.0   92.0   LA       1
5    Alice  25.0   85.0   NY       1

After drop_duplicates():
      Name   Age  Score City  Visits
0    Alice  25.0   85.0   NY       1
1      Bob   0.0   90.0   LA       2
2  Charlie  30.0    0.0   SF       2
3    David  22.0   88.0   NY       1
4  Unknown  28.0   92.0   LA       1


### Feature Engineering

we will start from the cleaned data and do it step by step


In [137]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 32, 30, 22, 28],
    'Score': [85, 90, 78, 88, 92],
    'City': ['NY', 'LA', 'SF', 'NY', 'LA'],
    'Visits': [1, 2, 2, 1, 3]
}
df = pd.DataFrame(data)

In [138]:
#Feature Engineering means creating new columns or 
#transforming existing ones to capture more information.

#the method .str should exert to a series object
# Example: Extract first 3 letters of Name as Initials
df['Initials'] = df['Name'].str[:3]
print("\nAfter adding Initials:")
print(df)

# Example: Upper-case City names
df['City_Upper'] = df['City'].str.upper()
print("\nAfter City_Upper:")
print(df)


After adding Initials:
      Name  Age  Score City  Visits Initials
0    Alice   25     85   NY       1      Ali
1      Bob   32     90   LA       2      Bob
2  Charlie   30     78   SF       2      Cha
3    David   22     88   NY       1      Dav
4      Eve   28     92   LA       3      Eve

After City_Upper:
      Name  Age  Score City  Visits Initials City_Upper
0    Alice   25     85   NY       1      Ali         NY
1      Bob   32     90   LA       2      Bob         LA
2  Charlie   30     78   SF       2      Cha         SF
3    David   22     88   NY       1      Dav         NY
4      Eve   28     92   LA       3      Eve         LA


Key Methods for feature engineering

|Concept|Purpose|Key Tool|
|---|---|---|
|`.str`|Text column batch operations|`series.str.xxx()`|
|Categorical|Efficient fixed-set values|`.astype('category')`, `pd.Categorical`|
|Binning|Turn numeric into groups|`pd.cut`(equal width), `pd.qcut`(quantile)|
|Function Application|Custom transformations|`apply`, `applymap`, vectorized math, NumPy ufuncs|

Other common string methods from the book:
- str.lower(), str.replace(), str.split(), str.strip()
- str.match()for regex matching
- str.slice()for substrings


In [139]:
# Convert City to categorical
df['City_Category'] = df['City'].astype('category')
print("\nCity_Category dtype:", df['City_Category'].dtype)

# Show categories
print("Categories:", df['City_Category'].cat.categories)


City_Category dtype: category
Categories: Index(['LA', 'NY', 'SF'], dtype='object')


Categorical data means the variable can take only a limited, fixed set of possible values‚Äã (categories).
- Examples: City= {NY, LA, SF}, Department= {HR, IT, Sales}.
- Internally Each unique string is mapped ti an integer code (0,1,2,...)
The actuall series holds these integer codes

In [140]:
ordered_cat = pd.Categorical(['Small', 'Large', 'Medium'], 
                              categories=['Small', 'Medium', 'Large'], 
                              ordered=True)
# comparison
small, medium, large = ordered_cat[0], ordered_cat[2], ordered_cat[1]

print(small < medium)   # True  (0 < 1)
print(large > medium)   # True  (2 > 1)  
print(medium > small)   # True  (1 > 0)
print(large < small)   # False (2 < 0 is false)

#Futrther Filter "Medium" or larger
#filtered = [val for val in ordered_cat if val >= 'Medium']
# This works because 'Medium'=1, 'Large'=2, both >= 'Medium'

False
False
False
True


### Binning or Discretizing continuous Variables

Binning means converting a continuous numeric variable into discrete intervals (bins).

Useful for turning ages, scores, prices into groups like ‚Äúlow‚Äù, ‚Äúmedium‚Äù, ‚Äúhigh‚Äù.

Two main ways in pandas
1. Equal-width binning ‚Äî pd.cut : Divides the range of data into equal-sized intervals.

In [141]:
ages = pd.Series([22, 25, 35, 45, 58])
bins = [20, 30, 40, 50, 60]
labels = ['20-30', '30-40', '40-50', '50-60']
age_groups = pd.cut(ages, bins=bins, labels=labels)

2. Quantile binning ‚Äî pd.qcut
Divides data so each bin has (approximately) the same number of observations.

In [142]:
scores = pd.Series([55, 65, 75, 85, 95])
score_quartiles = pd.qcut(scores, q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

### Pandas Basic: Grouping & Aggregating Data

Grouping and aggregation are essential when you want to **summarize data**‚Äã across categories or groups. For example:

- Total sales per region
    
- Average age per department
    
- Count of users by signup month
    
This is done using `groupby()`followed by an aggregation function like `sum()`, `mean()`, `count()`, etc

In [143]:
import pandas as pd

data = {
    'Department': ['Sales', 'HR', 'Sales', 'IT', 'HR', 'IT', 'Sales'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Salary': [50000, 60000, 55000, 70000, 62000, 68000, 53000],
    'YearsExperience': [2, 5, 3, 7, 4, 6, 2]
}

df = pd.DataFrame(data)
print(df)


  Department Employee  Salary  YearsExperience
0      Sales    Alice   50000                2
1         HR      Bob   60000                5
2      Sales  Charlie   55000                3
3         IT    David   70000                7
4         HR      Eve   62000                4
5         IT    Frank   68000                6
6      Sales    Grace   53000                2


In [144]:
#Single stats info:return a series
# Min and max salary per department
df.groupby('Department')['Salary'].min()
df.groupby('Department')['Salary'].max()

# Total salary by department
df.groupby('Department')['Salary'].sum()

# Average salary by department
df.groupby('Department')['Salary'].mean()

# Number of employees per department
df.groupby('Department')["Employee"].count()

# Multiple stats for Salary by Department, return a dataframe
df.groupby('Department')['Salary'].agg(['sum', 'mean', 'count', 'min', 'max'])

Unnamed: 0_level_0,sum,mean,count,min,max
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HR,122000,61000.0,2,60000,62000
IT,138000,69000.0,2,68000,70000
Sales,158000,52666.666667,3,50000,55000


In [145]:
#You can compute different aggregations on different columns
df.groupby('Department').agg({
    'Salary': 'mean',
    'YearsExperience': 'sum'
})
# you concat 2 series of ['Salary'].mean() and ['Years'].sum()

Unnamed: 0_level_0,Salary,YearsExperience
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,61000.0,9
IT,69000.0,13
Sales,52666.666667,7


In [146]:
df.groupby('Department').agg({
    'Salary': ['mean', 'max'],
    'YearsExperience': 'sum'
})

Unnamed: 0_level_0,Salary,Salary,YearsExperience
Unnamed: 0_level_1,mean,max,sum
Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
HR,61000.0,62000,9
IT,69000.0,70000,13
Sales,52666.666667,55000,7


In [147]:
 # Custom: salary range (max - min) per department
range = df.groupby('Department')['Salary'].agg(lambda x: x.max() - x.min())
range # a series with name "Salary", value max-min

Department
HR       2000
IT       2000
Sales    5000
Name: Salary, dtype: int64

In [148]:
# Let's add a 'Region' column for demo
df['Region'] = ['North', 'South', 'North', 'West', 'South', 'East', 'North']
print(df)
# Group by Department and Region
df.groupby(['Department', 'Region'])['Salary'].mean()
# This creates a Multiindex in the result, 
# for IT you have 2 index East and West , further each with a value

  Department Employee  Salary  YearsExperience Region
0      Sales    Alice   50000                2  North
1         HR      Bob   60000                5  South
2      Sales  Charlie   55000                3  North
3         IT    David   70000                7   West
4         HR      Eve   62000                4  South
5         IT    Frank   68000                6   East
6      Sales    Grace   53000                2  North


Department  Region
HR          South     61000.000000
IT          East      68000.000000
            West      70000.000000
Sales       North     52666.666667
Name: Salary, dtype: float64

In [149]:
result = df.groupby('Department')['Salary'].mean().reset_index()
print(result)
result_worse = df.groupby('Department')['Salary'].mean()
print(result)

  Department        Salary
0         HR  61000.000000
1         IT  69000.000000
2      Sales  52666.666667
  Department        Salary
0         HR  61000.000000
1         IT  69000.000000
2      Sales  52666.666667


In [150]:
df.groupby('Department', as_index=False)['Salary'].mean()

Unnamed: 0,Department,Salary
0,HR,61000.0
1,IT,69000.0
2,Sales,52666.666667


### Counting Frequency

In [151]:
df['Department'].value_counts()

Department
Sales    3
HR       2
IT       2
Name: count, dtype: int64

In [152]:
df.groupby('Department').size()  # returns a Series
df.groupby('Department').count()  # counts non-null values in each column

Unnamed: 0_level_0,Employee,Salary,YearsExperience,Region
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HR,2,2,2,2
IT,2,2,2,2
Sales,3,3,3,3


In [153]:
# Filter + Group + Aggregate
# Step 1: Group and aggregate
avg_exp = df.groupby('Department')['YearsExperience'].mean().reset_index()

# Step 2: Filter
filtered = avg_exp[avg_exp['YearsExperience'] > 3]

print(filtered)


(
    df.groupby('Department', as_index=False)['YearsExperience']
    .mean()
    .query('YearsExperience > 3')
)

  Department  YearsExperience
0         HR              4.5
1         IT              6.5


Unnamed: 0,Department,YearsExperience
0,HR,4.5
1,IT,6.5


### Bonus: Understanding GroupBy

DataFrameGroupBy object‚Äã is a special intermediate object that "remembers" how to split the data but hasn't performed any calculations yet.

In [154]:
import pandas as pd

# Sample data
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Value': [10, 20, 15, 25, 12, 30],
    'Size': [100, 200, 150, 250, 120, 300]
}
df = pd.DataFrame(data)

# This returns a GROUPBY OBJECT, not a DataFrame
grouped = df.groupby('Category')
print(type(grouped))
# Output: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>


it stores:
- How to split the data (by 'Category')
- What operations to apply later

Inside:

In [189]:
# The GroupBy object stores information about groups
print("Groups:", grouped.groups)
# Output: Groups: {'A': [0, 2, 4], 'B': [1, 3, 5]}
# Shows which row indices belong to each group,crucial

print(grouped.indices)

Groups: {'A': [0, 2, 4], 'B': [1, 3, 5]}
{'A': array([0, 2, 4]), 'B': array([1, 3, 5])}


In [202]:
# NOW we get actual results (a Series)
result_series = grouped['Value'].sum()
#if you just use .sum you will get a dataframe
print(type(result_series))
# Output: <class 'pandas.core.series.Series'>

print("the name of index:", result_series.index.name)
print("the name of the series:",result_series.name)
print("the values of the series:",result_series.values)
print("the series:\n",result_series)

print("\nget a dataframe from the groupby object")
a = grouped.sum()
print("dataframe's indices have a name, as they made up a list\n",grouped.sum().index)
a.info
a["Size"]

a == grouped.agg({"Value":'sum', 'Size':'sum'})

<class 'pandas.core.series.Series'>
the name of index: Category
the name of the series: Value
the values of the series: [37 75]
the series:
 Category
A    37
B    75
Name: Value, dtype: int64

get a dataframe from the groupby object
dataframe's indices have a name, as they made up a list
 Index(['A', 'B'], dtype='object', name='Category')


Unnamed: 0_level_0,Value,Size
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
A,True,True
B,True,True


In [157]:
result_series.name
result_series.info()
result_series["A"]

<class 'pandas.core.series.Series'>
Index: 2 entries, A to B
Series name: Value
Non-Null Count  Dtype
--------------  -----
2 non-null      int64
dtypes: int64(1)
memory usage: 140.0+ bytes


np.int64(37)

To understand `reset_index()`, it helps to think of the **Index** as the "address" of a row. Sometimes that address is a simple number, but after doing data analysis, that address can become a messy label or a complex category. `reset_index()` is the tool that gives your data a fresh start.

---

## 1. The "Why": Common Scenarios

### A. The "Clean Slate" (After Filtering)

When you delete rows, pandas doesn't "renumber" the remaining rows automatically. If you delete row #1, your index goes from 0 to 2. This is called a **discontinuous index**.

### B. Turning Labels into Data (After GroupBy)

When you group data (e.g., finding average sales per city), pandas often makes "City" the index. If you want to use "City" for a chart or merge it with another table later, it's often easier if "City" is a regular column.

### C. Moving from MultiIndex to Flat Data

After complex operations, you might end up with multiple levels of indexes. `reset_index()` flattens these into a simple, single-row header format.

---

## 2. Practical Example: Step-by-Step

Let‚Äôs look at a scenario where we filter a list of employees and see how the index behaves.

### The Setup

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Department': ['HR', 'Tech', 'Tech', 'HR', 'Tech'],
    'Salary': [50000, 80000, 75000, 52000, 90000]
}

df = pd.DataFrame(data)

```

### Scenario 1: Resetting after Filtering

If we filter for only "Tech" employees, notice what happens to the index numbers:

```python
tech_df = df[df['Department'] == 'Tech']
print(tech_df)

# Output:
#    Name Department  Salary
# 1   Bob       Tech   80000  <-- Index jumps from 1...
# 2 Charlie       Tech   75000
# 4   Eve       Tech   90000  <-- ...to 4!

```

To fix this so it starts at 0, 1, 2:

```python
tech_df_cleaned = tech_df.reset_index(drop=True)

```

* **`drop=True`**: This is vital! If you don't use it, pandas will keep the old index (1, 2, 4) and turn it into a new column called "index".

---

### Scenario 2: Resetting after GroupBy

If we want to see the average salary per department:

```python
grouped = df.groupby('Department')['Salary'].mean()
print(grouped)

# Output:
# Department
# HR      51000.0
# Tech    81666.6
# Name: Salary, dtype: float64

```

In the output above, **Department** is the index (it sits lower than the column name). To turn it back into a normal table:

```python
final_df = grouped.reset_index()
print(final_df)

# Output:
#   Department       Salary
# 0         HR  51000.000000
# 1       Tech  81666.666667

```

---

## 3. Key Parameters to Remember

| Parameter | What it does | When to use it |
| --- | --- | --- |
| **`drop=True`** | Discards the old index. | When the old index is just random numbers you no longer need. |
| **`drop=False`** (Default) | Keeps the old index and inserts it as a new column. | When the old index contains meaningful data (like dates or IDs). |
| **`inplace=True`** | Modifies the current DataFrame. | When you don't want to create a new variable (e.g., `df.reset_index(inplace=True)`). |

---
