# Pandas Dataframe Object

- Series is like a column of an excel sheet [1D]. Dataframe is like the sheet in an excel sheet [2D]
- Series has 2 types of index(Implicit and Explicit). Similarly Dataframe has 2 types of index in each dimension.
- Basically if you pick a Single Row or a Single Column in a Dataframe, it will be a Series Object.
- Now Let's first understand the CRUD Operation in a pandas Dataframe.

## Create DataFrame

- It’s more important to learn how to use a DataFrame (filtering, grouping, merging, reshaping) than to deeply study how pandas structures data during creation. I initially was more concerned about once we give a data, how pandas is choosing how to form the dataframe.
- I have experimented with **list of list**, **list of dictionary**, **list of series**, **dict of list**, **dict of dict**, **dict or series**.
- But, it is not worth learning. One example that made me realize that. See below.

In [22]:
# Example that made me realise why it's not worth learning "how pandas create the dataframe"

import pandas as pd

df = pd.DataFrame(data=[ 
    { 'food' : 5000, 'dairy': 1000}, 
    { 'travel' : 1000 }, 
    { 'communication' : 2000 }, 
    { 'household' : 10000 } ]
) 

df

Unnamed: 0,food,dairy,travel,communication,household
0,5000.0,1000.0,,,
1,,,1000.0,,
2,,,,2000.0,
3,,,,,10000.0


- Knowing construction logic is only moderately useful - mainly for debugging and converting raw inputs.
- The data parameter in pd.DataFrame() accepts any 2D (tabular) structure.
- If data can be interpreted as rows*columns, pandas can form a valid DataFrame.
    - 1D data → auto-converted to a single-column DataFrame.
    - 2D data → valid and structured properly.
    - 3D+ data → raises ValueError: Must pass 2-d input.
- pandas.DataFrame() is designed for 2D tabular data only.
- Anything higher-dimensional must be flattened or reduced before conversion.

In [35]:
import pandas as pd
import numpy as np


arr = np.arange(27).reshape(3,3,3)
try:
    pd.DataFrame(arr)
except ValueError as ex:
    print(ex) # Must pass 2-d input. shape=(3, 3, 3)

Must pass 2-d input. shape=(3, 3, 3)


- So, if I ever need to create a dummy dataframe, just use the below way.

In [36]:
import pandas as pd

data = {
    'Category': ['Food', 'Travel', 'Communication', 'Household'],
    'Amount': [5000, 1000, 2000, 10000]
}

df = pd.DataFrame(data)

print(df)

        Category  Amount
0           Food    5000
1         Travel    1000
2  Communication    2000
3      Household   10000


- The most important way to learn the creation of Dataframe is via external ways and among them csv and excel is the basics to start.
- Params for reading data from csv and excel is pretty much same. I need to remember mainly 2 params(providing filepath and header row). filepath is a positional and header row selection is a keyword only. Once the data is imported, I can clean it afterwords, no need to clean it while importing.
- Note that for importing data from an excel file, pandas internally use a python package called openpyxl. So I have to install that.

In [68]:
from datetime import datetime
import pandas as pd

df = pd.read_csv(
    'carpe_diem.csv',
    sep=',',
    header=0,
    usecols=[0,1,2],
    names={0: 'Date', 1:'Validity', 2:'Time'},
    converters={
        0: lambda date : datetime.strptime(date, '%Y/%m/%d').date(),
        1: lambda entry: True if entry=='Filled' else False,
        2: lambda dt: datetime.strptime(dt, '%B %d, %Y %I:%M %p').time()
    }
)
df

Unnamed: 0,0,1,2
0,2023-11-01,True,12:40:00
1,2023-11-02,True,12:36:00
2,2023-11-03,True,12:56:00
3,2023-11-04,True,12:01:00
4,2023-11-05,True,12:43:00
...,...,...,...
717,2025-10-18,True,10:51:00
718,2025-10-19,True,06:05:00
719,2025-10-20,True,11:12:00
720,2025-10-21,True,10:00:00


In [64]:
from datetime import datetime
import pandas as pd

df = pd.read_excel(
    'carpe_diem.xlsx',
    header=0,
    usecols=[0,1,2],
    names={0: 'Date', 1:'Validity', 2:'Time'},
    converters={
        0: lambda date : date.date(),
        1: lambda entry: True if entry=='Filled' else False,
        2: lambda dt: dt.time()
    }
)

df

Unnamed: 0,0,1,2
0,2023-11-01,True,12:40:00
1,2023-11-02,True,12:36:00
2,2023-11-03,True,12:56:00
3,2023-11-04,True,12:01:00
4,2023-11-05,True,12:43:00
...,...,...,...
717,2025-10-18,True,10:51:00
718,2025-10-19,True,06:05:00
719,2025-10-20,True,11:12:00
720,2025-10-21,True,10:00:00


## Read DataFrame

- When I learned about accessing a cell in pandas Series, I needed an index of that cell. Now it can be an explicit or implicit.
- When it comes to pandas DataFrame, here also I need index to locate a cell and that index can be explicit or implicit. The only thing is, I need 2 indices as DataFrame is 2D.
- And the good thing is because DataFrame is 2D, I can access:
  * A specific cell
  * An entire column - This will be a Series by defination
  * An entire row - This will be a Series by defination
- Now Pandas Team wanted a simple syntax, so they made `df['column']` that return a Series representing that single column. Because usually I have also observed that while working on an excel sheet, sometimes I need to hold the entire column and perform some task.
- Then they saw that If I chain another `[]` after that, like `df['column'][row]`, I can target a specific row. It technically works - but it’s not how pandas team meant it to be used. This is called **chained indexing**, and pandas team **warns against it**, because it can create copies instead of views and lead to unexpected results. So, even though `df['column'][row]` might give me the value I want, it’s not reliable for modifying data.
- To make this process clean and powerful, pandas introduced two dedicated indexers:
    1. `df.iloc[row_index, column_index]`
    2. `df.loc[row_label, column_label]`
- And yes, both `.iloc[]` and `.loc[]` also support:
    1. **Slicing**
    2. **Masking**
    3. **Fancy indexing**
- So now, in total, these are the valid and safe ways to access data in pandas:
    * `df['column']` → access a column (returns Series)
    * `df[['column 1', 'column 2']]` → access multiple columns (returns DataFrame) → I think of it as fancy indexing.
    * `df.iloc[row, column]`
    * `df.loc[row, column]`
    * A special sytax
- So I know that `df[]` syntax returns column(s) given that inside `[]` what we provide. If a single string, then the column and if list of strings, then provides a df. But we can pass another thing as well, that is a boolean series(result of something like `df['column']>10`). And the resulting boolean series gets ANDed with the pandas row index and returns the filtered rows where the boolean index values was true. So this special syntax filters rows instead of returning a column.
- So finally here is the syantax that I will find
    1. `df['column']` / `df[['column 1', 'column 2']]`
    2. `df.iloc[row, column]`
    3. `df.loc[row, column]`
    4. `df[boolean series]`
- Just a little more tiny thing, How do I get index in pandas dataframe? in series I was getting it by `series.index`. And this index that we get is ofcourse always explicit. In case of DataFrame, because there are 2 indices, we have 2 attributes which ofcourse returns the explicit index.
    1. df.index - gives the row index
    2. df.columns - gives the column index

## Clean DataFrame

- When I said clean a Dataframe, I meant doing some operation on dataframe so that it will be ready to use.
- This cleaning activity I usually find happens in 2 sense.
    1. Schema Cleaning
    2. Data Cleaning
- Schema Cleaning can be something like
    - Maybe while importing, some blank or unwanted columns were included - **Drop Those Columns**
        - `df = df.drop(columns=['Column 1', 'Column 2'])`
    - Maybe some column names don’t correctly describe the data - **Rename Columns**
        - `df = df.rename(columns={'Column 1 Old' : 'Column 1 New', 'Column 2 Old':'Column 2 New'})`
    - Maybe I will be using a particular column for searching, but pandas has provided default RangeIndex - **Set Those Column as Index**
        - `df = df.set_index('Column Name')`
- Data Cleaning can be something like
    - Maybe I want all the string/text column data to be in lower case or upper case - **Perform String Operation**
    - Maybe a column has boolean-like values (True/False, Yes/No, 0/1), but pandas didn’t infer them as bool - **Convert Those Columns**
    - Maybe a column has numeric values, but pandas did't infer them as numeric - **Convert Those Columns**
    - Maybe a column has datetime values, but pandas did't infer them as datetime - **Convert Those Columns**
    - Maybe a column has categorical values, but pandas did't infer them as categorical - **Convert Those Columns**
    - Maybe in a column, I see some rows have correct values and some has NaN values - **Drop Those Rows or Fill Those Cells**
    - Maybe some rows are duplicates in the original data - **Drop Those Rows**
    - Maybe I want certain numeric column value to be between 0-1. This often is done in Neural Network - **Normalize Those Columns**

### Schema Cleaning

### Data Cleaning

## Insight from DataFrame