**A DataFrame** is a two-dimensional, tabular data structure
in pandas, similar to a spreadsheet or SQL table.
It consists of:
Rows, Columns & Indexes.

### Key Features:
1. Columns are labeled, and rows have an index. An index can be numeric or a custom label. 
2. Heterogenous data. Different columns can store different data types.
3. Filtering, groupiing, joining and reshaping data are supported.



In [1]:
import pandas as pd

### DataFrames can be created in a few ways:

From a Dictionary 

In [2]:
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


From a List of Dictonaries

In [3]:
data = [
    {"Name": "Alice", "Age": 25, "City": "New York"},
    {"Name": "Bob", "Age": 30, "City": "Los Angeles"},
    {"Name": "Charlie", "Age": 35, "City": "Chicago"}
]
df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


List of Lists with Column Names

In [13]:
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
df = pd.DataFrame(data, columns=["A", "B", "C"])
print(df)

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9


CSV or Excel File
(xlsx files require pip install openpyxl)

In [9]:
# From a CSV file
df_csv = pd.read_csv("data.csv")
print(df_csv)

# From an Excel file
df_xlsx = pd.read_excel("data.xlsx")
print(df_xlsx)

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9


Inspecting the dataframe

In [15]:
print(df.head()) #First 5 rows by default

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9


In [16]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       3 non-null      int64
 1   B       3 non-null      int64
 2   C       3 non-null      int64
dtypes: int64(3)
memory usage: 204.0 bytes
None


In [17]:
print(df.describe())

         A    B    C
count  3.0  3.0  3.0
mean   4.0  5.0  6.0
std    3.0  3.0  3.0
min    1.0  2.0  3.0
25%    2.5  3.5  4.5
50%    4.0  5.0  6.0
75%    5.5  6.5  7.5
max    7.0  8.0  9.0


Accessing Data

In [19]:
print(df["B"])

0    2
1    5
2    8
Name: B, dtype: int64


In [20]:
print(df[["B", "A"]])

   B  A
0  2  1
1  5  4
2  8  7


Row by index

In [31]:
print(df.iloc[0]) # First row

A    1
B    2
C    3
Name: 0, dtype: int64


Row by label

In [32]:
print(df.loc[0]) # Same as above, since label is numeric.

A    1
B    2
C    3
Name: 0, dtype: int64


### Filtering
Find rows where A > 3

In [33]:
print(df[df["A"] > 3])

   A  B  C
1  4  5  6
2  7  8  9


Adding a new column

In [35]:
df["D"] = False  # Setting a default value
print(df)

   A  B  C      D
0  1  2  3  False
1  4  5  6  False
2  7  8  9  False


Modifying data

In [36]:
df.loc[df["B"] == 5, "D"] = True
print(df)

   A  B  C      D
0  1  2  3  False
1  4  5  6   True
2  7  8  9  False


Dropping a column or row

In [37]:
df = df.drop("D", axis = 1) # axis=1 for columns
print(df)

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9


### Advanced Operations

Grouping and aggregation.

In [38]:
data = {
    "City": ["New York", "Los Angeles", "New York", "Chicago"],
    "Age": [25, 30, 35, 28]
}
df = pd.DataFrame(data)
print(df.groupby("City")["Age"].mean())

City
Chicago        28.0
Los Angeles    30.0
New York       30.0
Name: Age, dtype: float64


Merging DataFrames

In [39]:
df1 = pd.DataFrame({"Name": ["Alice", "Bob"], "Age": [25, 30]})
df2 = pd.DataFrame({"Name": ["Alice", "Bob"], "City": ["New York", "Los Angeles"]})
merged_df = pd.merge(df1, df2, on="Name")
print(merged_df)

    Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles


Handling Missing Data.
Useful for cleaning/normalizing NaN values.

In [40]:
df.dropna()  # Drops rows with any NaN

Unnamed: 0,City,Age
0,New York,25
1,Los Angeles,30
2,New York,35
3,Chicago,28


In [41]:
df.fillna(0)  # Replaces NaN with 0

Unnamed: 0,City,Age
0,New York,25
1,Los Angeles,30
2,New York,35
3,Chicago,28


List Comprehensions can be used with DataFrames to create or manipulate data.

In [42]:
# Create a new column based on a condition
df["Is_Senior"] = [True if age > 30 else False for age in df["Age"]]
print(df)

          City  Age  Is_Senior
0     New York   25      False
1  Los Angeles   30      False
2     New York   35       True
3      Chicago   28      False


### Reshaping
Involves operations like pivoting, melting, stacking or unstacking to change the shape of the data.

Pivoting (pivot and pivot_table)<br>
Spreads rows into columns based on a column's unique values, creating a new table with one column's values as the new columns.

In [43]:
import pandas as pd

df = pd.DataFrame({
    "Date": ["2023-01", "2023-01", "2023-02", "2023-02"],
    "City": ["NY", "LA", "NY", "LA"],
    "Sales": [100, 150, 200, 250]
})
pivoted = df.pivot(index="Date", columns="City", values="Sales")
print(pivoted)

City      LA   NY
Date             
2023-01  150  100
2023-02  250  200


Use pivot_table instead of pivot if you need aggregation.

In [45]:
pivoted = df.pivot_table(index="Date", columns="City", values="Sales", aggfunc="mean")
print(pivoted)

City        LA     NY
Date                 
2023-01  150.0  100.0
2023-02  250.0  200.0


Melting (melt)<br>
Converts a "wide" DataFrame into a "long" format by turning columns into rows.

In [46]:
df = pd.DataFrame({
    "Date": ["2023-01", "2023-02"],
    "NY": [100, 200],
    "LA": [150, 250]
})
melted = pd.melt(df, id_vars=["Date"], value_vars=["NY", "LA"], var_name="City", value_name="Sales")
print(melted)

      Date City  Sales
0  2023-01   NY    100
1  2023-02   NY    200
2  2023-01   LA    150
3  2023-02   LA    250


Stacking and Unstacking (stack and unstack)<br>
Reshape by moving row or colun indicies to the other axis, often for hierarchical (multi-index) DataFrames.

In [47]:
df = pd.DataFrame({
    "NY": [100, 200],
    "LA": [150, 250]
}, index=["2023-01", "2023-02"])
stacked = df.stack()
print(stacked)

2023-01  NY    100
         LA    150
2023-02  NY    200
         LA    250
dtype: int64


In [48]:
unstacked = stacked.unstack()
print(unstacked)

          NY   LA
2023-01  100  150
2023-02  200  250


Transposing (transpose or .T)<br>
Flips rows and columns. Quick way to swap axes for small datasets or visualization.

In [49]:
df = pd.DataFrame({
    "A": [1, 2],
    "B": [3, 4]
})
print(df.T)

   0  1
A  1  2
B  3  4


### When to Use Reshaping
Pivoting: When you need a summary table (e.g., sales by city over time).<br>
<br>
Melting: When preparing data for tools that expect long-format data (e.g., plotting libraries like seaborn).<br>
<br>
Stack/Unstack: When working with hierarchical data or multi-index DataFrames.<br>
<br>
Transpose: For simple row-column swaps.

### Key Notes
Data Format: Pivoting creates wide formats; melting creates long formats. Choose based on your analysis needs.

Performance: Reshaping can be memory-intensive for large datasets, so ensure your DataFrame is clean (e.g., no duplicates unless handled).

Index Awareness: Methods like pivot and unstack rely on the index, so set it appropriately (e.g., df.set_index()).