In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.set_option("display.max_columns", 50)

Create a small demo dataset (with intentionally "messy" types)

In [5]:
df = pd.DataFrame({
    "customer_id": ["001", "002", "003", "004", "005"],     # numeric-like but stored as string (common!)
    "age": [25, 31, 40, np.nan, 28],                        # float because NaN appears
    "income": ["55000", "72000", "not_available", "61000", "59000"],  # numbers as strings + a bad value
    "signup_date": ["2025-01-10", "2025-02-03", "2025-02-15", "bad_date", "2025-03-01"],
    "segment": ["A", "B", "A", "C", "B"],                   # good candidate for category
    "is_active": [1, 0, 1, 1, 0]                            # could be boolean
})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   customer_id  5 non-null      object 
 1   age          4 non-null      float64
 2   income       5 non-null      object 
 3   signup_date  5 non-null      object 
 4   segment      5 non-null      object 
 5   is_active    5 non-null      int64  
dtypes: float64(1), int64(1), object(4)
memory usage: 368.0+ bytes


Shape: rows vs columns

In [7]:
n_rows, n_cols = df.shape
print("=== SHAPE ===")
print(f"Rows (samples): {n_rows}")
print(f"Columns (features): {n_cols}")

=== SHAPE ===
Rows (samples): 5
Columns (features): 6


Column names and quick peek

In [8]:
print("\n=== COLUMN NAMES ===")
print(df.columns.tolist())

print("\n=== FIRST ROWS (HEAD) ===")
display(df.head())

print("\n=== LAST ROWS (TAIL) ===")
display(df.tail())

print("\n=== RANDOM SAMPLE ROWS ===")
display(df.sample(min(3, len(df)), random_state=42))


=== COLUMN NAMES ===
['customer_id', 'age', 'income', 'signup_date', 'segment', 'is_active']

=== FIRST ROWS (HEAD) ===


Unnamed: 0,customer_id,age,income,signup_date,segment,is_active
0,1,25.0,55000,2025-01-10,A,1
1,2,31.0,72000,2025-02-03,B,0
2,3,40.0,not_available,2025-02-15,A,1
3,4,,61000,bad_date,C,1
4,5,28.0,59000,2025-03-01,B,0



=== LAST ROWS (TAIL) ===


Unnamed: 0,customer_id,age,income,signup_date,segment,is_active
0,1,25.0,55000,2025-01-10,A,1
1,2,31.0,72000,2025-02-03,B,0
2,3,40.0,not_available,2025-02-15,A,1
3,4,,61000,bad_date,C,1
4,5,28.0,59000,2025-03-01,B,0



=== RANDOM SAMPLE ROWS ===


Unnamed: 0,customer_id,age,income,signup_date,segment,is_active
1,2,31.0,72000,2025-02-03,B,0
4,5,28.0,59000,2025-03-01,B,0
2,3,40.0,not_available,2025-02-15,A,1


Data types + non-null counts

In [9]:
print("\n=== INFO (DTYPES, NON-NULL, MEMORY) ===")
df.info()

# A more detailed view of types:
print("\n=== DTYPES ===")
print(df.dtypes)


=== INFO (DTYPES, NON-NULL, MEMORY) ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   customer_id  5 non-null      object 
 1   age          4 non-null      float64
 2   income       5 non-null      object 
 3   signup_date  5 non-null      object 
 4   segment      5 non-null      object 
 5   is_active    5 non-null      int64  
dtypes: float64(1), int64(1), object(4)
memory usage: 368.0+ bytes

=== DTYPES ===
customer_id     object
age            float64
income          object
signup_date     object
segment         object
is_active        int64
dtype: object


Memory usage

In [10]:
print("\n=== MEMORY USAGE (DETAILED) ===")
mem_by_col = df.memory_usage(deep=True)
print(mem_by_col)
print(f"Total memory (bytes): {mem_by_col.sum():,}")


=== MEMORY USAGE (DETAILED) ===
Index          128
customer_id    300
age             40
income         318
signup_date    333
segment        290
is_active       40
dtype: int64
Total memory (bytes): 1,449


When datasets get ‚Äúlarge,‚Äù the best way to load them depends on what ‚Äúlarge‚Äù means relative to your RAM and what you need to do next (one-pass summary vs random access vs joins vs repeated queries).

Here are the main loading strategies, what they optimize for, and when to use each strategy:

##### Pandas chunked reading
Best when
- Data does not fit in RAM
- You can compute results in a streaming way (aggregations, filtering, writing out smaller subset)

Tradeoff
- Not as convenient as a full DataFrame (you process chunk-by-chunk)

##### Convert to Parquet and read Parquet

Best when
- You load the dataset repeatedly
- You want fast reads, smaller storage, and column pruning (read only needed columns)
- You want to preserve types (categorical, datetime, etc.)

Typical workflow
- One-time: CSV ‚Üí Parquet

But if we are sticking to only PANDAS

| Situation    | Best Pandas Method           |
| ------------ | ---------------------------- |
| < 30% RAM    | Normal `read_csv`            |
| 30‚Äì70% RAM   | Optimized `read_csv`         |
| > 80% RAM    | Chunking                     |
| Too big      | Chunking + save smaller file |
| Repeated use | CSV ‚Üí Parquet (via pandas)   |


#### Full Load (Naive Pandas)

In [None]:
import pandas as pd
df = pd.read_csv("data.csv")

When to use
- ‚úÖ Dataset clearly fits in RAM
- ‚úÖ You want maximum flexibility (EDA, ML, joins, plots)

Pros:
- implest
- Full pandas API available

Cons:
- High memory usage
- Slow for big CSVs
- Bad dtype inference (lots of object)

‚ö†Ô∏è Rule of thumb
If CSV = 2 GB ‚Üí DataFrame often = 5‚Äì10 GB RAM

#### Optimized Full Load (Best Practice Pandas)

In [None]:
import pandas as pd

dtypes = {
    "user_id": "int32",
    "age": "int8",
    "income": "float32",
    "is_active": "bool"
}

df = pd.read_csv(
    "data.csv",
    usecols=["user_id", "age", "income", "signup_date", "is_active"],
    dtype=dtypes,
    parse_dates=["signup_date"],
    engine="pyarrow"  # optional
)

| Feature            | Why                          |
| ------------------ | ---------------------------- |
| `usecols`          | Don‚Äôt load unused columns    |
| `dtype`            | Avoid `object`               |
| `parse_dates`      | Proper datetime              |
| `engine="pyarrow"` | Faster parser (if installed) |


This is the default ‚Äúprofessional‚Äù way to load large files in pandas.

When to use
- ‚úÖ Data fits barely in RAM
- ‚úÖ Performance matters
- ‚úÖ You‚Äôll analyze in memory

Pros:
- Much lower RAM
- Faster parsing
- Stable dtypes

Cons:
- Requires planning

üí° This should be your default habit. üí°