**Lesson Note: Memory Optimization Techniques in Pandas**

**Objective:**
By the end of this lesson, students will be able to:
1. Understand the importance of memory optimization in pandas.
2. Utilize various techniques to optimize memory usage in pandas DataFrames.
3. Implement these techniques using Python faker to create synthetic data.

**Introduction to Memory Optimization in Pandas:**
Pandas is a powerful library for data manipulation and analysis, but it can be memory intensive, especially when dealing with large datasets. Optimizing memory usage is crucial for efficient data processing, especially on machines with limited resources.

**Memory Optimization Techniques:**
1. **Selecting Appropriate Data Types (dtypes):**
   - Choose the most appropriate data types for each column to minimize memory usage.
   - Use integers instead of floats wherever possible, as they typically require less memory.
   - Utilize categorical data types for columns with a limited number of unique values to save memory.

2. **Downcasting Numeric Data Types:**
   - Downcast numeric data types (e.g., float64 to float32, int64 to int32) to reduce memory usage while preserving data integrity.

3. **Sparse Data Structures:**
   - Convert columns containing mostly missing values to sparse data structures to save memory.
   - Use `pd.SparseDtype` for creating sparse data types.

4. **Memory-efficient Data Loading:**
   - Use memory-efficient methods for loading data, such as `read_csv()` with appropriate parameters like `dtype` and `usecols`.



In [1]:
import pandas as pd
from faker import Faker

# Create Faker instance
fake = Faker()

# Generate synthetic data
data = {
    'Name': [fake.name() for _ in range(1000)],
    'Age': [fake.random_int(min=18, max=90) for _ in range(1000)],
    'Salary': [fake.random_int(min=20000, max=100000) for _ in range(1000)],
    'Department': [fake.random_element(elements=('IT', 'Finance', 'HR')) for _ in range(1000)]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display memory usage before optimization
print("Memory Usage Before Optimization:")
print(df.memory_usage(deep=True))

# 1. Selecting Appropriate Data Types
# Convert 'Age' and 'Salary' to int32
df['Age'] = df['Age'].astype('int32')
df['Salary'] = df['Salary'].astype('int32')

# Convert 'Department' to categorical
df['Department'] = df['Department'].astype('category')

# 2. Downcasting Numeric Data Types
df['Age'] = pd.to_numeric(df['Age'], downcast='integer')
df['Salary'] = pd.to_numeric(df['Salary'], downcast='integer')

# Display memory usage after optimization
print("\nMemory Usage After Optimization:")
print(df.memory_usage(deep=True))


Memory Usage Before Optimization:
Index           132
Name          62290
Age            8000
Salary         8000
Department    52745
dtype: int64

Memory Usage After Optimization:
Index           132
Name          62290
Age            1000
Salary         4000
Department     1266
dtype: int64



**Conclusion:**
Memory optimization is essential for efficient data processing in pandas. By applying techniques like selecting appropriate data types, downcasting numeric types, and utilizing sparse data structures, users can significantly reduce memory usage without sacrificing data integrity. These techniques are particularly beneficial when dealing with large datasets or working on machines with limited resources.